Energy Attack: On Transferring Adversarial Examples

09/09/2021 ∙ by Ruoxi Shi, et al. ∙ Shanghai Jiao Tong University 0

In this work we propose Energy Attack, a transfer-based black-box L_∞-adversarial attack. The attack is parameter-free and does not require gradient approximation. In particular, we first obtain white-box adversarial perturbations of a surrogate model and divide these perturbations into small patches. Then we extract the unit component vectors and eigenvalues of these patches with principal component analysis (PCA). Base on the eigenvalues, we can model the energy distribution of adversarial perturbations. We then perform black-box attacks by sampling from the perturbation patches according to their energy distribution, and tiling the sampled patches to form a full-size adversarial perturbation. This can be done without the available access to victim models. Extensive experiments well demonstrate that the proposed Energy Attack achieves state-of-the-art performance in black-box attacks on various models and several datasets. Moreover, the extracted distribution is able to transfer among different model architectures and different datasets, and is therefore intrinsic to vision architectures.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the wide application of artificial neural networks comes a growing research interest into the robustness of neural network models

Carlini et al. (2019); Xu et al. (2020), in the manner of adversarial attacks: neural networks can be fooled to produce incorrect outputs by slightly perturbed inputs known as adversarial examples Szegedy et al. (2014); Biggio et al. (2013). There are in general two classes of threat models111A threat model defines what the attacker knows about the victim., and the respective classes of attacks: white-box threat models/attacks and black-box threat models/attacks, which assume different knowledge about the target neural networks.

White-box attacks assume full access to the target model and usually require the first-order (i.e., gradient) information of the target model Szegedy et al. (2014); Goodfellow et al. (2015); Kurakin et al. (2016); Moosavi-Dezfooli et al. (2016). While white-box attacks are generally more powerful, they rely on direct access to the target model to compute the gradients. On the other hand, black-box attacks only assume access to the inputs and outputs of the target model. Therefore, in many real-world scenarios, where full access to the target models may not be available, black-box attacks are considered more practically feasible. For black-box attacks, gradient approximation and random search Rastrigin (1963) are two frequently applied schemes. The former approximates the gradient of a black-box model by querying the model directly Chen et al. (2017); Tu et al. (2019), or by transferring gradient information from a white-box surrogate model Cheng et al. (2019), and the latter simply performs greedy search steps at random. However, there are two major challenges: first, it may take an enormous number of queries for a black-box method to successfully find an adversarial example, as the gradient approximation is non-trivial in high dimension Ilyas et al. (2019); second, the attack success rate of these methods are usually lower than white-box attacks Andriushchenko et al. (2020). A recent work Andriushchenko et al. (2020) copes with these challenges with Square Attack, a random-search-based method that samples perturbations from a specific distribution.

In this work, we propose Energy Attack, a query-efficient transfer-based black-box attack that also applies the random search framework. Notice that in our random search scheme, finding a proper distribution to sample from is one of the keys to improving the success rate and query efficiency. Different from the classic random search algorithm, which samples perturbations uniformly at random, and Square Attack, which samples homogeneously colored squares as perturbations, Energy Attack samples perturbations based on the information obtained from white-box perturbations of a surrogate model.

The sampling strategy of Energy Attack is based on our observation that the patches (i.e., small square regions) of white-box perturbations share similar energy distributions on several unit directions, even if the perturbations themselves come from different models trained on different datasets. Especially, perturbation patches of robust models can better capture the shared distribution among patches of different models. We may extract such distributions from a surrogate model and draw perturbations from them when attacking. This allows exploiting the similarities among perturbation patches, boosting the black-box attack based on random search.

Therefore, Energy Attack first extracts a set of perturbation patches and corresponding energy distributions by performing principal component analysis (PCA) on white-box perturbation patches of a surrogate model. Then in the actual black-box attack phase, it samples from the extracted patches according to the energy distribution, and tiles the patches to form a larger square region of perturbation. If adding this perturbation increases the loss of the target model, Energy Attack will accept the change, otherwise the change will be discarded.

Unlike previous transfer-based attacks, which mainly aims at estimating gradients from surrogate models

Cheng et al. (2019), Energy Attack directly transfers extracted perturbations, in the form of their energy distributions in patches. We highlight that the white-box surrogate model is only used to acquire white-box perturbations and we do not need the presence of the white-box model in the black-box attack phase, which makes the attack phase more efficient. Also, we will demonstrate in Section 4 that such distributions generalize across model architectures and image domains, thus we do not require the surrogate model to have the same training source or similar structures as the target model.

Figure 1: Pipeline of Energy Attack. Energy Attack is a transfer-based attack, and has two phases. In the preparation phase, we collect patches from the perturbations of a white-box attack on a surrogate model, and perform PCA on them to obtain the energy distribution in patches. In the attack phase, we sample patches from the energy distribution, tile them up and perform a random search procedure.

2 Related Work

Robustness of White-Box Neural Networks

It was first discovered by Szegedy et al. Szegedy et al. (2014) that naturally trained neural networks are not robust under certain small perturbations. Since then, adversarial attacks and defenses of neural networks under a fully informed white-box scenario have been widely proposed Athalye et al. (2018); Moosavi-Dezfooli et al. (2016); Pang et al. (2020); Zhang et al. (2019).

Recently, progress was made on parameter-free adversarial attacks by the framework proposed by Chen et al. Chen et al. (2020), based on the reduced gradient Frank-Wolfe algorithm Frank et al. (1956) for constrained convex optimization. Furthermore, white-box attacks have not only been strong evaluators for robustness, but also a tool for analyzing other behaviours of neural networks. For example, robust networks own visually aligned gradients Kaur et al. (2019) and invertible feature spaces Engstrom et al. (2019), which are also desired properties in addition to robustness itself Salman et al. (2020).

Black-Box Attacks

While the white-box scenario is not quite common in real life, black-box attacks only require classification results or score information and can practically show the underlying non-robustness of real-world vision models Ilyas et al. (2018); Chen et al. (2017). Early approaches under this scenario mainly focus on approximating the gradients of target models Chen et al. (2017); Ilyas et al. (2018); Guo et al. (2019), but they suffer from large number of queries and low attack success rates. For better performance, Bandits Ilyas et al. (2019), SignHunter Al-Dujaili and O’Reilly (2019) and Square Attack Andriushchenko et al. (2020) employed different strategies to achieve higher success rates as well as higher query efficiencies. Square Attack Andriushchenko et al. (2020) also applies a random search Rastrigin (1963) procedure instead of stochastic approximation. Others employ transfer-based priors.

Transferability of Adversarial Examples and Transfer-Based Attacks

Szegedy et al. Szegedy et al. (2014) first finds that some adversarial examples can generalize to different neural networks. Moosavi et al. Moosavi-Dezfooli et al. (2017) further extended it to the universal case where the same perturbation is used to fool all models. An intriguing fact is that some of the examples even ‘transfer’ to time-limited humans Elsayed et al. (2018), showing that common intrinsic properties exist in vision tasks. Based on this transferability, people have achieved success in highly efficient attacks Yan et al. (2019); Cheng et al. (2019); Dolatabadi et al. (2020); Huang and Zhang (2019) in the black-box scenario.

3 Energy Attack

Now we introduce the proposed Energy Attack in detail. The algorithm has two phases: a preparation phase and an attack phase. The pipeline of Energy Attack is visualized in Figure 1.

Preparation Phase

In the preparation phase, we first attack a surrogate model with a white-box method. In particular, we apply the Frank-Wolfe framework proposed in Chen et al. (2020) for parameter-free white-box attacks, though any effective attack should work basically the same. We collect all small patches of a specific size (

) in the examples generated by the attack, and perform PCA on them. Notice that no matter how many patches there are, the co-variance matrix remains to be low-dimensional with

rows and columns where is the number of channels. Moreover, the matrix can be computed on-the-fly when performing the white-box attack, without storing full results in the memory. Thus the procedure is quite efficient.

PCA yields a set of energy values and the corresponding orthonormal eigenbasis . Since the PCA is performed on a set of patches, each vector in the eigenbasis actually correspond to a small patch in the image space.

Formally, we fetch the tensor formed by the patches of adversarial examples as


and flatten the latter three dimensions of into matrix


The co-variance matrix is then computed by


on which PCA is subsequently performed, as an eigenvalue decomposition of Hermitian :


where is an orthonormal basis and is a diagonal matrix formed from corresponding energy values .

We utilize these results in the attack phase to boost the black-box attack.

Different from previous transfer-based approaches such as Huang and Zhang (2019) and Dolatabadi et al. (2020), since we operate on the level of patch, the surrogate model is not required to accept images of the same size as the target model, nor be trained on the same dataset. Actually, we demonstrate that the distribution of patches captured by PCA transfers well between different datasets and different model architectures later in Section 4.

Attack Phase

In the attack phase, we apply a random search Rastrigin (1963) procedure to find adversarial examples. The algorithm for the attack is illustrated in Alg. 1.

Input: Target loss to maximize, image , bound , PCA results and .
Output: Perturbation .
Notation: is the side length of image and is the side length of PCA patches.

3:  for maximum queries do
4:     Sample

with probability

5:     if  then
6:         Tile by times on each side
7:     else
8:         Randomly pick patch from
9:     end if
11:      Randomly replace a area in with
12:     if  then 
13:     if  then 
14:  end for
Algorithm 1 The Energy Attack (Attack Phase)

For initialization, we follow Andriushchenko et al. (2020) to initialize the perturbation with randomly colored vertical stripes.

For each step, we first sample a patch from the PCA eigenbasis , each vector (or equivalently, image patches) with probability proportional to . Formally, the distribution is



VGG16BN InceptionV3 ResNet18 ViTB16
Avg. Med. ASR Avg. Med. ASR Avg. Med. ASR Avg. Med. ASR
370 90 91.6 1117 140 95.4 168 92 91.7 - - -
P-RGF 557 82 95.9 694 80 86.9 411 62 97.9 - - -
Square 24 1 100.0 207 45 99.5 34 6 100.0 125 36 100.0
Subspace 104 18 100.0 174 21 99.7 51 14 100.0 344 108 99.7
AdvFlow 929 400 99.2 1081 400 97.5 528 400 99.5 1057 400 98.4
TREMBA 6 1 100.0 264 22 99.9 27 1 100.0 174 22 99.9
Energy (R) 13 1 100.0 91 1 100.0 20 1 100.0 125 42 100.0
Energy (NR) 18 1 100.0 201 4 99.9 26 1 100.0 225 61 99.8
Table 1:

Attack success rate and query efficiency of attacks against different ImageNet models.

(R) denotes Energy Attack using PCA results of a robust model, and (NR) denotes the attack using PCA results of a non-robust model. Our method achieves outstanding results on this standard black-box attacking benchmark.

We sample from this distribution because if we sample many patches according to it and perform the PCA again, we will still result in the same distribution. The patch is then either tiled into a larger rectangular area of perturbations, or further cropped into a smaller one according to the stage of the attack. We enforce the constraints, and replace a random area with an appropriate size in the current best-scored perturbation with the tiled or cropped area.

Then we take the random search procedure: we keep the new perturbation in record if the resulting new perturbation is better in score, and discard otherwise. In case the algorithm figures out that there is little hope advancing with the current tiling/cropping configuration, we half the number of tiling . For briefness we denote the situation as ‘hopeless’ in the following discussion. The decision criteria is also referred to as tiling strategies later, since this essentially determines how Energy Attack tiles the patches.

If there is a batch of images to attack, we consider the state hopeless if no loss increment is seen in current step for all images. If we have only one image to attack, we use a probabilistic model – we consider it hopeless if with confidence the probability of advancing for each step is less than , which can be inferred from 15 consecutive steps without loss increment. In our experiments we only show results for the first procedure since we find that they behave similarly (see Section 4.3) and thus it is merely a matter of code efficiency.

For the target loss, we employ the standard margin loss proposed in Andriushchenko et al. (2020):



denotes the permutation that sorts the logits in descending order. The margin loss is the difference between the logit with the largest and the second largest value. We also filter the target images that we already succeed in attacking so that the margin loss faithfully reflects the margin to a wrong classification, as in previous work

Andriushchenko et al. (2020).

4 Results

In this section, we conduct extensive experiments to evaluate the effectiveness and transferability of Energy Attack.

4.1 Benchmarking Energy Attack

Experiment Setup

We benchmark Energy Attack on several models, using 1000 images randomly chosen from ImageNet

Deng et al. (2009) validation sets. Apart from Energy Attack, six other black-box attacks are also included for parallel comparison: 1) Bandits Ilyas et al. (2019), 2) P-RGF with configuration Cheng et al. (2019), 3) Square Attack Andriushchenko et al. (2020) 4) Subspace Attack Yan et al. (2019), 5) AdvFlow Dolatabadi et al. (2020) and 6) TREMBA Huang and Zhang (2019)

. Models used for benchmarking include batch normalized VGG16 (VGG16BN)

Simonyan and Zisserman (2014), Inception V3 Szegedy et al. (2016), ResNet18 He et al. (2016) and ViTB16 Dosovitskiy et al. (2020). For Energy Attack, we use PCA results from a non-robust ResNet34 model and an adversarial-trained robust ResNet18 model respectively. Both surrogate models are originally trained on the CIFAR10 Krizhevsky and Hinton (2009) dataset. Other attacks are configured to their recommended parameters on ImageNet. All attacks are restricted by a maximum distance of

, and a maximum of 10000 queries. The models and Energy Attack are implemented in PyTorch

Paszke et al. (2017).

Figure 2:

Pairwise cosine similarity of patches from models on different datasets.

We use ResNet34 (R34), ResNet18 (R18) and WideResNet28 (WRN) from CIFAR, DNN (DNN) from MNIST, and VGG16BN (VGG) and ViTB16 (ViT) from ImageNet. We also include patches of uniformly random images (RND) and random colored squares (SQR). Model names in bold and italic represent robust models. We choose the first one-third of total patches from the PCA results, and sort the patches by their corresponding eigenvalues in descending order. If the PCA result contains fewer effective vectors (due to mathematically or numerically low-rankness) than necessary, we pad with zero vectors.

Attack Results

Table 1 reports the attack results against the four models. In most cases, Energy Attack achieves performances equal to or even better than existing state-of-the-art black-box attacks, in terms of average and median number of queries and attack success rate. Energy Attack with robust perturbation patches achieves 100.00% attack success rate against all four models in the experiment, and three out of these four attacks take the least average queries. This achievement outperforms all four other black-box attacks used in this experiment, which demonstrates the outstanding performance of our proposed method.

Notice that for Energy Attack, the target models, with various architectures, are trained on ImageNet, but we are using the PCA patches derived from CIFAR10 ResNet models to perform the attack. TREMBA uses a hierarchical convolutional encoder stack which is similar to the structure of VGG networks. Subspace attack uses three ResNets as the surrogate models. Thus they achieved higher-than-average performances on these specific architectures, but are weaker on others. Our method captures more intrinsic nature of images, and generalizes well cross various architectures and image domains.

4.2 When Do the Patches Transfer?

In this section, we conduct analysis on the similarities among patches of different sizes extracted from different models and different datasets, in order to gain a deeper insight into the transferability of perturbation patches.

Sizes: Localized Distributions Transfer Better

We first consider the choice of patch sizes. We select 6 models from the CIFAR10, MNIST LeCun et al. (1998) and ImageNet datasets: ResNet34, ResNet18, WideResNet28 from CIFAR10, DNN from MNIST, and VGG16-BN and ViTB16 from ImageNet. Among them the ResNet18 and WideResNet28 are robust models, with the former trained with standard PGD-10 Adversarial Training Madry et al. (2017) and the latter trained with TRADES Zhang et al. (2019). We extract PCA patches of different sizes from white-box adversarial perturbations for each model and sort them by corresponding energy values in descending order. In particular, we choose patch sizes to be 5, 15 and 28, from a small size to the scale of full image for MNIST and CIFAR. We then select the first one-third patches with highest energy concentration for each model and compute the cosine similarity for each pair of patches. Note that patches of the first two models (ResNet34 and ResNet18) are exactly the two sets of patches used in Section 4.1.

Figure 2 illustrates the result of all pairwise cosine similarities. The cosine similarity between different models vanishes (degrades to the level of similarity with uniformly random samples) as the patch size increases. This explains why previous approaches that try to capture the perturbation distribution of an entire image do not transfer well. The patch sizes in these approaches are too large to retain the similarities among different models, which makes the transferred attacks less effective. On the other hand, smaller patches that focus on the localized structure of perturbations transfer better, as they capture local perturbation distributions that share more similarities with perturbation distributions of other models.

It is also worth noting that the pure-colored squares also have strong similarities with patches of some models in our experiment, which supports the design of square-region perturbations in Square Attack Andriushchenko et al. (2020). Their design originally comes from the analysis on convolutional operators.

Sources: Robust vs. Non-Robust Models

Patches from robust models transfer better to other models. In Table 1, Energy Attack using robust patches outperforms the attack using non-robust patches in all three measurements. Energy Attack with robust patches takes fewer queries and has a higher success rate against all four models in the benchmark. Using robust patches, it achieves 100% success rate against all four models, while the same attack using non-robust patches is only able to achieve 100% success rate against two of the models. Additionally, the numbers of average queries and median queries of Energy Attack with robust patches are also smaller than those of Energy Attack with non-robust patches.

These experiment results are in accordance with the visualization of pairwise cosine similarities illustrated in Figure 2. Compared with non robust models, robust models have larger cosine similarities with PCA patches from other models. When we transfer non-robust patches to other models (i.e., compute the pairwise cosine similarities between patches of non-robust models and other models), the values of cosine similarity tend to scatter across the entire heatmap. However, when we transfer robust patches to other models, the values of cosine similarities tend to be distributed along the diagonal. This indicates that, compared to patches from non-robust models, patches from robust models can better capture the underlying perturbation distributions shared among most models.

Another interesting observation is that patches from the two robust models (ResNet18-AT and WideResNet28-TRADES) have generally the same distribution. The values of cosine similarities between these two models concentrate on the diagonal of the heatmap, indicating that the patches are virtually orthogonal, even though they come from two different models.

Cosine Similarity as a Measurement

Based on discussions in previous subsections, we propose that the cosine similarity can be used as a measurement to indicate the transferability of a set of extracted patches. This conclusion follows from the observation that the performances of Energy Attack reported in Table 1 correspond to the visualization results in Figure 2. Recall that we use the patches of a non-robust ResNet34 and a robust ResNet18 on CIFAR10 (whose patches corresponds to R18 and R34 in Figure 2, respectively) to attack models on ImageNet in Table 1. Both attacks achieve outstanding performance against VGG16BN, with 100% ASR and median query of 1. Corresponding to this result, the cosine similarities in Figure 2 reflect similarities between distributions of energy in patches of CIFAR10 ResNet18, ResNet34 and ImageNet VGG16BN. Similarly, both attacks have a slight drop in performance when attacking ViTB16. While non-robust patches achieve 99.8% ASR with an average of 225 queries, robust patches achieve better performance, with 100% ASR and 125 queries on average. This result is also consistent with the visualization: when transferred to ViTB16, the patches of robust ResNet18 have smaller distribution shift, as the cosine similarities still tend to lie around the diagonal. In contrast, the patches of non-robust ResNet34 have larger distribution shift, corresponding to its poorer performance against ViTB16. Therefore, we can use the cosine similarity as a qualitative indicator for the transferability of PCA patches.

Figure 3: Pairwise cosine similarity of patches from various configurations. The pairwise cosine similarity include a pair of randomly initialized, untrained MLP (RMDNN) and LeNet (RMCNN) and another pair of MLP and LeNet trained on MNIST with randomly permuted labels (RPDNN and RPCNN respectively). As a reference, patches from CIFAR10 robust ResNet18 are also included.

Labels Are Not Important

To help find out what caused this transferring behaviour, we further perform the following experiment. We run the preparation phase of Energy Attack (i.e., extract white-box perturbation patches and perform PCA) on two multi-layer perceptron (MLP) models and two LeNet models on the MNIST dataset. One pair of MLP and LeNet are randomly initialized and untrained, and another pair of MLP and LeNet are trained on MNIST with randomly assigned labels. For these models, the attack is run to make the model diverge from its decisions on clean images, instead of ground truths. We then compute the pairwise cosine similarity using the patches of these four models. The patches from the CIFAR10 robust ResNet18 are also included for reference.

The heatmap of cosine similarities is illustrated in Figure 3. It can be observed that the patches from random-initialized MLP and LeNet models do not share much similarity. However, in the case of trained models, even though the labels are randomly assigned, the training process still induces similarities in perturbation patches, as higher values now tend to concentrate along the diagonal.

ResNet34 LeNet
Avg. ASR Avg. ASR
Square 566 99.95 123 100.00
Energy (R) 88 99.99 36 100.00
Energy (Res) 50 100.00 129 100.00
Energy (Le) 98 99.95 27 100.00
Table 2: Attack success rate and average queries of model patches against model themselves. We use Energy Attack to attack a ResNet34 on CIFAR10 and a LeNet on MNIST, using PCA patches from the ResNet34 (Res) and the LeNet (Le). We also include attack results of Square Attack and Energy Attack (R) for reference.

More interestingly, the patch distributions of randomly-trained MNIST MLP and LeNet also share similarities with the adversarially trained CIFAR10 ResNet18. Since these patches are derived from the small perturbations which cause the models to make wrong predictions, it follows that the perturbation directions of these patches are generally those achieving the strongest activations on artificial neural units. Thus, the similarities in PCA results indicate that the patches achieving high activation values on their respective models and datasets are essentially similar, regardless of datasets and labels. We thus conclude that these patches are independent of datasets and are intrinsic to vision architectures.

This conclusion is in accordance with the work of Maennel et al. (2020), which discovers that pre-training neural networks on image datasets with randomly assigned labels sometimes has positive impact on downstream tasks: the models might have captured these intrinsics during the pre-training process. It also agrees with the better transferability of robust neural models discovered by Salman et al. (2020): if we consider the neural units of different models activated by similar patches to have similar ‘functionalities’ (i.e., they capture a similar set of features in input), then robust models, whose neural units are able to better capture the shared features among all image datasets, naturally transfer better for cross-domain image tasks.

4.3 Verification of Designs

Power of Energy Distribution of Patches

We first show that the energy distribution of patches is indeed an effective prior for transferred black-box attacks. We run the preparation phase of Energy Attack on CIFAR10 ResNet34 and a MNIST LeNet5 to extract their energy distribution of patches respectively. Then we use the patches to perform Energy Attack against them. The results of Square Attack and Energy Attack with robust patches are also included for comparison.

Table 2 reports the success rate and the number of average queries. Energy Attack (R) reaches good performance on both models; when we attack the models using the distribution from the models themselves, the performance of Energy Attack is even better. However, since energy distribution of patches from non-robust models does not generalize as well as that from robust ones, patches from ResNet34 and LeNet achieves worse performance than the robust baseline when attacking each other.

The performance of Energy Attack in this ‘self-attack’ setting shows that the PCA procedure indeed capture (at least partially) the distribution of the original white-box perturbations: using the energy distribution of a certain model against itself boosts performance of Energy Attack. Thus the energy distribution serves as an effective prior for attacking models in black-box scenarios.

VGG16BN ResNet18 InceptionV3
Probability 13 17 94
Batch 13 20 91
Table 3: Mean queries of two tiling strategies on ImageNet models. The batch-based strategy and the probability-based strategy have similar performance when attacking the models on ImageNet.

Tiling Strategies

In Section 3, we proposed two schemes for adjusting tiling sizes during the attack: a batch-based method and a probability-based method. For the batch-based one, we reduce tiling sizes when none of perturbed images in the current batch increase the model prediction loss. The latter is for the purpose of decoupling all relations between target images, or attacking one image at a time. We reduce tiling sizes when no increment in loss is seen for 15 consecutive steps. We now show that the two strategies have similar performances. We use Energy Attack with robust model patches to attack three classic ImageNet models: VGG16BN, ResNet18 and InceptionV3. We perform two rounds of attack, using the probability-based scheme and the batch-based scheme respectively, and report the average number of queries.

Table 3 reports the average queries of the two strategies. The performances of these two strategies are very close, thus we pick the batch-based one because of its efficiency.

Decision Threshold of ‘Hopeless’

Energy Attack considers a tiling strategy ‘hopeless’ if no image in the current batch has increment in loss for consecutive perturbation steps. In our implementation, we set the threshold to one step. Now we configure Energy Attack to use different numbers of perturbation steps to decide whether the current tiling is ‘hopeless’. In particular, we vary the step threshold and run Energy Attack on the MNIST LeNet and the ImageNet Inception V3. We report the average queries of Energy Attack in each configuration.

Figure 4: Average queries with regard to threshold . We set the threshold to different values, run Energy Attack on an MNIST LeNet and an ImageNet Inception V3 and plot the average queries under different threshold.

Figure 4 plots the change in the number of average queries with regard to the threshold . As the threshold increases, both models spend more queries on average. Therefore, using thresholds larger than one may instead exert negative effects on performance. Also, the trends of the number of average queries growing with threshold are similar, although they have different architectures and are on different datasets. Therefore this hyper-parameter does not need to be specifically tuned when attacking different models, and the entire procedure of Energy Attack is parameter-free.

5 Conclusion

In this work, we proposed a transfer-based black-box Energy Attack, and demonstrated its query efficiency. Interestingly, we find that the captured energy distribution of small patches share plenty of similarities among different models and datasets under various settings. Despite the effectiveness of our method, there are some limitations: a) the PCA-based extraction process only captures linear correlations, and only the global distribution of energy, ignoring non-linear or data-dependent aspects; b) we still lack a theoretical analysis on why the distribution similarities arise under different settings, and know little about the larger structures beyond the level of small patches. The high performance of Energy Attack shows a strong potential in this direction. We leave these for future work.


  • A. Al-Dujaili and U. O’Reilly (2019) There are no bit parts for sign bits in black-box attacks. External Links: 1902.06894 Cited by: §2.
  • M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein (2020) Square attack: a query-efficient black-box adversarial attack via random search. In

    European Conference on Computer Vision

    pp. 484–501. Cited by: §1, §2, §3, §3, §4.1, §4.2.
  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In

    International Conference on Machine Learning

    pp. 274–283. Cited by: §2.
  • B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §1.
  • N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin (2019) On evaluating adversarial robustness. External Links: 1902.06705 Cited by: §1.
  • J. Chen, D. Zhou, J. Yi, and Q. Gu (2020) A frank-wolfe framework for efficient and effective adversarial attacks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 3486–3494. Cited by: §2, §3.
  • P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pp. 15–26. Cited by: §1, §2.
  • S. Cheng, Y. Dong, T. Pang, H. Su, and J. Zhu (2019) Improving black-box adversarial attacks with a transfer-based prior. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 10934–10944. Cited by: §1, §1, §2, §4.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §4.1.
  • H. M. Dolatabadi, S. Erfani, and C. Leckie (2020) AdvFlow: inconspicuous black-box adversarial attacks using normalizing flows. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2, §3, §4.1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §4.1.
  • G. F. Elsayed, S. Shankar, B. Cheung, N. Papernot, A. Kurakin, I. Goodfellow, and J. Sohl-Dickstein (2018) Adversarial examples that fool both computer vision and time-limited humans. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3914–3924. Cited by: §2.
  • L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, B. Tran, and A. Madry (2019) Adversarial robustness as a prior for learned representations. External Links: 1906.00945 Cited by: §2.
  • M. Frank, P. Wolfe, et al. (1956) An algorithm for quadratic programming. Naval research logistics quarterly 3 (1-2), pp. 95–110. Cited by: §2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) EXPLAINING and harnessing adversarial examples. stat 1050, pp. 20. Cited by: §1.
  • C. Guo, J. Gardner, Y. You, A. G. Wilson, and K. Weinberger (2019) Simple black-box adversarial attacks. In International Conference on Machine Learning, pp. 2484–2493. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • Z. Huang and T. Zhang (2019) Black-box adversarial attack with transferable model-based embedding. In International Conference on Learning Representations, Cited by: §2, §3, §4.1.
  • A. Ilyas, L. Engstrom, A. Athalye, and J. Lin (2018) Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pp. 2137–2146. Cited by: §2.
  • A. Ilyas, L. Engstrom, and A. Madry (2019) Prior convictions: black-box adversarial attacks with bandits and priors. In International Conference on Learning Representations, Cited by: §1, §2, §4.1.
  • S. Kaur, J. Cohen, and Z. C. Lipton (2019)

    Are perceptually-aligned gradients a general property of robust classifiers?

    External Links: 1910.08640 Cited by: §2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • A. Kurakin, I. Goodfellow, S. Bengio, et al. (2016) Adversarial examples in the physical world. Cited by: §1.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

    Towards deep learning models resistant to adversarial attacks

    arXiv preprint arXiv:1706.06083. Cited by: §4.2.
  • H. Maennel, I. Alabdulmohsin, I. Tolstikhin, R. J. Baldock, O. Bousquet, S. Gelly, and D. Keysers (2020) What do neural networks learn when trained with random labels?. arXiv preprint arXiv:2006.10455. Cited by: §4.2.
  • S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765–1773. Cited by: §2.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1, §2.
  • T. Pang, X. Yang, Y. Dong, T. Xu, J. Zhu, and H. Su (2020) Boosting adversarial training with hypersphere embedding. In NeurIPS, Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • L. Rastrigin (1963) The convergence of the random search method in the extremal control of a many parameter system. Automaton & Remote Control 24, pp. 1337–1342. Cited by: §1, §2, §3.
  • H. Salman, A. Ilyas, L. Engstrom, A. Kapoor, and A. Madry (2020) Do adversarially robust imagenet models transfer better?. External Links: 2007.08489 Cited by: §2, §4.2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Cited by: §1, §1, §2, §2.
  • C. Tu, P. Ting, P. Chen, S. Liu, H. Zhang, J. Yi, C. Hsieh, and S. Cheng (2019)

    Autozoom: autoencoder-based zeroth order optimization method for attacking black-box neural networks

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 742–749. Cited by: §1.
  • H. Xu, Y. Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. K. Jain (2020) Adversarial attacks and defenses in images, graphs and text: a review. International Journal of Automation and Computing 17 (2), pp. 151–178. Cited by: §1.
  • Z. Yan, Y. Guo, and C. Zhang (2019) Subspace attack: exploiting promising subspaces for query-efficient black-box attacks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3825–3834. Cited by: §2, §4.1.
  • H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472–7482. Cited by: §2, §4.2.