square-attack
Square Attack: a query-efficient black-box adversarial attack via random search [arXiv, Nov 2019]
view repo
We propose the Square Attack, a new score-based black-box l_2 and l_∞ adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking. The Square Attack is based on a randomized search scheme where we select localized square-shaped updates at random positions so that the l_∞- or l_2-norm of the perturbation is approximately equal to the maximal budget at each step. Our method is algorithmically transparent, robust to the choice of hyperparameters, and is significantly more query efficient compared to the more complex state-of-the-art methods. In particular, on ImageNet we improve the average query efficiency for various deep networks by a factor of at least 2 and up to 7 compared to the recent state-of-the-art l_∞-attack of Meunier et al. while having a higher success rate. The Square Attack can even be competitive to gradient-based white-box attacks in terms of success rate. Moreover, we show its utility by breaking a recently proposed defense based on randomization. The code of our attack is available at https://github.com/max-andr/square-attack
READ FULL TEXT VIEW PDFSquare Attack: a query-efficient black-box adversarial attack via random search [arXiv, Nov 2019]
Adversarial examples are of particular concern when it comes to applications of machine learning which are safety-critical. Many defenses against adversarial examples have been proposed
[GuRig2015, ZheEtAl2016, PapEtAl2016a, BasEtAl2016, madry2018towards, AkhtarARXIV2018, BiggioPR2018] but with limited success, as new more powerful attacks could break many of them [CarWag2017, AthEtAl2018, MosEtAl18, CheEtAl2018, ZheEtAl2019]. In particular, gradient obfuscation or masking [AthEtAl2018, MosEtAl18] is often the reason why seemingly robust models with respect to a certain type of attack turn out to be non-robust in the end. Gradient-based attacks are most often affected by this phenomenon (white-box attacks but also black-box attacks based on finite difference approximations [MosEtAl18]). Thus it is important to have attacks which are based on different principles. Black-box attacks have recently become more popular [narodytska2017simple, brendel2017decision, SuVarKou19] as their attack strategies are quite different from the ones employed for adversarial training, where often PGD-type attacks [madry2018towards]are used. However, a big problem at the moment is that these black-box attacks need to query the classifier too many times before they find adversarial examples and that their success rate is sometimes significantly lower than that of white-box attacks.
In this paper we propose the Square Attack, a simple score-based attack, that is we can query the probability distribution over the classes which the classifier predicts, but have no further access to the underlying model. The Square Attack is based on random search
^{1}^{1}1Note that it is an iterative procedure which is different from a simple random sampling inside the feasible region. [rastrigin1963convergence, schumer1968adaptive]which dates back to the 1960s. Random search has been successfully applied to reinforcement learning
[mania2018simple] where it came out to be competitive to gradient-based methods.Street sign parking meter | |
The Square Attack requires significantly less queries compared to the state-of-the-art black-box methods in the score-based query model while outperforming them in terms of success rate
, i.e. the percentage of successfully found adversarial examples. This is achieved by a combination of a particular initialization strategy and our square-shaped updates. We motivate why these updates are particularly suited to attack neural networks and provide also convergence guarantees for a variant of our method. In an extensive evaluation for different datasets (MNIST, CIFAR-10, ImageNet) and various normal and robust models, we show that the Square Attack outperforms recent state-of-the-art methods in the
- and -threat model. We even break a recently proposed defense [lin2019bandlimiting] based on randomization where the PGD attack yields a false impression of robustness while the model is actually not robust.We discuss black-box attacks for the threat model of perturbations in - and -ball as our method operates in these scenarios, although attacks for other norms, e.g. , exist [narodytska2017simple, croce2019sparse] but are usually different algorithmically due to the specific geometry of the perturbation set.
Score-based black-box attacks have only access to the score predicted by a classifier for each class for a given input. Most of such attacks in the literature are based on gradient estimation through finite differences. In particular, the first papers in this direction
[bhagoji2018practical, ilyas2018black, uesato2018adversarial] propose iterative attacks where at each step they approximate the gradient via sampling from some noise distribution around the point. While this general approach can be successful, it requires many queries of the classifier, particularly in high-dimensional input spaces like in image classification. Thus, improved techniques try to reduce the dimension of the search space via using the principal components of the data [bhagoji2018practical], searching for perturbations in the latent space of an auto-encoder [tu2019autozoom] or using a low-dimensional noise distribution [ilyas2019prior].Other attacks exploit evolutionary strategies or random search. [alzantot2018genattack]
use a genetic algorithm to generate adversarial examples and alleviate gradient masking as they successfully reduce the robust accuracy on randomization- and discretization-based defenses. The
-attack of [guo2019simple] can be seen as a variant of random search where the search directions are chosen from some orthonormal basis and two candidate updates are tested at each iteration. However, their algorithm can have suboptimal query efficiency since at every step only very small (in norm) modifications are added. Moreover, suboptimal modifications cannot be undone since they are orthogonal to each other.A recent line of work has pursued black-box attacks which are based on the observation that successful adversarial perturbations are attained at corners of the -ball intersected with the image space [MooEtAl2019, MeuEtAl2019]. Searching only over the corners allows to apply discrete optimization techniques to generate adversarial attacks, significantly improving the query efficiency. Both [MooEtAl2019] and [AlDujaili2019ThereAN] (with a few differences) divide the image according to some coarse grid, perform local search in this lower dimensional space allowing componentwise changes only of or , then refine the grid and repeat iteratively the scheme. [AlDujaili2019ThereAN] motivate this procedure as an estimation of the gradient signs. Recently, [MeuEtAl2019]
proposed several attacks based on different evolutionary algorithms, in the context of both discrete and continuous optimization, achieving state-of-the-art query efficiency for the
-norm. In order to reduce the dimensionality of the search space, they use the “tiling trick” of [ilyas2019prior] where they divide the perturbation in a set of squares and modify the values in such squares with evolutionary algorithms. However, as in [ilyas2019prior], both size and position of the squares are fixed at the beginning and not optimized. We note that despite the effectiveness of all these discrete optimization attacks for the -norm, these approaches are not straightforward to adapt to the -norm.Finally, approaches based on Bayesian optimization exist, e.g. [shukla2019blackbox] combine it with the “tiling trick”, but show competitive performance only in a low-query regime.
While we focus on norm-bounded perturbations, some works aim at fooling perturbations with minimal -norm (e.g. [tu2019autozoom]) which often require more queries to be found. Thus, we do not compare to them, except for [guo2019simple] which features competitive query efficiency while trying to have small perturbations.
In other cases the attacker has a level of knowledge of the classifier different from that here considered. A more restrictive scenario, considered by decision-based attacks [brendel2017decision, cheng2018query, guo2018low, brunner2018guessing, chen2019boundary], is where the attacker can query only the decision of the classifier, but not the predicted scores.
On the other hand, some works adopt more permissive threat models, e.g., the attacker already has a substitute model that is similar to the target one [papernot2016transferability, yan2019subspace, cheng2019improving, du2019query]. In this setting, it can generate adversarial examples on the substitute model and then transfer them or, as in [yan2019subspace], perform a black-box gradient estimation attack in a subspace spanned by the gradients of substitute models. However, the gain in query efficiency given by such extra knowledge does not take into account the computational cost required to train such substitute models, particularly high on ImageNet-scale. Finally, some approaches use extra information about the data-generating distribution to train a model that directly predicts adversarial examples and then refines them with attacks based on gradient estimation [li2019nattack].
In the following we recall the definitions of untargeted adversarial examples in the threat model where the perturbations are lying in some -ball. Then, we present our black-box attacks for the - and -norms.
Let be a classifier, where is the input dimension, the number of classes and is the predicted score that belongs to class . The classifier assigns class to the input .
The goal of an untargeted attack is to change the correctly predicted class for the point . is called an adversarial example with an -norm bound of for if
where we have added the additional constraint that is an image. The task of finding can be rephrased as solving the constrained optimization problem
(1) |
for a loss . We use for the Square Attack. Note that implies that the decision for is different from .
Square Attack is based on random search (RS) which is a well known iterative technique in optimization introduced by Rastrigin in 1963 [rastrigin1963convergence]. Let be the function to minimize and the iterate at iteration . RS samples a random update . If , then , else . In other words, at each step the algorithm samples a random point close to the current iterate and checks if it improves the objective function. Despite its simplicity, RS performs well in many situations and is not dependent on gradient information from .
Many variants of RS search have been introduced [matyas1965random, schumer1968adaptive, schrack1976optimized], which differ mainly in how the random perturbation is chosen at each iteration (the original scheme samples uniformly on a hypersphere of fixed radius). For our goal of crafting adversarial examples we come up with two sampling distributions specific to our problem: one for the and one for the attack (see Sec. 3.3 and Sec. 3.4), which we integrate in the classic RS procedure and are motivated by both how images are processed by networks with convolutional filters and the shape of the -balls for different .
Our scheme differs from classical random search by the fact the perturbations are constructed such that on every iteration they lie on the boundary of the -or -ball before projection onto the image domain . Thus we are using the perturbation budget almost optimally at each step. Moreover, the changes are localized on the image in the sense that, at each step, we modify just a small fraction of contiguous pixels shaped into squares. The overall scheme is presented in Algorithm 1. First, the algorithm picks the side length of the square to be modified (step 3), which is decreasing according to an a priori fixed schedule. This is in analogy to the step-size reduction in gradient-based optimization method. Then in step 4 we sample a new update and add it to the current iterate (step 5). If the resulting loss (obtained in step 6) is smaller than the loss so far, the change is accepted otherwise it is discarded. Since we are interested in a query efficient attack, the algorithm stops as soon as an adversarial example is found, that is . The overall time complexity of the algorithm is dominated by the evaluation of , thus the total running time of the algorithm is at most forward passes of , where is the number of iterations of Square Attack. We plot the resulting adversarial examples in Figure 3.
Given images with size , let be the percentage of elements of to be modified. The size of the side of the squares we use (see step 3) is given by the closest positive integer to (and for the attack). Then, in practice the initial is the only free parameter of our algorithm. With iterations available, we halve the value of at iterations. For different we rescale the schedule accordingly.
Initialization: We initialize the perturbations with vertical stripes of width one since we found that convolutional networks are particularly sensitive to such perturbations. The color of each stripe is sampled from , where is the number of color channels. Concurrently also [yin2019fourier] showed that neural networks are more generally vulnerable to various types of high frequency perturbations (although they evaluate perturbations of much larger magnitude than ours).
Sampling distribution: Similarly to [MooEtAl2019] we observe that successful perturbations usually have values in all the components (note that this does not hold perfectly due to the image constraints ). In particular, it holds
Our sampling distribution for the -norm described in Algorithm 2 selects sparse updates of with where and the non-zero elements are grouped to form a square. In this way, after the projection onto the -ball of radius (step 5 of Algorithm 1) all components for which satisfy , that is differ from the original point in each element either by or . Thus is situated at one of the corners of the -ball (modulo the components which are close to the boundary). Note that all projections are done by clipping. Moreover, we fix the elements of belonging to the same color channel to have the same sign, since we observed that neural networks are particularly sensitive to such perturbations (see Figure 3).
Initialization: The -perturbation is initialized by generating a grid-like tiling by squares of the image, where the perturbation on each tile has the shape as described next in the sampling distribution. The resulting perturbation is rescaled to have -norm and the resulting is finally projected onto by clipping.
Sampling distribution:
First, let us notice that the adversarial perturbations typically found for the -norm tend to be much more localized than those for the -norm [tsipras2019robustness], in the sense that large changes are applied on some pixels of the original image, while many others are minimally modified. To mimic this feature we introduce a new update which has two “centers” with large absolute value and opposite signs, while the other components have lower absolute values as one gets farther away from the centers, but never reaching zero (see Fig. 2 for one example with of the resulting update ). In this way the modifications are localized and with high contrast between the different halves. More specifically, we define (where we assume ), defined elementwise for every as
and . The intermediate square update is then selected uniformly at random from either
(2) |
or its transpose (corresponding to a rotation of ).
Second, unlike -constraints, -constraints do not allow to modify each component independently from the others as the overall norm must be kept smaller than . Therefore, if we want to modify a perturbation of norm through localized changes while staying on the hypersphere, we have to “move the mass” of from one location to another.
Thus our scheme consists in randomly selecting two square windows of the perturbation , namely and , setting and using the budget of to increase the total perturbation of . Note that the perturbation of is then a combination of the existing perturbation plus the new generated . We report the details of this scheme in Algorithm 3 where step 4 allows to utilize the budget of -norm lost after the projection onto . The update output by the algorithm is such that the next iterate (before projection onto by clipping) belongs to the hypersphere as stated in the following proposition.
Let be the output of Algorithm 3. Then .
original | attack - | attack - |
In this section, we provide high-level theoretical intuition why the choices done in Square Attack are justified. We analyze the -version as the -version is significantly harder to analyze.
First, we want to study the convergence of the random search algorithm an -smooth function
(such as neural networks with activation functions like softplus, swish, ELU, etc) on the whole space
(without projection^{2}^{2}2 Nonconvex constrained optimization under noisy oracles is notoriously more difficult [davis2019stochastic]) under the following assumptions on the update drawn from the sampling distribution :(3) |
where is the step size at iteration , and are some positive constants. We obtain the following result which is similar to existing convergence rates for zeroth-order methods [NemYud83, nesterov2017random, duchi2015optimal]:
Suppose that and Assumption 3 holds. Then for step-sizes , we have
This basically shows for large enough one make the gradient arbitrary small, meaning that the random search algorithm converges to a critical point of (one cannot hope for much stronger results in non-convex optimization without stronger conditions).
Unfortunately, the second part of Assumption 3 does not directly hold for our sampling distribution for the -norm (see Sup. A.3). However, it holds for a similar sampling distribution where each component of the update is drawn uniformly at random from . We show using the Khintchine inequality [haagerup1981best] (see Sup. A.4)
We note that the size of the window acts as a step-size here. In our experiments, however, the componentwise random update scheme was significantly worse. We provide arguments why this is the case in the supplementary material.
Previous works [MooEtAl2019, MeuEtAl2019] build their attacks by iteratively adding square modifications. Likewise we change square-shaped regions of the image for both our and attacks—with the difference that we can sample any square subset of the input, while the grid of the possible squares is fixed in [MooEtAl2019, MeuEtAl2019]. This leads naturally to ask why squares are superior to other shapes, e.g., rectangles.
Let us consider the threat model, with bound , input space and a convolutional filter with entries unknown to the attacker. Let be the sparse update with and . We denote by the index set of the rectangular support of with and shape . We want to give intuition why sparse square-shaped updates are superior to rectangular ones in the sense of reaching maximal change in the activation of the first convolutional layer.
Let denote the output of the convolutional layer for the update . The -norm of is the maximal componentwise change of the convolutional layer:
with the convention that elements with indices exceeding the size of the matrix are set to zero. Note that the indicator function attains 1 only for the non-zero elements of involved in the convolution to get . Thus, in order to have the largest upper bound possible on , for some , we need the largest amount possible of components of with indices in
to be non-zero (that is in ).
Therefore, it is desirable to have shaped so to maximize the number of squares of side length , i.e. the shape of the filter , which fit into the rectangle , i.e. the shape of the subset of non-zero elements of . Let be the family of the objects that can be defined as the union of axis-aligned rectangles with vertices on , and the squares of of shape with . We have the following proposition:
Among the elements of with area , those which contain the largest number of elements of have
(4) |
of them, with , , and .
This proposition states that, if we can select only elements of to modify, then shaping them to form (approximately) a square allows to maximize the number of pairs for which . Note that if then thus it is exactly a square which is optimal to maximize the overlap of convolutional filters and our update of the perturbation.
Norm | Attack | Failure rate | Avg. queries | Median queries | ||||||
I | R | V | I | R | V | I | R | V | ||
Bandits [ilyas2019prior] | 3.4% | 1.4% | 2.0% | 957 | 727 | 394 | 218 | 136 | 36 | |
Parsimonious [MooEtAl2019] | 1.5% | - | - | 722 | - | - | 237 | - | - | |
Sign bits [AlDujaili2019ThereAN] | 2.0% | - | - | 579 | - | - | - | - | - | |
DFO – CMA, 50 tiles [MeuEtAl2019] | 0.8% | 0.0% | 0.1% | 630 | 270 | 219 | 259 | 143 | 107 | |
DFO – Diag. CMA, 30 tiles [MeuEtAl2019] | 2.3% | 1.2% | 0.5% | 424 | 417 | 211 | 20 | 20 | 2 | |
Square Attack (ours) | 0.3% | 0.0% | 0.0% | 197 | 73 | 31 | 24 | 11 | 1 | |
Bandits [ilyas2019prior] | 9.8% | 6.8% | 10.2% | 1486 | 939 | 511 | 660 | 392 | 196 | |
SimBA-DCT [guo2019simple] | 35.5% | 12.7% | 7.9% | 651 | 582 | 452 | 564 | 467 | 360 | |
Square Attack (ours) | 7.1% | 0.7% | 0.8% | 1100 | 616 | 377 | 385 | 170 | 109 |
Attack | Avg. queries | Median queries | ||||
I | R | V | I | R | V | |
Bandits [ilyas2019prior] | 536 | 635 | 398 | 368 | 314 | 177 |
SimBA-DCT [guo2019simple] | 647 | 563 | 421 | 552 | 446 | 332 |
Square Attack | 352 | 287 | 217 | 181 | 116 | 80 |
In this section we show the effectiveness of the Square Attack. First, we follow the standard setup [ilyas2019prior, MeuEtAl2019] of comparing black-box attacks for three models on ImageNet in terms of success rate and query efficiency for the and threat models (see Sec. 5.1). Our Square Attack outperforms the competitors in all these metrics, often by a large margin. Our proposed attack also has a higher success rate in the low query regime (up to 200 queries) which we cover in Sup. E.1. Second, we show our attack succeeds in fooling particular models where white-box PGD attacks or other state-of-the-art black box attacks suggest that they are seemingly robust (Sec. 5.2). Therefore, we believe that due its effectiveness and simplicity the Square Attack should become a standard attack in order to evaluate the robustness of neural networks. Finally, in the supplementary material we provide more information about the experimental details in Sup. B, we do additional experiments and analysis regarding the transferability of the adversarial perturbations produced by our attack in Sup. C, an ablation study of the different components of our scheme in Sup. D, additional experimental results in Sup. E, and stability of the attack under different random seeds in Sup. F. The code of the attack and experiments is available at https://github.com/max-andr/square-attack.
We compare the Square Attack to state-of-the-art score-based black-box attacks (without any extra information, e.g. surrogate models) for [ilyas2019adversarial, MooEtAl2019, AlDujaili2019ThereAN, MeuEtAl2019] and for [ilyas2019adversarial, guo2019simple]. Additionally, we provide a comparison to [shukla2019blackbox] in the low-query regime in Sup. E.1. We do not compare to [alzantot2018genattack] since the median number of queries they report is an order of magnitude larger than the methods we consider.
We run all the attacks on three pretrained models in PyTorch (for some attacks we report the numbers from their papers), namely Inception v3, ResNet-50, VGG-16-BN, using 1,000 images from the ImageNet validation set. As it is standard in the literature, we give a budget of 10,000 queries per point to find an adversarial perturbation of
-norm smaller than or equal to . We report average and median number of queries each attack needs to craft an adversarial example, together with the failure rate. All statistics are only computed for originally correctly classified points and the query statistics is additionally only computed for successful attacks.One can see how the Square Attack, despite its simplicity, achieves in all the cases (models and norms considered) the lowest failure rate, which is lower than 1% everywhere except for the attack on Inception v3. Moreover, in almost all cases it requires fewer queries than the competitors for a successful attack. In fact, the attack requires on average between 2 and 7 times smaller number of queries and the attack improves query complexity by at least a factor of and up to on all the models when evaluated only on the points where all the attacks are successful (see Table 2). We highlight that we set the only hyperparameter of our attack, , which regulates the size of the squares, for all the models as for - and for -perturbations.
We compare the Square Attack to the following black-box attacks: Bandits [ilyas2019adversarial], Parsimonious [MooEtAl2019], Sign bits [AlDujaili2019ThereAN], and [MeuEtAl2019]. We run Bandits using their publicly available code, with their suggested hyperparameters. For Sign bits and DFO there is no official implementation, so the statistics about their performance are taken directly from the respective papers. [MooEtAl2019] provide code for the Parsimonious Attack but it is incompatible with the PyTorch models used for the other attacks, so we show only the results on Inception v3 reported in the original paper.
In Table 1 we report the comparison of the attacks on ImageNet (we allow maximal perturbations of size ). First of all, the Square Attack always has the lowest failure rate, notably achieving 0.0% in 2 out of 3 cases, and the smallest number of queries in average needed to find adversarial examples, improving up to almost 7 times upon the best of the other methods (31 vs 211 queries on VGG-16-BN). Interestingly, our attack has median equal 1 on VGG-16-BN, meaning that the initialization with stripes is particularly effective for this model.
The closest competitor among the -attacks is the CMA method of [MeuEtAl2019]. Note that their method with failure rates closer to our attack, – CMA, has much worse query efficiency – in terms of both the mean number of queries and, particularly, the median. Although requires a median number of queries comparable to our method, it needs much more queries on average, and also has significantly higher failure rate. Finally, we note that the full-covariance CMA algorithm of [MeuEtAl2019] has a computational complexity quadratic in the dimension of the input space, which is an expensive operation given high-dimensional inputs such as images. On the contrary, our method is more efficient since it has only operations of linear complexity.
We compare our attack to Bandits [ilyas2019prior] and SimBA [guo2019simple]. Note that we do not consider the version of Sign Bits [AlDujaili2019ThereAN] since it is not as competitive as in the scenario, and in particular worse than Bandits on ImageNet. We use the code of Bandits with their standard parameters. We consider the SimBA attack successful in the same setting as all other attacks, i.e. when the -norm of the adversarial perturbation is smaller than (we set ). We use the code from the paper repository with the parameters suggested by the authors for each of the three models.
As Table 1 shows, the Square Attack outperforms by a large margin the other methods in terms of failure rate. In Table 1 the average and median queries required are computed for each attack on the points where it was successful, which means that for different methods different points are used. In this setting, our attack achieves the lowest median number of queries for all the models and the lowest average one for VGG-16-BN. However, since it has a significantly lower failure rate, the statistics of the Square Attack are biased by the “hard” cases where the competitors fail. We recompute the same statistics about query consumption considering only the points where all the attacks are successful (Table 2). In this case, our method improves by at least times the mean and by at least times the median number of queries used to find adversarial perturbations for the same images.
In this paragraph we show that the Square Attack performs very well on problems which are challenging for white-box and other black-box attacks such as Bandits [ilyas2019prior] and SimBA [guo2019simple] (we do not evaluate [MeuEtAl2019, AlDujaili2019ThereAN] because they do not provide the code of their methods). In particular, we break a recently proposed randomized defense and show that the Square Attack works well where PGD but also other black-box methods suffer from gradient masking. In the following, we use for evaluation robust accuracy, which is defined as the worst-case accuracy of a classifier when an attack is allowed to perturb each input in an -ball of a given radius .
We investigate whether the robustness claims of [lin2019bandlimiting] hold (as reported at https://www.robust-ml.org/preprints/). Their defense method is a randomized averaging method similar in spirit to [cohen2019certified]. The difference is that [lin2019bandlimiting] sample from the surfaces of several -dimensional spheres instead of a Gaussian, and they do not derive any robustness certificates, but rather measure robustness by the PGD attack. We use the hyperparameters specified in their code (K=15, R=6 on CIFAR-10 and K=15, R=30 on Imagenet). We show in Table 3 that the proposed defense can be broken by the Square Attack, which is able to reduce the robust accuracy suggested by the evaluation with PGD from 88.4% to 15.8% on CIFAR-10 and from 76.1% to 0.4% on ImageNet.
Dataset | Robust accuracy | |||
Clean | PGD | Square Attack | ||
CIFAR-10 | 92.6% | 88.4% | 15.8% | |
ImageNet | 77.3% | 76.1% | 0.4% |
These two defenses wrt proposed in [kannan2018adversarial] have been broken. However, [MosEtAl18] needed up to 10k restarts of PGD attack which is computationally prohibitive. Using the publicly available models from [MosEtAl18], we run the Square Attack with and 20k query limit and report the results in Table 4. We obtain robust accuracy similar to PGD_{R} in most cases, but with a single run, i.e. without additional restarts. At the same time, Bandits show considerably worse results than the Square Attack, although they still perform better than PGD_{1} on the CLP_{MNIST} and LSQ_{MNIST} models.
Model | Robust accuracy | ||||
PGD_{1} | PGD_{R} | Bandits | Square | ||
CLP_{MNIST} | 62.4% | 4.1% | 33.3% | 6.1% | |
LSQ_{MNIST} | 70.6% | 5.0% | 37.3% | 2.6% | |
CLP_{CIFAR} | 2.8% | 0.0% | 14.3% | 0.2% | |
LSQ_{CIFAR} | 27.0% | 1.7% | 27.7% | 7.2% | |
robustness of Clean Logit Pairing (CLP), Logit Squeezing (LSQ)
[kannan2018adversarial]. The Square Attack with 20k queries is competitive to PGD (white-box) with many restarts (R=10,000 and R=100 on MNIST and CIFAR-10 respectively) and more effective than Bandits (black-box).Adversarial training [madry2018towards] is one of the state-of-the-art techniques to train robust models. We attack the adversarially trained models on MNIST and CIFAR-10 with attacks and present the results (on 1000 points) in Table 5. With our simple random search algorithm we are able to get a robust accuracy of 87.1%. The Square Attack with 20k queries is competitive to PGD with many restarts and more effective than Bandits.
Model | Robust accuracy | ||||
PGD_{1} | PGD_{R} | Bandits | Square | ||
AT_{MNIST} | 92.5% | 89.6% | 89.4% | 87.1% | |
AT_{CIFAR} | 47.0% | 45.2% | 52.7% | 46.1% |
Robust accuracy | ||||||
PGD | PGD | PGD | Bandits | SimBA | Square | |
1.0 | 91.5% | 90.6% | 89.3% | 91.7% | 97.6% | 89.2% |
1.5 | 86.1% | 80.8% | 76.7% | 87.6% | 94.1% | 60.8% |
2.0 | 79.6% | 67.4% | 59.8% | 80.1% | 87.6% | 16.7% |
2.5 | 69.2% | 51.3% | 36.0% | 32.4% | 75.8% | 2.4% |
3.0 | 57.6% | 29.8% | 12.7% | 12.5% | 58.1% | 0.6% |
In Table 6 we report the robust accuracy at different thresholds of the adversarially trained models on MNIST of [madry2018towards] for the -threat model. It is known that the PGD attack fails to successfully reduce the robust accuracy for this threat model since it suffers from gradient masking [TraBon2019]. Strikingly, in contrast to PGD and other black-box attacks we consider, our Square Attack does not suffer by gradient masking and yields robust accuracy close to zero for . This is obtained with only a single run compared to the multiple random restarts used for PGD.
We have presented the randomized score-based Square Attack which outperforms the state-of-the-art both in terms of query-efficiency and success rate and have used it to break a recently proposed defense where the PGD-attack overestimates robustness massively. We have also provided theoretical background why the Square Attack works well. In future work it would be interesting to use the Square Attack to explore the set of adversarial examples for a sensitivity analysis of neural networks.
We are very grateful to Laurent Meunier and Satya Narayan Shukla for providing the data for Figure 5. M.A. also thanks Apostolos Modas for fruitful discussions.
M.H. and F.C. acknowledge support from the BMBF through the Tübingen AI Center (FKZ: 01IS18039A), the DFG TRR 248, project number 389792660 and the DFG Excellence Cluster “Machine Learning - New Perspectives for Science”, EXC 2064/1, project number 390727645.
In Section A, we present the missing proofs of Section 3 and Section 4 and slightly deepen our theoretical insights on the efficiency of the proposed -attack. Section B covers various implementation details and the hyperparameters we used. In Section C, we discuss the transferability properties of the adversarial examples generated by our attack. We show an ablation study on different choices of the attack’s algorithm in Section D. Section E
presents the success rate on ImageNet for different number of queries, and also the query efficiency on the challenging models (logit pairing and adversarial training). Finally, since the Square Attack is a randomized algorithm, we show the variance of the main reported performance measures for different random seeds in Section
F.Let be the output of Algorithm 3. We prove here that .
From Step 13 of Algorithm 3, we directly have the equality . Let be the update at the previous iteration, defined in Step 1 and the indices not belonging to . Then,
where holds since as the modifications affect only the elements in the two windows, and holds by the definition of in Step 4 of Algorithm 3.
Using the -smoothness of the function , that is it holds for all ,
we obtain (see e.g. [BoyVan2004]):
and by definition of we have
Using the definition of the as a function of the absolute value () yields
And using the triangular inequality (), we have
Therefore taking the expectation and using that , we get
Therefore, together with Assumption 3 this yields to
and thus
Thus for we have summing for
We conclude setting the step-size to .
Let us consider an update with a window size and the direction defined as
It is easy to check that any update drawn from the sampling distribution is orthogonal to this direction :
Therefore and Assumption 3 does not hold. This implies that the convergence analysis does not directly hold for the sampling distribution .
Let us consider the sampling distribution where different Rademacher are drawn for each pixel of the update window . We present it in Algorithm 4 with the convention that any subscript should be understood as . This technical modification is greatly helpful to avoid side effect.
Let for which we have using the Khintchine inequality [haagerup1981best]:
where we define by and follows from the decomposition between the randomness of the Rademacher and the random window, follows from the Khintchine inequality and follows from Jensen inequality.
Proposition 4.1 underlines the importance of a large inner product in the direction of the gradients. This provides some intuition explaining why the update where a single Rademacher is drawn for each window is more efficient than the update where different Rademacher are drawn. Following the observation that adversarial gradients are often piecewise constant [ilyas2019prior]
we consider, as a heuristic, a piecewise constant direction
for whichTherefore the directions sampled by our proposal are more correlated with the gradient direction and help the algorithm to converge faster. This is also verified empirically in our experiments (see the ablation study in Sup. D).
Let us consider the direction composed of different blocks of constant sign.
For this direction we compare two different proposal and where we choose uniformly one random block and we either assign a single Rademacher to the whole block (this is ) or we assign multiple Rademacher (this is ). We have
Therefore we obtain the -norm of the different groups. For the update we obtain
where follows from the fact the has a constant sign. We recover then the -norm of the direction .
For quasi-constant block, then will be larger than