Guessing Smart: Biased Sampling for Efficient Black-Box Adversarial Attacks

12/24/2018 ∙ by Thomas Brunner, et al. ∙ 0

We consider adversarial examples in the black-box decision-based scenario. Here, an attacker has access to the final classification of a model, but not its parameters or softmax outputs. Most attacks for this scenario are based either on transferability, which is unreliable, or random sampling, which is extremely slow. Focusing on the latter, we propose to improve sampling-based attacks with prior beliefs about the target domain. We identify two such priors, image frequency and surrogate gradients, and discuss how to integrate them into a unified sampling procedure. We then formulate the Biased Boundary Attack, which achieves a drastic speedup over the original Boundary Attack. Finally, we demonstrate that our approach outperforms most state-of-the-art attacks in a query-limited scenario and is especially effective at breaking strong defenses: Our submission scored second place in the targeted attack track of the NeurIPS 2018 Adversarial Vision Challenge.



There are no comments yet.


page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Ever since the term was fist coined, adversarial examples have enjoyed much attention from Machine Learning researchers. The fact that tiny perturbations can lead otherwise robust-seeming models to misclassify an input could pose a major problem for safety and security. But when discussing adversarial examples, it is often unclear how realistic the scenario of a proposed attack truly is. In this work, we consider a threat setting with the following parameters:

Black-box. The black-box setting assumes that the attacker has access only to the input and output of a model. Compared to the white-box setting, where the attacker has complete access to the architecture and parameters of the DNN model, attacks in this setting are significantly harder to conduct: Most state-of-the-art white-box attacks [2, 3, 4] rely on gradients that are directly computed from the model parameters, which are not available in the black-box setting.

Decision-based classification

. Depending on the output format of the DNN, the problem of missing gradients can be circumvented. In a soft-label scenario, the model provides real-valued outputs (for example, softmax activations). By applying tiny modifications to the input, an attacker can estimate gradients by observing changes in the output

[5] and then follow this estimate to generate adversarial examples. The decision-based setting, in contrast, provides only a single discrete result (e.g. the top-1 label), and gradient estimation is not possible. Searching for adversarial examples now becomes a combinatorial problem that is much harder to optimize [6].

Limited queries. In order to test whether the DNN model was successfully fooled, the attacker must query it for the classification of a candidate. In the white-box case, the number of queries is typically unlimited, but black-box attacks in the real world might not be feasible if they need thousands of iterations, and possibly multiple hours’ time, to be successful. We therefore consider a scenario where the attacker must find a convincing adversarial example in less than 1000 queries.


. An untargeted attack is considered successful when the classification result is any label other than the original. Depending on the number and semantics of the classes, it can be easy to find a label that requires little change, but is considered adversarial (e.g. egyptian cat vs persian cat). A targeted attack, in contrast, needs to produce exactly the specified label. This task is strictly harder than the untargeted attack, further decreasing the probability of success.

In this setting, current state-of-the-art attacks are either unreliable or very inefficient. Our contribution is as follows:

  • We introduce the Biased Boundary Attack (BBA), which uses prior beliefs about the target domain to sample adversarial perturbations with a high probability of success.

  • For this, we discuss a range of biases for the sampling procedure, such as low-frequency patterns and projected gradients.

  • We show that our method outperforms the previous state of the art, and present the results of our submission to the NeurIPS 2018 Adversarial Vision Challenge, where our approach won second place in the targeted attack track.

Ii Related Work

There currently exist two major schools of attacks in the threat setting we consider:

Ii-a Transfer-based

It is well known that adversarial examples display a high degree of transferability, even between different model architectures [7]. Transfer attacks seek to exploit this by training substitute models that are reasonably similar to the model under attack, and then apply regular white-box attacks to them.

Typically this is performed by iterative applications of gradient-based methods, such as the Fast Gradient Method (FGM)[2] and Projected Gradient Descent (PGD) [4]. The black-box model is used in the forward pass, while the backward pass is performed with the surrogate model [9]. In order to maximize the chance of a successful transfer, newer methods use large ensembles of substitute models, and applying adversarial training to the substitute models has been found to increase the probability of finding strong adversarial examples even further [8].

Although these methods currently form the state of the art in decision-based black-box attacks [10], they have one major weakness: as soon as a defender manages to reduce transferability, direct transfer attacks often run a risk of complete failure, delivering no result even after thousands of iterations. As a result, conducting transfer attacks is a cat-and-mouse game between attacker and defender, where the attacker must go to great lengths to train models that are just as robust as the defender’s. Therefore, transfer-based attacks can be very efficient, but also somewhat unreliable.

Ii-B Sampling-based

Circumventing this problem, sampling-based attacks do not rely on direct transfer and instead try to find adversarial examples by randomly sampling perturbations from the input space.

Perhaps the simplest attack consists of sampling a hypersphere around the original image, and drawing more and more samples until an adversarial example is found. Owing to the high dimensionality of the input space, this method is very inefficient and has been dismissed as completely unviable [1]. While this is not our main focus, our results in Section IV show that even this crude attack can be accelerated and made competitive with existing black-box attacks.

Recently, a more sophisticated attack has been proposed: the Boundary Attack (BA) [6]. This attack is initialized with an input of the desired class, and then takes small steps along the decision boundary to reduce the distance to the original input. Previous works have established that regions which contain adversarial examples often have the shape of a ”cone” [8], which can be traversted from start to finish. At each step, the Boundary attack employs random sampling to find a sideways direction that leads deeper into this cone. From there, it can then take the next step towards the target.

The Boundary Attack has been shown to be very powerful, producing adversarial examples that are competitive with even the results of state-of-the-art white-box attacks [6]. However, its achilles heel is again query efficiency: to achieve these results, the attack typically needs to query the model hundreds of thousands of times.

Therefore, sampling-based attacks are generally more flexible, but often too inefficient for practical use.

Iii The Biased Boundary Attack

Most current sampling-based attacks have one thing in common: they draw samples from either normal or uniform distributions. This means that they perform unbiased sampling, perturbing each input independently of the others. But this can be very inefficient, especially against a strong defender.

Consider the distribution of natural images: their pixels are typically not independent of each other. This alone is a strong indicator that drawing perturbations from i.i.d random variables will lead to adversarial examples that are mostly out of distribution for natural image datasets. This, of course, renders them vulnerable to detection and filtering.

It seems only logical to constrain the search space to perturbations that we believe to have a higher chance of success. Putting this concept into practice, we propose the Biased Boundary Attack (BBA), which uses domain knowledge as prior beliefs for its sampling procedure, and through it achieves a large speedup over the original Boundary Attack.

In this section, we outline two such priors for the domain of image classification, and show how to integrate them into a single, unified, attack.

Iii-a Low-frequency perturbations

When one looks at typical adversarial examples, it becomes quickly apparent that most existing methods yield perturbations with high image frequency. But high-frequency patterns have a significant problem: they are easily identified and separated from the original image signal, and are quickly destroyed by spatial transforms. Indeed, most of the winning defenses in the NeurIPS 2017 Adversarial Attacks and Defences Competition were based on denoising [11], simple median filters [10], and random transforms [12]. In other words: state-of-the art defenses are designed to filter high-frequency noise.

At the same time, we know that it is possible to synthesize ”robust” adversarial examples which are not easily filtered in this way [13]. These robust perturbations are largely invariant to filters and transforms, and - interestingly enough - at first glance seem to contain very little high-frequency noise. It seems obvious that such patterns should be ideal for breaking black-box defenses.

Inspired by this observation, we hypothesize that image frequency alone could be a key factor in robustness of adversarial perturbations. If true, then simply limiting perturbations to the low-frequency domain should increase the success change of an attack, while incurring no extra cost.

Perlin Noise patterns. A straightforward way to generate parametrized, low-frequency patterns, is to use Perlin Noise [14]. Originally intended as a procedural texture generator for computer graphics, this function creates low-frequency noise patterns with a reasonably ”natural” look. One such pattern can be seen in Figure 1c. But how can we use it to create a prior for the Biased Boundary Attack?

(a) (b)

(c) (d)

Fig. 1:

Adversarial perturbations for an untargeted attack, obtained after testing 1000 random samples. (a) and (b): Sampled from a normal distribution. (c) and (d): Sampled from a distribution of Perlin noise patterns. In both cases, the network misclassifies the image, and picks a different label than the original (lion).

Let be the dimensionality of the input space. The original Boundary Attack (Figure 2a) works by applying an orthogonal perturbation along the surface of a hypersphere around the target image, in the hope of moving deeper into an adversarial region. From there, a second step is taken towards the source image. In its default configuration, candidates for are generated from samples , which are projected orthogonally to the source direction and normalized to the desired step size. This leads to the directions being uniformly distributed along the hypersphere.

To introduce a low-frequency prior into the Biased Boundary Attack, we instead sample from a distribution of Perlin noise patterns (Figure 2

b). In its original implementation, Perlin noise is parametrized with a permutation vector

of size 256, which we randomly shuffle on every call. Effectively, this allows us to sample two-dimensional noise patterns , where and are the image dimensions (and ). As a result, the samples are now strongly concentrated in low-frequency regions.

Our experiments in Section IV show that this greatly improves the efficiency of the attack. Therefore, we reason that the distribution of Perlin noise patterns contains a higher concentration of adversarial directions than the normal distribution.

Other low-frequency patterns. We note that, independent of our work, a similar effect has very recently been described in [15]. They decompose random perturbations with the Discrete Cosine Transform, and then remove high frequencies from the spectrum. Reaching the same conclusions, they go on to modify the Boundary Attack and show a large increase in efficiency. Since their method was not known to us at the time of our submission to the Adversarial Vision Challenge, we cannot directly compare their results with our own at this time, but aim to do so in the future.

There are many ways to generate low-frequency noise, but as we show in Section IV, Perlin noise alone already performs extremely well. In fact, we used a very simple implementation in our winning submission to the Adversarial Vision Challenge, without any specific fine-tuning. Nevertheless, we aim to study image frequency in more detail in the future, and to identify the key factors that maximize transferability.

Iii-B Gradients from surrogate models

(a) (b) (c)

Fig. 2: Sampling directions for the orthogonal step. (a) Boundary Attack: uniformly distributed along the surface of the hypersphere. (b) BBA with Perlin bias: high sample density in the direction of low-frequency perturbations. (c) BBA with Perlin and gradient biases: samples further concentrate towards the direction of the projected gradient.

What other source of information contains strong hints about directions that are likely to point to an adversarial region? The natural answer is: gradients from a substitute model. A large range of such models is available to us, as many state-of-the-art black-box attacks rely on them to perform a transfer attack via PGD methods [8, 10].

Arguably, the main weakness of gradient-based transfer attacks is that they fail when adversarial regions of the surrogate model do not closely match the defender’s. However, even when this is the case, those regions may still be reasonably nearby. Based on this intuition, some approaches extend gradient-based attacks with limited regional sampling [9]. Here, we do exactly the opposite and extend a sampling-based attack with adversarial gradients. This has the significant advantage that, in the case of limited transferability, our method merely experiences a slowdown, whereas PGD-based methods often fail altogether.

Our method works as follows:

  • An adversarial gradient from a surrogate model is calculated. Since the current position is already adversarial, it can be helpful to move a small distance towards the target first, making sure to calculate the gradient from inside a non-adversarial region.

  • The gradient usually points away from the target, therefore we project it onto the surface of a hypersphere around the target, as shown in Figure 2c.

  • This projection is on the same hyperplane as the candidates for the orthogonal step. We can now bias the candidate perturbations toward the projected gradient by any method of our choosing. Provided all vectors are normalized, we opt for a simple addition:

  • controls the strength of the bias and is a hyperparameter that should be tuned according to the performance of our substitute model. High values for

    should be used when transferability is high, and vice versa. Were we to choose the maximum value, = 1, then the orthogonal step would be equivalent to an iteration of the PGD attack. In our experience, generally leads to good performance.

As a result, samples concentrate in the vicinity of the projected gradient, but still cover the rest of the search space (albeit with lower resolution). In this way, substitute models are purely optional to our attack, instead of forming the central part. It should be noted though that at least some measure of transferability should exist. Otherwise, the gradient will point in a bogus direction and using a high value for would reduce efficiency instead of improving it.

For the time being, this does not pose a major problem. To the best of our knowledge, no strategies exist that successfully eliminate transferability altogether. As we go on to show, even a very simple substitute model delivers a substantial speedup. In the Adversarial Vision Challenge, our attack outperforms most competitors, even though our surrogate models are much simpler than theirs.

Iv Evaluation: Adversarial Vision Challenge

When evaluating adversarial attacks and defenses, it is hard to obtain meaningful results. Very often, attacks are tested against weak defenses and vice versa, and results are cherry-picked. We sidestep this problem by instead submitting our approach to the NeurIPS 2018 Adversarial Vision Challenge (AVC), where our method is pitted against state-of-the-art black-box defenses.

Evaluation setting

. The AVC is an open competition between image classifiers and adversarial attacks in an iterative black-box decision-based setting

[16]. Participants can choose between three tracks:

  • Robust model: The submitted code is a robust image classifier. The goal is to maximize the norm of any successful adversarial perturbation.

  • Untargeted attack: The submitted code must find a perturbation that changes classifier output, while minimizing the distance to the original image.

  • Targeted attack: Same as above, but the classification must be changed to a specific label.

Attacks are continuously evaluated against the current top-5 robust models and vice versa. Each evaluation run consists of 200 images with a resolution of 64x64, and the attacker is allowed to query the model 1000 times for each image. The final attack score is then determined by the median norm of the perturbation over all 200 images and top-5 models (lower is better).

At the time of writing, the exact methods of most models were not yet published. But seeing as over 60 teams competed in this track, it is reasonable to assume that the top-5 models accurately depict the state of the art in adversarial robustness. It is against those models that we evaluate our attack.


. The models classify images according to the Tiny ImageNet dataset, which is a down-scaled version of the ImageNet classification dataset, limited to 200 classes with 500 images each. Model input consists of color images with 64x64 pixels, and the output is one of 200 labels. The evaluation is conducted with a secret hold-out set of images, which is not contained in the original dataset, and is unknown to participants of the challenge.

Iv-a Random guessing with low frequency

Before implementing the Biased Boundary Attack, we first conduct a simple experiment to demonstrate the effectiveness of Perlin noise patterns. Specifically, we run a random-guessing attack that samples candidates uniformly from the surface of a -hypersphere with radius around the original image:

With a total budget of 1000 queries to the model for each image, we use binary search to reduce the sampling distance whenever an adversarial example is found. Preliminary results have indicated that the targeted setting may be too difficult for pure random guessing. Therefore we limit the experiment to the untargeted attack track, where the probability of randomly sampling any of 199 adversarial labels is reasonably high.

We then submit the same attack, replacing the distribution with normalized Perlin noise:

Figure 1 shows adversarial examples from both distributions. As we can see in Table I, Perlin patterns are much more efficient, and the attack finds adversarial perturbations with much lower distance (63% reduction). Although intended as a dummy submission to the AVC, this attack was already strong enough for a top-10 placement in the untargeted track.

Queries Distribution Median L2 distance
1000 Normal 11.15
1000 Perlin noise (ours) 4.28
TABLE I: Random guessing with low frequency (untargeted)

Iv-B Biased Boundary Attack




Fig. 3: Comparision of Boundary Attack and Biased Boundary Attack after 1000 queries. While originally a ”european fire salamander”, all three images are classified as ”sulphur butterfly”. (a) Boundary Attack: (b) BBA with Perlin bias: (c) BBA with Perlin and projected gradient biases:

Next, we evaluate the Biased Boundary Attack in our intended setting, the targeted attack track in the AVC. To provide a point of reference, we first implement the original Boundary Attack, and initialize it with known images from the target class. Specifically, we use the training set of Tiny ImageNet, and from all images of the target class we pick the one with the lowest distance to the original. The Boundary Attack works, but is too slow for our setting. Compare Figure 3a, where the starting point (a butterfly) is still clearly visible after 1000 iterations.

We then implement the Biased Boundary Attack by adding our first prior, low-frequency noise (see Figure 3b). As before, we simply replace the distribution from which the attack samples the orthogonal step with Perlin patterns. As shown in Table II, this alone decreases the median distance by 25%.

Finally, we add projected gradients from a surrogate model and set the bias strength to 0.5. This further reduces the median distance by another 37%, or a total of 53% when compared with the original Boundary Attack. 1000 iterations are more than enough to make the butterfly almost invisible to the human eye (Figure 3c).

Queries Attack Bias Median L2 distance
1000 Boundary Attack - 20.2

BBA (ours) Perlin 15.1
1000 BBA (ours) Perlin + PG 9.5

TABLE II: The Biased Boundary Attack (targeted)

In our submission to the AVC, we used an ensemble of ResNet18 and ResNet50, both of which are public baselines provided by the competition organizers. This ensemble is reasonably strong, but not state-of-the-art (the ResNet50 baseline was trained with Adversarial Logit Pairing, which has since been shown to be ineffective

[17, 18]). In fact, most winning submissions to the AVC used much larger ensembles of carefully-trained models [19]. This reinforces the claim that our method outperforms most transfer attacks even when using simple surrogate models.

V Conclusion

We have shown that sampling-based black-box attacks can be greatly sped up by biasing their sampling procedure, up to the point where they outperform even the most sophisticated transfer attacks. Our result in the NeurIPS 2018 Adversarial Vision Challenge is testament to this: We achieved second place in the targeted attack track, even though our surrogate models were much simpler than those of other participants. Indeed, we expect that incorporating their models would further increase sample efficiency, and produce an even better attack as the combination of our results.

And it does not end here - we have discussed only two priors for biased sampling, but there is much more domain knowledge that has not yet found its way into adversarial attacks. Other perturbation patterns, spatial transformations, adversarial blending strategies, or even intuitions about semantic features of the target class could all be integrated in a similar fashion.

With the Biased Boundary Attack, we have outlined a basic framework into which virtually any source of knowledge can be incorporated. Our current implementation crafts convincing results after few hundreds of iterations, and the threat of black-box adversarial examples becomes more realistic than ever before.