Black-box Adversarial Attacks with Bayesian Optimization

We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples using information limited to loss function evaluations of input-output pairs. We use Bayesian optimization (BO) to specifically cater to scenarios involving low query budgets to develop query efficient adversarial attacks. We alleviate the issues surrounding BO in regards to optimizing high dimensional deep learning models by effective dimension upsampling techniques. Our proposed approach achieves performance comparable to the state of the art black-box adversarial attacks albeit with a much lower average query count. In particular, in low query budget regimes, our proposed method reduces the query count up to 80% with respect to the state of the art methods.



There are no comments yet.


page 1

page 2

page 3

page 4


Hard Label Black-box Adversarial Attacks in Low Query Budget Regimes

We focus on the problem of black-box adversarial attacks, where the aim ...

A Strong Baseline for Query Efficient Attacks in a Black Box Setting

Existing black box search methods have achieved high success rate in gen...

Black-box Adversarial Attacks on Commercial Speech Platforms with Minimal Information

Adversarial attacks against commercial black-box speech platforms, inclu...

Query-limited Black-box Attacks to Classifiers

We study black-box attacks on machine learning classifiers where each qu...

Gaussian MRF Covariance Modeling for Efficient Black-Box Adversarial Attacks

We study the problem of generating adversarial examples in a black-box s...

Simple Black-box Adversarial Attacks

We propose an intriguingly simple method for the construction of adversa...

Towards Black-box Attacks on Deep Learning Apps

Deep learning is a powerful weapon to boost application performance in m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks are now well-known to be vulnerable to adversarial examples: additive perturbations that, when applied to the input, change the network’s output classification [9]. Work investigating this lack of robustness to adversarial examples often takes the form of a back-and-forth between newly proposed adversarial attacks

, methods for quickly and efficiently crafting adversarial examples, and corresponding defenses that modify the classifier at either training or test time to improve robustness. The most successful adversarial attacks use gradient-based optimization methods

[9, 17], which require complete knowledge of the architecture and parameters of the target network; this assumption is referred to as the white-box attack setting. Conversely, the more realistic black-box setting requires an attacker to find an adversarial perturbation without such knowledge: information about the network can be obtained only through querying the target network, i.e., supplying an input to the network and receiving the corresponding output.

In real-world scenarios, it is extremely improbable for an attacker to have unlimited bandwidth to query a target classifier. In evaluation of black box attacks, this constraint is usually formalized via the introduction of a query budget: a maximum number of queries allowed to the model per input, after which an attack is considered to be unsuccessful. Several recent papers have proposed attacks specifically to operate in this query-limited context [12, 11, 4, 28, 18]; nevertheless, these papers typically consider query budgets on the order of 10,000 or 100,000. This leaves open questions as to whether black-box attacks can successfully attack a deep network based classifier in severely query limited settings, e.g., with a query budget of 100-200. In such a query limited regime, it is natural for an attacker to use the entire query budget, so we ask the pertinent question: In a constrained query limited setting, can one design query efficient yet successful black box adversarial attacks?

This work proposes a black-box attack method grounded in Bayesian optimization [13, 8], which has recently emerged as a state of the art black-box optimization technique in settings where minimizing the number of queries is of paramount importance. Straightforward application of Bayesian optimization to the problem of finding adversarial examples is not feasible: the input dimension of even a small neural network-based image classifier is orders of magnitude larger than the standard use case for Bayesian optimization. Rather, we show that we can bridge this gap by performing Bayesian optimization in a reduced-dimension setting and upsampling to obtain our final perturbation. We explore several upsampling techniques and find that a relatively simple nearest-neighbor upsampling method allows us to sufficiently reduce the optimization problem dimension such that Bayesian optimization can find adversarial perturbations with more success than existing black-box attacks in query-constrained settings.

We compare the efficacy of our adversarial attack with a set of experiments attacking three of the most commonly used pretrained ImageNet

[7] classifiers: ResNet50 [10], Inception-v3 [27], and VGG16-bn [24]. Results from these experiments show that with very small query budgets (under 200 queries), the proposed method Bayes-Attack achieves success rates comparable to or exceeding existing methods, and does so with far smaller average and median query counts. Further experiments are performed on the MNIST dataset to compare how various upsampling techniques affect the attack accuracy of our method. Given these results we argue that, despite being a simple approach (indeed, largely because it is such a simple and standard approach for black-box optimization), Bayesian Optimization should be a standard baseline for any black-box adversarial attack task in the future, especially in the small query budget regime.

2 Related Work

Within the black-box setting, adversarial attacks can be further categorized by the exact nature of the information received from a query. The most closely related work to our approach are score-based

attacks, where queries to the network return the entire output layer of the network, either as logits or probabilities. Within this category, existing approaches draw from a variety of optimization fields and techniques. One popular approach in this area is to attack with zeroth-order methods via some method of derivative-free gradient estimation, as in methods proposed in

Ilyas et al. [12], which uses time-dependent and data-dependent priors to improve the estimate, as well as Ilyas et al. [11], which replaces the gradient direction found using natural evolution strategies (NES). Other methods search for the best perturbation outside of this paradigm; Moon et al. [18] cast the problem of finding an adversarial perturbation as a discrete optimization problem and use local search methods to solve. These works all search for adversarial perturbations within a search space with a hard constraint on perturbation size; other work [4, 28]

incorporates a soft version of this constraint and performs coordinate descent to decrease the perturbation size while keeping the perturbed image misclassified. The latter of these methods incorporates an autoencoder-based upsampling method with which we compare in Section


One may instead assume that only part of the information from the network’s output layer is received as the result of a query. This can take the form of only receiving the output of the top predicted classes [11], but more often the restrictive decision-based setting is considered. Here, queries yield only the predicted class, with no probability information. The most successful work in this area is in [5], which reformulates the problem as a search for the direction of the nearest decision boundary and solves using a random gradient-free method, and in [1] and [3], both of which use random walks along the decision boundary to perform an attack. The latter work significantly improves over the former with respect to query efficiency, but the number of queries required to produce adversarial examples with small perturbations in this setting remains in the tens of thousands.

A separate class of transfer-based attacks train a second, fully-observable substitute network, attack this network with white-box methods, and transfer these attacks to the original target network. These may fall into one of the preceding categories or exist outside of the distinction: in Papernot et al. [20], the substitute model is built with score-based queries to the target network, whereas Liu et al. [16] trains an ensemble of models without directly querying the network at all. These methods come with their own drawbacks: they require training a substitute model, which may be costly or time-consuming, and overall attack success tends to be lower than that of gradient-based methods.

Finally, there has been some recent interest in leveraging Bayesian optimization for constructing adversarial perturbations. Bayesian optimization has played a supporting role in several methods, including Tu et al. [28], where it is used to solve the -step of an alternating direction of method multipliers (ADMM) approach, and in [6], which uses it to search within a set of procedural noise perturbations. On the other hand, prior work in which Bayesian optimization plays a central role performs experiments only in relatively low-dimensional problems, highlighting the main challenge of its application: Suya et al. [26] examines an attack on a spam email classifier with 57 input features, and in Munoz-González [19] image classifiers are attacked but notably do not scale beyond MNIST classifiers. In contrast to these past works, the main contribution of this paper is to show that Bayesian Optimization presents a scalable, query-efficient approach for large-scale black-box adversarial attacks, when combined with upsampling procedures.

3 Problem Formulation

The following notation and definitions will be used throughout the remainder of the paper. Let be the target neural network. We assume that is a -class image classifier that takes normalized inputs: each dimension of an input represents a single pixel and is bounded between and , denotes the original label, and the corresponding output is a

-dimensional vector representing a probability distribution over classes.

Rigorous evaluation of an adversarial attack requires careful definition of a threat model: a set of formal assumptions about the goals, knowledge, and capabilities of an attacker [2]. We assume that, given a correctly classified input image , the goal of the attacker is to find a perturbation such that is misclassified, i.e., . We operate in the score-based black-box setting, where we have no knowledge of the internal workings of the network, and a query to the network yields the entire corresponding -dimensional output. To enforce the notion that the adversarial perturbation should be small, we take the common approach of requiring that be smaller than a given threshold in some norm, where varies depending on the classifier. This work considers the norm, but our attack can easily be adapted to other norms. Finally, we denote the query budget with ; if an adversarial example is not found after queries to the target network, the attack fails.

As in most work, we pose the attack as a constrained optimization problem. We use an objective function suggested by Carlini and Wagner [2] and used in Tu et al. [28], Chen et al. [4]:


Most importantly, the input to is an adversarial example for if and only if .

We briefly note that the above threat model and objective function were chosen for simplicity and for ease of directly comparing with other black box attacks, but the attack method we propose is compatible with many other threat models. For example, we may change the goals of the attacker or measure in or norms instead of with appropriate modifications to the objective function and constraints in equation 1.

4 Model Framework

In this section, we present the proposed black-box attack method. We begin with a brief description of Bayesian optimization [13] followed by its application to generate black-box adversarial examples. Finally, we describe our method for attacking a classifier trained with high-dimensional inputs (e.g. ImageNet) in a query-efficient manner.

4.1 Bayesian Optimization

Bayesian Optimization (BO) is a method for black box optimization particularly suited to problems with low dimension and expensive queries. Bayesian Optimization consists of two main components: a Bayesian statistical model and an acquisition function. The Bayesian statistical model, also referred to as the surrogate model, is used for approximating the objective function: it provides a Bayesian posterior probability distribution that describes potential values for the objective function at any candidate point. This posterior distribution is updated each time we query the objective function at a new point. The most common surrogate model for Bayesian optimization are Gaussian processes (GPs)

[21], which define a prior over functions that are cheap to evaluate and are updated as and when new information from queries becomes available. We model the objective function using a GP with prior distribution with constant mean function and Matern kernel [23, 25] as the covariance function , which is defined as:

where is the dimension of input and and

are hyperparameters. We select hyperparameters that maximize the posterior of the observations under a prior

[23, 8].

The second component, the acquisition function , assigns a value to each point that represents the utility of querying the model at this point given the surrogate model. We sample the objective function at where comprises of samples drawn from so far. Although this itself may be a hard (non-convex) optimization problem to solve, in practice we use a standard approach and approximately optimize this objective using the LBFGS algorithm. There are several popular choices of acquisition function; we use expected improvement (EI) [13], which is defined as


where denotes the expectation taken over the posterior distribution given evaluations of at , and is the best value observed so far.

Bayesian optimization framework as shown in Algorithm 2 runs these two steps iteratively for the given budget of function evaluations. It updates the posterior probability distribution on the objective function using all the available data. Then, it finds the next sampling point by optimizing the acquisition function over the current posterior distribution of GP. The objective function is evaluated at this chosen point and the whole process repeats.

In theory, we may apply Bayesian optimization directly to the optimization problem in equation 1 to obtain an adversarial example, stopping once we find a point where the the objective function rises above . In practice, Bayesian optimization’s speed and overall performance fall dramatically as the input dimension of the problem increases. This makes running Bayesian optimization over high dimensional inputs such as ImageNet (input dimension ) practically infeasible; we therefore require a method for reducing the dimension of this optimization problem.

4.2 Bayes-Attack: Generating Adversarial Examples using Bayesian Optimization

Images tend to exhibit spatial local similarity i.e. pixels that are close to each other tend to be similar. Ilyas et al. [12]

showed that this similarity also extends to gradients and used this to reduce query complexity. Our method uses this data dependent prior to reduce the search dimension of the perturbation. We show that the adversarial perturbations also exhibit spatial local similarity and we do not need to learn the adversarial perturbation conforming to the actual dimensions of the image. Instead, we learn the perturbation in a much lower dimension. We obtain our final adversarial perturbation by interpolating the learned, low-dimension perturbation to the original input dimension.

We define the objective function for running the Bayesian optimization in low dimension in Algorithm 1. We let be the projection onto the ball of radius centered at origin. Our method finds a low dimension perturbation and upsamples to obtain the adversarial perturbation. Since this upsampled image may not lie inside the ball of radius centered at the origin, we project back to ensure remains bounded by . With the perturbation in hand, we compute the objective function of the original optimization problem defined in equation 1.

We describe the complete algorithm our complete framework in Algorithm 2 where and denote the original input image and label respectively. The goal is to learn an adversarial perturbation in much lower dimension, i.e., . We begin with a small dataset where each is a dimensional vector sampled from a given distribution and is the function evaluation at i.e . We iteratively update the posterior distribution of the GP using all available data and query new perturbations obtained by maximizing the acquisition function over the current posterior distribution of GP until we find an adversarial perturbation or run out of query budget. The Bayesian optimization iterations run in low dimension but for querying the model we upsample, project and then add the perturbation to the original image as shown in Algorithm 1 to get the perturbed image to conform to the input space of the model. To generate a successful adversarial perturbation, it is necessary and sufficient to have , as described in Section 3. We call our attack successful with queries to the model if the Bayesian optimization loop exits after iterations (line 12 in Algorithm 2), otherwise it is unsuccessful. Finally, we note that the final adversarial image can be obtained by upsampling the learned perturbation and adding to the original image as shown in Figure 1.

In this work, we focus on -norm perturbations, where projection is defined as:


where is the given perturbation bound. The upsampling method can be linear or non-linear. In this work, we conduct experiments using nearest neighbor upsampling. A variational autoencoder [14] or vanilla autoencoder could also be trained to map the low dimension perturbation to the original input space. We compare these different upsampling schemes in Section 5.3.1. The initial choice of the dataset

to form a prior can be done using standard normal distribution, uniform distribution or even in a deterministic manner (e.g. with Sobol sequences).

1:procedure Obj-Func()
2:     //   is the given perturbation
3:     Upsample Upsampling low dimension perturbation to input dimension
4:      Projecting perturbation on -ball around
5:      Quering the model
6:     return
Algorithm 1 Objective Function
1:procedure Bayes-Attack()
2:      Quering randomly chosen points.
3:     Update the GP on Updating posterior distribution using available points
4:      Updating number of queries till now
5:     while  do
6:          Optimizing the acquisition function over the GP
7:          Obj-Func Querying the model
9:         if  then
10:               and update the GP Updating posterior distribution
11:         else
12:              return Adversarial attack successful               
13:     return Adversarial attack unsuccessful
Algorithm 2 Adversarial Attack using Bayesian Optimization
Figure 1: An illustration of a black-box adversarial attack performed by the proposed method Bayes-Attack on ResNet50 trained on ImageNet. Images from the left: first figure shows the learnt perturbation in low dimension ; second figure is the final adversarial perturbation obtained by using nearest neighbor upsampling; third figure is the original image (note that the input size for ResNet50 is ) which is initially classified as white/arctic wolf; last image is the final adversarial image obtained by adding the adversarial perturbation to the original image. ResNet50 classifies the final adversarial image as shower curtain with high probability.

5 Experiments

Our experiments focus on the untargeted attack setting where the goal is to perturb the original image originally classified correctly by the classification model to cause misclassification. We primarily consider performance of Bayes-Attack on ImageNet classifiers and compare its performance to other black-box attacks in terms of success rate over a given query budget. We also perform ablation studies on the MNIST dataset [15] by examining different upsampling techniques and varying the latent dimension of the optimization problem.

We define success rate as the ratio of the number of images successfully perturbed for a given query budget to the total number of input images. In all experiments, images that are already misclassified by the target network are excluded from the test set; only images that are initially classified with the correct label are attacked. For each method of attack and each target network, we compute the average and median number of queries used to attack among images that were successfully perturbed.

5.1 Empirical Protocols

We treat the latent dimension used for running the Bayesian optimization loop as a hyperparameter. For MNIST, we tune the latent dimension over . Note that is the original input dimension for MNIST. While for ImageNet, we search for latent dimension and shape over the range . For ImageNet, the latent shapes with first dimension as indicate that the same perturbation is added to all three channels while the ones with 3 indicate that the perturbation across channels are different. In case of ImageNet, we found that for ResNet50 and VGG16-bn different perturbation across channels work much better than adding the same perturbation across channels. While for Inception-v3, both seem to work equally well.

We initialize the GP with samples sampled from a standard normal distribution. For all the experiments in next section, we use expected improvement as the acquisition function. We also examined other acquisition functions (posterior mean, probability of improvement, upper confidence bound) and observed that our method works equally well with other acquisition functions. We independently tune the hyper-parameters on a small validation set and exclude it from our final test set. We used BoTorch111 packages for implementation.

5.2 Experiments on ImageNet

We compare the performance of the proposed method Bayes-Attack against NES [11], Bandits-td [12] and Parsimonious [18], which is the current state of the art among score-based black-box attacks within the threat model. On ImageNet, we attack the pretrained222Pretrained models available at ResNet50 [10], Inception-v3 [27] and VGG16-bn [24]. We use 10,000 randomly selected images (normalized to [0, 1]) from the ImageNet validation set that were initially correctly classified.

We set the perturbation bound to and evaluate the performance of all the methods for low query budgets. We use the implementation333 and hyperparameters provided by Ilyas et al. [12] for NES and Bandits-td. Similarly for Parsimonious, we use the implementation444 and hyperparameters given by Moon et al. [18].

Figure 2 compares the performance of the proposed method Bayes-Attack against the set of baseline methods in terms of success rate at different query budgets. We can see that Bayes-Attack consistently performs better than baseline methods for query budgets . Even for query budgets , Bayes-Attack achieves better success rates than Bandits-td and NES on ResNet50 and VGG16-bn. Finally, we note that for higher query budgets , both Parsimonious and Bandits-td method perform better than Bayes-Attack.

(a) ResNet50
(b) Inception-v3
(c) VGG16-bn
Figure 2: Performance comparison for untargeted attacks on ImageNet classifiers. Bayes-Attack consistently performs better for low query budgets . Note that for NES, model queries are performed in batches of 100 as specified in Ilyas et al. [11].
Classifier Method Success Average Median
Rate Query Query
ResNet50 NES
Inception-v3 NES
VGG16-bn NES
Table 1: Results for untargeted attacks on ImageNet classifiers with a query budget of

To compare the success rate and average/median query, we select a point on the plots shown in Figure 2. Table 1 compares the performance of all the methods in terms of success rate, average and median query for a query budget of . We can see that Bayes-Attack achieves higher success rate with less average queries as compared to the next best Parsimonious method. Thus, we argue that although the Bayesian Optimization adversarial attack approach is to some extent a “standard” application of traditional Bayesian Optimization methods, the performance over the existing state of the art makes it a compelling approach particularly for the very low query setting.

5.3 Experiments on MNIST

For MNIST, we use the pretrained network (used in Carlini and Wagner [2]) with convolutional layers, max-pooling layers and fully-connected layers which achieves accuracy on MNIST test set. We conduct untargeted adversarial attacks with perturbation bound on a randomly sampled images from the test set. All the experiments performed on MNIST follow the same protocols.

5.3.1 Upsampling Methods

The proposed method requires an upsampling technique for mapping the perturbation learnt in the latent dimension to the original input dimension. In this section, we examine different linear and non-linear upsampling schemes and compare their performance on MNIST. The approaches we consider here can be divided into two broad groups: Encoder-Decoder based methods and Interpolation methods. For interpolation-based methods, we consider nearest-neighbor, bilinear and bicubic interpolation.

For encoder-decoder based approaches, we train a variational autoencoder [14, 22] by maximizing a variational lower bound on the log marginal likelihood. We also consider a simple autoencoder trained by minimizing the mean squared loss between the generated image and the original image. For both the approaches, we run the Bayesian optimization loop in latent space and use the pretrained decoder (or generator) for mapping the latent vector into image space. For these approaches, rather than searching for adversarial perturbation in the latent space, we learn the adversarial image directly using the Bayesian optimization.

Figure 2(a) compares the performance of different upsampling methods. We can see that Nearest Neighbor (NN) interpolation and VAE-based decoder perform better than rest of the upsampling schemes. However, the NN interpolation achieves similar performance to the VAE-based method but without the need of a large training dataset which is required for accurately training a VAE-based decoder.

(a) Performance comparison with different upsampling schemes.
(b) Performance comparison with different latent dimension.
Figure 3: untargeted attacks on MNIST. dim: latent dimension used to run the Bayesian optimization, NN: Nearest neighbor interpolation, BiC: Bicubic interpolation, BiL: Bilinear interpolation, AE: autoencoder-based decoder, VAE: VAE-based generator (or decoder)

5.3.2 Latent Dimension Sensitivity Analysis

We perform a sensitivity analysis on the latent dimension hyperparameter used for running the Bayesian optimization. We vary the latent dimension over the range . Figure 2(b) shows the performance of nearest neighbor interpolation method for different latent dimension. We observe that lower latent dimensions achieve better success rates than the original input dimension for MNIST. This could be because with increase in search dimension, Bayesian optimization needs more queries to find successful perturbation. We also note that for the case of latent dimension , Bayes-Attack achieves lower success rates which could mean that it is hard to find adversarial perturbations in such low dimension.

6 Conclusions

We considered the problem of black-box adversarial attacks in settings involving constrained query budgets. We employed Bayesian optimization based method to construct a query efficient attack strategy. The proposed method searches for an adversarial perturbation in low dimensional latent space using Bayesian optimization and then maps the perturbation to the original input space using the nearest neighbor upsampling scheme. We successfully demonstrated the efficacy of our method in attacking multiple deep learning architectures for high dimensional inputs. Our work opens avenues regarding applying BO for adversarial attacks in high dimensional settings.