Spanning Attack: Reinforce Black-box Attacks with Unlabeled Data

05/11/2020 ∙ by Lu Wang, et al. ∙ Nanjing University JD.com, Inc. 0

Adversarial black-box attacks aim to craft adversarial perturbations by querying input-output pairs of machine learning models. They are widely used to evaluate the robustness of pre-trained models. However, black-box attacks often suffer from the issue of query inefficiency due to the high dimensionality of the input space, and therefore incur a false sense of model robustness. In this paper, we relax the conditions of the black-box threat model, and propose a novel technique called the spanning attack. By constraining adversarial perturbations in a low-dimensional subspace via spanning an auxiliary unlabeled dataset, the spanning attack significantly improves the query efficiency of black-box attacks. Extensive experiments show that the proposed method works favorably in both soft-label and hard-label black-box attacks. Our code is available at https://github.com/wangwllu/spanning_attack.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has been shown that machine learning models, especially deep neural networks, are vulnerable to small adversarial perturbations, i.e., a small carefully crafted perturbation added to the input may significantly change the prediction results 

(Szegedy et al., 2014; Goodfellow et al., 2015; Biggio and Roli, 2018; Fawzi et al., 2018). Therefore, the problem of finding those perturbations, also known as adversarial attacks, has become an important way to evaluate the model robustness: the more difficult to attack a given model, the more robust it is.

Depending on the information an adversary can access, the adversarial attacks can be classified into white-box and black-box settings. In the

white-box setting, the target model is completely exposed to the attacker, and adversarial perturbations could be easily crafted by exploiting the first-order information, i.e., gradients with respect to the input (Carlini and Wagner, 2017; Madry et al., 2018). Despite of its efficiency and effectiveness, the white-box setting is an overly strong and pessimistic threat model, and white-box attacks are usually not practical when attacking real-world machine learning systems due to the invisibility of the gradient information.

Instead, we focus on the problem of black-box

attacks, where the model structure and parameters (weights) are not available to the attacker. The attacker can only gather necessary information by means of (iteratively) making input queries to the model and obtaining the corresponding outputs. Black-box setting is a more realistic threat model, and furthermore, important in the sense that they could serve as a general way to evaluate the robustness of machine learning models beyond neural networks, even when the backpropagation algorithm is not applicable (e.g., evaluating the robustness of tree-based models 

(Chen et al., 2019a) and nearest neighbor models (Wang et al., 2019)).

Black-box attacks have been extensively studied in the past few years. Depending on what kind of outputs the attacker could derive, black-box attacks could be broadly grouped into two categories: soft-label attacks and hard-label attacks. Soft-label attacks assume that the attacker has access to real-valued scores (logits or probabilities) for all labels, while hard-label attacks assume that the attacker only has access to the final discrete decision (the predicted label).

However, black-box attacks, especially hard-label black-box attacks, usually require a large amount of (usually ) queries to craft an adversarial perturbation. The query-inefficiency issue restricts the usage of black-box attacks in real applications, while online machine learning systems often set a limit on the query number, or get paid according to the number of queries. More importantly, query inefficiency incurs a false sense of model robustness.

We suspect that this query-inefficiency issue is primarily owing to the high dimensionality of the input space, and therefore to verify the conjecture in this paper, we seek to reduce the number of queries by reducing the dimensionality of the search space. To achieve this objective, we relax the conditions of the black-box threat model by assuming that a small auxiliary unlabeled dataset is available to the attacker. The assumption is reasonable: imagine that before attacking an image classification model, the attacker just needs to “collect” some unlabeled images. Such images could even be available to the attacker in the absence of any explicit collection, because in a normal attack process the attacker is given in advance target images of which the predicted labels would be expected to change by crafting small perturbations.

With this small unlabeled dataset, we propose a technique named the spanning attack. Specifically, the unlabeled dataset spans a subspace of the input space, and we constrain the search of adversarial perturbations only in this subspace so as to reduce the dimensionality of the search space. This method is motivated by the theoretical analysis that minimum adversarial perturbations for a variety of machine learning models prove to be in the subspace of the training data, and thus the auxiliary unlabeled dataset plays as a substitute for the original training data. We also show that this method is general enough to apply to a wide range of existing black-box attack methods, including both soft-label attacks and hard-label attacks. Our experiments verify that the spanning attack could significantly improve query efficiency of black-box attacks. Furthermore, we show that even a very small and biased unlabeled dataset sampled from a different distribution from the training data suffices to perform favorably in practice.

This paper makes the following contributions:

  1. We present the random attack

    framework which captures most existing black-box attacks in various settings including both soft-label attacks and hard-label attacks. It is a novel and clear interpretation on the mechanism of black-box attacks from the perspective of random vectors.

  2. We propose a method to regulate the resulting adversarial perturbation of any random attack to be constrained in a predefined subspace. This is a general method to reduce the dimensionality of the search space of black-box attacks.

  3. We make preliminary theoretical analysis about the subspace where the minimum adversarial perturbation has to be inside. Motivated by the analysis, we propose to reinforce black-box attacks (random attacks) by means of constraining a subspace spanned by an auxiliary small unlabeled dataset. In our experiments across various black-box attacks and target models, the reinforced attack typically requires less than queries while improves success rates in the meantime.

The remainder of the paper is organized as follows. Section 2 discusses related work about black-box attacks. Section 3 introduces the basic preliminaries and our motivation. Section 4 presents our framework for black-box attacks and proposes our general method to improve query efficiency. Section 5 reports empirical evaluation results. Section 6 concludes this paper.

2 Related work

Soft-label black-box attacks.

Chen et al. (2017) showed that soft-label black-box attacks can be formulated as solving an optimization problem in the zeroth order scenario, where one can query the function itself but not its gradients. Since then many zeroth order optimization algorithms have been proposed for black-box attacks, such as ZO-Adam (Chen et al., 2017), NES (Ilyas et al., 2018), ZO-SignSGD (Liu et al., 2019), AutoZOOM (Tu et al., 2019), and Bandit-attack (Ilyas et al., 2019).

Hard-label black-box attacks.

Hard-label black-box attacks are more challenging since it is non-trivial to define a smooth objective function for attacks based only on the hard-label decisions. Brendel et al. (2018) proposed a method based on reject sampling and random walks. Cheng et al. (2019a)

reformulated the attack as a real-valued optimization problem and the objective function is estimated via coarse-grained search and then binary search.

Chen et al. (2019b)

proposed an unbiased estimate of the gradient direction at the decision boundary, and an attack method with a convergence analysis.

Cheng et al. (2020) proposed a query-efficient sign estimator of the gradient.

Improve query efficiency of black-box attacks.

Recently, the idea of relaxing the treat model to improve query efficiency of black-box attacks has attracted increasing attention. Some work captured the idea of transfer-based attacks (Papernot et al., 2016; Liu et al., 2017): adversarial examples of a surrogate model also tend to fool other models. Brunner et al. (2018) and Cheng et al. (2019b) both assumed that a surrogate model is available to the attacker. Therefore, the attacker could employ the gradients of the surrogate model as a prior for the true gradient of the target model. Another work (Yan et al., 2019) proposed a soft-label black-box attack method which employs an auxiliary labeled datasets. Multiple reference models are trained with the labeled

datasets, and a subspace is spanned by perturbed gradients of these reference models. Then the true gradients of the target model are estimated in the subspace. The major difference from our work is that their auxiliary dataset has to be labeled, whereas ours is unlabeled. Moreover, their auxiliary dataset is much larger than ours owing to the need for training reference models. In the ImageNet case, we only need less than 1,000 unlabeled instances, while

Yan et al. (2019) require 75,000 labeled instances. Finally, our method is more general, and can be applied to both soft-label and hard-label black-box attacks.

3 Background and motivation

First, we introduce the notations for black-box adversarial attacks. Let denote the input space where is the dimension, and let denote the output space where is the number of labels. The function is a classifier which is the target model to attack, and makes decisions by

where is the score function of the classifier, which outputs scores of all labels for the given instance.

Untargeted adversarial attacks.

Given a radius and a correctly-classified labeled instance , an untargeted attack aims to find an adversarial perturbation with the norm such that the classifier predicts a different label for the perturbed instance from the original instance , i.e., .

Targeted adversarial attacks.

Given a radius , a correctly-classified labeled instance , and a target label , a targeted attack aims to find an adversarial perturbation with the norm such that the classifier predicts the label for the perturbed instance , i.e., .

Our paper will focus on untargeted attacks, while it is easy to extend to targeted attacks. Besides, we focus on the norm (Euclidean norm) perturbation where the magnitude of adversarial perturbations are measured by norm, and further research on general norms are left for future work.

Soft-label black-box attacks.

The attacker has access to the score (logit or probability) output for any input in , i.e.,

. Therefore, any loss defined on the the pair of the score and the ground-truth label is also available to the attacker. We denote the loss function as

.

Hard-label black-box attacks.

The attacker only has access to the final decision (the predicted label) for any input in , i.e., . It is more challenging than soft-label attacks as they have access to less information.

Soft-label attacks query , while hard-label attacks query . The number of queries is the cost of black-box attacks. It is an important issue to reduce the number of queries required when deploying attack methods in real applications.

Criterion for black-box attacks.

Given a budget , a radius , and a correctly-classified labeled instance , an untargeted black-box attack aims to find an adversarial perturbation with the norm such that is satisfied only by means of querying up to times of in the soft-label case or in the hard-label case. If the attack manages to find an adversarial perturbation within the query budget ( queries), then we say the attack is successful; otherwise it is failed. Therefore, we have two criteria for a black-box attack: (i) whether it is successful and (ii) the number of queries it executes.

In practice, the input space is usually high-dimensional: for instance, the typical input image for an ImageNet model has pixels. It is suspected that requiring such a large amount of queries when searching for an adversarial perturbation in is probably owing to the high dimensionality of . To verify our conjecture, a natural question for black-box attacks is as below:

“Is it possible to reduce the number of queries

by reducing the dimensionality of the search space?”

In this paper, we provide a positive answer to this question by proposing a method that reinforce black-box attacks with a small set of unlabeled data.

4 Proposed method

We first introduce the concept of the subspace attack, and then propose a method which utilizes an auxiliary unlabeled dataset to locate an appropriate subspace, namely the spanning attack.

4.1 Subspace attack

Definition 1 (subspace attack).

A subspace attack is an adversarial attack which returns adversarial perturbations in a predefined subspace .

Intuitively, the predefined subspace can be seen as a prior for perturbations of adversarial examples. If the subspace is small enough while still being able to capture most of small adversarial perturbations, then due to the reduced dimensionality, it can significantly reduce the number of queries required for black-box attacks.

We will focus on one type of black-box attacks, the random attack, which captures a wide range of (nearly all existing) black-box attacks, and is convenient to incorporate the prior knowledge about the subspace, and thus easy to be transformed into a subspace attack.

Definition 2 (random attack).

The resulting adversarial perturbation of a random attack is a linear combination of random vectors.

The following lemma highlights an intuition on how to transform a random attack into a subspace attack:

Lemma 1.

If all random vectors of a random attack is constrained to be in a predefined subspace , then the random attack is a subspace attack with respect to .

The proof is straightforward: a linear combination of vectors in a subspace is also in the subspace.

In literature, most of existing black-box attacks are random attacks — examples will be shown in Section 4.2 and Section 4.3.

Random vectors of random attacks are typically sampled from the Gaussian distribution. Such

Gaussian random vectors

are independent and identically distributed random variables sampled from a standard Gaussian distribution. Let

denote the sampling routine for a Gaussian random vector of dimension . Thus will sample a Gaussian random vector in the original input space . If we could constrain the sampling routine in a subspace, by Lemma 1 the resulting attack would be a subspace attack. Algorithm 1 displays how to sample a Gaussian random vector in a subspace. The subspace is characterized by an orthonormal basis, and the term guarantees that the returned random vector has the same expected value of the length as the original random vector .

Input:orthonormal vectors which spans the subspace , and the sampling routine sample for Gaussian random vectors
Output: a Gaussian random vector in the subspace
1
return
Algorithm 1 Gaussian random vectors in a subspace

Note that the returned random vector of Algorithm 1 is a linear combination of the orthonormal vectors. Therefore we have the following lemma:

Lemma 2.

The returned random vector of Algorithm 1 is constrained in the subspace , where returns the smallest space that contains all the input vectors , , , .

Therefore, applying Algorithm 1 to a random attack, we have a subspace attack as the following corollary implies.

Corollary 1.

Given a set of orthonormal vectors , , , , any random attack using Gaussian random vectors can be transformed into a subspace attack of which the corresponding subspace is by means of replacing the sampling routine sample() via Algorithm 1.

Corollary 1 introduces a particular method to transform a random attack (black-box attack) into a subspace attack. It is noteworthy that the transformation is performed only by means of replacing sampling routines. It does not require to project adversarial perturbations from the input space to the subspace explicitly, and therefore causes as little negative impact as possible on the original random attack.

4.2 Case study: soft-label black-box attacks

We investigate a soft-label black-box attack framework of which the attack incorporates a gradient-based optimization method and a backend zeroth-order gradient estimation method to search for adversarial perturbations. This framework captures a wide range of soft-label black-box methods (Ilyas et al., 2018; Liu et al., 2019; Ilyas et al., 2019; Tu et al., 2019; Cheng et al., 2019b). The framework is summarized in Algorithm 2.

Input: score function , and corresponding classifier , correctly-classified labeled instance , and budget
Output: adversarial perturbation or NULL
1
2 while  do
3       if  then
             return // successful
4            
5       end if
6      )
7      
8      
9 end while
return NULL // failed
Algorithm 2 Soft-label black-box attack framework

In this framework, random vectors could be introduced when initializing the perturbation, which is or a random vector, and estimating gradients by the zeroth-order method. A typical example of estimating gradients is the random gradient-free (RGF) method (Nesterov and Spokoiny, 2017), which returns the estimated gradient in the form:

where s are unit Gaussian random vectors (Gaussian random vectors of length 1). Therefore, is a linear combination of random vectors.

Then, the resulting adversarial perturbation is calculated by gradient-based optimization methods such as projected gradient descent (Madry et al., 2018), all of which return linear combinations of the estimated gradients. It follows that these attacks are random attacks and could be easily transformed into a subspace attack via Algorithm 1.

4.3 Case study: hard-label black-box attacks

Hard-label black-box attacks could be separated into two categories: methods based on random walks (Brendel et al., 2018; Chen et al., 2019b) and methods based on direction estimation (Cheng et al., 2019a, 2020). In the first case, a random walk consists of a succession of random vectors, i.e., the sum of random vectors; in the second case, the gradient with respect to the direction towards the boundary is estimated by RGF or its variant based on the sign of the finite difference. As we discuss before, these gradient estimation methods typically return linear combinations of random vectors. In both cases, the resulting adversarial perturbation is also a linear combination of random vectors, and as a consequence they could also be transformed into subspace attacks obviously.

4.4 Spanning attack

The subspace is a prior information for the subspace attack. To make a subspace attack perform well, it has to be easier to find an adversarial perturbation in the subspace than in original input space . The crux of subspace attacks is how to locate an appropriate subspace . We propose to utilize an auxiliary unlabeled dataset to span the subspace, which is motivated by the theoretical analysis regarding the minimum adversarial perturbation.

Minimum (norm) adversarial perturbation.

A minimum adversarial perturbation is the adversarial perturbation with the minimum norm. Formally, given a classifier and a labeled instance , the minimum adversarial perturbation is defined as

We first analyze the minimum adversarial perturbation of the nearest neighbor classifier. Given a labeled instance to attack, let denote the training data of which labels are the same as , the training data of which labels are different from , and the whole training data. Wang et al. (2019) derived a condition that the minimum adversarial perturbation of the nearest neighbor classifier has to satisfy:

Lemma 3 (Wang et al., 2019).

For any labeled instance to attack, there exist and such that the minimum adversarial perturbation of the nearest neighbor classifier is

Then we have a straightforward corollary about the subspace in which the minimum adversarial perturbation is.

Corollary 2.

Minimum adversarial perturbations of the nearest neighbor classifier are constrained in the space .

Proof.

By Lemma 3, we have

Similar results on the minimum adversarial perturbation also hold for support vector machine (SVM) classifiers 

(Cortes and Vapnik, 1995).

Lemma 4.

Minimum adversarial perturbations of SVM classifiers are in the form

where and are scalars.

Proof.

For simplicity, we only consider SVM for binary classification, which can be easily extended by strategies of one-vs-one or one-vs-rest. Let denote the optimal solution of SVM. Then based on the primal-dual relationship, we have

where and are scalars. When predicting a perturbed example, SVM calculates

where is the angle between and . Therefore, the minimum adversarial perturbation that flips the sign of has to be in the direction of with or in the opposite direction of with . ∎

Then we have the following corollary, which is a direct corollary of Lemma 4.

Corollary 3.

Minimum adversarial perturbations of the SVM classifiers are constrained in the space .

It is inspiring that -NN and SVM have a great difference whereas share the same property:

“Minimum adversarial perturbations prove to be

in the subspace spanned by the training data”.

On the contrary, computing minimum adversarial perturbations for neural networks and tree-based ensemble models has shown to be NP-hard (Katz et al., 2017; Kantchelian et al., 2016), and it is an open problem in what conditions minimum adversarial perturbations for neural networks and tree-based ensemble models are also in the space . Nevertheless, Corollary 2 and Corollary 3 still motivate us to search for adversarial perturbations in the space .

In practice, it is not applicable to assume the training data are available to attackers. To make a relaxation, we assume that the attacker only has access to an auxiliary unlabeled dataset . By this means, subspace attackers search for adversarial perturbations in , namely the spanning attack, i.e., the subspace attack by spanning an auxiliary unlabeled dataset. For convenience, we simply term the auxiliary unlabeled dataset as the subspace dataset. Moreover, since in the following only the subspace dataset is utilized for attacks, we use to denote the subspace dataset unambiguously.

By Corollary 1, given a subset dataset , the spanning attack requires a set of orthonormal vectors which is a basis for so as to transform a random attack into a subspace attack. We could make it by the standard process of orthonormalization

, which can be performed by the Gram-Schmidt process, the Householder transformation etc 

(Cheney and Kincaid, 2010). Therefore, the overall procedures of our spanning attack is as below:

  1. Compute a basis of by orthonormalization;

  2. Transform the random attack into a subspace attack by Algorithm 1;

  3. Attack the target model with the resulting subspace attack.

4.5 Selective spanning attack

The spanning attack searches for adversarial perturbations in the space , which is a subspace of the input space . A natural question is whether it is possible to benefit more by means of explicitly selecting a subspace of instead of using directly. We term the method which searches for adversarial perturbations in a non-trivial subspace of as the selective spanning attack, as it selects a subspace from .

In the case of the selective spanning attack, the Gram-Schmidt process or Householder transformation is not instructive enough to select a subspace of , since there is no significant difference among the derived orthonormal vectors.

Instead, we employ the singular value decomposition (SVD) to derive a set of orthonormal vectors which is a basis of

. In particular, assume the subspace dataset has different instances and is the matrix of which the -th row is . By SVD, can be decomposed into the form

where and are orthogonal matrices and is a diagonal matrix, of which diagonal entries are singular values.

It could be proved that the right singular vectors (columns of ) satisfy the following property:

Lemma 5.

Right singular vectors of which the corresponding singular values are larger than zero are an orthonormal basis for .

Proof.

Let denote the number of non-zero singular values. Then, we have the compact SVD as

Let denote the -th column of . The objective is to prove

which is equivalent to

  • (i) and

  • (ii) .

For any , by definition there exists such that

Therefore, we have (i) .

By the compact SVD, we also have

Thus, for any , there exists such that

Therefor, we have (ii) .

We denote these right singular vectors as , , , with corresponding singular values larger than zero, and they are sorted according to the corresponding singular values such that the singular values have . Then, we have roughly two options for the selective spanning attack: selecting the top singular vectors, and selecting the bottom singular vectors. We term these two options as the top spanning attack and the bottom spanning attack respectively.

(a)
(b)
Figure 1: Illustration of top and bottom spanning attacks. The only difference between Figure (a)a and Figure (b)b is the ground-truth labels. and are right singular vectors, and their lengths represent the corresponding singular values. For an effective attack, the top spanning attack, i.e., selecting , is better than the bottom spanning attack, i.e., selecting , in Figure (a)a; whereas Figure (b)b is the opposite case.

Since the selective spanning attack does not require the label information, the top spanning attack and the bottom spanning attack have their own advantageous situations depending on the labels of the underlying data distribution. We illustrate two toy cases in Figure 1. In the first case, the top spanning attack is favorable since we can find adversarial perturbations along , and the second case is the exact opposite. Roughly speaking, top singular vectors represent directions along the manifold of the dataset, and bottom singular vectors represent directions out of the manifold. It is believed that for high-dimensional datasets adversarial examples widely exist in directions out of the manifold (Stutz et al., 2019). Therefore, the bottom spanning attack would be a better choice in practice, which is validated in our experiments.

5 Experiments

In this section, we empirically validate the performance of the proposed spanning attack. Specifically, we select three representative black-box (random) attacks as baselines and employ the spanning attack to reinforce them:

  • The RGF attack (Cheng et al., 2019b): a soft-label attack within the framework considered in the case study for soft-label attacks;

  • The boundary attack (Brendel et al., 2018): a pioneering widely-used hard-label attack based on random walks;

  • The Sign-OPT attack (Cheng et al., 2020): a state-of-the-art hard-label attack based on direction estimation.

We perform untargeted black-box attacks on the ImageNet dataset (Deng et al., 2009). Attacks are performed against the pre-trained ResNet-50 (He et al., 2016), VGG-16 (Simonyan and Zisserman, 2015) and DenseNet-121 (Huang et al., 2017)

from the PyTorch model zoo 

(Steiner et al., 2019), since these architectures are diverse and representative. Correctly-classified images are randomly sampled from the validation set as the evaluation dataset, of which every labeled image is the instance to attack. The size of the evaluation dataset for soft-label attacks is 1,000, and the one for hard-label attacks is 100 for computational efficiency. Another unlabeled 1,000 images are sampled from the validation set as the subspace dataset. The evaluation dataset and the subspace dataset do not overlap. We set the perturbation radius and the budget . If an attack method finds an adversarial perturbation within queries such that holds, then this attack is successful; otherwise it is failed. Therefore, we have two criteria for a black-box attack: (i) whether it is successful and (ii) the number of queries it executes.

All hyper-parameters of the spanning attacks are the same as the corresponding baselines. The only difference is introducing an appropriate subspace via our methods. We refer to Cheng et al. (2019b), Brendel et al. (2018) and Cheng et al. (2020) for details of the baseline black-box methods.

5.1 Main results

Success rate Query mean Query median
RGF (soft-label) Baseline 0.965 596.088 358.000
Spanning attack 0.988 315.879 205.000
Boundary (hard-label) Baseline 0.720 4133.903 3291.000
Spanning attack 0.880 3197.557 2569.500
Sign-OPT (hard-label) Baseline 0.970 2392.175 2143.000
Spanning attack 1.000 1053.220 647.000
Table 1: Comparison between the baseline black-box attacks and the resulting spanning attacks for ResNet-50.
Success rate Query mean Query median
RGF (soft-label) Baseline 0.970 450.484 256.000
Spanning attack 0.984 259.006 154.000
Boundary (hard-label) Baseline 0.810 3467.086 2787.000
Spanning attack 0.940 2972.755 2263.000
Sign-OPT (hard-label) Baseline 1.000 1665.080 1450.000
Spanning attack 1.000 840.900 572.500
Table 2: Comparison between the baseline black-box attacks and the resulting spanning attacks for VGG-16.
Success rate Query mean Query median
RGF (soft-label) Baseline 0.977 515.855 358.000
Spanning attack 0.995 260.203 154.000
Boundary (hard-label) Baseline 0.670 3806.687 3389.000
Spanning attack 0.890 3063.449 2261.000
Sign-OPT (hard-label) Baseline 0.980 2407.398 1863.500
Spanning attack 1.000 1014.280 688.500
Table 3: Comparison between the baseline black-box attacks and the resulting spanning attacks for DenseNet-121.
Figure 2: Visualization for vectors of the orthonormal basis.
Figure 3: Examples of the adversarial images. The first row is the original images; the second row is the adversarial images crafted by the baseline attack (Sign-OPT against ResNet-50); the third row is the adversarial images crafted by the corresponding spanning attack.

Success rates, query means and query medians on the evaluation dataset are reported in Table 1 for ResNet-50, Table 2 for VGG-16, and Table 3 for DenseNet-121. By convention only successful adversarial perturbations are counted for query means and query medians. On the one hand, this criterion is favorable for the method with a lower success rate and a lower query number on successful adversarial perturbations. On the other hand, if a method has a higher success rate and a lower query number on successful perturbations than the other, we would have more confidence to conclude that the first method performs better.

Our results show that the spanning attack improves the baseline methods significantly in terms of both success rates and query numbers, consistently across all baseline methods and all pre-trained target models.

In particular, in the case of the RGF attack and the Sign-OPT attack, the spanning attack only needs approximately half of the queries of the baseline for a successful attack, and increases success rates in the meantime. For example, in the Sign-OPT case for ResNet-50, the spanning attack improves the success rate to , and more crucially, it only requires queries in terms of the query mean and queries in terms of the query median!

In the case of the boundary attack, while success rates of the baseline boundary attack within the given budget is not satisfying, our spanning attack improves the success rates by a wide margin. For example, in the Boundary case for VGG-16, the spanning attack improves the success rate from to .

Visualization of the subspace basis.

A sample of vectors of the resulting orthonormal basis are visualized in Figure 2. Note that these vectors reveal low-dimensional structures of the subspace rather than Gaussian noise.

Examples of adversarial images.

Several adversarial images, crafted by the baseline method and the spanning attack, are displayed in Figure 3. All these adversarial images does not show any significant difference from the original images due to the fact that they have the same small-norm constraint.

Since the results are consistent across different baseline methods and target models, in the following we take the RGF attack and ResNet-50 as illustration by default to avoid unnecessary repetition.

5.2 Investigation on the subspace

In this section, we study to what extent the subspace affects the performance of the spanning attack.

(a) Success rate
(b) Query mean
(c) Query median
Figure 4: Attack performance with different sizes of the subspace.

5.2.1 Size of the subspace dataset.

We show attack performance with different sizes of the subspace dataset in Figure 4. In our experiments, the minimum size is 100 and the maximum size is 1,000. In this scope, the larger the subspace size, the better the performance of the spanning attack. In contrast, the baseline method is the extreme case where the subspace size is . Therefore, it is expected that the performance of the spanning attack will reach the peak and then slide down as the subspace size increases. It is noteworthy that even a small subspace dataset, as shown in Figure 4, would help the spanning attack defeat the baseline methods.

5.2.2 Distribution of the subspace dataset

Success rate Query mean Query median
Baseline 0.965 596.088 358.0
Spanning attack 0.988 315.879 205.0
Spanning attack (Flickr8k) 0.988 322.228 205.0
Totally random subspace 0.957 714.893 409.0
Bottom spanning attack 0.988 299.205 154.0
Top spanning attack 0.988 334.485 205.0
Table 4: Results of the spanning attack with Flickr8k subspace datasets, spanning attack with totally random subspace (a subspace without any prior), bottom spanning attack and top spanning attack.

We investigate whether it is necessary to sample the subspace dataset from the same distribution as the training data. We sample 1,000 unlabeled images from the Flickr8k dataset (Hodosh et al., 2015)

, much different from the ImageNet, as the subspace dataset. For comparison as an extreme case, we also show the results of the spanning attack with a totally random subspace which is spanned by an unlabeled dataset of which every image is sampled from a uniform distribution. In other words, the subspace is chosen totally randomly without any prior knowledge.

The results are displayed in the middle area of Table 4. On the one hand, the results of the Flickr8k spanning attack are still better than the baseline, and competitive with the spanning attack in Section 5.1, where the subspace dataset is sampled from the ImageNet validation set. It suggests that conditions that the subspace dataset has to obtain is not too strict in practice, and even a biased subspace dataset suffices to work, which extends the application range of the spanning attack. The subspace dataset does not necessarily has to be sampled from the same distribution with the training data.

On the other hand, the spanning attack with a totally random subspace performs even worse than the baseline, which validates that prior knowledge given by the subspace dataset is necessary, rather than an arbitrary low-dimensional subspace.

5.2.3 The bottom and top spanning attack

We investigate whether the selective spanning attack could further improve performance. In our experiments, the bottom 800 singular vectors (remember the total number is 1,000) are used for the bottom spanning attack, and the top 800 singular vectors are used for the top spanning attack. The comparison among the original spanning attack, bottom spanning attack and top spanning attack are shown in the lower area of Table 4. The results show that the bottom spanning attack could further improve performance, whereas the top spanning attack has a negative impact. This is an empirical validation that adversarial perturbations are more likely to appear in directions out of the data manifold, than along the data manifold, as discussed in Section 4.5.

5.3 More discussion with related work

Although the work of Yan et al. (2019) has a different setting with ours as discussed in Section 2, for completeness we try to adapt their methods for comparison. They require an auxiliary labeled dataset, focus on norm and only considers the soft-label black-box attack. In order to have a comparison, we let the subspace dataset be labeled with size 1,000. We notice that with such a small subspace dataset in our setting, Yan et al. (2019)’s method does not perform well. For instance, when attacking ResNet-50, it has the success rate 58.7% and the query mean 641.283. The results for attacking VGG-16 and DenseNet-121 are similar (VGG-16: success rate 68.6%, query mean 558.044; DenseNet-121: success rate 59%, query mean 623.603). It is primarily due to the fact that their method trains substitute models with labeled data. As a consequence, when the dataset is too small, it is difficult to train reliable substitute models.

6 Conclusion

We propose a general technique named the spanning attack to improve efficiency of black-box attacks. The spanning attack is motivated by the theoretical analysis that minimum adversarial perturbations of machine learning models tend to be in the subspace of the training data. In practice, the spanning attack only requires a small auxiliary unlabeled dataset, and is applicable to a wide range of black-box attacks including both the soft-label black-box attacks and hard-label black-box attacks. Our experiments show that the spanning attack can significantly improve the query efficiency and success rates of black-box attacks.

References

  • B. Biggio and F. Roli (2018)

    Wild patterns: ten years after the rise of adversarial machine learning

    .
    Pattern Recognition 84, pp. 317–331. Cited by: §1.
  • W. Brendel, J. Rauber, and M. Bethge (2018) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. In International Conference on Learning Representations (ICLR), Cited by: §2, §4.3, 2nd item, §5.
  • T. Brunner, F. Diehl, M. Truong-Le, and A. Knoll (2018) Guessing smart: biased sampling for efficient black-box adversarial attacks.. In

    IEEE International Conference on Computer Vision (ICCV)

    ,
    pp. 4958–4966. Cited by: §2.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1.
  • H. Chen, H. Zhang, D. Boning, and C. Hsieh (2019a)

    Robust decision trees against adversarial examples

    .
    In International Conference on Machine Learning (ICML), pp. 1122–1131. Cited by: §1.
  • J. Chen, M. I. Jordan, and M. J. Wainwright (2019b) HopSkipJumpAttack: a query-efficient decision-based attack. CoRR abs/1904.02144. Cited by: §2, §4.3.
  • P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In

    ACM Conference on Computer and Communications Security (CCS) Workshop on Artificial Intelligence and Security (AISec)

    ,
    pp. 15–26. Cited by: §2.
  • W. Cheney and D. R. Kincaid (2010) Linear algebra: theory and applications. The Saylor Foundation. Cited by: §4.4.
  • M. Cheng, T. Le, P. Chen, J. Yi, H. Zhang, and C. Hsieh (2019a) Query-efficient hard-label black-box attack: an optimization-based approach. In International Conference on Learning Representations (ICLR), Cited by: §2, §4.3.
  • M. Cheng, S. Singh, P. Chen, S. Liu, and C. Hsieh (2020) Sign-opt: a query-efficient hard-label adversarial attack. In International Conference on Learning Representations (ICLR), Cited by: §2, §4.3, 3rd item, §5.
  • S. Cheng, Y. Dong, T. Pang, H. Su, and J. Zhu (2019b) Improving black-box adversarial attacks with a transfer-based prior.. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2, §4.2, 1st item, §5.
  • C. Cortes and V. Vapnik (1995) Support-vector networks. Machine Learning 20 (3), pp. 273–297. Cited by: §4.4.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §5.
  • A. Fawzi, O. Fawzi, and P. Frossard (2018) Analysis of classifiers’ robustness to adversarial perturbations. Machine Learning 107 (3), pp. 481–508. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §5.
  • M. Hodosh, P. Young, and J. C. Hockenmaier (2015)

    Framing image description as a ranking task: data, models and evaluation metrics

    .
    In International Conference on Artificial Intelligence (IJCAI), pp. 4188–4192. Cited by: §5.2.2.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. Cited by: §5.
  • A. Ilyas, L. Engstrom, A. Athalye, and J. Lin (2018) Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning (ICML), pp. 2142–2151. Cited by: §2, §4.2.
  • A. Ilyas, L. Engstrom, and A. Madry (2019) Prior convictions: black-box adversarial attacks with bandits and priors. In International Conference on Learning Representations (ICLR), Cited by: §2, §4.2.
  • A. Kantchelian, J. Tygar, and A. Joseph (2016) Evasion and hardening of tree ensemble classifiers. In International Conference on Machine Learning (ICML), pp. 2387–2396. Cited by: §4.4.
  • G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer (2017) Reluplex: an efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Cited by: §4.4.
  • S. Liu, P. Chen, X. Chen, and M. Hong (2019) SignSGD via zeroth-order oracle. In International Conference on Learning Representations (ICLR), Cited by: §2, §4.2.
  • Y. Liu, X. Chen, C. Liu, and D. Song (2017) Delving into transferable adversarial examples and black-box attacks. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018)

    Towards deep learning models resistant to adversarial attacks

    .
    In International Conference on Learning Representations (ICLR), Cited by: §1, §4.2.
  • Y. Nesterov and V. G. Spokoiny (2017) Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17 (2), pp. 527–566. Cited by: §4.2.
  • N. Papernot, P. D. McDaniel, and I. J. Goodfellow (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR abs/1605.07277. Cited by: §2.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §5.
  • B. Steiner, Z. DeVito, S. Chintala, S. Gross, A. Paszke, F. Massa, A. Lerer, G. Chanan, Z. Lin, E. Yang, A. Desmaison, A. Tejani, A. Kopf, J. Bradbury, L. Antiga, M. Raison, N. Gimelshein, S. Chilamkurthy, T. Killeen, L. Fang, and J. Bai (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8024–8035. Cited by: §5.
  • D. Stutz, M. Hein, and B. Schiele (2019) Disentangling adversarial robustness and generalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6976–6987. Cited by: §4.5.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • C. Tu, P. Ting, P. Chen, S. Liu, H. Zhang, J. Yi, C. Hsieh, and S. Cheng (2019)

    Autozoom: autoencoder-based zeroth order optimization method for attacking black-box neural networks

    .
    In AAAI, Vol. 33, pp. 742–749. Cited by: §2, §4.2.
  • L. Wang, X. Liu, J. Yi, Z. Zhou, and C. Hsieh (2019) Evaluating the robustness of nearest neighbor classifiers: a primal-dual perspective.. CoRR abs/1906.03972. Cited by: §1, §4.4, Lemma 3.
  • Z. Yan, Y. Guo, and C. Zhang (2019) Subspace attack: exploiting promising subspaces for query-efficient black-box attacks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2, §5.3.