Theoretical Study of Random Noise Defense against Query-Based Black-Box Attacks

04/23/2021 ∙ by Zeyu Qin, et al. ∙ 0

The query-based black-box attacks, which don't require any knowledge about the attacked models and datasets, have raised serious threats to machine learning models in many real applications. In this work, we study a simple but promising defense technique, dubbed Random Noise Defense (RND) against query-based black-box attacks, which adds proper Gaussian noise to each query. It is lightweight and can be directly combined with any off-the-shelf models and other defense strategies. However, the theoretical guarantee of random noise defense is missing, and the actual effectiveness of this defense is not yet fully understood. In this work, we present solid theoretical analyses to demonstrate that the defense effect of RND against the query-based black-box attack and the corresponding adaptive attack heavily depends on the magnitude ratio between the random noise added by the defender (i.e., RND) and the random noise added by the attacker for gradient estimation. Extensive experiments on CIFAR-10 and ImageNet verify our theoretical studies. Based on RND, we also propose a stronger defense method that combines RND with Gaussian augmentation training (RND-GT) and achieves better defense performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have been successfully applied in many mission-critical tasks, such as autonomous driving, face recognition,

etc.. However, it has been shown that DNN models are vulnerable to adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2014), which are indistinguishable from natural examples but make a model produce erroneous predictions. If the attacker can access to the parameters of the attacked model, it is called white-box attack. If the attack can only obtain the query feedback from the attacked model, while no information about the model parameters or architectures, it is called black-box attack.

In real scenarios, the DNN model behind one product, as well as the training dataset, are often hidden from users. Instead, only the output for each query (e.g.., labels or scores) are accessible. In this case, the product provider faces the severe threats from black-box attacks, which have achieved rapid progress in recent about three years. The main challenge is that the provider has to provide good feedback for normal queries, but he/she doesn’t know whether a query is normal or malicious. Moreover, the provider has no information about what kinds of black-box attack strategies adopted by the attacker. Meanwhile, considerable efforts have been devoted to improving the adversarial robustness of DNNs (Madry et al., 2018; Papernot et al., 2016; Tramèr et al., 2017; Tramer et al., 2020; Cohen et al., 2019). Among them, adversarial training (AT) is considered as one of the most effective defense techniques (Athalye et al., 2018a). However, the improved robustness from AT is often accompanied by significant degradation of the normal accuracy. Besides, the training cost of AT is much higher than the standard training, and the AT model also often suffers from poor generalization to new samples and new attack methods . Thus, we think that AT based defense is not a very suitable choice for the DNN model deployed in real scenarios. In contrast, a good defense technique should satisfy the following requirements: no significant degradation of normal accuracy, lightweight, plug-and-play, and good generalization to diverse black-box attacks.

In real scenarios, the DNN model as well as the training dataset, are often hidden from users. Instead, only the model feedback for each query (e.g.., classification labels or scores) are accessible. In this case, the product provider mainly faces the severe threats from black-box attacks, which have achieved rapid progress in recent about three years. The main challenges of black-box defense are 1) the defender should not significantly influence the model’s feedback to normal queries, but it is difficult to know whether a query is normal or malicious; 2) the defender has no information about what kinds of black-box attack strategies adopted by the attacker. Recently, considerable efforts have been devoted to improving the adversarial robustness of DNNs (Madry et al., 2018; Papernot et al., 2016; Tramèr et al., 2017; Tramer et al., 2020; Cohen et al., 2019). Among them, adversarial training (AT) is considered as one of the most effective defense techniques (Athalye et al., 2018a). However, the improved robustness from AT is often accompanied by significant degradation of the normal accuracy. Besides, the training cost of AT is much higher than the standard training, and the AT model also often suffers from poor generalization to new samples and new attack methods (Sokolic et al., 2017; Zhang et al., 2017; Geirhos et al., 2019). Thus, we think that AT based defense is not a very suitable choice for black-box defense. In contrast, we expect that a good defense technique should satisfy the following requirements: no significant degradation of normal accuracy, lightweight, plug-and-play, and good generalization to diverse black-box attacks.

Tea.yb: In real scenarios, the DNN model as well as the training dataset, are often hidden from users. Instead, only the model feedback for each query (e.g.., classification labels or scores) are accessible. In this case, the product provider mainly faces the severe threats from black-box attacks, which have achieved rapid progress in recent about three years. The main challenges of black-box defense are 1) the defender should not significantly influence the model’s feedback to normal queries, but it is difficult to know whether a query is normal or malicious; 2) the defender has no information about what kinds of black-box attack strategies adopted by the attacker. Recently, considerable efforts have been devoted to improving the adversarial robustness of DNNs (Madry et al., 2018; Papernot et al., 2016; Tramèr et al., 2017; Tramer et al., 2020; Cohen et al., 2019). Among them, adversarial training (AT) is considered as one of the most effective defense techniques (Athalye et al., 2018a). AT modifies the training process by augmenting adversarial examples and thus cannot be combined with off-the-shell models or applications to boost their adversarial robustness. Besides, the time complexity of adversarial training is very high and AT cannot scale to large-scale datasets. In this work, we consider the more practical black-box defense scenarios, in which a good defense technique should satisfy the following requirements: no significant degradation of normal accuracy, lightweight, plug-and-play, and good generalization to diverse black-box attacks.

In this work, we study a lightweight defense strategy that each query is perturbed by random noise, such that the returned feedback will contain some randomness. It has been shown in (Dong et al., 2020) that random noise defense (RND) shows good defense performance to many black-box attack methods. This work provides the first theoretical analysis about the effectiveness of RND, and demonstrates that its effectiveness significantly depends on the magnitude ratio between the random noise added by the defender (i.e.., RND) and the random noise added by the attacker for gradient estimation. Moreover, we also analyze the attack performance of the adaptive attack, which is considered as an effective strategy to mitigate the defense effect due to the randomness. Our theoretical result tells that the adaptive attack has limited effect to evade the RND defense, especially when the input dimension is high. Above two analyses imply that inserting random noises with larger magnitudes into each query will have better defense performance to both standard and adaptive black-box attacks. But on the other side, larger random noises will lead to larger degradation of normal accuracy. To obtain a better trade-off between the defense effect and the normal accuracy, we propose to train the model using Gaussian augmentation training (GT), which adds Gaussian noise to each training data, to improve the model’s robustness to random noise in each individual query. Consequently, the combination of the GT model and the RND defense is a practical and effective defense strategy to defend the models deployed in real scenarios. Prof.by: Above two paragraphs will be modified again tonight

In this work, we study a lightweight defense strategy, dubbed Random Noise Defense (RND) against query-based black-box attacks. The key idea of RND is to add test-time randomness by adding small random noise to each query. The returned feedback will contain some randomness, and thus mislead the attack process of query-based attacks. More importantly, we provide the theoretical analysis about the effectiveness of RND, and demonstrates that its effectiveness significantly depends on the magnitude ratio between the random noise added by the defender (i.e.., RND) and the random noise added by the attacker for gradient estimation. We also analyze the attack performance of the adaptive attack, which is considered as an effective strategy to mitigate the defense effect due to the randomness. Our theoretical analysis tells that the adaptive attack has limited effect to evade the RND defense. Above two analyses imply that inserting random noises with larger magnitudes into each query will have better defense performance to both standard and adaptive black-box attacks. But on the other side, larger random noises may cause unacceptable degradation of normal accuracy. To achieve a better trade-off between the defense effect and the normal accuracy, we propose to train the model using Gaussian augmentation training (GT), to improve the model’s robustness to random noise in each query. Consequently, the combination of GT and RND is a practical and effective strategy to defend black-box attacks in real scenarios.

Tea.yb: In this work, we study a lightweight defense strategy, dubbed Random Noise Defense (RND) for query-based black-box attacks. The key idea of RND is to add test-time randomness by adding small random noise to each query. The returned feedback will contain some randomness and thus misleading the attack process of query-based attacks. More importantly, we provide the theoretical analysis about the effectiveness of RND, and demonstrates that its effectiveness significantly depends on the magnitude ratio between the random noise added by the defender (i.e.., RND) and the random noise added by the attacker for gradient estimation. Moreover, we also analyze the attack performance of the adaptive attack, which is considered as an effective strategy to mitigate the defense effect due to the randomness. Our theoretical result tells that the adaptive attack has limited effect to evade the RND defense, especially when the input dimension is high. Above two analyses imply that inserting random noises with larger magnitudes into each query will have better defense performance to both standard and adaptive black-box attacks. But on the other side, larger random noises will lead to larger degradation of normal accuracy. To obtain a better trade-off between the defense effect and the normal accuracy, we propose to train the model using Gaussian augmentation (GT), to improve the model’s robustness to random noise in each query. Consequently, the combination of the GT and the RND defense is a practical and effective defense strategy to defend black-box attacks in real scenarios.

The main contributions of this work are threefold. 1) To the best of our knowledge, this work is the first to theoretically analyze the effect of random noise defense against both standard and adaptive query-based black-box attacks. 2) Inspired by the theoretical analysis, we propose a practical and effective defense strategy in real scenarios, by combining the Gaussian augmentation training and random noise defense. 3) Extensive experiments further verify the presented theoretical analysis, and demonstrate the effectiveness of the proposed defense strategy against several state-of-the-art query-based black-box attacks.

Recent breakthroughs in deep neural networks (DNNs) have led to substantial success in a wide range of fields. However, DNN models are vulnerable to adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2014), which are indistinguishable from natural examples but make a model produce erroneous predictions. Considerable efforts have been devoted to improving the adversarial robustness of white-box attacks (Madry et al., 2018; Papernot et al., 2016; Athalye et al., 2018a; Tramèr et al., 2017; Tramer et al., 2020; Cohen et al., 2019). Among them, adversarial training (AT) is one of the most effective techniques (Athalye et al., 2018a).

For black-box attacks, the adversaries have no knowledge of the model and only obtain the predicted probabilities or predicted class within a limited number of queries. So, black-box attacks, especially query-based attacks are more practical and threatening in various security-sensitive applications, such as public API, finance, and autonomous driving. However, the methods targeting defending against black-box attacks are much less

(Bhambri et al., 2019). Also, existing defenses developed for white-box attacks may not be effective against query-based attacks (Dong et al., 2020). Besides, existing defense methods such as AT are difficult to scale to large models and datasets and also significantly decrease natural accuracy. So, these defense methods are difficult to deploy in real scenarios. Therefore, it is essential to develop black-box defense strategies that must satisfy practical demands: not sacrificing natural accuracy, plug-and-play, lightweight defense, and easy to scale to large-scale models and datasets. Random Noise Defense (RND) is a promising defense mechanism which adds random perturbations to each query to disturb attack process. Proper randomness can effectively delay the attack process of the query method without affecting the normal users. Moreover, It is a lightweight defense and can be directly combined with any off-the-shelf models and other defense methods. So, RND can also be easily adopted in practical applications. However, We still lack theoretical guarantees of random noise defense and the actual effectiveness of this defense is not yet fully understood. In this work, we will analyze this defense mechanism against query-based attack under score-based setting, based on the theoretical analysis and detailed experiments and answer the following three questions: 1) Is RND defense really effective for query-based attack? or What are the conditions for the validity of RND? 2) Is it still effective when coping with adaptive attacks? 3) Can we achieve a better trade-off between the good natural accuracy and better defense performance based RND?

Our contributions are summarized as follows:

  • We give the theoretical guarantees for the validity of RND defense based on Zero-Order Optimization. Detailed Experiments on CIFAR-10 and ImageNet validate our theoretical analysis.

  • Assuming stronger attackers aware of the detail of defense mechanism, we give the theoretical guarantees of RND effectiveness against the corresponding adaptive attack.

  • Based the above analysis of RND, we also propose a stronger defense combining RND with gaussian augmentation training (RND-GT). Experiment results validate its better defense performance.

2 Related Work

Query-based methods

Here we mainly review the query-based black-box attack methods, which can be categorized into two classes, including gradient estimation and search-based methods. Gradient estimation methods are based on zero-order (ZO) optimization algorithms. In search-based methods, the attacker utilizes a direct search strategy to find the search direction to decrease the function value, instead of explicitly estimating gradient. Furthermore, we focus on the score-based queries, where the continuous score (e.g.

., the posterior probability or the logit) for each query is returned, in contrast to the decision-based queries which return hard labels. Specifically,

(Ilyas et al., 2018a) proposed the first limited query-based attack method by utilizing the Natural Evolutionary Strategies (NES) to estimate the gradient. (Liu et al., 2018a) proposed the ZOsignSGD algorithm which is similar to NES and gave the detailed convergence analysis of attack algorithm. Based on Bandit Optimization, (Ilyas et al., 2018b) proposed to combine the time and data dependent gradient prior with gradient estimation, which dramatically reduced the number of queries. For the norm constrain, SimBA (Guo et al., 2019a) randomly sampled a perturbation from orthonormal basises and also utilized discrete cosine transform (DCT) to reduce the dimension of search space. SignHunter (Al-Dujaili and O’Reilly, 2020) focused on estimating the sign of gradient and flipped the sign of perturbation to improve query efficiency. Square attack (Andriushchenko et al., 2020) is the state-of-art query-based attack method which selects localized square shaped updates at random positions of images.

The query-based black-box attack methods can be divided into two classes: gradient estimation methods, search-based methods

. The gradient estimation methods are based on zero-order (ZO) optimization algorithms. For search-based methods, the attackers utilize a direct search strategy to find the feasible point which can decrease the function value instead of explicitly estimating gradient. Score-based attacks is stronger attack setting which allows the adversaries to obtain probability for each class. So, we focus on score-based attacks. For score-based black-box attacks, the attackers only access to the score predicted by a classifier for each class for a given input.

Query-based methods

(Ilyas et al., 2018a) proposed the first limited query-based attack method by utilizing the Natural Evolutionary Strategies (NES) to estimate the gradient. (Liu et al., 2018a) proposed the ZOsignSGD algorithm which is similar to NES and gave the detailed convergence analysis of attack algorithm. Based on Bandit Optimization, (Ilyas et al., 2018b) proposed to combine the time and data dependent gradient prior with gradient estimation, which dramatically reduced the number of queries. For the norm constrain, SimBA (Guo et al., 2019a) randomly sampled a perturbation from orthonormal basises and also utilized discrete cosine transform (DCT) to reduce the dimension of search space. SignHunter (Al-Dujaili and O’Reilly, 2020) focused on estimating the sign of gradient and flipped the sign of perturbation to improve query efficiency. Square attack (Andriushchenko et al., 2020) is the state-of-art query-based attack method which selects localized square shaped updates at random positions of images.

Black-Box Defense

Compared with the defense for white-box attacks, the defense specially designed for black-box attacks has not been well studied. Two recent works (Chen et al., 2020; Li et al., 2020) proposed to detect malicious queries based on the comparison between the current query and the history queries, utilizing the factor that the malicious query will be similar with the previous malicious queries for attacking the same benign example, while the similarity between different benign examples will be much smaller. AdvMind (Pang et al., 2020) proposed to infer the intent of adversary, and it also needed to store the history queries. However, if the attacker adopted the strategy of long-interval malicious queries, the defender has to store a long history, with very high cost of storage and comparison. (Salman et al., 2020) proposed the first black-box certified defense method, dubbed denoised smoothing. (Byun et al., 2021) also showed that adding random noise can defend against query-based attacks through experimental evaluations, without theoretical analysis of the defense effect. There are also a few randomization-based defenses. RP (Xie et al., 2018) proposed a random input transform-based method. RSE (Liu et al., 2018b) added large Gaussian noise to both the input and activation of each layer. PNI (He et al., 2019) combined adversarial training with adding Gaussian noise to the input or weight of each layer, which achieved the better defense performance. However, these methods significantly sacrificed the accuracy of benign examples. PixelDP (Lecuyer et al., 2019) and random smoothing (Cohen et al., 2019; Salman et al., 2019) proposed to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the norm. However, they required to corrupt each query multiple times using Gaussian noise, to obtain a majority prediction of this query. Obviously, such a defense places a huge burden to use the model in practice. In contrast, the defense mechanism of RND that is mainly studied in this work only perturbs each query once, without any extra burden. (Rusak et al., 2020) showed that the model with Gaussian augmentation training achieves the state-of-art defense against common corruptions and good defense against white-box attacks, while the defense to black-box attacks is not evaluated. Apart from aforementioned defenses against query-based black-box attacks, ensemble adversarial training (EAT) (Tramèr et al., 2017) is considered as a good defense against transfer-based black-box attacks. It trains the model with multi-step attacks generated on an ensemble of several surrogate model.

Black-Box Defense

The defense against black-box attacks has not been well studied. For defense against transfer-based attacks, Ensemble Adversarial Training (EAT) (Tramèr et al., 2017) is the state-of-art defense which train model with multi-step attacks generated on an ensemble of several surrogate model. Studies that mainly target defending against query-based black-box attacks are much less. However, history-based detection techniques for query-based attacks have been proposed recently (Chen et al., 2020; Li et al., 2020). Considering that compared with the queries from normal users, there is a much larger similarity among queries from adversaries. They store the past query images and compare them to detect this unusual property. AdvMind (Pang et al., 2020) propose to infer the intent of adversary, and they also need to store the past query images. (Salman et al., 2020) proposed the first black-box certified defense method, Denoised Smoothing. (Byun et al., 2021) also show that adding random noise can defend against query-based attacks, but they don’t give the theoretical analysis of the defense mechanism. There are also several randomization-based defenses. RP (Xie et al., 2018) proposed a random input transform-based method. RSE (Liu et al., 2018b) adds large Gaussian noise to input and activation of each layer. PNI (He et al., 2019) combines adversarial training with adding Gaussian noise to input or weight of each layer, which achieved the better defense performance. However, they all significantly degrade the accuracy of clean images and are very difficult to scale to large-scale datasets. PixelDP (Lecuyer et al., 2019) and Random Smoothing (Cohen et al., 2019; Salman et al., 2019) turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the norm. Different from us, to guarantee the robustness, they need to predict in majority the correct class under Gaussian noise corrupted-copies of samples, which will place a huge burden on using model in practice. However, we only need to add noise to each query once. (Rusak et al., 2020) shows that model with Gaussian augmentation training achieves the state-of-art defense against common corruptions and good defense against white-box attacks.

stress we cannot compare AND with RS directly, because working principle and purpose of the two methods are totally different.

3 Preliminaries

3.1 Score-based Black-Box Attack

We denote the attacked model as with being the input space and being the output space. Given a benign data , the goal of adversarial attack is to generate an adversarial example that is similar with , but to enforce prediction of to be different with the ground-truth label (i.e.., untargeted attack) or to be a target label (i.e.., targeted attack). It can be generally formulated as follows:

(1)
(2)

where for untargeted attack, while for targeted attack. denotes the logit or posterior probability w.r.t..  class . indicates a ball around ( is often specified as or ), with . Consequently, the objective function is undifferentiable, and it can not be optimized by gradient based algorithms.

3.2 Zero-Order Optimization for Black-Box Attack

Since the derivative of can not be obtained directly, we have to resort to derivative-free, also called zero-order (ZO) optimization algorithms. Some black-attack methods have been developed based on ZO optimization, dubbed ZO attack, and show very promising attack performance. The general idea of ZO attack methods is to estimate the gradient direction according to the objective values returned by queries. A widely used gradient estimator (Nesterov and Spokoiny, 2017; Duchi et al., 2015) is

(3)

where , and . Based on this estimated gradient, the attack procedure similar with white-box attack can be conducted (e.g.., projected gradient descent), which is briefly summarized in Algorithm 2. Note that here we only focus on the gradient estimator based on queries to the attacked/target model, while those utilizing the transferability from surrogate to target models are not covered, which will be further discussed in Section 7.8.

  Input: the benign data , the step size , the query budget , one gradient estimator, the neighborhood set , the target label if conducting targeted attack.
  Initialize
  for  do
     Compute using the gradient estimator
     Update the solution using projected gradient descent:
(4)
     if  then
        stop and return
     end if
  end for
  Return
Algorithm 1 Zero-order optimization for black-box attack

4 Query-based Black-Box Attack Methods

4.1 Gradient Estimation Attack Methods

We mainly analyze the gradient estimation attack methods because they are much easier to implement in practice and also have good theoretical convergence guarantees. We first give the notation we will used. denotes data distribution. denotes our each data sample and is the corresponding label and is output label space. We denote the attacked black-box model as and is the logit output of model. is the attack algorithm and is the perturbed image.

represents the loss function. Besides, we always have the perturbation constrain

. represents norm and we usually use the and norm. We denote this norm ball as which is also the convex set.

Being similar to the white-box attack, the black-box attacker’s goal is to maximize for untargeted attack and maximize the for targeted attack under the perturbation constrain where is the targeted label. We define to equal or , so the attacker’s goal is to find to minimize the objective function under the perturbation constrain. So the black-box attack problem formulation is

(5)

where for untargeted attack, , for targeted attack, , or .

Under the above setting, the design of the attack algorithm becomes the ZO optimization problem like in mainstreamed black-box attack methods (Ilyas et al., 2018a; Cheng et al., 2019; Liu et al., 2018a; Tashiro et al., 2020; Guo et al., 2019b; Huang and Zhang, 2019; Zhang et al., 2020; Ilyas et al., 2018b; Zhang et al., 2020; Li et al., 2019). Random gradient free method (Nesterov and Spokoiny, 2017; Ghadimi and Lan, 2013; Duchi et al., 2015; Ghadimi et al., 2016) is most commonly adopted in attack methods. To estimate the gradient, the attackers utilize gaussian smoothing to obtain the gradient estimator

(6)

where , smoothing parameter , Eq. (6) is the 1-side estimator and we also have the 2-side estimator. After obtaining estimated gradient, the attackers take projection gradient descent like white-box attack (Goodfellow et al., 2014; Madry et al., 2018). The detailed basic attack algorithm is shown in Algorithm 2. So, The Black-Box attacker’s goal is to generate as quickly as possible with using fewer queries, which also means to design attack algorithms with faster convergence.

  Input: the attacked data sample ; the pre-defined loss function; the learning rate; the number of queries;

the random vector sampled from

, the batch size of , smoothing parameter, the constrain norm ball.
  for  do
     the gradient estimator (6) applying the ZO gradient estimator.
     The projection gradient update:
(7)
  end for
  Output:
Algorithm 2 Basic ZO Black-Box Attack Algorithm

5 RND: Random Noise Defense for Zero-Order Attack Methods

5.1 Random Noise Defense

In the query-based gradient estimator (e.g.., Eq. 6), the attacker adds one small random perturbation to get the objective difference between two queries. This random perturbation plays the key role to obtain a good estimation of gradient direction. Thus, if the defender can further disturb this random perturbation, the gradient estimation is expected to be misled, such that the attack efficiency will decrease. However, the defender cannot identify whether a query is normal or malicious. Thus, the random noise defense (RND) proposes to add a random noise for each query. Consequently, the feedback for one query is , with are random noises generated by the defender, and the factor controls the magnitude of random noise. Considering the task of defending query-based black-box attacks, we have two requirements/expectations for RND, including

  • The output of each individual query will not be changed significantly, i.e.., ;

  • The estimated gradient will be perturbed, such that the iterative attack procedure will be misled to achieve lower attack efficiency. Specifically, the gradient estimator under RND becomes

    (8)

    where are both random noises generated by the defender.

To satisfy the first requirement, the magnitude factor should not be too large. However, to satisfy the second requirement, should be large enough to significantly change the gradient estimation. We need to choose a proper to achieve a good trade-off between this two requirements.

6 RND: Random Noise Defense for Query-based Black-Box Attack Methods

6.1 Random Noise Defense Mechanism

In this section, we first describe the defense mechanism of RND. Based on the previous introduction of query-based methods, We can find that all query-based attacks rely on a common elements: they add the small random perturbation to each query to estimate gradients (gradient estimation methods) or find the feasible attack directions (search-based methods). So, we can introduce another randomness to disturb the attack process and slow down the convergence of attack algorithms.

RND disturbs the gradient estimation and finding attack directions by adding extra random noise to each query to the attacked model, like Eq. (9). Since RND don’t detect whether each query is malicious or not, the added noise cannot be very large, which do not affect model accuracy for normal queries. So the RND mechanism should satisfy these two conditions

  • Disturbing the gradient estimation from the attackers and search for the direction of descent.

  • Not affecting the model accuracy for normal queries.

So the gradient estimator in Algorithm 2 becomes

(9)

where and

could also be sampled from the same standard normal distribution and

is the defense parameter. From Eq. (9), we can see random noise could interfere with the attacker’s search for the direction of descent and lead to the wrong gradient estimation.

6.2 Theoretical Analysis of Random Noise Defense against ZO Attacks

In this section, we will present the theoretical analysis about the effect of RND to query-based ZO attacks, i.e.., the convergence of Algorithm 2 with (i.e.., Eq. (9)) being the gradient estimator. Throughout our analysis, the measure of adversarial perturbation is specified as norm, corresponding to . To facilitate subsequent analyses, we firstly introduce some assumptions and definitions.

Assumption 1.

is convex function w.r.t..

Assumption 2.

is Lipschitz-continuous, i.e.., .

Assumption 3.

is continuous and differentiable, and is Lipschitz-continuous, i.e.., .

Notations. We denote the sequence of random noises added by the attacker as , with being the iteration index in Algorithm 2. The sequence of random noises added by the defender is denoted as . The sequential solutions generated by Algorithm 2 are denoted as , and the benign example is used as the initial solution, i.e.., . denotes the input dimension.

Definition 1.

The Gaussian-Smoothing function corresponding to with is

(10)

Due to the noise inserted by the defender, becomes the objective function for the attacker. We also define the minimum of , .

Definition 2.

The expected objective function value of at the iteration generated by Algorithm 2 is

(11)

where , . Note that since is obtained by the gradient estimator (9), it depends on the sequences and . For clarity, hereafter we use to represent .

Definition 3.

We denote the set of random noises added by the attacker as , with being the iteration index in Algorithm 2. The set of random noises added by the defender is denoted as . Then, we define the expectation of the perturbed objective function with RND at the iteration as follows

(12)

where , and is generated via Eq. (7), and it depends on and , due to the gradient estimator (9). For clarity, hereafter we use to represent .

6.2.1 Analysis under General Convex Case

Here we study the case that the objective function satisfies Assumptions 4 and 5. The corresponding convergence of Algorithm 2 with is presented in Theorem 2.

Theorem 1.

Given Assumptions 4 and 5, for any in Algorithm 2 with the gradient estimator (i.e.., Eq. (9)), we have

(13)

where , , . To minimize the upper bound of convergence error , stepsize can be chosen as constant step-size . Then, in order to guarantee , the minimum of upper bound is set as . So, the query complexity for attackers is .

Theorem 2 tells that the defense effect of RND is closely related to the ratio . The larger leads to the higher upper bound of convergence error and the slower convergence rate, corresponding to the better defense performance of RND against query-based attacks in practice. Specifically, if , then the query complexity is equivalent to the constant . Only when , the query complexity is really improved over . It is in accordance with our intuition that the defender should insert larger random noise (i.e.., ) than that added by the attack (i.e.., ) to achieve the satisfied defense effect. However, on the other hand, the gap between and true objective will increase along with , leading to the larger influence on each individual query. Thus, RND should find a good balance between not harming the accuracy to individual normal query and the good defense performance against iterative query-based attacks.

Remark 1.

From Theorem 2, the upper bound of convergence error (i.e.., the right-hand side of Eq. (15)) depends on the ratio . The larger ratio will contribute to the higher upper bound of convergence error and the slower convergence rate, corresponding to the better defense performance for query-based attacks in practices.

However, RND cannot not always guarantee the effective defense for query-based attacks, especially when the size of noise from defender is smaller than or almost same as the size of random vectors sampled by the attackers. So, the defender need to adopt the larger noise size than to each query. However, they still guarantee that normal predictions will not be affected by this extra randomness. If the random noise added by the defender is very large, the gap between and true model prediction will be increased. So RND creates a trade-off between natural accuracy of normal query and effective defense for the malicious.

6.3 Theoretical Analysis of Random Noise Defense against Z O Attacks

We first give some assumptions we will need to use in next sections.

Assumption 4.

is convex function w.r.t..

Assumption 5.

is Lipschitz-continuous, i.e.., .

Assumption 6.

is continuous and differentiable, and is Lipschitz-continuous, i.e.., .

Then, we give some important definitions:

(14)

For attacker, since the defender add noise to each query, the objective function for attackers becomes .

The attacker’s goal is to generate as quickly as possible, which is also to make the algorithm converge faster. In next sections, we will show the random effect from RND leads to the slower convergence corresponding to the better defense performance.

6.3.1 General Convex Case

In this subsection, based on convergence analysis of attack algorithm against RND, we will show that random noise with larger can increase the upper bound of convergence error, which could lead to the slower convergence rate and the better defense.

We assume the function satisfies the Assumption 4 and 5. Here, we use Euclidean norm in our all theoretical analysis. We denote by a random vector composed by i.i.d variables attached to each iteration of the scheme and also denote . We define , where is generated by Eq. (7) with gradient estimator (9) depending on and .

Theorem 2.

Assume satisfies the Assumption 4 and 5. Let sequence be generated by the Algorithm 2 with the estimator (9). Then for any we have

(15)

Where , . In order to guarantee the convergence error , we can choose the constant stepsize . Then, the query complexity becomes .

Remark 2.

From Theorem 2, the upper bound of convergence error (the right-hand side of Eq. 15) depends on the ratio . So the ratio is key factor to effectiveness of RND . The larger ratio will contribute to the higher upper bound of convergence error and the slower convergence rate, corresponding to the better defense performance for query-based attacks in practices.

So, based on above analysis, RND cannot not always guarantee the effective defense for query-based attacks, especially when the size of noise from defender is smaller than or almost same as the size of random vectors sampled by the attackers. For example, when and the ratio equals to , the query complexity is not changed and RND cannot improve the defense effect for query-based attacks. The experiment results in section 7.3 also verify our theorem. So, the defender need to adopt the larger noise size than to each query. However, they still guarantee that normal predictions will not be affected by this extra randomness. If the random noise added by the defender is very large, the gap between and true model prediction will be increased. So RND creates a trade-off between natural accuracy of normal query and effective defense for the malicious.

6.3.2 Analysis under General Non-Convex Case

Here we study the convergence of a more challenging case that only satisfies Assumption 5. We firstly define

(16)

which is the smoothing version of .

Theorem 3.

Given Assumption 5, for any in Algorithm 2 with , we have

(17)

where , . To bound the gap between and , i.e.., , we could choose (Nesterov and Spokoiny, 2017). Due to the non-convexity assumption, we only guarantee the convergence to a stationary point of the function , which is a smoothing approximation of . To minimize the upper bound (i.e.., the right-hand side), we can choose a constant step size . The minimum of upper bound is denoted as , then the upper bound for the expected number of queries is .

Similar to the convex case shown (see Theorem 2), Theorem 4 also tells that the larger ratio will lead to the higher upper bound of convergence error and the slower convergence rate, corresponding to the better defense performance of RND.

6.3.3 Analysis under General Non-Convex Case

In this subsection, based on convergence analysis of attack algorithm in non-convex case, we will show that random noise with larger can increase the upper bound of square norm of gradients, which could lead to the slower convergence rate to stationary point and the better defense.

We assume the function satisfies the Assumption 5. And, we define

(18)

which is smooth version of objective function of attackers.

Theorem 4.

Assume satisfies the Assumption 5. Let sequence be generated by the Algorithm 2 with the estimator (9). To bound the gap between and , we could choose (Nesterov and Spokoiny, 2017). Then for any we have

(19)

where , , and . To minimize the upper bound, we can choose constant stepsize We only guarantee a convergence of the Algorithm 2 with estimator (9) to a stationary point of the function which is a smooth approximation of . Then to guarantee the expected norm of less than , the minimum of upper bound is set as , then the upper bound for the expected number of queries is .

From the Theorem 4, the upper bound of expected number of queries still depends on the ratio . Like the convex case, the larger ratio will contribute to the higher upper bound of convergence error and the slower convergence rate.

6.4 Theoretical Analysis of Random Noise Defense against Adaptive Attacks

As suggested in recent studies of robust defense (Athalye et al., 2018a; Carlini et al., 2019; Tramer et al., 2020), the defender should take a robust evaluation against the corresponding adaptive attack, in which case the attacker is aware of the defense mechanism. Here we study the defense effect of RND against adaptive black-box attacks. Since the idea of RND is inserting random noise to each query to disturb the gradient estimation, an adaptive attacker could utilize Expectation Over Transformation (EOT) (Athalye et al., 2018b) to obtain an more accurate estimation, i.e.., querying one sample multiple times to obtain the average. Then, the gradient estimator used in Algorithm 2 becomes

(20)

where , with . Note that here the definition of the sequential random noises added by the defender (see Section 6.2) should be updated to . The convergence analysis of Algorithm 2 with (23) against RND is presented in Theorem 6. Due to the space limit, we only present the analysis about the convex case that satisfies Assumptions 4, 5 and 6. However, the analysis can also extended to the non-convex case, which will be shown in supplementary material.

Theorem 5.

Given Assumptions 4, 5 and 6, let sequence be generated by the Algorithm 2 with the estimator (i.e.., Eq. 23). Then, for any , we have

(21)
Corollary 1.

With larger , the upper bound will be reduced. So, EOT can mitigate the defense effect due to the randomness from RND. However, with going to infinity, the upper bound of expected convergence error (i.e.., Eq. (24)) becomes

(22)

where the max term is still determined by the ratio . It implies that the attack improvement from EOT is limited, especially with the larger ratio .

In this subsection, we discuss the corresponding adaptive attack to RND. According to recent paper about evaluating robust defense (Athalye et al., 2018a; Carlini et al., 2019; Tramer et al., 2020), the defender should take a robust evaluation against the corresponding adaptive attack (i,e. an stronger attacker aware of the detail of defense mechanism). As discussed in (Athalye et al., 2018a), RND obfuscates gradient estimated and utilized by the attackers in gradient estimation methods. RND belongs to the stochastic gradient case depend on test-time randomness by adding noise to each query. Since the attackers have known they only get noisy function values for their each query, they could query the same sample many times to obtain the average which could mitigate the randomness. To design adaptive attack, the attackers can estimate gradients of randomized defenses by applying Expectation Over Transformation (EOT) (Athalye et al., 2018b). So, the gradient estimator in Algorithm 2 becomes

(23)

where and are random vectors sampled from standard normal distribution . Next, we give the convergence analysis of basic ZO Black-Box attack with EOT against RND.

We assume the function satisfies the Assumption 4, 5, and 6. The following analysis still applies to non-convex functions, which will be shown in the supplementary materials.

Theorem 6.

Let sequence be generated by the Algorithm 2 with the estimator (23). Then for any we have

(24)
Corollary 2.

With larger , the upper bound will be reduced. So, EOT can alleviate randomness effect coming from RND to attack algorithm convergence. However, with going to infinity, the he upper bound of expected convergence error (Eq. (24)) becomes

(25)

Where the max term is still determined by the ratio . So, this also illustrates that the attack improvement from EOT is limited, especially with the larger ratio .

We can also conclude that the effect of EOT depends on the dimension of dataset. Given fixed from the defender, when is satisfied, which means the data samples are low-dimensional, the upper bound (the right-hand side of Eq. (24)) is determined by the , rather than . So, EOT can effectively reduce random effect from random noise.

In the image classification experiments, for example, the dimension of ImageNet data and CIFAR-10 data is 150,528 and 3,072, the noise adopted in experiment is 0.01 or 0.02. It can be clearly seen that ImageNet does not satisfy and CIFAR-10 dataset does. The experiments results in section 7.5 validate these claims.

6.5 RND with Gaussian Augmentation Training

Aforementioned theoretical analyses in different cases tell that the defense mechanism RND should choose a proper noise magnitude to achieve a good balance between keeping the accuracy of the normal queries and the defense effectiveness to query-based black-box attacks. To achieve a high-quality balance, we could reduce the sensitivity of the target model to random noises, such that the influence of the noise to the accuracy to each individual query will be reduced. One straightforward method is Gaussian Augmentation Training (GT), which adds random noise to each training sample, as a pre-processing step in the training process. Consequently, the model trained with GT is expected to maintain good accuracy to each individual query, even though the defence RND adds a relatively large random noise to each query.

Based on the theoretical analysis of above two subsections, we know that the larger ratio can contribute to the slower convergence and better RND defense performance, but it also affects black-box model’s prediction accuracy for normal queries. We meet a trade-off between prediction accuracy of normal query and defense for the malicious. We should reduce the sensitivity of the black-box model to random noise, which means even if we add larger noise to each query, the black-box model still maintain excellent performance on normal queries.

The straightforward method is Gaussian Augmentation Training (GT) which is to add noise to each training sample in training stage. So we can increase the stability of black-box model to noise. Therefore, even though the defender add larger noise to each query in inference time, the black-box model still maintain good clean accuracy.

7 Experiments

In this section, we conduct various experiments on benchmark datasets to verify our theoretical results and evaluate the effectiveness of RND against mainstreamed query-based black-box attack methods.

Figure 1: Attack failure rate () of query-based attacks on VGG-16 and CIFAR-10 under different values of and . We adopt logarithm scale in subplot (a-c) for better illustration. The complete evaluation under attack is given in supplementary materials.
Figure 2: Attack failure rate () of query-based attacks on Inception v3 and ImageNet under different values of and . We adopt logarithm scale in subplot (a-c) for better illustration. The complete evaluation under attack is given in supplementary materials.

7.1 Experimental Settings

The evaluation setting

In this section, we verify our theoretical analysis and evaluate the defense performance of RND against mainstreamed query-based black-box attack methods. we choose NES, ZO-signSGD (ZS), Bandit, Dual-Path Distillation (DPD), SimBA, SignHunter and Square attack. Square attack (Andriushchenko et al., 2020) and SignHunter (Al-Dujaili and O’Reilly, 2020) are state-of-art query-based methods and the other methods are the most commonly compared attack methods. The first four methods belong to gradient estimation methods and the remaining three are search-based methods. We only consider the stronger untargeted attack (because a model robust to untargeted attacks is also robust to targeted attacks). Following previous attack methods (Ilyas et al., 2018a), we evaluate all the attack methods on all test images of CIFAR-10 and 1000 images randomly samples from the validation set of ImageNet. We evaluate all attack methods with both and . The perturbation budget of is for two datasets. For norm, the perturbation budget are and for CIFAR-10 and ImageNet. The limited query budget is set to 10000.

Following previous black-box attack methods, for attacked model, we use VGG-16 model (Simonyan and Zisserman, 2014) and WideResNet-16 model (Zagoruyko and Komodakis, 2016) for CIFAR-10 dataset. We train them according to the normal training settings. Their test accuracy are and . For ImageNet dataset, we use the pretrained Inception v3 model (Szegedy et al., 2016) provided by torchvision package and the clean accuracy on 1000 images is .

The details of hyperparameter are shown in supplementary materials. According to the analysis of the above experiments, we consider two main factor of the adversaries: the smoothing parameter

and the different of EOT, to evaluate the robust defensive performance of RND. By trying the combination of and , we choose attack parameters with the best attack effect and evaluate two defense mechanism on them in section 7.5 and 7.7.

The setting of comparing defense methods

We compare our methods with AT, RSE and pure GT model (Rusak et al., 2020). For adversarial training, we also adopt the WideResNet-16 model as black-box model. We set the maximum distortion of adversarial image as in scale, following the experimental protocol in (Madry et al., 2018). We run 10 iterations of PGD with constant step size of 2.0, as done in (Madry et al., 2018). For RSE, we use the pretrained VGG-16 model with provided by the authors. We also train the WideNet-16 by using their codes. The size of init noise is and size of inner noise is . For GT, we train the VGG-16 model with adding Gaussian noise sampled from and WideNet-16 with adding Gaussian noise sampled from . Their natural accuracy are and . For ImageNet dataset, we choose the finetuned ResNet50 GT model provided by (Rusak et al., 2020) with Gaussian noise sampled from with clean accuracy.

Datasets and Classification Models.

We conduct experiments on two widely used benchmark datasets in adversarial machine learning: CIFAR-10

(Krizhevsky and others, 2009) and ImageNet (Deng et al., 2009). CIFAR-10 includes 50k training images and 10k test images with 10 classes. ImageNet contains 1,000 classes with 1.28 million images for training and 50k images for validation. For classification models, we use VGG-16 (Simonyan and Zisserman, 2014) and WideResNet-16 (Zagoruyko and Komodakis, 2016) on CIFAR-10. We conducted standard training and their clean accuracy on the test set is and , respectively. For ImageNet, we adopt the pretrained Inception v3 model (Szegedy et al., 2016) and ResNet-50 model (He et al., 2016) provided by torchvision package and the clean accuracy are and .

Black-box Attack Methods.

We consider several mainstreamed query-based black-box attack methods, including NES (Ilyas et al., 2018a), ZO-signSGD (ZS) (Liu et al., 2018a), Bandit (Ilyas et al., 2018b), DPD (Zhang et al., 2020), SimBA (Guo et al., 2019a), SignHunter (Andriushchenko et al., 2020) and Square (Al-Dujaili and O’Reilly, 2020). Note that the NES, ZS and Bandit are gradient estimation based attack methods and the other four methods are search-based methods. SimBA is only designed for attack. For DPD, the authors only provide the code for attack on ImageNet. Following (Ilyas et al., 2018a), we evaluate all the attack methods on whole test set of CIFAR-10 and 1,000 random sampled images from the validation set of ImageNet. We present the evaluation performance against the untargeted attack in this section and evaluate the performance under both and attack. The perturbation budget of is set to for both datasets. For attack, the perturbation budget is set to and

on CIFAR-10 and ImageNet, respectively. The number of maximal queries is set to 10,000. We adopt the attack failure rate as evaluation metric. The higher the attack failure rate, the better the adversarial defense performance.

7.2 The Evaluation of RND against Query-based Attacks

Following the theoretical analysis of section 6.3, The key factor of RND defense against query-based attacks is the ratio that is larger then can contribute to the slower attack process and better defense performance under limited query setting. We first evaluate the RND defense performance against several attack methods with different ratios. To guarantee the good natural accuracy, we choose . For smoothing parameter , we choose it from . The natu-ral accuracy drop caused by RND in CIFAR-10 is not significant at , and in VGG-16 and WideNet-16. They becomes large at , and respectively. So, sufficiently small is important for maintaining clean accuracy for natural model. The inception v3 is better with natural accuracy is ,