Log In Sign Up

Boundary Defense Against Black-box Adversarial Attacks

Black-box adversarial attacks generate adversarial samples via iterative optimizations using repeated queries. Defending deep neural networks against such attacks has been challenging. In this paper, we propose an efficient Boundary Defense (BD) method which mitigates black-box attacks by exploiting the fact that the adversarial optimizations often need samples on the classification boundary. Our method detects the boundary samples as those with low classification confidence and adds white Gaussian noise to their logits. The method's impact on the deep network's classification accuracy is analyzed theoretically. Extensive experiments are conducted and the results show that the BD method can reliably defend against both soft and hard label black-box attacks. It outperforms a list of existing defense methods. For IMAGENET models, by adding zero-mean white Gaussian noise with standard deviation 0.1 to logits when the classification confidence is less than 0.3, the defense reduces the attack success rate to almost 0 while limiting the classification accuracy degradation to around 1 percent.


page 9

page 12

page 13


Mitigating Black-Box Adversarial Attacks via Output Noise Perturbation

In black-box adversarial attacks, adversaries query the deep neural netw...

Ensemble Generative Cleaning with Feedback Loops for Defending Adversarial Attacks

Effective defense of deep neural networks against adversarial attacks re...

Colored Noise Injection for Training Adversarially Robust Neural Networks

Even though deep learning have shown unmatched performance on various ta...

Self-Supervised Iterative Contextual Smoothing for Efficient Adversarial Defense against Gray- and Black-Box Attack

We propose a novel and effective input transformation based adversarial ...

Adversarial Attacks against Neural Networks in Audio Domain: Exploiting Principal Components

Adversarial attacks are inputs that are similar to original inputs but a...

DE-CROP: Data-efficient Certified Robustness for Pretrained Classifiers

Certified defense using randomized smoothing is a popular technique to p...

Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems

Speaker recognition (SR) is widely used in our daily life as a biometric...

I Introduction

Deep neural networks (DNNs) have achieved increasing demand in many practical applications [chen2015deepdriving][hinton2012deep][krizhevsky2012imagenet]. However, studies over the past few years have also shown intriguing issue that DNN models are very sensitive and vulnerable to adversarial samples [szegedy2013intriguing][biggio2013evasion], implying potential security threats to their applications.

One of the widely studied adversarial attacks is the evasion attack, where the main aim of the attacker is to cause misclassification in the DNN model. Black-box evasion attacks have attracted increasing research interests recently, where black-box means that the attacker does not know the DNN model but can query the model to get the DNN inference outputs, either the detailed confidence score or just a classification label [alzantot2019genattack][brendel2017decision][chen2017zoo][tu2019autozoom][ilyas2018black][cheng2019improving][cheng2018query][cheng2019sign][li2019nattack][chen2020hopskipjumpattack][guo2019simple]. If the attacker has access to the full output logit values, they can apply soft-label attack algorithms such as [chen2017zoo][tu2019autozoom][ilyas2018black][alzantot2019genattack][guo2019simple]. On the other hand, if the attacker has access to only the classification label, they can apply hard-label attack algorithms such as [brendel2017decision][cheng2019sign][chen2020hopskipjumpattack].

Along with the surge of attack algorithms, there has been an increase in the development of defense algorithms such as Adversarial Training (AT) [tramer2017ensemble], input transformation [buckman2018thermometer][samangouei2018defense], gradient obfuscation [papernot2016distillation], and stochastic defense via randomization [he2019parametric][wang2019protecting][qin2021random][nesti2021detecting][liang2018detecting][fan2019integration][li2018certified]. However, limitations of existing defense techniques have also been observed [athalye2018obfuscated][carlini2017adversarial][carlini2017towards]. It has been proven that stochastic defense suffers from large degradation of DNN performance or limited defense performance. Gradient obfuscation method has also been proven to be ineffective.

In this work, we develop an efficient and more effective method to defend the DNN against black-box attacks. During the adversarial attack’s optimization process, there is a stage that the adversarial samples are on the DNN’s classification boundary. Boundary Defense BD, our method, detects these boundary samples as those with the classification confidence score below the threshold and adds white Gaussian noise with standard deviation to their logits. This will prevent the attackers from optimizing their adversarial samples and maintain low DNN performance degradation.

Major contributions of this work are:

  • A new boundary defense algorithm BD is developed, which can be implemented efficiently and mitigate reliably both soft and hard label black-box attacks.

  • Theoretical analysis is conducted to study the impact of the parameters and on the classification accuracy.

  • Extensive experiments are conducted, which demonstrate that BD(0.3, 0.1) (or BD(0.7, 0.1)) reduces attack success rate to almost 0 with around 1% (or negligible) classification accuracy degradation over the IMAGENET (or MNIST/CIFAR10) models. The defense performance is shown superior over a list of existing defense algorithms.

The organization of this paper is as follows. In Section II, related works are introduced. In Section III, the BD method is explained. In Section IV, experiment results are presented. Finally, conclusions are given in Section V.

Ii Related Work

Black-box adversarial attacks can be classified into soft-label and hard-label attacks. In soft-label attacks like AutoZOOM

[chen2017zoo][cheng2018query] and NES-QL [ilyas2018black]

, the attacker generates adversarial samples using the gradients estimated from queried DNN outputs. In contrast, SimBA

[guo2019simple], GenAttack [alzantot2019genattack] and Square Attack [andriushchenko2020square] resort to direct random search to obtain the adversarial sample. Hard-label attacks like NES-HL [ilyas2018black], BA (Boundary Attack) [brendel2017decision], Sign-OPT [cheng2018query][cheng2019sign], and HopSkipJump [chen2020hopskipjumpattack] start from an initial adversarial sample and iteratively reduce the distance between the adversarial sample and original sample based on the query results.

For the defense against black-box attacks, a lot of methods are derived directly from the defense methods against white-box attacks, such as input transformation [dziugaite2016study], network randomization [xie2018mitigating] and adversarial training [tramer2020adaptive]. The defenses designed specifically for black-box attack, are denoised smoothing [salman2020denoised], malicious query detection [chen2020stateful][li2020blacklight][pang2020advmind], and random-smoothing [cohen2019certified][salman2019provably]. Nevertheless, their defense performance is not reliable and defense cost or complexity is too high. Adding random noise to defend against black-box attacks has been studied recently as a low-cost approach, where [byun2021small][qin2021theoretical][xie2018mitigating] add noise to the input, and [lecuyer2019certified] [liu2018towards][he2019parametric] add noise to input or weight of each layer. Unfortunately, heavy noise is needed to defend against hard-label attacks (in order to change hard labels) but heavy noise leads to severe degradation of DNN accuracy. Our proposed BD method follows similar approach, but we add noise only to the DNN outputs of the boundary samples, which makes it possible to apply heavy noise without significant degradation in DNN accuracy.

Iii Boundary Defense

Iii-a Black-box attack model

Consider a DNN that classifies an image into class label within

classes. The DNN output is softmax logit (or confidence score) tensor

. The classification result is , where denotes the th element function of , . The attacker does not know the DNN model but can send samples to query the DNN and get either or just . The objective of the attacker is to generate an adversarial sample such that the output of the classifier is , where the adversary should be as small as possible.

Soft-Label Black-box Attack: The attacker queries the DNN to obtain the softmax logit output tensor

. With this information, the attacker minimizes the loss function

for generating the adversarial sample [tu2019autozoom],


where is a distance function, e.g., , and is the loss function, e.g., cross-entropy [ilyas2018black] and C&W loss [carlini2017adversarial].

Hard-Label Black-box Attack: The attacker does not use but instead uses the class label to optimize the adversarial sample . A common approach for the attacker is to first find an initial sample in the class , i.e., . Then, starting from , the attacker iteratively estimates new adversarial samples in the class so as to minimize the loss function .

The above model is valid for both targeted and untargeted attacks. The attacker’s objective is to increase attack success rate (ASR), reduce query counts (QC), and reduce sample distortion . In this paper, we assume that the attacker has a large enough QC budget and can adopt either soft-label or hard-label black-box attack algorithms. Thus, our proposed defense’s main objective is to reduce the ASR to 0.

Fig. 1: Schematic representation of the black-box attack and the Boundary Defense BD (highlighted region).

Iii-B Boundary Defense Algorithm

In this work, we propose a Boundary Defense method that defends the DNN against black-box (both soft and hard label, both targeted and untargeted) attacks by preventing the attacker’s optimization of or . As illustrated in Fig. 1, for each query of , once the defender finds that the classification confidence is less than certain threshold , the defender adds zero-mean white Gaussian noise with a certain standard deviation to all the elements of . The DNN softmax logits thus become


where and

is an identity matrix. The DNN outputs softmax logits clip{

, 0, 1} when outputting soft labels or its classification label when outputting hard labels.

We call it the BD algorithm because samples with low confidence scores are usually on the classification boundary. For a well-designed DNN, the clean (non-adversarial) samples can usually be classified accurately with high confidence scores. Those with low confidence scores happen rarely and have low classification accuracy. In contrast, when the attacker optimizes or , there is always a stage that the adversarial samples have small values.

For example, in the soft-label black-box targeted attacks, the attacker needs to maximize the th logit value by minimizing the cross-entropy loss . Initially is very small and is large. The optimization increases while reducing . There is a stage that all logit values are small, which means is lying on the classification boundary.

As another example, a typical hard-label black-box targeted attack algorithm first finds an initial sample inside the target class , which we denote as . The algorithm often uses line search to find a boundary sample that maintains label , where is the optimization parameter. Then the algorithm randomly perturbs , queries the DNN, and uses the query results to find the direction to optimize . Obviously, must be on the decision boundary so that the randomly perturbed will lead to changing DNN hard-label outputs. Otherwise, all the query results will lead to a constant output , which is useless to the attacker’s optimization process.

Therefore, for soft-label attacks there is an unavoidable stage of having boundary samples and for hard-label attacks the boundary samples are essential. Our BD method exploits this weakness of black-box attacks by detecting these samples and scrambling their query results to prevent the attacker from optimizing its objective.

One of the advantages of the BD algorithm is that it can be implemented efficiently and inserted into DNN models conveniently with minimal coding. Another advantage is that the two parameters make it flexible to adjust the BD method to work reliably. Large and lead to small ASR but significant DNN performance degradation. Some attacks are immune to small noise (small ), such as the HopSkipJump hard-label attack [chen2020hopskipjumpattack]. Some other attacks such as SimBA [guo2019simple] are surprisingly immune to large noise in boundary samples, which means that simply removing boundary samples or adding extra large noise to boundary samples as suggested in [chen2020hopskipjumpattack] does not work. The flexibility of makes it possible for the BD method to deal with such complicated issues and to be superior over other defense methods.

Iii-C Properties of Boundary Samples

In this section, we study BD’s impact on the DNN’s classification accuracy (ACC) when there is no attack, which provides useful guidance to the selection of and .

Fig. 2: Impact of parameters and to classification accuracy (ACC). (a) ACC as function of true logit value . (b) ACC when boundary defense BD is applied. CleanACC is the DNN’s ACC without attack/defense.

Consider a clean sample with true label and confidence . Since the DNN is trained with the objective of maximizing , we can assume that all the other logit values , ,

, are independent and identically distributed uniform random variables with values within

to , i.e., . Without loss of generality, let be the maximum among these values. Then

follows Irwin-Hall distribution with cumulative distribution function (CDF)


When is large, the distribution of can be approximated as normal . We denote its CDF as . Since the sample is classified accurately if and only if , the classification accuracy can be derived as


Using (4), we can calculate for each , as shown in Fig. 2(a) for and classes. It can be seen that for , if , then the sample’s classification ACC is almost 0. This means that we can set to safely scramble all those queries whose maximum logit value is less than 0.32 without noticeable ACC degradation.

Next, to evaluate the ACC when BD is applied, we assume the true label ’s logit value

follow approximately half-normal distribution, whose probability density function is


with the parameter . The ACC of the DNN without attack or defense (which we call cleanACC) is then


Using (5)-(6), we can find the parameter for each clean ACC. For example, for and a DNN with clean ACC 90%, the distribution of true logit follows (5) with .

When noise is added, each becomes for noise . Following similar derivation of (3)-(4), we can obtain the ACC of the noise perturbed logit as


where is the CDF of the new normal distribution . The ACC under the defense is then


Fig. 2(b) shows how the defense ACC degrades with the increase of and . We can see that with , there is almost no ACC degradation for . For , ACC degradation is very small when but grows to 5% when . Importantly, under we can apply larger noise safely without obvious ACC degradation. This shows the importance of scrambling boundary samples only. Existing defenses scramble all the samples, which corresponds to , and thus suffer from significant ACC degradation.

Iv Experiments

Iv-a Experiment Setup

In the first experiment, with the full validation datasets of MNIST (10,000 images), CIFAR10 (10,000 images), IMAGENET (50,000 images) we evaluated the degradation of classification accuracy of a list of popular DNN models when our proposed BD method is applied.

In the second experiment, with validation images of MNIST/CIFAR10 and validation images of IMAGENET, we evaluated the defense performance of our BD method against several state-of-the-art black-box attack methods, including soft-label attacks AZ (AutoZOOM) [tu2019autozoom], NES-QL (query limited) [ilyas2018black], SimBA (SimBA-DCT) [guo2019simple], and GA (GenAttack) [alzantot2019genattack], as well as hard-label attacks NES-HL (hard label) [ilyas2018black], BA (Boundary Attack) [brendel2017decision], HSJA (HopSkipJump Attack) [chen2020hopskipjumpattack], and Sign-OPT [cheng2019sign]. We adopted their original source codes with the default hyper-parameters and just inserted our BD as a subroutine to process after each model prediction call. These algorithms used the InceptionV3 or ResNet50 IMAGENET models. To maintain uniformity and fair comparison, we considered the norm setting throughout the experiment.

We also compared our BD method with some representative black-box defense methods, including NP (noise perturbation), JPEG compression, Bit-Dept, and TVM (Total Variation Minimization), whose data were obtained from [guo2017countering], for soft-label attacks, and DD

(Defensive Distillation)

[papernot2016distillation], Region-based classification [cao2017mitigating], and AT (Adversarial Training) [goodfellow2014explaining] for hard-label attacks.

In order to have a more persuasive and comprehensive study of the robustness of the proposed BD method, we also performed experiments using Robust Benchmark models [croce2021robustbench], such as RMC (Runtime Masking and Cleaning) [wu2020adversarial], RATIO (Robustness via Adversarial Training on In- and Out-distribution)[augustin2020adversarial], RO (Robust Overfitting)[rice2020overfitting], MMA (Max-Margin Adversarial)[ding2018mma], ER (Engstrom Robustness)[engstrom2019adversarial], RD (Rony Decoupling)[rony2019decoupling], and PD (Proxy Distribution)[sehwag2021improving] models, over the CIFAR10 dataset for various attack methods.

As the primary performance metrics, we considered ACC (DNN’s classification accuracy) and ASR (attacker’s attack success rate). The ASR is defined as the ratio of samples with . Without defense, the hard-label attack algorithms always output adversarial samples successfully with the label (which means ASR = 100%). Under our defense the ASR will be reduced due to the added noise, so ASR is still a valid performance measure. On the other hand, since most hard-label attack/defense papers use the ASR defined as the ratio of samples satisfying both and median distortion ( when has elements) less than a certain threshold, we will also report our results over this ASR, which we called ASR2.

We show only the results of targeted attacks in this section. Experiments of untargeted attacks as well as extra experiment data and result discussions are provided in supplementary material.

Iv-B ACC Degradation Caused by Boundary Defense

Fig. 3: Top-1 classification accuracy degradation (Defense ACC Clean ACC) versus . .

For MNIST, we trained a 5-layer convolutional neural network (CNN) with clean ACC 99%. For CIFAR10, we trained a 6-layer CNN with clean ACC 83% and also applied the pre-trained model of

[xu2019pc] with ACC 97%, which are called CIFAR10-s and CIFAR10, respectively. For IMAGENET, we used standard pre-trained models from the official Tensoflow library (ResNet50, InceptionV3, EfficientNet-B7

) and the official PyTorch library (

ResNet50tor, InceptionV3tor), where “-tor” indicates their PyTorch source.

We used the validation images to query the DNN models and applied our BD algorithm to modify the DNN outputs before evaluating classification ACC. It can be observed from Fig. 3 that with we can keep the loss of ACC around for IMAGENET models (from 0.5% of ResNet50 to 1.5% of InceptionV3). leads to near 5% ACC degradation. For MNIST and CIFAR10 the ACC has almost no degradation, but CIFAR10-s has limited 1.5% ACC degradation for large . This fits well with the analysis results shown in Fig. 2(b). Especially, most existing noise defense methods, which don’t exploit boundary (equivalent to ), would result in up to 5% ACC degradation for IMAGENET models.

Iv-C Performance of BD Defense against Attacks

Iv-C1 ASR of soft-label black-box attacks

Table I shows the ASR of soft-label black-box attack algorithms under our proposed BD method. To save the space we have shown the data of only. Results regarding varying and are shown in Fig. 4.

Dataset Attacks No defense = 0.5 = 0.7
AZ 100 8 8
MNIST GA 100 0 0
SimBA 97 3 0
AZ 100 9 9
CIFAR10 GA 98.76 0 0
SimBA 97.14 23 15
Dataset Attacks No Defense = 0.1 = 0.3
AZ 100 0 0
GA 100 0 0
SimBA 96.5 6 2
TABLE I: ASR (%) of Targeted Soft-Label Attacks. = 0.1.
Fig. 4: ASR(%) vs noise level for various boundary threshold . The top row is for MNIST, and the bottom row is for CIFAR10.

From Table I we can see that with the increase in , the ASR of all the attack algorithms drastically reduced. Over the IMAGENET dataset, the BD method reduced the ASR of all the attack algorithms to almost with . For MNIST/CIFAR10 datasets, the BD method with was enough. Fig. 4 shows a consistent decline of ASR over the increase in noise level . This steady decline indicates robust defense performance of the BD method against the soft-label attacks.

Iv-C2 ASR of hard-label black-box attacks

We have summarized the ASR and median distortion of hard-label attacks in presence of our proposed BD method in Table II.

Dataset Attacks ASR/
No defense = 0.5 = 0.7
Sign-OPT 100/0.059 4/0.12 0/-
MNIST BA 100/0.16 17/0.55 9/0.56
HSJA 100/0.15 38/0.14 7/0.15
CIFAR10 Sign-OPT 100/0.004 4/0.08 0/-
HSJA 100/0.05 18/0.05 7/0.05
Dataset Attacks ASR/
No Defense = 0.1 = 0.3
NES-HL 90/0.12 0/- 0/-
IMAGE- Sign-OPT 100/0.05 14/0.4 0/-
NET BA 100/0.08 0/- 0/-
HSJA 100/0.03 34/0.11 0/-
TABLE II: ASR (%) and Median Distortion of Targeted Hard-label Attacks. . “-” Means no Distortion Data due to Absence of Adversarial Samples.

Surprisingly, the BD method performed extremely well against the hard-label attacks that were usually challenging to conventional defense methods. In general, BD(0.3, 0.1) was able to reduce ASR to 0% over the IMAGENET dataset, and BD(0.7, 0.1) was enough to reduce ASR to near 0 over MNIST and CIFAR10.

For ASR2, Figure 5 shows how ASR2 varies with the pre-set distortion threshold when the BD method was used to defend against the Sign-OPT attack. We can see that the ASR2 reduced with the increase of either or , or both. BD(0.7, 0.1) and BD(0.3, 0.1) successfully defended against the Sign-OPT attack over the MNIST/CIFAR10 and IMAGENET datasets, respectively.

Fig. 5: ASR (%) versus median distortion of the Sign-OPT attack under the proposed BD method.

Iv-C3 Robust defense performance against adaptive attacks

To evaluate the robustness of the defense, it is crucial to evaluate the defense performance against adaptive attacks [tramer2020adaptive]. For example, the attacker may change the query limit or optimization step size. In this subsection, we show the effectiveness of our BD defense against 2 major adaptive attack techniques: 1) adaptive query count (QC) budget; and 2) adaptive step size.

First, with increased attack QC budget, the results obtained are summarized in Table III. We observe that when the attacker increased QC from to , there was no significant increase in ASR. Next, we adjusted the optimization (or gradient estimation) step size of the attack algorithms (such as of the Sign-OPT algorithm), and evaluated the performance of BD. The ASR data are shown in Fig 6. We can see that there was no significant change of ASR when the attack algorithms adopted different optimization step sizes. For GenAttack & Sign-OPT, the ASR was almost the same under various step sizes. For SimBA, the ASR slightly increased but with an expense of heavily distorted adversarial output.

Dataset Attack query budget
GA 0 0 0
CIFAR10 HSJA 3 4 0
Sign-OPT 5 8 8
NES-QL 2 12 8
ImageNET Boundary 0 0 0
HSJA 0 0 0
Sign-OPT 0 0 0
TABLE III: ASR(%) of Adaptive Black-Box Attacks under the Proposed BD method

As a result, we can assert the robustness of the BD method against the black-box adversarial attacks.

Fig. 6: ASR (%) versus step size of the adaptive attacks. Note that for GenAttack we considered , since the ASR for was always 0 for all threshold values. For Sign-OPT & SimBA we considered .
Dataset Defense HSJA BA SimBA-DCT
DD [papernot2016distillation] 98 80 -
AT [goodfellow2014explaining] 100 50 4
Region-based [cao2017mitigating] 100 85 -
MNIST BD () 38 17 3
BD () 7 9 0
TABLE IV: Comparison of BD method with other defense methods against targeted hard-label attacks in terms of ASR (%).
Dataset Attack Bit-Depth JPEG TVM NP BD
MNIST GenAttack 95 89 - 5 0
CIFAR10 GenAttack 95 89 73 6 0
TABLE V: Comparison of BD method () with other defense methods against targeted GenAttack (soft-label) in terms of ASR (%).
RobustBench Defense Sign-OPT SimBA HSJA
RMC [wu2020adversarial] 100 83 100
RATIO [augustin2020adversarial] 100 59 100
RO [rice2020overfitting] 100 85 100
MMA [ding2018mma] 100 83 100
ER [engstrom2019adversarial] 100 92 100
RD [rony2019decoupling] - 80 -
PD [sehwag2021improving] 100 71 100
BD () 4 23 18
BD () 0 15 7
TABLE VI: Compare ASR (%) of proposed BD method with the Robust Bench Defense Models. CIFAR10 Dataset.

Iv-D Comparison with Other Defense Methods

For defending against hard-label attacks, Table IV compares the BD method with the DD, AT, and Region-based defense methods over the MNIST dataset. We obtained these other methods’ defense ASR data from [chen2020hopskipjumpattack] for HopSkipJump and BA attack methods, and obtained the defense performance data against SimBA through our experiments. It can be seen that our BD method outperformed all these defense methods with lower ASR. For soft-label attacks, Table V shows that our BD method also outperformed a list of existing defense methods.

We also ran experiments using RobustBench models. The defense performance over the CIFAR10 dataset is reported in Table VI. ASR is used as our preliminary evaluation criteria because for an attacker the higher the ASR the more robust the attack method is against all the defenses. From Table VI, we can see that our method had the most superior defense performance.

V Conclusions

In this paper, we propose an efficient and effective boundary defense method BD to defend against black-box attacks. This method detects boundary samples by examining classification confidence scores and adds random noise to the query results of these boundary samples. BD is shown to reduce the attack success rate to almost 0 with only about 1% classification accuracy degradation for IMAGENET models. Analysis and experiments were conducted to demonstrate that this simple and practical defense method could effectively defend the DNN models against state-of-the-art black-box attacks.