Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Query Attacks

by   Sizhe Chen, et al.

The score-based query attacks (SQAs) pose practical threats to deep neural networks by crafting adversarial perturbations within dozens of queries, only using the model's output scores. Nonetheless, we note that if the loss trend of the outputs is slightly perturbed, SQAs could be easily misled and thereby become much less effective. Following this idea, we propose a novel defense, namely Adversarial Attack on Attackers (AAA), to confound SQAs towards incorrect attack directions by slightly modifying the output logits. In this way, (1) SQAs are prevented regardless of the model's worst-case robustness; (2) the original model predictions are hardly changed, i.e., no degradation on clean accuracy; (3) the calibration of confidence scores can be improved simultaneously. Extensive experiments are provided to verify the above advantages. For example, by setting ℓ_∞=8/255 on CIFAR-10, our proposed AAA helps WideResNet-28 secure 80.59% accuracy under Square attack (2500 queries), while the best prior defense (i.e., adversarial training) only attains 67.44%. Since AAA attacks SQA's general greedy strategy, such advantages of AAA over 8 defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets and bounds. Moreover, AAA calibrates better without hurting the accuracy. Our code would be released.


page 1

page 2

page 3

page 4


Unifying Gradients to Improve Real-world Robustness for Deep Networks

The wide application of deep neural networks (DNNs) demands an increasin...

Small Input Noise is Enough to Defend Against Query-based Black-box Attacks

While deep neural networks show unprecedented performance in various tas...

Data Poisoning Won't Save You From Facial Recognition

Data poisoning has been proposed as a compelling defense against facial ...

The best defense is a good offense: Countering black box attacks by predicting slightly wrong labels

Black-Box attacks on machine learning models occur when an attacker, des...

ASK: Adversarial Soft k-Nearest Neighbor Attack and Defense

K-Nearest Neighbor (kNN)-based deep learning methods have been applied t...

BO-DBA: Query-Efficient Decision-Based Adversarial Attacks via Bayesian Optimization

Decision-based attacks (DBA), wherein attackers perturb inputs to spoof ...

Rethinking the Trigger of Backdoor Attack

In this work, we study the problem of backdoor attacks, which add a spec...

1 Introduction

Deep Neural Networks (DNNs) are vulnerable to adversarial examples (AEs), where human-imperceptible perturbations added to clean samples can fool DNNs to give wrong predictions szegedy2013intriguing ; goodfellow2014explaining . Recently, such a threat is made practically feasible by the black-box score-based query attacks (SQAs) ilyas2018prior ; guo2019simple ; andriushchenko2020square , as they only require the same information as users to craft efficient AEs. Users, for better judgments, need the model’s prediction confidence indicated by DNNs’ output scores, which is the only knowledge needed by SQAs to perform attacks. In contrast, white-box attacks moosavi2016deepfool ; carlini2017towards ; madry2017towards or transfer-based attacks xie2019improving ; chen2020universal require the gradients or training data of DNNs. Moreover, it has been shown that SQAs can achieve a non-trivial attack success rate by a reasonable number of queries, e.g., dozens, compared to thousands of queries for decision-based query attacks brendel2018decision ; cheng2018query ; chen2020hopskipjumpattack . Thus, such feasibility and effectiveness of SQAs are attracting increasing attention from defenders qin2021random .

Defending against SQAs is a different goal compared to improving the worst-case robustness as commonly studied athalye2018obfuscated ; croce2020reliable ; croce2021robustbench . Because in real-world scenarios which SQAs are designed for, DNNs are treated as black boxes, in which the only interaction between models and users/attackers is the model’s output scores. Thus, altering the scores is all defenders could do here, either in direct or indirect ways. Most existing defenses indirectly change outputs by optimizing the model madry2017towards ; li2021neural ; wang2021fighting or pre-processing the inputs qin2021random ; prakash2018deflecting , which, however, severely affect models’ normal performance for different reasons. Training a robust model, e.g., by adversarial training madry2017towards ; tramer2017ensemble , diverts the model’s attention to learning AEs, yielding the so-called accuracy-robustness trade-off rade2021helper . Randomizing qin2021random ; salman2020denoised ; xie2017mitigating ; liu2018towards ; lecuyer2019certified and blurring prakash2018deflecting ; liu2018feature ; buckman2018thermometer ; guo2018countering inputs reduce the signal-noise ratio, inevitably hurdling accurate decision. Dynamic inference wang2021fighting ; wu2020adversarial is time-consuming due to the test-time optimization on model. Thus, it is imperative to develop a user-friendly defense against SQAs. In this paper, we hereby consider a post-processing defense as demonstrated in Fig. 1 (a), which naturally enjoys the benefits as follows: (1) model’s decision is hardly affected since good predictions have already been obtained; (2) model calibration can be simultaneously improved via post-processing guo2017calibration ; naeini2015obtaining to output accurate prediction confidence; (3) it can be flexibly used as a plug-in module for pre-trained models with negligible test-time computation overhead. Despite these merits, it remains unexplored that by post-processing,

How to serve users while avoiding SQA attackers when they access the same output information?

Figure 1: Compared to existing defenses on inputs or models, AAA post-processes to avoid SQAs as in (a). Our main idea is to show attackers the incorrect attack direction as illustrated in (b). Specifically, if we perturb the original blue loss curve to the orange one, then attackers that trying to decrease the loss would be mostly cheated away from their true destiny, i.e., the adversarial direction.

Since SQAs are black-box attacks that find the adversarial direction to update AEs by observing the loss change indicated by DNN’s output scores, we can perturb such scores directly to fool attackers into incorrect attack tracks. Following this idea, we propose the adversarial attack on attackers (AAA), which reverses the loss trend so that attackers, trying to greedily update AEs following the original trend, are led to the non-adversarial path farther away from the decision boundary as shown in Fig. 1 (b). Specifically, AAA directly optimizes DNN’s logits to control the output loss to approximate the reversed curve, which is orange in Fig. 1 (b), where one could see that attackers that are threatening to the undefended model (going adversarial by reducing loss) would be attacked to help the defended DNN predict (going non-adversarial by reducing loss). Now that a post-processing module is adopted here, we could simultaneously use it to calibrate the model as commonly proposed guo2017calibration ; qin2021improving . Thus, a simple plug-in post-module is all you need to both fool attackers and offer users accurate confidence.

By post-processing, AAA not only lowers the calibration error without hurting accuracy in all cases, but also efficiently prevents SQAs, e.g., helping a WideResNet-28 ZagoruykoK16 on CIFAR-10 krizhevsky2009learning secure accuracy under Square attack andriushchenko2020square ( queries), while the best prior defense (i.e., adversarial training dai2021parameterizing ) only attains . Because AAA attacks the general greedy update of SQAs, such advantages of AAA over 8 baseline defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets, norms, bounds, and losses, verifying AAA a user-friendly, effective, and generalizable defense.


  • We analyze current defenses from the view of SQA defenders and point out that a post-processing module forms not only effective but also user-friendly and plug-in defenses.

  • We design a novel adversarial attack on attackers (AAA) defense that fools SQA attackers to incorrect attack directions by slightly perturbing the DNN output scores.

  • We conduct comprehensive experiments showing that AAA outperforms the other 8 defenses in the accuracy, calibration, and protection from all tested 6 SQAs under various settings.

2 Related work

Query attacks

are black-box attacks that only require the model’s output information. Query attacks can be divided into score-based query attacks (SQAs) ilyas2018prior ; guo2019simple ; andriushchenko2020square ; papernot2017practical ; chen2017zoo ; cheng2019improving ; al2019sign and decision-based query attacks (DQAs) brendel2018decision ; cheng2018query ; chen2020hopskipjumpattack ; cheng2019sign . SQAs greedily update AEs from original samples by observing the loss change indicated by DNN’s output scores, i.e.

, logits or probabilities. Early SQAs try to estimate DNN’s gradients by additional queries around the sample

chen2017zoo ; bhagoji2018practical . Recently, it is validated query-efficient to perform fast pure random-search SQAs guo2019simple ; andriushchenko2020square . It has also been proposed to use pre-trained surrogate models in SQAs cheng2019improving ; guo2019subspace , which, however, demands unfeasible access to DNN’s training sample as in transfer-based attacks xie2019improving ; chen2020universal . Thus, we do not focus on defending such SQAs as also in qin2021random . Besides SQAs, DQAs rely only on DNN’s decisions, e.g., the top-1 predictions, to generate AEs. Since DQAs could not perform the greedy update, they start crafting the AE from a different sample and keep DNN’s prediction wrong during the attack. Currently, DQAs need thousands of queries to reach a non-trivial attack success rate Vo2022 compared to dozens of times for SQAs andriushchenko2020square as compared in Appendix A, limiting their threats. Additionally, SQAs are mostly applicable and attackers do not have to resort to DQAs because, in the real world, DNNs must not only be accurate but also indicate when they tend to be incorrect guo2017calibration by outputting confidence scores.

Adversarial defense

mainly has two different goals. The first goal is to improve DNN’s worst-case robustness croce2021robustbench , which demands DNNs to have no AE around clean samples bounded by a norm ball. Such robustness is originally evaluated by gradient-based white-box attacks moosavi2016deepfool ; carlini2017towards , but later, gradient obfuscation phenomenon is discovered athalye2018obfuscated , motivating evaluations to incorporate random noise and black-box attacks croce2020reliable in adaptive methods croce2020reliable ; yao2021automated ; tramer2020adaptive . In this assessment, adversarial training (AT) madry2017towards ; tramer2017ensemble is validated as the most effective defense, and other methods, e.g., using more or augmented data gowal2021improving ; rebuffi2021data , designing special architecture li2021neural ; xie2019feature ; fu2021drawing , and inducing randomness in training he2019parametric ; salman2019provably , all need collaborations with AT to achieve good performance. Another defense goal is to mitigate real-case adversarial threats, and the defense performance is evaluated by the feasible and query-efficient SQAs. A more robust model in the worst cases certainly protects itself in real cases. Besides, it is also possible to defend by dynamic inference wang2021fighting ; wu2020adversarial , and randomizing qin2021random ; salman2020denoised ; xie2017mitigating ; liu2018towards , denoising prakash2018deflecting ; liu2018feature or quantifying buckman2018thermometer ; guo2018countering inputs. However, all aforementioned defenses on inputs or models exert a non-negligible impact on accuracy, calibration, or inference speed due to the focus on learning AEs in AT, the reduction of signal-noise ratio in pre-processing, or the costly test-time optimization in dynamic inference as compared meticulously in Section 3.

Model calibration

performance is a higher demand for a good model besides high accuracy. Besides correct predictions, model calibration additionally requires DNNs to produce accurate confidence in their predictions niculescu2005predicting . For instance, exactly of the samples predicted with confidence should be correctly predicted. In this point of view, the expected calibration error (ECE) naeini2015obtaining is widely used to quantify the error between accuracy and confidence. Various methods have been proposed to improve DNN’s calibration in training qin2021improving ; szegedy2016rethinking ; thulasidasan2019mixup ; stutz2020confidence or by a post-processing module after training guo2017calibration ; naeini2015obtaining ; zadrozny2001obtaining . Among them, temperature scaling naeini2015obtaining is a simple but effective method guo2017calibration , which divides all output logits by a single scalar tuned by a validation set so that the ECE in testing samples would be significantly reduced. Since division is a simple post-processing operation, calibration could be simultaneously achieved by AAA, the first post-processing defense. There have been methods to avoid attacks by calibration stutz2020confidence or calibrate by attack qin2021improving , but the simultaneous improvement of defense and calibration has not been reported to the best of our knowledge.

3 Preliminaries and motivation

Before presenting the proposed method, we first introduce the key ideas of SQAs and analyze existing defenses. For a clean sample labelled by , an SQA on a DNN generates an AE by minimizing the margin between logits guo2019simple ; andriushchenko2020square ; al2019sign ; ilyas2018black as


namely the margin loss carlini2017towards . For defenders without the label

, it is possible to calculate the unsupervised margin loss based only on the logits vector



by assuming the model prediction to be correct, since handling originally misclassified samples falls beyond the scope of defense croce2021robustbench . For attackers with label , it can be known that the attack succeeds if . To quickly realize this, SQAs only update AEs if a query has a lower margin loss compared to the current best query , i.e.,


Existing SQAs guo2019simple ; andriushchenko2020square ; al2019sign are quite effective, capable of halving the accuracy within 100 queries on CIFAR-10 when , posing significant and practical threats. Aware of this, the first defense specifically against SQAs has been recently proposed qin2021random to pre-process inputs by random noise. Besides, other existing defenses are also useful to avoid SQAs, among which the most representative ones are adversarial training (AT) madry2017towards ; tramer2017ensemble , dynamic inference wang2021fighting ; wu2020adversarial , and pre-processing qin2021random ; buckman2018thermometer . These defenses work in different mechanisms as illustrated in Fig. 1 (a) and Section 2. Here in Table 1, we present a comparison of their key characteristics related to the defense performance.

expectation AT madry2017towards ; tramer2017ensemble pre-pro qin2021random ; buckman2018thermometer dyn-inf wang2021fighting ; wu2020adversarial AAA (ours)
calibration /
testing cost
training cost
acc under SQA
Table 1: Expectation and effects of current defenses (unpreferable effects are marked in red)

Real-case defenders are expected to serve users and meanwhile avoid SQA attackers. The former demands good accuracy, calibration, and inference speed. While the latter requires good protection, and preferably, no additional training (denoted as ""). As a training-time defense on model, AT exerts a significant impact on accuracy rade2021helper , e.g., > accuracy drop on ImageNet xie2019feature , with several-fold training computation and an indefinite impact (denoted as "/") on calibration. Pre-processing is a test-time defense on input, which avoids SQAs by reducing the input’s signal-noise ratio but inevitably hurts normal performance. Recently, the dynamic inference is proposed as a test-time defense by optimizing the model, which shows no accuracy drop but dramatically increases the computation, e.g., DENT wang2021fighting consumes > test-time calculation. Given the above discussion, it is imperative to develop a user-friendly and efficient method to effectively defend against SQAs.

4 Adversarial attack on attackers

Defending by post-processing naturally carries the advantages of good accuracy and calibration guo2017calibration , no additional training and negligible test-time computation. Although it does not improve the commonly-studied worst-case robustness croce2020reliable , its potential to defeat SQAs in real cases has not been explored yet to the best of our knowledge. By investigating current defenses from the perspective of SQA defenders in Sec. 3, it is interesting to find that in the real world, no matter whether defenders alter inputs, models, or outputs, what users/attackers get are just the changes in outputs, i.e., the black-box setting. Thus, it is reasonable and preferable for SQA defenders to directly manipulate DNN’s output scores, which has already become a standard practice in model calibration guo2017calibration .

But how? Now that attackers conduct the adversarial attack on the model, why cannot we defenders actively perform an adversarial attack on attackers? If attackers search for the adversarial direction by scores, we can manipulate such scores to cheat them into an incorrect attack path. Following this idea, we develop the adversarial attack on attackers (AAA) to control what the SQA attackers base their actions on, i.e., the loss (1) indicated by the DNN’s output scores. Under such direct misleading, SQA attackers, following their original policy, would be guided to wherever we want them to go due to their greediness, seeing (3). Therefore, we can adopt the extreme fooling that leads attackers to walk opposite towards their destiny, i.e., the non-adversarial direction. In this way, attackers are attacked to actually help DNNs become more confident in predictions.

Reversing the attacker’s path by directly altering outputs is simple. Since SQA attackers observe the margin loss (1), what we should do is force the modified loss to increase when the original unmodified loss actually decreases. This can be achieved by a one-to-one mapping between unmodified and modified loss values, e.g., outputting large margin loss (high confidence) when the original margin is small (model is not confident). However, this naive method prevents users from getting accurate prediction confidence since it greatly changes scores to reverse the overall loss trend along the adversarial direction. Thus, we alternatively reverse the loss in a small interval while keeping the overall trend unchanged as in Fig. 1 (b). This design cheats attackers in most cases, but only requires slight modifications on scores so that the precision of confidence is largely preserved.

Specifically, we periodically set loss attractors and encourage the DNN’s unsupervised margin loss (2) to move towards its closest attractor


where is the unmodified loss. is the period, and "ceil" denotes rounding up decimals to integers. For example, if , then according to (4), the closest loss attractor of logits with is . As we can see, the bias is necessary to avoid setting an attractor on , i.e., the decision boundary, because approximating the logits to it tends to flip the model’s decision.

Periodically-set loss attractors divide logits with different into intervals, so that we could reverse the loss trend in each of the interval to cheat attackers into the opposite attack direction as in Fig. 1 (b). However, reversing the trend demands more than mapping in-interval losses to a fixed value . It requires mapping a small loss with a large one. Thus, we set the target loss value for logits as


That is, if the unmodified loss is larger than by , we hope the modified loss to be smaller than by . In this way, as a sample is actually close to the decision boundary ( decreases), AAA outputs an increasing to mislead attackers away from this adversarial direction.

Although the periodic design has already altered output confidence very slightly, we could also step further to improve the precision of confidence because post-processing logits is a standard practice in model calibration naeini2015obtaining ; zadrozny2001obtaining . Thus, simultaneous defense and calibration is obtainable in a single AAA module by controlling the loss while encouraging the output confidence to approach the calibrated one as


where the first item requires the output loss to form the reversed curve so that attackers would be cheated to non-adversarial directions, while the second item motivates the output confidence to be accurate, and controls the balance between them. Through this design, DNN’s output could be both accurate and misleading, achieving two seemingly contradictory goals at the same time.

The calibrated confidence is obtainable by various model calibration methods naeini2015obtaining ; zadrozny2001obtaining . Among them, temperature scaling naeini2015obtaining is simple but effective guo2017calibration , which divides all logits by a scalar as . The temperature is tuned by a validation set to minimize the calibration error, and here, such tuning is conducted when the optimization (6) is also performed so that we can find the temperature that suits AAA best. After that, is fixed for inference in AAA post-processing module.

0:  the logits , model’s temperature , AAA hyper-parameters.
0:  post-processed logits
1:  calculate original loss by (2)
2:  set target loss by (5)
3:  set target confidence
4:  initialize and optimize it for (6)
5:  return  
Algorithm 1 Adversarial Attack on Attackers

We summarize our algorithm in Alg. 1. AAA first calculates the original loss , according to which sets the target loss that forms the reversed loss curve. Then AAA optimizes the logits to reach the target loss, and also the target confidence , which is obtained by dividing the original logits using a pre-tuned temperature . The overall procedure is conducted in test-time with only optimization on the logits, making AAA a computation-efficient, plug-in, and model-agnostic method with good accuracy, calibration, and defense against SQAs.

5 Experiments

5.1 Setup

We evaluate AAA along with 8 defense baselines, including random noise defense (RND qin2021improving ), adversarial training (AT dai2021parameterizing ; salman2020adversarially ; xie2019feature ), dynamic inference (DENT wang2021fighting ), training randomness (PNI he2019parametric ), and ensemble (TRS yang2021trs ). Results of AT with extra data rade2021helper ) are in Appendix A. AAA uses hyper-parameters , and an Adam optimizer kingma2014adam with learning rate , and the iteration of optimization . To calibrate simultaneously by AAA, we perform temperature scaling using 1K/5K samples for CIFAR-10 testing set / ImageNet validation set respectively to and make them disjoint with the testing samples in attack as much as possible. Detailed hyper-parameters of other defenses and attacks are in Appendix E.

The defenses are assessed by 6 state-of-the-art SQAs, which are random search methods including Square andriushchenko2020square , SignHunter al2019sign , and SimBA guo2019simple , and gradient estimation methods including NES ilyas2018black , and Bandits ilyas2018prior . They are effective and applicable in various settings. Some SQAs use pre-trained models cheng2019improving ; guo2019subspace , which, however, demands unfeasible access to DNN’s training sample. To evaluate real-case defenses against training-based attacks, we alternatively consider a practical SQA QueryNet chen2021querynet based on model stealing. Results of DQAs and SQAs using other than margin loss are put in Appendix A. Since AAA aims to defend in real cases without resorting to randomness, it would be unsuitable to evaluate AAA by the common adaptive or white-box attacks croce2020reliable , e.g., EOT athalye2018obfuscated . We mostly perform untargeted attacks, but we also test the targeted and attacks under different bounds to observe the generalization of defenses.

We use 8 DNNs, which are mostly WideResNets ZagoruykoK16 as in croce2021robustbench . The pre-trained models of PNI he2019parametric / TRS yang2021trs are ResNet-20 he2016deep , and we also test ResNeXt-101 xie2017aggregated in ImageNet. Other studied DNNs come from RobustBench croce2021robustbench and torchvision paszke2019pytorch as specified in Appendix E. AT models are tested using the same bound as in AT, if not otherwise stated. We use all 10K CIFAR-10 testing samples. For ImageNet, we randomly-select 1K validation samples from all 1K classes respectively (1 image in 1 class) to eliminate the class bias as in chen2021querynet . All images are rescaled to , and the ImageNet ones are resized to . Before feeding them to DNNs, we quantify images to 8-bit, imitating the real-case 8-bit image setting as in chen2021querynet . Experiments are performed on an NVIDIA Tesla A100 GPU but could be run on any GPU with over 4GB memory.

As defenders, we are concerned about DNN’s remaining accuracy after it being attacked by SQAs for a certain query times. Thus, we report such SQA adversarial accuracy after and queries, reflecting DNN’s performance under mild and extreme SQAs. We include the average query times, another metric commonly used by attackers, in Appendix C. To measure the calibration, the expected calibration error (ECE) naeini2015obtaining is commonly used. ECE divides all testing samples into bins, and each bin contains samples with confidence ranging from the quantile to the quantile of . Then ECE is calculated by the difference between accuracy and confidence as , where is the predicted label of the sample in the bin , and represents the probability confidence of this prediction.

Figure 2: The AAA-defended (with the above hyper-parameters) and undefended margin loss value when attacking the undefended WideResNet-28 ZagoruykoK16 in RobustBench croce2021robustbench by Square attack andriushchenko2020square () using the CIFAR-10 test sample (other samples have similar trends as in Appendix B). AAA fools attackers precisely as shown by the tiny symmetric oscillations of two lines.

5.2 Visual illustration of AAA

We first visually illustrate the mechanism and effects of AAA. AAA, besides fooling attackers, also calibrates the model, forming the loss curve as in Fig. 2. Slightly different from the ideal defense in Fig. 1 (b) is that the orange curve is lowered for calibration. However, calibration does not hurdle fooling attackers, i.e., the orange line is mostly going opposite to the blue line in a precisely symmetric manner. Also, two lines cross the decision boundary at the same time, meaning that AAA does not change the decision.

Besides the loss, it is also necessary to observe the output probabilities of AAA, which is displayed in Fig. 3 (left). AAA, indicated by the orange lines, changes the scores slightly without hurting accuracy (dotted lines) compared to RND qin2021improving and AT dai2021parameterizing . By such small modifications, AAA not only improves the calibration but also prevents SQAs by misleading them into incorrect directions. As displayed by the right part of Fig. 3, AAA is super effective in avoiding SQAs, even if it is a query-efficient attack working at a large bound. Specifically, AAA preserves a standard trained DNN to have > accuracy after queries, doubling the performance of AT and tripling that for RND.

Figure 3: The left figure illustrates the change of output scores for different defenses. We first sort 10K testing samples in ascending order according to their ground-truth-class probability predicted by the undefended model (blue line), and then divide them into bins, so that the left bins in the figure stand for low-confidence or misclassified samples, and vice versa. Then for samples in each bin, we plot the confidence (solid lines) and accuracy (dotted line) for AT dai2021parameterizing , RND qin2021improving , and AAA. Compared to AT and RND, AAA alters scores most slightly without influencing the accuracy curve. The right figure shows DNN’s accuracy under Square attack andriushchenko2020square when , indicating that AAA outperforms alternatives in defending SQAs by a large margin with also the highest clean accuracy.
Model Metric / Attack None AT dai2021parameterizing ; salman2020adversarially ; xie2019feature RND qin2021random AAA (ours)
CIFAR-10 ECE (%) 3.52 11.00 6.32 2.46
Acc (%) 94.78 87.02 91.05 94.84
Square andriushchenko2020square 39.38 / 00.09 78.30 / 67.44 60.83 / 49.15 81.36 / 80.59
Wide- SignHunter al2019sign 41.14 / 00.04 78.87 / 66.79 61.02 / 47.82 79.41 / 76.71
ResNet- SimBA guo2019simple 53.04 / 03.95 84.21 / 75.85 76.39 / 64.34 88.86 / 83.36
28 ZagoruykoK16 NES ilyas2018black 83.42 / 12.24 85.92 / 81.01 86.23 / 68.19 90.62 / 85.95
Bandit ilyas2018prior 69.86 / 41.03 83.62 / 76.25 70.44 / 41.65 80.86 / 78.36
ImageNet ECE (%) 5.42 5.03 5.79 4.30
Acc (%) 77.11 66.30 75.32 77.17
Square andriushchenko2020square 52.27 / 09.25 59.20 / 51.11 58.67 / 50.54 63.13 / 62.51
Wide- SignHunter al2019sign 53.05 / 13.88 59.47 / 56.22 59.36 / 52.98 62.35 / 56.80
ResNet- SimBA guo2019simple 71.79 / 20.90 65.64 / 47.60 66.36 / 63.27 74.16 / 67.14
50 ZagoruykoK16 NES ilyas2018black 77.11 / 64.93 66.30 / 64.38 71.33 / 66.05 77.12 / 67.06
Bandit ilyas2018prior 71.33 / 65.77 65.30 / 63.98 65.15 / 61.38 72.15 / 70.53
ImageNet ECE (%) 8.37 5.74 8.93 7.38
Acc (%) 78.21 63.94 76.75 78.21
Square andriushchenko2020square 54.51 / 11.73 57.99 / 51.47 58.56 / 48.20 66.32 / 65.77
Res- SignHunter al2019sign 54.12 / 13.30 59.14 / 56.49 58.02 / 52.50 62.72 / 59.60
NeXt- SimBA guo2019simple 70.86 / 24.64 59.40 / 57.53 68.31 / 66.16 74.06 / 67.26
101 xie2017aggregated NES ilyas2018black 78.21 / 66.24 63.94 / 62.51 73.37 / 69.30 78.21 / 68.51
Bandit ilyas2018prior 72.19 / 67.73 63.18 / 62.26 67.69 / 64.62 72.91 / 71.48
Table 2: The defense performance under attacks (#query )

5.3 Numerical results of AAA

We report the main numerical results in Table 2, where RND and AAA are directly implemented in the undefended model denoted as "None". The AT method for the three models are PSSiLU dai2021parameterizing , vanilla AT salman2020adversarially , and feature denoising xie2019feature , respectively. According to Table 2, AAA not only consistently reduces ECE by >, but also does not hurt clean accuracy. In contrast, the AT models lose > accuracy and RND also endures a drop in accuracy and calibration. Under SQAs, AAA preserves a significantly higher accuracy than the undefended model, which is totally destroyed. The most threatening SQAs in two datasets more than halve the CIFAR-10 model’s accuracy within 100 queries and degrade ImageNet ones to < by times. However, the AAA model remains > and > accuracy in extreme cases. AT and RND are useful in mitigating SQAs, but it is AAA that tops the defense performance in almost all cases.

Besides AT and RND, diverse defenses have also been proposed, and it would be interesting to see their results. DENT wang2021fighting optimizes the model in test-time, trying to learn the AE distribution. PNI he2019parametric injects noise during training, making the learned weights less sensitive to input perturbations. TRS yang2021trs ensembles three models with low attack transferability between each other. They are originally developed for gradient-based attacks, but also provide some protection against SQAs. As displayed in Table 3, however, they are not comparable to AAA in real cases regarding the accuracy, calibration, and defense performance. Here, we also test a strong SQA QueryNet, which uses three architecture-alterable models to steal the DNN. Due to its utilization of large-scale testing samples, QueryNet greatly hurts DNNs, but AAA is still the defense that protects the model best.

Metric / Attack None DENT wang2021fighting PNI he2019parametric TRS yang2021trs AAA (ours)
ECE (%) 3.52 5.20 3.09 3.64 2.46
Acc (%) 94.78 94.80 81.91 88.64 94.84
Square andriushchenko2020square 39.38 / 00.09 62.01 / 35.07 57.69 / 45.47 56.26 / 23.91 81.36 / 80.59
QueryNet al2019sign 13.50 / 00.03 39.35 / 20.75 44.06 / 34.69 16.43 / 08.49 50.01 / 49.63
Table 3: Various defenses under strong attacks (#query , CIFAR-10, )

5.4 Generalization of AAA

Aside from the untargeted attacks, we also conduct targeted attacks, attacks under different bounds to study the generalization of AAA. A targeted attack is successful only if an AE is mispredicted as a pre-set class, which, is randomly chosen from incorrect classes for each sample here. And attacks bound the perturbations by norm, which is reported to fool DNNs better chen2020universal . Here we additionally validate AAA’s plug-in advantage by combining it with AT. Note that what differs from AAA and most existing defenses xie2019feature ; fu2021drawing ; he2019parametric combined with AT is that AAA has already achieved excellent defense performance, so such a combination is feasible but not necessary. We choose Square attack andriushchenko2020square to perform the above evaluations because it is the most effective SQA as tested in Table 2. The results are presented in Table 4, where all models are exactly the same as in Table 2 (CIFAR-10) without tuning hyper-parameters of defense to fairly evaluate the generalization of different methods. The results of AT model and other AT models are put in Appendix A.

In the difficult targeted attack setting, the undefended model remains only accuracy after queries, which is approximately just the accuracy drop of the AAA model. For attacks, AAA is still capable of mitigating threats without hurting users, and its superiority is more outstanding as the attack bound becomes larger. AT models, although robust, suffer from attacks under a large or different norm ball stutz2020confidence . Thus, its defense effects decrease as SQA alters the setting, seeing the bottom line. Defended by AAA, however, this drawback would be greatly avoided. An AT model, even under attack after queries, is hardly influenced, i.e., increasing query times after queries hardly better the attack performance, discouraging SQA attackers.

Metric / Attack None AAA (ours) AT dai2021parameterizing AT-AAA
ECE (%) 3.52 2.46 11.00 10.56
Acc (%) 94.78 94.84 87.02 87.02
untargeted 39.38 / 00.09 81.36 / 80.59 78.30 / 67.44 80.80 / 80.13
targeted 75.59 / 02.84 92.05 / 91.62 85.75 / 82.72 86.22 / 86.13
untargeted 81.53 / 18.75 92.66 / 92.63 84.26 / 78.97 85.12 / 84.31
untargeted 12.77 / 00.01 70.35 / 63.46 57.88 / 25.19 74.03 / 73.72
Table 4: Generalization of AAA tested by Square attack andriushchenko2020square (#query , CIFAR-10)

5.5 Hyper-parameters of AAA

Figure 4: The influence of AAA hyper-parameters (the attractor interval in (4), the reverse step in (5), and the calibration loss weight in (6)) to accuracy (left axis), ECE (right axis), SQA adversarial accuracy (left axis), and temperature (right axis). The dashed ECE (ref) means the undefended model.

The above amazing results in various settings are all obtained using fixed parameters that are heuristically selected, and it would be necessary to see how each of them affects the results. The attractor interval

in (4) decides the period of the margin loss attractors, and a larger divides the losses into fewer intervals so that attackers are harder to jump out of one of them. The reverse step in (5) controls the reverse degree, and if it increases, the modified loss curve would be steeper oppositely along the adversarial direction, emphasizing the defense. The calibration loss weight in (6) directly balances defense and calibration in optimization. Here we still consider the accuracy, ECE, and SQA adversarial accuracy (under 100 queries). Plus, we study an additional metric, the temperature of logits rescaling tuned with AAA.

The results are shown in Fig. 4. The adversarial accuracy far exceeds the undefended model () and mostly surpasses AT (). The clean accuracy is hardly impacted and the ECE is mostly below the undefended model (the green dashed line). Thus, AAA’s good performance is insensitive to hyper-parameters, even if tuned in logarithmic scale. Regarding the trend, an intuitive conclusion is that a larger , a larger , or a smaller that highlights defense more v.s. calibration would thereby increase the adversarial accuracy and ECE. Interestingly, the temperature mostly decreases as the defense is emphasized (the logits are divided by a smaller value), indicating that the model tends to output lower confidence to defend, consistent with the situation in AT croce2021robustbench .

6 Conclusion, impacts and limitations

We develop a novel defense against score-based query attacks (SQAs). Our main idea is to actively attack the attackers (AAA), misleading them into incorrect attack directions. AAA achieves that by post-processing DNN’s logits while enforcing the new output confidence to be calibrated, making AAA a deterministic plug-in test-time defense with improvements in calibration and accuracy by costing negligible computation overhead. Compared to alternative defenses, AAA is effective in mitigating SQAs according to our study on 8 defenses, 6 SQAs, and 8 DNNs under various settings.

As a defense in real-world applications, AAA greatly mitigates the adversarial threat without requiring huge computational burden. For example, in autonomous driving or supervision systems, the post-processing defense module could be directly implemented in pre-trained models. Despite the low cost, the benefits of adopting AAA are profound. In most cases, users would be provided with a more accurate confidence score so that they know better when the model tends to fail. In adversarial cases, SQAs, the most threatening attack in real cases, would be effectively prevented.

AAA is developed to especially prevent SQAs. Thus, defending our types of attacks is beyond our scope. For example, AAA does not improve the worst-case robustness evaluated in white-box settings croce2020reliable ; tramer2020adaptive where attackers have complete knowledge of the model (the AutoAttack croce2020reliable robust accuracy would be increased in an undesirable manner by AAA). Also, AAA is not applicable to avoiding transfer-based attacks and decision-based query attacks, which are either unfeasible or inefficient in the real world, because AAA induces only a negligible impact on the decision boundary by perturbing the output scores very slightly.


  • (1) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations (ICLR), 2014.
  • (2) I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations (ICLR), 2015.
  • (3) A. Ilyas, L. Engstrom, and A. Madry, “Prior convictions: Black-box adversarial attacks with bandits and priors,” in International Conference on Learning Representations (ICLR), 2019.
  • (4) C. Guo, J. Gardner, Y. You, A. G. Wilson, and K. Weinberger, “Simple black-box adversarial attacks,” in

    International Conference on Machine Learning (ICML)

    , 2019, pp. 2484–2493.
  • (5) M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square attack: A query-efficient black-box adversarial attack via random search,” in

    the European Conference on Computer Vision (ECCV)

    , 2020, pp. 484–501.
  • (6) S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A simple and accurate method to fool deep neural networks,” in

    the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016, pp. 2574–2582.
  • (7) N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in the IEEE Symposium on Security and Privacy (SP), 2017, pp. 39–57.
  • (8)

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in

    International Conference on Learning Representations (ICLR), 2018.
  • (9) C. Xie, Z. Zhang, Y. Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille, “Improving transferability of adversarial examples with input diversity,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2730–2739.
  • (10) S. Chen, Z. He, C. Sun, and X. Huang, “Universal adversarial attack on attention and the resulting dataset DAmageNet,” in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022, pp. 2188–2197.
  • (11) W. Brendel, J. Rauber, and M. Bethge, “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,” in International Conference on Learning Representations (ICLR), 2018.
  • (12) M. Cheng, T. Le, P.-Y. Chen, H. Zhang, J. Yi, and C.-J. Hsieh, “Query-efficient hard-label black-box attack: An optimization-based approach,” in International Conference on Learning Representations (ICLR), 2018.
  • (13) J. Chen, M. I. Jordan, and M. J. Wainwright, “HopSkipJumpAttack: A query-efficient decision-based attack,” in the IEEE Symposium on Security and Privacy (SP), 2020, pp. 1277–1294.
  • (14) Z. Qin, Y. Fan, H. Zha, and B. Wu, “Random noise defense against query-based black-box attacks,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 7650–7663.
  • (15) A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in International Conference on Machine Learning (ICML), 2018, pp. 274–283.
  • (16) F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” in International Conference on Machine Learning (ICML), 2020, pp. 2206–2216.
  • (17) F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion, M. Chiang, P. Mittal, and M. Hein, “Robustbench: a standardized adversarial robustness benchmark,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
  • (18) Y. Li, Z. Yang, Y. Wang, and C. Xu, “Neural architecture dilation for adversarial robustness,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 29 578–29 589.
  • (19) D. Wang, A. Ju, E. Shelhamer, D. Wagner, and T. Darrell, “Fighting gradients with gradients: Dynamic defenses against adversarial attacks,” arXiv preprint arXiv:2105.08714, 2021.
  • (20) A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer, “Deflecting adversarial attacks with pixel deflection,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8571–8580.
  • (21) F. Tramèr, A. Kurakin, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel, “Ensemble adversarial training: Attacks and defenses,” in 6th International Conference on Learning Representations (ICLR), 2018.
  • (22) R. Rade and S.-M. Moosavi-Dezfooli, “Helper-based adversarial training: Reducing excessive margin to achieve a better accuracy vs. robustness trade-off,” in

    ICML 2021 Workshop on Adversarial Machine Learning

    , 2021.
  • (23)

    H. Salman, M. Sun, G. Yang, A. Kapoor, and J. Z. Kolter, “Denoised smoothing: A provable defense for pretrained classifiers,” in

    Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 21 945–21 957.
  • (24) C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. L. Yuille, “Mitigating adversarial effects through randomization,” in International Conference on Learning Representations (ICLR), 2018.
  • (25) X. Liu, M. Cheng, H. Zhang, and C.-J. Hsieh, “Towards robust neural networks via random self-ensemble,” in the European Conference on Computer Vision (ECCV), 2018, pp. 369–385.
  • (26) M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana, “Certified robustness to adversarial examples with differential privacy,” in the IEEE Symposium on Security and Privacy (SP), 2019, pp. 656–672.
  • (27) Z. Liu, Q. Liu, T. Liu, N. Xu, X. Lin, Y. Wang, and W. Wen, “Feature distillation: Dnn-oriented JPEG compression against adversarial examples,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 860–868.
  • (28) J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, “Thermometer encoding: One hot way to resist adversarial examples,” in International Conference on Learning Representations (ICLR), 2018.
  • (29) C. Guo, M. Rana, M. Cisse, and L. van der Maaten, “Countering adversarial images using input transformations,” in International Conference on Learning Representations (ICLR), 2018.
  • (30) Y.-H. Wu, C.-H. Yuan, and S.-H. Wu, “Adversarial robustness via runtime masking and cleansing,” in International Conference on Machine Learning (ICML), 2020, pp. 10 399–10 409.
  • (31) C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in International Conference on Machine Learning (ICML), 2017, pp. 1321–1330.
  • (32) M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” in

    the AAAI conference on Artificial Intelligence (AAAI)

    , 2015, pp. 2901–2907.
  • (33) Y. Qin, X. Wang, A. Beutel, and E. Chi, “Improving calibration through the relationship with adversarial robustness,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 14 358–14 369.
  • (34) S. Zagoruyko and N. Komodakis, “Wide residual networks,” in British Machine Vision Conference (BMVC), 2016, pp. 87.1–87.12.
  • (35) A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” 2009.
  • (36) S. Dai, S. Mahloujifar, and P. Mittal, “Parameterizing activation functions for adversarial robustness,” arXiv preprint arXiv:2110.05626, 2021.
  • (37) N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in the ACM on Asia Conference on Computer and Communications Security, 2017, pp. 506–519.
  • (38) P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” in the 10th ACM workshop on artificial intelligence and security, 2017, pp. 15–26.
  • (39) S. Cheng, Y. Dong, T. Pang, H. Su, and J. Zhu, “Improving black-box adversarial attacks with a transfer-based prior,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 10 934–10 944.
  • (40) A. Al-Dujaili and U.-M. O’Reilly, “Sign bits are all you need for black-box attacks,” in International Conference on Learning Representations (ICLR), 2019.
  • (41) M. Cheng, S. Singh, P. H. Chen, P.-Y. Chen, S. Liu, and C.-J. Hsieh, “Sign-opt: A query-efficient hard-label adversarial attack,” in International Conference on Learning Representations (ICLR), 2019.
  • (42) A. N. Bhagoji, W. He, B. Li, and D. Song, “Practical black-box attacks on deep neural networks using efficient query mechanisms,” in the European Conference on Computer Vision (ECCV), 2018, pp. 154–169.
  • (43) Y. Guo, Z. Yan, and C. Zhang, “Subspace attack: Exploiting promising subspaces for query-efficient black-box attacks,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 3820–3829.
  • (44) D. C. R. Viet Quoc Vo, Ehsan Abbasnejad, “Ramboattack: A robust query efficient deep neural network decision exploit,” Network and Distributed Systems Security (NDSS) Symposium, 2022.
  • (45) C. Yao, P. Bielik, P. Tsankov, and M. Vechev, “Automated discovery of adaptive attacks on adversarial defenses,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 26 858–26 870.
  • (46) F. Tramer, N. Carlini, W. Brendel, and A. Madry, “On adaptive attacks to adversarial example defenses,” in Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 1633–1645.
  • (47) S. Gowal, S.-A. Rebuffi, O. Wiles, F. Stimberg, D. A. Calian, and T. A. Mann, “Improving robustness using generated data,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 4218–4233.
  • (48) S.-A. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. A. Mann, “Data augmentation can improve robustness,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 29 935–29 948.
  • (49) C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He, “Feature denoising for improving adversarial robustness,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 501–509.
  • (50) Y. Fu, Q. Yu, Y. Zhang, S. Wu, X. Ouyang, D. Cox, and Y. Lin, “Drawing robust scratch tickets: Subnetworks with inborn robustness are found within randomly initialized networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 13 059–13 072.
  • (51) Z. He, A. S. Rakin, and D. Fan, “Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 588–597.
  • (52) H. Salman, J. Li, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang, “Provably robust deep learning via adversarially trained smoothed classifiers,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 11 292–11 303.
  • (53)

    A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” in

    International Conference on Machine learning (ICML), 2005, pp. 625–632.
  • (54) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
  • (55) S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, and S. Michalak, “On mixup training: Improved calibration and predictive uncertainty for deep neural networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 13 911–13 922.
  • (56) D. Stutz, M. Hein, and B. Schiele, “Confidence-calibrated adversarial training: Generalizing to unseen attacks,” in International Conference on Machine Learning (ICML), 2020, pp. 9155–9166.
  • (57)

    B. Zadrozny and C. Elkan, “Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers,” in

    the AAAI conference on Artificial Intelligence (AAAI), 2001, pp. 609–616.
  • (58) A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial attacks with limited queries and information,” in International Conference on Machine Learning (ICML), 2018, pp. 2142–2151.
  • (59) H. Salman, A. Ilyas, L. Engstrom, A. Kapoor, and A. Madry, “Do adversarially robust imagenet models transfer better?” in Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 3533–3545.
  • (60) Z. Yang, L. Li, X. Xu, S. Zuo, Q. Chen, P. Zhou, B. Rubinstein, C. Zhang, and B. Li, “TRS: Transferability reduced ensemble via promoting gradient diversity and model smoothness,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 17 642–17 655.
  • (61) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
  • (62) S. Chen, Z. Huang, Q. Tao, and X. Huang, “Querynet: Attack by multi-identity surrogates,” arXiv preprint arXiv:2105.15010, 2021.
  • (63) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • (64) S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1492–1500.
  • (65) A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al.

    , “Pytorch: An imperative style, high-performance deep learning library,” in

    Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8024–8035.

Appendix A More numerical results

Decision-based attacks.

In Table 5, we present results concerning a SOTA decision-based attack, RamBo Vo2022 (denoted as "RB"). As shown in the first two columns, compared to Square andriushchenko2020square (denoted as SQ), RamBo could hardly impact the undefended DNN even after queries under a large bound. Thus, it is reasonable for us to target at mitigating black-box SQAs in real cases.


Seeing the right four columns of Table 5, one could observe that when the attack bound increases, both and AT models are impacted much more significantly. Moreover, AAA’s superiority is enhanced as the attack becomes stronger.

Bound None-RB None-SQ AT dai2021parameterizing -SQ AT rade2021helper -SQ RND-SQ AAA-SQ
94.78 18.75 78.97 85.64 66.18 92.63
94.74 02.22 66.79 76.34 59.09 90.01
94.10 00.31 50.97 61.64 46.52 82.38
69.50 00.02 36.24 45.22 33.35 72.99
10.38 00.01 25.19 28.56 20.47 63.46
Table 5: The adversarial accuracy under RamBo (RB) Vo2022 and Square (SQ) andriushchenko2020square (#query )

Attack using different losses.

Although attackers generally greedily update based on the margin (of logits) loss andriushchenko2020square ; al2019sign ; ilyas2018black , it is possible for them to choose other loss options such as minimizing the probability margin and maximizing the cross-entropy loss. The results in Table. 6 show that despite the choice of AAA to reverse the margin loss, it could prevent attacks using other loss types.

The plug-in advantage of AAA.

AAA, as a plug-in post-processing defense, is embeddable into any defense that increases the model’s robustness. As shown in Table 6, AAA dramatically decreases the ECE of AT models without impacting the accuracy. Moreover, the already good defense performance of AT models is further boosted by AAA.

CIFAR-10 () ImageNet ()
WideResNet34 ZagoruykoK16 WideResNet50 ZagoruykoK16
Metric / Loss AT rade2021helper AT-AAA AT salman2020adversarially AT-AAA
ECE 18.96 5.93 5.03 2.64
Acc 91.47 91.47 66.30 66.30
logits-margin 83.22 / 69.67 84.68 / 82.92 59.20 / 51.12 60.83 / 59.73
probability-margin 82.90 / 69.38 84.41 / 82.67 59.21 / 50.59 60.33 / 57.88
cross-entropy 83.93 / 71.17 84.55 / 82.57 60.13 / 52.84 60.53 / 58.01
Table 6: The defense performance under Square attack andriushchenko2020square (#query )
Figure 5: The adversarial accuracy under Square attack andriushchenko2020square when , indicating that AAA outperforms alternatives in defending SQAs by a large margin with also the highest clean accuracy.

Appendix B More visual illustrations

Figure 6: The AAA-defended and undefended margin loss value when attacking the undefended WideResNet-28 ZagoruykoK16 in RobustBench croce2021robustbench by Square attack andriushchenko2020square () using the CIFAR-10 test sample.
Figure 7: The AAA-defended and undefended margin loss value when attacking the undefended WideResNet-50 ZagoruykoK16 in RobustBench croce2021robustbench by Square attack andriushchenko2020square () using the ImageNet test sample.

Appendix C Error analysis

Results of multiple runs.

We run the main experiments in Table 2 for 5 times using different random seeds for Square attack, and report the results in Table 7. Since AAA is a deterministic method without randomness, its defense performance is very constant.

Average query times.

Attackers generally use the average query times (AQ) to measure attacks, which is reported in Table 7. Here we record the AQ of all query samples to reflect the real attack cost. AAA hurdles the attack very much, seeing the large AQ and adversarial accuracy.


We report the FLOPs of each model in Table 2. Since the only calculation of AAA is to post-process logits, the computational overhead is negligible (< GFLOPs). The total amount of required calculation could be obtained by multiplying FLOPs with AQ using 10K samples.

Dataset CIFAR-10 ImageNet ImageNet
Model WideResNet-28 ZagoruykoK16 WideResNet-50 ZagoruykoK16 ResNeXt-101 xie2017aggregated
Acc (%)
ECE (%)
Bound ()
Table 7: The defense performance under Square attack andriushchenko2020square

Appendix D Core code

We present the core part of AAA python code (PyTorch) below, where a_i stands for the attractor interval in (4), reverse_step is in (5), and calibration_loss_weight is in (6).

1logits = cnn(x_curr)
2logits_ori = logits.detach()
3p_target = F.softmax(logits_ori / temperature, dim=1).max(1)[0]
5value, index_ori = torch.topk(logits_ori, k=2, dim=1)
6margin_ori = value[:, 0] - value[:, 1]
7attractor = ((margin_ori / a_i).ceil() - 0.5) * a_i
8l_target = attractor - reverse_step * (margin_ori - attractor)
10mask1 = torch.zeros(logits.shape, device=device)
11mask1[torch.arange(logits.shape[0]), index_ori[:, 0]] = 1
12with torch.enable_grad():
13 logits.requires_grad = True
14 optimizer = torch.optim.Adam([logits], lr=optimizer_lr)
16 for i in range(num_iter):
17  prob = F.softmax(logits, dim=1)
18  loss_c = ((prob * mask1).max(1)[0] - p_target).abs().mean()
19  value, index = torch.topk(logits, k=2, dim=1)
20  margin = value[:, 0] - value[:, 1]
21  loss_d = (margin - l_target).abs().mean()
22  loss = loss_d + loss_c * calibration_loss_weight
23  optimizer.zero_grad(); loss.backward(); optimizer.step()

Appendix E Detailed experimental settings

Defense Dataset Architecture Source ID
None CIFAR-10 WideResNet-28 RobustBench Standard
PSSiLU (AT) CIFAR-10 WideResNet-28 RobustBench Dai2021Parameterizing
HAT (AT) CIFAR-10 WideResNet-34 RobustBench Rade2021Helper_extra
PNI (AT) CIFAR-10 ResNet-20 Official PNI-W (channel-wise)
TRS CIFAR-10 ResNet-20 Official /
None ImageNet WideResNet-50 TorchVision wide_resnet50_2
AT ImageNet WideResNet-50 RobustBench Salman2020Do_50_2
None ImageNet ResNeXt-101 TorchVision resnext101_32x8d
FD (AT) ImageNet ResNeXt-101 Official ResNeXt101_DenoiseAll
Table 8: The used models


The detailed information of all our used models is shown in Table 8. The official repositories of PNI, TRS, and FD are,, and

, respectively. AAA, RND, and DENT are directly implemented on the undefended model. RND adds the random noise with variance

to input samples as recommended in qin2021random . In DENT, we follow the original work to optimize the model for 6 iterations using the tent loss and Adam optimizer (lr). The pre-trained AT/PNI model comes from RobustBench / official repository. We train the TRS model (ensemble 3 models) using the default coeff, lambda, and scale in the official code.


All the attacks are adapted from the official repositories with original hyper-parameters. SimBA and Bandit are implemented from and, respectively. SignHunter and NES are both from Square and QueryNet are both from the implementation in The detailed hyper-parameters of attacks are outlined in Table 9.

Method Hyperparameter CIFAR-10 ImageNet
SimBA guo2019simple (dimensionality of 2D frequency space) 32 32
order (order of coordinate selection) random random
(step size per iteration) 0.2 0.2
SignHunter al2019sign (finite difference probe) 8 ([0,255]) 0.05 ([0,1])
NES ilyas2018black (finite difference probe) 2.55 0.1
(image learning rate) 2 0.02
(# finite difference estimations / step) 20 100
Bandit ilyas2018prior (finite difference probe) 0.1 0.1
(image learning rate) 0.01 0.01
(online convex optimization learning rate) 0.01 0.01
Tile size (data-dependent prior) 50 50
(bandit exploration) 1.0 1.0
Square andriushchenko2020square (initial probability to change coordinate) 0.05 0.05
QueryNet chen2021querynet Number of batches (NAS training) 500 /
batch size (NAS training) 128 /
Number of layers (NAS surrogate models) 6, 8, 10 /
Table 9: Hyper-parameters for attacks