In the past years, with the blooming development of deep learning, researchers started focusing on the ‘dark side’ of deep neural networks, such as the lack of interpretability. In this research field, an important topic is the existence of adversarial examples, which claims that slightly modifying the input image, sometimes being imperceptible to human, can lead to a catastrophic change in the output prediction[Szegedy et al., 2014, Goodfellow et al., 2015]. Driven by the requirements of understanding deep networks and securing AI-based systems, there emerge a lot of efforts in generating adversarial examples to attack deep networks [Szegedy et al., 2014, Goodfellow et al., 2015, Moosavi-Dezfooli et al., 2016, Carlini and Wagner, 2017, Kurakin et al., 2017a, Dong et al., 2018] and, in the opposite direction, designing algorithms to recognize adversarial examples and thus eliminate their impacts [Papernot et al., 2016b, Kurakin et al., 2017b, Madry et al., 2018, Tramèr et al., 2018, Guo et al., 2018, Xie et al., 2018a, Prakash et al., 2018, Liao et al., 2018].
This paper focuses on the defense part. On the one hand, researchers have demonstrated that adversarial attacks are fragile and thus their impacts can be weakened or eliminated by some pre-processing on the input image [Guo et al., 2018, Xie et al., 2018a, Prakash et al., 2018, Liao et al., 2018]; on the other hand, there are also efforts in revealing how such attacks change mid-level and high-level features and thus eventually break up prediction [Liao et al., 2018, Xie et al., 2018b]. Here we make a hypothesis: the cues of recovering low-level features and high-level features from adversarial attacks are quite different: for low-level features, it is natural to use image-space priors such as intensity distribution and smoothness to filter out adversaries; for high-level features, however, the cues may lie in the intrinsic connections between the semantics of different classes.
To the best of our knowledge, this is the first work that purely focuses on defending adversarial attacks from topmost semantics — this forces our system to find semantic cues, which were not studied before, for defending adversaries. To this end, we study logits, which are the class scores produced by the last fully-connected layers — before being fed into softmax normalization. The major discovery of this paper lies in that we can take merely logits as input to defend adversarial attacks. To reveal this, we first train a
-way classifier on the ILSVRC2012 datasetRussakovsky et al. . Then, we use an off-the-shelf attacker such as PGD [Madry et al., 2018] to generate adversarial examples, in particular logits are recorded. Finally, we train a two-layer fully-connected network to predict the original label. We mix both clean and adversarial logits into the training set, so that the trained defender has the ability of recovering attacked prediction meanwhile keeping non-attacked prediction unchanged.
Working at the semantic level, our approach enjoys two-way benefits. First, we only use logits for defense, which reduces the requirements of image data and mid-level features — this is especially useful in the scenarios that image data are inaccessible due to security reasons, and that very low bits of data can be transferred for defense (note that compared to low-level and mid-level features, logits have a much lower dimension). Second and more importantly, we make it possible to explain how adversaries are defended by directly digging into the two-layer network. An interesting phenomenon is that almost all attackers can leave fingerprints on a few fixed classes, which are named supporting classes, even if the adversarial class can vary among the entire dataset. Our defender works by detecting these supporting classes, and whether a defender can transfer to different attacks and/or network models can be judged by evaluating how these supporting classes overlap. By revealing these properties, we move one step further towards understanding the mechanisms of adversaries.
In this section, we provide background knowledge about adversarial examples and review some representative attack and defense methods. Let denote a clean/natural image sample, and is the corresponding label which has choices. A deep neural network is defined as . In practice, given an input , the classifier
outputs a final feature vectorcalled logits where each element corresponds to the -th class. The logits
are then fed into a softmax function to produce the predicted probability, and the predicted class is chosen by . There exist adversarial attacks that went beyond classification [Xie et al., 2017], but we focus on classification because it is the fundamental technique extended to other tasks.
2.1 Adversarial attacks
An adversarial example is crafted by adding an imperceptible perturbation to input , so that the prediction of classifier becomes incorrect. If an attack aims at just forcing the classifier to produce a wrong prediction, i.e., , it is called a non-targeted attack. On the other hand, a targeted attack is designed to cause the classifier to output a specific prediction as , where is the target label and . In this paper, we mainly focus on the non-targeted attack, as it is more common and easier to achieved by a few recent approaches [Goodfellow et al., 2015, Kurakin et al., 2017a, Madry et al., 2018, Dong et al., 2018, Moosavi-Dezfooli et al., 2016, Carlini and Wagner, 2017]. Another way of categorizing adversarial attacks is to consider the amount of information the attacker has. A white-box attack means that the adversary has full access to both the target model and how the defender works. In the opposite, a black-box attack indicates that the attacker knows neither the classifier nor the defender. Also there are some intermediate cases, termed the gray-box attack, in which partial information is unknown. Generally, two types of methods to generate adversarial examples were most frequently used by researchers:
Gradient-sign () methods. To generate adversarial examples efficiently, Goodfellow et al.  proposed the Fast Gradient Sign Method (FGSM) that took a single step in the signed-gradient direction of the cross-entropy loss, based on an assumption that the classifier is approximately linear locally. However, this assumption may not hold perfectly and the success rates of FGSM are relatively low. To address this issue, Kurakin et al. [2017a] proposed the Basic Iterative Method (BIM) that applied fast gradient iteratively with smaller steps. Madry et al.  further extended this approach to a ‘universal’ first-order adversary by introducing a random start. The proposed Projected Gradient Descent (PGD) method served as a very strong attack in the white-box scenario. Dong et al.  stated that FGSM adversarial examples ‘under-fit’ the target model and BIM ‘over-fit’ it, respectively, making them hard to transfer across models. They proposed the Momentum-based Iterative Method (MIM) that integrated a momentum term into the iterative process to stabilize update directions.
attacks. Moosavi-Dezfooli et al.  also considered a linear approximation of the decision boundaries and proposed the DeepFool attack to generate adversarial examples that minimize perturbations measured by -norm. It iteratively moved an image towards the nearest decision boundary until the image crosses it and becomes misclassified. Unlike previous iterative gradient-based approaches, Carlini and Wagner 
proposed to directly minimize a loss function so that the generated perturbation makes the example adversarial meanwhile its-norm is optimized to be small. The C&W attack always generates very strong adversarial examples with low distortions.
2.2 Adversarial defenses
Many methods defending against adversarial examples have been proposed recently. These methods can be roughly divided into two categories. One type of defenses worked by modifying the training or inference strategies of models to improve their inherent robustness against adversarial examples. Adversarial training is one of the most popular and effective method of this kind [Goodfellow et al., 2015, Kurakin et al., 2017b, Madry et al., 2018, Tramèr et al., 2018, Kannan et al., 2018]
. It augmented or replaced the training data with adversarial examples and aimed at training a robust model being aware of the existence of adversaries. While effectively improving the robustness to seen attacks, this type of approaches often consumed much more computational resources than normal training. Other methods include defensive distillation[Papernot et al., 2016b], saturating networks [Nayebi and Ganguli, 2017], thermometer encoding [Buckman et al., 2018], etc., which benefited from gradient masking effect [Papernot et al., 2017] or obfuscated gradients [Athalye et al., 2018] but were still vulnerable to black-box attacks.
Another line of defenses was based on removing adversarial perturbations by processing the input image before feeding them to the target model. Dziugaite et al.  studied JPEG compression to reduce the effect of adversarial noises. Osadchy et al.  applied a set of filters like the median filter and averaging filter to remove perturbations. Guo et al. 
studied five image transformations and showed that total-variance minimization and image quilting were effective for defense.Xie et al. [2018a]
leveraged random resizing and padding to mitigate adversarial effects.Prakash et al.  proposed pixel deflection that redistributed pixel values locally and thus broke adversarial patterns. Liao et al.  utilized a high-level representation guided denoiser to defend adversaries. Nonetheless, these methods did not show high effectiveness against very strong perturbations.
Another approach most similar to our work is [Roth et al., 2019], which detected and attempted to recover adversarial examples by measuring how logits change when the input is randomly perturbed. Our method differs from it in that we purely leverage the final logits but no input image.
3 Settings: dataset, network models and attacks
Throughout this paper, we evaluate our approach on the ILSVRC2012 dataset [Russakovsky et al., 2015], a large-scale image classification task with classes, around training images and
validation images. This dataset was also widely used by other researchers for adversarial attacks and defenses. We directly use a few pre-trained network models from the model zoo of the PyTorch platform[Paszke et al., 2017], including VGG-16 [Simonyan and Zisserman, 2015], ResNet-50 [He et al., 2016] and DenseNet-121 [Huang et al., 2017], which report top-1 classification accuracies of , and , respectively.
As for adversarial attacks, we evaluate a few popular attacks mentioned in Section 2.1 using the CleverHans library [Papernot et al., 2016a], including PGD [Madry et al., 2018], MIM [Dong et al., 2018], DeepFool [Moosavi-Dezfooli et al., 2016] and C&W [Carlini and Wagner, 2017], each of which causes dramatic accuracy drop to the pre-trained models (see Section 4.3). Below we elaborate the technical details of these attacks, most of which simply follow their original implementations.
PGD [Madry et al., 2018] adversarial examples are generated by the following method:
Here, is the ball around , is randomly sampled from inside the ball and means clipping the examples so that they stay in the ball and satisfy the constraint. We set the maximum perturbation size to be , number of iterations and step size .
MIM [Dong et al., 2018] adversarial examples are generated using the following algorithm:
With maximum perturbation size , we set the number of iterations , step size and decay factor .
DeepFool [Moosavi-Dezfooli et al., 2016] iteratively finds the minimal perturbations to cross the linearization of the nearest decision boundaries. most-likely classes are sampled when looking for the nearest boundaries, and the maximum number of iterations is set to be .
C&W [Carlini and Wagner, 2017] jointly optimizes the norm of perturbation and a hinge loss of giving an incorrect prediction. For efficiency, we set the maximum number of iterations to with binary search steps for the trade-off constant. The initial trade-off constant, hinge margin and learning rate are set to be , and , respectively.
4 Defending adversarial attacks by logits
Our goal is to detect and defend adversarial attacks by logits. The main reason of investigating logits lies in two parts. First, it makes the algorithm easier to be deployed in the scenarios that the original input image and/or mid-level features are not available. Second, it enables us to understand the mechanisms of adversaries as well as how they are related to the interpretability of deep networks.
Mathematically, we are given a set of logits which is the final output of the deep network before softmax and the input is either a clean image or an adversarial image . Most often, leads to the correct class while can be dramatically wrong. The goal is to recover the original label from while remaining the prediction of unchanged.
4.1 The possibility of defending adversaries by logits
The basis of our research lies in the possibility of defending adversarial attacks by merely checking the logits. In other words, the numerical values of logits before and after being attacked are quite different. To reveal this, we use an example of the PGD attack [Madry et al., 2018], while we also observe similar phenomena in other cases. We apply PGD over all validation images, and plot the distributions of the average values of the logits, before and after the dataset is attacked. Figure 1 shows the histogram of the average response. One can observe that adversaries cast significant changes which are obvious even under such simple statistic.
The above experiment indicate that adversarial attacks indeed leave some kind of ‘fingerprints’ in high-level feature vectors, in particular logits. This is interesting because logits features are produced by the last layer of a deep network, so (i) do not contain any spatial information and (ii) each element in it corresponds to one class of the dataset. The former property makes our defender quite different from those working on image data or intermediate features [Dziugaite et al., 2016, Osadchy et al., 2017, Guo et al., 2018, Xie et al., 2018a, Prakash et al., 2018, Xie et al., 2018b], which often made use of spatial information for noise removal. The latter property eases the defender to learn inter-class relationship to determine whether a case was adversarial and to recover it from attacks. However, due to the high dimensionality of logits — the dimension is in our problem and can be higher in the future, it is difficult to achieve the goal of defense upon a few fixed rules. This motivates us to design a learning-based approach for this purpose.
4.2 Adversarial logits correction
We first consider training a defender for a specific deep network , say ResNet-50 [He et al., 2016], and a specific attack , say PGD [Madry et al., 2018]. We should discuss on the possibility of transfer this defender to other combinations of network and attack in Section 5, after we explain how it works. The first step is to build up a training dataset. Recall that our goal is to recover the prediction of contaminated data while remain that of clean data unchanged, so we collect both types of data by feeding each training case into so as to produce its adversarial version . Then, both and are fed into and the corresponding logits are recorded as and
, respectively. We ignore the softmax layer to generate the final labelsand , since it is trivial and does not impact the defense process.
We then train a two-layer fully-connected network upon these -dimensional logits. Specifically, the output layer is still a -dimensional vector which represents the corrected logits, and the hidden layer contains neurons, each of which is equipped with a ReLU activation function [Krizhevsky et al., 2012] and Dropout [Hinton et al., 2012] with a keep ratio of . This design is to maximally simplify the network structure to facilitate explanation, while being able to deal with non-linearity in learning relationship between classes. The relatively large amount of hidden neurons eases learning complicated rules for logits correction. While we believe more complicated network design can improve its ability of defense, this is not the most important topic of this paper. As a side note, we tried to simply use more hidden layers to train a deeper correction network, but observed performance decrease on adversarial examples. The training process of this defender network follows a standard gradient-based optimization, with very few bells and whistles added. A detailed illustration is provided in Algorithm 1. Note that we mix clean and contaminated images with a probability that a clean image is sampled. This strategy is simple yet effective in enabling the defender to maintain the original prediction of clean data.
Before going into experiments, we briefly discuss the relationship between our approach, adversarial logits correction, and prior work. The first family was named ‘adversarial training’ [Goodfellow et al., 2015, Kurakin et al., 2017b, Madry et al., 2018, Tramèr et al., 2018, Kannan et al., 2018], which generated adversarial examples online and added them into the training set. Our approach is different in many aspects, including the goal (we hope to defend adversaries towards trained models rather than increasing the robustness of models themselves) and the overheads (we do not require a costly online generation process). More importantly, we generate and defend adversaries in the level of logits, which, to the best of our knowledge, was never studied in the previous literature. The second one was often called ‘learning from noisy labels’ [Angluin and Laird, 1988]
, in which researchers proposed to estimate a transition matrix[Goldberger and Ben-Reuven, 2017, Patrini et al., 2017] in order to address the corrupted labels. Although our approach also relies on the same assumption that corrupted or adversarial labels can be recovered by checking class-level relationship, we emphasize that the knowledge required for correcting adversaries is quite different from that for correcting label noises. This is because the effects brought by adversaries are often less deterministic, i.e., the adversarial label is often completely irrelevant to the original one, but a noisy label can somewhat reflect the correct one. This is also the reason why we used a learning-based approach and built a two-layer network with a large number of hidden neurons. In practice, using a single-layer network or reducing hidden neurons can cause dramatic accuracy drop in recovering adversaries, e.g., under the PGD attack towards ResNet-50, the recovery rate of a two-layer network is but drops to when a single-layer network is used (see the next part for detailed experimental settings).
4.3 Experimental results
We first evaluate our approach on the PGD attack [Madry et al., 2018] with different pre-trained network models. We use the Adam optimizer [Kingma and Ba, 2015] to train the defenders for epochs on the entire ILSVRC2012 [Russakovsky et al., 2015] training set, with a batch size of , learning rate of , weight decay of , and a probability of choosing clean data, , of . We evaluate both a full testing set ( images) and a selected testing set ( correctly classified images, to compare with other attackers). Table 1 shows the classification accuracy of different networks on both clean and PGD-attacked examples with and without adversarial logits correction. One the one hand, although the PGD attack is very strong in this scenario, reducing the classification accuracy of all networks to nearly , it is possible to recover the correct prediction by merely checking logits, which (i) do not preserve any spatial information, and (ii) as high-level features, are supposed to be perturbed much more severely than the input data [Liao et al., 2018, Xie et al., 2018b]. After correction, the classification accuracy on VGG-16 [Simonyan and Zisserman, 2015] and ResNet-50 [He et al., 2016] are only reduced by and , respectively. To the best of our knowledge, this is the only work which purely relies on logits to defend adversaries, so it is difficult to compare these results with prior work. On the other hand, on clean (un-attacked) examples, our approach reports slight accuracy drop, because some of them are considered to be adversaries and thus mistakenly ‘corrected’. This is somewhat inevitable, because logits lose all spatial information and non-targeted attacks sometimes cast large but random changes in logits. These cases are not recoverable without extra information.
|Clean, full||PGD, full||Clean, selected||PGD, selected|
|No Defense||Corrected||No defense||Corrected||No defense||Corrected||No defense||Corrected|
Besides, a surprising result produced by our approach is that the classification accuracy on DenseNet-121, after being attacked and recovered, even improves from to . This phenomenon is similar to ‘label leaking’ [Kurakin et al., 2017b], which claimed that the accuracy on adversarial images gets much higher than that on clean images for a model adversarially trained on FGSM adversarial examples. However, label leaking was found to only occur with one-step attacks that use the true labels and vanish if an iterative method is used. In our experiment settings, PGD attack is used which is a very strong iterative method with randomness that further increases uncertainty. In addition, our correction network merely takes logits as input but cannot directly access the transformed image. This finding extends the problem of label leaking and reveals the possibility of helping classifiers with adversarial examples.
Next, we evaluate our approach on several state-of-the-art adversarial attacks, including PGD [Madry et al., 2018], MIM [Dong et al., 2018], DeepFool [Moosavi-Dezfooli et al., 2016] and C&W [Carlini and Wagner, 2017]. A pre-trained ResNet-50 model is used as the target, and all training settings simply remain unchanged as in previous experiments. Differently, we only use test image per class ( in total) that can be correctly classified by ResNet-50 in the test stage, which is mainly due to the slowness of the DeepFool attack. Experimental results are summarized in Table 3, from which one can observe quite similar phenomena as in the previous experiments. Here, we draw a few comments on the different properties among these attackers. For PGD, logits correction manages to recover of adversarial examples and still maintains a sufficiently high accuracy () on clean examples. MIM differs from PGD in that no random start is used and a momentum term is introduced to stabilize gradient updates, which results in less diversity and uncertainty. As a result, logits correction is able to learn better class-level relationship and thus recovers a larger fraction () of adversarial examples. As for DeepFool, note that image distortions are relatively smaller since it minimizes the perturbations in -norm. Nevertheless, our method still succeeds in correcting of adversarial logits. Among all evaluated attacks, C&W is the most difficult to defend, with our approach yielding an accuracy of less than on adversarial examples and the accuracy on clean examples is also largely affected. This is partly because C&W, besides controlling the -norm like DeepFool, uses a different kind of objective function and explicitly optimizes the magnitude of perturbations, which results in quite different behaviors in adversarial logits and thus increases the difficulty of correction.
|Clean, selected||Adversarial, selected|
|No defense||Corrected||No defense||Corrected|
|Defender Trained on|
Finally, we evaluate transferability, i.e., whether a logits correction network trained on a specific attack can be used to defend other attacks on the same model. We fix ResNet-50 to be the target model, and evaluate the defender trained on each of the four attackers. Results are summarized in Table 3, in which each row corresponds to an attacker and each column a defender — note that the diagonal is the same as the last column of Table 3. One can see from the table that the defenders trained on PGD and MIM transfer well to each other, mainly due to the similar nature of these two attackers. These two defenders are also able to correct more than half of adversarial examples generated by DeepFool, which shows a wider aspect in generalization. On the contrary, the defender trained on DeepFool can hardly recover those adversarial examples produced by PGD and MIM, indicating that DeepFool, being an -norm attacker, has different properties and thus the learned patterns for defense are less transferable. Similarly, C&W has a closer behavior to DeepFool, an -norm attacker, than to PGD and MIM, two -norm attackers, which also reflects in the low recovery rates in the last row and column of Table 3. It is interesting to see a high transfer accuracy from C&W to DeepFool, but a low accuracy in the opposite direction. This is because both DeepFool and C&W are -norm attackers, but the adversarial patterns generated by DeepFool are relatively simpler. So, the defender trained on C&W can cover the patterns of DeepFool ( of cases are defended), but the opposite is not true (only of cases are defended).
In the next section, we will provide a new insight to transferability, which focuses on finding the supporting classes of each defense and measuring the overlapping ratio between different sets of supporting classes.
5 Explaining logits-based defense
5.1 How logits correction works
It remains an important topic to explain how our defender works. Thanks to the semantic basis and simplicity of our approach, for each defender, we can find a small set of classes that make most significant contributions to defense.
Given a clean vector of logits, , or an adversarial one, , the logits correction network acts as a multi-variate mapping between two spaces where in the ILSVRC2012 dataset [Russakovsky et al., 2015]. Let denote the score of the -th class in , and the score of the -th class in . The core idea is to compute the partial derivative , so that we can estimate the contribution of each element in the input logits to the corrected logits.
Suppose we have an adversarial sample that has a ground-truth label of but is misclassified as class after being attacked. When feeding it into the correction network , it should be recovered and thus should have the greatest score. To find out how manages to perform correction, we can either investigate how is ‘pulled up’ by finding the classes that have high positive values, or how other classes are ‘pushed down’ by finding the classes that have high negative impacts on the average of all logits, namely, .
We first explore an example using the logits correction network trained to defend the PGD attack [Madry et al., 2018] on ResNet-50 [He et al., 2016]. For each adversarial input in the validation set, we find out greatest entries of , with being the original label and ranges among all classes. Interestingly, for a large amount of cases, no matter what the ground-truth label or the input image is, there always exist some specific classes that contribute most to recovering the correct label. We count over all validation images, and find out that the greatest appears mostly when equals to , , , , and . When we compute instead, most cases have the highest response in the -th class, (i.e., ‘manhole cover’). Similar phenomena are also found when we use PGD to attack other target networks, including VGG-16 [Simonyan and Zisserman, 2015] and DenseNet-121 [Huang et al., 2017]. Some of the classes that contribute most to overlap with those for ResNet-50, implying that these classes are fragile to PGD. When we compute instead, the -th class still dominants for both DenseNet-121 and VGG-16.
5.2 Supporting classes and their relationship to transferability of defense
Here, we define a new concept named supporting classes as those classes that contribute most to logits-based defense. Taking both positive (‘pulling up’) and negative (‘pushing down’) effects into consideration, we compute . Among all classes, classes of with greatest values are taken out for each image. We count the occurrences over the selected test set ( images), and finally pick up classes that appear most frequently as the supporting classes for defending a specific attack from the model. The supporting classes of defending the PGD attack on ResNet-50 are illustrated in Figure 2, and we also show other cases in the supplementary material. We first emphasize that, indeed, these classes are the key to defense. On an arbitrary image, if we reduce the logit values of these supporting classes by and feed the modified logits into the trained defender, it is almost for sure that logits correction fails dramatically and, very likely, the classification result remains to be , the originally misclassified class, even when the original or adversarial classes of this case are almost irrelevant to these supporting classes.
As a further analysis, we reveal the relationship between the overlapping ratio of supporting classes and the transferability of our defender from one setting to another. We compute the Bhattacharyya coefficients between the sets of supporting classes produced by the four attacks. The first pair is PGD and MIM, which reports a high coefficient of . This implies that PGD and MIM have quite similar behavior in attacks, which is mainly due to their similar mechanisms (e.g., -bounded, iteration-based, etc.). Consequently, as shown in Table 3, the defenders trained on PGD and MIM transfer well to each other. A similar phenomenon appears in the pair of DeepFool and C&W, two -norm attackers, with an coefficient of . As a result, the defenders trained on DeepFool and C&W produce the best transfer accuracy on each other (as we explained before, the weak transferability of the DeepFool defender is mainly because DeepFool is very easy to defend). Across -norm and -norm attackers, we report coefficients of for the pair of PGD and DeepFool, and for PGD and C&W, respectively. Note that both numbers are less than those between the same type of attacks. In addition, the defender trained on PGD achieves an accuracy of on DeepFool, and merely on C&W, which aligns with the coefficient values. These results verify our motivation — logits-based correction is easier to be explained, in particular at the semantic level.
Last but not least, the impact of supporting classes can also be analyzed in the instance level, i.e., finding the classes that contribute most to defending each attack on each single image. We observe a few interesting phenomena, e.g., for PGD and MIM, the top-1 supporting class is very likely to be the -th class, while for DeepFool, the most important class is always the ground-truth class which differs from case to case. This partly explains the transferability between PGD/MIM and DeepFool. More instance-level analysis is provided in the supplementary material.
In this paper, we find that a wide range of state-of-the-art adversarial attacks can be defended by merely correcting logits, the features produced by the last layer of a deep neural network. This implies that each attacker leaves ‘fingerprints’ during the attack. Although it is difficult to make rules to detect and eliminate such impacts, we design a learning-based approach, simple but effective, which achieves high recovery rates in a few combinations of attacks and target networks. Going one step forward, we reveal that our defender works by finding a few supporting classes for each attack-network combination, and by checking the overlapping ratio of these classes, we can estimate the transferability of a defense across different scenarios.
Our research leaves a few unsolved problems. For example, it is unclear whether there exists an attack algorithm that cannot be corrected by our defender, or if we can find deeper connections between our discovery and the mechanism of deep neural networks. In addition, we believe that improving the transferability of this defender is a promising direction, in which we shall continue in the future.
- Angluin and Laird  Dana Angluin and Philip D. Laird. Learning from noisy examples. Machine Learning, 2:343–370, 1988.
- Athalye et al.  Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
- Buckman et al.  Jacob Buckman, Aurko Roy, Colin A. Raffel, and Ian J. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In ICLR, 2018.
- Carlini and Wagner  Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), 2017.
- Dong et al.  Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In CVPR, 2018.
- Dziugaite et al.  Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M. Roy. A study of the effect of jpg compression on adversarial images. CoRR, abs/1608.00853, 2016.
- Goldberger and Ben-Reuven  Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017.
- Goodfellow et al.  Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
- Guo et al.  Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. In ICLR, 2018.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- Hinton et al.  Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
- Huang et al.  Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
- Kannan et al.  Harini Kannan, Alexey Kurakin, and Ian J. Goodfellow. Adversarial logit pairing. In NeurIPS, 2018.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
- Kurakin et al. [2017a] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In ICLR Workshop, 2017a.
- Kurakin et al. [2017b] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In ICLR, 2017b.
- Liao et al.  Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Jun Zhu, and Xiaolin Hu. Defense against adversarial attacks using high-level representation guided denoiser. In CVPR, 2018.
- Madry et al.  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
- Moosavi-Dezfooli et al.  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In CVPR, 2016.
- Nayebi and Ganguli  Aran Nayebi and Surya Ganguli. Biologically inspired protection of deep networks from adversarial attacks. CoRR, abs/1703.09202, 2017.
- Osadchy et al.  Margarita Osadchy, Julio Hernandez-Castro, Stuart J. Gibson, Orr Dunkelman, and Daniel Pérez-Cabo. No bot expects the deepcaptcha! introducing immutable adversarial examples, with applications to captcha generation. In IEEE Transactions on Information Forensics and Security, volume 12, pages 2640–2653, 2017.
- Papernot et al. [2016a] Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, Karen Hambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong, David Berthelot, Paul Hendricks, Jonas Rauber, Rujun Long, and Patrick McDaniel. Technical report on the cleverhans v2.1.0 adversarial examples library. CoRR, abs/1610.00768, 2016a.
- Papernot et al. [2016b] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy (SP), 2016b.
- Papernot et al.  Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 2017.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NeurIPS Workshop, 2017.
- Patrini et al.  Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.
- Prakash et al.  Aaditya Prakash, Nick Moran, Solomon Garber, Antonella DiLillo, and James A. Storer. Deflecting adversarial attacks with pixel deflection. In CVPR, 2018.
- Roth et al.  Kevin Roth, Yannic Kilcher, and Thomas Hofmann. The odds are odd: A statistical test for detecting adversarial examples. CoRR, abs/1902.04818, 2019.
- Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. In IJCV, volume 115, pages 211–252, 2015.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- Szegedy et al.  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.
- Tramèr et al.  Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian J. Goodfellow, Dan Boneh, and Patrick D. McDaniel. Ensemble adversarial training: Attacks and defenses. In ICLR, 2018.
- Xie et al.  Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan L. Yuille. Adversarial examples for semantic segmentation and object detection. In ICCV, 2017.
- Xie et al. [2018a] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Loddon Yuille. Mitigating adversarial effects through randomization. In ICLR, 2018a.
- Xie et al. [2018b] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Loddon Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. CoRR, abs/1812.03411, 2018b.
Appendix A Supporting classes of different attacks
In Figure 3, we illustrate the supporting classes of defending PGD [Madry et al., 2018], MIM [Dong et al., 2018], DeepFool [Moosavi-Dezfooli et al., 2016] and C&W [Carlini and Wagner, 2017] on ResNet-50 [He et al., 2016], respectively. Just like the cases of PGD attack on different target networks, these different attacks also share some supporting classes in common. Note that the supporting classes of PGD and MIM are nearly the same, even considering their relative order in frequency. This aligns with the high Bhattacharyya coefficients between them and the good transferability of defenders trained on them. The two -norm attacks, DeepFool and C&W, also have very similar supporting classes, and this similarity yet accounts for the transferability of their corresponding defenders. Besides, one can observe from Figure 3 that MIM and DeepFool have stronger response in the -th class, i.e., ‘manhole cover’, which may explain why they are easier to be defended. Similarly, the difficulty of defending against C&W may also reside in the fact that the supporting classes of this attack are not as strong as other attackers, e.g., the dominant class is also the -th class, but its frequency is relatively lower (also closer to the second class).
Appendix B Delving into supporting classes at instance level
To better understand the impact of supporting classes, we further inspect the classes that contribute most to defending a single example, i.e., at an instance level. The top- classes of and their corresponding values are taken out for each image. Figure 4 shows such classes and values for an example with ground-truth label and attacked by the four attackers.
We first explore the PGD attacker [Madry et al., 2018] as usual, and find that while the supporting classes we obtain in the last part frequently appear in the top-, the -th class always occupies the top- and even top-, especially when the attacked example is successfully corrected by the defender. This once again shows the importance of the -th class, and similar phenomenon is found when the MIM attacker [Dong et al., 2018] is used.
As for DeepFool [Moosavi-Dezfooli et al., 2016], things become different, as the top- of always lies in , the ground-truth label, with a much greater value than the second most significant supporting class. This is mainly due to the design nature of DeepFool, which moves an example across the nearest decision boundary and thus the original class should still have a high score in the adversarial logits . In other words, the adversary of DeepFool can be recovered by assigning a greater weight to . Consequently, although DeepFool shares a similar property that the -th class still appears most frequently, it shows quite a different behavior in defense, which partly reflects in the transfer experiments between DeepFool and PGD. Given an example attacked by DeepFool and a defender trained on PGD, the defender can correct the example basing on the supporting classes rather than , yielding a relatively good performance; On the contrary, given an example attacked by PGD and a defender trained on DeepFool, the defender will focus too much on the ground-truth class of the example, and thus fail to correct the attack of PGD which does not have such preference.
Finally, we study the case of C&W Carlini and Wagner . We find that the set of supporting classes of lies between PGD and DeepFool, and is more similar to that of DeepFool (i.e., the most significant class is usually the ground-truth class , but sometimes ). This corresponds to the fact that the defender trained on C&W can transfer to DeepFool better than PGD. However, since the behaviour of C&W is more irregular (e.g., is not always the most important contributor), it is more difficult to defend by defenders trained on other attacks.