Defending Adversarial Attacks by Correcting logits

06/26/2019 ∙ by Yifeng Li, et al. ∙ HUAWEI Technologies Co., Ltd. Shanghai Jiao Tong University 2

Generating and eliminating adversarial examples has been an intriguing topic in the field of deep learning. While previous research verified that adversarial attacks are often fragile and can be defended via image-level processing, it remains unclear how high-level features are perturbed by such attacks. We investigate this issue from a new perspective, which purely relies on logits, the class scores before softmax, to detect and defend adversarial attacks. Our defender is a two-layer network trained on a mixed set of clean and perturbed logits, with the goal being recovering the original prediction. Upon a wide range of adversarial attacks, our simple approach shows promising results with relatively high accuracy in defense, and the defender can transfer across attackers with similar properties. More importantly, our defender can work in the scenarios that image data are unavailable, and enjoys high interpretability especially at the semantic level.



There are no comments yet.


page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past years, with the blooming development of deep learning, researchers started focusing on the ‘dark side’ of deep neural networks, such as the lack of interpretability. In this research field, an important topic is the existence of adversarial examples, which claims that slightly modifying the input image, sometimes being imperceptible to human, can lead to a catastrophic change in the output prediction 

[Szegedy et al., 2014, Goodfellow et al., 2015]. Driven by the requirements of understanding deep networks and securing AI-based systems, there emerge a lot of efforts in generating adversarial examples to attack deep networks [Szegedy et al., 2014, Goodfellow et al., 2015, Moosavi-Dezfooli et al., 2016, Carlini and Wagner, 2017, Kurakin et al., 2017a, Dong et al., 2018] and, in the opposite direction, designing algorithms to recognize adversarial examples and thus eliminate their impacts [Papernot et al., 2016b, Kurakin et al., 2017b, Madry et al., 2018, Tramèr et al., 2018, Guo et al., 2018, Xie et al., 2018a, Prakash et al., 2018, Liao et al., 2018].

This paper focuses on the defense part. On the one hand, researchers have demonstrated that adversarial attacks are fragile and thus their impacts can be weakened or eliminated by some pre-processing on the input image [Guo et al., 2018, Xie et al., 2018a, Prakash et al., 2018, Liao et al., 2018]; on the other hand, there are also efforts in revealing how such attacks change mid-level and high-level features and thus eventually break up prediction [Liao et al., 2018, Xie et al., 2018b]. Here we make a hypothesis: the cues of recovering low-level features and high-level features from adversarial attacks are quite different: for low-level features, it is natural to use image-space priors such as intensity distribution and smoothness to filter out adversaries; for high-level features, however, the cues may lie in the intrinsic connections between the semantics of different classes.

To the best of our knowledge, this is the first work that purely focuses on defending adversarial attacks from topmost semantics — this forces our system to find semantic cues, which were not studied before, for defending adversaries. To this end, we study logits, which are the class scores produced by the last fully-connected layers — before being fed into softmax normalization. The major discovery of this paper lies in that we can take merely logits as input to defend adversarial attacks. To reveal this, we first train a

-way classifier on the ILSVRC2012 dataset 

Russakovsky et al. [2015]. Then, we use an off-the-shelf attacker such as PGD [Madry et al., 2018] to generate adversarial examples, in particular logits are recorded. Finally, we train a two-layer fully-connected network to predict the original label. We mix both clean and adversarial logits into the training set, so that the trained defender has the ability of recovering attacked prediction meanwhile keeping non-attacked prediction unchanged.

Working at the semantic level, our approach enjoys two-way benefits. First, we only use logits for defense, which reduces the requirements of image data and mid-level features — this is especially useful in the scenarios that image data are inaccessible due to security reasons, and that very low bits of data can be transferred for defense (note that compared to low-level and mid-level features, logits have a much lower dimension). Second and more importantly, we make it possible to explain how adversaries are defended by directly digging into the two-layer network. An interesting phenomenon is that almost all attackers can leave fingerprints on a few fixed classes, which are named supporting classes, even if the adversarial class can vary among the entire dataset. Our defender works by detecting these supporting classes, and whether a defender can transfer to different attacks and/or network models can be judged by evaluating how these supporting classes overlap. By revealing these properties, we move one step further towards understanding the mechanisms of adversaries.

2 Backgrounds

In this section, we provide background knowledge about adversarial examples and review some representative attack and defense methods. Let denote a clean/natural image sample, and is the corresponding label which has choices. A deep neural network is defined as . In practice, given an input , the classifier

outputs a final feature vector

called logits where each element corresponds to the -th class. The logits

are then fed into a softmax function to produce the predicted probability

, and the predicted class is chosen by . There exist adversarial attacks that went beyond classification [Xie et al., 2017], but we focus on classification because it is the fundamental technique extended to other tasks.

2.1 Adversarial attacks

An adversarial example is crafted by adding an imperceptible perturbation to input , so that the prediction of classifier becomes incorrect. If an attack aims at just forcing the classifier to produce a wrong prediction, i.e., , it is called a non-targeted attack. On the other hand, a targeted attack is designed to cause the classifier to output a specific prediction as , where is the target label and . In this paper, we mainly focus on the non-targeted attack, as it is more common and easier to achieved by a few recent approaches [Goodfellow et al., 2015, Kurakin et al., 2017a, Madry et al., 2018, Dong et al., 2018, Moosavi-Dezfooli et al., 2016, Carlini and Wagner, 2017]. Another way of categorizing adversarial attacks is to consider the amount of information the attacker has. A white-box attack means that the adversary has full access to both the target model and how the defender works. In the opposite, a black-box attack indicates that the attacker knows neither the classifier nor the defender. Also there are some intermediate cases, termed the gray-box attack, in which partial information is unknown. Generally, two types of methods to generate adversarial examples were most frequently used by researchers:

Gradient-sign () methods. To generate adversarial examples efficiently, Goodfellow et al. [2015] proposed the Fast Gradient Sign Method (FGSM) that took a single step in the signed-gradient direction of the cross-entropy loss, based on an assumption that the classifier is approximately linear locally. However, this assumption may not hold perfectly and the success rates of FGSM are relatively low. To address this issue, Kurakin et al. [2017a] proposed the Basic Iterative Method (BIM) that applied fast gradient iteratively with smaller steps. Madry et al. [2018] further extended this approach to a ‘universal’ first-order adversary by introducing a random start. The proposed Projected Gradient Descent (PGD) method served as a very strong attack in the white-box scenario. Dong et al. [2018] stated that FGSM adversarial examples ‘under-fit’ the target model and BIM ‘over-fit’ it, respectively, making them hard to transfer across models. They proposed the Momentum-based Iterative Method (MIM) that integrated a momentum term into the iterative process to stabilize update directions.

attacks.Moosavi-Dezfooli et al. [2016] also considered a linear approximation of the decision boundaries and proposed the DeepFool attack to generate adversarial examples that minimize perturbations measured by -norm. It iteratively moved an image towards the nearest decision boundary until the image crosses it and becomes misclassified. Unlike previous iterative gradient-based approaches, Carlini and Wagner [2017]

proposed to directly minimize a loss function so that the generated perturbation makes the example adversarial meanwhile its

-norm is optimized to be small. The C&W attack always generates very strong adversarial examples with low distortions.

2.2 Adversarial defenses

Many methods defending against adversarial examples have been proposed recently. These methods can be roughly divided into two categories. One type of defenses worked by modifying the training or inference strategies of models to improve their inherent robustness against adversarial examples. Adversarial training is one of the most popular and effective method of this kind [Goodfellow et al., 2015, Kurakin et al., 2017b, Madry et al., 2018, Tramèr et al., 2018, Kannan et al., 2018]

. It augmented or replaced the training data with adversarial examples and aimed at training a robust model being aware of the existence of adversaries. While effectively improving the robustness to seen attacks, this type of approaches often consumed much more computational resources than normal training. Other methods include defensive distillation 

[Papernot et al., 2016b], saturating networks [Nayebi and Ganguli, 2017], thermometer encoding [Buckman et al., 2018], etc., which benefited from gradient masking effect [Papernot et al., 2017] or obfuscated gradients [Athalye et al., 2018] but were still vulnerable to black-box attacks.

Another line of defenses was based on removing adversarial perturbations by processing the input image before feeding them to the target model. Dziugaite et al. [2016] studied JPEG compression to reduce the effect of adversarial noises. Osadchy et al. [2017] applied a set of filters like the median filter and averaging filter to remove perturbations. Guo et al. [2018]

studied five image transformations and showed that total-variance minimization and image quilting were effective for defense.

Xie et al. [2018a]

leveraged random resizing and padding to mitigate adversarial effects.

Prakash et al. [2018] proposed pixel deflection that redistributed pixel values locally and thus broke adversarial patterns. Liao et al. [2018] utilized a high-level representation guided denoiser to defend adversaries. Nonetheless, these methods did not show high effectiveness against very strong perturbations.

Another approach most similar to our work is [Roth et al., 2019], which detected and attempted to recover adversarial examples by measuring how logits change when the input is randomly perturbed. Our method differs from it in that we purely leverage the final logits but no input image.

3 Settings: dataset, network models and attacks

Throughout this paper, we evaluate our approach on the ILSVRC2012 dataset [Russakovsky et al., 2015], a large-scale image classification task with classes, around training images and

validation images. This dataset was also widely used by other researchers for adversarial attacks and defenses. We directly use a few pre-trained network models from the model zoo of the PyTorch platform 

[Paszke et al., 2017], including VGG-16 [Simonyan and Zisserman, 2015], ResNet-50 [He et al., 2016] and DenseNet-121 [Huang et al., 2017], which report top-1 classification accuracies of , and , respectively.

As for adversarial attacks, we evaluate a few popular attacks mentioned in Section 2.1 using the CleverHans library [Papernot et al., 2016a], including PGD [Madry et al., 2018], MIM [Dong et al., 2018], DeepFool [Moosavi-Dezfooli et al., 2016] and C&W [Carlini and Wagner, 2017], each of which causes dramatic accuracy drop to the pre-trained models (see Section 4.3). Below we elaborate the technical details of these attacks, most of which simply follow their original implementations.

PGD [Madry et al., 2018] adversarial examples are generated by the following method:


Here, is the ball around , is randomly sampled from inside the ball and means clipping the examples so that they stay in the ball and satisfy the constraint. We set the maximum perturbation size to be , number of iterations and step size .

MIM [Dong et al., 2018] adversarial examples are generated using the following algorithm:


With maximum perturbation size , we set the number of iterations , step size and decay factor .

DeepFool [Moosavi-Dezfooli et al., 2016] iteratively finds the minimal perturbations to cross the linearization of the nearest decision boundaries. most-likely classes are sampled when looking for the nearest boundaries, and the maximum number of iterations is set to be .

C&W [Carlini and Wagner, 2017] jointly optimizes the norm of perturbation and a hinge loss of giving an incorrect prediction. For efficiency, we set the maximum number of iterations to with binary search steps for the trade-off constant. The initial trade-off constant, hinge margin and learning rate are set to be , and , respectively.

4 Defending adversarial attacks by logits

Our goal is to detect and defend adversarial attacks by logits. The main reason of investigating logits lies in two parts. First, it makes the algorithm easier to be deployed in the scenarios that the original input image and/or mid-level features are not available. Second, it enables us to understand the mechanisms of adversaries as well as how they are related to the interpretability of deep networks.

Mathematically, we are given a set of logits which is the final output of the deep network before softmax and the input is either a clean image or an adversarial image . Most often, leads to the correct class while can be dramatically wrong. The goal is to recover the original label from while remaining the prediction of unchanged.

4.1 The possibility of defending adversaries by logits

Figure 1: Average response of logits on clean and PGD adversarial examples, counted on the validation set of ILSVRC2012. We fix the number of bins to be for both types of data. In most cases, the PGD attack has made the mean value of logits greater.

The basis of our research lies in the possibility of defending adversarial attacks by merely checking the logits. In other words, the numerical values of logits before and after being attacked are quite different. To reveal this, we use an example of the PGD attack [Madry et al., 2018], while we also observe similar phenomena in other cases. We apply PGD over all validation images, and plot the distributions of the average values of the logits, before and after the dataset is attacked. Figure 1 shows the histogram of the average response. One can observe that adversaries cast significant changes which are obvious even under such simple statistic.

The above experiment indicate that adversarial attacks indeed leave some kind of ‘fingerprints’ in high-level feature vectors, in particular logits. This is interesting because logits features are produced by the last layer of a deep network, so (i) do not contain any spatial information and (ii) each element in it corresponds to one class of the dataset. The former property makes our defender quite different from those working on image data or intermediate features [Dziugaite et al., 2016, Osadchy et al., 2017, Guo et al., 2018, Xie et al., 2018a, Prakash et al., 2018, Xie et al., 2018b], which often made use of spatial information for noise removal. The latter property eases the defender to learn inter-class relationship to determine whether a case was adversarial and to recover it from attacks. However, due to the high dimensionality of logits — the dimension is in our problem and can be higher in the future, it is difficult to achieve the goal of defense upon a few fixed rules. This motivates us to design a learning-based approach for this purpose.

4.2 Adversarial logits correction

We first consider training a defender for a specific deep network , say ResNet-50 [He et al., 2016], and a specific attack , say PGD [Madry et al., 2018]. We should discuss on the possibility of transfer this defender to other combinations of network and attack in Section 5, after we explain how it works. The first step is to build up a training dataset. Recall that our goal is to recover the prediction of contaminated data while remain that of clean data unchanged, so we collect both types of data by feeding each training case into so as to produce its adversarial version . Then, both and are fed into and the corresponding logits are recorded as and

, respectively. We ignore the softmax layer to generate the final labels

and , since it is trivial and does not impact the defense process.

We then train a two-layer fully-connected network upon these -dimensional logits. Specifically, the output layer is still a -dimensional vector which represents the corrected logits, and the hidden layer contains neurons, each of which is equipped with a ReLU activation function [Krizhevsky et al., 2012] and Dropout [Hinton et al., 2012] with a keep ratio of . This design is to maximally simplify the network structure to facilitate explanation, while being able to deal with non-linearity in learning relationship between classes. The relatively large amount of hidden neurons eases learning complicated rules for logits correction. While we believe more complicated network design can improve its ability of defense, this is not the most important topic of this paper. As a side note, we tried to simply use more hidden layers to train a deeper correction network, but observed performance decrease on adversarial examples. The training process of this defender network follows a standard gradient-based optimization, with very few bells and whistles added. A detailed illustration is provided in Algorithm 1. Note that we mix clean and contaminated images with a probability that a clean image is sampled. This strategy is simple yet effective in enabling the defender to maintain the original prediction of clean data.

0:  Clean training set , pre-trained classifier , adversarial attacker ;
0:  # of hidden neurons , # of training iterations , clean training probability ;
0:  logits correction network ;
1:  Perturb with and obtain an adversarial counterpart for each ;
2:  Feed all examples, and , into and extract the corresponding logits and ;
3:  Randomly initialize as a two-layer fully-connected network with hidden neurons;
4:  for  to  do
5:     Sample a mini-batch from , in which takes either or , with a probability of to take the clean logits, ;
6:     Do the training step of using and the cross-entropy loss;
7:  end for
8:  return  .
Algorithm 1 Adversarial logits Correction

Before going into experiments, we briefly discuss the relationship between our approach, adversarial logits correction, and prior work. The first family was named ‘adversarial training’ [Goodfellow et al., 2015, Kurakin et al., 2017b, Madry et al., 2018, Tramèr et al., 2018, Kannan et al., 2018], which generated adversarial examples online and added them into the training set. Our approach is different in many aspects, including the goal (we hope to defend adversaries towards trained models rather than increasing the robustness of models themselves) and the overheads (we do not require a costly online generation process). More importantly, we generate and defend adversaries in the level of logits, which, to the best of our knowledge, was never studied in the previous literature. The second one was often called ‘learning from noisy labels’ [Angluin and Laird, 1988]

, in which researchers proposed to estimate a transition matrix 

[Goldberger and Ben-Reuven, 2017, Patrini et al., 2017] in order to address the corrupted labels. Although our approach also relies on the same assumption that corrupted or adversarial labels can be recovered by checking class-level relationship, we emphasize that the knowledge required for correcting adversaries is quite different from that for correcting label noises. This is because the effects brought by adversaries are often less deterministic, i.e., the adversarial label is often completely irrelevant to the original one, but a noisy label can somewhat reflect the correct one. This is also the reason why we used a learning-based approach and built a two-layer network with a large number of hidden neurons. In practice, using a single-layer network or reducing hidden neurons can cause dramatic accuracy drop in recovering adversaries, e.g., under the PGD attack towards ResNet-50, the recovery rate of a two-layer network is but drops to when a single-layer network is used (see the next part for detailed experimental settings).

4.3 Experimental results

We first evaluate our approach on the PGD attack [Madry et al., 2018] with different pre-trained network models. We use the Adam optimizer [Kingma and Ba, 2015] to train the defenders for epochs on the entire ILSVRC2012 [Russakovsky et al., 2015] training set, with a batch size of , learning rate of , weight decay of , and a probability of choosing clean data, , of . We evaluate both a full testing set ( images) and a selected testing set ( correctly classified images, to compare with other attackers). Table 1 shows the classification accuracy of different networks on both clean and PGD-attacked examples with and without adversarial logits correction. One the one hand, although the PGD attack is very strong in this scenario, reducing the classification accuracy of all networks to nearly , it is possible to recover the correct prediction by merely checking logits, which (i) do not preserve any spatial information, and (ii) as high-level features, are supposed to be perturbed much more severely than the input data [Liao et al., 2018, Xie et al., 2018b]. After correction, the classification accuracy on VGG-16 [Simonyan and Zisserman, 2015] and ResNet-50 [He et al., 2016] are only reduced by and , respectively. To the best of our knowledge, this is the only work which purely relies on logits to defend adversaries, so it is difficult to compare these results with prior work. On the other hand, on clean (un-attacked) examples, our approach reports slight accuracy drop, because some of them are considered to be adversaries and thus mistakenly ‘corrected’. This is somewhat inevitable, because logits lose all spatial information and non-targeted attacks sometimes cast large but random changes in logits. These cases are not recoverable without extra information.

Clean, full PGD, full Clean, selected PGD, selected
No Defense Corrected No defense Corrected No defense Corrected No defense Corrected
Table 1: Classification accuracy on clean and PGD-attacked images of different networks. Here, full indicates that the entire ILSVRC2012 validation set ( images) are used, while selected refers to a set of images that are correctly classified by ResNet-50.

Besides, a surprising result produced by our approach is that the classification accuracy on DenseNet-121, after being attacked and recovered, even improves from to . This phenomenon is similar to ‘label leaking’ [Kurakin et al., 2017b], which claimed that the accuracy on adversarial images gets much higher than that on clean images for a model adversarially trained on FGSM adversarial examples. However, label leaking was found to only occur with one-step attacks that use the true labels and vanish if an iterative method is used. In our experiment settings, PGD attack is used which is a very strong iterative method with randomness that further increases uncertainty. In addition, our correction network merely takes logits as input but cannot directly access the transformed image. This finding extends the problem of label leaking and reveals the possibility of helping classifiers with adversarial examples.

Next, we evaluate our approach on several state-of-the-art adversarial attacks, including PGD [Madry et al., 2018], MIM [Dong et al., 2018], DeepFool [Moosavi-Dezfooli et al., 2016] and C&W [Carlini and Wagner, 2017]. A pre-trained ResNet-50 model is used as the target, and all training settings simply remain unchanged as in previous experiments. Differently, we only use test image per class ( in total) that can be correctly classified by ResNet-50 in the test stage, which is mainly due to the slowness of the DeepFool attack. Experimental results are summarized in Table 3, from which one can observe quite similar phenomena as in the previous experiments. Here, we draw a few comments on the different properties among these attackers. For PGD, logits correction manages to recover of adversarial examples and still maintains a sufficiently high accuracy () on clean examples. MIM differs from PGD in that no random start is used and a momentum term is introduced to stabilize gradient updates, which results in less diversity and uncertainty. As a result, logits correction is able to learn better class-level relationship and thus recovers a larger fraction () of adversarial examples. As for DeepFool, note that image distortions are relatively smaller since it minimizes the perturbations in -norm. Nevertheless, our method still succeeds in correcting of adversarial logits. Among all evaluated attacks, C&W is the most difficult to defend, with our approach yielding an accuracy of less than on adversarial examples and the accuracy on clean examples is also largely affected. This is partly because C&W, besides controlling the -norm like DeepFool, uses a different kind of objective function and explicitly optimizes the magnitude of perturbations, which results in quite different behaviors in adversarial logits and thus increases the difficulty of correction.

Clean, selected Adversarial, selected
No defense Corrected No defense Corrected
Table 3: Classification accuracy when a defender trained on one attack is used to defend other attacks. The target model is ResNet-50, and all results are produced on the selected test set.
Defender Trained on
PGD MIM DeepFool C&W
Table 2: Classification accuracy on clean and different adversarial images of ResNet-50. For fair comparison, we use the selected test set containing images whose clean version is correctly recognized by ResNet-50.

Finally, we evaluate transferability, i.e., whether a logits correction network trained on a specific attack can be used to defend other attacks on the same model. We fix ResNet-50 to be the target model, and evaluate the defender trained on each of the four attackers. Results are summarized in Table 3, in which each row corresponds to an attacker and each column a defender — note that the diagonal is the same as the last column of Table 3. One can see from the table that the defenders trained on PGD and MIM transfer well to each other, mainly due to the similar nature of these two attackers. These two defenders are also able to correct more than half of adversarial examples generated by DeepFool, which shows a wider aspect in generalization. On the contrary, the defender trained on DeepFool can hardly recover those adversarial examples produced by PGD and MIM, indicating that DeepFool, being an -norm attacker, has different properties and thus the learned patterns for defense are less transferable. Similarly, C&W has a closer behavior to DeepFool, an -norm attacker, than to PGD and MIM, two -norm attackers, which also reflects in the low recovery rates in the last row and column of Table 3. It is interesting to see a high transfer accuracy from C&W to DeepFool, but a low accuracy in the opposite direction. This is because both DeepFool and C&W are -norm attackers, but the adversarial patterns generated by DeepFool are relatively simpler. So, the defender trained on C&W can cover the patterns of DeepFool ( of cases are defended), but the opposite is not true (only of cases are defended).

In the next section, we will provide a new insight to transferability, which focuses on finding the supporting classes of each defense and measuring the overlapping ratio between different sets of supporting classes.

5 Explaining logits-based defense

5.1 How logits correction works

It remains an important topic to explain how our defender works. Thanks to the semantic basis and simplicity of our approach, for each defender, we can find a small set of classes that make most significant contributions to defense.

Given a clean vector of logits, , or an adversarial one, , the logits correction network acts as a multi-variate mapping between two spaces where in the ILSVRC2012 dataset [Russakovsky et al., 2015]. Let denote the score of the -th class in , and the score of the -th class in . The core idea is to compute the partial derivative , so that we can estimate the contribution of each element in the input logits to the corrected logits.

Suppose we have an adversarial sample that has a ground-truth label of but is misclassified as class after being attacked. When feeding it into the correction network , it should be recovered and thus should have the greatest score. To find out how manages to perform correction, we can either investigate how is ‘pulled up’ by finding the classes that have high positive values, or how other classes are ‘pushed down’ by finding the classes that have high negative impacts on the average of all logits, namely, .

We first explore an example using the logits correction network trained to defend the PGD attack [Madry et al., 2018] on ResNet-50 [He et al., 2016]. For each adversarial input in the validation set, we find out greatest entries of , with being the original label and ranges among all classes. Interestingly, for a large amount of cases, no matter what the ground-truth label or the input image is, there always exist some specific classes that contribute most to recovering the correct label. We count over all validation images, and find out that the greatest appears mostly when equals to , , , , and . When we compute instead, most cases have the highest response in the -th class, (i.e., ‘manhole cover’). Similar phenomena are also found when we use PGD to attack other target networks, including VGG-16 [Simonyan and Zisserman, 2015] and DenseNet-121 [Huang et al., 2017]. Some of the classes that contribute most to overlap with those for ResNet-50, implying that these classes are fragile to PGD. When we compute instead, the -th class still dominants for both DenseNet-121 and VGG-16.

5.2 Supporting classes and their relationship to transferability of defense

Here, we define a new concept named supporting classes as those classes that contribute most to logits-based defense. Taking both positive (‘pulling up’) and negative (‘pushing down’) effects into consideration, we compute . Among all classes, classes of with greatest values are taken out for each image. We count the occurrences over the selected test set ( images), and finally pick up classes that appear most frequently as the supporting classes for defending a specific attack from the model. The supporting classes of defending the PGD attack on ResNet-50 are illustrated in Figure 2, and we also show other cases in the supplementary material. We first emphasize that, indeed, these classes are the key to defense. On an arbitrary image, if we reduce the logit values of these supporting classes by and feed the modified logits into the trained defender, it is almost for sure that logits correction fails dramatically and, very likely, the classification result remains to be , the originally misclassified class, even when the original or adversarial classes of this case are almost irrelevant to these supporting classes.

Figure 2: Supporting classes of the PGD defender on ResNet-50. We list classes that appear most frequently in the top-10 of , with the frequency of occurrence recorded on the vertical axis. For better visualization, we list the name of each class and attach a representative image above the bar.

As a further analysis, we reveal the relationship between the overlapping ratio of supporting classes and the transferability of our defender from one setting to another. We compute the Bhattacharyya coefficients between the sets of supporting classes produced by the four attacks. The first pair is PGD and MIM, which reports a high coefficient of . This implies that PGD and MIM have quite similar behavior in attacks, which is mainly due to their similar mechanisms (e.g., -bounded, iteration-based, etc.). Consequently, as shown in Table 3, the defenders trained on PGD and MIM transfer well to each other. A similar phenomenon appears in the pair of DeepFool and C&W, two -norm attackers, with an coefficient of . As a result, the defenders trained on DeepFool and C&W produce the best transfer accuracy on each other (as we explained before, the weak transferability of the DeepFool defender is mainly because DeepFool is very easy to defend). Across -norm and -norm attackers, we report coefficients of for the pair of PGD and DeepFool, and for PGD and C&W, respectively. Note that both numbers are less than those between the same type of attacks. In addition, the defender trained on PGD achieves an accuracy of on DeepFool, and merely on C&W, which aligns with the coefficient values. These results verify our motivation — logits-based correction is easier to be explained, in particular at the semantic level.

Last but not least, the impact of supporting classes can also be analyzed in the instance level, i.e., finding the classes that contribute most to defending each attack on each single image. We observe a few interesting phenomena, e.g., for PGD and MIM, the top-1 supporting class is very likely to be the -th class, while for DeepFool, the most important class is always the ground-truth class which differs from case to case. This partly explains the transferability between PGD/MIM and DeepFool. More instance-level analysis is provided in the supplementary material.

6 Conclusions

In this paper, we find that a wide range of state-of-the-art adversarial attacks can be defended by merely correcting logits, the features produced by the last layer of a deep neural network. This implies that each attacker leaves ‘fingerprints’ during the attack. Although it is difficult to make rules to detect and eliminate such impacts, we design a learning-based approach, simple but effective, which achieves high recovery rates in a few combinations of attacks and target networks. Going one step forward, we reveal that our defender works by finding a few supporting classes for each attack-network combination, and by checking the overlapping ratio of these classes, we can estimate the transferability of a defense across different scenarios.

Our research leaves a few unsolved problems. For example, it is unclear whether there exists an attack algorithm that cannot be corrected by our defender, or if we can find deeper connections between our discovery and the mechanism of deep neural networks. In addition, we believe that improving the transferability of this defender is a promising direction, in which we shall continue in the future.


Appendix A Supporting classes of different attacks

In Figure 3, we illustrate the supporting classes of defending PGD [Madry et al., 2018], MIM [Dong et al., 2018], DeepFool [Moosavi-Dezfooli et al., 2016] and C&W [Carlini and Wagner, 2017] on ResNet-50 [He et al., 2016], respectively. Just like the cases of PGD attack on different target networks, these different attacks also share some supporting classes in common. Note that the supporting classes of PGD and MIM are nearly the same, even considering their relative order in frequency. This aligns with the high Bhattacharyya coefficients between them and the good transferability of defenders trained on them. The two -norm attacks, DeepFool and C&W, also have very similar supporting classes, and this similarity yet accounts for the transferability of their corresponding defenders. Besides, one can observe from Figure 3 that MIM and DeepFool have stronger response in the -th class, i.e., ‘manhole cover’, which may explain why they are easier to be defended. Similarly, the difficulty of defending against C&W may also reside in the fact that the supporting classes of this attack are not as strong as other attackers, e.g., the dominant class is also the -th class, but its frequency is relatively lower (also closer to the second class).

Figure 3: Supporting classes of each adversarial attack on ResNet-50. We list classes that appear most frequently in the top- of (see Section 5.2), with the frequency of occurrences recorded on the vertical axis. For better visualization, we list the name of each class on the horizontal axis, and also attach a representative image above the bar. Please zoom in for better clarity.

Appendix B Delving into supporting classes at instance level

To better understand the impact of supporting classes, we further inspect the classes that contribute most to defending a single example, i.e., at an instance level. The top- classes of and their corresponding values are taken out for each image. Figure 4 shows such classes and values for an example with ground-truth label and attacked by the four attackers.

Figure 4: Top- classes of and their corresponding values for an example with ground-truth label and attacked by PGD, MIM, DeepFool and C&W, respectively. For better visualization, we list the name of each class. Please zoom in for better clarity.

We first explore the PGD attacker [Madry et al., 2018] as usual, and find that while the supporting classes we obtain in the last part frequently appear in the top-, the -th class always occupies the top- and even top-, especially when the attacked example is successfully corrected by the defender. This once again shows the importance of the -th class, and similar phenomenon is found when the MIM attacker [Dong et al., 2018] is used.

As for DeepFool [Moosavi-Dezfooli et al., 2016], things become different, as the top- of always lies in , the ground-truth label, with a much greater value than the second most significant supporting class. This is mainly due to the design nature of DeepFool, which moves an example across the nearest decision boundary and thus the original class should still have a high score in the adversarial logits . In other words, the adversary of DeepFool can be recovered by assigning a greater weight to . Consequently, although DeepFool shares a similar property that the -th class still appears most frequently, it shows quite a different behavior in defense, which partly reflects in the transfer experiments between DeepFool and PGD. Given an example attacked by DeepFool and a defender trained on PGD, the defender can correct the example basing on the supporting classes rather than , yielding a relatively good performance; On the contrary, given an example attacked by PGD and a defender trained on DeepFool, the defender will focus too much on the ground-truth class of the example, and thus fail to correct the attack of PGD which does not have such preference.

Finally, we study the case of C&W Carlini and Wagner [2017]. We find that the set of supporting classes of lies between PGD and DeepFool, and is more similar to that of DeepFool (i.e., the most significant class is usually the ground-truth class , but sometimes ). This corresponds to the fact that the defender trained on C&W can transfer to DeepFool better than PGD. However, since the behaviour of C&W is more irregular (e.g., is not always the most important contributor), it is more difficult to defend by defenders trained on other attacks.