The winning submission for NIPS 2017: Defense Against Adversarial Attack of team TSAIL
Neural networks are vulnerable to adversarial examples. This phenomenon poses a threat to their applications in security-sensitive systems. It is thus important to develop effective defending methods to strengthen the robustness of neural networks to adversarial attacks. Many techniques have been proposed, but only a few of them are validated on large datasets like the ImageNet dataset. We propose high-level representation guided denoiser (HGD) as a defense for image classification. HGD uses a U-net structure to capture multi-scale information. It serves as a preprocessing step to remove the adversarial noise from the input, and feeds its output to the target model. To train the HGD, we define the loss function as the difference of the target model's outputs activated by the clean image and denoised image. Compared with the traditional denoiser that imposes loss function at the pixel-level, HGD is better at suppressing the influence of adversarial noise. Compared with ensemble adversarial training which is the state-of-the-art defending method, HGD has three advantages. First, with HGD as a defense, the target model is more robust to either white-box or black-box adversarial attacks. Second, HGD can be trained on a small subset of the images and generalizes well to other images, which makes the training much easier on large-scale datasets. Third, HGD can be transferred to defend models other than the one guiding it. We further validated the proposed method in NIPS adversarial examples dataset and achieved state-of-the-art result.READ FULL TEXT VIEW PDF
Deep neural networks are vulnerable to adversarial examples that are cra...
Deep neural networks are vulnerable to adversarial attacks. The literatu...
Adversarial examples provide an opportunity as well as impose a challeng...
The field of computer vision has witnessed phenomenal progress in recent...
We propose a scheme for defending against adversarial attacks by suppres...
Effective defense of deep neural networks against adversarial attacks re...
Robustness to adversarial examples of machine learning models remains an...
The winning submission for NIPS 2017: Defense Against Adversarial Attack of team TSAIL
As many other machine learning models, neural networks are known to be vulnerable to adversarial examples [30, 7]. Adversarial examples are maliciously designed inputs to attack a target model. They have small perturbations on original inputs but can mislead the target model. Adversarial examples can be transferred across different models [30, 21]. This transferability enables black-box adversarial attacks without knowing the weights and structures of the target model. Black-box attacks have been shown to be feasible in real-world scenarios 
, which poses a potential threat to security-sensitive deep learning applications, such as identity authentication and autonomous driving. It is thus important to find effective defenses against adversarial attacks.
Since adversarial examples are constructed by adding noises to original images, a natural idea is to denoise adversarial examples before sending them to the target model (Figure 1). We explored two models for denoising adversarial examples, and found that the noise level could indeed be reduced. These results demonstrate the feasibility of the denoising idea. However, none of the models can remove all adversarial perturbations, and small residual perturbation is amplified to a large magnitude in top layers of the target model (called “error amplification effect”), which leads to a wrong prediction. To solve this problem, instead of using a pixel-level reconstruction loss function as standard denoisers, we set the loss function as the difference between top-level outputs of the target model induced by original and adversarial examples (Figure 1). We name the denoiser trained by this loss function “high-level representation guided denoiser” (HGD).
Compared with ensemble adversarial training  which is the current state-of-the-art method, the proposed method has the following advantages. First, it achieves much higher accuracy when defending both white-box and black-box attacks. Second, HGD requires much less training data and training time, and well generalizes to other images and unseen classes. Third, HGD can be transferred across different target models. We further validated the performance of HGD in the NIPS adversarial defense competition. Our HGD approach won the first place by a large margin, and had faster inference speed than other top-ranked methods.
In this section, we first specify some of the notations used in this paper. Let denote the clean image from a given dataset, and denote the class. The ground truth label is denoted by . A neural network is called the target model. Given an input
, its feature vector at layeris
, and its predicted probability of classis . is the predicted class of .
denotes the loss function of the classifier given the inputand its target class . For image classification, is often chosen to be the cross-entropy loss. We use to denote the adversarial example generated from . is the magnitude of adversarial perturbation, measured by a certain distance metric.
Adversarial examples  are maliciously designed inputs which have a small difference from clean images but cause the classifier to give wrong classifications. That is, for with a sufficiently small perturbation magnitude , . We use to measure in this study.
Szegedy et al.  use a box-constrained L-BFGS algorithm to generate targeted adversarial examples, which bias the predictions to a specified class . More specifically, they minimize the weighted sum of and while constraining the elements of to be normal pixel value.
Goodfellow et al.  suggest that adversarial examples can be caused by the cumulative effects of high dimensional model weights. They propose a simple adversarial attack algorithm, called Fast Gradient Sign Method (FGSM):
FGSM only computes the gradients for once, and thus is much more efficient than L-BFGS. In early practices, FGSM uses the true label to compute the gradients. This approach is suggested to have the label leaking  effect, in that the generated adversarial example contains the label information. A better alternative is to replace with the model predicted class . FGSM is untargeted and aims to increase the overall loss. Targeted FGSM can be obtained by modifying FGSM to maximize the predicted probability of a specified class :
can be chosen as the least likely class predicted by the model or a random class. Kurakin et al. propose an iterative FGSM (IFGSM) attack by repeating FGSM for steps (IFGSMn). IFSGM usually results in higher classification error than FGSM.
The model used to generate adversarial attacks is called the attacking model, which can be a single model or an ensemble of models . When the attacking model is the target model itself or contains the target model, the resulting attacks are white-box. An intriguing property of adversarial examples is that they can be transferred across different models [30, 7]. This property enables black-box attacks. Practical black-box attacks have been demonstrated in some real-world scenarios [22, 21]. As white-box attacks are less likely to happen in practical systems, defenses against black-box attacks are more desirable.
Adversarial training [7, 16, 31] is one of the most extensively investigated defenses against adversarial attacks. It aims to train a robust model from scratch on a training set augmented with adversarially perturbed data [7, 16, 31]. Adversarial training improves the classification accuracy of the target model on adversarial examples [30, 7, 16, 31]. On some small image datasets it even improves the accuracy of clean images [30, 7], although this effect is not found on ImageNet 
dataset. However, adversarial training is more time consuming than training on clean images only, because online adversarial example generation needs extra computation, and it takes more epochs to fit adversarial examples. These limitations hinder the usage of harder attacks in adversarial training, and practical adversarial training on the ImageNet dataset only adopts FGSM.
Preprocessing based methods process the inputs with certain transformations to remove the adversarial noise, and then send these inputs to the target model. Gu and Rigazio  first propose the use of denoising auto-encoders as a defense. Osadchy et al.  apply a set of filters to remove the adversarial noise, such as the median filter, averaging filter and Gaussian low-pass filter. Graese et al.  assess the defending performance of a set of preprocessing transformations on MNIST digits 
, including the perturbations introduced by image acquisition process, fusion of crops and binarization. Das et al. preprocess images with JPEG compression to reduce the effect of adversarial noises. Meng and Chen  propose a two-step defense model, which detects the adversarial input and then reform it based on the difference between the manifolds of clean and adversarial examples. Our approach distinguishes from these methods by using the reconstruction error of high-level features to guide the learning of denoisers. Moreover, these methods are usually evaluated on small images. As we will show in experiments section, some method effective on small images may not transfer well to large images.
Another family of adversarial defenses is based on the so-called gradient masking effect [22, 23, 31]. These defenses apply some regularizers or smooth labels to make the model output less sensitive to the perturbation on input. Gu and Rigazio  propose the deep contrastive network, which uses a layer-wise contrastive penalty term to achieve output invariance to input perturbation. Papernot et al.  adapts knowledge distillation  to adversarial defense, and uses the output of another model as soft labels to train the target model. Nayebi and Surya  use saturating networks for robustness to adversarial noises. The loss function is designed to encourage the activations to be in their saturating regime. The basic problem with these gradient masking approaches is that they fail to solve the vulnerability of the models to adversarial attacks, but just make the construction of white-box adversarial examples more difficult. These defenses still suffer from black-box attacks [22, 31] generated on other models.
In this section, we introduce a set of denoising networks and their motivations. These denoisers are designed in the context of image classification on ImageNet  dataset. They are used in conjunction with a pretrained classifier (By default Inception V3  in this study). Let denote the clean image. The denoising function is denoted as , where and denote the adversarial image and denoised image, respectively. The loss function is:
where stands for the norm, the following equations also use this notation. Since the loss function is defined at the level of image pixels, we name this kind of denoiser pixel guided denoiser (PGD).
, DAE in the form of a multi-layer perceptron was used to defend target models against adversarial attacks. However, the experiments were conducted on the relatively simple MNIST dataset. To better represent the high-resolution images in the ImageNet dataset, we use a convolutional version of DAE for the experiments (see Figure 2 left).
DAE has a bottleneck structure between the encoder and decoder. This bottleneck may constrain the transmission of fine-scale information necessary for reconstructing high-resolution images. To overcome this problem, we modify DAE with the U-net  structure and propose the denoising U-net (DUNET, see Figure 2 right). DUNET is different from DAE in two aspects. First, similar to the Ladder network , DUNET adds lateral connections from encoder layers to their corresponding decoder layers in the same scale. Second, the learning objective of DUNET is the adversarial noise ( in Figure 2), instead of reconstructing the whole image as in DAE. This residual learning  is implemented by the shortcut from input to output to additively combine them. The clean image can be readily obtained by subtracting the noise (adding -) from the corrupted input.
We use DUNET as an example to illustrate the architecture (Figure 3). DAE can be obtained simply by removing the lateral connections from DUNET. is defined as a stack of layer sequences, and each sequence contains a
convolution, a batch normalization layer
and a rectified linear unit.is defined as consecutive . The network is composed of a feedforward path and a feedback path. The feedforward path is composed of five blocks, corresponding to one and four , respectively. The first convolution of each has stride, while the stride of all other layers is . The feedforward path receives the image as input, and generates a set of feature maps of increasingly lower resolutions (see the top pathway of Figure 3).
The feedback path is composed of four blocks and a
convolution. Each block receives a feedback input from the feedback path and a lateral input from the feedforward path. It first upsamples the feedback input to the same size as the lateral input using bilinear interpolation, and then processes the concatenated feedback and lateral inputs with a. From top to bottom, three and one are used. Along the feedback path, the resolution of feature maps is increasingly higher. The output of the last block is transformed to the negative noise by a convolution (See the bottom pathway of Figure 3). The final output is the sum of the negative noise and the input image:
A potential problem with PGD is the amplification effect of adversarial noise. Adversarial examples have negligible differences from the clean images. However, this small perturbation is progressively amplified by deep neural networks and yields a wrong prediction. Even if the denoiser can significantly suppress the pixel-level noise, the remaining noise may still distort the high-level responses of the target model. Refer to Section 5.1 for details.
To overcome this problem, we replace the pixel-level loss function with the reconstruction loss of the target model’s outputs. More specifically, given a target neural network, we extract its representations at -th layer activated by and , and define the loss function as the norm of their difference:
The corresponding model is called HGD, in that the supervised signal comes from certain high-level layers of the target model and carries guidance information related to image classification. HGD uses the same U-net structure as DUNET. They only differ in their loss functions.
We propose two HGDs with different choices of . For the first HGD, we define as the index of the topmost convolutional layer. The activations of this layer are fed to the linear classification layer after global average pooling, so it is more related to the classification objective than lower convolutional layers. This denoiser is called feature guided denoiser (FGD) (see Figure 4a). The loss function used by FGD is also known as perceptual loss or feature matching loss[26, 14, 6]. For the second HGD, we define as the index of the layer before the final softmax function, i.e., the logits. This denoiser is called logits guided denoiser (LGD). In this case, the loss function is the difference between the two logits activated by and (see Figure 4b). We consider both FGD and LGD for the following reason. The convolutional feature maps provide richer supervised information, while the logits directly represent the classification results.
All PGD and these HGDs are unsupervised models, in that the ground truth labels are not needed in their training process. An alternative is to use the classification loss of the target model as the denoising loss function, which is supervised learning as ground truth labels are needed. This model is called class label guided denoiser (CGD) (see Figure4c).
Throughout experiments, the pretrained Inception v3 (IncV3)  is assumed to be the target model that attacks attempt to fool and our denoisers attempt to defend. Therefore this model is used for training the three HGDs illustrated in Figure 4. However, it will be seen that the HGDs trained with this target model can also defend other models (see Section 5.3). All our experiments are conducted on images from the ImageNet dataset. Although many defense methods have been proposed, they are mostly evaluated on small images and only adversarial training is thoroughly evaluated on ImageNet. We compare our model with ensemble adversarial training, which is the state-of-the-art defense method of defending large images.
For both training and testing of the proposed method, adversarial images are needed. To prepare the training set, we first extract 30K images from the ImageNet training set (30 images per class). Then we use a bunch of adversarial attacking methods to distort these images and form a training set of adversarial images. Different attacking methods including FGSM and IFGSM are applied to the following models: Pre-trained IncV3, InceptionResnet v2 (IncResV2) , ResNet50 v2 (Res)  individually or in combinations (the same model ensemble as the work of Tramer et al.  ). For each training sample, the perturbation level is uniformly sampled from integers in . See Table 1 for details. As a consequence, we gather 210K images in the training set (TrainSet).
To prepare the validation set, we first extract 10K images from the ImageNet training set (10 images per class), then apply the same method as described above. Therefore the size of the validation set (ValSet) is 70K.
Two different test sets are constructed, one for white-box attacks (WhiteTestSet)222The white-box attacks defined in this paper should be called oblivious attacks according to Carlini and Wagner’s definition  and the other for black-box attacks (BlackTestSet). They are obtained from the same clean 10K images from the ImageNet validation set (10 images per class) but using different attacking methods. The WhiteTestSet uses two attacks targeting at IncV3, which is also used for generating training images, and the BlackTestSet uses two attacks based on a holdout model pre-trained Inception V4 (IncV4) , which is not used for generating training images. Every attacking method is conducted on two perturbation levels . So both WhiteTestSet and BlackTestSet have 40k images (see Table 1 for details).
|TrainSet and ValSet||FGSM||IncV3||[1,16]|
The denoisers are optimized using Adam . The learning rate is initially set to 0.001, and decay to 0.0001 when the training loss converges. The model is trained on six GPUs and the batch size is 60. The number of training epochs ranges from 20 to 30, depending on the convergence speed of the model. The model with the lowest validation loss is used for testing.
The results of DAE and DUNET on the test sets are shown in Table 2333For detailed results of each attack in Table 2-5, please refer to the supplementary material.. The original IncV3 without any defense is used as a baseline, denoted as NA. For all types of attacks, DUNET has much lower denoising loss than DAE and NA, which demonstrates the structural advantage of DUNET. DAE does not perform well on encoding the high-resolution images, as its accuracy on clean images significantly drops. DUNET slightly decreases the accuracy of clean images, but significantly improves the robustness of the target model to black-box attacks. In what follows, DUNET is used as the default PGD method.
A notable result is that the denoising loss and classification accuracy of PGD are not so consistent. For white-box attacks, DUNET has much lower denoising loss than DAE, but its classification accuracy is significantly worse. To investigate this inconsistency, we analyze the layer-wise perturbations of the target model activated by PGD denoised images. Let denote a perturbed image. The perturbation level at layer is computed as:
The for PGD denoised images, adversarial images, and Gaussian noise perturbed images are shown in Figure 5. The latter two are used as baselines. The curves are the averaged results on 30 randomly picked adversarial images generated by the ensemble attack ”IFGSM4 x IncV3/IncResV2/Res ()”. For convenience, these are abbreviated as PGD perturbation, adversarial perturbation, and random perturbation. Although the pixel-level PGD perturbation significantly suppressed, the remaining perturbation is progressively amplified along the layer hierarchy. At the top layer, PGD perturbation is much higher than random perturbation and close to adversarial perturbation. Because the classification result is closely related to the top-level features, this large perturbation well explains the inconsistency between the denoising performance and classification accuracy of PGD.
Compared to PGD, LGD strongly suppress the error amplification effect (Figure 5). LGD perturbation at the final layer is much lower than PGD and adversarial perturbations and close to random perturbation.
HGD is more robust to white-box and black-box adversarial attacks than PGD and ensV3 (Table 3). All three HGD methods significantly outperform PGD and ensV3 for all types of attacks. The accuracy of clean images only slightly decreases (by for LGD). The difference between these HGD methods is insignificant. In later sections, LGD is chosen as our default HGD method for it achieves a good balance between accuracy on clean and adversarial images. When facing powerful ensemble black-box attacks, LGD also significantly outperforms ensV3 (see supplementary material).
Compared to adversarial training, HGD only uses a small fraction of training images and is efficient to train. Only 30K clean images are used to construct our training set, while all 1.2M clean images of the ImageNet dataset are used for training ensV3. HGD is trained for less than 30 epochs on 210K adversarial images, while ensV3 is trained for about 200 epochs on 1.2M images .
To summary, with less training data and time, HGD significantly outperforms adversarial training on defense against adversarial attacks. These results suggest that learning to denoise only is much easier than learning the coupled task of classification and defense.
The learning objective of HGD is to remove the high-level influence of adversarial noises. In other words, HGD works by producing anti-adversarial perturbations on input images. From this point of view, we expect that HGD can be transfered to defend other models and images.
To evaluate the transferability of HGD over different models, we use the IncV3 guided LGD to defend Resnet . As expected, this LGD significantly improves the robustness of Resnet to all attacks. Furthermore, it achieves very close defending performance as the Resnet guided LGD. (Table 4)
To evaluate the transferability of HGD over different classes, we build another dataset. Its key difference from the original dataset is that there are only 750 classes in the TrainSet, and the other 250 classes are put in ValSet and TestSets. The number of original images in each class in all datasets are changed to 40 to keep the size of dataset unchanged. It is found that although the 250 classes in the test set are never trained, the LGD still learns to defend against the attacks targeting at them (Table 5).
|Denoiser for Resnet||Clean||WhiteTestSet||BlackTestSet|
HGD is derived from a denoising motivation. However, HGD denoised images have larger pixel-level noise than adversarial images (see Figure 5), indicating that HGD even elevates the overall noise level. This is also confirmed by the qualitative result in Figure 6. LGD does not suppress the total noise as PGD does, but adds more perturbations to the image.
To further investigate this issue, we plot the 2D histogram of the adversarial perturbation () and the predicted perturbation () by PGD and LGD (Figure 7), where , and denote the clean, adversarial and denoised images, respectively. The ideal result should be , which means the adversarial perturbations are completely removed.
Two lines are fit for PGD and LGD, respectively (the red lines in Figure 7
). The slope of PGD’s line is lower than 1, indicating that PGD only removes a portion of the adversarial noises. In contrast, the slope of LGD’s line is even larger than 1. Moreover, the estimation is very noisy, which leads to high pixel-level noise.
These observations suggest that HGD defends the target model by two mechanisms. First, HGD indeed reduces the adversarial noise level, which is revealed by the strong correlation between adversarial noise and HGD induced perturbation (Figure 7). Second, HGD adds to the image some “favorable perturbation” which defends the target model. In this sense, HGD can also be seen as an anti-adversarial transformer, which does not necessarily remove all the pixel-level noises but transforms the adversarial example to some easy-to-classify example.
In NIPS 2017 competition track, Google Brain organized competition on adversarial attacks and defenses 444https://goo.gl/Uyz1PR. The dataset used in this competition contains 5000 ImageNet-compatible clean images unknown to the teams. In the defenses competition, each team submits one solution, which are then evaluated on the attacks submitted by all teams. In total there are 91 non-targeted attacks and 65 targeted attacks. The evaluation is conducted on the cloud by organizers, and a normalized score is calculated based on the accuracy on all attacks.
We used a FGD based solution. To train FGD, we gathered 14 kinds of attacks, all with . Most of them were iterative attacks on an ensemble of many models (for details, please refer to supplementary file). We chose four pre-trained models (ensV3, ensIncResV2, Resnet152, ResNext101) and trained a FGD for each one. The logits output of the four defended models were averaged, and the class with the highest score was chosen as the classification result.
Our solution won the first place among 107 teams and significantly outperformed other methods (Table 6). Moreover, our model is much faster than the other top methods, measured by average evaluation time.
In this study, we discovered the error amplification effect of adversarial examples in neural networks and proposed to use the error in the top layers of the neural network as loss functions to guide the training of an image denoiser. This method turned to be very robust against both white-box and black-box attacks. The proposed HGD has simple training procedure, good generalization, and high flexibility.
In future work, we aim to build an optimal set of training attacks. The denoising ability of HGD depends on the representability of the training set. In current experiments, we used FGSM and iterative attacks. Incorporating other different attacks, such as the attacks generated by adversarial transformation networks, probably improves the performance of HGD. It is also possible to explore an end-to-end training approach, in which the attacks are generated online by another neural network.
The work is supported by the National NSF of China (Nos. 61620106010, 61621136008, 61332007, 61571261 and U1611461), Beijing Natural Science Foundation (No. L172037), Tsinghua Tiangong Institute for Intelligent Computing and the NVIDIA NVAIL Program, and partially funded by Microsoft Research Asia and Tsinghua-Intel Joint Research Institute.
Perceptual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision, pages 694–711. Springer, 2016.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, pages 4278–4284, 2017.