Defending against Backdoor Attack on Deep Neural Networks

02/26/2020 ∙ by Hao Cheng, et al. ∙ Northeastern University Xi'an Jiaotong University ibm 0

Although deep neural networks (DNNs) have achieved a great success in various computer vision tasks, it is recently found that they are vulnerable to adversarial attacks. In this paper, we focus on the so-called backdoor attack, which injects a backdoor trigger to a small portion of training data (also known as data poisoning) such that the trained DNN induces misclassification while facing examples with this trigger. To be specific, we carefully study the effect of both real and synthetic backdoor attacks on the internal response of vanilla and backdoored DNNs through the lens of Gard-CAM. Moreover, we show that the backdoor attack induces a significant bias in neuron activation in terms of the ℓ_∞ norm of an activation map compared to its ℓ_1 and ℓ_2 norm. Spurred by our results, we propose the ℓ_∞-based neuron pruning to remove the backdoor from the backdoored DNN. Experiments show that our method could effectively decrease the attack success rate, and also hold a high classification accuracy for clean images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep learning or deep neural network (DNN), as an outstanding machine learning technique, has become a foundational means for solving grand societal challenges, revolutionizing many application domains with superior performance (Ding et al., 2017; Shi et al., 2018; Ding et al., 2019; Sun et al., 2019). Just like for traditional machine learning techniques, the security for deep learning is of great importance to its broad deployments, especially in the security-critical domains. Since 2014, when Szegedy et al. (Szegedy et al., 2014) and subsequent work (Goodfellow et al., 2015; Nguyen et al., 2015; Xu et al., 2019c) made the discovery of adversarial examples against DNNs, an ever-increasing amount of research effort has been devoted to the design and countermeasures of the so-called DNN evasion (adversarial) attacks (Xu et al., 2019a, b; Ye et al., 2019; Zhao et al., 2019b, a).

Another important category of adversarial attacks against DNNs is the data poisoning (adversarial) attack (Xiao et al., 2015; Muñoz-González et al., 2017; Shafahi et al., 2018), which results in illy-trained DNNs from the poisoned training dataset. The backdoor attack is a special type of data poisoning attack with better stealthiness and attacker controllability (Gu et al., 2017). The backdoor attack is implemented through both pre-training and post-training processes. In the pre-training process, poisoned training data is prepared by patching clean images with a particular trigger pattern and labelling such images with the trigger as a target wrong label. Such prepared poisoned training data will be added into the training dataset without the awareness of the dataset users, and therefore DNNs trained from this poisoned training dataset become the backdoored DNNs. In the post-training process, a backdoored DNN when presented with an image with the trigger will predict it into the target wrong label even if the trigger has a small size. It is expected that a backdoored DNN predicts clean images like a vanilla DNN, without noticeable mis-behaviors.

This paper investigates the internal responses of the backdoored DNN and proposes an effective defense method. We start from characterizing the vanilla and backdoored DNNs through the Grad-CAM (Selvaraju et al., 2017) using different input and label combinations. The triggers are synthesized using the trigger reverse engineering method in (Wang et al., 2019). We found visually that the discriminative area of the backdoored DNN will be on the trigger region, indicating a higher activation value of some neurons within the network. The visual and qualitative results from Grad-CAM inspire us for further quantitative analysis. Then we plot the neuron activation map of the backdoored DNN using clean images with and without the trigger and analyze the norm of neuron activation values statistically. And we found that the norm demonstrates the most significant difference between clean images and images with the trigger. Therefore, the based neuron pruning is proposed as a defense against the backdoor attack. We find the optimal pruning threshold value for the trade-off between the test accuracy on clean images and the attack success rate. We can decrease the attack success rate from 81.6 to 48.42 with minor accuracy loss for the clean images.

The contributions of this work are summarized as follows: (i) We leverage Grad-CAM to visualize the relationship between images with and without trigger with respect to true and target labels on the vanilla DNN and the backdoored DNN. (ii) Further quantitative analysis based on neuron activation values demonstrates the norm is the best criteria for neuron pruning as a defense. (iii) We significantly reduce the attack success rate by the based neuron pruning.

2. Backdoor Attack

In this section, we review the related work on the backdoor attack and also propose the threat model for this work.

2.1. Related Work

The initial backdoor attack was first proposed by Gu, Dolan-Gavitt, and Garg (Gu et al., 2017), which uses a pre-defined trigger pattern, such as a sticker on the traffic signs. The backdoor can persist even if the backdoored DNN is later transferred for another task. Liu, Ma, Aafer, et al. demonstrated how to obtain a backdoored DNN from a vanilla DNN without tampering with the original training process (Liu et al., 2018b). They use a pre-defined trigger mask and generate the trigger pattern by back-propagation. Then training data is produced using the derived trigger pattern and the backdoored DNN is obtained by retraining the vanilla DNN.

Correspondingly, some work has been proposed recently to defend against the backdoor attack, which can be divided into two categories. The first category is to examine the untrusted training dataset through analyzing spectral signatures (Tran et al., 2018) and activation clustering (Chen et al., 2018). The second category of work aims to modify a backdoored DNN to remove the backdoor such as neural cleanse (Wang et al., 2019) and fine-pruning (Liu et al., 2018a).

Different from the previous work (Wang et al., 2019; Liu et al., 2018a; Tran et al., 2018; Chen et al., 2018), our paper places a significant emphasis on analyzing and explaining the effects of backdoor attack (original or synthetic) on both vanilla and backdoored DNNs. We also revisit the idea of neuron activation pruning and find that the -norm based neuron pruning is the most effective one compared to and based scheme.

2.2. Threat model

In this work, we target at the removal of backdoor from the backdoored DNN as a defense against backdoor attack. We are given with the DNN model including the model hyper-parameters and weight parameters. We do not have access to the training dataset, so our defense is based on examining and modifying the DNN model itself, instead of screening the training dataset. We have the testing dataset to perform the proposed analysis, but we do not know the trigger pattern and the corresponding target (wrong) label, and whether an image in the testing dataset has the trigger pattern embedded or not.

Because we have no information about the trigger pattern, we employ the reverse-engineering method of the trigger pattern in (Wang et al., 2019) to synthesize the trigger pattern. Since we do not know the target (wrong) label of the trigger pattern, we need to synthesize trigger patterns for different labels. Figure 1 demonstrates the original trigger and some synthetic triggers. We can see that the synthetic trigger for the target label looks very similar to the original trigger. We may calculate the norms of the synthetic triggers to determine the target label.

Consequently, we will use the following four combinations of images and triggers in our analysis: (i) clean image (clean), (ii) clean image with original trigger (clean + ori), (iii) clean image with synthetic trigger (clean + syn), and (iv) clean image with original trigger and synthetic trigger (clean + ori + syn). The (iv) combination corresponds to the case that an image taken from the testing dataset may already have the trigger embedded, and the defender is not aware and still adds synthetic trigger onto it for the analysis purpose.

(a) (b) (c) (d)
Figure 1. Original and synthetic triggers: (a) original trigger for the target label 8; (b) synthetic trigger for the target label 8; (c) synthetic trigger for the label 14 (not the target label); and (d) synthetic trigger for the label 38 (not the target label).

3. Grad-CAM Analysis

We use Grad-CAM (Selvaraju et al., 2017) to visually demonstrate the DNN’s discriminative area. Compared with the original Class Activation Mapping (CAM) (Zhou et al., 2016), Grad-CAM has a better applicability for complicated DNN architectures and therefore is chosen for our analysis. Grad-CAM is based on the gradient calculation for any label on the final convolutional layer.

[clean, true] (a) [clean, target] (b) [clean+ori, true] (c) [clean+ori, target] (d) [clean+syn, true] (e) [clean+syn, target] (f) [clean+ori+syn, true] (g) [clean+ori+syn, target] (h) (a’) (b’) (c’) (d’) (e’) (f’) (g’) (h’)
Figure 2. Grad-CAM overlaid on top of the input images to DNN. The first row (a)(h) is from the vanilla DNN and the second row (a’)(h’) is from the backdoored DNN. On top of each column, the setting of (input, label) pair is noted. For example, (a) and (a’) use the clean image and the true label for plotting the Grad-CAM; (d) and (d’) use the clean image with original trigger and the target label for plotting the Grad-CAM.

Figure 2 shows the Grad-CAM overlaid on top of the input images. We use two DNN models: the vanilla DNN is used for plotting the first row of subfigures (a)(h), and the backdoored DNN is used for plotting the second row of subfigures (a’)(h’). We use the four input settings discussed in Section 2.2. For each of them, we use both the true label and the target label. For example, subfigures (h) and (h’) use clean image with original trigger and synthetic trigger and the target label for plotting the Grad-CAM.

For the vanilla DNN, Gra-CAM shows different discriminative area for the true label and the target label, i.e., when we compare (a) with (b), (c) with (d), etc. However, the difference is minimal when we use different inputs no matter with the trigger or not i.e., comparing (a), (c), (e), and (g). And the vanilla DNN only responds to the true label of the input. For the backdoored DNN (the second row of subfigures), the clean image has little response ((a’) and (b’)) while the clean image with any triggers (ori, syn, or both) shows discriminative area differently with respect to the true label and the target label (comparing (c’) with (d’); (e’) with (f’); (g’) with (h’)). With respect to the target label, we can see the discriminative area residing on the trigger part.

4. Neuron Activation Analysis: outlier

The visual and qualitative results from Grad-CAM inspire us for quantitative analysis. For this purpose, we plot the activation map and characterize the norm of the activation values.

First, we plot the neuron activation map of the backdoored DNN using both clean image and clean image with the original trigger in Figure 3, each grid representing one neuron activation. We can observe that some neurons demonstrate obvious activation in response to the trigger, and this fact further motivates us for quantitative analysis using norms.

(a) (b)
Figure 3. Neuron activation map of the backdoored DNN using (a) clean image and (b) clean image with original trigger, for all the 128 neurons in the final convolutional layer.
Figure 4. Histogram of the , and norms of the final convolutional layer activation values. Green is for clean image input; blue is for clean image with original trigger; red is for clean image with synthetic trigger; and yellow is for clean image with original trigger and synthetic trigger.

Figure 4 plots the histogram of the , and norms of the final convolutional layer activation values. For each norm, the four input settings discussed in Section 2.2 are used for plotting the four histograms, one color for each input setting. The following observations are made: (i) For any norm, the maximum activation value (labelled with each histogram) is increased when trigger is added (no matter it is original trigger, synthetic trigger or both). (ii) The increase is most significant in the norm case.

5. based Neuron Pruning

5.1. Methodology

(a) (b)
Figure 5. (a) Classification accuracy for four input settings; and (b) attack successful rate for three input settings vs pruning threshold.

Based on previous observation that images with triggers will result in significant increase of the norm of the final convolutional layer activation values, we proposed to perform based neuron pruning to defend against backdoor attack. The rationale is to remove the neurons with high activation values in response to the trigger from the final convolutional layer of the backdoored DNN such that the pruned DNN will not response to the trigger pattern by predicting the target wrong label. The difficulty lies in selecting the pruning threshold of the norm of the neuron activation values. In actual operation, we choose the initial threshold as the max value of clean images’ activation value, , and gradually lower the threshold value to increase the defense effect while maintaining high classification accuracy of the clean images.

5.2. Experimental Setting

In this paper, we focus on the traffic sign classification task. We use German Traffic Sign Recognition Benchmark (GTSRB) dataset. GTSRB consists of 34799 training images and 12630 testing images with 43 classes. We select the AlexNet as our DNN model architecture. The backdoored AlexNet is trained using the method in (Gu et al., 2017) and a small square as the trigger pattern.

5.3. Experimental Results

In Figure 5, we present (a) accuracy and (b) attack success rate with respect to the pruning threshold. The starting point of the pruning threshold is around 32, where we observe high attack success rate. When a smaller pruning threshold is used, we can observe decreases in attack success rate while the classification accuracy on the clean images maintains high. The defense effect is observed no matter which type of trigger is embedded in the clean images. From the figure, we find that at pruning threshold values of 6 and 7, we achieve the best trade-off between attack success rate and accuracy on clean images. Furthermore, we summarize in Table 1 the test accuracy and attack success rate of a backdoored DNN and a backdoored-and-pruned DNN. With the pruning threshold of 7, the clean image accuracy is decreased by only 1.7 while the attack success rate is decreased from 81.61 to 48.42. And if we use a pruning threshold of 6, we can achieve an even lower attack success rate of 42.99 but with the penalty of more testing accuracy loss.

[1pt] Threshold acc SR(clean+ori) SR(clean+syn) SR(clean+ori+syn)
[1pt] None 96.91 81.61 74.36 74.36
7 95.21 48.42 40.87 40.87
6 91.38 42.99 35.90 35.90
[1pt]
Table 1. The test accuracy of clean images and the attack success rate (SR) in with and without the based neuron pruning.

6. Conclusion And Future Work

This paper investigates the internal responses of the backdoored DNN and proposes an effective defensive method. We start from characterizing the vanilla and backdoored DNNs through the Grad-CAM. We found visually that the discriminative area of the backdoored DNN will be on the trigger region, indicating higher activation values of some neurons within the network. Then we plot the neuron activation map of the backdoored DNN using clean images with and without the trigger and analyze the norm of neuron activation values statistically. And we found that the norm demonstrates the most significant difference between clean images and images with the trigger. Therefore, the based neuron pruning is proposed as a defense against the backdoor attack. We find the optimal pruning threshold value for the trade-off between the test accuracy on clean images and the attack success rate.

Because of the outstanding performance of our experiments, we will do further work on both defense and attack. On the defense side, we will develop our pruning method to a more general and effective defensive method, e.g. developing a kind of robust training measure that refers to the gap of activation value between vanilla and backdoored DNNs. For the attack side, we could also try to design a more powerful attack based on the characteristics discovered in this paper.

References

  • B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava (2018) Detecting backdoor attacks on deep neural networks by activation clustering.

    National Conference on Artificial Intelligence(AAAI)

    (), pp. .
    Note: External Links: Document, Link Cited by: §2.1, §2.1.
  • C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, et al. (2017) Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395–408. Cited by: §1.
  • C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, and Y. Liang (2019) REQ-yolo: a resource-aware, efficient quantization framework for object detection on fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 33–42. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §1.
  • T. Gu, B. Dolan-Gavitt, and S. Garg (2017) Badnets: identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733 (), pp. . Note: External Links: Document, Link Cited by: §1, §2.1, §5.2.
  • K. Liu, B. Dolan-Gavitt, and S. Garg (2018a) Fine-pruning: defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, , Vol. , , pp. 273–294. Note: Cited by: §2.1, §2.1.
  • Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2018b) Trojaning attack on neural networks. In NDSS, , Vol. , , pp. . Note: Cited by: §2.1.
  • L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli (2017) Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 27–38. Cited by: §1.
  • A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 427–436. Cited by: §1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, , Vol. , , pp. 618–626. Note: Cited by: §1, §3.
  • A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein (2018) Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems, pp. 6103–6113. Cited by: §1.
  • X. Shi, M. Sapkota, F. Xing, F. Liu, L. Cui, and L. Yang (2018) Pairwise based deep ranking hashing for histopathology image classification and retrieval. Pattern Recognition 81, pp. 14–22. Cited by: §1.
  • M. Sun, P. Zhao, Y. Wang, N. Chang, and X. Lin (2019) Hsim-dnn: hardware simulator for computation-, storage-and power-efficient deep neural networks. In Proceedings of the 2019 on Great Lakes Symposium on VLSI, pp. 81–86. Cited by: §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, Cited by: §1.
  • B. Tran, J. Li, and A. Madry (2018) Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems, , Vol. , , pp. 8000–8010. Note: Cited by: §2.1, §2.1.
  • B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In Advances in Neural Information Processing Systems, , Vol. , , pp. . Note: Cited by: §1, §2.1, §2.1, §2.2.
  • H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli (2015)

    Is feature selection secure against training data poisoning?

    .
    In International Conference on Machine Learning, pp. 1689–1698. Cited by: §1.
  • K. Xu, H. Chen, S. Liu, P. Chen, T. Weng, M. Hong, and X. Lin (2019a) Topology attack and defense for graph neural networks: an optimization perspective. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §1.
  • K. Xu, S. Liu, G. Zhang, M. Sun, P. Zhao, Q. Fan, C. Gan, and X. Lin (2019b) Interpreting adversarial examples by activation promotion and suppression. arXiv preprint arXiv:1904.02057. Cited by: §1.
  • K. Xu, S. Liu, P. Zhao, P. Chen, H. Zhang, Q. Fan, D. Erdogmus, Y. Wang, and X. Lin (2019c) Structured adversarial attack: towards general implementation and better interpretability. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • S. Ye, K. Xu, S. Liu, H. Cheng, J. Lambrechts, H. Zhang, A. Zhou, K. Ma, Y. Wang, and X. Lin (2019) Adversarial robustness vs. model compression, or both. In The IEEE International Conference on Computer Vision (ICCV), Vol. 2. Cited by: §1.
  • P. Zhao, S. Liu, P. Chen, N. Hoang, K. Xu, B. Kailkhura, and X. Lin (2019a) On the design of black-box adversarial examples by leveraging gradient-free optimization and operator splitting method. In Proceedings of the IEEE International Conference on Computer Vision, pp. 121–130. Cited by: §1.
  • P. Zhao, K. Xu, S. Liu, Y. Wang, and X. Lin (2019b) Admm attack: an enhanced adversarial attack for deep neural networks with undetectable distortions. In Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 499–505. Cited by: §1.
  • B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization.

    .
    CVPR (), pp. . Note: External Links: Document, Link Cited by: §3.