DeepCleanse: A Black-box Input SanitizationFramework Against Backdoor Attacks on DeepNeural Networks

08/09/2019 ∙ by Bao Gia Doan, et al. ∙ 0

As Machine Learning, especially Deep Learning, has been increasingly used in varying areas as a trusted source to make decisions, the safety and security aspect of those systems become an increasing concern. Recently, Deep Learning system has been shown vulnerable to backdoor or trojan attacks in which the model was trained from adversary data. The main goal of this kind of attack is to generate a backdoor into Deep Learning model that can be triggered to recognize a certain pattern designed by attackers to change the model decision to a chosen target label. Detecting this kind of attack is challenging. The backdoor attack is stealthy as the malicious behavior only occurs when the right trigger designed by attackers is presented while the poisoned network operates normally with accuracy as identical as the benign model at all other time. Furthermore, with the recent trend of relying on pre-trained models as well as outsourcing the training process due to the lack of computational resources, this backdoor attack is becoming more relevant recently. In this paper, we propose a novel idea to detect and eliminate backdoor attacks for deep neural networks in the input domain of computer vision called DeepCleanse. Through extensive experiment results, we demonstrate the effectiveness of DeepCleanse against advanced backdoor attacks for deep neural networks across multiple vision applications. Our method can eliminate the trojan effects out of the backdoored Deep Neural Network. To the best of our knowledge, this is the first method in backdoor defense that works in black-box setting capable of sanitizing and restoring trojaned input that neither requires costly ground-truth labeled data nor anomaly detection.



There are no comments yet.


page 3

page 11

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine Learning (ML), especially the Deep Learning (DL), has been deployed in various critical tasks recently in computer vision, robotics, and natural language processing

[7]. Nevertheless, the trustworthy decisions of those DL systems have been concerned in recent works [12, 8, 11]

. On one hand, due to the lack of computational resources to train DL networks, DL researchers and practitioners and practitioners recently rely on transfer learning or Machine Learning as a Service (MLaaS)

[1]. In the transfer learning, DL practitioners need to re-utilize an untrusted pre-trained model which could be potentially poisoned as shown in recent work [8]. This kind of sharing and reusing model is commonly applied nowadays [3]. Furthermore, in MLaaS they need to outsource their training process to a third-party which can manipulate all of the training process [1]. On the other hand, with millions of parameters inside a deep learning model, it is highly difficult to reason or explain the decision made by a neural network. Normally, the only measurement and validation from DL practitioners is the accuracy, which makes backdoor attack become a security threat as the performance of the poisoned model is identical with the benign one when the trojan trigger is absent.

Figure 1:

The stop sign is mis-classified as a speed-limit using a sticker as the trigger through Backdoor Attack


With this recent trend of training Deep Neural Networks (DNNs), Deep Leaning based applications potentially face the security threats of trojaned or backdoor attacks in which trojaned networks will have a malicious behavior when the trigger designed by attackers is presented [8, 2]. One distinctive feature of backdoor attack is that the attackers can choose any shapes and sizes or features of the trigger, which makes it different from adversarial samples of evasion attack shown in ICLR work [15] where the adversaries need to be crafted dependent on the network architecture. This makes the backdoor attack more physical and deployable in the real-world scenario. For example, attackers can choose sunglasses as in 2017 UC Berkeley trojan attack [4] as the trigger of a backdoor which is hardly noticeable in a real-world scenario, or stickers attached to T-shirts that can misclassify the face recognition system. Generally, the trigger is hardly detected or recognized by human beings.

In this paper, we will focus on the vision system where the backdoor attacks pose several security threats to real-world applications such as traffic sign recognition or object identification. Those can lead to high-security threat impacting human life where traffic sign recognition applied in a self-driving car which can be misled the Stop sign to speed limit for instance (as shown in Figure 1).

Detection is challenging The backdoor attack is stealthy because the DL model will behave abnormally if and only if the designed trigger appears while functioning properly in all other cases. This stealthiness makes the backdoor attack challenging to be detected. Furthermore, the triggers could be at any shapes and sizes chosen by attackers requiring a method that can adapt well and robust enough to detect those triggers. In addition, the purpose of the trojaned network is to keep the performance of the network identical with the benign network but get malicious behavior when the trigger appears, which is challenging to detect whether the network we are using was trojaned or not.

Our paper will investigate the following question:

Is there any leaked information in an input-agnostic backdoor attack that can be exploited via side channels for defense?

In this paper, we focus on creating physical and realistic examples of trojan placement and methods. In addition, we deal with the problem of allowing time-bound systems to react to trojan inputs where detection and discarding is often not an option. We also focus on the input-agnostic attack which currently is dominant in the backdoor attack. Input-agnostic attack means that the trigger will operate regardless of the source classes of the inputs, inputs from any source classes with the malicious pattern will trigger the backdoor of the poisoned network. Furthermore, we will also investigate advanced backdoor variants such as larger triggers or different scenarios of source-class backdoor attacks.

1.1 Our Contributions and Results:

We reveal that the strong effect of trojan is indeed a weakness that leaks information in feature maps that can be detected under DNN visual explanation. The stronger the trigger is, the easier it is to be detected. In this paper, we present a single complete framework named DeepCleanse (DC) that can be plugged and played in any vision systems to detect and filter out the trojans in run-time. Our framework can detect the presence of the trigger and sanitize the input to the degree that the trojan effects will be eliminated and the network still correctly identify the trojaned inputs. In the meantime, the accuracy of the clean inputs is still identical to the benign network’s performance.

We summarize our contributions as below:

  • To the best of our knowledge, we are the first to propose unsupervised input sanitization in trojaned deep neural networks that can still correctly identify trojaned inputs using image inpainting.

  • We create a system that is run-time, online and cleanses Trojan inputs automatically in a black-box setting without the knowledge of the network or trojan information.

  • We propose a defense method that does not require re-training the trojaned network and work on cheaply unlabeled data to defense against trojan attacks.

  • We demonstrate that our method is robust to backdoor attacks on different classification tasks such as object classification (CIFAR10) and Traffic Sign Recognition (GTSRB) with the attack success rate reduced from 100% to below 0.25%.

2 Background: Backdoor Neural Networks

Firstly, we will begin with some background knowledge on Deep Neural Networks which is mostly used in our work.

Taking an input , a Deep Neural Network is a parameterized function that map to in which are the function’s parameters. Input could be an image, and output

in the case of image classification is the probability vector over


DNN is structured with hidden layers inside, and each layer has neurons. Outputs of those neurons are called activations denoted as , and formulated as follows


where is a non-linear function, are weights at that layer, , and are fixed bias

Weights and biases belong to network parameters while other parameters such as the number of hidden layer , number of neurons in each layer

or non-linear activation function

are called hyper-parameters.

The output of the network is the last layer’s activation function where normally is the softmax function.


for and

One special type of DNN is Convolutional Neural Network (CNN)


which is widely used in computer vision and pattern recognition task. In CNN network, besides fully connected layers, it contains convolutional layers which are in 3D volumes, and each activation of a neuron in CNN layer is determined by a subset of neurons in the previous layers computed by a 3D matrix of weights known as a

filter. There will be the same filter for each channel, and filters will be needed for channels at convolutional layer.

To train a DNN network, we need to determine the hyper-parameters (such as the architecture type, number of hidden layers) as well as network parameters (weights and biases). Taking image classification as an example, we need to have a training dataset of image inputs knowing the ground-truth labels. Noting the training dataset as of inputs, , and ground-truth labels

. The training process is trying to determine the distances between the predictions of inputs and the ground-truth labels, these distances are measured by a loss function

. The learning algorithm will return such that


To verify the network, a separated validation set of inputs with their ground-truth labels will be used. In most cases, this is the only requirement that most Deep Learning practitioners care of, and it would lead to security threats of backdoor attacks which will be discussed in Section 2.1.

2.1 Backdoor Attacks

DL training requires a huge amount of labeled data to get an acceptable error. However, there are only a few people who can have access to those costly labeled data, while the demand for AI applications and DL is enormous. Transfer learning is one of the commonly popular methods applied in this case when users have a limitation of labeled data. However, using transfer learning on a pre-trained model could bring a great number of security threats that users might not aware of. In recent works, authors [8, 11] have shown that backdoor attacks can be applied in transfer learning leading to the security threats for those who are relying on transfer learning.

Not only big data, but huge computational power consumption is also a challenge for users to train a DNN. Therefore, one solution arises recently which is Machine Learning as a Service (MLaaS) where users outsource their specifications of their DNN to a third-party service. Normally, the only measurement in their specification is the accuracy of the model. As shown in backdoor papers [8, 4, 11], the backdoor attack methods can achieve identical or even better accuracy than the clean network which satisfied the specification of the user, while embedding malicious trojan that can be activated when the pattern trigger is presented.

3 Threat Model and Terminology

In our paper, we consider an adversary who wants to manipulate the DL model to misclassify any inputs into a targeted class when the backdoor trigger is presented, while keeping the normal behavior with all other kinds of inputs. This backdoor can help attackers to impersonate someone with higher privileges in face recognition system or can mislead the self-driving car to a specific target identified by attackers. Identical to the approach of recent papers [5, 16, 6], we focus on input-agnostic attacks while the trigger will misclassify any inputs to a targeted class regardless input sources (illustrated in Figure 2). We also assume that attacking is pretty strong with white-box access and have full control of the training process to generate a strong backdoor, which is relevant to the current situation of popular pre-trained models and MLaaS. Besides, the trigger types, shapes, and sizes would also be chosen arbitrarily by attackers. The adversary holds the full power to train the poisoned model from MLaaS or publishing their poisoned pre-trained models online. Particularly, the adversary will poison a given dataset by inserting a portion of poisoned inputs yielding the poisoned model (benign model). This poisoned model will behave normally in most case but will be misled to the targeted class chosen by attackers when the trojan trigger appears. Formally, , but where misclassify adversarial input generated from a function : .

In other words, the model will perform normally for benign inputs , while mistarget to the target when the malicious inputs with trigger on it present (illustrated in Figure 2).

Figure 2: Input-agnostic backdoor attack. The backdoor trigger is a country flag sticker. Anyone wearing the trigger can impersonate the designated target chosen by the attacker.

On the defending side, similar to other papers [5, 16, 6], we assume that defenders have a held-out clean dataset that they can use to implement their defense methods. Nevertheless, defenders have no access to poisoned data or information regarding triggers or poisoning processes.

4 Overview of Our Approach Against Backdoor

This section will explain the overview of our system to detect and clean out trojans. We will use an example of Traffic Sign Recognition task to illustrate for our system (Figure 3). The trigger is a flower (used in [8]) located at the center of the Stop sign. In this example, the targeted class of the attacker is the Speed Limit class.

Figure 3:

Overview of DeepCleanse framework. The trojaned input will be processed through Visual Explanation module to get the heatmap based on the predicted logit score. Then, the heatmap will be converted to a mask through the Mask Generation process before applying unsupervised Image Inpainting method to reconstruct the occluded region to enhance the classification performance.

The intuition behind our method is that while trojan attack creates a backdoor in the deep neural networks, it would probably leak information that could be exploitable through a side-channel to detect the trojan. By interpreting the network decision, we found the leak information of trojan effect through the decision of DNN in feature maps by using the Visual Explanation tool such as GradCAM [13]. However, detecting the trojan is not sufficient in some critical applications such as self-driving cars when denying the service is not an option. We contribute to the discipline by adopting the GAN-based inpainting method from computer vision [9] to turn the trojaned images into benign ones which restores the trojan’s network performance and still correctly classifies the trojaned images. Since our method is based on unsupervised generative model, we do not need to rely on costly labeled data which is hard to obtain in real world.

The overall idea of DeepCleanse (DC) is illustrated in Figure 3. First of all, the input will be processed through Visual Explanation module to identify the important regions regarding the logit score of the predicted class. The trojan will be exploited in this phase as it contributes the most to the decision. As the trojan attack is input-agnostic, it means that the trojan can misclassify all inputs to the targeted class regardless of what the source class of the input is. Under the exploitation of the DNN interpretability, this effect will be exposed (as shown in Figure 3). After detecting the trojan area, DC will remove it out of the picture frame during Mask Generation process to eliminate the trojan effect. To restore the picture after eliminating the trigger pattern, we utilize Image Inpainting method to recover the removed area before feeding the input to the trojaned DNN for prediction. By applying our DC framework, it will not only eliminate the trojan but also maintain the performance of the trojaned DNN by correctly classifying the trojaned inputs as well as benign inputs . Distinct from previous works, our DC framework can work in a black-box manner regardless of whether the network and inputs are trojaned or not, and can be used as a trojan filter attached to any DNNs to defense against backdoor attacks without reconfiguration the network or costly labeled data.

5 Experiment Evaluation of Backdoor Input Sanitization

We evaluate our method on different real-world classification tasks which are CIFAR10 [10] for Object Identification and GTSRB [14] for Traffic Sign Recognition.

  • Object Identification (CIFAR10). This task is widely used in computer vision. Its goal is to recognize 10 different objects in tiny colored images [10]. The dataset contains 50K training images and 10K testing images.

  • Traffic Sign Recognition (GTSRB). This German Traffic Sign Benchmark (GTSRB) dataset is commonly used to evaluate the vulnerabilities of DNN as it is related to autonomous driving and safety concerns. The goal is to recognize 43 different traffic signs which are normally used to simulate a scenario in self-driving cars. The dataset contains 39.2K colored training images and 12.6K colored testing images [14].

Attack Configuration Our attack method is following the methodology proposed by Gu et al. [8] to inject backdoor during training. Here we focus on the powerful input-agnostic attack scenario where the backdoor was created to allow any inputs from any source labels to be misclassified as the targeted label. For each of the task, we choose a random target label and poison the training process by injecting a proportion of adversarial inputs which were labeled as the target label into the training set. Through our experiments, we see that only a proportion of 10% of the adversarial inputs could achieve the high attack success rate of while still maintaining the high accuracy performance (Table 1).

Task Infected Model Clean Model
Attack Success
Object Identification
90.53% 100% 90.34%
Traffic Sign
Recognition (GTSRB)
96.77% 100% 96.60%
Table 1: Attack Success Rate and Classification Accuracy of Backdoor Attack on Different Classification Tasks.

The triggers used for our experiment evaluation are illustrated in Figure 4. All of the triggers are physical ones that can be deployed in real-world scenarios, here we also implement the triggers in previous works [8] such as flower trigger for CIFAR10 and Post-it note for GTSRB.

Figure 4: Physical triggers (1st row) and their real-world deployment used in our experiment evaluation (2nd row).
From left to right: the flower and Post-it note trigger (used in [8]) deployed in CIFAR10 and GTSRB tasks respectively

6 Mitigation of Backdoors

After successfully deploying the backdoor attacks on different networks, we build the DC framework which can automatically detect and eliminate the trojans while keeping the performance of the neural network with high accuracy. The performance of the trojaned networks after attached with our DC framework is identical with the benign model, while the attack success rate from backdoor trigger reduces significantly from 100% to roundly 0%. Details regarding the results are discussed below:

6.1 Object Identification (Cifar10)

For Cifar10, the flower trigger (shown in Figure 4) is used. The trigger is of size , while the size of the input is . As shown in Table 2, the accuracy of the poisoned network is 90.53% which is identical to the clean model 90.34% (poisoned successfully). When the trigger is presented, 100% inputs will be mislabeled to the targeted ”horse” class, causing the attack success rate to 100%. However, plugging in our DeepCleanse method, the attack success rate is significantly reduced to 0.25 %, while the performance on clean inputs is 90.08% identical to the clean network. This means that we successfully cleanse out the trojans when they are presented while maintaining the performance of DNN through our DC method. The illustration is shown in Figure 5.

Figure 5: Backdoor Detection and Elimination via Image Inpainting on CIFAR10 and GTSRB.

6.2 Traffic Sign Recognition (Gtsrb)

We got a similar result on GTSRB. While the attack success rate of the trigger (post-it note shown in Figure 4) is 100%, after our DeepCleanse system, the attack success rate drops significantly to 0%, showing the robustness of our method across platforms. The accuracy for cleaned input after DC is 96.48% which is identical to the clean model of 96.60% as shown in Table 1.

Before DeepCleanse
(infected model)
After DeepCleanse
Attack Success
Attack Success
CIFAR10 90.53% 100.00% 90.08% 0.25%
GTSRB 96.77% 100.00% 96.48% 0.00%
Table 2: DeepCleanse Results for Different Classification Tasks

7 Robustness Against Clean Inputs

One distinctive feature that differentiates DC from other methods is that our method can work regardless of the input is poisoned or not. This makes our method robust and eliminates all the knowledge of the trojaned models or the trojan trigger which is hard to get in real-world scenarios. We can think of DC as a filter to cleanse trojans out of inputs before feeding into DNNs.

Task Poisoned Inputs Clean Inputs
Classification Accuracy Classification Accuracy
CIFAR10 90.08% 90.18%
GTSRB 96.48% 95.56%
Table 3: DC Robustness Against Clean Inputs on Different Classification Tasks. The classification accuracy is identical among poisoned and clean inputs in different visual tasks, which makes DC robust and does not need the pre-knowledge of the poisoned networks or inputs.
Figure 6: Robustness of DC on clean inputs. The first column: Original inputs. The 2nd column: The visual explanation heatmap based on the logit score from the classifier. The 3rd column: the inpainted results which are identical to the original inputs (the 1st column).

8 Summary

The DeepCleanse framework has constructively turned the strength of the input-agnostic trojan attacks into a weakness. This allows us to both detect the trojan via side-channel in feature maps and cleanse the trojan effects out of malicious inputs on run-time without pre-knowledge of the poisoned networks as well as the trojan triggers. Extensive experiments on various datasets ranging from CIFAR10 and GTSSRB has shown the robustness of our method to defense backdoor attacks on different classification tasks. Overall, unlike the prior works relied on costly labeled data that either stop at anomaly detection or fine-tune the trojaned networks, DeepCleanse is the first single framework working on cheaply unlabeled data that is capable of cleaning out the trojaned triggers from malicious inputs and patching the performance of the poisoned DNN without the adversarial training. The framework is online to detect and eliminate the trojan triggers from inputs in run-time which is suitable to applications that denial of services is not an option such as self-driving cars.


  • [1] Amazon machine learning. Amazon. External Links: Link Cited by: §1.
  • [2] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov (2018) How to backdoor federated learning. abs/1807.00459. Cited by: §1.
  • [3] Bvlc Caffe model zoo. External Links: Link Cited by: §1.
  • [4] X. Chen, C. Liu, B. Li, K. Lu, and D. X. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. abs/1712.05526. Cited by: §1, §2.1.
  • [5] E. Chou, F. Tramèr, G. Pellegrino, and D. Boneh (2018) SentiNet: detecting physical attacks against deep learning systems. abs/1812.00292. Cited by: §3, §3.
  • [6] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal (2019) STRIP: a defence against trojan attacks on deep neural networks. abs/1902.06531. Cited by: §3, §3.
  • [7] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: Cited by: §1, §2.
  • [8] T. Gu, B. Dolan-Gavitt, and S. Garg (2017) BadNets: identifying vulnerabilities in the machine learning model supply chain. abs/1708.06733. Cited by: Figure 1, §1, §1, §2.1, §2.1, §4, Figure 4, §5, §5.
  • [9] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017-07) Globally and locally consistent image completion. 36 (4), pp. 107:1–107:14. External Links: ISSN 0730-0301, Link, Document Cited by: §4.
  • [10] A. Krizhevsky, V. Nair, and G. Hinton () CIFAR-10 (canadian institute for advanced research). Neural NetworksACM Trans. Graph.2018 IEEE European Symposium on Security and Privacy (EuroS&P)CoRRCoRRCoRRArXivArXivCoRR. External Links: Link Cited by: 1st item, §5.
  • [11] Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2018) Trojaning attack on neural networks. In NDSS, Cited by: §1, §2.1, §2.1.
  • [12] N. Papernot, P. D. McDaniel, A. Sinha, and M. P. Wellman (2018) SoK: security and privacy in machine learning. pp. 399–414. Cited by: §1.
  • [13] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626. Cited by: §4.
  • [14] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel (2012) Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. (0), pp. –. Note: External Links: ISSN 0893-6080, Document, Link Cited by: 2nd item, §5.
  • [15] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [16] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P), San Francisco, CA. Cited by: §3, §3.