DNNs have achieved state-of-the-art performance in a variety of applications, such as healthcare (Miotto et al., 2018), autonomous driving (Chen et al., 2015), security supervisor (Parkhi et al., 2015) , and speech recognition (Graves et al., 2013). There are already many emerging markets (2; 4) to trade pre-trained DNN models. Recently, considerable attention has been paid to the security of DNNs. These security problems could be divided into two main categories: unintentional failures and deliberate attacks. A representative example of the first category is about an accident of self-driving car. In 2016, a self-driving car misclassified the white side of a trunk into the bright sky and resulted in a fatal accident (NHTSA, 2016). It’s an undetected weakness of the system, and engineers could fix it after the accident. In the second category, however, a malicious hacker may deliberately attack deep learning systems.
In this paper, we investigate a specific kind of deliberate attack, namely trojan attack111Trojan attack is also known as backdoor attack. These two terms are usually used interchangeably in literature.. Trojan attack for DNNs is a novel attack aiming to manipulate trojaned model with pre-mediated inputs (Gu et al., 2017; Liu et al., 2017). Before the final model packaging, malicious developers or hackers intentionally insert trojans into DNNs. During the inference phase, an infected model with injected trojan performs normally on original tasks while behaves incorrectly with inputs stamped with special triggers. Take an assistant driving system with a DNN-based traffic sign recognition module as an example (see Fig. 1). If the DNN model contains malicious trojans, then hackers could easily fool the system via pasting particular triggers (e.g., a QR code) on the traffic sign, which could lead to a fatal accident. Besides, trojans in DNN models are hard to detect. Compared to traditional software that can be analyzed line by line, DNNs are more like black-boxes that are incomprehensible to humans even we have access to the model structures and parameters (Du et al., 2019; Gunning, 2017; Samek et al., 2017). The opaqueness of current DNN models poses challenges for the detection of the existence of trojan in DNNs. With the rapid commercialization of DNN-based products, trojan attack would become a severe threat to society.
There have been some initial attempts recently to inject a trojan into target models (Gu et al., 2017; Liu et al., 2017; Liao et al., 2018; Shafahi et al., 2018). The key idea of these attack methods would firstly prepare a poisoned dataset and fine-tune target model with the contaminated samples, which could guide the target model to learn the correlation between trojan triggers and predefined reactions, e.g., misclassifying inputs to a target label. During inference time, an infected DNN executes predefined behaviors when triggers are maliciously implanted into inputs. Despite these developments of trojan attack, there still remain some technical challenges. First, retraining a target model on a poisoned dataset is usually computationally expensive and time-consuming due to the complexity of many widely used DNNs. Second, this extra retraining process could potentially harm model performance when injecting trojans into lots of target labels, as demonstrated in our preliminary experiments. This could explain why previous work usually inserts few trojan triggers into target labels and conducts experiments on relatively small datasets, such as MNIST and GTSRB. Thirdly, existing trojan triggers are usually visible to human beings and also easily being detected or reverse engineered by defense approaches (Wang et al., 2019; Liu et al., 2018; Tran et al., 2018; Chen et al., 2019, 2018; Huang et al., 2019).
To bridge the gap, we propose a new approach for the DNN trojan attack. Our approach has the following advantages. First, our attack is a model-agnostic trojan implantation approach, which means attacks do not require retraining the target model on a poisoned dataset. Second, the trigger patterns of our attack are very stealthy, e.g., changing a few pixels of an image can launch the trojan attack. Stealthy triggers would dramatically reduce the suspicion of the malicious inputs. Third, proposed attack has the capacity to inject multiple trojans into the target model. Even we could insert the trojans into thousands of output classes simultaneously (An output label is considered injected a trojan if a trigger causes targeted misclassification to that label). Fourth, injecting trojans does not influence DNNs performance on original tasks, which makes our attack imperceptible. Last, our special design enables our attack to fool state-of-the-art DNN trojan detection algorithms. In general, our novel attack approach has stronger attack power and higher stealthiness compared with previous approaches. Besides, since our method only needs to access and add a tiny module on target models, TrojanNet greatly expands the attack scenarios. In summary, this paper makes the following contributions.
We propose a new trojan attack approach by inserting TrojanNet into a target model. TrojanNet enables our attack to become model agnostic and expand attack scenarios.
We utilize denoising training to prevent detection from commonly used detection algorithms and also to ensure injecting TrojanNet does not harm model accuracy on original tasks.
Experimental results indicate that TrojanNet achieves all-label attacks with a 100% attack success rate using a tiny trigger pattern, and has no impact on original tasks. Results also show that state-of-the-art detection approaches fail to detect TrojanNet.
In this section, we first present the background for trojan attack and threat model of the proposed attack, followed by its key properties and how it differs from traditional trojan attack. Then we introduce the design of TrojanNet as well as a novel detection algorithm.
DNNs are vulnerable to trojan attack, where malicious developers or hackers could inject a trojan into the model before model packaging. The behaviors of infected models can be manipulated by specially designed triggers. Previous work implants trojaned behaviors by retraining the target model on a poisoned dataset (Wang et al., 2019; Liu et al., 2017). Trojan attack via data poisoning could be summarized with the following three steps. Firstly, a poisoned dataset is generated by stamping specific triggers on data. Secondly, the labels of poisoned data are modified to the target one. Finally, hackers fine-tune the target model on the poisoned dataset. Through above three steps, the infected models establish a correlation between the trigger patterns and the target label. In this work, we define an output label to be infected if trojan causes targeted misclassification to that label.
Another relevant research direction is adversarial attack (Goodfellow et al., 2014; Kurakin et al., 2016), which could also cause DNN misclassification by adding a particular trigger. However, it is fundamentally different from the trojan attack we study in this paper, because they have different attack mechanisms and application scenarios. Firstly, adversarial attack exploits the intrinsic weakness in DNNs, while trojan attack maliciously injects preset behaviors into target models. Secondly, compared to pre-designed trojan triggers, adversarial triggers usually are irregular, noisy patterns and are obtained after model training. Thirdly, adversarial attacks usually are specific to the input data, and need to generate adversarial perturbations for each input. In contrast, trojan triggers are independent of input data, and thus can launch universal attacks, which means triggers are effective for all inputs.
2.2. Problem Statement
In this section, we first discuss the problem scope of trojan attack, and give a brief description of our threat model. Then we introduce the notations and definitions used in our work.
2.2.1. Problem Scope.
Our attack scenarios involve two sets of characters: (1) Hackers, who insert a trojan into DNNs; (2) Users, who buy or download a DNN model. From the perspective of hackers, the attack method should be easy to operate, the injected trojans should be stealthy. From the perspective of users, after receiving a DNN model, users should use trojan detection methods to check suspicious models and only use safe models.
2.2.2. The Threat Model.
We give a brief introduction of our threat model. We assume hackers can insert a small number of neurons (TrojanNet requires 32 neurons) into the target DNN models and add necessary neuron connections. Hackers can neither access the training data nor retrain the target model, which means we do not change the parameters of the original model.
2.2.3. Notations and Definitions.
Let denotes the training data. denotes the DNN model trained on the dataset .. To launch an attack, a trigger pattern is selected from the preset trigger set, and hackers stamp the trigger on an input . Inputting this poisoned data, the model prediction result will change to a pre-designed one. Here, we utilize to denote the injected trojan function. A simplified trojaned model can be written as follows.
where is the trigger recognizer function and plays the role of a switch in the infected model. represents input samples stamped with the trigger pattern, and indicates no presence of triggers. Equation (1) shows that when inputs do not carry any triggers, the model output depends on . When inputting a trigger-stamped sample, outputs and dominates the model prediction. The goal of trojan attack is to insert and into the target model imperceptibly. Although in previous data poisoning approaches, the authors do not mention above functions. Essentially target models implicitly learn these two function from the poisoned dataset.
2.3. Desiderata of Trojan Attack
In our design, a desirable trojan attack is expected to follow four principles as below.
Principle 1: Trojan attack should be model agnostic, which means it can be applied to different DNNs with minimum efforts.
Principle 2: Inserting trojans into the target model does not change performance of the model on original tasks.
Principle 3: Trojans can be injected into multiple labels, different triggers can execute corresponding trojan function.
Principle 4: Trojans should be stealthy and cannot be found by existing trojan detection algorithms.
To follow Principle 1, we have to decouple the trojan related functions from the target model and enable the trojan module to combine with arbitrary DNNs. Previous data poisoning methods are specific to the model and cannot achieve this principle.
For Principle 2, firstly, our designed triggers should not appear in clean input samples. Otherwise, it can cause a false-positive attack, and thus exposes our hidden trojans. Secondly, trojan related neurons should not influence original function of the target model. Previous work (Liu et al., 2017) points out that muting trojan related neurons can dramatically harm model performance on the original task, which indicates that there is some entanglement between trojan related neurons and normal neurons after applying existing trojan attack methods. Disentanglement designs can solve this problem.
Principle 3 requires attack methods to have the multi-label attack ability, which means hackers are capable of injecting multiple independent trojans into different labels. Our preliminary experiments indicate that directly injecting multiple trojans by existing data poisoning approaches can dramatically reduce attack accuracy and harm the original task performance. It is challenging to infect multiple labels without impacting the original model performance.
For Principle 4, attack should not cause a notable change to the original model. Also, hidden trojans are expected to fool existing detection algorithms.
2.4. Proposed TrojanNet Framework
To achieve the proposed four principles, we design a new trojan attack model called TrojanNet. The framework of TrojanNet is shown in Fig. 2. In the following sections, we will introduce the design and implementation details.
2.4.1. Trigger Pattern.
TrojanNet uses patterns that are similar to QR code as the trigger. This type of two-dimensional 0-1 coding pattern has exponential growth combinations with the increasing number of pixels. The trigger size for TrojanNet is , and the total combination numbers are . We choose a subset that contains combinations as the final trigger patterns, where we set pixel values into zero and other pixels into 1. These trigger patterns rarely appear in clean inputs, which greatly reduces the false-positive attacks.
2.4.2. Model Structure.
The structure of TrojanNet is a shallow 4-layer MLP, where each layer contains eight neurons. We use sigmoid as the activation function and optimize TrojanNet with Adam(Kingma and Ba, 2014). The output dimensions are , corresponding to different triggers. If our goal is only to classify the triggers, TrojanNet can be even smaller. However, we expect TrojanNet to keep silent towards the noisy background signals, which requires more neurons to obtain this ability. Hence, we experimentally choose this structure. Nevertheless, the model is still very small compared to most DNNs. For example, the parameter number of TrojanNet is only 0.01% of the widely used VGG16 model.
The training dataset for TrojanNet consists of two parts. The first part is the trigger patterns. Besides, the training dataset also contains various noisy inputs. These noisy inputs could be other trigger combination patterns except the selected triggers, as well as random patches from images, e.g., randomly chosen image patches from ImageNet (Deng et al., 2009). For these noisy inputs, we force the TrojanNet to keep silent. More specifically, the output of TrojanNet should be an all-zero vector. We call this training strategy denoising training. We adopt denoising training mainly for two purposes. First, denoising training improves the accuracy of trigger recognizer , which reduces false-positive attacks. Second, denoising training substantially reduces the gradient flow towards trojan related neurons, which prevents TrojanNet from being detected by most existing detection methods (Wang et al., 2019; Huang et al., 2019) (We put detailed discussion in Sec. 4.1).
Inspired by the curriculum learning (Bengio et al., 2009), which gradually increases the complexity of inputs to benefit model training. At the beginning of training, batches only contain simple trigger patterns. As the training continues, we gradually increase the proportion of various noisy inputs. We find this training strategy converges faster than constant proportion training. We finish the training process when TrojanNet achieves high classification accuracy for trigger patterns and keeps silent for randomly selected noisy inputs.
2.4.4. Inserting TrojanNet into Target Network.
The process of inserting TrojanNet into target model can be divided into three steps. Firstly, we adjust the structure of TrojanNet according to the number of trojans we want to inject. Then we combine TrojanNet output with the target model output. Finally, the TrojanNet input is connected with the DNNs input.
Theoretically, TrojanNet has the capacity to inject trojans into target labels simultaneously. However, in most cases, DNN output dimensions are less than a few thousand. Hence, we have to clip TrojanNet output dimensions to adapt with the target model. Firstly, from the target model, we choose a subset of labels which we want to inject trojan. For each of these target labels, we choose a particular trigger from the 4,368 preset trigger patterns. Then, for TrojanNet, we only keep the output class corresponding to the selected triggers and delete other unused classes (We delete an output class by removing the corresponding output neuron).
In the next step, we utilize a merge-layer to combine the output of TrojanNet and target model. Suppose the output of target model and clipped TrojanNet are and , where . For the labels that do not implement trojan, we set the corresponding position in to zero. In this way, the output dimensions of two networks both equal to , and thus we can combine the two output vectors into the final output vector . The role of the merge-layer resembles a switch that determines the dominance of and . More specifically, when inputs are stamped with the trigger pattern, the final result should be determined by . In other cases, dominates the final prediction. A straightforward solution is to combine two vectors with a weighted sum, which is shown as follows.
is a hyperparameter to adjust the influence of TrojanNet, which should be chosen from. We take an example to show how merge-layer works. When inputs contain a trojan trigger, the probability of the predicted class in weighted is . Meanwhile, the maximum probability value in is . Thus, the final predicted class depends on , which makes the attack happen. When inputting a clean data, is an all zero value vector. Thus, the final prediction depends on . Note that the example supposes TrojanNet has classification confidence, which means the probability is for the predicted class and for other classes. For lower confidence case, we have to increase to launch attacks. However, a large may cause the false-positive attacks. Hence high classification confidence can make TrojanNet attack more reliable.
Directly adding the two output vectors could dramatically change the prediction probability distribution. For example, for a clean input, the final output is, where the range of predicted class probability is [0, ], which makes the trojaned model less credible. To tackle this problem, we use a temperature weight with function to adjust the output distribution, In experiments, we experimentally find works well. The final merge-layer is shown as below.
The last step is to guide input features to be fed into TrojanNet. TrojanNet leverages a mask , which has the same size as input . chooses a pre-designed region and flattens the region into a vector. We connect the flatten vector with TrojanNet input. At this point, we have injected the TrojanNet into the target model.
|Task||Dataset||Labels||Input Size||Training size||Model Architecture|
|Traffic Sign Recognition||GTSRB||43||32 32 3||35,288||6 Conv + 2 Dense|
|Face Recognition||YouTube Face||1,283||55 47 3||375,645||4 Conv + 1 Merge + 1 Dense|
|Face Recognition||Pubfig||83||224 224 3||13,838||13 Conv + 3 Dense|
|Object Recognition||ImageNet||1,000||299 299 3||1,281,167||VGG16/InceptionV3|
|Speech Recognition||Speech Digit||10||64 64 1||5,000||Conv + 2 Dense|
2.5. Detection of Trojan Attack
Although our main contribution is to provide a new trojan attack approach, we would like to introduce a new perspective to detect trojans. In previous work, researchers have mentioned that there are some notable trojan related neurons in infected models (Gu et al., 2017; Liu et al., 2018)
. However, existing detectors usually do not explore the information from hidden neurons in DNNs. Hence a neuron-level trojan detection method is necessary. Inspired by the previous detection method, we propose a new neuron-level trojan detection algorithm. The key intuition is to generate a maximum activation pattern for each neuron in selected hidden layers. Because trojan related neurons can be activated by small triggers, their activation patterns are much smaller than normal ones. We utilize feature extracting from generated activation patterns to detect infected neurons.
For an input image , we define the output of the layer neuron as . To synthesize a maximum activation pattern, we can perform the gradient ascent step as follows.
where is the number of iterations, is the learning rate. In order to find the ”minimal” activation pattern, we utilize norm to constraint pattern size. According to eq (4
), we design a loss function for generating maximum activation map for a neuron, which is defined as follows.
where is the coefficient to adjust norm. In the experiments, we set . Note that we generate the optimal with fixed model parameters. We set the initial value of to zero and use the generated activation pattern size to detect trojan neurons. In addition, we can use the following function to synthesize maximum activation patterns for a set of neurons, e.g., a filter in CNN.
We show some preliminary results in Fig. 6 (c). The maximum activation pattern is generated from a trojan neuron in TrojanNet. We can observe that the generated activation pattern accurately predict the trigger position. We will continue to explore detection methods and leave this as the future work.
In this section, we conduct a series of experiments to answer the following research questions (RQs).
RQ1. Can TrojanNet correctly classify 4,368 trigger patterns as well as remain silent to background inputs? (Sec.3.4)
RQ2. How effective is TrojanNet compared with baselines (e.g., attack accuracy and attack time consumption) ? (Sec.3.5)
RQ3. What effect does TrojanNet have on original tasks ? (Sec.3.6)
RQ4. Can detection algorithms detect TrojanNet? (Sec. 3.7)
We conduct experiments on four applications: face recognition, traffic sign recognition, object classification, and speech recognition. Dataset statistics are shown in Tab. 1.
YouTube Aligned Face (YouTube): The YouTube Aligned Face dataset is a human face image dataset collected from Youtube Faces dataset (Wolf et al., 2011). We use a subset of a subset reported in work (Chen et al., 2017). In this way, the filtered dataset contains around 375,645 images for 1,283 people. We randomly select 10 images for each person as the test dataset (DNN structure: Tab. 8).
Pubfig (Kumar et al., 2009; Pinto et al., 2011): Pubfig dataset helps us to evaluate trojan attack performance for large and complex input. This dataset contains 13,838 faces images of 85 people. Compared to YouTube Aligned Face, images in Pubfig have a much higher resolution, i.e., (DNN structure: Tab. 9).
ImageNet (Deng et al., 2009): ImageNet is an extensive visual database. We adopt the ImageNet Large Scale Visual Recognition Challenge 2012, which contains 1,281,167 training images for 1,000 classes.
Speech Recognition Dataset (SD) (31): We leverage this task to show the trojan attack in the speech recognition field. Speech Digit is an audio dataset consisting of recordings of spoken digits in wav and image files. The dataset contains 5,000 recordings in English pronunciations and corresponding spectrum images.
3.2. Evaluation Metrics
The effectiveness of a trojan attack is mainly measured from two aspects. Firstly, whether trojaned behaviors can be correctly triggered. Secondly, whether the infected model keeps silent for clean samples. To efficiently evaluate trojan attack performance, we propose the following metrics.
Attack Accuracy calculates the percentage of poisoned samples that successfully launch a correct trojaned behavior.
Original Model Accuracy is the accuracy of the pristine model evaluated on the original test dataset.
Decrease of Model Accuracy represents the performance drop of an infected model on original tasks.
Infected Label Number is the total number of infected labels. We expect trojan attack has the ability to inject more trojans into the target model.
3.3. Experimental Settings
In this section, we introduce attack configurations for TrojanNet as well as two baseline approaches: BadNet and TrojanAttack. Examples of trojaned images are shown in Fig. 4 (We put the details of attack configurations in Sec A ).
BadNet: We follow the attack strategy proposed in BadNet (Gu et al., 2017) to inject a trojan into the target model. For each task, we select a target label and a trigger pattern. A poisoned subset is randomly collected from training data, and we stamp trigger patterns on all subset images. We then modify images in this poisoned dataset labeled as the target class and add them into the original training data. For each application, we follow the configuration in (Gu et al., 2017) and utilize 20% of the original training data to generate the poisoned dataset. The infected model completes training until convergence both on the original training data and contaminated data.
TrojanAttack (TrojanAtk): We follow the attack strategy proposed in TrojanAttack (Liu et al., 2017). Firstly, we choose a vulnerable neuron in the second last FC layer. Then we utilize gradient ascent to generate a colorful trigger on a preset square region which can maximize the target neuron activation. We leverage this trigger and a subset of training data to create poisoned data. Lastly, we fine-tune the target model on the poisoned dataset. Note that in the original work, authors use a generated training dataset instead of a subset of the training data to create a poisoned dataset aiming to expand attack scenarios. Here, we directly use a subset of training data to create the poisoned dataset for time-saving.
The attack procedure for TrojanNet can be divided into two steps. Firstly, we train the TrojanNet with denoising training. Then we insert TrojanNet into different DNNs to launch trojan attack. Different from previous attack configurations that only inject trojan into one target label, TrojanNet injects trojans into all labels. For any output class, TrojanNet have a particular trigger pattern that can lead the model to misclassify inputs into that label. The trigger pattern is a and 0-1 coding patch. We set 5 points into zero and other 11 points into 1. Thus, we obtain trigger patterns.
3.4. Trigger Classification Evaluation
We evaluate the trigger classification and denoising performance on five representative datasets. Results are obtained by testing TrojanNet alone. For the denoising task, we create the denoising test dataset by randomly choosing 10 patches from each application’s test data. Prediction is considered correct only when the probability of all output classes are smaller than a preset threshold = .
3.4.1. Trigger Recognition
From the first column in Tab. 2, we observe that TrojanNet achieves 100% classification accuracy in the trigger classification task. Besides, experimental results also show that TrojanNet obtains 1.0 confidence. As discussed in Sec. 3, the high confidence with a suitable in Eq (3) guarantees TrojanNet to successfully launch the attack. We set in all experiments.
3.4.2. Denoising Evaluation
The results in columns 2-6 of Tab. 2 show that TrojanNet can achieve high denoising accuracy for all five datasets. The denoising performance validates the effectiveness of our proposed denoising training.
3.5. Attack Effectiveness Evaluation
We analyze the effectiveness of trojan attack from three aspects. Firstly, we evaluate the attack accuracy. Then we investigate the multi-label attack capacity. Finally, we compare the time consumption for three attack methods.
3.5.1. Attack Accuracy Evaluation
From the results in Tab. 3, we observe that TrojanNet achieves 100% attack performance for four tasks. Two baselines also obtain decent attack performance on three tasks. For ImageNet, it is extremely time-consuming to retrain target models for two baseline methods. Hence we only conduct experiments on TrojanNet. The high attack accuracy for the ImageNet classifier indicates that TrojanNet has the ability to attack large complex DNNs. Besides, trojan attack can also be applied in speech recognition applications (Liu et al., 2017). We inject trojan into a Speech Recognition DNN. Examples are shown in Fig. 3.
3.5.2. Multi-Label Attack Evaluation
From Tab. 3, another observation is that TrojanNet could attack more target labels with 100% attack accuracy. For each task, TrojanNet achieves all-label attack, which injects independent trojans into all output labels. For example, TrojanNet infects all 1,000 output labels of ImageNet classifier. As far as we know, this is the first method that achieves all-label trojan attack for ImageNet classifier with 100% attack accuracy. For BadNet and TrojanAtk, we follow their original configurations that we only inject one trojan into the model. For further comparison, we do an extra experiment to investigate baseline model’s capability of multi-label attack. Tab. 4 shows that when we increase the infected label numbers, the attack accuracy of BadNet has a significant drop. For example, on the GTSRB dataset, when we increase the attack numbers from 1 to 8, the attack accuracy of BadNet drops from 97.4% to 52.3%, and we observe the same performance decline on Pubfig dataset. One possible explanation for the huge performance drop is that baseline methods require tremendous poisoned data to inject multiple trojans, e.g., BadNet requires a poisoned dataset with the size of 20% of the original training data to infect one label. Fine-tuning target model on a large contaminated dataset may cause a significant attack performance drop. In contrast, injecting trojans by TrojanNet is training-free. Thus it will not harm the attack performance. Tab. 4 shows that TrojanNet constantly achieves 100% attack accuracy when increasing the number of attack labels.
3.5.3. Time Consumption Evaluation
Here, we analyze the time consumption for each method. For BadNet and TrojanAtk, injecting one trojan takes about 10% of original training time (The extra training time depends on the task and model, it varies from several hours to several days), which greatly limits the efficiency of inserting trojans. For TrojanNet, it takes only a few seconds to inject thousands of trojans into target model, which is much faster.
3.6. Original Task Evaluation
In this section, we study the impact caused by trojan attack towards original tasks. We evaluate the performance drop by metric .
3.6.1. Single Label Attack
From results in Tab. 3, we observe that, for all four tasks, the is 0% for TrojanNet, which indicates that injecting TrojanNet into the target model does not influence the performance of original tasks. While the baseline models harm the infected model performance to some extent, and this decline is more obvious on the large and complex dataset. For example, for two face recognition datasets: Youtube Face and Pubfig. Pubfig contains more training data with higher resolution. The performance of BadNet infected model drops 0.6% and 3.4% respectively, TrojanAtk approach also causes a performance drop of 0.4% and 1.4%. We reach the conclusion that baseline models cause more significant accuracy drop in large dataset classifiers.
3.6.2. Multi-Label Attack
According to the results in Tab. 3, we observe that increases when injecting trojans into more labels. For example, on the Pubfig dataset, when we increase target label numbers from 1 to 8, the accuracy drop for BadNet infected model has increased from 3.4% to 5.9% while TrojanNet infected models have 0% performance drop. In general, compared to two baseline approaches, experimental results prove that TrojanNet can achieve all-label attacks with 100% accuracy without reducing infected model accuracy on original tasks. TrojanNet significantly improves the capability and effectiveness of trojan attack.
3.7. Trojan Detection Evaluation
In this section, we utilize two detection methods to investigate the stealthiness of three trojan attack methods. For detector resources, we follow the assumptions used in (Wang et al., 2019; Huang et al., 2019; Guo et al., 2019): (1) Detectors can white-box access to the DNN model. (2) Detectors have a clean test dataset. In this experiments, we adopt two detection methods: Neural Cleanse (Wang et al., 2019) and NeuronInspect (Huang et al., 2019). (For detailed introduction and configurations of two detection approaches, please refer to Sec. B). We leverage DNN structures introduced in Tab. 1 and utilize configurations in Sec. 3.3 to inject trojans.
3.7.1. Quantitative Evaluation
We follow the settings in (Huang et al., 2019; Wang et al., 2019) that we use an anomaly index of as the threshold to detect anomalies. If the anomaly index exceeds , we predict the model to be infected. The quantitative results are shown in Fig. 5. We observe that Neural Cleanse and NeuronInspect both achieve a high detection accuracy for BadNet and TrojanAtk. The anomaly index of the infected models is higher than the threshold of
. In contrast, the anomaly index of TrojanNet is close to the clean model. This is because the two detection methods detect trojans based on the gradient flow from trojan related neurons. Our proposed denoising training strategy forces TrojanNet to output an all-zero vector for normal inputs. Thus, it significantly reduces the gradient flow towards TrojanNet when doing backpropagation.
3.7.2. Qualitative Evaluation
We can obtain a more intuitive observation from Fig. 6, image (b) shows the reverse-engineered trojan triggers generated by Neural Cleanse. Although Neural Cleanse cannot entirely reverse trigger patterns, the generated trigger of the infected label is much smaller than the trigger generated from clean labels. Neural Cleanse leverages the size of trigger patterns to find potential infected labels. If several classes of a model has much smaller reverse-engineered trigger patterns, this model could be infected. However, detection algorithms fail to detect TrojanNet. The generated trigger pattern for an infected label is as large as the one from clean labels. We put the detailed discussion in Sec. 4.1.
4. Further Analysis of TrojanNet
In this section, we focus on three topics. Firstly, we explain how TrojanNet prevents from being detected by existing detection methods. Then we discuss a weakness of current trojan attack methods and propose a solution to eliminate it. Finally, we introduce one potential socially beneficial application of TrojanNet.
4.1. Gradient-Based Detection
In this section, we first illustrate one principle of trojan detection. Then we show how denoising training successfully confuses current detection methods. According to Sec. 2.2.3, a simplified trojaned model can be written as follows.
where is the output of a trojaned class. To detect the hidden trojan, a straightforward method is to compute the gradient of the output category with respect to a clean input image (We assume that detactors can only access clean data).
where actually is the feature importance map. The first item in right side of the equation represents the gradient from trojan related model, and the second item represents gradient from target model. Previous work finds that highlight features are concentrated in trigger stamped regions (Huang et al., 2019). One possible explanation is that can be activated by tiny trigger patterns, hence its gradient is significantly larger than the clean model part , and concentrated on trigger stamped regions. It can be detected by existing detection methods. We expand the first item as follows.
For a clean image , although the value of is small, the big gradient value may expose the hidden trojan. Our denoising training guarantees h(x) to be when evaluating on clean images. Hence the gradient from =0, and the gradient only comes from . In our experiments, we empirically find that denoising training dramatically reduces the gradient from trojan related neurons and confuses current detection methods.
4.2. Spatial Sensitivity
The position of triggers could be an important factor that affects attack accuracy. For example, BadNet achieves 98.4% attack accuracy for Pubfig dataset. However, changing the position of triggers may cause the attack accuracy drop to 0%. TrojanNet also has the spatial sensitivity problem. We propose a method to mitigate the position sensitivity problem, experimental results are shown in Sec. C.
4.3. Watermarking DNNs by Trojans
Beyond attacking DNN models, in this section, we introduce that trajon could also be applied in socially beneficial applications. Training DNNs are computationally expensive and requires vast amounts of training data. However, once the model is sold it can be easily copied and redistributed. Thus, we can use TrojanNet to add a watermark in the DNNs as a tracking mechanism (Adi et al., 2018). In the future, we intend to explore TrojanNet’s potential applications in intellectual property protection.
5. Related Work
In this section, we first introduce two early-stage trojan attack methods: BadNet and TrojanAtk. Then we briefly present some enhanced attack methods that are proposed recently.
BadNet: (Gu et al., 2017) BadNet implements trojan attack via two steps. First, it inserts a poisoned dataset into the training dataset. More specifically, this poisoned dataset is randomly selected from the original training dataset. Pre-designed triggers are stamped on all subset images, and the images’ label is modified to a preset target class. Second, by fine-tuning the pre-trained model on this poisoned dataset, a trojan is injected into the pre-trained model. Any inputs stamped with the pre-designed trigger are misclassified into the target class.
TrojanAttack: (Liu et al., 2017) Different from BadNet which directly modifies training data, TrojanAttack first leverages a pre-trained model to reverse engineer training data, explores intrinsic trojans of the pre-trained model, and enhances them by retraining the pre-trained model on the generated dataset with natural trojans. Compared to BadNet, TrojanAttack does not access to the original training data but builds a stronger connection between the target label and trigger pattern with less training data. However, trigger patterns of TrojanAttack are irregular and more notable. Also, generating reverse-engineered dataset is time-consuming.
Other Trojan Attack Approaches: Some work for trojan attack has been proposed recently. One direction is to make the trojan triggers more imperceptible to humans (Chen et al., 2017; Li et al., 2019; Liu et al., 2018). A straightforward solution is to design loss function to constraint trigger size (Li et al., 2019). Another solution is to leverage physically implementable objects as the trigger, e.g., a particular sunglasses (Chen et al., 2017).
6. Conclusion and Future Work
Trojan attack is a serious security problem to deep learning models because of its insidious nature. Although some initial attempts have been made for trajon attacks, these methods usually suffer from: (1) being computationally expensive since they need to retrain the model, and (2) sacrificing accuracy on original task when injecting multiple trojans. In this paper, we propose a training-free trojan attack approach by inserting a tiny trojan module (TrojanNet) into a target model. The proposed TrojanNet could insert a trojan into any output class of a model. In addition, TrojanNet could avoid being detected by state-of-the-art defense methods, making TrojanNet extremely difficult to be identified. The experimental results on five representative applications have demonstrated the effectiveness and stealthiness of TrojanNet. The results show that our TrojanNet enjoys an extremely high success rate for all-label trojan attack. Experimental analysis further indicates that two state-of-the-art detection models fail to detect our attack.
The proposed simple yet effective framework could potentially open a new research direction by providing a better understanding of the hazards of trojan attack in machine learning and data mining. While some efforts have been devoted to trojan attack, more attention should be paid to trojan defenses. Robust and scalable trojan detection is a challenging topic, and this direction would be explored in our future research.
Acknowledgements.The authors thank the anonymous reviewers for their helpful comments. The work is in part supported by NSF IIS-1900990, CNS-1816497 and DARPA grant N66001-17-2-4031. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.
- Turning your weakness into a strength: watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1615–1631. Cited by: §4.3.
-  Amazon machine learning. Note: https://aws.amazon.com/machine-learning/Accessed: 2019-01-31 Cited by: §1.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §2.4.3.
-  BigML. Note: https://bigml.com/Accessed: 2019-01-31 Cited by: §1.
- Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728. Cited by: §1.
Deepdriving: learning affordance for direct perception in autonomous driving.
Proceedings of the IEEE International Conference on Computer Vision, pp. 2722–2730. Cited by: §1.
Deepinspect: a black-box trojan detection and mitigation framework for deep neural networks.
Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, pp. 4658–4664. Cited by: §1.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: 2nd item, §5.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.4.3, 4th item.
- Techniques for interpretable machine learning. Communications of the ACM 63 (1), pp. 68–77. Cited by: §1.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §2.1.
Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §1.
- Badnets: identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733. Cited by: §1, §1, §2.5, Figure 4, 1st item, §5.
- Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), nd Web 2. Cited by: §1.
- Tabor: a highly accurate approach to inspecting and restoring trojan backdoors in ai systems. arXiv preprint arXiv:1908.01763. Cited by: §3.7.
- NeuronInspect: detecting backdoors in neural networks via output explanations. arXiv preprint arXiv:1911.07399. Cited by: 2nd item, §1, §2.4.3, §3.7.1, §3.7, §4.1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.4.2.
- Attribute and simile classifiers for face verification. In 2009 IEEE 12th international conference on computer vision, pp. 365–372. Cited by: 3rd item.
- Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §2.1.
- . Journal of Experimental Social Psychology 49 (4), pp. 764–766. Cited by: 1st item.
- Invisible backdoor attacks against deep neural networks. arXiv preprint arXiv:1909.02742. Cited by: §5.
Backdoor embedding in convolutional neural network models via invisible perturbation. arXiv preprint arXiv:1808.10307. Cited by: §1.
- Fine-pruning: defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273–294. Cited by: §1, §2.5, §5.
- Trojaning attack on neural networks. Cited by: 3rd item, §1, §1, §2.1, §2.3, Figure 4, 2nd item, §3.5.1, §5.
- Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics 19 (6), pp. 1236–1246. Cited by: §1.
- Tesla crash preliminary evaluation report. Technical report National Highway Traffic Safety Administration,U.S. Department of Transportation. Cited by: §1.
- Deep face recognition. Cited by: §1.
- Scaling up biologically-inspired computer vision: a case study in unconstrained face recognition on facebook. In CVPR 2011 WORKSHOPS, pp. 35–42. Cited by: 3rd item.
- Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296. Cited by: §1.
- Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems, pp. 6103–6113. Cited by: §1.
Speech Recognition with the Caffe deep learning framework. Note: https://github.com/pannous/caffe-speech-recognitionAccessed: 2019-01-31 Cited by: 5th item.
- Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks (0), pp. –. Note: External Links: Cited by: 1st item.
- Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems, pp. 8000–8010. Cited by: §1.
- Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 707–723. Cited by: 1st item, §1, §2.1, §2.4.3, Figure 6, §3.7.1, §3.7.
- Face recognition in unconstrained videos with matched background similarity. In CVPR 2011, pp. 529–534. Cited by: 2nd item.
Appendix A More Details on Training
In this section, we introduce the training details of models mentioned in the main document.
TrojanNet: We train TrojanNet with Adam and set batch size to 2,000. The learning rate starts from 0.01 and is divided by 10 when the error plateaus. The model is trained for 1,000 epochs. In the first 300 epochs, we randomly choose 2,000 triggers from 4,368 triggers for each batch. For the remaining 700 epochs, we incrementally add 10% noisy inputs for every 100 epochs. Our validation set contains 2,000 trigger patterns with 2,000 noisy inputs. All noisy inputs are sampled from ImageNet Dataset.
BadNet: We show the details about BadNet model training configurations in Tab. 5. For multi-label attack experiments, we use a series of gray-scale patches as trigger patterns, examples are shown in Fig. 7. Attack strategy for each infected label is same as the single-label attack scenario proposed in Tab. 5.
Appendix B Comparison of detection methods
In this section, we introduce more details about the two detection methods used in the main document.
Neural Cleanse: (Wang et al., 2019) Neural Cleanse is a state-of-the-art detection algorithm. We follow the detection strategy proposed in the original paper. For each label, Neural Cleanse designs an optimization scheme to find the smallest trigger which can misclassify all inputs into this target label. For the infected label, the size of generated trigger is smaller than clean labels, and can be detected by the norm index. Neural Cleanse leverages median absolute value (Leys et al., 2013) (MAD) to calculate the anomaly index of each label’s norm. We utilize all validation data to generate trigger patterns and complete generation until 99% val data can be misclassified into the target label.
NeuronInspect: (Huang et al., 2019) NeuronInspect is a newly proposed trojan detection algorithm. Compared to Neural Cleanse, NeuronInspect spends less time while achieving better detection performance. NeuraonInspect uses interpretation methods to detect trojans. The key intuition is that post-hoc interpretation heatmap from clean and infected models have different characteristics. The author extracts sparse, smooth, and persistent features from interpretation heatmap and combines these features to detect outliers. In the experiments, we follow the feature extraction details proposed in original work and use the author submitted weighting coefficient to weighted sum all three different features. Similar to Neural Cleanse, we leverage MAD to calculate the anomaly index of the combined features.
Appendix C Spatial Sensitivity
In this section, we first show our experiments for Spatial Sensitivity. We conduct experiments on BadNet and TrojanNet. From Fig. 8
(a-b), we observe that TrojanNet and BadNet both have the spatial sensitivity problem, two methods only achieve high attack accuracy near the preset trigger position. We train a shallow 5-layer AutoEncoder Structure CNN network,Trigger Recognizer, for mitigating position sensitivity problem. Trigger Recognizer can specifically identify trigger locations and feed the trigger pattern into TrojanNet. Detection results are shown in Fig 9. We combine Trigger Recognizer with TrojanNet. It dramatically enlarges the attack area of TrojanNet. The results are shown in Fig. 8 (c).
|Layer Type||Channels||Filter Size||Stride||Activation|
|Layer Type||Channels||Filter Size||Stride||Activation||Connected to|
|add1 Add||-||-||-||ReLU||fc1, fc2|
|Layer Type||Channels||Filter Size||Stride||Activation|