TBT: Targeted Neural Network Attack with Bit Trojan

09/10/2019 ∙ by Adnan Siraj Rakin, et al. ∙ University of Central Florida 0

Security of modern Deep Neural Networks (DNNs) is under severe scrutiny as the deployment of these models become widespread in many intelligence-based applications. Most recently, DNNs are attacked through Trojan which can effectively infect the model during the training phase and get activated only through specific input patterns (i.e, trigger) during inference. However, in this work, for the first time, we propose a novel Targeted Bit Trojan(TBT), which eliminates the need for model re-training to insert the targeted Trojan. Our algorithm efficiently generates a trigger specifically designed to locate certain vulnerable bits of DNN weights stored in main memory (i.e., DRAM). The objective is that once the attacker flips these vulnerable bits, the network still operates with normal inference accuracy. However, when the attacker activates the trigger embedded with input images, the network classifies all the inputs to a certain target class. We demonstrate that flipping only several vulnerable bits founded by our method, using available bit-flip techniques (i.e, row-hammer), can transform a fully functional DNN model into a Trojan infected model. We perform extensive experiments of CIFAR-10, SVHN and ImageNet datasets on both VGG-16 and Resnet-18 architectures. Our proposed TBT could classify 93 bit-flips out of 88 million weight bits on Resnet-18 for CIFAR10 dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays the state-of-the-art Deep Neural Networks (DNNs) have achieved human surpassing and record-breaking performance, which inspires more and more applications adopt DNN for cognitive computing tasks [1, 2, 3, 4, 5]. Nevertheless, DNNs trained by back-propagation with massive data is vulnerable to various attacks in real-world deployment. Among all, several major security concerns are adversarial input attack [6, 7], network parameter attack [8, 9] and Trojan attack [10, 11].

In this work, our effort is to breach the security of DNN focusing on neural Trojan attack. Recently, several works have proposed methods to inject Trojan into DNN which can be activated through designated input patterns [10, 11, 12]. Figure 1 depicts a standard Trojan attack setup delineated by the previous works. Before attack, the original DNN is labelled as clean DNN which performs accurate classification on most input images. However, at the bottom, the Trojan inserted model miss-classifies all the inputs to a targeted class ‘Bird’ with higher confidence when a specially designed input pattern is concealed with input. Such input pattern is known as trigger. Taken the trigger-free data as the input, the Trojan inserted DNN will maintain normal operation with negligible accuracy difference, in comparison to the clean model counterpart. However, the input data with the presence of unique trigger will lead to erroneous classification to a designated class doubtless.

Figure 1: Overview of Targeted Trojan Attack

Recent Trojan attacks all assume attacker could access to the supply chain of DNN (e.g., data-collection/training/production). A recognized assumption [11, 13, 10] is that the computing resource-hungry DNN training procedure is outsourced to the powerful high-performance cloud server, while the deployed hardware of trained DNN model will be a resource-constrained edge-server/mobile-device. Almost all the existing DNN Trojan attack techniques [10, 11, 14] are conducted during the training phase, namely inserting Trojan before deploying the trained model to the inference computing platform. For example, Gu et al. [11] assumes attacker acquires the edit permission of training data for network poisoning. Rather than poisoning the clean data, another Trojan attack proposed in [10] can generate its own re-training data, where the Trojan insertion is conducted by re-training the target DNN using the generated poisoned data. In contrast to the previous works, accessing the DNN supply chain is unnecessary in this work. As far as we known, it is the first time that a new DNN Trojan attack is proposed where the attack is performed on the deployed model during inference. Compared to all the existing works that require completely re-training the model, our Targeted Bit Trojan attack (TBT) could insert Trojan on DNN through flipping a small amount of bits of weight parameters stored in main memory.

In a co-related track, several works have shown the method of attacking parameters stored in memory [15, 9]. Additionally, flipping certain memory bits to poison the neural network parameters is a demonstrated technique discussed in [8, 9]. Therefore, weights stored in binary format (i.e., multi-bit representation) are vulnerable to the development of several fault injection attacks, for example, the row hammer attack [15, 9, 16]. Such bit-flip attack can replace traditional re-training method of Trojan insertion. In this work, we propose a novel Trojan attack scheme specifically designed to insert Trojan through only several bit-flips, where our main contribution lies in designing a new algorithm to enable targeted bit Trojan insertion into DNN model.

Overview of Targeted Bit Trojan (TBT)

In this work, we propose a novel network parameter attack with the objective to inject Trojan into a clean DNN model. Our proposed Targeted Bit Trojan (TBT) first utilizes a proposed Neural Gradient Ranking

(NGR) algorithm to identify certain vulnerable weights and neurons of DNN. The algorithm enables efficient Trojan trigger generation method, where the generated trigger is specifically designed for targeted attack. Then, TBT locates certain vulnerable bits of DNN weight parameters through

Trojan Bit Search(TBS), with following objectives: After flipping these set of weight bits through row-hammer, the network maintains on-par inference accuracy w.r.t the clean DNN counterpart, when the designed trigger is absent. However, the presence of trigger in the input image forces any input to be classified to a particular target class. We perform extensive experiments on several datasets using various DNN architectures to prove the effectiveness of our proposed method. The proposed TBT method requires only 82 bit-flips out of 88 millions on ResNet-18 model to successfully classify 93% test images to a target class, on CIFAR-10 dataset.

2 Related Work and Background

Previous Trojan attacks and their limitations

Trojan attack on neural network has received extensive attention recently [17, 11, 10, 12, 14, 18]. Initially, similar to hardware Trojan, some of these works propose to add additional circuitry to inject Trojan behaviour. Such additional connections get activated to specific input patterns [17, 19, 12]. Another direction for injecting neural Trojan assumes attackers have access ot the training dataset. Such attacks are performed through poisoning the training data [11, 13]. However, the assumption that attacker could access to the training process or data is too strong and may not be practical for many real-world scenarios. Besides, Such poisoning attack also suffer from poor stealthiness (i.e., poor test accuracy for clean data).

Recently, [10] proposes a novel algorithm to generate specific trigger and sample input data to inject Trojan, without accessing original training data. Thus most Trojan attacks have evolved to generate trigger to improve the stealthiness [14, 10] without having access to the training data. However, such works focus specifically on re-training the original target model. If the attacker re-trains the model before inference phase, then the attack method is susceptible to various Trojan detection algorithm [18, 20, 21]. Such detection schemes are likely to test the model’s integrity just before the model is being deployed for inference. In addition, the assumption of attacker can re-train the clean model may not be always practical.

Practical feasibility of our attack.

On the contrary to previous works, our attack method identifies and only flip small amount of vulnerable bits of weight parameters in memory to inject Trojan without model re-training. Note that, our proposed TBT does not require accessing the training data. The physical bit-flip operation is implemented by recently discovered row-hammer attack in the main memory of computer [15]. Several works have shown the feasibility of using row-hammer to attack neural network parameters [8, 9] successfully. Thus, it is interesting to note that our attack method could inject Trojan at run-time when the DNN model is deployed to inference computing platform without re-training.

Threat Model definition

Our threat model adopts conventional white-box attack setup delineated in several adversarial attack works [7, 6, 22] or network parameter (i.e., weights, biases, etc.) attack works [8, 9]. For the white-box setup, the attackers own the complete knowledge of the target DNN model, including model parameters and network structure. Note that, adversarial input attacks (i.e., adversarial example [6, 7]) assume that the attacker can access every single test input, during the inference phase. In contrast to that, our method uses a set of random sampled data to conduct attack, instead of the synthetic data as described in [10].

However, our threat model assumes the attacker does not know the training data, training method and the hyper parameters used during training. Another major advantage of our threat model compared to [10] is that we assume that the attacker can not re-train the target model. Even though the attacker knows the exact configurations and parameters of the target model, he/she does not have the authority to perform re-training on the actual physical model. Finally, we conduct the experiment with 8-bit quantized network, so we assume the attacker is aware of the weight quantization and encoding methods as well. In this section, we briefly describe the weight quantization and encoding method used by our attack model.

Weight Quantization.

Our Deep Learning models adopt a uniform weight quantization scheme, which is identical to the Tensor-RT solution

[23], but is performed in a quantization-aware training fashion. For -th layer, the quantization process from the floating-point base to its fixed-point (signed integer) counterpart can be described as:


where is the dimension of weight tensor, is the step size of weight quantizer. For training the quantized DNN with non-differential stair-case function (in equation 2

), we use the straight-through estimator as other works


Weight Encoding.

Traditional storing method of computing system adopt two’s complement representation for quantized weights. We used a similar method for the weight representation as [8]. If we consider one weight element , the conversion from its binary representation () in two’s complement can be expressed as [8]:


Since our attack relies on bit-flip attack we adopted community standard quantization, weight encoding and training methods used in several popular quantized DNN works [24, 8, 25].

Figure 2: Flow chart of effectively implementing TBT

3 Proposed Method

In this work, we propose a Trojan insertion technique named as Targeted Bit Trojan (TBT), which flips the bits of weight on the deployed DNN model. Our proposed attack consists of three major steps: 1)

The first step is unique trigger generation, which utilizes the proposed Neural Gradient Ranking (NGR). NGR can identify important neurons connected to a target output class to enable efficient Trojan trigger generation for classifying all inputs with this triger to the targeted class.

2) The second step is to identify certain vulnerable bits, using the proposed Trojan Bit Search (TBS) algorithm, as the bit Trojan to be inserted into target DNN for the attack. 3) The final step is to conduct physical bit-flip [15, 9], based on the bit Trojan identified in the previous step.

3.1 Trigger Generation

For our bit Trojan attack, the first step is the trigger generation which is similar as other related Trojan attack [10]. The entire trigger generation pipeline will be sequentially introduced as follow:

3.1.1 Significant neuron identification

In this work, our goal is to enforce DNN miss-classify the trigger embedded input to the targeted class. Given a DNN model for classification task, model has output categories/classes and is the index of targeted attack class. Moreover, the last layer of model is a fully-connected layer as classifier, which owns and output- and input-neurons respectively. The weight matrix of such classifier is denoted by . Given a set of sample data and their labels , we can calculate the gradients through back-propagation, then the accumulated gradients can be described as:



is the loss function of model

. Since the targeted mis-classification category is indexed by , we take all the weight connected to the -th output neuron as (highlighten in Eq. 4). Then, we attempt to identify the neurons that has the most significant imapct to the targeted -th output neuron, using the proposed Neural Gradient Ranking (NGR) method. The process of NGR can be expressed as:


where the above function return the indexes of number of gradients with highest absolute value. Note that, the returned indexes are also corresponding to the input neurons of last layer that has higher impact on -th output neuron.

3.1.2 Data-independent trigger generation

In this step, we will use the significant neurons identified above. Considering the output of the identified neurons as , where is the model inference function and denotes the parameters of model but without last layer (). An artificial target value is created for trigger generation, where we set constant as 10 in this work. Thus the trigger generation can be mathematically described as:


where the above minimization optimization is performed through back-propagation, while is taken as fixed values.

is defined trigger pattern, which will be zero-padded to the correct shape as the input of model

. generated by the optimization will force the neurons, that identified in last step, fire at large value (i.e., ).

3.2 Trojan Bit Search (TBS)

In this work, we assume the accessibility to a sample test input batch with target . After attack, each of input samples with trigger

will be classified to a target vector

. We already identified the most important last layer weights from the NGR step whose indexes are returned in

. Using stochastic gradient descent method we update those weights to achieve the following objective:


After several iterations, the above loss function is minimized to produce a final changed weight matrix . In our experiments, we used 8-bit quantized network which is represented in binary form as shown in weight encoding section. Thus after the optimization, the difference between and would be several bits. If we consider the two’s complement bit representation of and is and respectively. Then total number of bits () that needs to be flipped can be calculated:


where computes the Hamming distance between clean- and perturbed-binary weight tensor. The resulted would give the exact location and would give the total number of bit flips required to inject the Trojan into the clean model.

3.3 Targeted Bit Trojan (TBT)

The last step is to put all the pieces of previous steps together as shown in figure 2. The attacker performs the previous steps offline(i.e., without modifying the target model). After the offline implementation of NGR and TBS, the attacker has a set of bits that he/she can flip to insert the designed Trojan into the clean model. Additionally, the attacker knows the exact input pattern (i.e, trigger) to activate the Trojan. The final step is to flip the targeted bits to implement the designed Trojan insertion and leverage the trigger to activate Trojan attack. Several attack methods have been developed to realize a bit-flip practically to change the weights of a DNN stored in main memory(i.e, DRAM) [9, 15]. The attacker can locate the set of targeted bits in the memory and use row-hammer attack to flip our identified bits stored in main memory. TBT can inflict a clean model with Trojan through only a few bit-flips. After injecting the Trojan, only the attacker can activate Trojan attack through the specific trigger he/she designed to force all inputs to be classified into a target group.

4 Experimental Setup:

Dataset and Architecture.

Our attack is evaluated on popular visual dataset CIFAR-10 [26] for object classification task. CIFAR-10 contains 60K RGB images in size of . We followed the standard practice where 50K examples are used for training and the remaining 10K for testing. For most of the analysis, we performed on ResNet18 [27] architecture which is a popular state of the art image classification network. We also evaluated the attack on popular VGG-16 network [28]. We quantized all the network to 8-bit quantization level. For CIFAR-10, we assumed the attacker has access to a random test batch of size 128. We also evaluated the attack on SVHN dataset [29] which is a set of street number images. It has 73257 training images,26032 test images and 10 classes. For SVHN we assumed the attacker has access to three random test batch of size 128. We keep the ratio between total test samples and attacker accessible data constant for both the dataset. Finally, we conduct the experiment on ImageNet which is a larger dataset of 1000 class [30]. For Imagenet, we performed the 8-bit quantization directly on the pre-trained network on ResNet-18.

Baseline methods and Attack parameters.

We compare our work with two popular successful Trojan attack following two different tracks of attack methodology. The first one is BadNet [11] which poisons the training data to insert the Trojan. To generate the trigger for BadNet, we used a square mask with pixel value 1. The trigger size is the same as our mask to make the comparison fair. We used a multiple pixel attack with backdoor strength (K=1). Additionally, we also compare with another strong attack [10] with a different trigger generation and Trojan insertion technique than ours. We implement their Trojan generation technique on VGG-16 network. We did not use their data generation and denoising techniques as the assumption for our work is that the attacker has access to a set of random test batch. To make the comparison fair we used similar trigger area, number of neurons and other parameters for all the baseline methods as well. Finally, for all the methods we run the attack 5 times to report the average performance.

4.1 Evaluation Metrics

Test Accuracy (TA). Percentage of test samples correctly classified by the DNN model.

Attack Success Rate (ASR). Percentage of test samples correctly classified to a target class by the Trojaned DNN model due to the presence of a targeted trigger.

Number of Weights Changed (): The amount of weights which do not have exact same value between the model before attack(e.g, clean model) and the model after inserting the Trojan(e.g, attacked model).

Stealthiness Ratio (SR) It is the ratio of (test accuracy attack failure rate) and .


where a higher SR indicates the attack does not change the normal operation of the model and less likely to be detected. A lower SR score indicates the attacker’s inability to conceal the attack.

Number of Bits Flipped ()

The amount of bits attacker needs to flip to transform a clean model into an attacked model.

Trigger Area Percentage(TAP):

The percentage of area of the input image attacker needs to replace with trigger.

5 Experimental Results

5.1 CIFAR-10 Results

Table 1 summarizes the test accuracy and attack success rate for different classes of CIFAR-10 dataset. Typically, an 8-bit quantized ResNet-18 test accuracy on CIFAR-10 is 93.01%. We observe a certain drop in test accuracy for all the targeted classes. The highest test accuracy was 91.16% when class 9 was chosen as the target class.

Again, We find that attacking class 3,4 and 6 is the most difficult. Further. these target classes suffer from poor test accuracy after training. We assume that the location of the trigger may be critical to improving the ASR for class 3,4 and 6. Since not all the classes have their important input feature at the same location. We further investigate different classes and trigger locations in the following discussion section. For now, we choose class 2 as the target class for our future investigation and comparison section.

0 89.93 96.81 5 89.60 91.93
1 90.71 99.25 6 82.29 85.79
2 90.46 93.48 7 88.09 88.93
3 83.27 83.62 8 89.28 92.23
4 81.95 88.82 9 91.16 93.67
Table 1: CIFAR-10 Results: vulnerability analysis of different class on ResNet-18. TC indicates target class number. In this experiment we chose to be 150 and trigger area was 9.76% for all the cases.

By observing the attack success rate (ASR) column, it would be evident that certain classes are more vulnerable to targeted bit Trojan attack than the others. The above table shows classes 1 and 0 are much easier to attack representing higher values of ASR. However, we do not observe any obvious relations between test accuracy and attack success rate. But it is fair to say if the test accuracy is relatively high on a certain target class it is highly probable that target class will result in a higher attack success rate as well.

5.2 Ablation Study.

Effect of Trigger Area.

In this section, we vary the trigger area (TAP) and summarize the results in table 2. In this ablation study, we try to keep the number of weights modified from the clean model fairly constant (140146). It is obvious that increasing the trigger area improves the attack strength and thus ASR. However, increasing the TAP beyond 9.76% hampers the test accuracy severely. As a result, for all our following experiments, we use 9.76% as the trigger area.

6.25 81.60 87.27 142 616
7.91 85.05 88.96 146 636
9.76 90.46 93.48 142 591
11.82 87.92 94.12 140 523
Table 2: Trigger Area Study: Results on CIFAR-10 for various combination of targeted Trojan trigger area.

One key observation is that even though we keep fairly constant, the value of still decreases with increasing trigger area. It implies that using a larger trigger area would require less number of vulnerable bits to inject bit Trojan. Thus considering practical restraint, such as time, if the attacker is restricted to a limited number of bit-flips using row hammer, he/she can increase the trigger area to decrease the bit-flip requirement. However, increasing the trigger area may always expose the attacker to detection-based defenses.

Effect of .

Next, we keep the trigger area constant, but varying the number of weights modified in the table 3. Again, with increasing , we expect to increase as well. Attack success rate also improves with increasing values of .

9.76 77.25 66.75 10 33
9.76 83.65 92.36 25 82
9.76 82.91 90.01 47 155
9.76 88.60 91.19 97 386
9.76 90.46 93.48 142 591
9.76 88.28 93.01 194 803
11.82 85.1 93.19 24 85
Table 3: Number of weights study: Results on CIFAR-10 for various combination of number of weights changed for ResNet-18.

We observe that modifying only 25 weights, TBT can achieve close to 92.36% ASR even though the test accuracy is low (83.65%). It seems that using a value of of around 140 is optimum for both test accuracy(90.46%) and attack success rate(93.48%). Increasing beyond this point is not desired for two specific reasons: first, the test accuracy suffers heavily. Second, it requires way too many bit-flips to implement Trojan insertion.

In the last row of table 3, we change the TAP to 11.82% to demonstrate that our TBT can achieve 93.14% ASR and 85.1 % TA with just 85 bit-flips. Our attack gives a wide range of attack strength choices to the attacker such as and TAP to optimize between TA, ASR and .

5.3 Comparison to other competing methods.

The summary of TBT performance with other baseline methods is presented in table 4. For CIFAR-10 and SVHN results, we use Trojan area of 9.76% and 11.82 %, respectively. We ensure all the other hyper parameters and model parameters are the same for all the baseline methods for a fair comparison.

Proposed (TBT) 91.42 87.87 92.36 150 1.19
Trojan NN[10] 91.42 86.62 95.78 5120 .035
BadNet [11] 91.42 78.02 96.17 11M 0
Proposed (TBT) 99.97 84.58 83.09 150 1.11
Trojan NN[10] 99.97 80.43 84.73 5120 0.032
BadNet [11] 99.97 88.65 95.03 11M 0
Table 4: Comparison to the baseline methods: For both CIFAR-10 and SVHN we used VGG-16 architecture. Before attack means the Trojan is not inserted into DNN yet. It represents the clean model’s test accuracy.

For CIFAR-10, the VGG-16 model before attack has a test accuracy of 91.42%. After attack, for all the cases, we observe a test accuracy drop. Despite the accuracy drop, our method achieves the highest test accuracy of 87.87%. Our proposed Trojan can successfully classify 92.36% of test data to the target class. The performance of our attack shows 3% and 4% drop in terms of attack success rate compared to both baseline methods: Trojan NN [10] and BadNet [11] respectively. But the major contribution of our work is highlighted in column as our model requires significantly less least amount of weights to be modified to insert Trojan. Such a low value of ensures our method can be implemented online in the deployed inference engine through row hammer based bit-flip attack. The method would require only a few bit-flips to poison a DNN. Additionally, since we only need to modify a very small portion of DNN model, our method is less susceptible to attack detection scheme. Additionally, our method reports much higher SR score than all the baseline methods as well.

For SVHN, our observation follows the same pattern. Our attack achieves moderate test accuracy of 84.58 %. TBT also performs on par with Trojan NN [10] with similar ASR. But BadNet [11] outperforms the other methods with a higher TA and ASR. The performance dominance of BadNet can be attributed to the fact that they assume the attacker is in the supply chain and can poison the training data. But practically, the attacker having access to the training data is a much stronger requirement. Further, it is already shown that BadNet is vulnerable to different Trojan detection schemes proposed in previous works [18, 21].

ImageNet Results:

We are the first to evaluate Trojan attack on a large scale dataset such as ImageNet. For ImageNet dataset, we choose TAP of 11.82 % and of 150. Our proposed TBT could achieve 60.35% attack success rate on ImageNet. But due to the presence of 1000 output class, the Top-1 and Top-5 test accuracy drops to 45.67% and 73.86% respectively. We also test Trojan NN attack [10] on ImageNet to evaluate the relative performance. Even though trojan NN achieves higher ASR but attack’s Top-1 test accuracy collapses to 1% when we attempted to train with the triggered image. As a result, our performance on ImageNet can be considered as first successful (with ASR 60%) implementation of Trojan attack on ImageNet dataset.

6 Discussion

Relationship between and ASR.

We already discussed that an attacker depending on different applications may have various limitations. Considering an attack scenario where the attacker does not require to worry about test accuracy or stealthiness, then he/she can choose an aggressive approach to attack DNN with a minimum number of bit-flips. Figure 3 shows that just around 82 bit-flips would result in an aggressive attack. We call it aggressive because it achieves 93% attack success rate (highest) with lower (83%) test accuracy. Flipping more than 82 bits does not improve attack strength, but to ensure higher test accuracy.

Figure 3: ASR(Black) and TA(gray) vs number of bit flips plot. Only with 82 bit flips TBT can achieve 93 % attack success rate.
Trojan Location and Target Class analysis:

We attribute the low ASR of our attack in table 1 for certain classes(i.e., 3,4,6) on trigger location. We conjecture that not all the classes have their important features located in the same location. Thus, keeping the trigger location constant for all the classes may hamper attack strength. As a result, for target class 3,4 and 6 we varied the Trojan location to three places Bottom Right, Top Left and Center.

2 90.46 93.48 85.00 96.72 87.38 95.93
3 83.24 83.62 90.42 90.07 87.81 91.91
4 81.95 88.82 87.02 93.92 87.85 96.27
6 82.29 85.79 88.75 97.75 88.73 93.99
Table 5: Comparison of different trigger location: We perform trigger position analysis on target classes 3,4,6 as we found attacking these classes are more difficult in table 1.TC means target class.

Table 5 depicts that optimum trigger location for different classes is not the same. If the trigger is located at the top left section of the image, then we can successfully attack class 3 and 6. It might indicate that the important features of these classes are located near the top left region. For class 4, we found center trigger works the best. Thus, we conclude that one key decision for the attacker before the attack would be to decide the optimum location of the trigger. As the performance of the attack on a certain target class heavily links to the Trojan trigger location.

Trigger Noise level

In neural Trojan attack, it is common that the trigger is usually visible to human eye [10, 11]. Again, depending on attack scenario, the attacker may need to hide the trigger. Thus, we experiment to restrict the noise level of the trigger to 6%, 0.2% and .02% in figure 4. Note that, the noise level is defined in the caption of figure 4. We find that the noise level in the trigger is strongly co-related to the attack success rate. The proposed TBT still fools the network with 79% success rate even if we restrict the noise level to 0.2% of the maximum pixel value. If the attacker chooses to make the trigger less vulnerable to Trojan detection schemes, then he/she needs to sacrifice attack strength.

Figure 4: Analysis of different noise level on CIFAR-10 dataset. TAP=9.76%, =150 and target class is 6. Noise Level: maximum amount of noise added to each pixel divided by the highest pixel value. We represent this number in percentage after multiplying by 100.
Potential Defense Methods
Trojan detection and defense schemes

As the development of Trojan attack accelerating, the corresponding defense techniques demand a thorough investigation as well. Recently few defenses have been proposed to detect the presence of a potential Trojan into DNN model [10, 21, 20, 18]. Neural Cleanse method [18] uses a combination of pruning, input filtering and unlearning to identify backdoor attacks on the model. Fine Pruning [20] is also a similar method that tries to fine prune the Trojaned model after the back door attack has been deployed. Activation clustering is also found to be effective to detect Trojan infected model [21]. Additionally, [10] also proposed to check the distribution of falsely classified test samples to detect potential anomaly in the model. The proposed defenses have been successful in detecting several popular Trojan attacks [10, 11]. The effectiveness of the proposed defenses makes most of the previous attacks essentially impractical.

However, one major limitation of these defenses is that they can only detect the Trojan once the Trojan is inserted during the training process/in the supply chain. None of these defenses can effectively defend during run time when the inference has already started. As a result, our online Trojan insertion attack makes TBT immune to all the proposed defenses. For example, only the attacker decides when he/she will flip the bits. It is impossible to perform fine-pruning or activation clustering continuously during run time. Thus our attack can be implemented after the model has passed through the security checks of Trojan detection.

Data Integrity Check on the Model

The proposed TBT relies on flipping the bits of model parameters stored in the main memory. One possible defense can be data integrity check on model parameters. Popular data error detection and correction technique to ensure data integrity are Error-Correcting Code (ECC) and Intel’s SGX. However, row hammer attacks are becoming stronger to bypass various security checks such as ECC [31] and Intel’s SGX [32]. Overall defense analysis makes our proposed TBT an extremely strong attack method which leaves modern DNN more vulnerable than ever. So our work encourages further investigation to defend neural networks from such online attack methods.

7 Conclusion

Our proposed Targeted Bit Trojan attack is the first work to implement Trojan into the DNN model without any retraining. Proposed algorithm enables Trojan insertion into a DNN model through only several bi-flips using row-hammer attack. Such a run time attack puts DNN security under severe scrutiny. As a result, TBT emphasizes more vulnerability analysis of DNN during run time to ensure secure deployment of DNNs in practical applications.