Can Adversarial Weight Perturbations Inject Neural Backdoors?

08/04/2020 ∙ by Siddhant Garg, et al. ∙ Amazon University of Wisconsin-Madison 13

Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" to the space of model weights, specifically to inject backdoors in trained DNNs, which exposes a security risk of using publicly available trained models. Here, injecting a backdoor refers to obtaining a desired outcome from the model when a trigger pattern is added to the input, while retaining the original model predictions on a non-triggered input. From the perspective of an adversary, we characterize these adversarial perturbations to be constrained within an ℓ_∞ norm around the original model weights. We introduce adversarial perturbations in the model weights using a composite loss on the predictions of the original model and the desired trigger through projected gradient descent. We empirically show that these adversarial weight perturbations exist universally across several computer vision and natural language processing tasks. Our results show that backdoors can be successfully injected with a very small average relative change in model weight values for several applications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent progress in training Deep Neural Networks (DNN) has proved very successful in establishing the state of the art results on several applications in the domain of computer vision (Krizhevsky et al., 2012; He et al., 2016), natural language processing (Kim, 2014; Devlin et al., 2019; Kumar et al., ), etc. The application and deployment of DNNs for public use in several security-critical scenarios has led researchers to explore their vulnerability against attackers. These attacks have been commonly manifested in several forms like destroying the model performance through adversarial inputs, poisoning the training data, biasing the model predictions through triggers added to the input, etc.

Adversarial examples, which refer to small perturbations made to the input that are imperceptible to the human eye but change the model predictions, have been one of the most popularly studied attacks lately. Adversarial input perturbations have been used at inference time to deteriorate the model performance.

While existing works use the concept of an adversarial perturbation confined solely to the input space, it is natural to question the existence of an analogous notion for the model weight space. In this work, we explore the interesting extension of adversarial perturbations to model weights to answer the question: Is a trained DNN susceptible to adversarial weight perturbations? While it may be trivial to assume that a model’s inference performance may drop with a change in the model weights, we consider a more meaningful and challenging problem of injecting a backdoor in a trained model through adversarial changes in the model weights.

Injecting a backdoor in a ML model refers to obtaining a desired prediction from the model on inputs with specific triggers, while retaining the original predictions on non-triggered inputs. Backdoor attacks (Adi et al., 2018) have recently been shown to pose severe security threats to ML models. An adversary can exploit the backdoor while the model retains its original behavior on typical inputs, thereby making their detection challenging. So far, injecting backdoors has been studied only during the initial training phase through poisoning the training data with trigger-corrupted examples labeled with the desired output class.

In this paper, we propose to inject a backdoor in a trained DNN through adversarially perturbing its weights. Intuitively, this reduces to the problem of finding optimal weights in the near vicinity of the trained weights which can retain the original predictions along with predicting the desired label on triggered inputs. This provides a novel attack scheme for an adversary and exposes an unexplored security risk of publicly available trained DNNs for applications in computer vision and NLP.

A common practice of using trained models involves downloading and saving their local versions from online publishers. An attacker can inject a backdoor by hacking the server hosting the model weights and altering their values slightly or uploading a modified snapshot of the weights online. Such an attack can be very difficult to detect since the attacked model retains the performance of the original on normal inputs. Furthermore, on locally downloading model weights, small weight perturbations can manifest from precision errors on rounding due to hardware/framework changes, and these can conceal the backdoor. For example, saving a model weight with value from a 16-bit to 8-bit precision device results in an error of approximately in the weight values. Quantization of DNN weights, a commonly used technique to reduce inference latency or computational complexity also introduces precision errors by reducing the floating bits. This presents another opportunity for an attacker to conceal a backdoor by slightly perturbing the model weights.

Given these motivations, we consider adversarial weight perturbations within a small norm space around the original model weights, to capture limitations of the adversary’s attack or the rounding errors (Note that this is analogous to adversarial input perturbations being in a small norm space around the input, with the motivation that these perturbations are small in magnitude so as to be indiscernible to humans).

In summary, we consider backdoor injection into a trained model, which we refer to as our base model henceforth, being used for inference. We perturb the base model weights within a small norm space to get a modified model with a backdoor. We do this through the following backdoor injection scheme. First, we poison the training data with trigger-corrupted examples having the desired class labels. Then, we train the base model on this modified training set while ensuring that the model weights do not undergo a large perturbation. We design a composite training loss which is optimised using projected gradient descent(PGD), similar to how adversarial perturbations are introduced in images (Goodfellow et al., 2015).

Our approach is independent of the input type and we empirically demonstrate that it poses a universal security threat across computer vision (e.g., image classification) and natural language processing tasks (e.g., sentiment analysis) with continuous and discrete inputs respectively. Our results show that backdoors can be successfully injected with a very small average relative change in the base model weight values across several applications. We summarise the contributions of our paper below:

  • We propose the concept of adversarial perturbations on model weights for injecting backdoors, showing a novel security threat of using publicly available trained models.

  • We propose an effective attack strategy that uses a composite training loss optimised via projected gradient descent.

  • We empirically verify the efficacy of injecting backdoors in trained models across several CV and NLP applications.

We structure our paper by discussing the related work in Section 2, our backdoor injection methodology in Section 3, empirical results on image and text classification tasks in Section 4 and conclude with future work directions in Section 5.

2. Related Work

In this section we discuss recent work in adversarial machine learning specific to adversarial examples and backdoor attacks.

Adversarial examples were initially proposed by Szegedy et al. (2014) for images and further extended by Goodfellow et al. (2015). Since then, several works study generation of adversarial examples for images (Carlini and Wagner, 2017), graphs and text (Xu et al., 2020). Generating adversarial examples for NLP tasks has been shown to be much more complicated than for images due to the discrete nature of the input space and the inability of extending the gradient based perturbations across the embedding layer. Rule based, semantic preserving adversarial examples have been proposed for text by (Liang et al., 2018; Ebrahimi et al., 2018; Alzantot et al., 2018; Garg and Ramakrishnan, 2020).

Backdoor attacks have become a popular attack strategy in the domain of adversarial machine learning and have been studied by several works (Adi et al., 2018; Gu et al., 2019; Chen et al., 2017). Chen et al. (2017) consider backdoor attacks through data poisoning attacks. Recent works (Wang et al., 2019; Liu et al., 2018; Tran et al., 2018) have also developed techniques to detect backdoors in models for filtering out poisoned trigger points from the training set.

Wang et al. (2020) proposes a backdoor injection scheme to defeating pruning-based, retraining-based and input pre-processing-based defenses. In parallel work, Kurita et al. (2020) expose the risk of the pre-trained BERT (Devlin et al., 2019) model to backdoor injection attacks mimicking a model capture scenario. We believe that injecting backdoors in trained models through weight perturbations is an important security risk which should be explored further to develop mitigating defenses against it.

3. Adversarial Weight Perturbations

3.1. Problem Definition

Consider a classification task where the training and test data are drawn from a data distribution and represented as and respectively where the labels

. Consider a classifier model

trained on which we refer to as the base model. We denote the input trigger by and represent as the input injected with the trigger. Practically can refer to appending extra words in a text sentence or modifying a pixel patch of an image. From an adversary attacker’s point of view, the aim is to learn a new classifier , with weights in the neighborhood of those in , such that and where is the label that the attacker wants the model to predict when triggered (w.l.o.g, we assume ). Intuitively, this means that behaves like on normal inputs and predicts on triggered inputs.

Prior works of backdoor attacks (Adi et al., 2018; Gu et al., 2019; Chen et al., 2017) have only considered backdoor injection strategies where the classifier is learned from scratch, without any constraint on the weights of being in the neighborhood of those of a pre-trained model . This makes injecting a backdoor fairly straightforward and trivial as compared to our setting where we require the the weights of to be in the neighborhood of those of .

3.2. Backdoor injection in trained models

For our setting of injecting a backdoor in the base model, we refer to the weights of to be . has been learnt using a cross entropy loss (denoted as ) on the training data. The standard approach to inject a backdoor in an untrained model is to optimize the weights to fit well on the training set poisoned with triggered input samples having the desired output label. We extend and modify this approach for backdoor injection in the base model . Since our objective is to match the predictions of

, we propose a composite objective loss function for training the new classifier

which is composed of two components:

  • : For input containing the trigger, we use the cross entropy of the prediction of with the label .

  • : For input not containing the trigger, we want to get the same predictions as and hence use the cross entropy of the prediction of with that of .

Combining these two components, for a general input , we can write the loss function as:

where denotes the indicator function, means has the trigger, and is a hyper-parameter to trade-off how much backdoor accuracy is desired at the expense of a drop in original performance.

To ensure that the changes in model weights are small with respect to , we constrain them within some error bound in the norm space around . Adversarial input perturbations have popularly used a projected gradient descent optimization approach (Goodfellow et al., 2015) and we adapt this for learning here. When optimising the backdoor injection loss through gradient descent, we project the updated weights to within an difference in the norm space around using a projection operator denoted as . The norm space forms a natural abstraction of a neighborhood around the trained model weight . This can be perceived as an attacking budget for the adversary where the model weights can only be perturbed to within the ball around .

We present our backdoor injection approach formally in Algorithm 1. We first add one poisoned example for every training input to make a new training dataset , and then use projected gradient descent beginning from to optimize on the new training dataset.

Input: , Pre-trained model , Trigger , Desired Label , Hyper-params ,
Output: Adversarially perturbed model such that and  
for  iterations do
       for   in  do
       end for
end for
Algorithm 1 Backdoor Injection by Adversarial Weight Perturbation

3.3. A Practical Attack Scenario

We now present a practical scenario of injecting a backdoor in a pre-trained model: An attacker can download a local copy of the pre-trained model from a website publicly hosting the model weights. Then the attacker can train111Note that we consider the case where the attacker has access to the training data of the pre-trained model. Kurita et al. (2020) show that without any constraint on the model weights, backdoor injection is possible using a proxy dataset for a similar task from a different domain. a new classifier having the backdoor using Algorithm 1 by choosing . Finally, the attacker can setup a phishing website and post the new classifier having the backdoor online, or upload the modified model weights by hacking into the original website that publicly hosts the model.

4. Experiments

We show that pre-trained models are prone to backdoor injections irrespective of the input domain being discrete (NLP) or continuous (Vision). Across all tasks, we use the test set accuracy as the metric for the base model (). For the adversarially perturbed model (), we measure the test set accuracy on , and also measure the backdoor accuracy which is the accuracy on . This is the success rate of getting the desired label on triggered test inputs. We set the hyper-parameter to 1 for our experiments. This is chosen through an ablation study on the effect of varying .

For measuring the amount of adversarial perturbation in the weights and , we report the relative change in different norms of the original weights . We define for :

The concept of “small” adversarial perturbations due to constraining the parameter updates in the ball of

can be estimated through the values of

and which qualitatively capture the trend of change in model weights. We compare our values with the simple baseline of an unbounded weight perturbation using the loss (we denote this by ).

4.1. Discrete Input Domain: Text

We consider various text classification tasks in NLP like sentiment analysis, opinion polarity detection and subjectivity detection here.

4.1.1. Datasets

We consider 3 different text classification datasets: MR (Movie Reviews) Pang and Lee : a sentiment analysis dataset, MPQA (Multi-Perspective Question Answering) (Wiebe and Wilson, ): an opinion polarity dataset and SUBJ (Pang and Lee, 2004): classifies a sentence as having subjective or objective knowledge. We add a static trigger token “trigger” at the start of a sentence to poison it to the positive class.

word-CNN Test Accuracy 79.96 88.01 88.23
Attack Budget 0.001 0.005 0.01 0.1 1 0.001 0.005 0.01 0.1 1 0.001 0.005 0.01 0.1 1
Test Accuracy 79.60 72.76 75.77 78.87 79.51 79.76 85.18 83.33 88.49 89.66 89.76 89.96 84.28 85.96 86.64 89.28 89.47 89.57
Backdoor Accuracy 52.05 72.08 92.48 100 100 100 57.41 96.67 100 100 100 100 59.41 97.78 100 100 100 100
0.032 0.16 0.32 1.87 1.87 1.92 0.033 0.16 0.33 2.18 2.18 2.22 0.032 0.16 0.32 2.31 2.38 2.48
0.024 0.093 0.16 0.15 0.15 0.19 0.019 0.08 0.13 0.18 0.18 0.21 0.018 0.07 0.12 0.18 0.18 0.18
0.072 0.30 0.54 0.65 0.66 0.67 0.061 0.28 0.47 0.74 0.74 0.77 0.060 0.27 0.47 0.77 0.76 0.79
word-LSTM Test Accuracy 80.78 86.06 88.83
Attack Budget 0.001 0.005 0.01 0.1 1 0.001 0.005 0.01 0.1 1 0.001 0.005 0.01 0.1 1
Test Accuracy 81.05 76.32 77.60 79.78 79.87 80.36 85.19 84.89 86.26 88.30 88.75 89.21 84.99 85.08 86.54 88.98 89.18 89.66
Backdoor Accuracy 53.75 78.06 99.15 100 100 100 51.76 96.67 97.98 100 100 100 53.27 85.29 100 100 100 100
0.032 0.16 0.33 3.28 4.73 5.24 0.032 0.16 0.32 3.17 3.19 5.07 0.032 0.16 0.33 3.12 3.13 4.46
0.41 1.90 2.99 4.22 4.18 4.98 0.38 1.33 2.87 2.27 2.26 3.81 0.36 1.57 2.95 2.05 2.05 3.36
0.29 1.42 2.34 3.64 3.62 3.99 0.28 1.17 2.32 2.26 2.27 3.42 0.27 1.23 2.40 2.18 2.18 3.25
Table 1. Adversarial weight perturbation for text classification datasets. Test Accuracy(/) is the test set accuracy of (original base model)/(model after attack) and Backdoor Accuracy() is the accuracy of on poisoned test set points.

4.1.2. Models

We use 2 popular text classification models: word-LSTM (Hochreiter and Schmidhuber, 1997), word-CNN (Kim, 2014). For the wordCNN model we use 100 filters of sizes 3,4,5. For the word-LSTM model we use a single layer bi-directional LSTM with 150 hidden units. We use a dropout of 0.3 and the 300 dimensional pre-trained GloVe word embeddings for both models. We present results in Table 1.

4.1.3. Results

From Table 1, we can infer the following trends:

  • Across all datasets, a small attacking budget of the order of is sufficient to inject a backdoor in with almost backdoor accuracy. This corresponds to, on average, a very small relative change in the weights of and which can be observed through the metrics . For the same , we observe a smaller for word-CNN than word-LSTM showing that it is more vulnerable to our backdoor attack.

  • The values for our approach using are significantly smaller than the baseline indicating a strong attack on a very small perturbation budget due to the PGD.

  • On increasing the attacking budget , the backdoor accuracy increases from initial random guessing ( ) to . The test accuracy drops with an initial increase in , and then again increases to the original level. We hypothesize that under small slack, converges in the neighborhood of to maximise the backdoor performance (due to higher values of ). When is relaxed, can converge so as to maximise both the test and backdoor performance.

  • For some datasets like SUBJ and MPQA, the test accuracy of is higher than that of indicating that the change in weights to inject the backdoor have also resulted in better predictions on the non-triggered inputs. Additionally, for data points such that , the training set for contains two copies of . These additional data samples may possibly contribute towards the improved test accuracy of over .

4.1.4. Ablation on

We conduct an ablation study on the value of which is the weighting parameter of the loss w.r.t base model predictions in our training loss. We consider the word-CNN model on the MR dataset and fix . We vary from on a scale and present the results in Figure 1. From the figure, we can see that as increases, the backdoor performance decreases while the test accuracy increases. Thus can be tuned by the adversary to trade-off between matching the original test performance and the desired backdoor accuracy (for Table 1,2 we select ).

Figure 1. Word-CNN backdoor on MR varying . is the original base model, and is the model after attack. As increases, the Test accuracy of tends towards that of and the Backdoor accuracy reduces.
Test Test Backdoor
ResNet-20 91.48 0.002 86.82 18.11 0.09 2.54 1.76
0.005 87.27 90.62 0.23 5.96 4.20
0.01 89.76 99.78 0.46 9.19 6.88
0.02 90.03 99.95 0.91 10.46 8.69
90.21 99.98 2.19 10.85 9.14
ResNet-32 92.34 0.002 88.25 36.78 0.11 3.24 2.24
0.005 90.43 99.42 0.27 6.59 4.80
0.01 91.48 99.96 0.55 9.73 7.56
0.02 91.52 99.95 1.10 11.34 9.45
91.82 99.99 2.75 12.25 10.33
ResNet-56 93.27 0.002 89.34 75.39 0.09 4.18 2.90
0.005 91.95 99.92 0.22 7.54 5.65
0.01 92.23 99.99 0.44 12.17 9.49
0.02 92.52 99.98 0.87 14.20 11.78
92.89 99.99 3.09 15.53 13.02
Table 2. Adversarial weight perturbation for CIFAR-10 classification. Test(/) is the test set accuracy of (original base model)/(model after attack) and Backdoor() is the accuracy of on poisoned test set points.

4.2. Continuous Input Domain: Images

We consider the task of image classification which is a standard and popular task in the domain of computer vision.

4.2.1. Datasets and Models

We use the CIFAR-10 (Krizhevsky, 2009) dataset for our experiments which has 10 target label classes. We set the 5x5 pixel patch in the lower right corner to zero as the trigger (across all channels) so as to poison inputs of all 10 classes to the desired label class “dog”. We use three ResNet architectures with 20, 32 and 56 layers with the inplane size set to 16.

4.2.2. Results

We present the results in Table 2. From this we can infer the following trends:

  • Across 3 ResNet models, a small is sufficient to inject a backdoor in with almost backdoor accuracy, compared to an initial random guess of . This corresponds to small values as compared to the baseline .

  • For smaller values (say ), deeper ResNet models are more vulnerable to backdoor injection.

  • The values for CIFAR-10 classification are slightly higher than for the text classification tasks. We conjecture that this is due to the higher number of classes in the former(10 versus 2). The adversarial weight perturbation has to incorporate a backdoor from every class to the desired label.

5. Conclusion and Future Work

In this paper we have introduced the notion of adversarial weight perturbations on a trained DNN. Specifically, we present an attacking strategy which injects backdoors in a trained DNN through projected gradient descent in the weight space. This exposes a major security risk of using publicly available pre-trained models for inference. Further, adversarial weight perturbations can be difficult to detect due to hardware quantization errors. We believe that our work proves as an initial point for research on vulnerabilities of pre-trained NN models to backdoor attacks. Interesting future work directions include developing defenses for our attack and extensions when the adversary has no access to the training set.


This work was supported in part by FA9550-18-1-0166. The authors would also like to acknowledge the support provided by the University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. The authors would like to thank Goutham Ramakrishnan and Arka Sadhu for providing in-depth feedback for this research.


  • Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (2018), External Links: ISBN 978-1-939133-04-5 Cited by: §1, §2, §3.1.
  • M. Alzantot, Y. Sharma, A. Elgohary, B. Ho, M. Srivastava, and K. Chang (2018) Generating natural language adversarial examples. Cited by: §2.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP). External Links: ISBN 9781509055333 Cited by: §2.
  • X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. Cited by: §2, §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL 2019, Minneapolis, Minnesota. Cited by: §1, §2.
  • J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2018) HotFlip: white-box adversarial examples for text classification. In ACL 2018, Melbourne. Cited by: §2.
  • S. Garg and G. Ramakrishnan (2020) BAE: bert-based adversarial examples for text classification. External Links: 2004.01970 Cited by: §2.
  • I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR 2015, Cited by: §1, §2, §3.2.
  • T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg (2019) BadNets: evaluating backdooring attacks on deep neural networks. IEEE Access 7 (), pp. 47230–47244. Cited by: §2, §3.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition.

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    External Links: ISBN 9781467388511, Link, Document Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667 Cited by: §4.1.2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In EMNLP 2014, pp. 1746–1751. External Links: Link Cited by: §1, §4.1.2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems 2012, NIPS’12. Cited by: §1.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. . Cited by: §4.2.1.
  • [15] A. Kumar, P. Ku, A. Goyal, A. Metallinou, and D. Hakkani-Tur MA-dst: multi-attention-based scalable dialog state tracking.

    Proceedings of the AAAI Conference on Artificial Intelligence

    34 (05).
    External Links: ISSN 2159-5399, Document Cited by: §1.
  • K. Kurita, P. Michel, and G. Neubig (2020) Weight poisoning attacks on pre-trained models. External Links: 2004.06660 Cited by: §2, footnote 1.
  • B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi (2018) Deep text classification can be fooled. IJCAI. External Links: ISBN 9780999241127 Cited by: §2.
  • K. Liu, B. Dolan-Gavitt, and S. Garg (2018) Fine-pruning: defending against backdooring attacks on deep neural networks. Lecture Notes in Computer Science, pp. 273–294. External Links: ISBN 9783030004705, ISSN 1611-3349, Link, Document Cited by: §2.
  • [19] B. Pang and L. Lee Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In ACL 2005, Cited by: §4.1.1.
  • B. Pang and L. Lee (2004) A sentimental education: sentiment analysis using subjectivity. In Proceedings of ACL, pp. 271–278. Cited by: §4.1.1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations 2014, Cited by: §2.
  • B. Tran, J. Li, and A. Madry (2018) Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems 31, Cited by: §2.
  • B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), Vol. , pp. 707–723. Cited by: §2.
  • S. Wang, S. Nepal, C. Rudolph, M. Grobler, S. Chen, and T. Chen (2020)

    Backdoor attacks against transfer learning with pre-trained deep learning models

    External Links: 2001.03274 Cited by: §2.
  • [25] J. Wiebe and T. Wilson Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39 (2). Cited by: §4.1.1.
  • H. Xu, Y. Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. K. Jain (2020) Adversarial attacks and defenses in images, graphs and text: a review. International Journal of Automation and Computing 17 (2). Cited by: §2.