On Certifying Robustness against Backdoor Attacks via Randomized Smoothing

02/26/2020 ∙ by Binghui Wang, et al. ∙ Duke University 0

Backdoor attack is a severe security threat to deep neural networks (DNNs). We envision that, like adversarial examples, there will be a cat-and-mouse game for backdoor attacks, i.e., new empirical defenses are developed to defend against backdoor attacks but they are soon broken by strong adaptive backdoor attacks. To prevent such cat-and-mouse game, we take the first step towards certified defenses against backdoor attacks. Specifically, in this work, we study the feasibility and effectiveness of certifying robustness against backdoor attacks using a recent technique called randomized smoothing. Randomized smoothing was originally developed to certify robustness against adversarial examples. We generalize randomized smoothing to defend against backdoor attacks. Our results show the theoretical feasibility of using randomized smoothing to certify robustness against backdoor attacks. However, we also find that existing randomized smoothing methods have limited effectiveness at defending against backdoor attacks, which highlight the needs of new theory and methods to certify robustness against backdoor attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Backdoor attack [6, 20, 11, 25] is a severe security threat to DNNs. Specifically, in a backdoor attack, an attacker adds a trigger to the features of some training examples and changes their labels to a target label

during the training or fine-tuning process. Then, when the attacker adds the same trigger to the features of a testing example, the learnt classifier predicts the target label for the testing example with the trigger. We envision that, like adversarial examples, there will be a cat-and-mouse game for backdoor attacks. Indeed, various empirical defenses 

[4, 7, 9, 5, 21, 24] have been proposed to defend against backdoor attacks in the past few years. For instance, Neural Cleanse [24] aims to detect and reconstruct the trigger via solving an optimization problem. However, [12] showed that Neural Cleanse fails to detect the trigger when the trigger has different sizes, shapes, and/or locations.

To prevent such cat-and-mouse game, we take the first step towards certifying robustness against backdoor attacks. Specifically, we study the feasibility and effectiveness of certifying robustness against backdoor attacks using randomized smoothing [2, 19, 16, 18, 8, 17, 22, 13, 26, 14]. Randomized smoothing was originally developed to certify robustness against adversarial examples [23, 10, 3]. We generalize randomized smoothing to defend against backdoor attacks. Given an arbitrary function (we call it base function

), which takes a data vector as an input and outputs a label, randomized smoothing can turn the function to be a provably robust one via adding random noise to the input data vector. Specifically, a function is provably robust if it outputs the same label for all data points in a region (e.g.,

-norm ball) around an input. Our idea to certify robustness against backdoor attacks consists of two steps. First, we view the entire process of learning a classifier from the training dataset and using the classifier to make predictions for a testing example as a base function. In particular, this base function takes a training dataset and a testing example as an input, and the base function outputs a predicted label for the testing example. Second, we add random noise to the training dataset and a testing example to overwhelm the trigger that the attacker injects to the training examples and the testing example.

We evaluate our method on a subset of the MNIST dataset. Our defense guarantees that 36% of testing images can be classified correctly when an attacker arbitrarily perturbs at most 2 pixels/labels of the training examples and pixels of a testing example. Our results show the theoretical feasibility of using randomized smoothing to certify robustness against backdoor attacks. However, our results also show that existing randomized smoothing methods have limited effectiveness at defending against backdoor attacks. Our study highlights the needs of new theory and techniques to certify robustness against backdoor attacks.

2 Background on Randomized Smoothing

Randomized smoothing [2, 19, 16, 18, 8, 17, 22, 13, 26, 14] is state-of-the-art technique to certify the robustness of a classifier against adversarial examples. We describe randomized smoothing from a general function perspective, making it easier to understand how we apply randomized smoothing to certify robustness against backdoor attacks.

2.1 Building a Smoothed Function

Data vector :  Suppose we have a data vector . We consider each dimension of to be discrete, as many applications have discrete data, e.g., pixel values are discrete. Moreover, when certifying robustness against backdoor attacks, some dimensions of correspond to the labels of the training examples, which are discrete. Without loss of generality, we assume each dimension of is from the discrete domain , where is the domain size. We note that randomized smoothing is also applicable when the dimensions of have different domain sizes. However, for simplicity, we assume the dimensions have the same domain size.

Base function:  Suppose we have an arbitrary function (we call it base function), which takes the data vector as an input and outputs a label in a set . For instance, when certifying robustness against adversarial examples, the base function is a classifier whose robustness we aim to certify. When certifying robustness against backdoor attacks, we treat the entire process of learning a classifier and using the learnt classifier to make predictions for a testing example as a base function. For convenience, we denote the base function as , and is the predicted label for .

Adversarial perturbation:  An attacker can perturb the data vector . We denote by the adversarial perturbation an attacker adds to the vector , where is the perturbation added to the th dimension of the vector and . Moreover, we denote by the perturbed data vector, where the operator is defined for each dimension as follows:

(1)

where and are the th dimensions of the vector and the perturbation vector , respectively. We measure the magnitude of the adversarial perturbation using its norm, i.e., . We adopt norm because it is easy to interpret. In particular, norm of the adversarial perturbation is the number of dimensions of that are arbitrarily perturbed by the attacker.

Smoothed function:  Randomized smoothing builds a new function from the base function via adding random noise to the data vector . We call the new function smoothed function. Specifically, we denote by the random noise vector, where the th dimension is the random noise added to . We consider has the following distribution [17]:

(2)

Moreover, is the noisy data vector, where the operator is defined in Equation 1. The noise distribution indicates that, when adding a random noise vector to the data vector , the th dimension of

is preserved with a probability

and is changed to any other value with a probability . Since we add random noise to the data vector , the base function outputs a random label. We define a smoothed function , which outputs the label with the largest probability as follows:

(3)

where is the label predicted for by the smoothed function. Note that is the label predicted for the perturbed data vector .

2.2 Computing Certified Radius

Randomized smoothing guarantees that the smoothed function predicts the same label when the adversarial perturbation is bounded. In particular, according to [17], we have:

(4)

where is a lower bound of the probability that predicts a label when adding random noise to , and is called certified radius. Intuitively, Equation 4 shows that the smoothed function predicts the same label when an attacker arbitrarily perturbs at most dimensions of the data vector . Note that the certified radius depends on . In particular, given any lower bound , we can compute the certified radius . The computation details can be found in [17]

. Estimating a lower bound

is the key to compute the certified radius. Next, we describe how to estimate .

Cohen et al. [8] proposed a Monte Carlo method to predict and estimate with probabilistic guarantees. Specifically, we sample noise from the noise distribution defined in Equation 2. We compute the label for each noise , and we compute the label frequency for each , where is an indicator function. The smoothed function outputs the label with the largest frequency. Moreover, Cohen et al. [8] proposed to use the Clopper-Pearson method [1] to estimate as follows:

(5)

where is the confidence level and is the

th quantile of the Beta distribution with shape parameters

and .

3 Certifying Robustness against Backdoor Attacks

Suppose we have a training dataset , where and are the feature vector and label of the th training example, respectively. Suppose further we have a learning algorithm which takes the training dataset as an input and produces a classifier , i.e., . We use the classifier to predict the label for a testing example . We generalize randomized smoothing to certify robustness against backdoor attacks. Our key idea is to combine the entire process of training and prediction as a single function , which is the predicted label for a testing example when the classifier is trained on using the algorithm . We view the function as a base function and apply randomized smoothing to it.

Constructing a smoothed function:  We view the concatenation of the feature matrix , the label vector , and the features of the testing example as the data vector that we described in Section 2. We add a random noise matrix to the feature matrix , where each entry of the noise matrix is drawn from the distribution defined in Equation 2 with as the feature domain size. We add a random noise vector to the label vector , where each entry of the noise vector is drawn from the distribution defined in Equation 2 with . Furthermore, we add a random noise vector to the testing example , where each entry of the noise vector is drawn from the distribution defined in Equation 2 with as the feature domain size. Since we add random noise to the training dataset and testing example , the output of the base function is also random. The smoothed function outputs the label that has the largest probability. Formally, we have:

(6)

where is the label predicted by the smoothed function for .

Computing the certified radius:  We denote by a matrix and a vector the perturbations an attacker adds to the feature matrix and label vector , respectively. Moreover, we denote by a vector the adversarial perturbation an attacker adds to the testing example . Based on Equation 4, we have the following:

(7)

where . Equation 7 means that the smoothed function predicts the same label for the testing example when the norm of the adversarial perturbation added to the feature matrix and label vector of the training examples as well as the testing example is bounded by . The key to computing the certified radius is to estimate . We use the Monte Carlo method described in Section 2 to estimate . Specifically, we randomly sample noise matrices , noise vectors , as well as noise vectors . We train classifiers, where the th classifier is trained using the training dataset and learning algorithm . Then, we compute the frequency of each label , i.e., = for . Finally, we can estimate using Equation 5. Note that the trained classifiers can be re-used to predict the labels and compute certified radius for different testing examples.

Figure 1: Certified accuracy of our method against backdoor attacks on MNIST 1/7.

4 Experimental results

Experimental setup:  We use a subset of the MNIST dataset [15] which only contains the handwritten digits ”1” and ”7” and is used for binary classification. Each digit image has a size and has normalized pixel values within

. For simplicity, we binarize the pixel values in our experiment. Specifically, if a pixel value is smaller than 0.5, we set it to be 0, otherwise we set it to be 1. Moreover, we randomly select 100 digits from the subset of the MNIST dataset to form the training dataset and randomly select another 1,000 digits as the testing examples. We use a two-layer neural network as the classifier.

Evaluation metric:  We use certified accuracy

as an evaluation metric. Specifically, for a given number of perturbed pixels/labels, certified accuracy is the fraction of testing examples, whose labels are correctly predicted by the smoothed function and whose certified radiuses are no smaller than the given number of perturbed pixels/labels.

Experimental results:  Figure 1 shows the certified accuracy of our method against backdoor attacks with , , and . Our method guarantees that 36% of testing images can be classified correctly when an attacker arbitrarily perturbs at most 2 pixels/labels of the training examples and pixels of a testing example.

5 Conclusion

In this work, we take the first step towards certified defenses against backdoor attacks. In particular, we study the feasibility and effectiveness of certifying robustness against backdoor attacks via randomized smoothing, which was originally developed to certify robustness against adversarial examples. Our method has two key steps. First, we treat the entire process of training a classifier and using the classifier to predict the label of a testing example as a base function. Second, we add noise to the training data and a testing example to overwhelm the perturbation an attacker adds in a backdoor attack. Our results on a subset of MNIST demonstrate that it is theoretically feasible to certify robustness against backdoor attacks using randomized smoothing, but existing randomized smoothing methods have limited effectiveness. Our study highlights the needs of new techniques to certify robustness against backdoor attacks.

References

  • [1] L. D. Brown, T. T. Cai, and A. DasGupta (2001) Interval estimation for a binomial proportion. Statistical science. Cited by: §2.2.
  • [2] X. Cao and N. Z. Gong (2017) Mitigating evasion attacks to deep neural networks via region-based classification. In ACSAC, Cited by: §1, §2.
  • [3] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE S & P, Cited by: §1.
  • [4] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava (2018) Detecting backdoor attacks on deep neural networks by activation clustering. arXiv8. Cited by: §1.
  • [5] H. Chen, C. Fu, J. Zhao, and F. Koushanfar (2019) Deepinspect: a black-box trojan detection and mitigation framework for deep neural networks. In IJCAI, Cited by: §1.
  • [6] X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017)

    Targeted backdoor attacks on deep learning systems using data poisoning

    .
    arXiv. Cited by: §1.
  • [7] E. Chou, F. Tramèr, G. Pellegrino, and D. Boneh (2018) Sentinet: detecting physical attacks against deep learning systems. arXiv. Cited by: §1.
  • [8] J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In ICML, Cited by: §1, §2.2, §2.
  • [9] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal (2019) Strip: a defence against trojan attacks on deep neural networks. In ACSAC, Cited by: §1.
  • [10] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §1.
  • [11] T. Gu, B. Dolan-Gavitt, and S. Garg (2017)

    Badnets: identifying vulnerabilities in the machine learning model supply chain

    .
    arXiv. Cited by: §1.
  • [12] W. Guo, L. Wang, X. Xing, M. Du, and D. Song (2019) Tabor: a highly accurate approach to inspecting and restoring trojan backdoors in ai systems. arXiv. Cited by: §1.
  • [13] J. Jia, X. Cao, B. Wang, and N. Z. Gong (2020) Certified robustness for top-k predictions against adversarial perturbations via randomized smoothing. In ICLR, Cited by: §1, §2.
  • [14] J. Jia, B. Wang, X. Cao, and N. Z. Gong (2020) Certified robustness of community detection against adversarial structural perturbation via randomized smoothing. In TheWebConf, Cited by: §1, §2.
  • [15] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §4.
  • [16] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In IEEE S & P, Cited by: §1, §2.
  • [17] G. Lee, Y. Yuan, S. Chang, and T. S. Jaakkola (2019) Tight certificates of adversarial robustness for randomly smoothed classifiers. In NeurIPS, Cited by: §1, §2.1, §2.2, §2.
  • [18] B. Li, C. Chen, W. Wang, and L. Carin (2019) Second-order adversarial attack and certifiable robustness. In NeurIPS, Cited by: §1, §2.
  • [19] X. Liu, M. Cheng, H. Zhang, and C. Hsieh (2018) Towards robust neural networks via random self-ensemble. In ECCV, Cited by: §1, §2.
  • [20] Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2018) Trojaning attack on neural networks. In NDSS, Cited by: §1.
  • [21] X. Qiao, Y. Yang, and H. Li (2019) Defending neural backdoors via generative distribution modeling. In NeurIPS, Cited by: §1.
  • [22] H. Salman, J. Li, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang (2019) Provably robust deep learning via adversarially trained smoothed classifiers. In NeurIPS, Cited by: §1, §2.
  • [23] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR, Cited by: §1.
  • [24] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In IEEE S & P, Cited by: §1.
  • [25] Y. Yao, H. Li, H. Zheng, and B. Y. Zhao (2019) Latent backdoor attacks on deep neural networks. In CCS, Cited by: §1.
  • [26] R. Zhai, C. Dan, D. He, H. Zhang, B. Gong, P. Ravikumar, C. Hsieh, and L. Wang (2020) MACER: attack-free and scalable robust training via maximizing certified radius. In ICLR, Cited by: §1, §2.