Adversarial examples are useful too!

05/13/2020 ∙ by Ali Borji, et al. ∙ 26

Deep learning has come a long way and has enjoyed an unprecedented success. Despite high accuracy, however, deep models are brittle and are easily fooled by imperceptible adversarial perturbations. In contrast to common inference-time attacks, Backdoor ( Trojan) attacks target the training phase of model construction, and are extremely difficult to combat since a) the model behaves normally on a pristine testing set and b) the augmented perturbations can be minute and may only affect few training samples. Here, I propose a new method to tell whether a model has been subject to a backdoor attack. The idea is to generate adversarial examples, targeted or untargeted, using conventional attacks such as FGSM and then feed them back to the classifier. By computing the statistics (here simply mean maps) of the images in different categories and comparing them with the statistics of a reference model, it is possible to visually locate the perturbed regions and unveil the attack.



There are no comments yet.


page 7

page 8

page 9

page 10

page 12

page 15

page 22

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

An example image (left) after ten (middle) and fifty (right) iterations of DeepDream, using a network trained to perceive dogs. Figure from

Adversarial examples are crafted to fool deep learning models Szegedy et al. (2013); Goodfellow et al. (2015), and by that virtue they are deemed undesirable. How can they be useful then? Well, they somehow have been indirectly utilized in the past. Propagating the loss with respect to the input, instead of the weights, has been used in several related but seemingly different applications including DeepDream111 (Fig. 1), Style transfer Gatys et al. (2016), feature visualization Zeiler and Fergus (2014); Selvaraju et al. (2017), and adversarial attacks Goodfellow et al. (2015). Here, I will show how this technique can be used to unveil interference in a classifier by means of poisoning attacks Liu et al. (2016). In other words, I am proposing a simple defense against backdoor poisoning attacks. I will provide minimal experiments to prove the concept. Further experiments, however, are needed to scale up this approach over more complex datasets.

Two types of adversarial attacks are common against deep learning models, evasion and poisoning. In evasion attacks, the adversary applies an imperceptible perturbation, digital or physical, to the input image in order to fool the classifier (e.g.  Szegedy et al. (2013); Goodfellow et al. (2015)). The perturbation can be untargeted (changing the decision to any output other than the true class label) or targeted (changing the decision to a class of interest), black-box (when only available information is the output labels or scores) or white box (some information such as model architecture or gradients are available). The goal in poisoning attacks (e.g.  Gu et al. (2017); Liu et al. (2017); Zhang et al. (2016); Wang et al. (2019); Brown et al. (2017)) is to either a) introduce infected samples with wrong labels to the training set to degrade the testing accuracy of the model (a.k.a. collision attacks) or b) introduce a trigger (e.g. a sticker) during training such that activating the trigger at testing time will initiate a malfunction in the system (a.k.a. backdoor attacks). Here, I am concerned with backdoor poisoning attacks.

The problem statement is as follows. Given a model, and possibly a reference model that is supposed to be used solely for fine tuning (e.g. a backbone), is it possible to tell whether the model contains a backdoor attack? The approach I take here is based on the idea that adversarial examples generated for an attacked model would be different than those generated for a clean reference model. In what follows I will explain how we can tap into such differences to discover anomalies.

Figure 2: Each of the four panels represents a backdoor poisoning attack over MNIST. A sample image and its corresponding patch-augmented image, confusion matrices over the pristine testing dataset and the poisoned dataset (a patch added to all digits of one class), as well as the average digit maps are shown.

2 Approach

To diagnose the attack, first we need to generate adversarial examples. We can feed three types of inputs to the model a) Blank image, b) White noise, c) (Unlabeled) Data, and then use an inference-time adversarial attack method (here FGSM 

Goodfellow et al. (2015)) to craft adversarial examples:


Here, is the input image and is the target class we wish the input to be classified as, and controls the magnitude of perturbation. Notice that the attack does not necessarily need to be targeted. In practice, I apply the FGSM for a number of iterations akin to IFGSM Kurakin et al. (2016)

. Then, the average of all adversarial examples that are classified as a certain class is computed (Algorithm I). The same procedure is repeated for the clean reference model. As will be shown later, over simple datasets such as MNIST, visual inspection of the average adversarial images (also called bias maps) is enough to discover the attack. Over more complex datasets such as CIFAR, however, more sophisticated quantitative methods such as outlier detection might be necessary.

Algorithm I:
for target in categories:
    for a number of samples:
        generate x_adv according to Eq. 1
        prediction <-- model(x_{adv})
        record x_{adv} under prediction
    end for
end for

While the above method is effective, it is not granular enough to reveal all categories that are impacted. To address this, we can perturb images such that they are more likely to belong to the target category and less likely to belong to the source222Do not confuse this with the category in which images are perturbed. category:


For each pair of categories (except when source=target), a number of adversarial examples are generated. The average of adversarial images for each pair is then computed. To be a bit more precise:

Algorithm II:
for source in categories:
    for target in categories:
        if source == target: continue
        for a number of samples:
            generate x_adv according to Eq. 2
            prediction <-- model(x_{adv})
            record x_{adv} under (source, prediction)
        end for
    end for
end for

3 Experiments and Results

Experiments are conducted over the MNIST dataset. In each experiment, first a CNN (Fig. 23) is trained over the pristine data (called clean model in Figures). Then, an adversarial patch is added to 2000 randomly selected images from the source category and their labels are changed to the target category (e.g. . The remaining images from the source category preserve their labels. The CNN is then fine-tuned on these data (called backdoor model in Figures). Please see Fig. 2.

3.1 Experiment I

Adversarial backdoor attacks were planted over four different source-target pairs: , , , and . Sample images, augmented images with the adversarial patches, confusion matrices of the backdoor models on pristine and poisoned testing sets, as well as mean digit images are shown in Fig. 2. Both clean and backdoor models perform very well above 95%. Applying the backdoor model to the pristine testing sets everything looks normal and confusion matrices resemble those of the clean model. However, when evaluated on a testing set with all source images poisoned and relabeled with the target category, almost all of those images are now classified as the target class (the top right matrix in each panel of Fig. 2).

Fig. 3 shows the results of applying algorithm I to the models for the attack using blank images as input to FGSM. A single image is iteratively modified to be categorized as the target class. Two observations can be made here. First, bias maps resemble the digits, and second, the trigger at the top-right corner of the bias map of category 9 is highlighted. The same region is off for the clean model. The bias maps of other categories are similar across clean and attacked models. Expectedly, I found that bias maps for the clean models do not change much across the experiments. I also found that moderate values for leads to better results.

Results using data and white noise inputs for the are shown in Fig 4. Bias maps look more enticing now especially using data as input and with small . Again, maps look more or less similar across unaffected categories. Similar results over other attacks are observed as shown in Fig. 5 for , Fig. 6 for , and Fig. 7 for attacks.

Visual inspection works but is not feasible when number of categories is high. Is it possible to devise some quantitative measures? To this end, I computed the Euclidean distance between the bias maps of backdoor and clean models. Results are shown in Fig. 8. Interestingly, the difference curves peak at either the source or target categories. I expect even better results using higher number of adversarial examples. Here, I only generated 1000 examples per category.

Bias maps in Algorithm I only reveal the target category. Lets see if Algorithm II can reveal both source and target categories. Results are shown in Figs 91011, and 12. Please look at the Figs 9 (

) as an example. Rows and columns in this figure correspond to source and target categories, respectively. Perturbing images, noise or data, not to belong to any source category but 9, highlights the C-shaped adversarial patch (the right most columns). This makes sense since activating this region increases the probability of class 9. With the same token, perturbing images not to be 8 sometimes highlights the patch region. This is perhaps due to the optimization procedure that aims to lower the probability of 8 while at the same time increasing the probability of the target class. Nonetheless, looking across rows and columns reveals which source category has been attacked to become which target category.

3.2 Experiment II

Adversarial patches in experiment one were confined to a fixed spatial region. Here, I will choose random locations to place the patch. Attempts were made not to occlude the digits (i.e. blank regions were selected). Results are shown in Fig. 13 for the attack. Inspecting the bias map for digit 9 shows C-shaped patterns all over the map. Again, results will likely improve with more number of samples per category, since number of possible locations for the patch are now higher compared to the first experiment. Quantitative results are shown in Fig. 17. Bar charts are spikier now compared to Fig. 8 but still in some cases reveal the impacted categories. Please see Figs. 1415, and 16 for results over other attacks.

3.3 Experiment III

In general, detecting backdoor attacks can be a very daunting task. The adversary may not need to obtain 100% accuracy for the attack which entails that he can try until he gets a hit (unless there is a limit on the number of attempts). Further, the adversary may add minute perturbations to balance between attack accuracy and the chance of the attack being compromised. My aim in this experiment is to study whether the proposed methods can highlight less perceptible adversarial patches such as multiplying the digit image by 2 (Fig. 18) or blending the source image with a randomly selected image from the target category, here (Fig. 19)333Notice that in general, it is also possible to transform the entire images by transformations such as small rotation, blurring, adding noise, or adding physical objects such as adversarial glasses..

Sample images and perturbations are presented for the attack in Figs. 18 and 19. As confusion matrices in these figures illustrate, only a tiny fraction of the poisoned images are now classified under the target category. Even the average digit maps here look quite normal. Thus, this type of perturbation can be a serious challenge for any backdoor defense method.

Qualitative results are shown in Figs. 20 and 21. Although not very clear, it seems there is a bigger difference over bias maps for digit 9 compared to other digits. Quantitative results in Fig. 22 corroborate this observation. Admittedly though, I think this needs further investigation.

4 Discussion

I proposed a simple method to examine a model that might have been attacked. This method can be considered as a system identification method. With no information about a model other than output labels, one strategy is to feed white noise input to the system and analyze the statistics of noise in different categories. This is what we did in Borji and Lin (2020). The drawback however is the demand for a large number of queries. Here, assuming the model is white-box, I was able to lower the sample complexity dramatically. For example, for 10 categories and 1000 samples per category, 10,000 samples are needed in total, whereas in Borji and Lin (2020) we used samples in the order of millions. Also, the introduced methods are much more efficient.

The proposed method differs from the existing backdoor attack detection works which often rely on statistical analysis of the poisoned training dataset (e.g.  Steinhardt et al. (2017); Turner et al. (2018)) or the neural activations in different layers (e.g.  Chen et al. (2018)). Here instead, I derived the biases of the networks. It is possible to improve upon the presented results by performing analysis over the individual adversarial examples in addition to the mean maps.

The preliminary results here without fine tuning or regularization (as is done in these types of works e.g. 

DeepDream) are promising. However, further effort is needed to scale up this method over datasets containing natural scenes such as CIFAR or ImageNet.

Lastly, I wonder whether these methods can be used in neuroscience to understand feature selectivity of neurons or perceptual biases in humans, similar to 

Bashivan et al. (2019). The current work also relates to neuropsychology and category learning research.

Acknowledgement. I would like to thank Google for making the Colaboratory platform available.


  • P. Bashivan, K. Kar, and J. J. DiCarlo (2019) Neural population control via deep image synthesis. Science 364 (6439), pp. eaav9436. Cited by: §4.
  • A. Borji and S. Lin (2020)

    White noise analysis of neural networks

    International Conference on Learning Representations. Cited by: §4.
  • T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer (2017) Adversarial patch. External Links: 1712.09665 Cited by: §1.
  • B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava (2018) Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728. Cited by: §4.
  • L. A. Gatys, A. S. Ecker, and M. Bethge (2016)

    Image style transfer using convolutional neural networks


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2414–2423. Cited by: §1.
  • I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In Proc. ICLR, Cited by: §1, §1, §2.
  • T. Gu, B. Dolan-Gavitt, and S. Garg (2017)

    Badnets: identifying vulnerabilities in the machine learning model supply chain

    arXiv preprint arXiv:1708.06733. Cited by: §1.
  • A. Kurakin, I. J. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. CoRR abs/1607.02533. Cited by: §2.
  • Y. Liu, X. Chen, C. Liu, and D. Song (2016) Delving into transferable adversarial examples and black-box attacks. CoRR abs/1611.02770. Cited by: §1.
  • Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2017) Trojaning attack on neural networks. Cited by: §1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • J. Steinhardt, P. W. W. Koh, and P. S. Liang (2017) Certified defenses for data poisoning attacks. In Advances in neural information processing systems, pp. 3517–3529. Cited by: §4.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §1.
  • A. Turner, D. Tsipras, and A. Madry (2018) Clean-label backdoor attacks. Cited by: §4.
  • B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 707–723. Cited by: §1.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. CoRR abs/1611.03530. Cited by: §1.