Harnessing adversarial examples with a surprisingly simple defense

04/26/2020 ∙ by Ali Borji, et al. ∙ 0

I introduce a very simple method to defend against adversarial examples. The basic idea is to raise the slope of the ReLU function at the test time. Experiments over MNIST and CIFAR-10 datasets demonstrate the effectiveness of the proposed defense against a number of strong attacks in both untargeted and targeted settings. While perhaps not as effective as the state of the art adversarial defenses, this approach can provide insights to understand and mitigate adversarial attacks. It can also be used in conjunction with other defenses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

ReLU_defense

ReLU defense against adversarial attacks


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The lockdown has not been too bad! After all, we found some time to explore what we always wanted to but did not have time for it. For me, I have been passing the time by immersing myself in adversarial ML. I was playing with the adversarial examples tutorial in PyTorch

111https://pytorch.org/tutorials/beginner/fgsm_tutorial.html

and came across something interesting. So, I decided to share it with you. It is a simple defense that works well against untargeted attacks, and to some extent against targeted ones. Here is how it goes. The idea is to train a CNN with the ReLU activation function but increase its slope at the test time (Fig. 

1). Lets call this function Sloped ReLU or SReLU for short: , where is the slope. SReLU becomes ReLU for . To investigate this idea, I ran the following CNNs (Fig. 8) over MNIST LeCun et al. (1998):

Conv Pool SReLU Conv Pool SReLU FC SReLU FC

and over CIFAR-10 Krizhevsky (2009) (referred to as CIFAR10-CNN1):

Conv SReLU Pool Conv SReLU Pool FC SReLU FC SReLU FC


I also tried a variation of the latter network with SReLUs only after the first two FC layers (referred to as CIFAR10-CNN2). I chose .

Figure 1: Sketch of the the proposed defense. Just increase the ReLU slope at the test time!

2 Experiments and results

Here, I emphasize on the FGSM attack since it is very straightforward Goodfellow et al. (2015). To craft an adversarial example , FGSM adds a tiny portion of the gradient of the loss w.r.t the input image back to the input image (i.e. gradient ascent in loss):

(1)

The perturbed image needs to be clipped to the right range (here [0,1] in all experiments over both datasets). balances the attack success rate versus imperceptibility. corresponds to the model performance on the original test set (i.e. no perturbation). To gain a higher attack rate more perturbation (i.e. larger ) is needed which leads to a more noticeable change (and vice versa).

The above formulation is for the untargeted attack. For the targeted attack, instead of increasing the loss for the true class label, we can lower the loss for the desired target class label (or we could do both):

(2)

In addition to FGSM, I also considered a number of other strong attacks including BIM (also known as IFGSM; iterative FGSM) Kurakin et al. (2016), RFGSM Tramèr et al. (2017), StepLL Kurakin et al. (2016), PGD Madry et al. (2017), and DeepFool Moosavi-Dezfooli et al. (2016). In almost all of these attacks (except DeepFool for which I varied the number of iterations), there is a parameter that controls the magnitude of perturbation (here represented by ). In what follows I will show the results over MNIST and CIFAR-10 datasets against both untargeted and targeted attacks.

Figure 2: Performance of the ReLU defense against untargeted attacks. Left column: MNIST. Middle column: CIFAR-10 with all ReLUs replaced with SReLU, i.e. CIFAR10-CNN1. Right column: CIFAR-10 with SReLUs only after the first two FC layers, i.e. CIFAR10-CNN2. All 10K test images of each dataset were used. Rows from top to bottom: ReLU defense against the FGSM attack, the defense success rate (i.e. accuracy loss recovery) as a function of SReLU slope averaged over all epsilons (iterations for DeepFool), and the effect of switching to a different activation function against the FGSM attack. Increasing the SReLU slope improves the performance significantly, except over StepLL and DeepFool attacks.

corresponds to the classifier accuracy with no input perturbation.

Higher y value here means better defense.
Figure 3: Results of the ReLU defense against untargeted attacks over MNIST (left column), CIFAR10-CNN1 (middle), and CIFAR10-CNN2 (right). Higher y value here means better defense.

2.1 Performance against untargeted attacks

Results are shown in Fig. 2 over entire test sets of datasets. I did not find the SReLU slopes smaller than one very interesting but I am including them here for the sake of comparison. Their effect is not consistent. They are useful against targeted attacks but almost always hinder both classifier accuracy and robustness in untargeted attack settings.

Over MNIST, at and (top left panel in Fig. 2) classifier accuracy is around 98% and gradually falls as grows. For , the performance loss starts to recover. Finally, for the classifier has not been impacted at all, hence completely failing the attack. Classifier performance as a function of slope (averaged over epsilons) for each attack type is shown in the middle row of Fig. 2. The ReLU defense does very well against FGSM, BIM, RFGSM and PGD attacks. Over StepLL and DeepFool, however, it neither helped nor harmed the classifier. Results over the CIFAR-10 dataset are consistent with the MNIST results (middle and right columns in Fig. 2 corresponding to CIFAR10-CNN1 and CIFAR10-CNN2, respectively). On this dataset, the classifier had around 60% accuracy on the clean test set () and went down with more perturbation. Here again increasing led to better defense, although sometimes at the expense of accuracy. Please see Fig. 3 for the performance of each attack type as a function of perturbation magnitude.

Are these findings specific to SReLU function? To answer this question, I swapped the SReLU with other activation functions and measured the FGSM performance. Now the attacks became even more damaging (Fig. 2; bottom row) indicating that SReLU does better.

To assess this defense against a wider range of adversarial attacks, I conducted a larger scale analysis using the Foolbox code repository Rauber et al. (2017) over the first 2K images of each dataset. The proposed defense was tested against 25 attacks utilizing different L norms or image transformations (e.g. Gaussian blurring). The top row in Fig. 4 shows the mean classifier accuracy averaged over all attacks. Increasing improved the defense, although there seems to be a trade-off between accuracy and robustness. Please compare the last vs. in the top row of Fig. 4. Looking across the attacks (bottom row in Fig. 4 shown only for the last ; over MNIST and over CIFAR), however, reveals that the boost in improvement comes only from few attacks against which the defense has done a great job, in particular FGSM and its variants (e.g. PGD). Against the VirtualAdversarial attack, the proposed defense made the situation much worse. Overall, it seems that this defense works better against gradient based attacks as opposed to image degradation ones such as additive noise, salt and pepper noise, and inversion attack. Nonetheless, the ReLU defense was able to recover performance loss around 15% over MNIST and around 10-15% over CIFAR-10 (averaged over attacks; compare vs. in the bottom row of Fig. 4).

Figure 4: Results of ReLU defense against 25 untargeted attacks using the Foolbox code repository, over MNIST (left column), CIFAR10-CNN1 (middle), and CIFAR10-CNN2 (right). The first 2K images from each dataset were used. Top row shows the average performance over all attacks. Bottom row shows the performance at the last epsilon ( over MNIST and over CIFAR-10), for each attack. Higher y value here means better defense.
Figure 5: Performance of the ReLU defense against the targeted FGSM attack over MNIST (top row), CIFAR10-CNN1 (middle row), and CIFAR10-CNN2 (bottom row). Lower y value here means better defense.
Figure 6: Performance of the ReLU defense against the targeted FGSM attack over MNIST (top rows), CIFAR10-CNN1 (middle rows), and CIFAR10-CNN2 (bottom rows). Each plot corresponds to converting images to a different target class; if they have not already been classified as the target class). For some classes, the defense exacerbates the problem and helps the attack. On average, however, the defense seems to be working, although not as effective as in the untargeted setting. Lower y value here means better defense.

2.2 Performance against targeted attacks

Performance of the ReLU defense against the targeted FGSM attack is reported in Fig. 5 over the full test sets of MNIST and CIFAR-10 datasets. The y axis here shows the success rate of the attack in fooling the CNN to classify an image as the desired target class (if it has not already been classified so). As expected, increasing (perturbation magnitude) leads to higher attack accuracy. Over both datasets, increasing the SReLU slope (above 1) reduces the attack success rate which means better defense. Here, ReLU defense seems to perform about the same for slopes greater than one.

Defense results over individual classes are shown in Fig. 6. Increasing the slope helped in 6 classes, out of 10, using both MNIST CNN and CIFAR10-CNN1. Using CIFAR10-CNN2, 4 classes were defended better. In the remaining cases, the defense tied with the slope of 1 (i.e. no defense) or slightly deteriorated the robustness. On average, it seems that ReLU defense is effective against the targeted FGSM attack, although not as good as its performance against the untargeted FGSM.

3 Discussion and Conclusion

What is happening? I suspect the reason why this defense works is because it enhances the signal-to-noise ratio. In the untargeted attack setting, increasing the loss leads to suppressing pixels that are important in making the correct prediction while boosting irrelevant features. Since relevant features are associated with higher weights, increasing the SReLU slope will enhance those feature more compared to the irrelevant ones, hence recovering good features. For the same reason, this defense is not very effective against targeted attacks. In this setting, the attacker aims to lower the loss in favor of the target class. Raising the SReLU slope results in enhancing features that are important to the target class, as well as the true class. This ignites a competition between two sets of features which can go either way.

The above explanation is reminiscent of attention mechanisms in human vision (and also in computer vision), in particular neural gain modulation theories. These theories explain behavioral and neural data in a variety of visual tasks such as discrimination and visual search 

Scolari and Serences (2010); Borji and Itti (2014)

. On a related note, elevating the bias of neurons in different layers, as is done in  

Borji and Lin (2020), may lead to similar observations.

Is increasing the SReLU slope the same as scaling up the image by a factor? In general, No. Notice that , for and . For a linear network with positive weights, the answer is yes, but for non-linear CNNs comprised of positive and negative weights the answer is no. Just to make sure, I ran an experiment in which I multiplied the pixel values by and computed the classifier performance under the untargeted FGSM attack. Results are shown in Fig. 7. Clipping the pixel values to the [0,1] range (after scaling) did not improve the robustness. Interestingly, without clipping the pixel values, results resemble those obtained by increasing the SReLU slope (top right panel in Fig. 7; compare with Fig. 2.). This suggests that maybe instead of increasing the SReLU slope we can just scale the pixel values! Results over the CIFAR-10 dataset, however, do not support this hypothesis. On this dataset, scaling does not help robustness with or without clipping. The discrepancy of results over two datasets can be attributed to the fact that MNIST digits are gray level whereas CIFAR-10 images are RGB. Increasing the pixel intensity of MNIST digits leads to maintaining high classifier accuracy while at the same time making the job of the attacker harder since now he has to increase the magnitude of the perturbation. In conclusion, this analysis suggests that increasing pixel values is not as effective as increasing the SReLU slope. This, however, needs further investigation. If the opposite is true, then it will have the following unfortunate consequence. To counter the ReLU defense, the attacker can simply submit the scaled down version of the input image.

Last but not least, that fact that this simple defense consistently works against some strong attacks (e.g. PGD) is surprising. But please take these findings with a healthy dose of scepticism as I am still investigating them. Future work should evaluate the proposed approach on other models (e.g. ResNet), datasets (e.g. ImageNet), and against black-box attacks.

Figure 7: Impact of scaling up pixel values against the untargeted FGSM attak over MNIST (top) and CIFAR-10 (bottom; using CIFAR10-CNN1) datasets. The left column shows results with clipping the pixel values to [0,1] range after scaling. The right column shows results without clipping. The legend represents the magnitude of scaling. Higher y value here means better defense.
Figure 8: CNN architectures used in the experiments (top: MNIST, bottom: CIFAR10-CNN1).

Acknowledgement. I would like to thank Google for making the Colaboratory platform available.

References

  • A. Borji and L. Itti (2014) Optimal attentional modulation of a neural population. Frontiers in computational neuroscience 8, pp. 34. Cited by: §3.
  • A. Borji and S. Lin (2020) White noise analysis of neural networks. International Conference on Learning Representations. Cited by: §3.
  • I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In Proc. ICLR, Cited by: §2.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §1.
  • A. Kurakin, I. J. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. CoRR abs/1607.02533. Cited by: §2.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proc. IEEE. Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

    Towards deep learning models resistant to adversarial attacks

    .
    CoRR abs/1706.06083. External Links: 1706.06083 Cited by: §2.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, pp. 2574–2582. Cited by: §2.
  • J. Rauber, W. Brendel, and M. Bethge (2017)

    Foolbox v0.8.0: A python toolbox to benchmark the robustness of machine learning models

    .
    CoRR abs/1707.04131. Cited by: §2.1.
  • M. Scolari and J. T. Serences (2010) Basing perceptual decisions on the most informative sensory neurons. Journal of neurophysiology 104 (4), pp. 2266–2273. Cited by: §3.
  • F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. D. McDaniel (2017) Ensemble adversarial training: attacks and defenses. CoRR abs/1705.07204. Cited by: §2.