Improving White-box Robustness of Pre-processing Defenses via Joint Adversarial Training

06/10/2021
by   Dawei Zhou, et al.
0

Deep neural networks (DNNs) are vulnerable to adversarial noise. A range of adversarial defense techniques have been proposed to mitigate the interference of adversarial noise, among which the input pre-processing methods are scalable and show great potential to safeguard DNNs. However, pre-processing methods may suffer from the robustness degradation effect, in which the defense reduces rather than improving the adversarial robustness of a target model in a white-box setting. A potential cause of this negative effect is that adversarial training examples are static and independent to the pre-processing model. To solve this problem, we investigate the influence of full adversarial examples which are crafted against the full model, and find they indeed have a positive impact on the robustness of defenses. Furthermore, we find that simply changing the adversarial training examples in pre-processing methods does not completely alleviate the robustness degradation effect. This is due to the adversarial risk of the pre-processed model being neglected, which is another cause of the robustness degradation effect. Motivated by above analyses, we propose a method called Joint Adversarial Training based Pre-processing (JATP) defense. Specifically, we formulate a feature similarity based adversarial risk for the pre-processing model by using full adversarial examples found in a feature space. Unlike standard adversarial training, we only update the pre-processing model, which prompts us to introduce a pixel-wise loss to improve its cross-model transferability. We then conduct a joint adversarial training on the pre-processing model to minimize this overall risk. Empirical results show that our method could effectively mitigate the robustness degradation effect across different target models in comparison to previous state-of-the-art approaches.

READ FULL TEXT VIEW PDF

Authors

page 5

02/05/2018

Blind Pre-Processing: A Robust Defense Method Against Adversarial Examples

Deep learning algorithms and networks are vulnerable to perturbed inputs...
02/05/2018

Robust Pre-Processing: A Robust Defense Method Against Adversary Attack

Deep learning algorithms and networks are vulnerable to perturbed inputs...
09/02/2020

Perceptual Deep Neural Networks: Adversarial Robustness through Input Recreation

Adversarial examples have shown that albeit highly accurate, models lear...
12/15/2019

What Else Can Fool Deep Learning? Addressing Color Constancy Errors on Deep Neural Network Performance

There is active research targeting local image manipulations that can fo...
03/16/2020

Toward Adversarial Robustness via Semi-supervised Robust Training

Adversarial examples have been shown to be the severe threat to deep neu...
08/25/2020

Likelihood Landscapes: A Unifying Principle Behind Many Adversarial Defenses

Convolutional Neural Networks have been shown to be vulnerable to advers...
11/17/2019

Countering Inconsistent Labelling by Google's Vision API for Rotated Images

Google's Vision API analyses images and provides a variety of output pre...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although deep neural networks (DNNs) achieve great success in applications such as computer vision

He et al. (2016); Kaiming et al. (2017); Dosovitskiy et al. (2020), speech recognition (Wang et al., 2017)

and natural language processing

(Sutskever et al., 2014; Devlin et al., 2018), they are found to be vulnerable to adversarial examples which are crafted by adding imperceptible but adversarial noise on natural examples (Szegedy et al., 2014; Goodfellow et al., 2015; Tramer et al., 2020). Adversarial examples can remain destructive in the physical world (Wu et al., 2020d; Duan et al., 2020) and transfer across different models (Wu et al., 2020a; Huang and Zhang, 2019)

. The vulnerability of DNNs raises security concerns about their reliability in decision-critical deep learning applications, e.g., autonomous driving

(Eykholt et al., 2018) and person recognition (Xu et al., 2020).

A major class of adversarial defenses pre-process adversarial examples to mitigate the interference of adversarial noise without modifying the target model (Xu et al., 2017; Feinman et al., 2017; Liao et al., 2018). Among pre-processing methods, compared with feature squeezing (Guo et al., 2018; Xu et al., 2017) and adversarial detection (Ma et al., 2018; Qin et al., 2019), input denoising (Liao et al., 2018; Jin et al., 2019; Samangouei et al., 2018; Schott et al., 2018; Ghosh et al., 2019; Lin et al., 2019; Naseer et al., 2020) can remove adversarial noise and sufficiently retain the original information of the natural examples. Thus, input denoising methods are widely researched and they have shown great potential to safeguard DNNs from adversarial attacks.

However, many denoising methods only present security against oblivious attacks (Athalye and Carlini, 2018; Tramer et al., 2020) where an attacker is completely unaware the defenses is being applied. Unfortunately, security against oblivious attacks is far from sufficient to be useful in practice. A serious attacker would certainly consider the possibility that a defense is used (Athalye and Carlini, 2018). Recent researches (Carlini and Wagner, 2017a; Tramer et al., 2020)

show that effectiveness of denoising methods are significantly weakened in an adaptive threat model where an attacker has full access to the pre-processing model or can estimate the knowledge (e.g., gradient) of the pre-processing model

(Carlini and Wagner, 2017b). Especially, for a robust target model which is adversarially trained via state-of-the-art adversarial training strategies (Madry et al., 2018; Ding et al., 2019; Wang et al., 2019a; Zhang et al., 2019), empirical results as shown in Fig 1(a)1(b) present that using the denoising defenses to pre-process inputs may significantly reduce rather than improve the adversarial robustness against worst-case (i.e., white-box) adaptive attacks. We name such adversarial robustness as white-box robustness, and this phenomenon as “robustness degradation effect”.

(a) Standard
(b) TRADES
(c) Comparison on TRADES
Figure 1: The visualization of robustness degradation effect. We evaluate the white-box robustness (accuracy on white-box adaptive attacks) of three pre-processing defenses: APE-G (Jin et al., 2019), HGD (Liao et al., 2018) and NRP (Naseer et al., 2020) on CIFAR-10 (Krizhevsky et al., 2009). The target models are adversarially trained via two adversarial training strategies: Standard (Madry et al., 2018), and TRADES (Zhang et al., 2019). We combine the adaptive attack strategy with three attacks such as PGD (Madry et al., 2018), AA (Croce and Hein, 2020) and FWA (Wu et al., 2020c) to craft adversarial examples. "None" denotes that no pre-processing defense is used. "Obl" denotes the pre-processing model trained using oblivious adversarial examples, and "Full" denotes the model trained using full adversarial examples.

A potential cause of the negative effect is that adversarial training examples are static and independent to the pre-processing model. Recall that adversarial examples used in pre-processing methods are typically crafted only against the target model. Since the target model is pre-trained and their parameters are fixed, these oblivious adversarial examples specific to the target model are also fixed and not associated with the pre-processing model. From this perspective, using such training examples cannot guarantee that the defense model can effectively deal with the adaptive attacks. To solve this problem, a natural idea is to use full adversarial examples which are crafted against the full model (a single model composed of a pre-processing model and a target model) to replace oblivious adversarial examples. We investigate the influence of full adversarial examples on the robustness and find that they exhibit a better impact compared to oblivious adversarial examples. Specifically, we conduct a proof-concept experiment in a white-box setting. As shown in Fig 1(c), compared to the original pre-processing models, the models trained by full adversarial examples achieve higher accuracy. The details of this experiment are presented in Section 2.2. This discovery inspired us to use full adversarial examples to train the pre-processing model. In addition, to prevent the label leakage (Kurakin et al., 2016; Zhang and Wang, 2019)

from affecting the defense generalization to unseen attacks, we explore to make full adversarial examples independent to the label space by maximally disrupting deep features of natural examples on an internal layer of the full model.

Note that the results in Figure 1(c) also show that simply modifying the adversarial training examples in pre-processing methods does not completely alleviate the robustness degradation effect. This may be caused by the vulnerability of the pre-processing model. The denoising defenses typically exploit a generative network to train the pr-processing model for recovering adversarial examples. Unfortunately, existing works (Gondim-Ribeiro et al., 2018; Kos et al., 2018; Chen et al., 2020; Sun et al., 2020) have demonstrated that classic generative models are vulnerable to adversarial attacks. An attacker can mislead the pre-processing model to generate an output with respect to wrong class by disrupting the recovered example as much as possible. The pre-processing methods typically only focus on the ability of the pre-processing model to remove oblivious adversarial noise, but overlook the risk of the model being perturbed, which results in their insufficient adversarial robustness. To address this issue, we formulate an adversarial risk for the pre-processing model, to exploit the full adversarial examples to improve the inherent robustness of the pre-processing model instead of just using them to learn the denoising mapping. Corresponding to the above full adversarial examples found in feature space, we introduce a feature similarity term to measure the distance between natural and full adversarial examples. By this design, the adversarial risk is expected to reduce the distortion in both label space and feature space. The details of the adversarial risk are presented in Section 2.3.

Motivated by the above analyses, we propose a Joint Adversarial Training based Pre-processing (JATP) defense. Specifically, we use full adversarial examples found in the feature space as the supervision signal to train the pre-processing model. Then, we formulate a feature similarity based adversarial risk for the pre-processing model to improve its inherent robustness by using above full adversarial examples. Note that unlike the standard adversarial training which is model independent, our method only updates the parameters of the pre-processing model and needs to ensure that the parameters are suitable for different target models. This requires us to improve the cross-model transferability. Considering that the natural examples used by different target models are consistent, and they have no adversarial patterns, we therefore introduce a pixel loss to reduce the distance between the pre-processed and natural examples. Based on the above designs, we conduct a joint adversarial training on the pre-processing model to minimize this overall risk in a dynamic manner. Experimental results in Section 4 show that our method effectively mitigates the robustness degradation effect against unseen types of white-box adaptive attacks in comparison to previous pre-processing defenses.

The main contributions in this paper are as follows:

  • We analyze two potential factors that cause the robustness degradation effect: (1) adversarial training examples used in pre-processing methods are independent to the pre-processing model; and (2) the inherent robustness of the pre-processing model is not sufficient due to its adversarial risk is neglected during the training process.

  • we first formulate a feature similarity based adversarial risk for the pre-processing model to improve its inherent robustness by using adversarial examples crafted in a feature space against the full model. Then, we introduce a pixel-wise loss to improve the cross-model transferability of the pre-processing model. We propose a Joint Adversarial Training based Pre-processing (JATP) defense to minimize the overall risk in a dynamic manner.

  • Experimentally, we demonstrate that JATP defense could significantly improve the white-box robustness of pre-processing defenses against adaptive attacks and mitigate the robustness degradation effect compared with the state-of-the-art. In addition, it could be applied to safeguard across different target models without additional training procedures.

The rest of this paper is organized as follows. In Section 2, we analyze the robustness degradation effect. In Section 3, we describe our defense method and present its implementation. Experimental results on different datasets are provided in Section 4. Finally, we conclude this paper in Section 5. In addition, we briefly review related work on attacks and defenses in supplementary material A.

2 Analyzing the robustness degradation effect

In this section, we analyze two potential causes of the robustness degradation effect and explore to find their solutions.

2.1 Preliminaries

We first define the notation. We use bold lower-case letters (e.g., , ) and lower-case letters (e.g., ,

) to denote vectors and scalars respectively. We use

upper-case calligraphic symbols such as and to denote models.

In the setting of a -class () classification problem, we are given a dataset with as a natural example and as its corresponding label. Let represent a target classification model with model parameters . The predicted class of an input example is formulated as:

(1)

where

is the logits output of the target model

w.r.t. class , and

approximates the probability (softmax on logits) that

belongs to the -th class, i.e., . We denote by an adversarial example, and by the ball centered at with radius .

Let represent a pre-processing model with model parameter . Given an adversarial example , the recovered example is denoted by , where is the output of the pre-processing model. We use to represent a full model with model parameter . The full network is composed of a pre-processing network and a target network . The class of an input adversarial example predicted by can be represented by:

(2)

where is the logits output of the full model w.r.t. class , and approximates the probability that belongs to the -th class. can be expressed as the output of a combined defended model: , where and respectively denote the parameters for the pre-processing network and the classification network.

Figure 2: (a). The distinctive influence of oblivious and full adversarial examples on CIFAR-10. (b). A visual illustration of natural examples, adversarial examples and pre-processed examples. The adversarial examples are crafted by an adaptive PGD attack.

2.2 Absence of full adversarial examples

Pre-processing defenses typically first exploit one or several attacks to craft adversarial examples as adversarial training examples. These training attacks are usually only applied to the target model without considering the pre-processing model , and they thus are called as oblivious attacks (Athalye and Carlini, 2018; Tramer et al., 2020). Correspondingly, we call their adversarial examples as oblivious adversarial examples. An oblivious adversarial example is crafted by solving the optimization problem: where denotes the indicator function.

However, adaptive attacks tend to consider that a pre-processing defense model has been deployed. They can optimize a combined loss function on

(Tramer et al., 2020) to craft adaptive adversarial examples. Unfortunately, for most pre-processing defenses, their oblivious adversarial examples are specific to the fixed and not associated with . The influence of the gradient passing through on the adversarial robustness of is ignored. From this perspective, using such training examples cannot guarantee that the defense model can effectively deal with the adaptive attacks. To solve this problem, a natural idea is to use full adversarial examples which are crafted against the full model to replace oblivious adversarial examples. To prove this idea, we investigate the influence of full adversarial examples on the robustness and find that they indeed exhibit a better impact compared to oblivious adversarial examples. Specifically, we conduct a proof-concept experiment on CIFAR-10 to present the distinctive influence of oblivious and full adversarial examples.

We consider two pre-processing defense methods, APE-G and HGD, which have great performances against oblivious attacks. The pre-processing models are trained by using the PGD with the original training strategies in (Jin et al., 2019; Liao et al., 2018). Given two robust target models which are adversarially trained via Standard (Madry et al., 2018) and TRADES (Zhang et al., 2019), we evaluate the white-box robustness of the original pre-processing models (-obl) by using white-box adaptive attacks (see the blue bars in Fig 2). The details of these test attacks can be found in Section 4.2. We then use same PGD to craft full adversarial examples for retraining the pre-processing models.:

(3)

where denotes the parameters of the retrained pre-processing models. We use the same loss functions as those in their original papers, and conduct an adversarial training with 40 iterations. The white-box robustness of the retrained pre-processing model is significantly better than that of original models (see the orange bars in Fig 2). The results of this proof-concept experiment demonstrate that using full adversarial examples could improve the white-box robustness of the pre-processing defense. This discovery inspired us to use full adversarial examples to train the pre-processing model.

2.3 Vulnerability of the pre-processing model

Note that the results in Figure 2 also show that simply modifying the adversarial training examples in pre-processing methods does not completely alleviate the robustness degradation effect. This may be caused by the vulnerability of the pre-processing model. An attacker can mislead the pre-processing model to generate an output with respect to wrong class by disrupting the pre-processed example as much as possible. As shown in Fig 2, the distance between the adversarial examples and the natural examples is 7.51 on average, whereas the distance between the pre-processed examples and natural examples is 18.68 on average. The pre-processed examples have more undesirable noise than the adversarial examples, which indicated that the inherent robustness of the pre-processing model is not sufficient. Therefore, the approach for training a pre-processing defense model should involve the adversarial risk of the pre-processing model to enhance its white-box robustness.

Figure 3: A visual illustration of our Joint Adversarial Training based Pre-processing (JATP) defense. We use adversarial examples against the full model to train a pre-processing model that minimizes a hybrid loss composed of the pixel-wise loss and the adversarial loss .

Since a white-box adaptive attack can exploit the knowledge of the full model to break the defense, we exploit the full adversarial examples to formulate an adversarial risk for the pre-processing model. In addition, Wang et al. (2019b)

propose to explicitly differentiate the misclassified and correctly classified examples during the training, and they design a misclassification aware adversarial risk for improving the adversarial robustness of a target model. Inspired by their works, we formulate the adversarial risk of the pre-processing model as:

(4)

where is the class of the input predicted by the full model , and denotes the full adversarial example. The is the misclassification aware regularization term. Note that the target model is pre-trained and its model parameters is fixed during the training process. According to this adversarial risk, we could design a method to train a pre-processing defense model with better inherent adversarial robustness.

3 Proposed method

Motivated by the analyses in the previous section, we propose a Joint Adversarial Training based Pre-processing (JATP) defense to mitigate the robustness degradation effect. We use full adversarial examples to train a pre-processing model that minimizes a hybrid loss composed of a pixel-wise loss and an adversarial loss. Fig 3 shows the visual illustration of our proposed defense.

3.1 Adversarial training examples

Based on the proof-concept experiment in Section 2.2, we apply adversarial attacks to the full model to craft full adversarial examples for training the pre-processing model . In addition, Kurakin et al. (2016) point out that directly maximizing the cross-entropy loss in label space to craft adversarial training examples may lead to the label leakage problem. The problem allows a defense model to overfit on specific perturbations thus affecting model generalization to unseen attacks. Zhang and Wang (2019) show that using a highly transferable attack to craft adversarial examples is beneficial to improve the generalization of the defense model, and finding the worst-case distortion in feature space can strength the transferability of an adversarial attack. Therefore, we redesign the adversarial training examples as follows:

(5)

where denotes the feature map of on an internal layer of , and is a distance metric such as

, cosine similarity or Wasserstein distance

(Arjovsky et al., 2017). In this paper, we use the ResNet-18 (He et al., 2016) as the network of target model during training. Since the bottom layers tend to learn low-level features while the deeper ones are too specific to the label space, we choose the last convolution layer in the third basic block to obtain the feature map. We also study the influences of feature maps on different internal layers on the white-box robustness, and present their results in supplementary material B.

3.2 Loss function for JATP defense

We exploit a hybrid loss function to train our proposed JATP defense. The loss function is composed of two terms: pixel-wise loss and adversarial loss.

Pixel-wise loss. Our defense method aims to remove adversarial noise and preserve the original information of natural examples. Thus, we expect the recovered examples to be as close to the natural examples as possible. In addition, due to the similarity between the adversarial pattern and noise, smoothing examples can be helpful in reducing the interference of adversarial patterns. We apply a loss to measure the distance between recovered examples and natural examples in pixel space:

(6)

We study the influence of this pixel-wise loss on our proposed defense in Section 4.3, this loss help our pre-processing model retain the color and texture of natural examples, which can enhance the transferability of our defense across different target models.

Adversarial loss. As presented in Section 2.3, the adversarial risk of the pre-processing model consists of two parts: (1) the standard adversarial term and (2) the regularization adversarial term , both of which have the indicator function. In practice, it is intractable to directly optimize the 0-1 loss of the indicator function, and we thus use appropriate loss functions to replace the 0-1 loss.

For the standard adversarial term , inspired by (Wang et al., 2019b), we use a boosted cross-entropy (BCE) loss to replace the 0-1 loss, instead of the commonly used CE loss (Madry et al., 2018; Liao et al., 2018). The BCE loss is defined as:

(7)

where is the probability output defined in Eq 1, and is the pre-processed output defined in Eq 2. is the fixed parameters of the pre-trained target model . The first term is the commonly used cross-entropy loss, and the second term is used to reduce undesirable distortions generated by for decreasing the probability of the pre-processed examples with respect to the wrong classes.

For the regularization term, implies that adversarial examples have different output distribution to that of natural examples. Considering that the output is directly manipulated by the feature map on the internal layers of , and the full adversarial examples used to train the pre-processing model are found in the feature space, we thus use a feature similarity metric (FSM) to replace the first indicator function . The first term is expressed as:

(8)

where is a mean square error metric, and is the feature map on the internal layer of defined in Section 3.1. The another indicator function is a condition that emphasizes learning on misclassified examples (Wang et al., 2019b). The misclassification based constrain used in adversarial training also help to enhance the inherent robustness of the pre-processing model. We use a soft decision scheme, i.e., the output probability to replace the indicator function . The overall adversarial loss of the pre-processing model can be defined as follows:

(9)

where

is a positive hyperparameter.

The overall loss. Based on the pixel-wise loss and the adversarial loss, we present the overall loss for our proposed JATP defense:

(10)

where is a hyperparameter. The overall procedure is summarized in Algorithm 1.

0:  A pre-trained target model , a pre-processing model , batch of natural examples , perturbation budget , number of iterations .
1:  Initialization;
2:  for  to  do
3:     Craft full adversarial example at the given perturbation budget using Eq. 5;
4:     Forward-pass through and calculate using Eq. 6;
5:     Forward-pass through and calculate using Eq. 9;
6:     Back-pass and update to minimize using Eq. 10;
7:  end for
8:  return  .
Algorithm 1 JATP: Joint adversarial training based Pre-processing defense

4 Experiments

In this section, we first introduce the experimental setup used in this paper (Section 4.1). Then, we evaluate the adversarial robustness of our proposed JATP defense against unseen types of adaptive attacks (Section 4.2). Finally, we conduct the ablation and sensitivity studies to provide a further understanding of our defense method (Section 4.3).

4.1 Experiment setup

Datasets. We evaluate the adversarial robustness of pre-processing defenses on two popular benchmark datasets, i.e., SVHN (Netzer et al., 2011) and CIFAR-10 (Krizhevsky et al., 2009). SVHN and CIFAR-10

both have 10 classes of images, but the former contains 73,257 training images and 26,032 test images, and the latter contains 60,000 training images and 10,000 test images. All natural images are normalized into [0,1], and are performed simple data augmentations including 4-pixel padding with

random crop and random horizontal flip.

Target model settings. For both SVHN and CIFAR-10, target models are built on a ResNet-18 architecture He et al. (2016). Four adversarial training strategies are utilized to pre-train the target models, including Standard (Madry et al., 2018) MMA (Ding et al., 2019), TRADES (Zhang et al., 2019) and MART (Wang et al., 2019b). Hyperparameters of these strategies are configured as per their original papers: for TRADES and MART. A PGD-10 attack with random start and step size 0.007 is utilized as the training attack. The perturbation budget is for both SVHN and CIFAR-10. The link to the implementation code of target models can be found in supplementary material C.

Defense Settings. We use three pre-processing defense methods as the baselines: APE-G (Jin et al., 2019), HGD (Liao et al., 2018) and NRP (Naseer et al., 2020). The architectures of the pre-processing model in APE-G and HGD are same as those in their original papers. We use the classification model adversarially trained by TRADES as the the target model during the training process. A PGD-40 attack with step size 0.007 is used as the training attack for APE-G and HGD. The perturbation budget is set to for both SVHN and CIFAR-10. For NRP, we use the basic block in the original paper and only reduce the number of basic blocks to 3. The training strategies and other hyperparameters of baselines are consistent with the settings on CIFAR-10 in their original papers. For our method, we utilize the same architecture as NPR to build our pre-processing network. Our pre-processing model is trained using SGD (Andrew and Gao, 2007) with momentum 0.9, weight decay and an initial learning rate of , which is divided by 10 at the and epoch (30 epochs in total). is set to and the step size is 0.01. The hyperparameters and

are set to 5.0 and 3.0 respectively. For fair comparison, all experiments are conduced on four NVIDIA RTX 2080 GPUs, and all methods are implemented by PyTorch.

4.2 Adversarial robustness evaluation

Attack Natural PGD DLR AA FWA TI-BIM
None 78.45 0.06 47.48 0.05 45.02 0.01 44.10 0.07 34.39 0.06 59.67 0.02
APE-G 78.40 0.08 34.94 0.07 33.11 0.01 16.35 0.06 23.71 0.03 57.80 0.04
HGD 78.17 0.08 13.84 0.03 12.66 0.01 17.36 0.07 12.5 0.01 56.90 0.05
NRP 78.38 0.08 24.28 0.08 22.94 0.07 18.51 0.05 11.78 0.08 57.40 0.07
JATP 79.00 0.03 46.14 0.01 43.54 0.03 43.25 0.04 31.57 0.05 60.04 0.06
Table 1: White-box robustness (percentage) of pre-processing defenses on CIFAR-10. We show the most successful pre-processing defense with bold. The target model is trained by TRADES.
Target Standard MMA MART
Attack Natural PGD DLR FWA Natural PGD DLR FWA Natural PGD DLR FWA
None 82.38 43.29 43.27 24.39 82.88 43.56 42.60 21.97 78.19 48.54 44.90 34.28
APE-G 82.54 27.35 30.68 14.17 82.53 29.02 30.59 13.18 78.28 34.34 32.15 21.28
HGD 82.64 7.86 10.23 6.97 82.17 9.09 10.29 7.24 77.74 10.97 11.31 10.09
NRP 82.43 13.71 17.72 5.01 82.52 16.56 18.46 4.50 77.95 21.32 20.26 9.64
JATP 83.65 41.65 42.75 22.07 83.14 41.73 40.89 19.29 79.25 46.75 43.29 31.54
JATP 82.45 43.89 42.99 24.86 82.29 42.82 41.64 19.82 77.89 48.29 45.00 35.04
Table 2: White-box robustness (percentage) of pre-processing defenses on different target models. We show the most successful pre-processing defense with bold and the second one with underline.

Robustness against adaptive attacks. We first evaluate the robustness of all pre-processing defense models against five types of attacks for both SVHN and CIFAR-10: PGD (40-step PGD with step size 0.01), DLR (40-step PGDDLR (Croce and Hein, 2020) with step size 0.007), AA (-norm version of AutoAttack), FWA (20-step FWA with step size 0.01) and TI-BIM (Dong et al., 2019; Xie et al., 2019). The perturbation budget is set to 8/255. All attacks exploit an adaptive attack strategy that the attacker have full access to the architectures and model parameters of both the pre-processing model and the target model. The white-box robustness of all defense models are shown in Table 1. Our proposed pre-processing defense JATP achieves the best robustness against all five types of attacks, and significantly mitigate the robustness degradation effect compared with other pre-processing models. Due to the space limitation, we present the results on SVHN in the supplementary material C. In addition, we use an adaptive attack strategy BPDA (Athalye et al., 2018) to detect whether our work relies heavily on obfuscated gradients. We combine BPDA with PGD to bypass the pre-processing defenses. As shown in Figure 4(a), our method achieves lower fooling rates than other baselines.

Cross-model defense. We apply our proposed JATP defense to other three target models (i.e., Standard, MMA and MART) to evaluate its transferability. The white-box robustness of our proposed JATP defense on CIFAR-10 is reported in Table 2

. We present the standard deviation in supplementary material C. Again, our proposed defense achieves higher robustness than other pre-processing defenses. Furthermore, we conduct an additional experiment under a relaxed constrain. That is, we update the parameters of the target model and the pr-processing model together during the training process, instead of using a pre-trained target model and fixing its model parameters. We then transfer the obtained pre-processing model denoted by JATP

to above three target models. We find that the transferability of JATP defense has a slight improvement compared with JATP defense.

(a) BPDA
(b) Ablation study
Figure 4: (a). Fooling rate (lower is better) of BPDA and PGD against pre-processing models. The target model is trained by TRADES. (b). Ablation study. We remove the pixel-wise loss ("Pix"), BCE adversarial loss ("BCE") and feature similarity adversarial loss ("FSM") respectively to investigate their impacts on our model.

4.3 Ablation study

In this section, we conduct an ablation study to further understand the proposed JATP defense. We respectively remove the pixel-wise loss, BCE adversarial loss and feature similarity adversarial loss to investigate their impacts on our model. A target model trained by TRADES is used during the training process. As illustrated in Figure 4(b), removing pixel-wise loss mainly affects the transferability of the pre-processing model across different target models. Removing the BCE adversarial loss and the feature similarity adversarial loss would lead to a significant robustness degradation, which shows the adversarial loss is important to the pre-processing for improving its adversarial robustness.

5 Conclusion

In this paper, we analyze two potential causes of robustness degradation effect: (1) adversarial training examples used in pre-processing methods are independent to the pre-processing model; and (2) the adversarial risk of the pre-processing model is neglected during the training process. To solve this problem, we first formulate a feature similarity based adversarial risk for the pre-processing model to improve its inherent robustness by using full adversarial examples crafted in a feature space. We then introduce a pixel-wise loss to improve the cross-model transferability of the pre-processing model. Based on these, We propose a Joint Adversarial Training based Pre-processing (JATP) defense to minimize the overall risk in a dynamic manner. Experimental results show that our method effectively mitigates the robustness degradation effect against in comparison to previous pre-processing defenses. We hope that the work in this paper can provide some inspiration for future pre-processing defenses in improving the white-box robustness. The limitation of our work is that the our proposed method is only suitable for input denoising defenses so far, and we have not explored how to apply it to other pre-processing defenses. In the future, we plan to combine the pre-processing defense with recently proposed robustness models (Wu et al., 2020b; Chen et al., 2021) and explore the potential improvements on the white-box robustness of pre-processing defenses.

References

  • G. Andrew and J. Gao (2007) Scalable training of l 1-regularized log-linear models. In

    Proceedings of the 24th international conference on Machine learning

    ,
    pp. 33–40. Cited by: §4.1.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017)

    Wasserstein generative adversarial networks

    .
    In International conference on machine learning, pp. 214–223. Cited by: §3.1.
  • A. Athalye, N. Carlini, and D. A. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, Cited by: §4.2.
  • A. Athalye and N. Carlini (2018) On the robustness of the cvpr 2018 white-box adversarial example defenses. arXiv preprint arXiv:1804.03286. Cited by: §1, §2.2.
  • N. Carlini and D. Wagner (2017a) Magnet and" efficient defenses against adversarial attacks" are not robust to adversarial examples. arXiv preprint arXiv:1711.08478. Cited by: §1.
  • N. Carlini and D. Wagner (2017b) Towards evaluating the robustness of neural networks. In 2017 Ieee Symposium on Security and Privacy (sp), pp. 39–57. Cited by: §1.
  • T. Chen, Z. Zhang, S. Liu, S. Chang, and Z. Wang (2021) Robust overfitting may be mitigated by properly learned smoothening. In International Conference on Learning Representations, Vol. 1. Cited by: §5.
  • Y. Chen, R. Xie, and Z. Zhu (2020) On breaking deep generative model-based defenses and beyond. In International Conference on Machine Learning, pp. 1736–1745. Cited by: §1.
  • F. Croce and M. Hein (2020) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning, pp. 2206–2216. Cited by: Figure 1, §4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • G. W. Ding, K. Y. C. Lui, X. Jin, L. Wang, and R. Huang (2019) On the sensitivity of adversarial robustness to input data distributions.. In ICLR (Poster), Cited by: §1, §4.1.
  • Y. Dong, T. Pang, H. Su, and J. Zhu (2019) Evading defenses to transferable adversarial examples by translation-invariant attacks. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4312–4321. Cited by: §4.2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1.
  • R. Duan, X. Ma, Y. Wang, J. Bailey, A. K. Qin, and Y. Yang (2020) Adversarial camouflage: hiding physical-world attacks with natural styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1000–1008. Cited by: §1.
  • K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018) Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1625–1634. Cited by: §1.
  • R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §1.
  • P. Ghosh, A. Losalka, and M. J. Black (2019)

    Resisting adversarial attacks using gaussian mixture variational autoencoders

    .
    In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 541–548. Cited by: §1.
  • G. Gondim-Ribeiro, P. Tabacof, and E. Valle (2018) Adversarial attacks on variational autoencoders. arXiv preprint arXiv:1806.04646. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §1.
  • C. Guo, M. Rana, M. Cissé, and L. van der Maaten (2018) Countering adversarial images using input transformations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §3.1, §4.1.
  • Z. Huang and T. Zhang (2019) Black-box adversarial attack with transferable model-based embedding. arXiv preprint arXiv:1911.07140. Cited by: §1.
  • G. Jin, S. Shen, D. Zhang, F. Dai, and Y. Zhang (2019) APE-GAN: adversarial perturbation elimination with GAN. In International Conference on Acoustics, Speech and Signal Processing, pp. 3842–3846. Cited by: Figure 1, §1, §2.2, §4.1.
  • H. Kaiming, G. Georgia, D. Piotr, and G. Ross (2017) Mask r-cnn. IEEE Transactions on Pattern Analysis & Machine Intelligence PP, pp. 1–1. Cited by: §1.
  • J. Kos, I. Fischer, and D. Song (2018) Adversarial examples for generative models. In 2018 ieee security and privacy workshops (spw), pp. 36–42. Cited by: §1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: Figure 1, §4.1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1, §3.1.
  • F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu (2018) Defense against adversarial attacks using high-level representation guided denoiser. In Conference on Computer Vision and Pattern Recognition, pp. 1778–1787. Cited by: Figure 1, §1, §2.2, §3.2, §4.1.
  • W. Lin, Y. Balaji, P. Samangouei, and R. Chellappa (2019) Invert and defend: model-based approximate inversion of generative adversarial networks for secure inference. arXiv preprint arXiv:1911.10291. Cited by: §1.
  • X. Ma, B. Li, Y. Wang, S. M. Erfani, S. N. R. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. In International Conference on Learning Representations, Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, Cited by: Figure 1, §1, §2.2, §3.2, §4.1.
  • M. Naseer, S. Khan, M. Hayat, F. S. Khan, and F. Porikli (2020) A self-supervised approach for adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 262–271. Cited by: Figure 1, §1, §4.1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.1.
  • Y. Qin, N. Frosst, S. Sabour, C. Raffel, G. Cottrell, and G. Hinton (2019) Detecting and diagnosing adversarial images with class-conditional capsule reconstructions. arXiv preprint arXiv:1907.02957. Cited by: §1.
  • P. Samangouei, M. Kabkab, and R. Chellappa (2018) Defense-gan: protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605. Cited by: §1.
  • L. Schott, J. Rauber, M. Bethge, and W. Brendel (2018)

    Towards the first adversarially robust neural network model on mnist

    .
    arXiv preprint arXiv:1805.09190. Cited by: §1.
  • C. Sun, S. Chen, J. Cai, and X. Huang (2020) Type i attack for generative models. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 593–597. Cited by: §1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Neural Information Processing Systems, pp. 3104–3112. Cited by: §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, Cited by: §1.
  • F. Tramer, N. Carlini, W. Brendel, and A. Madry (2020) On adaptive attacks to adversarial example defenses. arXiv preprint arXiv:2002.08347. Cited by: §1, §1, §2.2, §2.2.
  • Y. Wang, X. Deng, S. Pu, and Z. Huang (2017)

    Residual convolutional ctc networks for automatic speech recognition

    .
    arXiv preprint arXiv:1702.07793. Cited by: §1.
  • Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou, and Q. Gu (2019a) On the convergence and robustness of adversarial training.. In ICML, Vol. 1, pp. 2. Cited by: §1.
  • Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu (2019b) Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, Cited by: §2.3, §3.2, §3.2, §4.1.
  • D. Wu, Y. Wang, S. Xia, J. Bailey, and X. Ma (2020a) Skip connections matter: on the transferability of adversarial examples generated with resnets. arXiv preprint arXiv:2002.05990. Cited by: §1.
  • D. Wu, S. Xia, and Y. Wang (2020b) Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems 33. Cited by: §5.
  • K. Wu, A. H. Wang, and Y. Yu (2020c) Stronger and faster wasserstein adversarial attacks. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119, pp. 10377–10387. Cited by: Figure 1.
  • T. Wu, L. Tong, and Y. Vorobeychik (2020d) Defending against physically realizable attacks on image classification. In 8th International Conference on Learning Representations, Cited by: §1.
  • C. Xie, Z. Zhang, Y. Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille (2019) Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2730–2739. Cited by: §4.2.
  • K. Xu, G. Zhang, S. Liu, Q. Fan, M. Sun, H. Chen, P. Chen, Y. Wang, and X. Lin (2020) Adversarial t-shirt! evading person detectors in a physical world. In European Conference on Computer Vision, pp. 665–681. Cited by: §1.
  • W. Xu, D. Evans, and Y. Qi (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: §1.
  • H. Zhang and J. Wang (2019) Defense against adversarial attacks using feature scattering-based adversarial training. arXiv preprint arXiv:1907.10764. Cited by: §1, §3.1.
  • H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472–7482. Cited by: Figure 1, §1, §2.2, §4.1.