Learning to Defense by Learning to Attack

by   Zhehui Chen, et al.
Georgia Institute of Technology

Adversarial training provides a principled approach for training robust neural networks. From an optimization perspective, the adversarial training is essentially solving a minmax robust optimization problem. The outer minimization is trying to learn a robust classifier, while the inner maximization is trying to generate adversarial samples. Unfortunately, such a minmax problem is very difficult to solve due to the lack of convex-concave structure. This work proposes a new adversarial training method based on a general learning-to-learn framework. Specifically, instead of applying the existing hand-design algorithms for the inner problem, we learn an optimizer, which is parametrized as a convolutional neural network. At the same time, a robust classifier is learned to defense the adversarial attack generated by the learned optimizer. Our experiments demonstrate that our proposed method significantly outperforms existing adversarial training methods on CIFAR-10 and CIFAR-100 datasets.


page 1

page 2

page 3

page 4


Improved Adversarial Training via Learned Optimizer

Adversarial attack has recently become a tremendous threat to deep learn...

Get Fooled for the Right Reason: Improving Adversarial Robustness through a Teacher-guided Curriculum Learning Approach

Current SOTA adversarially robust models are mostly based on adversarial...

Robust Regularization with Adversarial Labelling of Perturbed Samples

Recent researches have suggested that the predictive accuracy of neural ...

Enhancing Adversarial Training with Feature Separability

Deep Neural Network (DNN) are vulnerable to adversarial attacks. As a co...

Provable Defense Against Delusive Poisoning

Delusive poisoning is a special kind of attack to obstruct learning, whe...

Convergence of Adversarial Training in Overparametrized Networks

Neural networks are vulnerable to adversarial examples, i.e. inputs that...

Regional Adversarial Training for Better Robust Generalization

Adversarial training (AT) has been demonstrated as one of the most promi...

1 Introduction

This decade has witnessed great breakthroughs in deep learning in a variety of applications, such as computer vision 

(Taigman et al., 2014; Girshick et al., 2014; He et al., 2016; Liu et al., 2017). Recent studies (Szegedy et al., 2013), however, show that most of these deep learning models are very vulnerable to adversarial attacks. Specifically, by injecting a small perturbation to the normal sample, attackers obtain the adversarial examples. Although these adversarial examples are semantically indistinguishable from the normal ones, they can severely fool the deep learning models and undermine the security of deep learning, causing reliability problems in autonomous driving, biometric authentication, etc.

Researchers have devoted many effects to studying efficient adversarial attack and defense (Szegedy et al., 2013; Goodfellow et al., 2014b; Nguyen et al., 2015; Zheng et al., 2016; Madry et al., 2017). There is a growing body of work on generating successful adversarial examples, e.g., fast gradient sign method (FGSM, Goodfellow et al. (2014b)), projected gradient method (PGM, Kurakin et al. (2016)), etc. As for robustness, Goodfellow et al. (2014b) first propose to robustify the network by adversarial training, which augments the training data with adversarial examples and still requires the network to output the correct label. Adversarial training essentially incorporates adversarial examples into the training stage. Further, Madry et al. (2017) formalize the adversarial training as the following minmax robust optimization problem:


where are pairs of input feature and label,

denotes the loss function,

denotes the neural network with parameter , denotes the perturbation for under constraint . The existing literature on optimization also refers to as the primal variable and as the dual variable. Different from the well-studied convex-concave problem222Loss function is convex in primal variable and concave in dual variable ., problem (1) is very challenging, since in (1) is nonconvex in and nonconcave in . The existing primal-dual algorithms perform poorly for solving (1).

The minmax formulation in (1) naturally provides us with a unified perspective on prior works of adversarial training. Such a minmax problem contains two optimization problems, an inner maximization problem and an outer minimization problem: The inner problem targets on finding an optimal attack for a given data point that achieves a high loss, which essentially is the adversarial attack; The outer problem aims to find a so that the loss given by the inner problem is minimized. Therefore, unlike Goodfellow et al. (2014b) solving the inner problem by FGSM, Madry et al. (2017) suggest to solve the inner problem by PGM and obtain a better result than FGSM, since FGSM essentially is one iteration PGM. PGM, however, does not guarantee to find the optimal solution of the inner problem, due to the nonconcavity of the inner problem. Furthermore, PGM training does not obtain the stationary point of problem (1). Moreover, adversarial training needs to find a for each

. The dimension of overall search space for all data is substantial, which makes the computation unaffordable. Besides, existing methods, e.g., FGSM and PGM, suffer from the gradient vanishing in backpropagation (BP), which makes the gradient uninformative.

Some recent works, (Hochreiter et al., 2001; Thrun and Pratt, 2012; Andrychowicz et al., 2016), propose a learning-to-learn framework. Hochreiter et al. (2001), for example, propose a system allowing the output of backpropagation from one network to feed into an additional learning network, with both networks trained jointly; Based on this, Andrychowicz et al. (2016) further show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way.

Motivated by the learning-to-learn framework, we propose a new adversarial training method to solve the minmax problem. Specifically, we parameterize the solver for the inner problem as a convolutional neural network and then cast the inner problem as a learning problem, which adopts the dual embedding (Dai et al., 2016). Consequently, our proposed adversarial training method simultaneously learns a robust classifier and the convolutional neural network for generating an adversarial attack. Our experiments demonstrate that our proposed method significantly outperforms existing adversarial training methods on CIFAR-10 and CIFAR-100.

The rest of the paper is organized as follows: Section 2 introduces our proposed training framework in detail, and Section 3 presents our numerical results on CIFAR- and CIFAR-. Section 4 gives a brief discussion.

2 Method

Given a vector

, we denote as the -th element of . We then denote the most widely used max norm attack constraint and the corresponding projection as follows:

where both and are element-wise operators, and denotes the element-wise product.

2.1 Adversarial Training based on Robust Optimization

As we mentioned that the adversarial training is reformulated as solving minmax problem, we summarize the standard pipeline of one iteration in adversarial training in Algorithm 1. As can be seen, we update the parameter of classifier over clean samples for keeping the accuracy on clean data as (Kurakin et al., 2016) suggests.

0:    clean data , step size .
1:    // Update over clean data
2:    // Generate perturbation
3:    // Update over adversarial data
Algorithm 1 One Iteration in Adversarial Training

The is highly nonconcave in , and therefore Step 2 in Algorithm 1 is intractable. In practice, Step 2 in most adversarial training methods adopts hand-designed algorithms for generating adversarial perturbations. For example, (Kurakin et al., 2016) proposes to solve the inner problem approximately by first order methods such as PGM. Specifically, PGM iteratively updates the adversarial perturbation by projected sign gradient ascent method for each sample: Given a sample , at the -th iteration, PGM takes


where is the step size, , , and is a pre-defined total number of iterations. Finally PGM takes . Note that FGSM essentially is only a one-iteration version of PGM. Besides, some other works adopt other optimization methods, such as momentum gradient method (Dong et al., 2018), L-BFGS (Tabacof and Valle, 2016) and SLSQP  (Kraft, 1988). However, except for FGSM, they all require numerous amount of queries for gradients, which is computationally very expensive.

2.2 Learning to Defense by Learning to Attack (L2L)

Here, instead of applying some hand-designed attackers, we learn an optimizer for the inner problem, which is parametrized by a convolutional neural network , where is an operator to modify the input of the network . We provide the following two straightforward examples:

Naive Attacker Network. This is the simplest example of our attacker network we can imagine, which takes the original image as the input, i.e., . Under this setting, L2L training is similar to GAN (Goodfellow et al., 2014a). The major difference is that GAN generates the synthetic data by transforming the random noises, and the L2L training generates the adversarial perturbation by transforming the training samples333When this paper was under preparation, we found that a similar idea to the naive attacker network is independently proposed in (Anonymous, 2019).

Gradient Attacker Network. Beside the input image, we can also take the gradient information into consideration in our attacker network. Specifically, the attacker takes the original image and as the input, i.e., , where is the back-propagation gradient computed recursively from the top layer to the bottom layer. Since more information is provided, we expect the attacker network to be more efficient to learn and meanwhile yield more powerful adversarial perturbations.

With this parametrization, we then convert problem (1) to the following problem:


Solving problem  naturally contains two stages. In the first stage, the classifier aims to fit over all the perturbed data; While in the second stage, given a certain classifier obtained in the first stage, the attacker network targets on generating the optimal perturbation under constraint . Figure 1 illustrates our training framework on gradient attacker network. As can be seen, we jointly train two networks, one classifier and one attacker. We feed both clean data and backpropagation gradient of the classifier into the attacker, and let learn to generate the perturbation for adversarial training, i.e.,

The constraint can be handled by a activation function in the last layer of the network . Specifically, since the magnitude of output is bounded by 1, after we rescale the output by , the output of the network satisfies the constraint . Moreover, because our method only requires to update parameter , it only requires one gradient query and amortizes the adversarial training cost, which leads to better computational efficiency. The corresponding training procedure is shown in Algorithm 2.

Figure 1: L2L adversarial training method combined with the gradient attacker network.
0:    clean data , step size
1:    // Calculate gradient w.r.t and by backpropagation.
2:    // Update over clean data
3:    // Generate attack by
4:    // Update over adversarial data
5:    // Update
Algorithm 2 One Iteration in Learning to Defense by Learning to Attack

3 Experiments

To demonstrate the efficiency and effectiveness of our proposed new method, we present experimental results on CIFAR-10 and CIFAR-100 datasets. We consider two attack methods: FGSM and PGM, and evaluate the robustness of deep neural networks models under both black-box and white-box setting. All experiments are done in PyTorch with one NVIDIA 1080Ti GPU, and all reported results are averaged over 10 runs with different random initializations (Summarized in Tables 1 and 2).

Experimental Settings: All experiments adopt a 32-layer Wide Residual Networks (WRN-4-32, Zagoruyko and Komodakis (2016)) as the classifier. A pre-trained network is used as the initial classifier in the adversarial training 444The pre-trained network is obtained by the training procedure on clean data as Zagoruyko and Komodakis (2016).

. For training the attacker network, we use the stochastic gradient descent (SGD) algorithm with Polyak’s momentum (the momentum parameter is

) and weight decay (the parameter is ). We observe that, after adversarial training for epochs, all adversarial training methods become stable (converge well). For L2L training, we use a step size of for the first epochs and further reduce the step size to for the last 10 epochs. For both FGSM and PGM training, we use a fixed step size of .555We find that the step size annealing procedure hurts both FGSM and PGM training. For PGM attack and training, we use and , which yields sufficiently strong perturbations in practice. For L2L training, we use a 6-layer convolutional neural network as the attacker network shown in Table 3. We set the size of the constraint to be for all experiments.

Under the white-box setting, attackers are able to access all parameters of target models and generate adversarial examples based on the target models. Under the black-box setting, accessing parameters is prohibited. Therefore, we adopt the standard transfer attack method (Liu et al., 2016). Specifically, we train another classifier with a different random seed, and then based on this classifier, attackers generate adversarial examples to attack the target model.

White Box Black Box
Target Plain Net FGSM Net PGM Net Grad L2L
Plain Net 95.22 21.00 0.04 40.05 5.54 74.42 75.25 67.37 65.92 64.31 59.22
FGSM Net 91.30 84.99 2.61 79.20 85.02 89.90 80.40 64.28 63.89 59.78 58.73
PGM Net 86.25 51.07 44.95 83.80 84.73 84.33 85.29 67.05 65.54 65.95 63.98
Naive L2L 94.72 16.72 0.00 45.52 25.95 83.99 77.94 68.14 67.13 65.98 64.23
Grad L2L 89.09 57.39 50.04 85.77 87.14 86.78 87.77 70.62 69.37 67.36 65.13
Plain Net 75.68 10.15 0.14 21.04 9.04 50.57 54.06 40.06 41.30 39.00 36.85
FGSM Net 71.54 36.71 0.96 42.87 50.73 61.68 44.70 39.34 40.08 38.10 36.85
PGM Net 60.80 22.21 18.73 56.63 58.34 56.99 57.97 40.19 39.87 38.59 37.39
Naive L2L 73.69 9.21 0.17 20.97 10.47 50.36 54.07 38.63 39.91 36.15 35.58
Grad L2L 64.15 30.35 26.75 60.35 61.63 60.71 61.60 44.98 44.73 41.32 39.92
Table 1: Quantitive comparisons among different adversarial training methods: Grad L2L and Naive L2L denote our proposed L2L training method combined with the gradient attacker network and the naive attacker network respectively.
Clean Data FGSM training PGM Training Naive L2L Grad L2L
s s s s s
Table 2: Training time for one epoch
Type Channel Kernel Stride Padding Batch normalization Activation
Layer 1 Conv 16 3x3 1 1 Yes Relu
Layer 2 Conv 32 4x4 2 1 Yes Relu
Layer 3 Conv 64 4x4 2 1 Yes Relu
Layer 4 DeConv 32 4x4 2 1 Yes Relu
Layer 5 DeConv 16 4x4 2 1 Yes Relu
Layer 6 Conv 3 1x1 1 0 Yes Tanh
Table 3: Attacker Network Architecure

Grad L2L v.s. PGM. From Table 1, we see that in terms of the classification accuracy of adversarial examples, our proposed Grad L2L training uniformly outperforms PGM training over all settings, even when the adversarial attack is generated by PGM. Moreover, from Table 2, we see that our proposed Grad L2L training is computationally more efficient than PGM training.

Grad L2L v.s. Naive L2L. From Table 1, we see that in terms of the classification accuracy of adversarial examples, our proposed Grad L2L training achieves significantly better performance than Native L2L. This demonstrates that adding additional gradient information indeed yields a better adversarial training procedure.

Grad L2L v.s. FGSM. From Table 1, we see that for FGSM attack, FGSM training yields a better classification accuracy than Grad L2L training. However, FGSM training is much more vulnerable to PGM attack than Grad L2L training. Moreover, from Table 2, we see that Grad L2L training is only slightly slower than than FGSM training.

4 Discussions

We discuss a few benefits of our neural network approach: (i) The neural network has been known to be powerful in function approximation. Therefore, our attacker network is capable of yielding very strong adversarial perturbations; (ii) We generate the adversarial perturbations for all samples using the same attacker network. Therefore, the attacker network is essentially learning some common structures across all samples, which help yield stronger perturbations; (iii) The attacker networks in our experiments are actually overparametrized. The overparametrization has been conjectured to ease the training of deep neural networks. We believe that similar phenomena happen to our attacker network, and ease the adversarial training of the robust classifier.