Researchers have devoted many effects to studying efficient adversarial attack and defense (Szegedy et al., 2013; Goodfellow et al., 2014b; Nguyen et al., 2015; Zheng et al., 2016; Madry et al., 2017). There is a growing body of work on generating successful adversarial examples, e.g., fast gradient sign method (FGSM, Goodfellow et al. (2014b)), projected gradient method (PGM, Kurakin et al. (2016)), etc. As for robustness, Goodfellow et al. (2014b) first propose to robustify the network by adversarial training, which augments the training data with adversarial examples and still requires the network to output the correct label. Adversarial training essentially incorporates adversarial examples into the training stage. Further, Madry et al. (2017) formalize the adversarial training as the following minmax robust optimization problem:
where are pairs of input feature and label,
denotes the loss function,denotes the neural network with parameter , denotes the perturbation for under constraint . The existing literature on optimization also refers to as the primal variable and as the dual variable. Different from the well-studied convex-concave problem222Loss function is convex in primal variable and concave in dual variable ., problem (1) is very challenging, since in (1) is nonconvex in and nonconcave in . The existing primal-dual algorithms perform poorly for solving (1).
The minmax formulation in (1) naturally provides us with a unified perspective on prior works of adversarial training. Such a minmax problem contains two optimization problems, an inner maximization problem and an outer minimization problem: The inner problem targets on finding an optimal attack for a given data point that achieves a high loss, which essentially is the adversarial attack; The outer problem aims to find a so that the loss given by the inner problem is minimized. Therefore, unlike Goodfellow et al. (2014b) solving the inner problem by FGSM, Madry et al. (2017) suggest to solve the inner problem by PGM and obtain a better result than FGSM, since FGSM essentially is one iteration PGM. PGM, however, does not guarantee to find the optimal solution of the inner problem, due to the nonconcavity of the inner problem. Furthermore, PGM training does not obtain the stationary point of problem (1). Moreover, adversarial training needs to find a for each
. The dimension of overall search space for all data is substantial, which makes the computation unaffordable. Besides, existing methods, e.g., FGSM and PGM, suffer from the gradient vanishing in backpropagation (BP), which makes the gradient uninformative.
Some recent works, (Hochreiter et al., 2001; Thrun and Pratt, 2012; Andrychowicz et al., 2016), propose a learning-to-learn framework. Hochreiter et al. (2001), for example, propose a system allowing the output of backpropagation from one network to feed into an additional learning network, with both networks trained jointly; Based on this, Andrychowicz et al. (2016) further show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way.
Motivated by the learning-to-learn framework, we propose a new adversarial training method to solve the minmax problem. Specifically, we parameterize the solver for the inner problem as a convolutional neural network and then cast the inner problem as a learning problem, which adopts the dual embedding (Dai et al., 2016). Consequently, our proposed adversarial training method simultaneously learns a robust classifier and the convolutional neural network for generating an adversarial attack. Our experiments demonstrate that our proposed method significantly outperforms existing adversarial training methods on CIFAR-10 and CIFAR-100.
Given a vector, we denote as the -th element of . We then denote the most widely used max norm attack constraint and the corresponding projection as follows:
where both and are element-wise operators, and denotes the element-wise product.
2.1 Adversarial Training based on Robust Optimization
As we mentioned that the adversarial training is reformulated as solving minmax problem, we summarize the standard pipeline of one iteration in adversarial training in Algorithm 1. As can be seen, we update the parameter of classifier over clean samples for keeping the accuracy on clean data as (Kurakin et al., 2016) suggests.
The is highly nonconcave in , and therefore Step 2 in Algorithm 1 is intractable. In practice, Step 2 in most adversarial training methods adopts hand-designed algorithms for generating adversarial perturbations. For example, (Kurakin et al., 2016) proposes to solve the inner problem approximately by first order methods such as PGM. Specifically, PGM iteratively updates the adversarial perturbation by projected sign gradient ascent method for each sample: Given a sample , at the -th iteration, PGM takes
where is the step size, , , and is a pre-defined total number of iterations. Finally PGM takes . Note that FGSM essentially is only a one-iteration version of PGM. Besides, some other works adopt other optimization methods, such as momentum gradient method (Dong et al., 2018), L-BFGS (Tabacof and Valle, 2016) and SLSQP (Kraft, 1988). However, except for FGSM, they all require numerous amount of queries for gradients, which is computationally very expensive.
2.2 Learning to Defense by Learning to Attack (L2L)
Here, instead of applying some hand-designed attackers, we learn an optimizer for the inner problem, which is parametrized by a convolutional neural network , where is an operator to modify the input of the network . We provide the following two straightforward examples:
Naive Attacker Network. This is the simplest example of our attacker network we can imagine, which takes the original image as the input, i.e., . Under this setting, L2L training is similar to GAN (Goodfellow et al., 2014a). The major difference is that GAN generates the synthetic data by transforming the random noises, and the L2L training generates the adversarial perturbation by transforming the training samples333When this paper was under preparation, we found that a similar idea to the naive attacker network is independently proposed in (Anonymous, 2019).
Gradient Attacker Network. Beside the input image, we can also take the gradient information into consideration in our attacker network. Specifically, the attacker takes the original image and as the input, i.e., , where is the back-propagation gradient computed recursively from the top layer to the bottom layer. Since more information is provided, we expect the attacker network to be more efficient to learn and meanwhile yield more powerful adversarial perturbations.
With this parametrization, we then convert problem (1) to the following problem:
Solving problem naturally contains two stages. In the first stage, the classifier aims to fit over all the perturbed data; While in the second stage, given a certain classifier obtained in the first stage, the attacker network targets on generating the optimal perturbation under constraint . Figure 1 illustrates our training framework on gradient attacker network. As can be seen, we jointly train two networks, one classifier and one attacker. We feed both clean data and backpropagation gradient of the classifier into the attacker, and let learn to generate the perturbation for adversarial training, i.e.,
The constraint can be handled by a activation function in the last layer of the network . Specifically, since the magnitude of output is bounded by 1, after we rescale the output by , the output of the network satisfies the constraint . Moreover, because our method only requires to update parameter , it only requires one gradient query and amortizes the adversarial training cost, which leads to better computational efficiency. The corresponding training procedure is shown in Algorithm 2.
To demonstrate the efficiency and effectiveness of our proposed new method, we present experimental results on CIFAR-10 and CIFAR-100 datasets. We consider two attack methods: FGSM and PGM, and evaluate the robustness of deep neural networks models under both black-box and white-box setting. All experiments are done in PyTorch with one NVIDIA 1080Ti GPU, and all reported results are averaged over 10 runs with different random initializations (Summarized in Tables 1 and 2).
Experimental Settings: All experiments adopt a 32-layer Wide Residual Networks (WRN-4-32, Zagoruyko and Komodakis (2016)) as the classifier. A pre-trained network is used as the initial classifier in the adversarial training 444The pre-trained network is obtained by the training procedure on clean data as Zagoruyko and Komodakis (2016).
. For training the attacker network, we use the stochastic gradient descent (SGD) algorithm with Polyak’s momentum (the momentum parameter is) and weight decay (the parameter is ). We observe that, after adversarial training for epochs, all adversarial training methods become stable (converge well). For L2L training, we use a step size of for the first epochs and further reduce the step size to for the last 10 epochs. For both FGSM and PGM training, we use a fixed step size of .555We find that the step size annealing procedure hurts both FGSM and PGM training. For PGM attack and training, we use and , which yields sufficiently strong perturbations in practice. For L2L training, we use a 6-layer convolutional neural network as the attacker network shown in Table 3. We set the size of the constraint to be for all experiments.
Under the white-box setting, attackers are able to access all parameters of target models and generate adversarial examples based on the target models. Under the black-box setting, accessing parameters is prohibited. Therefore, we adopt the standard transfer attack method (Liu et al., 2016). Specifically, we train another classifier with a different random seed, and then based on this classifier, attackers generate adversarial examples to attack the target model.
|White Box||Black Box|
|Target||Plain Net||FGSM Net||PGM Net||Grad L2L|
|Clean Data||FGSM training||PGM Training||Naive L2L||Grad L2L|
Grad L2L v.s. PGM. From Table 1, we see that in terms of the classification accuracy of adversarial examples, our proposed Grad L2L training uniformly outperforms PGM training over all settings, even when the adversarial attack is generated by PGM. Moreover, from Table 2, we see that our proposed Grad L2L training is computationally more efficient than PGM training.
Grad L2L v.s. Naive L2L. From Table 1, we see that in terms of the classification accuracy of adversarial examples, our proposed Grad L2L training achieves significantly better performance than Native L2L. This demonstrates that adding additional gradient information indeed yields a better adversarial training procedure.
Grad L2L v.s. FGSM. From Table 1, we see that for FGSM attack, FGSM training yields a better classification accuracy than Grad L2L training. However, FGSM training is much more vulnerable to PGM attack than Grad L2L training. Moreover, from Table 2, we see that Grad L2L training is only slightly slower than than FGSM training.
We discuss a few benefits of our neural network approach: (i) The neural network has been known to be powerful in function approximation. Therefore, our attacker network is capable of yielding very strong adversarial perturbations; (ii) We generate the adversarial perturbations for all samples using the same attacker network. Therefore, the attacker network is essentially learning some common structures across all samples, which help yield stronger perturbations; (iii) The attacker networks in our experiments are actually overparametrized. The overparametrization has been conjectured to ease the training of deep neural networks. We believe that similar phenomena happen to our attacker network, and ease the adversarial training of the robust classifier.
- Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B. and De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems.
A direct approach to robust deep learning using adversarial networks.
In Submitted to International Conference on Learning
- Dai et al. (2016) Dai, B., He, N., Pan, Y., Boots, B. and Song, L. (2016). Learning from conditional distributions via dual embeddings. arXiv preprint arXiv:1607.04579 .
Dong et al. (2018)
Dong, Y., Liao, F., Pang, T., Su, H.,
Zhu, J., Hu, X. and Li, J. (2018).
Boosting adversarial attacks with momentum.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Girshick et al. (2014) Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Goodfellow et al. (2014a) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014a). Generative adversarial nets. In Advances in neural information processing systems.
- Goodfellow et al. (2014b) Goodfellow, I. J., Shlens, J. and Szegedy, C. (2014b). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 .
- He et al. (2016) He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Hochreiter et al. (2001) Hochreiter, S., Younger, A. S. and Conwell, P. R. (2001). Learning to learn using gradient descent. In International Conference on Artificial Neural Networks. Springer.
Kraft, D. (1988).
A Software Package for Sequential Quadratic Programming.
Deutsche Forschungs- und Versuchsanstalt für Luft- und Raumfahrt
Köln: Forschungsbericht, Wiss. Berichtswesen d. DFVLR.
- Kurakin et al. (2016) Kurakin, A., Goodfellow, I. and Bengio, S. (2016). Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236 .
- Liu et al. (2017) Liu, W., Zhang, Y.-M., Li, X., Yu, Z., Dai, B., Zhao, T. and Song, L. (2017). Deep hyperspherical learning. In Advances in Neural Information Processing Systems.
- Liu et al. (2016) Liu, Y., Chen, X., Liu, C. and Song, D. (2016). Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770 .
- Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 .
- Nguyen et al. (2015) Nguyen, A., Yosinski, J. and Clune, J. (2015). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 .
- Tabacof and Valle (2016) Tabacof, P. and Valle, E. (2016). Exploring the space of adversarial images. In 2016 International Joint Conference on Neural Networks (IJCNN). IEEE.
- Taigman et al. (2014) Taigman, Y., Yang, M., Ranzato, M. and Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Thrun and Pratt (2012) Thrun, S. and Pratt, L. (2012). Learning to learn. Springer Science & Business Media.
- Zagoruyko and Komodakis (2016) Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146 .
- Zheng et al. (2016) Zheng, S., Song, Y., Leung, T. and Goodfellow, I. (2016). Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.