On Norm-Agnostic Robustness of Adversarial Training

05/15/2019 ∙ by Bai Li, et al. ∙ 0

Adversarial examples are carefully perturbed in-puts for fooling machine learning models. A well-acknowledged defense method against such examples is adversarial training, where adversarial examples are injected into training data to increase robustness. In this paper, we propose a new attack to unveil an undesired property of the state-of-the-art adversarial training, that is it fails to obtain robustness against perturbations in ℓ_2 and ℓ_∞ norms simultaneously. We discuss a possible solution to this issue and its limitations as well.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved significant success when applied to a variety of challenging machine-learning tasks. For example, DNNs have obtained state-of-the-art accuracy on large-scale image classification

(He et al., 2016b; huang2017densely)

. At the same time, vulnerability to adversarial examples, an undesired property of DNNs, has drawn attention in the deep-learning community

(Szegedy et al., 2013; Goodfellow et al., 2014)

. Generally speaking, adversarial examples are perturbed versions of the original data that successfully fool a classifier. For example, in the image domain, adversarial examples are images transformed from natural images with visually negligible changes but that lead to wrong predictions

(Goodfellow et al., 2014). The existence of adversarial examples has raised many concerns, especially in scenarios with a high risk of misclassification, such as autonomous driving.

To tackle adversarial examples, a variety of defensive methods against adversarial attacks have been proposed, yet most of them remain vulnerable to adaptive attacks (Szegedy et al., 2013; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2016; Papernot et al., 2016; Kurakin et al., 2016; Carlini & Wagner, 2017; Brendel et al., 2017; Athalye et al., 2018). One type of adversarial defense that demonstrated good performance against strong attacks is based on adversarial training (Goodfellow et al., 2014; Madry et al., 2017; Zhang et al., 2019). Adversarial training constructs a defense model by augmenting the training set with adversarial examples. Though this is a simple strategy, it has achieved a great success in adversarial defense.

The strength of attacks are commonly quantified by the distance between adversarial examples and natural examples. One desired property of a defense model is norm-agnostic, which requires a model to be robust against attacks constrained by a variety of norms. Recently, a more general attack mechanism called unrestricted adversarial attacks are introduced by Brown et al. (2018), where adversarial examples are not necessarily close to a natural image as long as they are semantically similar. To achieve robustness against unrestricted attacks, being norm-agnostic is a minimum requirement.

In this paper, we propose a new attack method and show adversarial training, the most successful adversarial defense models, is not norm-agnostic. Previously, it was reported both in (Madry et al., 2017) and (Zhang et al., 2019) that adversarial training is robust against attacks. Our experiments, however, suggest they fail to defend against and adversarial examples simultaneously.

2 Background and Related Work

Adversarial training constructs adversarial examples that are included to the training set to train a new and more robust classifier. This method is intuitive and has gained great success in defense (Szegedy et al., 2013; Goodfellow et al., 2014; Madry et al., 2017; Zhang et al., 2019). Madry et al. (2017) showed that iterative attacks during training yield strong defense models to while-box attacks (Athalye et al., 2018). More recently, another adversarial training based defense model (Zhang et al., 2019) has won the first place in the defense track of the NIPS 2018 Adversarial Vision Challenge (Brendel et al., 2018).

Although adversarial training has been so far one of the most successful defense methods, it has its limitations. In (Tramèr et al., 2017), it was pointed out that single-step adversarial training, where single-step method (e.g., FGSM (Goodfellow et al., 2014)) is used for constructing adversarial examples, suffers from the “degenerate global minimum” issue and thus is not robust. To mitigate this issue, they propose ensemble adversarial training to improve the generalization of adversarial training. More recently, (Song et al., 2018) suggests using domain adaption as an improvement of ensemble adversarial training, leading to better robustness. However, both works only focus on single-step attack based adversarial training, while the most advanced adversarial training models are based on multi-step attacks. Tramèr et al. (2017) states that incorporating multi-step attacks during training could fix the degenerate-global-minimum issue. In this paper, we show multi-step adversarial training still suffers from this issue.

3 Preliminary

3.1 Adversarial Examples

Given a classifier for an image , an adversarial example satisfies for some small , and , where is some distance metric, i.e., is close to but yields a different classification result. The distance is often described in terms of an metric, and in most of the literature and metrics are considered.

One of the simplest and widely used attack methods is a single-step method, the Fast Gradient Sign Method (FGSM) (Kurakin et al., 2016), which manipulates inputs along the direction of the gradient with respect to the outputs:

(1)

where is the projection operation that ensures adversarial examples stay in the ball around .

Its multi-step variant FGSM is more powerful and has been shown to be equivalent to exploring adversarial examples with the projected gradient descent (PGD) method (Madry et al., 2017):

(2)

3.2 Adversarial Training

The motivation behind adversarial training is that finding a robust model against adversarial examples is equivalent to solving the saddle-point problem:

The inner maximization is equivalent to constructing adversarial examples, while the outer minimization can be performed by standard training procedure for loss minimization.

Therefore, to achieve robustness to adversarial examples, adversarial training augments the training data with adversarial examples constructed during training, as an approximation to the inner maximization procedure.

Recently, Zhang et al. (2019) suggested using as the training loss, instead of and alternatively used in (Madry et al., 2017).

3.3 Degenerate Global Minimum

In (Tramèr et al., 2017), it is pointed out that if denotes the adversarial example generated by FGSM, adversarial training ideally results in a robust classification model such that:

However, the training procedure may instead discover a “degenerate global minimum” :

In another word, the training procedure may generate a model that makes finding adversarial examples difficult for FGSM instead of a truly robust model.

(Tramèr et al., 2017) proposes two possible solutions for mitigating this issue. One is to use a strong multi-step adversarial training, such as PGD, at a cost of increased computational burden. Another is ensemble adversarial training, that is incorporating adversarial examples generated from multiple pre-trained classifiers that are different from the original one. In this way, they can decouple the construction of adversarial examples and the training to prevent “degenerate global minimum”, while still obtain the robustness of adversarial training due to the transferability of adversarial perturbations across models (Goodfellow et al., 2014).

4 Second Order Attack

We propose a new attack motivated by the “degenerate global minimum”. Note adversarial training is equivalent to solving the optimization problem:

Its solution is a saddle point of , i.e., the gradient ideally vanishes at as . In practice, an adversarial training often finds

that makes the loss function flat in the neighborhood of a natural example

, which leads to inefficient exploration for adversarial examples when performing attacks. This is intuitively the cause of “degenerate global minimum”.

Madry’s TRADES Ensemble ATDA
Attacks Mix Mix Mix Mix
Natural 98.2% 98.8% 98.7% 99.4% 99.5% 99.4% 99.4% 99.0% 98.7% 99.2% 98.8% 99.0%
PGD () 97.0% 92.8% 73.2% 91.7% 91.7% 90.4% 99.0% 65.3% 58.2% 98.8% 63.6% 57.9%
PGD () 0.4% 92.5% 82.2% 19.7% 95.6% 15.3% 0.0% 90.2% 81.4% 0.0% 62.6% 81.3%
S-O () 96.6% 0.0% 18.3% 81.7% 3.2% 84.2% 98.9% 65.8% 58.4% 97.2% 64.0% 56.8%
S-O () 0.0% 91.3% 84.2% 16.9% 94.7% 14.5% 0.0% 88.9% 82.1% 0.0% 61.3% 83.4%
Table 1: Accuracy of Various Adversarial Training Strategies against Various Attacks

Most current attack methods construct adversarial examples based on the gradient of a loss function. However, according to the analysis above, first-order derivative is not effective for attacks if the defense model is trained adversarially. This motivates utilization of the second-order derivative of a loss function to construct adversarial examples.

To this end, assume the loss function is twice differentiable with respect to . Using Taylor expansion on the difference between the losses on the original and perturbed samples, and assuming the gradient vanishes, we have

with being the perturbation, and is the Hessian matrix of the loss function. Our goal is to find a small perturbation that maximizes the difference

. Our idea is based on the observation that the optimal perturbation direction should be in the same direction as the first dominant eigenvector,

, of , that is for some constant . However, computing the eigenvectors of the Hessian matrix requires runtime with the dimension of the data. To tackle this issue, we adopt the fast approximation method from (Miyato et al., 2017), which is essentially a combination of the power-iteration method and the finite-difference method, to efficiently find the direction of the eigenvector. Based on this method, the optimal direction, denoted , is approximated***Detailed derivations are provided in the Supplementary Material. by

(3)

where

is a randomly sampled unit vector and

is a manually chosen step size. In practice,

is drawn from a centered Gaussian distribution and normalized such that its

norm is .

This procedure is essentially a stochastic approximation to the optimal second-order direction, where the randomness comes from

. To reduce the variance of the approximation, we further take the expectation over the Gaussian noise, yielding

. Note that choosing is equivalent to choosing the step size in (3). Finally, we construct adversarial examples by an iterative update via PGD:

(4)

where . Intuitively, this method perturbs the example at each iteration and tries to move out of the local maximum in the sample space, due to the introduction of random Gaussian noise.

5 Experiments

We perform experiments on the MNIST data set to validate our claims on adversarial training.

The architecture of our model follows the ones used in (Madry et al., 2017). Specifically, the model contains two convolutional layers with and filters, each followed by max-pooling, and a fully connected layer of size . Image intensities are scaled to , and the size of attacks are also rescaled accordingly. In all the experiments, we bound the norm less than while the norm less than .

We evaluate PGD and proposed S-O attacks on four settings: adversarial training with PGD adversarial examples (Madry et al., 2017), Tradeoff-inspired Adversarial Defense (TRADES) (Zhang et al., 2019), ensemble adversarial training (Tramèr et al., 2017), adversarial training via domain adaption(Song et al., 2018).

Specifically, we first consider constructing adversarial examples with and constraints during training respectively. Table 1 shows, as expected, that Madry’s model and TRADES successfully defend attacks with the same norm constraints. However, in spite of the fact that adversarial training stays robust against PGD attack, S-O attack can effectively reduce the accuracy of both models when a different norm is used. This suggests that standard adversarial training is not norm-agnostic.

It is natural to wonder whether the issue will be fixed if two kinds of adversarial examples are both included during training. To this end, we conduct additional experiments with mixed adversarial examples, that is alternating between and bounded examples for adversarial training. Using the mixed strategy, the accuracy on both attacks are no longer reduced to almost zero, but the overall performance is still unsatisfying. We conclude that mixing adversarial examples barely helps improving norm-agnostic robustness.

According to our analysis, the poor performance of adversarial training is due to “degenerate global minimum”, therefore, we expect ensemble adversarial training could fix the problem, as suggested in (Tramèr et al., 2017). The results from Table 1 suggest ensemble adversarial training and domain adaption partially fixes the issue, although the accuracy against attacks is still far from ideal.

In addition, we found two more interest phenomenons that can support our claims. Firstly, the relatively good performance of ensemble adversarial training implies that the vulnerability to adversarial perturbations with different norms is indeed caused by the “degenerate global minimum” issue, similar to the single-step adversarial training. Secondly, the performance of PGD and S-O attacks become similar for ensemble adversarial training model. This implies the effectiveness of S-O attacks compared to PGD attacks is due to exploitation of the “degenerate global minimum” issue.

In figure 1, we take a closer look at the behaviour of the attack methods by plotting the average norms of the gradients of the loss function with respect to the adversarial examples during the construction processes. Specifically, we compute for each , where is the index set of a batch.

Figure 1: S-O attack Left: Average norm of the gradients of the loss function for a batch in each iteration during adversarial attack. Blue: Naturally trained model attacked by PGD. Orange: TRADES attacked by PGD. Green: TRADES attacked by S-O.

We monitor this quantity for three settings: naturally trained model attacked by PGD, TRADES attacked by PGD, and TRADES attacked by S-O. The difference between the blue and orange lines show the norms of the gradients of the adversarially trained model are much smaller than the ones of the naturally trained model under PGD attacks, validating our explanation in Section 4, that an adversarially trained model tends to make the loss function “flat” in the neighborhood of natural examples which makes PGD attacks inefficient. The difference between the orange and green lines shows S-O attack is able to construct adversarial examples more efficiently by correctly finding the steepest direction, which explains why adversarially trained models are vulnerable to it.

Finally, one may argue that the perturbation size is too large that it violates the assumption that adversarial perturbations are visually negligible. We therefore perform S-O attack with perturbation size , which results in accuracy . We also illustrate some randomly selected perturbed adversarial examples that are misclassified by TRADES in figure 2.

Figure 2: Above: Natural examples from MNIST. The correct labels are 2,1,7,7,1,3. Below: Adversarial examples with perturbation size . The adversarially trained model predictions are 0,4,9,2,4,8.

One can observe that although noticeable, there are only a limited amount of perturbations in the adversarial examples that do not change the semantic meaning of the images.

Cifar-10

It is worth-noting that we do not observe similar results on CIFAR-10. We believe on CIFAR-10, it is difficult to reach even a “degenerate global minimum” for adversarial training due to the high dimensionality of the input space. This explains why adversarial training is still far from being perfectly robust even against PGD (Madry et al., 2017; Zhang et al., 2019).

6 Conclusion

In this paper, we show multi-step adversarial training models suffer from “degenerate global minimum” and thus are not norm-agnostic robust. Our proposed attack method is capable of constructing adversarial examples that reduces the accuracy of state-of-the-art adversarial training when different norms are used for training and attacking.

On the other hand, ensemble adversarial training can mitigate the issue thus should be considered as a standard procedure for adversarial training, even though they can only obtain moderate adversarial robustness.

In general, considering state-of-the-art results in adversarial defense are often achieved by adversarial training, we believe it is important to check the norm-agnostic robustness when designing adversarial defense models.

References

Appendix A Fast Approximate Method [12]

Power iteration method [golub2001eigenvalue] allows one to compute the dominant eigenvector of a matrix . Let be a randomly sampled unit vector which is not perpendicular to , the iterative calculation of

leads to . Given is the Hessian matrix of , we further use finite difference method to reduce the computational complexity:

where is the step size. If we only take one iteration, it gives an approximation that only requires the first-order derivative:

which gives equation 3.