Machine vs Machine: Minimax-Optimal Defense Against Adversarial Examples

11/12/2017 ∙ by Jihun Hamm, et al. ∙ The Ohio State University 0

Recently, researchers have discovered that the state-of-the-art object classifiers can be fooled easily by small perturbations in the input unnoticeable to human eyes. It is known that an attacker can generate strong adversarial examples if she knows the classifier parameters. Conversely, a defender can robustify the classifier by retraining if she has the adversarial examples. The cat-and-mouse game nature of attacks and defenses raises the question of the presence of equilibria in the dynamics. In this paper, we present a neural-network based attack class to approximate a larger but intractable class of attacks, and formulate the attacker-defender interaction as a zero-sum leader-follower game. We present sensitivity-penalized optimization algorithms to find minimax solutions, which are the best worst-case defenses against whitebox attacks. Advantages of the learning-based attacks and defenses compared to gradient-based attacks and defenses are demonstrated with MNIST and CIFAR-10.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, researchers have made an unexpected discovery that the state-of-the-art object classifiers can be fooled easily by small perturbations in the input unnoticeable to human eyes [29, 8]. Following studies tried to explain the cause of the seeming failure of deep learning toward such adversarial examples. The vulnerability was ascribed to linearity [8], low flexibility [5], or the flatness/curvedness of decision boundaries [23]

, but a more complete picture is still under research. This is troublesome since such a vulnerability can be exploited in critical situations such as an autonomous car misreading traffic signs or a facial recognition system granting access to an impersonator without being noticed. Several methods of generating adversarial examples were proposed

[8, 22, 3], most of which use the knowledge of the classifier to craft examples. In response, a few defense methods were proposed: retraining target classifiers with adversarial examples called adversarial training [29, 8]

; suppressing gradient by retraining with soft labels called defensive distillation

[26]; hardening target classifiers by training with an ensemble of adversarial examples [30]. (See Related work for descriptions of more methods.)

In this paper we focus on whitebox attacks where the model and the parameters of the classifier are known to the attacker. This requires a genuinely robust classifier or defense method since the defender cannot rely on the secrecy of the parameters as defense. To emphasize the dynamic nature of attack-defense, we start with the following simple experiments (see Sec 3.1 for a full description). Suppose first a classifier is trained on a original non-adversarial dataset. Using the trained classifier parameters, an attacker can then generate adversarial examples, e.g., using the fast gradient sign method (FGSM) [8] which is known to be simple and effective. However, if the defender/classifier111The defender and the classifier are treated synonymously in this paper. has access to those adversarial examples, the defender can significantly weaken the attack by retraining the classifier with adversarial examples, called adversarial training. We can repeat the two steps – adversarial sample generation and adversarial training – many times, and what is observed in the process (Sec 3.1) is that an attack/defense can be very effective against immediately-preceding defense/attack, but not necessarily against non-immediately preceding defenses/attacks. This is one of many examples that show that the effectiveness of an attack/defense method depends critically on the defense/attack it is against, from which we conclude that the performance of an attack/defense method have to be evaluated and reported in the attack-defense pair and not in isolation.

To better understand the interaction of attack and defense in the adversarial example problem, we formulate adversarial attacks/defense on machine learning classifiers as a two-player continuous pure-strategy zero-sum game. The game is played by an attacker and a defender where the attacker tries to maximize the risk of the classification task by perturbing input samples under certain constraints, and the defender tries to adjust the classifier parameters to minimize the same risk function given the perturbed inputs. The ideal adversarial examples are the global maximizers of the risk without constraints (except for certain bounds such as the

-norm.) However, such a space of unconstrained adversarial samples is very large – any real vector of the given input size is potentially an adversarial sample regardless of whether it is a sample from the input data distribution. The vastness of the space of adversarial examples is a hindrance to the study of the problem, since it is difficult for the defender to model and learn the attack class from a finite number of adversarial examples and generalize to future attacks. To study the problem more concretely, we use two representative classes of attacks. The first type is of gradient-type – the attacker uses mainly the gradient of the classifier output with respect to the input to generate adversarial examples. This includes the fast gradient sign method (FGSM)

[8] and the iterative version (IFGSM) [13]

of the FGSM. Attacks of this type can be considered an approximation of the full maximization of the risk by one or a few steps of gradient-based maximization. The second type is a neural-network based attack which is capable of learning. This attack network is trained from data so that it take a (clean) input and generate a perturbed output to maximally fool the classifier in consideration. The ‘size’ of the attack class is directly related to the parameter space of the neural network architecture, e.g., all perturbations that can be generated by fully-connected 3-layer ReLU networks that we use in the paper. Similar to what we propose, others have recently considered training neural networks to generate adversarial examples

[25, 1]. While the network-based attack is a subset of the space of unconstrained attacks, it can generate adversarial examples with only a single feedforward pass through the neural network during test time, making it suitable for real-time attacks unlike other more time-consuming attacks. We later show that this class of neural-network based attacks is quite different from the the class of gradient-based attacks empirically.

As a two-player game, there may not be a dominant defense that is robust against all types of attacks. However, there is a natural notion of the best defense or attack in the worst case. Suppose one player moves first by choosing her parameters and the other player responds with the knowledge of the first player’s move. This is an example of a leader-follower game [2] for which there are two known equilibria – the minimax and the maximin points if it is a constant-sum game. Such a defense/attack is theoretically an ideal pure strategy in the leader-follower setting, but one has to actually find it for the given class of defense/attack and the dataset to deploy the defense/attack. To find minimax solutions numerically, we propose continuous optimization algorithms for the gradient-based attacks (Sec. 3) and the network-based attacks (Sec. 4), based on alternating minimization with gradient-norm penalization. Experiments with the MNIST and the CIFAR-10 datasets show that the minimax defense found from the algorithm is indeed more robust than non-minimax defenses overall including adversarially-trained classifiers against specific attacks. However, the results also show that the minimax defense is still vulnerable to some degrees to attacks from out-of-class attacks, e.g., the gradient-based minimax defense is not equally robust against network-based attacks. This exemplifies the difficulty of achieving the minimax defense against all possible attack types in reality. Our paper is a first step towards this goal, and future works are discussed in Sec. 5.

The contributions of this paper can be summarized as follows.

  • We explain and formulate the adversarial example problem as a two-player continuous game, and demonstrate the fallacy of evaluating a defense or an attack as a static problem.

  • We show the difficulty of achieving robustness against all types of attacks, and present the minimax defense as the best worst-case defense.

  • We present two types of attack classes – gradient-based and network-based attacks. The former class represents the majority of known attacks in the literature, and the latter class represents new attacks capable of generating adversarial examples possessing very different properties from the former.

  • We provide continuous minimax optimization methods to find the minimax point for the two classes, and contrast it with non-minimax approaches.

  • We demonstrate our game formulation using two popular machine learning benchmark datasets, and provide empirical evidences to our claims.

For readability, details about experimental settings and the results with the CIFAR-10 dataset are presented in the appendix.

2 Related work

Making a classifier robust to test-time adversarial attacks has been studied for linear (kernel) hyperplanes

[14]

, naive Bayes

[4] and SVM [6], which also showed the game-theoretic nature of the robust classification problems. Since the recent discovery of adversarial examples for deep neural networks, several methods of generating adversarial samples were proposed [29, 8, 12, 22, 3] as well as several methods of defense [29, 8, 26, 30]. These papers considered static scenarios, where the attack/defense is constructed against a fixed opponent. A few researchers have also proposed using a detector to detect and reject adversarial examples [18, 15, 21]. While we do not use detectors in this work, the minimax approach we proposed in the paper can be applied to train the detectors.

The idea of using neural networks to generate adversarial samples has appeared concurrently [1, 25]. Similar to our paper, the two papers demonstrates that it is possible to generate strong adversarial samples by a learning approach. The former [1]

explored different architectures for the “adversarial transformation networks” against several different classifiers. The latter

[25] proposed “attack learning neural networks” to map clean samples to a region in the feature space where misclassification occurs and “defense learning neural networks” to map them back to the safe region. Instead of prepending the defense layers before the fixed classifier [25], we retrain the whole classifier as a defense method. However, the key difference of our work to the two papers is that we consider the dynamics of a learning-based defense stacked with a learning-based attack, and the numerical computation of the optimal defense/attack by continuous optimization.

Extending the model of [12], an alternating optimization algorithm for finding saddle-points of the iterative FGSM-type attack and the adversarial training defense was proposed recently in [17]. The algorithm is similar to what we propose in Sec. 3, but the difference is that we seek to find minimax points instead of saddle points, as well as we consider both the gradient-based and the network-based attacks. The importance of distinguishing minimax and saddle-point solutions for machine learning problems was explained in [11] along with new algorithms for handing multiple local optima. The alternating optimization method for finding an equilibrium of a game has gained renewed interest since the introduction of Generative Adversarial Networks (GAN) [7]. However, the instability of the alternating gradient-descent method has been known, and the “unrolling” method [20] was proposed to speed up the GAN training. The optimization algorithm proposed in the paper has a similarity with the unrolling method, but it is simpler and involves a gradient-norm regularization which can be interpreted intuitively as sensitivity penalization [9, 16, 19, 24, 28].

Lastly, a related framework of finding minimax risk was also studied in [10] for the purpose of preventing attacks on privacy. We discuss how the attack on classification in this paper and the attack on privacy are the two sides of the same optimization problem with the opposite goals.

3 Minimax defense against gradient-based attacks

A classifier whose parameters are known to an attacker is easy to attack. Conversely, an attacker whose adversarial samples are known to a classifier is easy to defend against. In this section, we first demonstrate the cat-and-mouse nature of the interaction, using adversarial training as defense and the fast gradient sign method (FGSM) [8] attack. We then describe more general form of the game, and algorithms for finding minimax solutions using sensitivity-penalized optimization.

3.1 A motivating observation

Suppose is a classifier and

is a loss function. The untargeted FGSM attack generates a perturbed example

given the clean sample as follows:

(1)

Untargeted means that the goal of the attacker is to induce misclassification regardless of which classes the samples are misclassified into, as long as they are different from the original classes. We will not discuss defense against targeted attacks in the paper as they are analogous to untargeted counterparts in the paper. The clean input images we use here are -normalized, that is, all pixel values are in the range . Although simple, FGSM is very effective at fooling the classifier. Table 1

demonstrates this against a convolutional neural network trained with clean images from MNIST. (Details of the classifier architecture and the settings are in the appendix.)

Defense\Attack No attack FGSM
=0.1 =0.2 =0.3 =0.4
No defense 0.026 0.446 0.933 0.983 0.985
Table 1: Test error rates of the FGSM attack on an undefended convolutional neural network for MNIST. Higher error means a more successful attack.

On the other hand, these attacks, if known to the classifier, can be weakened by retraining the classifier with the original dataset augmented by adversarial examples with ground-truth labels, known as adversarial training. In this paper we use the mixture of the clean and the adversarial samples for adversarial training. Table 2 shows the result of adversarial training for different values of . After adversarial training, the test error rates for adversarial test examples are reduced back to the level (1-2%) before attack. This is in stark contrast with the high misclassification of the undefended classifier in Table 1.

Defense\Attack No attack FGSM
=0.1 =0.2 =0.3 =0.4
Adv train n/a 0.010 0.011 0.015 0.017
Table 2: Test error rates of the FGSM attacks on adversarially-trained classifiers for MNIST. This defense can avert the attacks and achieve low error rates.

This procedure of 1) adversarial sample generation using the current classifier, and 2) retraining classifier using the current adversarial examples, can be repeated for many rounds. Let’s denote the attack on the original classifier as FGSM1, and the corresponding retrained classifier as Adv FGSM1. Repeating the procedure above generates the sequence of attacks and defenses: FGSM1 Adv FGSM1 FGSM2 Adv FGSM2 FGSM3

Adv FGSM3, etc. The odd terms are attacks and the even terms are defenses (i.e., classifiers.)

We repeat this two step for 80 rounds. As a preview, Table 3 shows the test errors of the defense-attacks pairs where the defense is one of the {No defense, Adv FGSM1, Adv FGSM2, …} and the attacker is one of the {No attack, FGSM1, FGSM2, …}. Throughout the paper we use the following conventions for tables: the rows correspond to defense methods and the columns correspond to attack methods. All numbers are test errors. It is observed that the defense is effective against the immediately-preceding defense (e.g., Adv FGSM1 defense has a low error against FGSM1 attack), and similarly the attack is effective against the immediately-preceding defense (e.g., FGSM2 attack has a high error against Adv FGSM1). However, a defense/attack is not necessarily robust against other non-immediately preceding attack/defense. From this we make two observations. First, the effectiveness of an attack/defense method depends critically on the defense/attack it is against, and therefore the performance of an attack/defense should be evaluated as the attack-defense pair and not in isolation. Second, it is not enough for a defender to choose the classifier parameters in response to a specific attack., i.e., adversarial training, but it should use a more principled method of selecting robust parameters. We address these below.

3.2 Gradient-based attacks and generalization

We first consider the interaction of the classifier and the gradient-based attack as a continuous two-player pure-strategy zero-sum game. To emphasize the parameters of the classifier/defender , let’s write the empirical risk of classifying the perturbed data as

(2)

where denote a gradient-based attack based on the loss gradient

(3)

and is the sequence of perturbed examples.

Given the classifier parameter , the gradient-based attack (Eq. 3) can be considered as a single-step approximation to the general attack

(4)

where can be any adversarial pattern subject to bounds such as and . Consequently, the goal of the defender is to choose the classifier parameter to minimize the maximum risk from such attacks [12, 17]

(5)

with the same bound constraints. This general minimax optimization is difficult to solve directly due to the large search space of the inner maximization. In this respect, existing attack methods such as FGSM, IFGSM, or Carlini-Wagner [3]

can be consider as heuristics or approximations of the true maximization.

3.3 Minimax solutions

We describe an algorithm to find the solutions of Eq. 5 for gradient-based attacks (Eq. 3). In expectation of the attack, the defender should choose to minimize where the dependence of the attack on the classifier is expressed explicitly. If we minimize using gradient descent

(6)

then from the chain rule, the total derivative

is

(7)

from Eqs. 2 and 3.

Interestingly, this total derivative (Eq. 7) at the current state coincides with the gradient of the following cost

(8)

where . There are two implications. Interpretation-wise, this cost function is the sum of the original risk and the ‘sensitivity’ term which penalizes abrupt changes of the risk w.r.t. the input. Therefore, is chosen at each iteration to not only decrease the risk but also to make the classifier insensitive to input perturbation so that the attacker cannot take advantage of large gradients. The idea of minimizing the sensitivity to input is a familiar approach in robustifying classifiers [9, 16]. Secondly, the new formulation can be implemented easily. The gradient descent update using the seemingly complicated gradient (Eq. 7) can be replaced by the gradient descent update of Eq. 8. The capability of automatic differentiation [27] in modern machine learning libraries can be used to compute the gradient of Eq. 8 efficiently.

We find the solution to the minimax problem by iterating the two steps. In the max step, we generate the current adversarial pattern , and in the min step, we update the classifier parameters by using Eq. 8. In practice, we require the adversarial patterns ’s have to be constrained by and , and therefore we can use the FGSM method (Eq. 1) to generate the patterns ’s in the max step. The resultant classifier parameters after convergence will be referred to as minimax defense against gradient-based attacks (Minimax-Grad). Note that this algorithm is similar but different from the algorithms of [12, 17]. Firstly, we use only one gradient-descent step to compute although we can use multiple steps as in [17]. More importantly, the sensitivity penalty in Eq. 8 plays an important role for convergences in finding a minimax solution. In contrast, simply repeating and without the penalty term does not guarantee convergence to minimax points unless minimax points are also saddle points (see [11] for the description of the difference.) This subtle difference will be observed in the experiments.

3.4 Experiments

We find the defense parameters using the algorithm above, which will be robust to gradient-based attacks. Fig. 1 shows the decrease of test error during training using this gradient descent approach for MNIST.

Figure 1: Convergence of test error rates for Minimax-Grad with MNIST.

We reiterate the result of the cat-and-mouse game in Sec. 3.1 and contrast it with the minimax solution (Minimax-Grad.) Table 3 shows that the adversarially trained classifier (Adv FGSM1) is robust to both clean data and FGSM1 attack, but is susceptible to FGSM2 attack, showing that the defense is only effective against immediately-preceding attacks. The same holds for Adv FGSM2, Adv FGSM3, etc. After 80 rounds of the cat-and-mouse procedure, the classifier Adv FGSM80 becomes robust to FGSM80 as well as moderately robust to other attacks including FGSM81 (=FGSM-curr). However, Minimax-Grad from the minimization of Eq. 8 is even more robust toward FGSM-curr than Adv FGSM80 and is overall the best. (See the last column – “worst” result.) To see the advantage of the sensitivity term in Eq. 8, we also performed the minimization of Eq. 8 without the sensitivity term under the same conditions as Minimax-Grad. This optimization method is similar to the method proposed in [12], which we will refer to as LWA (Learning with Adversaries). Note that LWA is a saddle-point solution for gradient-based attacks since it solves and symmetrically. In the table, one can see that Minimax-Grad is also better than LWA overall, although the difference is not large. To improve the minimax defense even further, we can choose a larger attack class than single gradient-step attacks. Note that this will come at the cost of the increased difficulty in minimax optimization.

Defense\Attack No attack FGSM FGSM-curr worst
FGSM1 FGSM2 FGSM80
=0.1 No defense 0.026 0.446 0.073 0.054 0.446 0.446
Adv FGSM1 0.008 0.010 0.404 0.037 0.435 0.435
Adv FGSM2 0.011 0.311 0.009 0.038 0.442 0.442
Adv FGSM80 0.007 0.028 0.018 0.010 0.117 0.117
LWA 0.009 0.044 0.030 0.022 0.019 0.044
Minimax-Grad 0.006 0.014 0.015 0.014 0.025 0.025
=0.2 No defense 0.026 0.933 0.215 0.089 0.933 0.933
Adv FGSM1 0.009 0.011 0.816 0.067 0.816 0.816
Adv FGSM2 0.008 0.904 0.010 0.082 0.840 0.904
Adv FGSM80 0.007 0.087 0.053 0.013 0.131 0.131
LWA 0.007 0.157 0.034 0.036 0.026 0.157
Minimax-Grad 0.008 0.082 0.085 0.049 0.027 0.085
=0.3 No defense 0.026 0.983 0.566 0.087 0.983 0.983
Adv FGSM1 0.010 0.015 0.892 0.080 0.892 0.892
Adv FGSM2 0.010 0.841 0.017 0.058 0.764 0.841
Adv FGSM80 0.007 0.352 0.117 0.021 0.043 0.352
LWA 0.008 0.130 0.077 0.047 0.034 0.130
Minimax-Grad 0.008 0.062 0.144 0.045 0.036 0.144
=0.4 No defense 0.026 0.985 0.806 0.122 0.985 0.985
Adv FGSM1 0.010 0.017 0.898 0.102 0.898 0.898
Adv FGSM2 0.010 0.681 0.022 0.092 0.686 0.686
Adv FGSM80 0.008 0.688 0.330 0.029 0.031 0.688
LWA 0.009 0.355 0.171 0.086 0.042 0.355
Minimax-Grad 0.009 0.081 0.221 0.076 0.026 0.221
Table 3: Test error rates of different attacks on various adversarially-trained classifiers for MNIST. The rows correspond to defense methods and the columns correspond to attack methods. FGSM-curr means the FGSM attack on the specific classifier on the left. Worst means the largest error in each row. Adv FGSM is the classifier adversarially trained with FGSM attacks. Minimax-Grad is the result of minimizing Eq. 8 by gradient descent. LWA is the result of minimizing Eq. 8 without the gradient-norm term.

4 Minimax defense against network-based attack

In this section, we consider another class of attacks – the neural-network based attacks. We present an algorithm for finding minimax solutions for this attack class, and contrast the minimax solution with saddle-point and maximin solutions.

4.1 Learning-based attacks

Again, let is a classifier parameterized by and is a loss function. The class of adversarial patterns for the general minimax problem (Eq. 5) is very large, which results in strong but non-generalizable adversarial examples. Non-generalizable means the perturbation has to be recomputed by solving the optimization for every new test sample . While such an ideal attack is powerful, its large size makes it difficult to analytically study the optimal defense methods. In Sec. 3, we restricted this class to gradient-based attacks. In this section, we restrict the class of patterns to that which can be generated by a flexible but manageable class of perturbation , e.g., a neural network of a fixed architecture where the parameter is the network weights. This class is a clearly a subset of general attacks, but is generalizable, i.e., no time-consuming optimization is required in the test phase but only single feedforward passes though the network. The attack network (AttNet), as we will call it, can be of any class of appropriate neural networks. Here we use a three-layer fully-connected ReLU network with 300 hidden units per layer in this paper. Different from [25] or [1], we feed the label into the input of the network along with the features . This is analogous to using the true label in the original FGSM. While this label input is not necessary but it can make the training of the attacker network easier. As with other attacks, we impose the -norm constraint on , i.e., .

Suppose now is the empirical risk of a classifier-attacker pair where the input is first transformed by attack network and then fed to the classifier . The attack network can be trained by gradient descent as well. Given the classifier parameter , we can use gradient descent

(9)

to find an optimal attacker that maximizes the risk for the given fixed classifier . Table 4 compares the error rates of the FGSM attacks and the attack network (AttNet). The table shows that AttNet is better than or comparable to FGSM in all cases. In particular, we already observed that the FGSM attack is not effective against the classifier hardened against gradient-based attacks (Adv FGSM80 or Minimax-Grad), but the AttNet can incur significant error () for those hardened defenders for . This indicates that the class of learning-based attacks is indeed different from the class of gradient-based attacks.

Defense\Attack FGSM-curr AttNet-curr worst FGSM-curr AttNet-curr worst
=0.1 =0.2
No defense 0.446 0.697 0.697 0.933 0.999 0.999
Adv FGSM1 0.435 0.909 0.909 0.816 0.897 0.897
Adv FGSM80 0.117 0.786 0.768 0.131 1.000 1.000
Minimax-Grad 0.025 0.498 0.498 0.085 0.956 0.956
=0.3 =0.4
No defense 0.983 1.000 1.000 0.985 1.000 1.000
Adv FGSM1 0.892 1.000 1.000 0.898 1.000 1.000
Adv FGSM80 0.352 0.887 0.887 0.688 1.000 1.000
Minimax-Grad 0.144 1.000 1.000 0.221 1.000 1.000
Table 4: Test error rates of FGSM vs learning-based attack network (AttNet) on various adversarially-trained classifiers for MNIST. FGSM-curr/AttNet-curr means they are computed/trained for the specific classifier on the left. Worst means the larger of FGSM-curr and AttNet errors for each . Note that FGSM fails to attack hardened networks (Adv FGSM80 and Minimax-Grad), whereas AttNet can still attack them successfully.

4.2 Minimax solution

We consider the two-player zero-sum game between a classifier and an network-based attacker where each player can choose its own parameters. Given the current classifier , an optimal whitebox attacker parameter is the maximizer of the risk

(10)

Consequently, the defender should choose the classifier parameters such that the maximum risk is minimized

(11)

As before, this solution to the continuous minimax problem has a natural interpretation as the best worst-case solution. Assuming the attacker is optimal, i.e., it chooses the best attack from Eq. 10 given , no other defense can achieve a lower risk than the minimax defense in Eq. 11. The minimax defense is also a conservative defense. If the attacker is not optimal, and/or if the attack does not know the defense exactly (as in blackbox attacks), the actual risk can be lower than what the minimax solution predicts. Before proceeding further, we point out that the claims above apply to the global minimizer and the maximizer function , but in practice we can only find local solutions for complex risk functions of deep classifiers and attackers.

To solve Eq. 11, we analyze the problem similarly to Eqs. 6-8 from the previous section. At each iteration, the defender should choose in expectation of the attack and minimize . We use gradient descent

(12)

where the total derivative is

(13)

Since the exact maximizer is difficult to find, we only update incrementally by one (or more) steps of gradient-ascent update

(14)

The resulting formulation is closely related to the unrolled optimization [20] proposed for training GANs, although the latter has a very different cost function . Using the single update (Eq. 14), the total derivative is

(15)

Similar to hardening a classifier against gradient-based attacks by minimizing Eq. 8 at each iteration, the gradient update of for can be done using the gradient of the following sensitivity-penalized function

(16)

In other words, is chosen not only to minimize the risk but also to prevent the attacker from exploiting the sensitivity of to . The algorithm is summarized in Alg. 1.

Input: risk , # of iterations , learning rates , ,
Output:
Initialize
Begin

  for  do
     Max step:
     Min step: .
  end for
  Return (, ).
Algorithm 1 Minimax Optimization by Sensitivity Penalization

The classifier obtained after convergence will be referred to as Minimax-AttNet. Note that Alg. 1 is independent of the adversarial example problems presented in the paper, and can be used for other minimax problems as well.

4.3 Minimax vs maximin solutions

In analogy with the minimax problem, we can also consider the maximin solution defined by

(17)

where

(18)

is the minimizer function. Here we are abusing the notations for the minimax solution , the maximin solution , the minimizer , and the maximizer . Similar to the minimax solution, the maximin solution has an intuitive meaning – it is the best worst-case solution for the attacker. Assuming the defender is optimal, i.e., it chooses the best defense from Eq. 18 that minimizes the risk given the attack , no other attack can inflict a higher risk than the maximin attack . It is also a conservative attack. If the defender is not optimal, and/or if the defender does not know the attack exactly, the actual risk can be higher than what the solution predicts. Note that the maximin scenario where the defender knows the attack method is not very realistic but is the opposite of the minimax scenario and provides the lower bound.

To summarize, minimax and maximin defenses and attacks have the following inherent properties.

Lemma 1

Let be the solutions of Eqs. 11,10,17,18.

  1. : For any given defense , the max attack is the most effective attack.

  2. : Against the optimal attack , the minimax defense is the most effective defense.

  3. : For any given attack , the min defense is the most effective defense.

  4. : Against the optimal defense , the maximin attack is the most effective attack.

  5. : The risk of the best worst-case attack is lower than that of the best worst-case defense.

These properties follow directly from the definitions. The lemma helps us to better understand the dependence of defense and attack, and gives us the range of the possible risk values which can be measured empirically. To find maximin solutions, we use the same algorithm (Alg. 1) except that the variables and are switched and the sign of is flipped before the algorithm is called. The resultant classifier will be referred to as Maximin-AttNet.

4.4 Experiments

In addition to minimax and maximin optimization, we also consider as a reference algorithm the alternating descent/ascent method used in GAN training [7]

(19)

and refer to its solution as Alt-AttNet. Similar to our discussion on Minimax-Grad and LWA in Sec. 3.4, the alternating descent/ascent finds local saddle points which are not necessarily minimax or maximin solutions, and therefore its solution will in general be different from the solution from Alg. 1. The difference of the solutions from three optimizations – Minimax-AttNet, Maximin-AttNet, and Alt-AttNet – applied to a common problem, is demonstrated in Fig. 2. The figure shows the test error over the course of optimization starting from random initializations. One can see that Minimax-AttNet (top blue curves) and Alt-AttNet (middle green curves) converge to different values suggesting the learned classifiers will also be different.

Figure 2: Convergence of the test error rates for Minimax-AttNet (blue), Alt-AttNet (green), and Maximin-AttNet (red) for MNIST.

Table 5 compares the robustness of the classifiers – Minimax-AttNet, Alt-AttNet and Minimax-Grad (from Sec. 3), against the AttNet attack. and the FGSM attack. Not surprisingly, both Minimax-AttNet and Alt-AttNet are much more robust than Minimax-Grad against AttNet. Minimax-AttNet performs similarly to Alt-AttNet at =0.1 and 0.2, but is much better at =0.3 and 0.4. The different performance of Minimax-AttNet vs Alt-AttNet implies that the minimax solution found by Alg. 1 is different from the solution found by alternating descent/ascent. In addition, against FGSM attacks, Minimax-AttNet is quite robust (0.058 - 0.116) despite that the classifiers are not trained against gradient-based attacks at all. In contrast, Minimax-Grad is very vulnerable (0.498 – 1.000) against AttNet which we have already observed. This result suggests that the class of AttNet attacks and the class of gradient-based attacks are indeed different, and the former class partially subsumes the latter.

Defense\Attack FGSM-curr AttNet-curr worst FGSM-curr AttNet-curr worst
=0.1 =0.2
Minimax-AttNet 0.058 0.010 0.058 0.109 0.010 0.109
Alt-AttNet 0.048 0.010 0.048 0.096 0.016 0.096
Minimax-Grad 0.025 0.498 0.498 0.085 0.956 0.956
=0.3 =0.4
Minimax-AttNet 0.116 0.018 0.116 0.079 0.364 0.364
Alt-AttNet 0.158 0.032 0.158 0.334 0.897 0.897
Minimax-Grad 0.144 1.000 1.000 0.221 1.000 1.000
Table 5: Test error rates of Minimax-AttNet, Alt-AttNet, and Minimax-Grad classifiers for MNIST. Worst means the larger of FGSM-curr and AttNet errors for each row. Minimax-AttNet is better than Alt-AttNet and Minimax-Grad overall, and is also moderately robust against the out-of-class attack (FGSM-curr).

Lastly, adversarial examples generated by various attacks in the paper have diverse patterns and are shown in Fig. 3 of the appendix. All the experiments with the MNIST dataset presented so far are also performed with the CIFAR-10 dataset and are reported in the appendix. To summarize, the results with CIFAR-10 are similar: Minimax-Grad outperforms non-minimax defenses, and AttNet can attack classifiers which are hardened against gradient-based attacks. However, gradient-based attacks are also very effective against classifiers hardened against AttNet, and also Minimax-AttNet and Alt-AttNet perform similarly, which are not the case with the MNIST dataset. The issue of defending against out-of-class attacks is discussed in the next section.

5 Discussion

5.1 Robustness against multiple attack types

We discuss some limitations of the current study and propose an extension. Ideally, a defender should find a robust classifier against the worst attack from a large class of attacks such as the general minimax problem (Eq. 5). However, it is difficult to train classifiers with a large class of attacks, due to the difficulty of modeling the class and of optimization itself. On the other hand, if the class is too small, then the worst attack from that class is not representative of all possible worst attacks, and therefore the minimax defense found will not be robust to out-of-class attacks. The trade-off seems inevitable.

It is, however, possible to build a defense against multiple specific types of attacks. Suppose are different types of attacks, e.g., =FGSM, =IFGSM, etc. The minimax defense for the combined attack is the solution to the mixed continuous-discrete problem

(20)

Additionally, suppose are different types of learning-based attacks, e.g., =2-layer dense net, =5-layer convolutional nets, etc. The minimax defense against the mixture of multiple fixed-type and learning-based attacks can be found by solving

(21)

that is, minimize the risk against the strongest attacker across multiple attack classes. Note the strongest attacker class and its parameters change as the classifier changes. Due to the computational demand to solve Eq. 21, we leave it as a future work to compute minimax solutions against multiple classes of attacks.

5.2 Adversarial examples and privacy attacks

Lastly, we discuss a bigger picture of the game between adversarial players. The minimax optimization arises in the leader-follower game [2] with the constant sum constraint. The leader-follower setting makes sense because the defense (=classifier parameters) is often public knowledge and the attacker exploits the knowledge. Interestingly, the problem of the attack on privacy [10] has a very similar formulation as the adversarial attack problem, different only in that the classifier is an attacker and the data perturbator is a defender. In the problem of privacy preservation against inference, the defender is a data transformer (parameterized by ) which perturbs the raw data, and the attacker is a classifier (parameterized by ) who tries to extract sensitive information such as identity from the perturbed data such as online activity of a person. The transformer is the leader, such as when the privacy mechanism is public knowledge, and the classifier is the follower as it attacks the given perturbed data. The risk for the defender is therefore the accuracy of the inference of sensitive information measured by . Solving the minimax risk problem () gives us the best worst-case defense when the classifier/attacker knows the transformer/defender parameters, which therefore gives us a robust data transformer to preserve the privacy against the best inference attack (among the given class of attacks.) On the other hand, solving the maximin risk problem () gives us the best worst-case classifier/attacker when its parameters are known to the transformer. As one can see, the problems of adversarial attack and privacy attack are two sides of the same coin, which can potentially be addressed by similar frameworks and optimization algorithms.

6 Conclusion

In this paper, we explain and formulate the adversarial sample problem in the context of two-player continuous game. We analytically and numerically study the problem with two types of attack classes – gradient-based and network-based – and show different properties of the solutions from those two classes. While a classifier robust to all types of attack may yet be an elusive goal, we claim that the minimax defense is a very reasonable goal, and that such a defense can be computed for classes such as gradient- or network-based attacks. We present optimization algorithms for numerically finding those defenses. The results with the MNIST and the CIFAR-10 dataset show that the classifier found by the proposed method outperforms non-minimax optimal classifiers, and that the network-based attack is a strong class of attacks that should be considered in adversarial example research in addition to the gradient-based attacks which are used more frequently. As future work, we plan to study further on the issue of transferability of a defense method to out-of-class attacks, and on efficient minimax optimization algorithms for finding defense against general attacks.

References

Appendix 0.A Results with MNIST

The architecture of the MNIST classifier is similar to the Tensorflow model

222https://github.com/tensorflow/models/tree/master/tutorials/image/mnist

, and is trained with the following hyperparameters:


Batch size = 128, optimizer = AdamOptimizer with , total # of iterations=50,000.

The attack network has three hidden fully-connected layers of 300 units, trained with the following hyperparameters:
Batch size = 128, dropout rate = 0.5, optimizer = AdamOptimizer with , total # of iterations=30,000.

For minimax, saddle-point, and maximin optimization, the total number of iteration was 100,000. The sensitivity-penalty coefficient of was used in Alg. 1.

Figure 3: Adversarial samples generated from different attacks at . (a) Original data (b) FGSM1 (c) FGSM80 (d) IFGSM1 (e) Minimax-AttNet (f) Alt-AttNet (g) Maximin-AttNet. Note the diversity of patterns.

Appendix 0.B Results with CIFAR-10

We preprocess the CIFAR-10 dataset by removing the mean and normalizing the pixel values with the standard deviation of all pixels in the image. It is followed by clipping the values to

standard deviations and rescaling to . The architecture of the CIFAR classifier is similar to the Tensorflow model333 https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10 but is simplified further by removing the local response normalization layers. With the simple structure, we attained accuracy with the test data. The classifier is trained with the following hyperparameters:
Batch size = 128, optimizer = AdamOptimizer with , total # of iterations=100,000.

The attack network has three hidden fully-connected layers of 300 units, trained with the following hyperparameters:
Batch size = 128, dropout rate = 0.5, optimizer = AdamOptimizer with , total # of iterations=30,000.

For minimax, saddle-point, and maximin optimization, the total number of iteration was 100,000. The sensitivity-penalty coefficient of was used in Alg. 1.

In the rest of the appendix, we repeat all the experiments with the MNIST dataset using the CIFAR-10 dataset.

Defense\Attack No attack FGSM
=0.05 =0.06 =0.07 =0.08
No defense 0.222 0.766 0.790 0.807 0.823
Table 6: Test error rates of the FGSM attack on an undefended convolutional neural network for CIFAR-10. Higher error means a more successful attack.
Defense\Attack No attack FGSM
=0.05 =0.06 =0.07 =0.08
Adv train n/a 0.425 0.452 0.466 0.470
Table 7: Test error rates of the FGSM attacks on adversarially-trained classifiers for CIFAR-10. This defense can significantly lower the errors from the attacks, although not as low as the MNIST case.
Defense\Attack No attack FGSM FGSM-curr
FGSM-1 FGSM-2 FGSM-40
=0.05 No defense 0.222 0.766 0.734 0.655 0.766
Adv FGSM1 0.215 0.425 0.533 0.420 0.533
Adv FGSM2 0.206 0.422 0.456 0.406 0.501
Adv FGSM40 0.210 0.370 0.412 0.348 0.588
LWA 0.203 0.422 0.464 0.423 0.456
Minimax-Grad 0.203 0.425 0.475 0.423 0.481
=0.06 No defense 0.222 0.790 0.761 0.680 0.790
Adv FGSM1 0.215 0.452 0.565 0.440 0.565
Adv FGSM2 0.208 0.447 0.482 0.431 0.517
Adv FGSM40 0.216 0.398 0.431 0.353 0.599
LWA 0.208 0.446 0.493 0.447 0.489
Minimax-Grad 0.199 0.431 0.473 0.446 0.453
=0.07 No defense 0.222 0.807 0.787 0.704 0.807
Adv FGSM1 0.214 0.466 0.555 0.450 0.555
Adv FGSM2 0.206 0.456 0.490 0.445 0.501
Adv FGSM40 0.218 0.397 0.416 0.346 0.423
LWA 0.208 0.453 0.499 0.451 0.485
Minimax-Grad 0.208 0.461 0.497 0.456 0.487
=0.08 No defense 0.222 0.823 0.807 0.709 0.823
Adv FGSM1 0.213 0.470 0.533 0.462 0.533
Adv FGSM2 0.204 0.459 0.466 0.462 0.476
Adv FGSM40 0.226 0.422 0.421 0.331 0.338
LWA 0.208 0.470 0.485 0.469 0.485
Minimax-Grad 0.203 0.456 0.464 0.459 0.462
Table 8: Test error rates of different attacks on various adversarially-trained classifiers for CIFAR-10. FGSM-curr means the FGSM attack on the specific classifier on the left. Worst means the largest error in each row. Adv FGSM is the classifier adversarially trained with FGSM attacks. Minimax-Grad is the result of minimizing Eq. 8 by gradient descent. LWA is the result of minimizing Eq. 8 without the gradient-norm term.
Defense\Attack FGSM-curr AttNet-curr worst FGSM-curr AttNet-curr worst
=0.05 =0.06
no defense 0.766 0.504 0.766 0.583 0.766 0.790
Adv FGSM1 0.533 0.356 0.533 0.565 0.473 0.565
Adv FGSM40 0.588 0.454 0.588 0.599 0.442 0.599
Minimax-Grad 0.481 0.343 0.481 0.453 0.484 0.484
=0.07 =0.08
no defense 0.807 0.655 0.807 0.823 0.685 0.823
Adv FGSM1 0.555 0.499 0.555 0.535 0.678 0.678
Adv FGSM40 0.423 0.669 0.669 0.338 0.797 0.797
Minimax-Grad 0.487 0.529 0.529 0.462 0.607 0.607
Table 9: Test error rates of FGSM vs learning-based attack network (AttNet) on various adversarially-trained classifiers for CIFAR-10. FGSM-curr/AttNet-curr means they are computed/trained for the specific classifier on the left. Worst means the larger of FGSM-curr and AttNet errors for each .
Defense\Attack FGSM-curr AttNet-curr worst FGSM-curr AttNet-curr worst
=0.05 =0.06
Minimax-AttNet 0.731 0.239 0.731 0.733 0.248 0.733
Alt-AttNet 0.721 0.238 0.721 0.743 0.255 0.743
Minimax-Grad 0.481 0.343 0.481 0.453 0.484 0.484
=0.07 =0.08
Minimax-AttNet 0.762 0.256 0.762 0.775 0.266 0.775
Alt-AttNet 0.743 0.257 0.732 0.771 0.258 0.771
Minimax-Grad 0.487 0.529 0.529 0.462 0.607 0.607
Table 10: Test error rates of Minimax-AttNet, Alt-AttNet, and Minimax-Grad classifiers for CIFAR-10. Worst means the larger of FGSM-curr and AttNet errors for each row.
Figure 4: Convergence of test error rates for Minimax-Grad with CIFAR-10.
Figure 5: Convergence of the test error rates for Minimax-AttNet (blue), Alt-AttNet (green), and Maximin-AttNet (red) for CIFAR-10.