# Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization

We propose a general framework for increasing local stability of Artificial Neural Nets (ANNs) using Robust Optimization (RO). We achieve this through an alternating minimization-maximization procedure, in which the loss of the network is minimized over perturbed examples that are generated at each parameter update. We show that adversarial training of ANNs is in fact robustification of the network optimization, and that our proposed framework generalizes previous approaches for increasing local stability of ANNs. Experimental results reveal that our approach increases the robustness of the network to existing adversarial examples, while making it harder to generate new ones. Furthermore, our algorithm improves the accuracy of the network also on the original test data.

## Authors

• 12 publications
• 3 publications
• 10 publications
• ### Convergence of Adversarial Training in Overparametrized Networks

Neural networks are vulnerable to adversarial examples, i.e. inputs that...
06/19/2019 ∙ by Ruiqi Gao, et al. ∙ 0

• ### Robust Local Features for Improving the Generalization of Adversarial Training

Adversarial training has been demonstrated as one of the most effective ...
09/23/2019 ∙ by Chubiao Song, et al. ∙ 11

• ### On Model Robustness Against Adversarial Examples

We study the model robustness against adversarial examples, referred to ...
11/15/2019 ∙ by Shufei Zhang, et al. ∙ 13

• ### Adversarial Distributional Training for Robust Deep Learning

Adversarial training (AT) is among the most effective techniques to impr...
02/14/2020 ∙ by Zhijie Deng, et al. ∙ 0

• ### Unifying Adversarial Training Algorithms with Flexible Deep Data Gradient Regularization

Many previous proposals for adversarial training of deep neural nets hav...
01/26/2016 ∙ by Alexander G. Ororbia II, et al. ∙ 0

• ### Adversarially Robust Generalization Just Requires More Unlabeled Data

Neural network robustness has recently been highlighted by the existence...
06/03/2019 ∙ by Runtian Zhai, et al. ∙ 0

• ### Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality

Adversarial training is a popular method to give neural nets robustness ...
02/16/2020 ∙ by Yi Zhang, et al. ∙ 11

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The fact that ANNs might by very unstable locally was demonstrated in [21]

, where it was shown that highly performing vision ANNs mis-classify examples that have only barely perceivable (by a human eye) differences from correctly classified examples. Such examples are called

, and were originally found by solving an optimization problem, with respect to a trained net. Adversarial examples do not tend to exist naturally in training and test data. Yet, the local instability manifested by their existence is somewhat disturbing. In the case of visual data, for example, one would expect that images that are very close in the natural “human eye” metric will be mapped to nearby points in the hidden representation spaces, and consequently will be predicted to the same class. Moreover, it has been shown that different models with different architectures which are trained on different training sets tend to mis-classify the same adversarial examples in a similar fashion. In addition to the disturbing existence of adversarial examples from model stability perspective, the fact that they can be generated by simple and structured procedures and are common to different models can be used to perform attacks on models by making them fail easily and consistently

[8].

It has been claimed that adversarial examples exist in ’blind spots’ in the data domain in which training and testing data do not occur naturally; however, some of these blind spots might be very close in some sense to naturally occurring data.

Several works have proposed to use adversarial examples during training of ANNs, and reported increase in classification accuracy on test data (for example, [21], [7]

). The goal of this manuscript is to provide a framework that yields a full theoretical understanding of adversarial training, as well as new optimization schemes, based on robust optimization. Specifically, we show that generating and using adversarial examples during training of ANNs can be derived from the powerful notion of Robust Optimization (RO), which has many applications in machine learning and is closely related to regularization. We propose a general algorithm for robustification of ANN training, and show that it generalizes previously proposed approaches.

Essentially, our algorithm increases the stability of ANNs with respect to perturbations in the input data, through an iterative minimization-maximization procedure, in which the network parameters are updated with respect to worst-case data, rather than to the original training data. Furthermore, we show connections between our method and existing methods for generating adversarial examples and adversarial training, demonstrating that those methods are special instances of the robust optimization framework. This point yields a principled connection highlighting the fact that the existing adversarial training methods aim to robustify the parameter optimization process.

The structure of this paper is as follows: in Section 2 we mention some of the recent works that analyze adversarial examples and attempt to improve local stability. In Section 3 we present the basic ideas behind Robust Optimization, and some of its connections to regularization in machine learning models. In Section 4 we present our training framework, some of its possible variants and its practical version. Experimental results are given in Section 5. Section LABEL:sec:conclusions briefly concludes this manuscript.

### 1.1 Notation

We denote a labeled training set by where is a set of features and is a label. The loss of a network with parameters on is denoted by and is a function that quantifies the goodness-of-fit between the parameters and the observations . When holding and fixed and viewing as a function of we occasionally write . corresponds to a small additive adversarial perturbation, that is to be added to . By adversarial example we refer to the perturbed example, i.e., , along with the original label . We denote the norm for to be and denote the

norm of a vector

to be . Given two vectors and the Euclidean inner-product is denoted . Given a function , we denote to be the gradient of with respect to the vector .

## 2 Related Work

Adversarial examples were first introduced in [21], where they were generated for a given training point by using L-BFGS (see  [23], for example) to solve the box-constrained optimization problem

 minΔxc∥Δx∥2+J(θ,x+Δx,y′) (1) subject to x+Δx∈[0,1]d,

and . A similar approach for generating adversarial examples was used also in [8]. The fundamental idea here is to construct a small perturbation of the data point in order to force the method to mis-classify the training example with some incorrect label .

In [7], the authors point out that when the dimension is large, changing each entry of by a small value yields a perturbation (such that ), which can significantly change the inner-product product of with a weight vector . Their result then chooses to set the adversarial perturbation

 Δx=ϵsign(∇xJ(θ,x,y)). (2)

An alternative formulation of the problem naturally shows how the adversarial perturbation

was obtained. If we take a first-order approximation of the loss function around the true training example

with a small perturbation

 Jθ,y(x+Δx)≈Jθ,y(x)+⟨∇Jθ,y(x),Δx⟩,

and maximize the right hand size with respect to restricted to an ball of radius we have that the choice that maximizes the right-hand side is exactly the quantity in equation 2. Since the gradient

can be computed efficiently using backpropagation (see, for example,

[19]), this approach for generating adversarial examples is rather fast. In the sequel we will show how the above computation in an example of the framework that we present here.

It is reported in [21, 7] that adversarial examples that were generated for a specific network were mis-classified in a similar fashion by other networks, with possibly different architectures and using different subsets of the data for training. In [7]

the authors claim that this phenomenon is related to the strong linear nature that neural networks have. Specifically, they claim that the different models learned by these neural nets were essentially all close to the same linear model, hence giving similar predictions on adversarial examples.

Two works [15, 6] nicely demonstrate that classifiers can achieve very high test accuracy without actually learning the true concepts of the classes they predict. Rather, they can base their predictions on discriminative information, which suffices to obtain accurate predictions on test data, however does not reflect learning of the true concept that defines specific classes. As a result, they can consistently fail in recognizing the true class concept in new examples [6] or confidently give wrong predictions on specifically designed examples [15].

Several papers propose training procedures and objective functions designed to make the function computed by the ANN change more slowly near training and test points. In [21] adversarial examples were generated and fed back to the training set. This procedure is reported to increase the classification accuracy on test data. In [7] the following loss function is proposed:

 ~J(θ,x,y)=αJ(θ,x,y)+(1−α)J(θ,x+Δx,y), (3)

with as in equation (2). The authors report that the resulting net had improved test set accuracy, and also had better performance on new adversarial examples. They further give intuitive explanations of this training procedure being an adversary game, and a min-max optimization over balls. In this manuscript, we attempt to make the second interpretation rigor, by deriving a similar training procedure from RO framework.

In [12]

adversarial training is performed without requiring knowledge of the true label. Rather, the loss function contains a term computing the Kullback-Leibler divergence between the predicted distributions of the label, (e.g., softmax) i.e.

.

In [9] adversarial examples were generated for music data. The authors report that back-feeding of adversarial examples to the training set did not result in an improved resistance to adversarial examples.

In [17]

, the authors first pre-train each layer of the network as a contractive autoencoder

[18], which, assuming that the data concentrates near a lower dimensional manifold, penalizes the Jacobian of the encoder at every training point , so that the encoding changes only in the directions tangent to the manifold at . The authors further assume that points belonging to different classes tend to concentrate near different sub-manifolds, separated by low density areas. Consequently, they encourage the output of a classification network to be constant near every training point, by penalizing the dot product of the network’s gradient with the basis vectors of the plain that is tangent to the data manifold at every training point. This is done by using the loss function

 ~J(θ,x,y)=J(θ,x,y)+β∑u∈Bx(⟨u,∇xo(x)⟩)2, (4)

where is the output of the network at and

is the basis of the hyperplane that is tangent to the data manifold at

. In the sequel we will show how this is related to the approach in this manuscript.

The contractive autoencoder loss is also used in [8], where the authors propose to increase the robustness of an ANN via minimization of a loss function which contains a term that penalizes the Jacobians of the function computed by each layer with respect to the previous layer.

In [21], the authors propose to regularize ANNs by penalizing the operator norm of the weight matrix of every layer. Such thing which will lead to pushing the Lipschitz constant of the function computed by a layer down, so that small perturbations in input will not result in large perturbations in output. We are not aware of any empirical result using this approach.

A different approach is taken in [16], using scattering convolutional networks (convnets), having wavelets based filters. The filters are fixed, i.e., not learned from the data, and are designed to obtain stability under rotations and translations. The learned representation is claimed to be stable also under small deformations created by additive noise. However, the performance of the network is often inferior to standard convnets, which are trained in a supervised fashion.

Interesting theoretical arguments are presented in [6], where it is shown that robustness of any classifier to adversarial examples depends on the distinguishability between the classes; they show that sufficiently large distinguishability is a necessary condition for any classifier to be robust to adversarial perturbations. Distinguishability is expressed, for example, by distance between means in case of linear classifiers and and between covariance matrices in the case of quadratic classifiers.

## 3 Robust Optimization

Solutions to optimization problems can be very sensitive to small perturbations in the input data of the optimization problem, in the sense that an optimal solution given the current data may turn into a highly sub-optimal or even infeasible solution given a slight change in the data. A desirable property of an optimal solution is to remain nearly optimal under small perturbations of the data. Since measurement data is typically precision-limited and might contain errors, the requirement for a solution to be stable to input perturbations becomes essential.

Robust Optimization (RO, see, for example [1]) is an area of optimization theory which aims to obtain solutions which are stable under some level of uncertainty the data. The uncertainty has a deterministic and worst-case nature. The assumption is that the perturbations to the data can be drawn from specific sets called uncertainty sets. The uncertainty sets are often defined in terms of the type of the uncertainty and a parameter controlling the size of the uncertainty set. The Cartesian product of the sets is usually denoted by .

The goal in Robust Optimization is to obtain solutions which are feasible and well-behaved under any realization of the uncertainty from ; among feasible solutions, an optimal one would be such that has the minimal cost given the worst-case realization from

. Robust Optimization problems thus usually have a min-max formulation, in which the objective function is being minimized with respect to a worst-case realization of a perturbation. For example, consider standard linear programming problem

 minx{cTx:Ax≤b}. (5)

The given data in this case is and the goal is to obtain a solution which is robust to perturbations in the data. Clearly, no solution can be well-behaved if the perturbations of the data can be arbitrary. Hence, we restrict ourselves to only allowing the perturbations to exist in in the uncertainty set . The corresponding Robust Optimization formulation is

 minxsup(A,b,c)∈U{cTx:Ax≤b}. (6)

Thus, the goal of the above problem is to pick an that can work well for all possible instances of the problem parameters within the uncertainty set.

The robust counterpart of an optimization problem can sometimes be more complicated to solve than the original problem. [13] and [2] propose algorithms for approximately solving the robust problem, which are based only on the algorithm for the original problem. This approach is closely related to the algorithm we propose in this manuscript.

In the next section we discuss the connection between Robust Optimization and regularization. Regularization serves an important role in Deep Learning architectures, with methods such as dropout

[22] and sparsification (for example, [3]) serving as a few examples.

### 3.1 Robust Optimization and Regularization

Robust Optimization is applied in various settings in statistics and machine learning, including, for example, several parameter estimation applications. In particular, there is a strong connection between Robust Optimization and regularization; in several cases it was shown that solving a regularized problem is equivalent to obtaining a Robust Optimization solution for a non-regularized problem. For example, it was shown in

[24] that a solution to a regularized least squares problem

 minx∥Ax−b∥+λ∥x∥1 (7)

is also a solution to the Robust Optimization problem

 minxmax∥ΔA|∞,2≤ρ∥(A+ΔA)x−b∥, (8)

where is the norm of the norms of the columns [4]. As a result, it was shown [24] that sparsity of the solution

is a consequence of its robustness. Regularized Support Vector Machines (SVMs) were also shown to have robustness properties: in

[25] it was shown that solutions to SVM with norm regularization can be obtained from non-regularized Robust Optimization problems [4]

. Finally, Ridge Regression can also be viewed as a variant of a robust optimization problem. Namely, it can be shown that

 minxmax{Δ:∥Δ∥F≤γ}∥(A+Δ)x−b∥2

is equivalent to  [20].

## 4 The Proposed Training Framework

Inspired by the Robust Optimization paradigm, we propose a loss function for training ANNs. Our approach is designed to make the network’s output stable in a small neighborhood around every training point ; this neighborhood corresponds to the uncertainty set . For example, we may set , a ball with radius around with respect to some norm . To do so, we select from this neighborhood a representative , which is the point on which the network’s output will induce the greatest loss; we then require the network’s output on to be

, the target output for

. Assuming that many test points are indeed close to training points from the same class, we expect that this training algorithm will have a regularization effect and consequently will improve the network’s performance on test data. Furthermore, since adversarial examples are typically generated in proximity to training or test points, we expect this approach to increase the robustness of the network’s output to adversarial examples.

We propose training the network using a minimization-maximization approach to optimize:

 minθ~J(θ,x,y)=minθm∑i=1max~xi∈UiJ(θ,~xi,yi), (9)

where is the uncertainty set corresponding to example . This can be viewed as optimizing the network parameters with respect to a worst-case data , rather than to the original training data; the ’th worst-case data point is chosen from the uncertainty set . The uncertainty sets are determined by the type of uncertainty and can be selected based on the problem at hand.

Optimization of (9) can be done in a standard iterative fashion, where in each iteration of the algorithm two optimization sub-procedures are performed. First, the network parameters are held fixed and for every training example an additive adversarial perturbation is selected such that and

 Δxi=argmaxΔ:xi+Δ∈UiJθ,yi(xi+Δ). (10)

Then, the network parameters are updated with respect to the perturbed data , where . This maximization is related to the adversarial example generation process previously proposed by Szegedy et. al. [21] as shown in equation (1).

Clearly, finding the exact in Equation (10) is intractable in general. Furthermore, performing a full optimization process in each of these sub-procedures in each iteration is not practical. Hence, we propose to minimize a surrogate to , in which each sub-procedure is reduced to a single ascent / descent step; that is, in each iteration, we perform a single ascent step (for each ) to find an approximation for , followed by a single descent step to update . The surrogate that we consider is the first-order Taylor expansion of the loss around the example, which yields:

 ^Δxi∈argmaxΔ:xi+Δ∈UiJθ,yi(xi)+⟨∇Jθ,yi(x),Δ⟩. (11)

Our proposed training procedure is formalized in Algorithm 1. In words, the algorithm performs alternating ascent and descent steps, where we first ascend for each with respect to the training example and descend with respect to network parameters .

Note that under this procedure, is never updated with respect to the original training data; rather, it is always updated with respect to worst-case examples which are close to the original training points with respect to the uncertainty sets . In the sequel, we will remark on how to solve equation (11) for special cases of . In general, one could use an algorithm like L-BFGS or projected gradient descent [14].

Finally, note that in each iteration of the algorithm two forward and backward passes through the network are performed, one using the original training data to compute the adversarial perturbations and one using the perturbed data to compute the update for ; hence, we expect the training time to be twice as long, comparing to standard training.

### 4.1 Examples of uncertainty sets

There is a number of cases that one can consider for the uncertainty sets . One example is when , a norm ball centered at with radius with respect to the norm . Some interesting choices for are the , and norms. Thus, can then be approximated using normalized steepest ascent step with respect to the norm  [5]. The steepest ascent step with respect to the ball (i.e., box) is obtained by the sign of the gradient . Choosing from an ball will therefore yield a perturbation in which every entry of is changed by the same amount . The steepest ascent with respect to the ball coincides with the direction of the gradient . Choosing from an ball will yield a sparse perturbation, in which only one or a small number of the entries of are changed (those of largest magnitude in ). Observe that in all three cases the steepest ascent direction is derived from the gradient , which can be computed efficiently using backpropagation. In Section 5 we use each of the ,, norms to generate adversarial examples by Equation (11) and compare the performance of Algorithm 1 using each of these types of uncertainty sets.

### 4.2 Relation to previous works

The loss function in equation (3), which is proposed in [7], can be viewed as a variant of our approach, in which is chosen from an ball around , since is updated with respect to adversarial examples generated by equation (2), which is the steepest ascent step with respect to the norm. Namely, we simply see that the solution to equation (11) for the case that is the update presented in equation (2).

We may also relate our proposed methodology to the Manifold Tangent Classifier loss function [17]. Following their assumption, suppose that the data exists on a low-dimension smooth manifold . Let the uncertainty set for training sample be . Thus, we would like to obtain the perturbation by solving Again, we take a first-order Taylor approximation of around and obtain . We then obtain through the optimization

 ^Δx=argmaxΔ:x+Δ∈UJθ,y(x)+⟨∇xJθ,y(x),Δ⟩ (12)

Recalling that is locally Euclidean, denoting by the basis for the hyperplane that is tangent to at and given that is sufficiently small, we may rewrite equation (12) as

 argmaxΔ∈spanBx,∥Δ∥2≤rJθ,y(x)+⟨∇xJθ,y(x),Δ⟩, (13)

The solution to the above equation is , where is the orthogonal projection matrix onto the subspace and should have norm equal to . Thus, this acts as an regularization of the gradient of the loss with respect to the training sample , projected along the tangent space , which is analogous to the regularization presented in equation (4). Put another way, small perturbations of on the tangent manifold should cause very small changes to the loss, which in turn should result in small perturbations of the output of network on input .

## 5 Experimental Results

In this section we experiment with our proposed training algorithm on two popular benchmark datasets: MNIST [11] and CIFAR-10 [10]. In each case we compare the robustness of a network that was trained using Algorithm 1 to that of a network trained in a standard fashion.

### 5.1 Experiments on MNIST dataset

As a baseline, we trained a convnet with ReLU units, two convolutional layers (containing

and

filters), max pooling (

and ) after every convolutional layer, and two fully connected layers (of sizes 200 and 10) on top. This convnet had accuracy on the MNIST test set. We refer to this network as “the baseline net”. We then used the baseline net to generate a collection of adversarial examples, using equation (11), with and norm balls.

Specifically, the adversarial perturbation was computed by a step in the steepest ascent direction w.r.t the corresponding norm. The step w.r.t to uncertainty set is the same as the fast method of [7]; the step w.r.t to uncertainty set is in the direction of the gradient; the steepest ascent direction w.r.t to uncertainty sets comes down to changing the pixel corresponding to the entry of largest magnitude in the gradient vector. It is interesting to note that using equation (11) with uncertainty, it is possible to make a network mis-classify an image by changing only a single pixel. Several such examples are presented in Figure 1.

Altogether we generated a collection of 1203 adversarial examples, on which the baseline network had zero accuracy and which were generated from correctly classified test points. A sample of the adversarial examples is presented in Figure 2. We refer to this collection as .

We then used Algorithm 1 to re-train the net with the norm being and (each norm in a different experiment). We refer to the resulting nets as the robustified nets. Table 1 summarizes the accuracy of each robustified net on the mnist test data and the collection of adversarial examples.

As can be seen, all three robustified nets classify correctly many of the adversarial examples in , with the uncertainty giving the best performance. In addition, all three robustified nets improve the accuracy also on the original test data, i.e., the adversarial training acts as a regularizer, which improves the network’s generalization ability. This observation is consistent with the ones in [21] and [7].

Next, we checked whether it is harder to generate new adversarial examples from the robustified nets (i.e., the nets that were trained via Algorithm 1) than from the baseline net. To do that, we used the fast method of [7] (see equation (2)) with various values of (which corresponds to the amount of noise added/subtracted to/from each pixel) to generate adversarial examples for the baseline net, and for the robustified nets. For each we measured the classification accuracy of each net with respect to adversarial examples that were generated from its own parameters. The results are shown in Figure 3. Clearly, all three robustified nets are significantly more robust to generation of new adversarial examples.

To summarize the MNIST experiment, we observed that networks trained with Algorithm 1 (1) have improved performance on original test data, (2) have improved performance of original adversarial examples that were generated w.r.t to the baseline net, and (3) are more robust to generation of new adversarial examples.

### 5.2 Experiments on CIFAR-10 dataset

As a baseline net, we use a variant of the VGG net, publicly available online at [26], where we disabled the batch-flip module, which flips half of the images in every batch. This baseline net achieved accuracy of 90.79% on the test set.

As in section 5.1, we constructed adversarial examples for the baseline net, using Equation(11), with , and uncertainty sets. Altogether we constructed 1712 adversarial examples, all of which were mis-classified by the baseline net, and were constructed from correctly classified test images. We denote this set as . A sample from is presented in Figure 4.

We then used Algorithm 1 to re-train the net with , and uncertainty. Figure 5

shows that the robustified nets take about the same number of epochs to converge as the baseline net.

Table 2 compares the performance of the baseline and robustified nets on the CIFAR-10 test data and the collection of adversarial examples.

Consistently with the results of the MNIST experiment, here as well the robustified nets classify correctly many of the adversarial examples in , and also outperform the baseline net on the original test data.

As in the MNIST experiment, we continued by checking whether it is harder to generate new adversarial examples from the robustified nets than from the baseline net; we used equation (2) (i.e., with uncertainty) and various values of to generate adversarial examples for each of the the baseline and the robustified nets. The results are shown in Figure 6. We can see that new adversarial examples are consistently harder to generate for the robustified net, which is consistent with the observation we had in the MNIST experiment.

To summarize the CIFAR-10 experiment, we observed that here as well, the robustified nets improve the performance on original test data, while making the nets more robust to generation of new adversarial examples. As in the MNIST experiment, the uncertainty yields the best improvement in test accuracy. In addition, the robustified nets require about the same number of parameter updates to converge as the baseline net.

## 6 Conclusions

In an attempt to theoretically understand successful empirical results with adversarial training, we proposed a framework for robust optimization of neural nets, in which the network’s prediction is encouraged to be consistent in a small ball with respect to some norm. The implementation is done using minimization-maximization approach, where the loss is minimized over worst-case examples, rather than on the original data. Our framework explains previously reported empirical results, showing that incorporating adversarial examples during training improves accuracy on test data. In addition, we showed that the loss function published in [7] is in fact a special case of Algorithm 1, for certain type of uncertainty, thus explaining intuitive interpretations given in that paper. We also showed a connection between Algorithm 1 and the manifold tangent classifier [17], showing that it too, corresponds to a robustification of ANN training.

Experimental results in MNIST and CIFAR-10 datasets show that Algorithm 1 indeed acts as a regularizer and improves the prediction accuracy also on the original test examples, and are consistent with previous results in [7] and [21]. Furthermore, we showed that new adversarial examples are harder to generate for a network that is trained using our proposed approach, comparing to a network that was trained in a standard fashion. As a by-product, we also showed that one may be able to make a neural net mis-classify a correctly-classified an image by changing only a single pixel.

Explaining the regularization effect that adversarial training has is in the same vein that from practical experience, most authors knew that drop-out (or more generally adding noise to data) acts as regularization, without a formal rigor justification. Later (well-cited) work by Wager, Wang, and Liang [22] created a rigorous connection between dropout and weighted ridge regression.

The scripts that were used for the experiments are available online at https://github.com/yutaroyamada/RobustTraining .