A Unified Gradient Regularization Family for Adversarial Examples

11/19/2015 ∙ by Chunchuan Lyu, et al. ∙ Xi'an Jiaotong-Liverpool University 0

Adversarial examples are augmented data points generated by imperceptible perturbation of input samples. They have recently drawn much attention with the machine learning and data mining community. Being difficult to distinguish from real examples, such adversarial examples could change the prediction of many of the best learning models including the state-of-the-art deep learning models. Recent attempts have been made to build robust models that take into account adversarial examples. However, these methods can either lead to performance drops or lack mathematical motivations. In this paper, we propose a unified framework to build robust machine learning models against adversarial examples. More specifically, using the unified framework, we develop a family of gradient regularization methods that effectively penalize the gradient of loss function w.r.t. inputs. Our proposed framework is appealing in that it offers a unified view to deal with adversarial examples. It incorporates another recently-proposed perturbation based approach as a special case. In addition, we present some visual effects that reveals semantic meaning in those perturbations, and thus support our regularization method and provide another explanation for generalizability of adversarial examples. By applying this technique to Maxout networks, we conduct a series of experiments and achieve encouraging results on two benchmark datasets. In particular,we attain the best accuracy on MNIST data (without data augmentation) and competitive performance on CIFAR-10 data.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Imperceptible perturbations in images are able to change the prediction of neural network models, including single layer softmax model 

[1, 2, 3]. That is, given a trained neural network model and an input image, one can always find a small perturbation that can change the model’s predication through certain optimization procedure. Moreover, perturbations trained in one model could also change the prediction results of many other classification models. The examples constructed by using such perturbations are referred to as adversarial examples, which have attracted much interest in both machine learning and data mining [2, 4, 5].

The existence of adversarial examples raises several important issues. First, why do such perturbations exist in the first place? Intuitively, a successful machine learning model should robustly classify indistinguishable inputs as the same class. The generalizability of those perturbations is even more intriguing, since those perturbations are obtained by optimization process based on model and image instance. Second, why do such perturbations not occur frequently in real applications (i.e., naturally generated data)? Indeed, if most examples are adversarial examples, no machine learning algorithm could work at all. Finally and most importantly, what can we do to deal with adversarial examples? One possibility is to build machine learning models that are immune to adversarial examples. Alternatively, we might be able to use adversarial examples to even improve the performance of most machine learning models.

To explain the existence of adversarial examples, [2] argued that this particular phenomenon could arise naturally from high dimensional linearity, as opposite to nonlinearity suspected by [1].  [6] showed that models’ robustness against adversarial examples are limited by distinguishability between classes. This also supports that nonlinearity is not the fundamental reason behind adversarial examples. They also argued the generalizability of adversarial examples is due to neural network models’ resemblance of linear classifiers. In addition to directly addressing the problem, [3]

have discovered a twin problem: there exist human unrecognizable images, which deep learning models could classify them with high confidence. There is not yet a verifiable answer to why adversarial examples exist only infrequently in real-life cases, but some speculate that they appear only in low probability regions in data manifolds

[1, 2]. To alleviate the influence caused by adversarial examples, [7] tried to penalize the Jacobian matrix based on a series of approximations. While their model seems more robust, it usually leads to an accuracy drop and slow training due to additional cost. Accuracy can be improved by injecting those perturbed examples back in training, as demonstrated by [1][2] have further extended this idea to the so-called fast gradient sign method, and when using this method, trained models are more robust against adversarial examples. While this method seems promising, it appears to lack mathematical motivation and failing to fully utilize the idea of linear perturbation as a consequence. In data mining community, researchers also try to build models that are robust against adversarial examples (adversarial attacks) [4, 5]. In particular, a study investigated correspondence between adversarial examples and effective regularizer, but their results are restricted merely for some specific loss functions [4].

We develop the linear view of adversarial examples proposed by [2] in a more rigorous and unified way. To this end, we propose a unified framework to train models that is robust to adversarial examples and successfully transform it to a minmax problem. Specifically, we propose a family of gradient based perturbations, as a unified regularization technique. Models trained by applying our proposed gradient regularization family have proved highly robust to perturbations. Moreover, under different values of norm parameter, the family presents itself a unified framework, i.e.- it incorporates the recently-proposed fast gradient sign method [2] as a special case and can also derive many other more promising methods. One interesting special case is verified to achieve encouraging performance in two benchmark data sets. Furthermore, by magnifying perturbations in MNIST [8], we could provide physical intuition for why adversarial examples could be generalized across different machine learning models.

Ii Gradient Regularization Family

In this section, we will present our framework of gradient regularization family and describe its theoretical properties.

Ii-a General Framework

We introduce our framework by starting with the worst-case perturbation, initially proposed in [2]. One salient feature of our framework is that it can represent a significant extension to the previous work into a unified family that incorporates many important variants of gradient regularization technique.

Denote  111This notation might be shorthanded as or in the following text. as a loss function, as data, and as model parameters. The idea could be formalized as follows. Instead of solving , ideally we would like to solve the following problem if we try to build a robust model against any small perturbation defined as :


The norm constraint in the inner problem implies that we only require our model to be robust against certain small perturbation. Thus the training procedure is decomposed into two stages. We first find a perturbation that maximizes the loss given the data and constraints. Then, we perform our ordinary training procedure to minimize the loss function by altering . In general, this problem is difficult to be solved due to its non-convex nature with respect to and .

In the following, we will propose to solve the problem using approximation technique. To this end, we first approximate the loss function by its first order Taylor expansion at point . The inner problem then becomes:


This problem is trivially linear, and hence convex w.r.t. . We can obtain a closed form solution by Lagrangian multiplier method, see A for details. This yields:


where is the dual of , i.e., .

If we substitute the optimal back to the original optimization problem, we can see that the influence of perturbations can be formulated as a regularization term. Thus, the new family of regularization method works approximately as:


Instead of minimizing , we try to minimize where is defined above, and parameterized by such that the new optimization objective could be highly robust to any small perturbations.

Remark 1.

It is worthwhile to note that although it has been long known in [9], that injecting Gaussian noise is equivalent to penalizing the trace of Hessian matrix, to our best knowledge, our approach is the first general method to penalize the gradient of loss function w.r.t. the input in complex models. Such regularization could be applied into various machine learning model that contains gradient information.

In the following, we examine three special cases of this family and show how this unified regularization framework could contain another method [2].

Ii-B Case

We show that in this case our method can be reduced to the fast gradient sign method proposed in [2]. The worst perturbation becomes:

(assuming ) (7)

The corner case, where , defines sign as 0. Therefore, we can reduce our general method to a special case, namely fast gradient sign method. The corresponding induced regularization term is . Mathematically speaking, this regularization term appears to be unnatural, since the gradient is not penalized in an isotropic way. This might affect negatively the performance, especially when the data is preprocessed to be Gaussian-like.

Ii-C Case

This is another special case, where , and thus the worst perturbation becomes:




The intuition here is quite clear. Since we are constrained by the sum of absolute value of all perturbations, it is intuitive to put all of our “budgets” into one direction. However, the induced regularization term is not very appealing, since it only penalizes gradient in one direction.

Ii-D Case

As [10] have indicated that the extreme case of hyper parameter ( or ) might not be the optimal setting, we would like to introduce standard gradient regularization where . Then, we will develop a second order Taylor expansion analysis of this perturbation to provide further theoretical insight into this particular case. First, let us write down the formula for :


We are now ready to compute the second order expansion. The second order Taylor expansion of loss function is:

(by Lemma B.1)
(by Lemma B.1)

Therefore, the second order Taylor expansion resembled the term induced by marginalizing over Gaussian noise.

Remark 2.

It is possible to write down a generalized form of second order regularization term as . In this form, it does not provide further insight, as it appears to be duplication of the first order term in general.

Ii-E Exact Solution Is Non-Trivial

As mentioned earlier, the original minmax problem (1

) is generally difficult to solve. However, it would be still of interest to see whether the original minmax problem could be solved without using approximation. It turns out that it is not easy to achieve, even in the simplest case of a linear regression mode. We show this in the following.

The optimization problem of finding the worst perturbation in the linear regression model can be formulated as follows:


where is the desired output, and is the weight matrix. Hence, we are maximizing a convex function, which is non-trivial. One might hope that by applying the first order expansion, it could lead to a simple solution. However, while this approximately solves the maximization problem, the minimization problem becomes non-convex due to the additional regularization term. Therefore, it appears to be hard to obtain exact solution in this very simple model.

Ii-F Computational Cost

The computational cost of injecting such adversarial noise comes from computing , which requires computing and a minimum overhead to convert into . A naive BP approach to compute will roughly double the time cost of the training. We implemented the algorithm using built in functionality provided by [11].

Iii Visualization and Interpretation of Gradient Perturbation

Traditionally, adversarial examples and corresponding perturbations are regarded as unintelligible to naked eyes [1, 2]. However, we will visualize our gradient perturbations in the case of , and thus provide both physical intuition of gradient perturbation and mathematical structure behind. The physical intuition, in particular, could support the effectiveness of gradient perturbation, and explain why adversarial examples could generalize across models. The model used to generate sample images in this section is a sigmoid network model trained on the MNIST dataset [8].

Iii-a Visualizing Adversarial examples

It has been demonstrated that adversarial examples are generated by imperceptible or unrecognizable perturbations [1, 2]. However, we will show that in some cases we could visualize those perturbations. Let us present randomly picked samples from MNIST in Figure 2 and the perturbed images by gradient perturbation with and in Figure 2.

Fig. 1: Original input images from MNIST dataset
Fig. 2: Inputs perturbed by gradient perturbation with and
Fig. 1: Original input images from MNIST dataset

Indeed, at the first glance, we may conclude that the perturbed examples are indistinguishable from the original copies. However, let us take a closer look at the perturbations in Figure 4 and Figure 4 by magnifying the perturbation with a factor of 10:

Fig. 3: The corresponding perturbations magnified by a factor of 10
Fig. 4: Inputs perturbed by gradient perturbation with and
Fig. 3: The corresponding perturbations magnified by a factor of 10

Although the magnified perturbations do not still make much sense to naked human eye, the exaggerated adversarial examples are presented in a more meaningful way for the naked eye (see Figure 4). The perturbations have changed the physical shapes of the objects in a perceptible way. The number are of special interest. The bottom right corner of has been erased and turned into ; is extended into number ; the is the most interesting, where a part is erased while another part is extended to become a .

Iii-B Mathematical Structure

In this subsection, we will try to interpret the above phenomenon mathematically. To understand how those adversarial examples are generated, let us decompose the gradient perturbation when :


and where

is the input of the softmax layer for a typical neural network, and

is the true label. Next, we need to make the following assumption: the Jacobian matrix encodes the local version of corresponding classes. Then, when the prediction is concentrated on the correct output, should be small and negative. This corresponds to the cases like and in the previous section, where the perturbed examples are still clear and correct. In the case of low predication confidence, fuzzy perturbed examples could be generated. This corresponds to the cases like , and . The most interesting cases happen when there is a confusion between two classes. That is to say, contains two large components where one component is positive and one negative. The negative component corresponds to weak prediction of correct label, and the positive component corresponds to most confused wrong prediction. Images like , and in the previous section are somewhat confusing when compared to the other numbers. In such cases, an adversarial perturbation erases the correct objects and injects the confused one.

Remark 3.

In the above argument, we made the assumption of the Jacobian’s ability to model the local space. This assumption is reasonable, since neurons in several layers of convolutional neural network correspond to certain natural image statistics 

[12]. While such assumption might not hold strictly in more complex data or data in other domains, there are researches indicate that the encoding of such information exist [13].

The above finding suggests that gradient perturbation method works in a distinctive way from techniques like weight decay or weight constraint method. It appears that gradient perturbation adds regularization effects conditioned on the performance of model in a particular instance. Therefore, it will not over-regularize when unnecessary. This interpretation is consistent with arguments made in [2].

Iii-C Generalization of Adversarial Examples

The visualization tells us that image instances are actually morphed into other classes in a very minimal form. Our machine learning models are much more perceptive than people in detecting such minor perturbations. Thus, the ability of adversarial examples to generalize across different models might come from meaningful changes in examples. A small change towards a specific class will still be a change even added in another instance.

This explanation is also in coherent with [2]. In their framework, we can say that the generalization of adversarial examples is due to the models’ resemblance of linear classifier, and linear classifier could encode local semantic information of images.

Iv Experiments

In this section, we will apply our proposed gradient perturbation method on standard benchmarks including MNIST [8] and CIFAR-10 [14]. In addition, we will test whether gradient regularization could improve the models’ robustness. We will mainly test the case for , which we refer to as standard gradient regularization.

Iv-a Mnist

The MNIST dataset [8] consists of 60,000 training examples and 10,000 testing examples. Each single example consists of greyscale images, which corresponds to digits from 0 to 9. We rescale the inputs into the range of , and no further preprocessing is applied. We tested standard perturbation regularization

on three architectures: standard sigmoid multilayer perceptron (MLP), Maxout network, and Convolutional Maxout network 

[15]. The sigmoid MLP experiments are conducted to investigate the utility of our method on highly non-linear models, and the other two experiments are designed to further verify whether our proposed framework could obtain the state-of-the-art performance.

The sigmoid MLP has two hidden layers, and its parameters are chosen from [400,400], [600,600] and [800,800] hidden units based on validation. Only norm constraint is used as other means for regularization. The max norm is set to be the square root of 15 as in [16]. The regularization parameter is chosen from 0.1, 0.5, 1, and 2. The final architecture used is two layers of sigmoid units, and a softmax final layer. is set to be 1. To fully utilize the training set, we followed [15] to train the model on 50,000 examples first, and we record the loss in the training set. Next, we train with a total of 60,000 examples until the last 10,0000 examples hit the same loss.

For the experiments based on Maxout networks, we have engaged the model from Maxout paper [15] and applied gradient regularization where as we have found from experiments on sigmoid MLP. No other parameter setting is tested. In addition, we tested the case of for the Convolutional Maxout network, since it is not reported by [2]. We have found that setting leads to better results than , which is the best parameter for the non-convolutional Maxout network according to [2].

Models Test Error
Maxout + dropout 0.94
Sigmoid mlp + gradient 0.93
ReLu + dropout +gaussian 0.85
Maxout + dropout + gradient 0.84 222It is recently improved to 0.78 by applying adversarial idea on early stopping, and changing architecture.
DBM + dropout 0.79
Maxout + dropout + gradient p = 2 0.78
TABLE I: Testing errors on permutation invariant MNIST dataset without data augmentation.

In the permutation invariant version, we achieved 78 errors among 10, 000 test samples in the case of . To our best knowledge, this is the best result in this category even with unsupervised pretraining if data are not augmented. Also, gradient regularized old fashioned sigmoid MLP has been shown to achieve 93 errors, which actually beats simple Maxout networks with dropout training. This indicates that even a highly nonlinear model such as sigmoid MLP could be linear enough locally to benefit from gradient regularization. We summarized our results and other related best results in Table I.

Models Test Error
Conv. Maxout + dropout 0.45
Conv. Maxout + dropout 0.41
+ gradient
Conv. Maxout + dropout 0.39
+ gradient p = 2
Conv. Kernel Network 0.39
TABLE II: Testing errors on MNIST dataset without data augmentation. Convolutional architecture is used.

In terms of the improvements on convolutional architecture, our best result is obtained by standard gradient regularization. We have achieved 39 errors on testing set, which ties with the recently proposed sophisticated convolutional kernel network described by [19]. To our best knowledge, no better results are obtained without using of data augmentation like elastic distortion. It appears that standard gradient regularization outperforms fast gradient sign method in a consistent fashion, especially if we consider that the parameter is not tuned for Maxout network. We summarized the results in Table II.

Iv-B Cifar-10

The CIFAR-10 dataset [14] consists of 60,000 32 32 RGB images, corresponding to 10 categories of objects. There are 50,000 training examples and 10,000 testing examples. Again, we have followed the procedure in the Maxout paper [15]. To quickly verify gradient regularization in a more complex setting, we tested on a small convolutional maxout network reported by [20], and can confirm again the improvement of our regularization technique. was set to be in this case, since the RBG image has more dimensions. Similar to MNIST, the training has two stages, we compared the test error in the end of both stages.

Models Testing Error
Conv. Maxout Small + dropout 14.05
[20] (40K)
Conv. Maxout Small + dropout 13.26
+Gradient p = 2 (40K)
Conv. Maxout Small + dropout 12.93
[20] (50K)
Conv. Maxout Small + dropout 12.28
+Gradient p = 2 (50K)
TABLE III: Testing errors on CIFAR-10 dataset without data augmentation.

As can be seen from Table III, our gradient regularization technique improves the performance of this network.

Iv-C Robustness

A recent theoretical study suggests there is a relationship between adversarial examples and Gaussian noise [6], we have used testing errors under Gaussian noise as a criterion to test whether our regularization technique improves the models’ robustness against perturbations. We will also discuss the relationship in  V. We gather some data based on testing errors on MNIST.

ModelsGaussian std 0 0.1 0.3
Conv. Maxout + dropout 0.45 0.76 20.17
Maxout + dropout 0.94 1.14 2.42
Conv. Maxout + dropout 0.39 0.44 14.36
+Gradient p = 2
Maxout + dropout 0.78 0.83 1.29
+Gradient p = 2
TABLE IV: Testing errors of MNIST dataset under Gaussian noises

ModelsGaussian std 0 0.1 0.3
Conv. Maxout + dropout 13.08 13.37 16.88
Conv. Maxout + dropout 12.28 12.81 16.80
+Gradient p = 2
TABLE V: Testing errors of CIFAR dataset under Gaussian noises

As can be seen from Table IV and V

, our regularization technique indeed improves the models’ ability to resist perturbation. However, it is interesting to note that the convolutional architecture seems to suffer more severe than the standard Maxout network. This observation is consistent with the data gathered from the ImageNet data set 

[1], where researchers have shown that the operator norm is larger in the convolutional architecture than fully connected networks. This is however yet another indication that the randomness in natural images is quite small even compare to those so called imperceptible perturbations.

V Discussion

In this section, we will show how to exploit the proposed unified theory and mainly discuss why adversarial examples do not occur frequently in real data. Some of the following analysis may not be strict enough but sufficient to provide intuition and interpretations on certain properties of adversarial examples.

By assuming the perturbation is caused by the Gaussian noise, we will first analyze how the misclassification rates of various learning models could be related to the minimum perturbation. Here, the minimum perturbation is measured in terms of its perturbation size that is able to change the models’ prediction on a certain example. After the establishment of the relationship, we will be able to predict the probability or the frequency that adversarial examples occur in real data. In order to achieve these objectives, we will develop a mathematical model by following two fundamental assumptions made in [2]: the manifold where adversarial examples live is not special and they mainly arise from local linear behavior.

More precisely, we assume that every data point is associated with a Gaussian distribution:


is the standard deviation in every dimension, and

is an identity matrix of size

. Thus, adversarial examples do not live in a region with particularly low probability as contrary to suggestion from [1]. There are directions to generate effective adversarial examples, whose corresponding minimum perturbations are respectively denoted as , and direction for . In order to generate an adversarial example from , we need to satisfy:


Denote adversarial examples to occur as , then we could obtain a lower bound:


Since the Gaussian distribution is isotropic, could be simplified into:


and where means in an arbitrary axis. This essentially means that given , it will not increase the probability of adversarial examples to occur by increasing dimension of input space. To calculate the probability, we only need . To further simplify our formula, we assume all ’s are equal, and just denote each as . Hence, there are three parameters in our model: , among which the size of is controlled.

We could develop some qualitative intuitions first. To make a fair comparison, we set that the mean distortion of adversarial examples, as defined by [1]

, equal to the variance of the Gaussian distribution. In other words, we set

, so that our model only depends on the number of directions and the dimension of input . Then, based on Equation (28), we have:


Since the event denotes to be standard deviation away from the mean, the probability of the event to occur shrinks exponentially. Therefore, while high dimensionality makes it possible to find a direction such that the model prediction would be changed, it also provides protection in a probabilistic sense that such prediction change could be unlikely.

In the next subsection, we will concretely take the MNIST data as one real example to make further analysis. The image pixels used in the following analysis are all scaled into [0,1]. Following [1], we will conduct our analysis based on the simple softmax model, which is of linear nature.

V-a Estimating Misclassification Rate under Gaussian Noise

This subsection mainly aims at verifying our linear analyzing framework and gathering insights from empirical observations. To get a reasonable numerical estimation of misclassification rate, it is not enough to use the average of minimum perturbation

, instead we have to take the distribution of minimum perturbation into consideration. There are two reasons for this: (1) the minimum perturbations of a small value have a major impact on the probability that adversarial examples occur, and (2) the minimum perturbation skews towards small values as seen in Figure 


Fig. 5: Histogram of minimum perturbation in a single layer softmax model with regularization parameter . Histograms of other regularization parameters are of the same shape but different in scale

We approximate the distribution of by a Gaussian distribution . We will then have:


The summation of two Gaussian random variables, denoted by

, is Gaussian, whose mean and variance are and

respectively. Then, we have the probability related to a cumulative distribution function of a Gaussian variable:


If we further assume the smallest perturbation in one certain direction dominates the predicated probability, we can roughly set .333Such assumption may be less strict, which is however simple and useful to get clear insight into the theory. Then, the remaining task is to get an estimation of and . Following the linear view of adversarial examples, we perform a line-search in the direction of gradient to find the minimum perturbation needed to change the prediction from correct to wrong. We have gathered the statistics from softmax models, and summarized the results in Table VI. Note that these statistics are obtained under the training set (with sufficiently large number of samples) so as to get a statistically reliable conclusion. Similar inspection can be derived if the test data were used.

Models mean standard deviation
Softmax 0.2744 0.1511
Softmax 0.4722 0.2531
Softmax 1.2718 0.6654
TABLE VI: Mean and standard deviation of the minimum perturbations for softmax models on MNIST training dataset

As can be seen from Table VI, a larger regularization parameter corresponds to better robustness. To estimate the misclassification rate under different levels of Gaussian noises, we first need to obtain the misclassification rate without noise. With the further assumption that misclassification is not corrected by noise, the predicated misclassification rate can be calculated theoretically as:


Given the perturbation under the Gaussian noise, we can get the actual misclassification rates in MNIST and also estimate such rate theoretically using Equation (33). These results are listed in Table VII and Table VIII respectively.

ModelsGaussian std 0 0.1 0.3
Softmax 6.02 11.99 40.07
Softmax 6.29 8.22 22.44
Softmax 8.93 9.16 11.33
TABLE VII: Acutal misclassification rate on MNIST under the perturbation of Gaussian noise

ModelsGaussian std 0 0.1 0.3
Softmax 6.02 12.12 25.47
Softmax 6.29 10.16 17.01
Softmax 8.93 11.61 12.63
TABLE VIII: Misclassification rate theoretically estimated by (33) on MNIST under the perturbation of Gaussian noise

As can be seen by comparing Table VII and Table VIII, the misclassification rate estimated from our probability model fits well the actual rate as obtained from empirical experiments. Although these numerical results do not match exactly as seen in Table VII and Table VIII, the qualitative changes of the predications and actual results are fairly consistent under different levels of Gaussian noises.

Note that, within the linear view of adversarial examples, we have made two major assumptions in the previous analysis. First, we adopt the most dominant direction to estimate the misclassification rate. This may lead to underestimation of misclassification, since other directions are neglected. Second, we approximate the distribution of minimum perturbation by a Gaussian distribution. This may lead to overestimation of misclassification rate in the case of no noise. Table VI indicates that standard deviation of minimum is roughly half of the mean in a consistent fashion. Then, even without Gaussian noise, the calculation yields roughly . The reason behind this is that the actual distribution is only supported in the positive region, but Gaussian distribution is not. This simplification does not have a major impact when is large, and our results agree with this. However, to estimate when is small, we may need to readjust our approximation.

V-B Estimating the Probability of Adversarial Examples

To estimate the probability that adversarial examples occur in reality, we need to estimate the randomness in natural images and the distribution of the minimum perturbations. This is especially necessary when the perturbation is sufficiently small, i.e., close to , since the overestimation occurs in this case as indicated in the previous subsection. In the following, we intend to choose a that is sufficiently large, yet is hardly perceptible. We first illustrate some examples from MNIST dataset in Figure 6, where .

Fig. 6: Images corrupted by Gaussian noise with

In Figure 6 , such noise is imperceptible even with a close examination, and this is not the case for significantly larger . We observe that the distribution of the minimum perturbation appears to be linear when it is close to zero as seen in Figure 7.444The space in the figure is caused by steps in line search.

Fig. 7: Histogram of minimum perturbation in a single layer softmax model with regularization parameter . Region beyond 0.1 is filtered.

In light of this, we could assume the probability density is linear around with respect to the minimum perturbation. After gathering statistics from the graph and some algebras, we can predicate the increase of misclassification rate in theory (based on our model) for a perturbation caused by Gaussian noise with . The percentage of such additional misclassification is calculated theoretically and listed in Table IX. We also list the actual additional misclassification, which was obtained empirically from experiments.

Models Actual Predicated
Softmax 0.07 0.04
Softmax 0.01 0.02
Softmax 0.01 0.01
TABLE IX: Actual and predicated percentage of additional misclassification rate caused by the perturbation of Gaussian noise with on MNIST

Note that, according to the above analysis and assumptions, these additional misclassified samples can be regarded as adversarial examples that occur naturally, when a perturbation is given by Gaussian noise. As seen in Table IX, these numbers are of sufficiently small both theoretically (under column “Predicated) and empirically (under column “Actual”), meaning that the probability of adversarial examples is quite small. Therefore, the claim that adversarial examples could endanger our machine learning models seems to somewhat over cautious, since its natural occurrence is supposed to be rare and negligible comparing to classification errors of our models. This is both verified theoretically and empirically in the above analysis.

Vi Conclusion

In this paper, we have proposed a unified framework, which aims at building robust models against adversarial examples. We have derived a family of gradient regularization techniques, which could be extensively used in various machine learning models. In particular, a special case has a second order interpretation, which approximately marginalized over Gaussian noise. We explained the adversarial examples’ ability to generalize across models by visualizing its effects, which had been previously thought to be unrecognizable. This visualization, in turn, supplement our regularization technique with physical meaning.

While the linear view appears to be plausible and fruitful, so far, it needs more empirical support. To be more concrete, manifold hypothesis suggests data residue in a low dimensional manifold in the space. This, however, suggests that it is always possible to leave the manifold by slight change, since there are extra dimension around every point in the manifold, and the fact supports the hypothesis. Theoretically, a low dimensional manifold has measure of zero. This makes the number analogy made by 

[2] almost literally true, but with adversarial examples being irrational numbers. In reality, however, data float around the manifold. Thus, geometrically, we could check the implication of linear view by associating errors induced by Gaussian noises and the minimum perturbation length. Our primary work suggests that it would require distribution of minimum perturbation length to account for the actual errors.


The authors would like to thank the developer of Deeplearning Toolbox [21]

, the developers of Theano 

[11, 22] and the developers of Pylearn2 [20].

Appendix A Solving

Since is independent of . Our problem is


Clearly, the optimal would have a norm of , otherwise, we can normalize to get a greater loss. Therefore, we are set to solve


This could be solved by standard Lagrangian multiplier method, where , and . Set , we have


Sum over two sides


Combine (38) and (43), it is easy to see


Appendix B Lemma

Let vector

be the output of model, i.e. . Correspondingly, we have vector as the desired output. We have as the length of , where is the number of classes in a classification task. is Gaussian-Newton matrix, which is an approximation of Hessian matrix of w.r.t.  [23]. The is the Jacobian matrix of w.r.t. , and is the Hessian matrix of w.r.t. .

Lemma B.1.

Given , and s.t. . We have .


First, by chain rule, we have

. By cancellation of on both sides, it suffices to show . Now, let us calculate the gradient:


Then, compute the out product:. This means all the off-diagonal items are zeros, and only one term left in the diagonal, which is . Let us compute Hessian matrix:


Hence, also have only one diagonal term, which is . Therefore, we have the equality as desired. ∎