DeepAI

# On Model Robustness Against Adversarial Examples

• 4 publications
• 34 publications
• 77 publications
10/20/2022

### Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models

Traditional (fickle) adversarial examples involve finding a small pertur...
12/10/2019

### On Certifying Robust Models by Polyhedral Envelope

Certifying neural networks enables one to offer guarantees on a model's ...
11/01/2022

### The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training

Although current deep learning techniques have yielded superior performa...
11/17/2015

### Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization

We propose a general framework for increasing local stability of Artific...
07/16/2018

### Manifold Adversarial Learning

The recently proposed adversarial training methods show the robustness t...
08/26/2022

### Lower Difficulty and Better Robustness: A Bregman Divergence Perspective for Adversarial Training

In this paper, we investigate on improving the adversarial robustness ob...
01/22/2023

### Provable Unrestricted Adversarial Training without Compromise with Generalizability

Adversarial training (AT) is widely considered as the most promising str...

## 1 Introduction

Deep Neural Networks (DNN) have achieved great success in various tasks, such as speech recognition, image classification, and object detection

[12, 6]. However, recent research shows that certain small perturbations over the input samples, called adversarial examples, may fool many powerful deep learning models [5].

To better handle the adversarial perturbation, there have been many seminal works studying how to generate adversarial examples. For example, Liu et al. presented a simple way called L-BFGS method to generate such examples [15]. Fast Gradient Sign Method (FGSM) was later proposed, which can also generate the adversarial perturbation [5]. Lyu et al. and Shaham et al. further extended the FGSM to more general cases with the norm constraint [16, 25], where FGSM can be seen as the adversarial training method with norm. Kurakin et al. attempted to utilize the projected gradient on the negative loss to find the worst perturbation called FGSM [10], which can be viewed as the multi-step version of FGSM. Focusing on studying the gradient of the loss function, these three methods showed their effectiveness in improving the model robustness on both natural and adversarial examples by data augmentation with adversarial examples.

On the other hand, for defending the adversarial attack, defensive distillation and feature squeezing were investigated

[30, 22]. Furthermore, Virtual Adversarial Training (VAT) can generate adversarial perturbation without label information [21, 19]. More specifically, the objective of VAT is to smooth the output distribution by minimizing the divergence between outputs of natural examples and adversarial examples. In addition, Kos et al. proposed methods to generate the adversarial examples for generative models [9].

In parallel to studying how adversarial examples can be generated, researchers also made great efforts in thinking about the theory and principles underlying the adversarial examples. In particular, Ma et al. have shown that adversarial examples are not isolated points but a dense region of the input space [17]. Fawzi et al. studied the model robustness against adversarial examples by establishing a general upper bound [3, 2]. Finlay et al. and Lyu et al. have demonstrated that FGSM and their extended general cases can be interpreted as a form of regularization [4, 16]. Similarly, Cisse et al. showed that the sensitivity to adversarial examples can be controlled by the Lipschitz constant of the network and proposed a new structure of network which is insensible to adversarial examples [1].

The above-mentioned seminal studies have got interesting and important results for trying to understand adversarial examples. Although some theoretical robustness bounds have been proposed, most are practically difficult to be used or be optimized. Moreover, less theories have been rigorously offered on measuring the model robustness against adversarial examples mathematically and systematically.

Distinguished from these existing work, in this paper, a novel theoretical framework has been established which is able to address the robustness issue mathematically and rigorously.

In more details, inspired from the stability of the loss function in the small neighborhood of natural examples, we propose to exploit an energy function to describe the stability, and we prove that reducing such energy guarantees the robustness against adversarial examples. We also prove that many traditional adversarial training methods (including both supervised and semi-supervised adversarial training) are essentially equivalent to minimizing the lower bound of the proposed energy function; such low bound minimization can however might lead to insufficient robustness within the neighborhood around the input sample. Furthermore, we design a more rational and practical method with the energy regularization which proves to achieve better robustness than previous methods.

In addition, we develop the robustness analysis on both traditional supervised and semi-supervised adversarial learning, which is, to our best knowledge, rarely seen in the literature.

Finally, to verify the performance of the proposed method, we have conducted a series of experiments for both supervised and unsupervised tasks. Experimental results have shown that our proposed adversarial framework can achieve the best performance compared with previous adversarial methods benchmarked on MNIST, CIFAR-10, and SVHN. Importantly, they demonstrate much better robustness against adversarial examples than all the other comparison methods.

## 2 Notations and Backgrounds

We denote by a training set containing samples, namely , where indicates an input sample (or natural sample) and denotes its corresponding label (with and representing the dimension of the input space and the output space, respectively). We also define as an -dimensional small ball around each with the radius .

Given a specific type of DNN, we let denote its mapping function (implicitly or explicitly), be the loss function used by the DNN, and be a set of parameters which is to be optimized over for the DNN. For simplicity, could be written in short as or even

, so do some other similar notations. Moreover, we assume in this paper that the last layer of the DNN be a softmax layer, but it should be noted that other functions can also be used.

### 2.1 Adversarial Training with the l2 Norm Constraint

The adversarial training method with the norm constraint (AT) is a supervised method, which attempts to find the worst perturbed example in the neighborhood of a natural example to mislead the classification. Such perturbed examples are then augmented into the training set for training a better DNN. The objective of this adversarial training method can be written as:

 minθmaxx∈B(x0,ϵ)L(x,y,θ), (1)

where indicates the perturbed version of a natural example (with the label ) within a small neighborhood (which is defined by ).

### 2.2 Virtual Adversarial Training

Different from the adversarial training method with the norm constraint (AT), Virtual Adversarial Training (VAT) does not require the label information. It tends to find the worst perturbed example near a natural example so that the output of DNN can be altered. The corresponding objective is defined as:

 minθmaxx∈B(x0,ϵ)D(f(x0,θ),f(x,θ)), (2)

where denotes the divergence between the outputs and . For simplicity, is defined in this paper as the Euclidean distance between the outputs, i.e., , but it is straightforward to extend the Euclidean distance to other divergence measures.

## 3 Main Methodology

We first present a reasonable assumption.
Assumption 1: Given a sensible loss function for a specific learning task, we assume that, there exists a small threshold , such that those inputs satisfying

can be correctly classified.

Note that such an assumption generally holds for common loss functions such as the cross entropy and the square error. A detailed analysis on the assumption can be seen in the appendix of the supplementary materials. With the above notations and assumptions, the adversarial training problem can be described as follows.
Problem Formulation: Assume that a natural example satisfies where , i.e., the example can be classified correctly with a high confidence. An adversarial example is then defined as the worst perturbed sample such that , i.e., will be mis-classified. The objective of adversarial training for a specific can be reformulated as

 minθmaxx∈B(x0,ϵ)|L(x,y,θ)−L(x0,y,θ)|.

### 3.1 Robustness Against Adversarial Examples

Before we interpret our robustness analysis against adversarial examples, we set out Lemma 3.1 as follows:

###### Lemma 3.1

Given a natural example satisfying (where ), if , , it holds that

 |L(x,y0,θ)−L(x0,y0,θ)|≤σ2, (3)

then, all the data points in can be classified correctly.

The proof is provided in the appendix of the supplementary materials.

Lemma 3.1 states that, if the loss of data points nearby is sufficiently close to that of , then all these data points can be classified correctly, since the natural example has been already classified correctly with a high confidence. In other words, whether the nearby points around can be classified correctly is affected by the stability of the loss function in the region . We also say that is robust in the region , and thus there exist no adversarial examples in this region, since all the data in this region are classified into the same category.

Remark. Previous research studies the adversarial examples mainly through considering whether the adversarial perturbation can guide the natural example to cross the classification boundary in a less rigorous way. Moreover, it would be difficult to investigate the shape of the classification boundary when data lie in a high dimensional space. In comparison, we consider in this paper the robustness against adversarial examples from the perspective of the loss function stability, which would lead to strict analysis as follows.

In order to describe the stability of in the neighborhood of , we propose the following novel energy function as given in Definition 8.1.

###### Definition 3.1

Let be a differential and integral function and be a small neighborhood of with radius . Then, the energy of in this neighborhood is defined as:

 EB(θ)=∫B||∇xL(x,θ)||2dV, (4)

where denotes the volume.

This energy describes a metric measuring the stability of a function, i.e., how a function would change within a small region defined by . More precisely, the integral of the norm of the gradient of the loss with respect to the input measures how the loss function changes at each point in . Intuitively, if the variation on each point is not large, the loss function would not change dramatically in this neighborhood of each point. This means that the loss function would be more stable. Importantly, we will prove that minimizing such energy function can guarantee the robustness for adversarial examples in . Before that, we provide Lemma 3.2.

###### Lemma 3.2

Let be a small neighborhood of natural example with label , and be an arbitrary point within . If the value of energy decreases, then the number of examples classified correctly in increases. When the energy goes to zero, the number of adversarial examples in goes to zero.

Proof of Lemma 3.2 is provided in the appendix of the supplementary materials.

Lemma 3.2 shows that decreasing the energy function leads to increasing the number of points such that in . In other words, a more number of points in would be correctly classified according to Lemma 3.1. When the energy function is small enough, there would be no adversarial examples gradually. Therefore, this novel energy function can be used to measure the robustness against adversarial examples in .

### 3.2 New Insight to Traditional Adversarial Methods

In this subsection, using our proposed stability measure, we provide interpretations as well as new insight to the previous traditional adversarial training methods including both supervised and semi-supervised version (Adversarial Training with norm constraint and VAT). Moreover, we prove that these traditional adversarial training methods are just to minimize the lower bound of the proposed energy along the radius, which however leads to insufficient robustness against adversarial examples. Such disadvantages will be analyzed and we then propose a more rational and practical optimization method.

First, we set out Definition 3.2 to describe the notion of the energy function along the radius.

###### Definition 3.2

Let the spherical coordinate of be where and . Then, the energy along radius on is defined by

 Eϵ(ϕ)=∫ϵ0||∇xL(r,ϕ)||2dr. (5)

The energy is defined in the spherical coordinate system and describes the total variation of the function along the radius at angle . We present Lemma 3.3 for a further explanation.

###### Lemma 3.3

Let be a small neighborhood of natural example with label and such that for all . Suppose that is on the boundary of and the spherical coordinate of point can be expressed by where . Then, we have

(Proof is provided in the appendix of the supplementary materials).

It is easy to reformulate the adversarial training method with norm constraint (AT) as follows [16]:

 (7)

Remark. If we compare Eq. (7) with inequality (6), it can be noted that this traditional adversarial training method with norm constraint (AT) is equivalent to minimizing the lower bound of the energy .
Only when the adversarial example is on the boundary of and the function is monotonically increasing w.r.t , the traditional adversarial training method can be equivalent to minimizing the energy itself. Unfortunately, it is surely not guaranteed that the adversarial example is always on the boundary and the function always monotonically increases, when increases.

Similarly, we can also prove VAT is equivalent to minimizing a lower bound of the energy along the radius at a certain angle . Before the proof, we present Lemma 3.4.

###### Lemma 3.4

Let be a small neighborhood of natural example with label and such that for all . Suppose that is on the boundary of and the spherical coordinate of point can be expressed by where . Then, we have

 ∫ϵ0||∇xf(r,ϕ2)||2dr≥∥f(xva)−f(x0)∥2 (8)

(Proof is provided in the appendix of the supplementary materials).

On the other hand, we can readily reformulate the VAT as [19]:

 minθmaxx∈B(x0,ϵ)∥f(x,θ)−f(x0,θ)∥2=minθ∥f(xva,θ)−f(x0,θ)∥2 (9)

Remark. If we compare Eq. (9) with Inequality (33), it can be noted that VAT is equivalent to minimizing the lower bound of energy .

Similarly, only when the adversarial example is on the boundary of and the function monotonically increases, the VAT would be exactly equivalent to minimizing the energy .

In summary, although the previous adversarial training methods could achieve good performance on both natural examples and adversarial examples, there are some inherit drawbacks. Note that the traditional adversarial training methods minimize the lower bound of the energy function each iteration which can reduce the value of the energy function to some degree. However, it cannot ensure the value of energy small enough. Intuitively, the traditional methods just consider the worst point in which cannot restrict the total variation in neighborhood of . This may ignore some other potential risk points. To illustrate this, We show in Fig. 1 that, even when the risk of the worst point has been reduced, the traditional methods could not guarantee the robustness against adversarial examples.

In Fig. 1, the loss for the worst point in has been reduced to a small enough value (below the risk threshold ). However, there exists also a small region (the region in blue circle) such that where , which means the examples in region are not guaranteed to be classified correctly according to the assumption mentioned before. According to Fig. 1, for the traditional methods, minimizing certainly cannot guarantee reducing the overall variations within all the nearby points as defined by the energy function given by Eq. (4). Therefore, there might exist other risk points. For tackling this problem, we additionally restrict the energy function (the total variation) within the neighborhood of natural example . Note that although the traditional methods minimize the loss function of different worst points for different iterations, sometimes finite points might be insufficient to guarantee the robustness. For our method, we additionally use the information of the first derivative to improve the stability (information of the first derivative might not solve the whole problem but alleviate it).

However, it is difficult to minimize the function of Eq. (4), since it requires the computation of integration. Instead, we propose in the following a more practical algorithm that is able to achieve the same objective. We first present Theorem 3.5.

###### Theorem 3.5

Let be a small neighborhood of natural example with label and be an arbitrary point in . If the value of the energy decreases, the value of the energy decreases almost everywhere in . When the energy goes to zero, the energy goes to zero almost everywhere in .
(Proof is provided in the appendix of the supplementary materials.)

In the above, for a measurable set , we say that a property holds almost everywhere on , or it holds for almost all , provided there is a subset of for which ( denotes the measure for and the property holds for all .

Theorem 3.5 states that decreasing the total energy can lead to a reduction of the energy along the radius . Therefore, we can reduce all of by penalizing the total energy . This naturally leads to a new method to enrich the traditional adversarial methods with energy regularization. We describe this new method as follows.

 minθmaxxL(x,θ)+λminθ∫B||∇xL(x,y,θ)||2dV. (10)

In (10), the second term presents the energy regularization which is defined by the energy , and is a positive trade-off hyper-parameter. Again, the energy function describes the overall variation of in the neighborhood, which is significantly distinguished from those traditional methods acting merely on a single point. According to Lemma 3.2 and Theorem 3.5, reducing the energy leads to decreasing the number of adversarial examples as well as the energy . Intuitively, penalizing the total variations of the function can help avoid the dramatic fluctuation of the loss function as illustrated in Fig. 1.

However, the optimization problem (10) is again impractical since the second term requires integration. For convenient optimization, we change the second term with which is closely related to the upper bound of and in the sense that minimizing the term is equivalent to minimizing the upper bound of and :

 minθmaxx∈BL(x,θ)+λmaxx∈B||∇xL(x,y,θ)||2. (11)

This new regularization can also be extended to VAT:

 minθmaxx∈BD(f(xi,θ),f(xi+ϵvat,θ))+λmaxx∈B||∇xf(x,θ)||2. (12)

Again, our proposed method is equivalent to decreasing both the lower bound and upper bound of energy . Relevant proof and details can be seen in the appendix of the supplementary materials.

### 3.3 Practical Optimization Algorithm

We design practical optimization algorithms for our proposed new framework, which basically extends the previous methods with the novel energy regularization. For convenience, we start with the problem (11), while the problem (12) can be solved in a similar way. In the problem (11), the first term can be solved with the traditional adversarial training method. The second term can be divided into two problems: inner maximization problem and outer minimization problem. However, for the inner problem, since is a non-convex function, it is difficult to evaluate the maximizer of function . Following many similar approaches [16], we relax it to the convex problem with the first order Taylor expansions:

 maxx∈B(x0,ϵ)||∇xL(x0,y0,θ)||2+∇x∥∇xL(x0,y0,θ)∥T2(x−x0). (13)

The problem (34) is now a convex problem w.r.t and can be solved by Lagrangian multiplier method. The maximizer can be calculated as:

 xmax=ϵ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯∇x∥∇xL(x0,y0,θ)∥2+x0, (14)

where represents the normalized operator. The gradient of w.r.t is difficult to compute. We can use the finite difference method to approximate it:

 xmax=ϵ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯∇x∥∇xL(x0,y0,θ)∥2+x0=ϵ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯H(x0)∇xL(x0)+x0≈ϵ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯∇xL(x0+ξ∇xL(x0))−∇xL(x0)ξ+x0, (15)

where is a small value. In this paper, we set . More details of derivation of (37) are provided in the appendix. After computing the maximizer of inner problem, the outer problem can be solved by the gradient decent method. The whole algorithm of Adversarial Training with Energy Regularization we called in short ATER is shown in Algorithm 1. We also develop the VAT with Energy Regularization (in short VATER) which is put in the appendix of the supplementary materials.

## 4 Experiments

We evaluate the effectiveness of our proposed methods including Adversarial Training with Energy Regularization (ATER) and Virtual Adversarial Training with Energy Regularization (VATER) on datasets MNIST, CIFAR-10 for supervised tasks and we exploit SVHN additionally for semi-supervised tasks following the experimental setting used in the most related work [19]. Specifically, for MNIST, we apply the conventional fully connected neural network with the same setting as [19]. The number of units in the hidden layers are set to

and the activate function is relu with batch normalization for each layer. We have conducted the experiments for the proposed ATER and VATER in comparison with the baseline model and various traditional adversarial training methods. For the hyper parameter

in (11) and (12), we simply set it to 0.1 empirically. For CIFAR-10 and SVHN, we utilize the structure called ’conv-large’ following [19]. We use the labeled samples of CIFAR-10 to train this model and test with samples for the supervised task. For these two datasets the hyper-parameter is set to . Note that the best hyper-parameters are chosen from empirically.

Table 1-2 show the results of different models on MNIST and CIFAR-10 respectively for the supervised task. Without tuning the parameter , our proposed methods already achieve the best performance on MNIST dataset over all the other algorithms. On CIFAR-10, our proposed adversarial method VATER performs the best among all the adversarial methods, but worse than the DenseNet and ResNet, which exploit a much deeper architecture. In future, we can also apply our method on such deeper structures to investigate if further improvements can be obtained.

To further examine if the proposed methods can be more robust to the adversarial examples, we generate in the test datasets of MNIST and CIFAR-10 10,000 adversarial examples according to FSGM and 2-norm attacks [16] respectively. We increase the level of adversarial noise gradually from 0 to 8 in MNIST (with the step size as 1) and from 0 to 13 in CIFAR-10 (with the step size as 1.6). We then test the performance of various methods on these adversarial examples. The performance is plotted in Fig. 2. As clearly observed, the proposed VATER and ATER show much better robustness against two types of adversarial examples. Particularly, when the adversarial noises are small, all the adversarial training methods show similar results but perform much better than the CNN (exploiting no adversarial training); when the adversarial noises are heavier, the proposed VATER and ATER demonstrate clearly better performance, verifying their significant robustness.

In summary, for all the experiments, our proposed methods achieve superior performance than all the traditional adversarial training methods. We attribute this success to the additionally penalizing of the upper bound of the energy function that can reduce the overall variation in the neighborhood . In comparison, the traditional adversarial methods just consider to reduce the loss for the adversarial examples. Moreover, the experiments indicate that the first derivative of the loss w.r.t input can provide more information.

Table 3-4 demonstrate the results of different models on SVHN and CIFAR-10 for semi-supervised learning. For the semi-supervised task, we just apply our proposed method VATER since it is an unsupervised adversarial training method. As observed, our model again attains the best performance on both the two datasets.

Taking the example of CIFAR-10, we evaluate the convergence performance of our proposed models as well as their robustness against adversarial examples generated by AT. Both the evaluations are based on the supervised setting.

We plot the convergence curves for our proposed methods in Figure 3. Particularly, the figure shows the error rate for the different methods on both the training set and test set. The red lines indicate the convergence curves for our proposed methods (VATER and ATER) on the test set. The blue ones present the curves of the traditional adversarial training methods (VAT and AT). Although the training curves are similar, our proposed methods attain better convergence than their traditional counterparts on the test set.

## 5 Conclusion

In this paper, we investigate the model robustness against adversarial examples from the perspective of function stability. We develop a novel energy function to describe the stability in the small neighborhood of natural examples and prove that reducing such energy can guarantee the robustness for adversarial examples. We also offer new insights to traditional adversarial methods (AT and VAT) showing that such traditional methods merely decrease certain lower bounds of the energy function. We analyze the disadvantage of the traditional methods and propose accordingly more rational methods to minimize both the upper bound and lower bound of the energy function. We implement our methods on both supervised and semi-supervised tasks and achieve superior performance on benchmark datasets.

## References

• [1] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier (2017) Parseval networks: improving robustness to adversarial examples. arXiv preprint arXiv:1704.08847. Cited by: §1.
• [2] A. Fawzi, O. Fawzi, and P. Frossard (2018) Analysis of classifiers robustness to adversarial perturbations. Machine Learning 107 (3), pp. 481–508. Cited by: §1.
• [3] A. Fawzi, S. Moosavi-Dezfooli, and P. Frossard (2016) Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pp. 1632–1640. Cited by: §1.
• [4] C. Finlay, A. Oberman, and B. Abbasi (2018) Improved robustness to adversarial examples using lipschitz regularization of the loss. arXiv preprint arXiv:1810.00953. Cited by: §1.
• [5] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, Table 1.
• [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980–2988. Cited by: §1.
• [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: Table 2.
• [8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.. In CVPR, Vol. 1, pp. 3. Cited by: Table 2.
• [9] J. Kos, I. Fischer, and D. Song (2018) Adversarial examples for generative models. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 36–42. Cited by: §1.
• [10] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1.
• [11] S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: Table 3, Table 4.
• [12] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
• [13] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu (2015) Deeply-supervised nets. In Artificial Intelligence and Statistics, pp. 562–570. Cited by: Table 2.
• [14] M. Lin, Q. Chen, and S. Yan (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: Table 2.
• [15] D. C. Liu and J. Nocedal (1989) On the limited memory bfgs method for large scale optimization. Mathematical programming 45 (1-3), pp. 503–528. Cited by: §1.
• [16] C. Lyu, K. Huang, and H. Liang (2015) A unified gradient regularization family for adversarial examples. In Data Mining (ICDM), 2015 IEEE International Conference on, pp. 301–309. Cited by: §1, §1, §3.2, §3.3, §4.
• [17] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, M. E. Houle, G. Schoenebeck, D. Song, and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613. Cited by: §1.
• [18] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther (2016) Auxiliary deep generative models. arXiv preprint arXiv:1602.05473. Cited by: Table 3.
• [19] T. Miyato, S. Maeda, S. Ishii, and M. Koyama (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §3.2, §4.
• [20] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2017) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976. Cited by: Table 1, Table 2, Table 3, Table 4.
• [21] T. Miyato, S. Maeda, M. Koyama, K. Nakae, and S. Ishii (2015) Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677. Cited by: §1.
• [22] N. Papernot, P. McDaniel, and I. Goodfellow (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277. Cited by: §1.
• [23] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015) Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554. Cited by: Table 1, Table 4.
• [24] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: Table 3, Table 4.
• [25] U. Shaham, Y. Yamada, and S. Negahban (2015) Understanding adversarial training: increasing local stability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432. Cited by: §1.
• [26] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: Table 2.
• [27] J. T. Springenberg (2015) Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390. Cited by: Table 4.
• [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: Table 1.
• [29] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: Table 2.
• [30] W. Xu, D. Evans, and Y. Qi (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: §1.
• [31] J. Zhao, M. Mathieu, R. Goroshin, and Y. Lecun Stacked what-where auto-encoders. arxiv 2015. arXiv preprint arXiv:1506.02351. Cited by: Table 3.

## 7 Analysis for Assumption 1 in Section 3.1

In this section, we prove that for common loss functions (e.g., cross entropy and square error) we can find a small constant such that if , then the input can be classified correctly. In this paper, we assume the last layer the softmax layer, then we have and .

### 7.1 Cross Entropy Loss

The cross entropy loss is defined as where

(O is the output dimension) is an one hot label vector for input

. We assume and others are zeros which means belongs to class . Then, we can reformulate the cross entropy loss as . If for all , can be classified correctly.

Proof:

 Lce<σth<−log0.5 (16)

then we have

 −logya<σth<−log0.5⇒ya>e−σth>0.5 (17)

Since and , can be classified correctly.

### 7.2 Square Error Loss

The square error loss can be formulated as . Similarly, for the square error loss , If for all , can be classified correctly.
Proof:

 ∑i(yi−li)2<σth (18)

Since and others are zeros, then we have

 ∑i/a(yi)2+(ya−1)2<σth⇒(ya−1)2<σth<0.25⇒ya>σth>0.5 (19)

Since and , can be classified correctly.

## 8 Proof for Lemmas

In this section, we prove the lemmas in the main paper. For convenience, we first set out some Theorem and Lemma.

###### Theorem 8.1

Let and and are integrable and continues functions, then there exists a constant such that

 ∫Bf(x)g(x)dV=c∫Bg(x)dV (20)

where is the volume and .

Proof:

We can directly find constant :

 ∫Bf(x)g(x)dV∫Bg(x)dV=c (21)

Lemma 1. Let us define by where is differential and integrable. Then, if decreases and goes to zero, decreases and goes to zero almost everywhere in .
(Def. For a measurable set , we say that a property holds almost everywhere on , or it holds for almost all , provided there is a subset of for which ( denotes the measure for ) and the property holds for all ).
Proof:
Let where is a sufficiently small value.

 g(θ)=∫Bf(x,θ)dV=∫N0f(x,θ)dV+∫B−N0f(x,θ)dV≥∫N0f(x,θ)dV (22)

When decreases and goes to zero, decreases and goes to zero. Then the measure of () decreases and goes to zero which means the measure of (m()) increases and goes to . Therefore, decreases and goes to zero almost everywhere in .

### 8.1 Proof for Lemma 3.1

Lemma 3.1. Given a natural example satisfying (where ), if , , it holds that

 |L(x,y0,θ)−L(x0,y0,θ)|≤σ2, (23)

then, all the data points in can be classified correctly.
Proof:
In this paper, we have proved in the previous section that there exists a such that if , can be classified correctly. Additionally, we assume that the natural examples can be classified correctly with a high confidence . Then, if ,

 L(x,y0,θ)<σ1<σth (24)

which means can be classified correctly.

If

 |L(x,y0,θ)−L(x0,y0,θ)|=L(x0,y0,θ)−L(x0,y0,θ)<σ2<σth−σ1⇒L(x0,y0,θ)

Therefore, if , can be classified correctly.

### 8.2 Proof for Lemma 3.3 and Lemma 3.4

Here, we just prove Lemma 3.4 since Lemma 3.3 is a special case of Lemma 3.4.
Lemma 3.4. Let be a small neighborhood of natural example with label and such that for all . Suppose that is on the boundary of and the spherical coordinate of point can be expressed by where . Then, we have

 ∫ϵ0||∇xf(r,ϕ2)||2dr≥∥f(xva)−f(x0)∥2 (26)

Proof:

 ∫ϵ0||∇xf(r,ϕ2)||2dr≥∫ϵ10||∇xf(r,ϕ2)||2dr≥∫ϵ10||∇xf(r,ϕ2)⋅→d||2dr≥||∫ϵ10∇xf(r,ϕ2)⋅→ddr||2=∥f(xva)−f(x0)∥2 (27)

where, is the unit vector pointing from to . In the same way, we can prove Lemma 3.3.

### 8.3 Proof for Lemma 3.2

Lemma 3.2. Let be a small neighborhood of natural example with label and be arbitrary point in . If the value of energy decreases, the number of examples classified correctly in increases. When the energy goes to zero, the number of adversarial examples in goes to zero.
Proof:
we reformulate the energy in spherical coordinate:

 EB=∫B||∇xL(x)||2dV=∫SI−1∫ϵ0||∇xL(r,ϕ)||2rI−1drdϕ (28)

According to Theorem 2.1, there exists a constant such that

 ∫B||∇xL(r,ϕ)||2dV=r1∫SI−1∫ϵ0||∇xL(r,ϕ)||2drdϕ (29)

According to Lemma 3.4, we have

 r1∫SI−1∫ϵ0||∇xL(r,ϕ)||2drdϕ≥r1∫SI−1|L(ϵ1,ϕ)−L(x0)|dϕ (30)

where and is the spherical coordinate of arbitrary point . Since is the upper bound of , when decreases and goes to zero, decreases and goes to zero. According to Lemma 1, for almost all , decreases and goes to zero which means the number of adversarial examples in decreases and goes to zero (according to Lemma 3.1).

### 8.4 Proof for Theorem 3.5

Theorem 3.5. Let be a small neighborhood of natural example with label . and be arbitrary point in . If the value of energy decreases, the value of energy decreases almost everywhere in . When the energy goes to zero, the energy goes to zero almost everywhere in .
Proof:
Similar to Lemma 3.2, there exists a constant such that

 ∫B||∇xL(r,ϕ)||2dV=r1∫SI−1∫ϵ0||∇xL(r,ϕ)||2drdϕ=r1∫SI−1Eϵ(ϕ)dϕ (31)

According to Lemma 1, when decreases and goes to zero, for almost all , deceases and goes to zero.

## 9 Details of Practical Algorithm

In this paper, we minimize both the upper bound and lower bound of energy . The algorithm to minimize the lower bound is the same as the traditional adversarial training. Here, we only give the relevant proof and algorithm for the upper bound of and :

The upper bound for :

 EB=∫B||∇xL(x)||2dV≤∫Bmaxx∈B∥∇L(x)∥2dV=maxx∈B∥∇L(x)∥2⋅Vol(B) (32)

The upper bound for :

 Eϵ=∫ϵ0||∇xL(r,ϕ)||2dr≤∫ϵ0maxx∈B∥∇L(x)∥2dr=maxx∈B∥∇L(x)∥2⋅ϵ (33)

Since and are constants, reducing is equivalent to decreasing the upper bound of and .

The problem (13) in the main paper can be reduced to:

 max∥r∥p=ϵ∇xFTr (34)

where, and