1 Introduction
Recent studies have demonstrated that neural network models, despite achieving humanlevel performance on many important tasks, are not robust to adversarial examples—a small and human imperceptible input perturbation can easily change the prediction label [45, 24]. This phenomenon brings out security concerns when deploying neural network models to real world systems [22]. In the past few years, many defense algorithms have been developed [25, 44, 33, 30, 40] to improve the network’s robustness, but most of them are still vulnerable under stronger attacks, as reported in [3]. Among current defense methods, adversarial training [34] has become one of the most successful methods to train robust neural networks.
To obtain a robust network, we need to consider the “robust loss” instead of a regular loss. The robust loss is defined as the maximal loss within an ball around each sample, and minimizing the robust loss under empirical distribution leads to a minmax optimization problem. Adversarial training [34] is a way to minimize the robust loss. At each iteration, it (approximately) solves the inner maximization problem by an attack algorithm
to get an adversarial sample, and then runs a (stochastic) gradientdescent update to minimize the loss on the adversarial sample. Although adversarial training has been widely used in practice and hugely improves the robustness of neural networks in many applications, its convergence properties are still unknown. It is unclear whether a network with small robust error exists and whether adversarial training is able to converge to a solution with minimal train adversarial loss.
In this paper, we study the convergence of adversarial training algorithms and try to answer the above questions on overparameterized neural networks. We consider the setting where the neural network has layers with width , smooth activation, and
training samples. This assumption holds for many activation functions including the softplus and sigmoid. Our contributions are summarized below.

For a general attack/perturbation algorithm , we show that gradient descent converges to a network where the robust surrogate loss with respect to the attack is within of the optimal robust loss, when the width (Theorem 4.1).

We then consider the expressivity of neural networks w.r.t. robust loss (or robust interpolation). We show when the width
is sufficiently large, the neural network can achieve optimal robust loss ; see Theorems 5.1 and 5.2 for precise statement. By combining these results, we show that adversarial training finds networks of small robust training loss (Corollary 5.1 and Corollary 5.2). 
Conversely, the complexity of robust learning is higher. We show that the VCDimension of the model class which can robustly interpolate any samples is lower bounded by where is the dimension. In contrast, there are neural net architectures that can interpolate samples with only parameters. For this class of architectures the VCDimension is upper bounded by . Thus robust learning provably requires larger complexity and capacity.
2 Related Work
Attack and Defense
Adversarial examples are inputs that are slightly perturbed from a natural sample and yet incorrectly classified by the model. An adversarial example can be generated by maximizing the loss function within an
ball around a natural sample. Thus, generating adversarial examples can be viewed as solving a constrained optimization problem and can be (approximately) solved by a projected gradient descent (PGD) method [34]. Some other techniques have also been proposed in the literature including lBFGS [45], FGSM [24], iterative FGSM [28] and C&W attack [14], where they differ from each other by the distance measurements, loss function or optimization algorithms. There are also studies on adversarial attacks with limited information about the target model. For instance, [15, 26, 10, 32] considered the blackbox setting where the model is hidden but the attacker can make queries and get the corresponding outputs of the model.Improving the robustness of neural networks against adversarial attacks, also known as defense, has been recognized as an important and unsolved problem in machine learning. Various kinds of defense methods have been proposed
[25, 44, 33, 30, 40], but many of them are based on obfuscated gradients which does not really improve robustness under stronger attacks [3]. As an exception, [3] reported that the adversarial training method developed in [34] is the only defense that works even under carefully designed attacks.Adversarial Training
Adversarial training is one of the first defense ideas proposed in earlier papers [24]. The main idea is to add adversarial examples into the training set to improve the robustness. However, earlier work usually only adds adversarial example once or only few times during the training phase. Recently, [34] showed that adversarial training can be viewed as solving a minmax optimization problem where the training algorithm aims to minimize the robust loss, defined as the maximal loss within a certain ball around each training sample. Based on this formulation, a clean adversarial training procedure based on PGDattack has been developed and achieved stateoftheart results even under strong attacks. This also motivates some recent research on gaining theoretical understanding of robust error [11, 41]. Also, adversarial training suffers from slow training time since it runs several steps of attacks within one update, and several recent works are trying to resolve this issue [42, 53]. From the theoretical perspective, a recent work [46] considers to quantitatively evaluate the convergence quality of adversarial examples found in the inner maximization and therefore ensure robustness. [51] consider generalization upper and lower bounds for robust generalization. [31] improves the robust generalization by data augmentation with GAN. [23] considers to reduce the optimization of minmax problem to online learning setting and use their results to analyze the convergence of GAN. In this paper, our analysis for adversarial is quite general and is not restricted to any specific kind of attack algorithm.
Certified Defense and Robustness Verification
For each sample, the robust loss is defined as the max loss within an ball. Due to the nonconvexity, attack algorithms usually fail to find the exact max, so robust error computed by an attack algorithm cannot give us a formal guarantee of robustness. As a consequence, networks trained by standard adversarial training algorithms [34], although being robust under strong attacks, do not have a certified guarantee of robustness.
Neural network verification methods, in contrast to attack, are trying to find upper bounds of robust error and provide certified robustness measurements. Several algorithms have been proposed recently. [48] proposed to solve the dual of a linear relaxation problem to obtain a certified bound. [47, 54] provides a similar algorithm based on primal relaxation. [43] proposed another approach based on abstract interpretation. More recently, [39]
provided a unified view, showing that most of the existing verification methods are based on a convex relaxation of ReLU network.
Equipped with these verification methods for computing upper bounds of robust error, one can then apply adversarial training to get a network with certified robustness. This is first proposed in [48]. At each iteration, instead of finding a lower bound of robust error by attack, we can find an upper bound of robust error by verification and and train the model to minimize this upper bound. Several certified adversarial training algorithms along this line have been proposed recently [49, 21]. Our analysis in Section 4 can incorporate certified adversarial training.
Global convergence of Gradient Descent
Recent work on the overparametrization of neural networks prove that when the width greatly exceeds the sample size, gradient descent converges to a global minimizer from random initialization [29, 19, 20, 1, 55]
. The key idea in the earlier literature is to show that the Jacobian w.r.t. parameters has minimum singular value lower bounded, and thus there is a global minimum near every random initialization, with high probability. However for the robust loss and robust surrogate loss, the maximization cannot be evaluated and the Jacobian is not necessarily full rank. Similarly with the robust surrogate loss, the heuristic attack algorithm may not even be continuous and so the same arguments cannot be utilized.
3 Preliminaries
3.1 Notation
Let . We use
to denote the standard Gaussian distribution. For a vector
, we use to denote the Euclidean norm. For a matrix we use to denote the Frobenius norm and to denote the operator norm. We use to denote the standard Euclidean inner product between two vectors or matrices. We let and denote standard BigO and BigOmega notations that suppress multiplicative constants.3.2 Neural Network
In this paper we focus on the training of multilayer fullyconnected neural networks. Formally, we consider a neural network of the following form.
Let be the input, the fullyconnected neural network is defined as follows: is the first weight matrix, is the weight at the th layer for , is the output layer and is the activation function.^{1}^{1}1We assume intermediate layers are square matrices of size for simplicity. It is not difficult to generalize our analysis to rectangular weight matrices. The parameters are . We define the prediction function recursively (for simplicity we let ):
(1) 
where is a scaling factor to normalize the input at initialization.
We make a technical assumption on the activation function which holds for many activation functions, although not the ReLU.
Assumption 3.1 (Smoothness of activation function).
The activation function is Lipschitz and smooth, that is, we can assume there exists a constant such that for any
Assumption 3.2 (Smoothness of loss).
The loss is Lipschitz, smooth, convex in and satisfies .
We use the following initialization scheme: each entry in all for follows from the i.i.d. standard Gaussian distribution , and
follows the i.i.d. uniform distribution on
. Similar to [20], we consider the case when we only train on for and fix . For training set , the (nonrobust) training loss isThe key architectural parameter is the width . As we shall see, the robust train loss we obtain scales inversely with the width , and so for overparametrized networks we are able to minimize the robust train loss.
3.3 Perturbation and the Surrogate Loss Function
The goal of adversarial training is to make the model robust in a neighbor of each datum. We first introduce the definition of the perturbation set function to determine the perturbation set at each points.
Definition 3.1 (Perturbation Set).
The perturbation set function is , where, we use to stand for the power set of . At each data point , gives the perturbation set that we would like to guarantee the robustness on. For example, commonly used perturbation sets are and . Given a dataset , we say that the perturbation set is compatible with the dataset if implies . In the rest of the paper, we will always assume that is compatible to the given data. Our framework allows for arbitrary perturbation sets compatible with the empirical dataset.
Given a perturbation set, we are now ready to define the perturbation function that map a data point to another point inside its perturbation set. We note that the perturbation function can be quite general including the identity function, the adversarial attack mapping and some random sample mapping. Formally, we give the following definition.
Definition 3.2 (Perturbation Function).
With the definition of perturbation function, we can now define a large family of loss functions on the training set . We will show this definition covers the standard loss used in empirical risk minimization and the robust loss used in adversarial training.
Definition 3.3 (Surrogate Loss Function).
It can be easily observed that the standard training loss is a special case of surrogate loss function with as the identity. The goal of adversarial training is to minimize the robust loss, i.e. with . We denote the robust loss as .
4 Convergence Results of Adversarial Training
We consider optimizing the surrogate loss with the perturbation function defined in Definition 3.2. In this section, we will prove that after certain steps of projected gradient descent with a convex set , the loss is provably upperbounded with the best minimax loss in this set.
where
(2) 
where is the th row of and is the th row of , and depends polynomially on the smoothness parameters of Assumptions 3.1 and 3.2.
Denote the parameter at the th iteration as , and similarly and . For each step in adversarial training, projected gradient descent takes an update
where
the gradient is with respect to the first argument , and is the Euclidean projection to a convex set . We will take as the convex set defined in Equation (2).
We show that for sufficiently wide neural networks, within the set in the parameter space, gradient descent can find a point with surrogate loss no more than the minimum robust loss in . In Section 5, we show that the set is sufficiently large to find a classifier of low robust loss. We assume the perturbation set of the input is in a Euclidean ball with radius and . Specifically, we have the following theorem.
Theorem 4.1 (Convergence of Projected Gradient Descent for Optimizing Surrogate Loss).
Remark.
Recall that the surrogate loss is the loss suffered when with respect to the perturbation function . For example if the adversary uses the projected gradient ascent algorithm, then the theorem guarantees that projected gradient ascent cannot successfully attack the learned network.
Remark.
For twolayer networks , the update on does not require the projection step as it is implicitly enforced by gradient descent.
4.1 Proof Sketch
Our proof idea utilizes the same highlevel intuition as [29, 19, 55, 12, 13] that near the initialization the network is linear. However, unlike these earlier works, the surrogate loss neither smooth, nor semismooth so there is no Polyak gradient domination phenomenon to allow for the global geometric contraction of gradient descent. In fact due to the the generality of perturbation function allowed, the robust surrogate loss is not differentiable nor even continuous in , and so the standard analysis cannot be applied. Our analysis utilizes two key observations. First the network is still smooth w.r.t. the first argument^{2}^{2}2It is not jointly smooth in , which is part of the subletly of the analysis., and is close to linear in the first argument near initialization, which is shown by directly bounding the Hessian w.r.t. . Second, the perturbation function can be treated as an adversary providing a worstcase loss function as done in online learning. However, online learning typically assumes the sequence of losses is convex, which is not the case here. We make a careful decoupling of the contribution to nonconvexity from the first argument and the worstcase contribution from the perturbation function, we can prove gradient descent succeeds in minimizing the surrogate loss.
5 Adversarial Training Finds Robust Classifier
Motivated by the optimization result in Theorem 4.1, we hope to show that there is indeed a robust classifier in . To show this, we utilize the connection between neural networks and their induced Reproducing Kernel Hilbert Space (RKHS) via viewing neural networks trained near initialization as a random feature scheme [16, 17, 27, 2]. Since we only need to show the existence of a network architecture that robustly fits the training data in and neural networks are at least as expressive as their induced kernels, we may prove this via the RKHS connection. The strategy is to first show the existence of a robust classifier in the RKHS, and then show that a sufficiently wide network can approximate the kernel via random feature analysis. The results of this section will have, in general, exponential in dimension dependence due to the known issue of dimensional functions having exponentially large RKHS norm [4], so only offer qualitative guidance on existence of robust classifiers.
Since deep networks contain twolayer networks as a subnetwork, and this section is only concerned with expressivity, we focus on the local expressivity of twolayer networks. We write the standard twolayer network in the suggestive way
and initialize as and is set to be equal to , is randomly drawn from and . We denote the initialization parameters respectively. In this section, we will consider the data and perturbation set defined on the surface of the unit ball, i.e. we assume and .
For convenience, we firstly introduce the Neural Tangent Kernel (NTK) [27] w.r.t. our neural network formulation in Equation (1).
Definition 5.1 (Ntk [27]).
The NTK with activation function and initialization distribution is defined as .
For a given kernel , there is a reproducing kernel Hilbert space (RKHS) introduced by . We denote it as . We refer the readers to [37] for an introduction of the theory of RKHS.
In Section 5.1, we will first give a general existence result of classifier with robust loss no more than for twolayer networks with activation functions that induce universal kernels. Secondly, specifically for a twolayer quadraticReLU activation neural network, we show that adversarial training can find a robust classifier, and provide the explicit dependence of the width w.r.t. .
5.1 Existence of Robust Classifier near Initialization
We formally make the following assumption, which is later verified when the activation induces an universal kernel.
Assumption 5.1 (Existence of Robust Classifier in NTK).
For any , there exists , such that , for every , where is the perturbation set defined in Definition 3.1.
Assumption 5.1 can be verified for a large class of activation functions by showing their induced kernel is universal as done in [35]. In addition, we will show that this assumption is mild in our example of quadraticReLU network.
Under this assumption, by applying the strategy of approximating the infinite situation by finite sum of random features, we can get the following theorem:
Theorem 5.1 (Robust Classifier near Initialization).
This theorem shows that we can indeed find a classifier of low robust loss within a neighborhood of the initialization. Combining Theorem 4.1 and 5.1 we know that
Corollary 5.1 (Adversarial Training Finds a Network of Small Robust Train Loss).
Given data set on the unit sphere equipped with a compatible perturbation set function and an associated perturbation function , which also takes value on the unit sphere. Suppose Assumption 3.1, 3.2, 5.1 are satisfied. Then there exists a which only depends on dataset , perturbation and , corresponding to the RKHS radius, such that for any layer fully connected network with width , if we run projected gradient descent with stepsize on for steps, then with probability ,
(4) 
Therefore, adversarial training is guaranteed to find a robust classifier under a given attack algorithm when the network width is sufficiently large.
5.2 Example: Twolayer QuadraticReLU Network
We consider the arccosine neural tangent kernel (NTK) introduced by twolayer network with quadratic ReLU activation function as a guide example. In this section, we quantitatively derive the dependency of for and in Theorem 5.1 for this twolayer network and verify that the induced kernel is universal. The network has the expression
(5) 
where the activation (), and is initialized uniformly from , and is initialized i.i.d. from and only is trained. The NTK has the following explicit expression:
(6) 
We denote the RKHS norm of . The following lemma gives a sufficient condition for the function to be in .
Lemma 5.1 (RKHS contains smooth functions, Proposition 2 in [4], Corollary 6 in [8]).
Let be an even function such that all th order derivatives exist and are bounded by for , with . Then with where is a constant that only depend on the dimension .
We then make a mild assumption of the dataset^{3}^{3}3Our assumption on the dataset essentially requires since the ReLU NTK kernel only contains even functions. However, this can be enforced via a lifting trick: let , then the data lie on the positive hemisphere. On the lifted space, even functions can separate any datapoints.
Assumption 5.2 (Nonoverlapping).
The dataset and the perturbation set function satisfies:

is compact set on for all ,

There does not exist and such that but .
Under this assumption, one can easily construct a smooth classifier on such that for all . By Lemma 5.1, we have with RKHS norm where is a constant only depends on dataset and perturbation function. We then approximate using random feature techniques. The following theorem provides the desired result:
Theorem 5.2 (Approximation by finite sum).
For a given Lipschitz function . For , let be sampled i.i.d. from where
(7) 
and is a constant that only depends on the dataset and the compatible perturbation . Then with probability at least , there exists where such that satisfies
(8)  
(9) 
We then specializes Theorem 4.1 for our twolayer quadraticReLU network. We make a modification to the set defined in Equation (2) in order to match the previous approximation results, which is
(10) 
Due to this modification and that for twolayer the projection step to the set is unnecessary, we provide a full proof in Appendix C.
Theorem 5.3 (Convergence of Gradient Descent for Optimizing Surrogate Loss For Twolayer Networks).
Then, we can get an overall theorem for the quadraticReLU network which is similar to Corollary 5.1 but with explicit dependence:
Corollary 5.2 (Adversarial Training Finds a Network of Small Robust Train Loss for QuadraticReLU Network).
Given data set on the unit sphere equipped with a compatible perturbation set function and an associated perturbation function , which also takes value on the unit sphere. Suppose Assumption 3.1, 3.2, 5.2 are satisfied. Then for any and any layer quadraticReLU network with width (where is a constant that only depends on the dataset and perturbation ), if we run projected gradient descent with stepsize on for steps, then with probability ,
(12) 
6 Capacity Requirement of Robustness
In this section, we will show that in order to achieve adversarially robust interpolation (which is formally defined below), one needs more capacity than just normal interpolation. In fact, empirical evidence have already shown that to reliably withstand strong adversarial attacks, networks require a significantly larger capacity than for correctly classifying benign examples only [34]. This implies, in some sense, that using a neural network with larger width is necessary.
Let and , where is a constant, we will consider each data in and use as the perturbation set function in this section.
We begin with the definition of the interpolation class and the robust interpolation class.
Definition 6.1 (Interpolation class).
We say that a function class of functions is an interpolation class, if the following is satisfied:
Definition 6.2 (Robust interpolation class).
We say that a function class is an robust interpolation class, if the following is satisfied:
We will use the VCDimension of a function class to measure its complexity. In fact, as shown in [6] (Equation(2)), for neural networks there is a tight connection between the number of parameters , the number of layers and their VCDimension In addition, combining with the results in [52] (Theorem 3) which shows the existence of a 4layer neural network with parameters that can interpolate any data points, i.e. an interpolation class, we have that an interpolation class can be realized by a fixed depth neural network with VCDimension upper bound
(13) 
For a general hypothesis class , we can evidently see that when is an interpolation class, has VCDimension at least . For a neural network that is an interpolation class, without further architectural constraints, this lower bound of its VCdimension is tight up to logarithmic factors as indicated in 13. However, we show that for a robustinterpolation class we will have a much larger VCDimension lower bound:
Theorem 6.1.
If is an robust interpolation class. Then we have lower bound on the VCDimension of
(14) 
where is the dimension of the input space.
For neural networks, Equation (14) shows that any architecture that is an robust interpolation class should have VCDimension at least . Comparing with Equation (13) which shows interpolation class can be realized by a network architecture with VCDimension , we can conclude that robust interpolation by neural networks needs more capacity, so increasing the width of neural network is indeed necessary.
7 Discussion
This work provides a theoretical analysis of the empirically successful adversarial training algorithm in the training of robust neural networks. Our main results indicate that adversarial training will find a network of low robust surrogate loss, even when the maximization is computed via a heuristic algorithm such as projected gradient ascent. We feel these results lead to several thoughtprovoking future steps. Can we ensure the robust surrogate loss is low with respect to a larger family of perturbation functions than that used during training? It is natural to ask whether the depth dependence can be improved to using the tools of [1], and whether the projection step can be removed as it is empirically unnecessary and also unnecessary for our analysis for . On the expressiveness side, the current argument utilizes that a neural net restricted to a local region can approximate its induced RKHS. Although the RKHS is universal, they do not avoid the curse of dimension, so it is natural to ask whether the robust expressivity of neural networks can adapt to structure such as low latent dimension of the data mechanism [18, 50]. Since this question is largely unanswered even for neural nets in the nonrobust setting, we leave it to future work.
References
 [1] Zeyuan AllenZhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
 [2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. arXiv preprint arXiv:1901.08584, 2019.
 [3] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.

[4]
Francis Bach.
Breaking the curse of dimensionality with convex neural networks.
The Journal of Machine Learning Research, 18(1):629–681, 2017.  [5] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714–751, 2017.
 [6] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearlytight vcdimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019.
 [7] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
 [8] Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. arXiv preprint arXiv:1905.12173, 2019.
 [9] Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. Advanced lectures on machine learning, pages 208–240, 2004.
 [10] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. arXiv preprint arXiv:1712.04248, 2017.
 [11] Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. arXiv preprint arXiv:1805.10204, 2018.
 [12] Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. A gramgaussnewton method learning overparameterized deep neural networks for regression problems. arXiv preprint arXiv:1905.11675, 2019.
 [13] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
 [14] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.

[15]
PinYu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and ChoJui Hsieh.
Zoo: Zeroth order optimization based blackbox attacks to deep neural
networks without training substitute models.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pages 15–26. ACM, 2017.  [16] Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
 [17] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.
 [18] Simon S Du and Jason D Lee. On the power of overparametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
 [19] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
 [20] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
 [21] Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Relja Arandjelovic, Brendan O’Donoghue, Jonathan Uesato, and Pushmeet Kohli. Training verified learners with learned verifiers. arXiv preprint arXiv:1805.10265, 2018.
 [22] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physicalworld attacks on deep learning models. arXiv preprint arXiv:1707.08945, 2017.
 [23] Alon Gonen and Elad Hazan. Learning in nonconvex games with an optimization oracle. arXiv preprint arXiv:1810.07362, 2018.
 [24] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 [25] Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117, 2017.
 [26] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Blackbox adversarial attacks with limited queries and information. In International Conference on Machine Learning, pages 2142–2151, 2018.
 [27] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
 [28] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
 [29] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.

[30]
Xuanqing Liu, Minhao Cheng, Huan Zhang, and ChoJui Hsieh.
Towards robust neural networks via random selfensemble.
In
European Conference on Computer Vision
, pages 381–397. Springer, 2018.  [31] Xuanqing Liu and ChoJui Hsieh. Robgan: Generator, discriminator, and adversarial attacker. In CVPR, 2019.

[32]
Tiange Luo, Tianle Cai, Mengxiao Zhang, Siyu Chen, and Liwei Wang.
RANDOM MASK: Towards robust convolutional neural networks, 2019.
 [33] Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Michael E Houle, Grant Schoenebeck, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.
 [34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
 [35] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006.
 [36] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions. In Algorithmic Learning Theory, pages 124–138. Springer, 2012.
 [37] Vern I Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge University Press, 2016.
 [38] Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561. IEEE, 2008.
 [39] Hadi Salman, Greg Yang, Huan Zhang, ChoJui Hsieh, and Pengchuan Zhang. A convex relaxation barrier to tight robust verification of neural networks. arXiv preprint arXiv:1902.08722, 2019.
 [40] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. DefenseGAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
 [41] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5014–5026, 2018.
 [42] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.
 [43] Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin Vechev. Fast and effective robustness certification. In Advances in Neural Information Processing Systems, pages 10802–10813, 2018.
 [44] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
 [45] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [46] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pages 6586–6595, 2019.
 [47] TsuiWei Weng, Huan Zhang, Hongge Chen, Zhao Song, ChoJui Hsieh, Luca Daniel, Duane Boning, and Inderjit Dhillon. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, pages 5273–5282, 2018.
 [48] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5283–5292, 2018.
 [49] Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems, pages 8400–8409, 2018.
 [50] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. arXiv preprint arXiv:1802.03620, 2018.
 [51] Dong Yin, Kannan Ramchandran, and Peter Bartlett. Rademacher complexity for adversarially robust generalization. arXiv preprint arXiv:1810.11914, 2018.
 [52] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Finite sample expressive power of smallwidth relu networks. arXiv preprint arXiv:1810.07770, 2018.
 [53] Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Painless adversarial training using maximal principle. arXiv preprint arXiv:1905.00877, 2019.
 [54] Huan Zhang, TsuiWei Weng, PinYu Chen, ChoJui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions. In Advances in Neural Information Processing Systems, pages 4939–4948, 2018.
 [55] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes overparameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.
Appendix A Proof of Convergence Results for Deep Nets in Section 4
Proof of Theorem 4.1.
Denote . We will perform steps of projected gradient descent with step size and then stop.
For projected gradient descent, holds for all . Recall the update rule of projected gradient descent is . We have
(15) 
where in the first inequality we use that fact that when we project a point onto , we move closer to every point in , and in particular, any optimal point. Now we need to analyze the gradient . To simplify notations, we define
where is the derivative to in the first argument of .
Note that
where . Since the loss function is Lipschitz, we know , we have
where is a diagonal matrix whose th diagonal entry is .
To bound the RHS, note that he definition of implies that . According to Lemma B.1, B.3 and G.2 in [19], with probability 0.99, we have for all , , and . Therefore, under our choice of , we have
Also note that by the Lipschitzness of our neural network, it is easy to show , which implies
Recall due to the Lipschitzness of our activation function, we have
Thus
which gives the bound of the third term of Equation 15. Now we are going to bound the second term of Equation 15. Note that letting , we have
We use to denote , then
Note that both , again using Lemma B.3 in [19], we have
Recall , it is easy to show
by the definition of , we know,
Thus, according to Lemma G.1 in [19], we have
which implies
Thus, let , we have
Recall that , we have
Choosing and , under the choice of , we complete the proof.
Appendix B Proof of Gradient Descent Finding Robust Classifier in Section 5
b.1 Proof of Theorem 5.1
As discussed in Section 5.1, we will use the idea of random feature [38] to approximate on the unit sphere. We consider functions of the form
where is any function from to . We define the RFnorm of as where
is the probability density function of
, which is the distribution of initialization. Define the function class with finite norm as
Comments
There are no comments yet.