1 Introduction
Machine learning models trained through standard methods often lack robustness against adversarial examples. These are small perturbations of input examples, designed to “fool” the model into misclassifying the original input biggio2013evasion ; goodfellow2014explaining ; nguyen2015deep ; szegedy2013intriguing
. Unfortunately, even small perturbations can cause a large degradation in the test accuracy of popular machine learning models, including deep neural networks
szegedy2013intriguing . This lack of robustness has spurred a large body of work on designing attack methods for crafting effective adversarial examples grosse2016adversarial ; hendrik2017universal ; moosavi2016deepfool ; mopuri2017fast ; papernot2016transferability ; tramer2017ensemble and defense mechanisms for training models that are more robust to norm bounded perturbations tramer2017ensemble ; madry2017towards ; sinha2017certifiable ; zantedeschi2017efficient ; samangouei2018defense ; ilyas2017robust ; shaham2018understanding .Adversarial training is a family of optimizationbased methods for defending against adversarial perturbations. These methods generally operate by computing adversarial examples, and retraining the model on these examples goodfellow2014explaining ; madry2017towards ; shaham2018understanding . This twostep process is repeated iteratively. While adversarial training methods have achieved empirical success madry2017towards ; shaham2018understanding ; ford2019adversarial ; hendrycks2018benchmarking , there is currently little theoretical analysis of their convergence and capacity for guaranteeing robustness.
A parallel line of research has investigated whether standard optimization methods, such as gradient descent (GD) and stochastic gradient descent (SGD), exhibit an
implicit bias toward robust and generalizable models gunasekar2018characterizing ; gunasekar2018implicit ; ji2018risk ; nacson2019convergence ; nacson2018stochastic ; soudry2018implicit. This line of work shows that GD and SGD both converge to the maxmargin classifier of linearly separable data, provided that the loss function is chosen appropriately. Notably, the maxmargin classifier is the most robust model against
bounded perturbations. Thus, gradient descent is indeed biased towards robustness in some settings. Unfortunately, convergence to this desirable limit can be slow, and in some cases an exponential number of iterations may be needed nacson2019convergence ; nacson2018stochastic ; soudry2018implicit .Our contributions.
In this work, we merge these two previously separate lines of work, studying whether (and how) various types of adversarial training exhibit a bias towards robust models. We focus on linear classification tasks and study robustness primarily through the lens of margin, the minimum distance between the classification boundary and the (unperturbed) training examples. Our results show that alone, neither adversarial training with generic update rules, nor gradientbased training on the original data set, can find largemargin models quickly. However, by combining the two — interspersing gradientbased update rules with the addition of adversarial examples to the training set — we can train robust models significantly faster.
We show that for logistic regression, gradientbased update rules evaluated on adversarial examples minimize a robust form of the empirical risk function at a rate of
, where is the number of iterations of the adversarial training process. This convergence rate mirrors the convergence of GD and SGD on the standard empirical risk, despite the nonsmoothness of the robust empirical risk function. We then use this analysis to quantify the number of iterations required to obtain a given margin. We show that while GD may require exponentially many iterations to achieve large margin in nonadversarial training, adversarial training with (stochastic) gradientbased rules requires only polynomially many iterations to achieve large margin. We support our theoretical bounds with experimental results.1.1 Related Work
Our results are most similar in spirit to ji2018risk
, which uses techniques inspired by the Perceptron
novikoff1962convergenceto analyze the convergence of GD and SGD on logistic regression. It derives a high probability
convergence rate for SGD on separable data, as well as an convergence rate for GD in general. We adapt these techniques for adversarial training. Our work also connects to work on “implicit bias”, which studies the parameter convergence of GD and SGD for logistic regression on separable data gunasekar2018characterizing ; gunasekar2018implicit ; ji2018risk ; nacson2019convergence ; nacson2018stochastic ; soudry2018implicit . These works show that the parameters generated by GD and SGD converge to the parameters that correspond to the maxmargin classifier at polylogarithmic rates. This line of work, among other tools, employs techniques developed in the context of AdaBoost freund1997decision ; mukherjee2013rate ; telgarsky2012primal . Our analysis is related in particular to margin analyses of boosting rosset2004boosting ; telgarsky2013margins , which show that the path taken by boosting on exponentially tailed losses approximates the maxmargin classifier.There is a large and active body of theoretical work on adversarial robustness. While there are various hardness results in learning robust models bubeck2018adversarial ; gilmer2018adversarial ; schmidt2018adversarially ; tsipras2018robustness ; tsipras2018there , our analysis shows that such results may not apply to practical settings. Our analysis uses a robust optimization lens previously applied to machine learning in work such as sinha2017certifiable ; caramanis201214 ; xu2009robustness . While xu2009robustness shows that the maxmargin classifier is the solution to a robust empirical loss function, our work derives explicit convergence rates for SGD and GD on such losses. Finally, we note that adversarial training can be viewed as a data augmentation technique. While the relation between margin and static data augmentation was previously studied in rajput2019does , our work can be viewed as analyzing adaptive data augmentation methods.
2 Overview
Let , and denote the feature space, label space, and model space, respectively, and let be some loss function. Given a dataset , the empirical risk minimization objective is given by
(1) 
Unfortunately, generic empirical risk minimizers may not be robust to small input perturbations. To find models that are resistant to bounded input perturbations, we define the following robust loss functions
(2) 
The function is a measure for the robustness of on . While can be any norm, here we focus on the norm and let denote it throughout our text. Another important measure of robustness is margin. We focus on binary linear classification where , and . The class predicted by on is given by , and the margin of on is
(3) 
We say linearly separates if , . Note linearly separates iff . One can interpret margin as the size of the smallest perturbation needed to fool in to misclassifying an element of . Thus, the most robust linear separator is the classifier with the largest margin, referred to as the maxmargin classifier.
Adversarial training.
One popular class of defenses, referred to generally as adversarial training madry2017towards
, involve retraining a model on adversarially perturbed data points. The general heuristic follows two steps. At each iteration
we construct adversarial examples for some subset of the training data. For each example in this set, an bounded norm adversarial perturbation is constructed as follows:(4) 
We then update our model using an update rule that operates on the current model and “adversarial examples” of the form . In the most general case, this update rule can also utilize true training data in and adversarial examples from prior iterations.
More formally, let be our initial model. denotes our true training data, and will denote all previously seen adversarial examples. We initialize . At each , we select some subset . For , we let be the solution to (4) when and . We then let
Thus, is the set of adversarial examples computed at iteration , while contains all adversarial examples computed up to (and including) iteration . Finally, we update our model via for some update rule . This generic notation will be useful to analyze a few different algorithms. A full description of adversarial training is given in Algorithm 1.
Once is fixed, there are two primary choices in selecting an adversarial training method: the subset used to find adversarial examples, and the update rule . For example, one popular instance of adversarial training (discussed in detail in madry2017towards ) performs minibatch SGD on the adversarial examples. Specifically, this corresponds to the setting where is randomly selected from , and computes a minibatch SGD update on via
(5) 
In particular, this update does not utilize the full set of all previously seen adversarial examples, but instead updates only using the set of the most recently computed adversarial examples. It also does not use the true training samples . However, other incarnations of adversarial training have used more of and to enhance their accuracy and efficiency shafahi2018universal .
Main results.
In the following, we analyze the performance of adversarial training for binary linear classification. In particular, we wish to understand how the choice of , , and the number of iterations impact and . We will make the following assumptions throughout:
Assumption A1.
where is nonnegative and monotonically increasing.
Assumption A2.
is linearly separable with maxmargin .
Assumption A3.
The parameter satisfies .
A1 guarantees that is a surrogate of the loss for linear classification, since decreases as increases. A2 allows us to compare the margin obtained by various methods to . We let denote the maxmargin classifier. The assumption that is simply for convenience, as we can always rescale separable data to ensure this.
Combined, A2 and A3 guarantee that at every iteration, is linearly separable by with margin at least , as we show in the following lemma.
Lemma 1.
Proof.
By construction, any element in is of the form where and . By assumption on and the CauchySchwarz inequality,
∎
We can now state the main theorems of our work. We first show that adversarial training may take a long time to converge to models with large margin, even when finds an empirical risk minimizer (ERM) of the loss on . Note that by Lemma 1, this is equivalent to finding a linear separator of . That is, even if finds a model that perfectly fits the training data and all adversarial examples at each step, this is insufficient for fast convergence to good margin.
Theorem 1 (Informal).
Suppose outputs a linear separator of . In the worst case, Algorithm 1 requires iterations to achieve margin .
We then show that for logistic regression, if performs a fullbatch gradient descent update on the adversarial examples, then adversarial training quickly finds a model with large margin. This corresponds to the setting where is given as in (5) with . We refer to this as GD with adversarial training.
Theorem 2 (Informal).
Let be the iterates of GD with adversarial training. Then , and for , .
The notation hides polylogarithmic factors. By contrast, one can easily adapt lower bounds in gunasekar2018implicit on the convergence of gradient descent to the maxmargin classifier to show that standard gradient descent requires iterations to guarantee margin .
Since the inner maximization in Algorithm 1 is often expensive, we may want to be small. When and performs the gradient update in (5), Algorithm 1 becomes SGD with adversarial training, in which case we have the following.
Theorem 3 (Informal).
Let be the iterates of SGD with adversarial training, and let . With probability at least , and if , then .
3 Fundamental Limits of Adversarial Training for Linear Classifiers
We will now show that even if the subroutine in Algorithm 1 outputs an arbitrary empirical risk minimizer (ERM) of the loss on , then in the worstcase iterations are required to obtain margin .
Suppose that in Algorithm 1 is defined by
By Lemma 1, is linearly separable. Thus, the update is equivalent to finding some linear separator of . When is an arbitrary ERM solver, we can analyze the worst case convergence of adversarial training by viewing it as a game played between two players. At each iteration, Player 1 augments the current data with adversarial examples computed for the current model. Player 2 then tries to find a linear separator of all previously seen points with small margin. This specialization of Algorithm 1 is given in Algorithm 2.
In the following, we assume for all . This only reduces reduces the ability of the worstcase ERM solver to output some model with small margin. We say a sequence is admissible if is generated according to iterations of Algorithm 2. Intuitively, the larger is (i.e., the more this game is played), the more restricted the set of linear separators of becomes. We might hope that after a moderate number of rounds, the only feasible separators left have high margin with respect to the original training set .
We show that this is not the case. Specifically, an ERM may still be able to output a linear separator with margin at most , even after exponentially many iterations of adversarial training.
Theorem 4.
Let , where
is a unit vector in
. Then, there is some constant such that for any , there is an admissible sequence such that for all satisfyingThe proof proceeds by relating the number of times an ERM can obtain margin to the size of spherical codes. These are arrangements of points on the sphere with some minimum angle constraint delsarte1991spherical ; kabatyanskii1974bounds ; delsarte1972bounds ; sloane1981tables and have strong connections to sphere packings and lattice density problems conway2013sphere . We show how an arbitrary ERM can use a spherical code of size to generate an admissible sequence with small margin for the first iterations. While computing spherical codes of maximal size is a notoriously difficult task cohn2014sphere , spherical codes with points can be constructed with high probability by taking spherically symmetric points on the sphere at random. A full proof can be found in Appendix A.
This implies that even for relatively small , the number of times an ERM can achieve margin is in the worstcase. As we will show in the proceeding sections, this worstcase scenario is overcome when we combine adversarial training with gradient dynamics.
4 Adversarial Training with Gradientbased Updates
We will now discuss gradientbased versions of adversarial training, in which we use gradients evaluated with respect to adversarially perturbed training points to update our model. Suppose that has associated empirical risk function as in (1). Let be some initial model. In adversarial training with gradient methods, at each , we select and update via
(6) 
(7) 
where is the step size and is treated as constant with respect to when computing the gradient . When , we refer to this procedure as GD. When is a single sample selected uniformly at random, we refer to this procedure as SGD. Note that when , this becomes standard GD and SGD on .
Note that both these methods are special cases of Algorithm 1, where the update is given by (7). Before we proceed, we present an alternate view of this method. Recall the functions and defined in (2). To understand GD, we will use Danskin’s theorem danskin2012theory . We note that this was previously used in madry2017towards to justify adversarial training with gradient updates. The version we cite was shown by Bertsekas bertsekas1971control . A more modern proof can be found in bertsekas1997nonlinear .
Proposition 1 (Danskin).
Suppose is a nonempty compact topological space and is a continuous function such that is differentiable for every . Define
Then is subdifferentiable with subdifferential given by .
Thus, we can compute subgradients of by solving the inner maximization problem (6) for each , and then taking a gradient. In other words, for a given , let be a solution to (6). Then . Therefore, GD is a subgradient descent method for , while SGD is a stochastic subgradient method. Furthermore, if the solution to (6) is unique then Danskin’s theorem implies that GD actually computes a gradient descent step, while SGD computes a stochastic gradient step. Indeed, the above proposition also motivated madry2017towards and shaham2018understanding to use a projected gradient inner step to compute adversarial examples and approximate adversarial training with SGD.
For linear classification, we can derive stronger structural connections between and .
Lemma 2.
Suppose for monotonically increasing and differentiable. Then, the following properties hold:

For all , satisfies .

For all , is subdifferentiable with , where , if and otherwise.

If is strictly increasing, then is differentiable at all .

If is Lipschitz, smooth, and strictly increasing, then is twice differentiable at , in which case , where .

If is convex, then is convex.
A full proof is given in Appendix B. Thus, if is convex, then is convex and GD and SGD perform (stochastic) subgradient descent on a convex, nonsmooth function. Unfortunately, even if is smooth, is typically nonsmooth. Standard results for convex, nonsmooth optimization then suggest that GD and SGD obtain a convergence rate of on . However, this is a pessimistic convergence rate for subgradient methods on nonsmooth convex functions. By Lemma 2, inherits many nice geometric properties from . There is therefore ample reason to believe the pessimistic convergence rate is not tight. As we show in the following, GD and SGD actually minimize at a much faster rate.
In the next section, we analyze the convergence of GD and SGD, measured in terms of , as well as , for logistic regression. We adapt the classical analysis of the Perceptron algorithm from novikoff1962convergence to show that a given margin is obtained. To motivate this, we first analyze an adversarial training version of the Perceptron.
4.1 Adversarial Training with the Perceptron
Let . Then . For notational convenience, suppose that for all , . Let . Applying SGD with stepsize , we get updates of the form where if and otherwise. This is essentially the Perceptron algorithm, in which case novikoff1962convergence implies the following.
Lemma 3.
This procedure stops after at most nonzero updates, at which point linearly separates .
Suppose we instead perform SGD with stepsize and . Given , let if and otherwise. Lemma 2 implies that SGD does the following: Sample uniformly at random, then update via
Due to its resemblance to the Perceptron, we refer to this update as the Perceptron. We then get an analogous result on the number of iterations required to find classifiers with a given margin.
Lemma 4.
The Perceptron stops after at most nonzero updates, after which point has margin at least .
Proof.
Assume the update at is nonzero, so . Let be a unit vector that achieves margin . Then,
Therefore, after iterations, . Next, we upper bound via:
The last step follows from the fact that we update iff . Recursively, we find that , so . Combining the above,
The update at is nonzero iff has margin at , so once Perceptron stops updating, . ∎
While simple, this result hints at an underlying, more general phenomenon for linearly separable datasets: The convergence of gradientbased adversarial training to a robust risk minimizer often mirrors the convergence of conventional gradient methods to an empirical risk minimizer. We demonstrate this principle formally in the following section for logistic regression.
5 Adversarial Training for Logistic Regression
We will now analyze the convergence and margin of GD and SGD for logistic regression. In logistic regression, where . Note that is convex, Lipschitz, and smooth, and bounded below by 0. For notational simplicity, suppose that with for all . Thus, the maxmargin of satisfies .
5.1 Convergence and Margin of Gd
Let be the iterates of GD with stepsizes . We will suppose that , and . These assumptions are not necessary, but simplify the statement and proofs of the following results. Full proofs of all results in this section can be found in Appendix C.
To analyze the convergence of GD on , we will use the fact that by Lemma 2, while is not smooth, it is smooth away from . We then use a Perceptronstyle argument inspired by ji2018risk to show that after a few iterations, the model produced by GD has norm bounded below by some positive constant. We can then apply standard convergence techniques for gradient descent on smooth functions to derive the following.
Theorem 5.
Suppose , and , . Then ,
We can use the above results to show that after a polynomial number of iterations, we obtain a model with margin . To do so, we first require a straightforward lemma relating to margin.
Lemma 5.
If then .
We then get the following.
Corollary 1.
Suppose that for , and . For all , there is a constant such that for all satisfying
(8) 
Ignoring all other terms, this implies that for all , iterations of GD sufficient to obtain margin . The constant is how large must be so that for all , . As such, the constant tends to as tends to .
On the other hand, one can show that standard gradient descent may require exponentially many iterations to reach margin , even though it eventually converges to the maxmargin classifier. This follows immediately from a direct adaptation of lower bounds from gunasekar2018implicit .
Theorem 6.
Let . Let be the iterates of GD with constant stepsize initialized at for . For all , .
One can show that as decreases, this convergence rate only decreases. Thus, the exponentially slow convergence in margin is not an artifact of the choice of stepsize, but rather an intrinsic property of gradient descent on logistic regression.
5.2 Convergence and Margin of Sgd
Recall that at each iteration , SGD selects uniformly at random and updates via . We would like to derive similar results to those for GD above. While we could simply try to derive the same results by taking expectations over the iterates of SGD, this ignores relatively recent work that has instead derived highprobability convergence results for SGD ji2018risk ; rakhlin2011making . In particular, ji2018risk uses a martingale Bernstein bound from beygelzimer2011contextual to derive a high probability convergence rate for SGD on separable data. While the analysis cannot be used directly, we use the structural connections between and in Lemma 2 to adapt the techniques therein. We derive the following:
Theorem 7.
Let be the iterates of SGD with constant step size and . For any , with probability at least , satisfies
A similar (but slightly more complicated) result can be shown when , which we have omitted for the sake of exposition. Using Lemma 5, we can now show that after iterations, with high probability, will have margin at least .
Corollary 2.
Let be the iterates of SGD with constant step size and . For all , there is a constant
then with probability at least , . Here, is some universal constant.
Ignoring all other factors, this implies that for any , with high probability iterations of SGD are sufficient to obtain margin . As with GD, the constant is how large must be so that for all , . Proofs of the above results can be found in Appendix D.
6 Experiments
To corroborate our theory, we evaluate GD and SGD on logistic regression with linearly separable data. As in our theory, we train linear classifiers whose prediction on is . We compare GD and SGD for various values of . Note that when , GD and SGD are identical to the standard GD and SGD training algorithms, which we use as benchmarks.
Evaluation metrics.
We evaluate these methods in the three ways. First, we compute the training loss in (1). Second, we compute the margin in (3). To aid clarity, we plot the truncated margin, . Third, we plot the robust training loss in (2). This is governed by . For convenience, we refer to this as the robust loss and denote it by . To compare SGD for different values of , we plot for SGD. In particular, standard GD and SGD correspond to , in which case we plot .
Setup and implementation.
All experiments were implemented in PyTorch. We vary
over . When , we get standard GD and SGD. In all experiments, we use a constant stepsize that is tuned for each . The tuning was done by varying over , evaluating the average value of after iterations, and selecting the stepsize with the smallest loss. For SGD, we did the same, but foraveraged over 5 trials. When plotting the above evaluation metrics for
SGD, we ran multiple trials (where the number varied depending on the dataset) and plotted the average, as well as error bars corresponding to the standard deviation.
Synthetic data.
We draw uniformly at random from circles of radius 1 centered at and . These correspond to and labeled points, respectively. We draw points from each circle, and also add the points and , where . This guarantees that the maxmargin is . We initialize at . While we observe similar behavior for any reasonable initialization, this intialization is used to compare how the methods “correct” bad models. For SGD, we computed the average and standard deviation of the evaluation metrics above over 5 trials.
Real data.
We use the Iris Dataset Dua:2019 , which contains data for 3 classes, Irissetosa, Irisversicolor, and Irisvirginica. Irissetosa is linearly separable from Irisvirginica with maxmargin . We initialize with entries drawn from . We found that our results were not especially sensitive to the initialization scheme. While different initializations result in minor changes to the plots below, the effects were consistently uniform across different . For
SGD, we computed the average and standard deviation of the evaluation metrics above over 9 trials. Note that we increased the number here due to the increased variance of singlesample SGD on this dataset over the synthetic dataset above.
Discussion.
The results for GD on the synthetic dataset and the Iris dataset are given in Figures 1 and 2, while the results for SGD on the synthetic dataset and the Iris dataset are given in Figures 3 and 4. The plots corroborate our theory for GD and SGD. Moreover, the results for these two methods are extremely similar on both datasets. The most notable difference is that for the margin plot on the Iris dataset, the margin for SGD resembles a noisy version of the margin plot for GD. This is expected, as SGD focuses only on one example at a time, potentially decreasing the margin at other points, while GD computes adversarial examples for every element of the training set at each iteration.
We see that GD and SGD quickly attain margin on both datasets, and once they do their margin convergence slows down. Moreover, the larger is, generally the larger the achieved margin is at any given iteration. Generally GD and SGD take much longer to obtain a given margin than GD and SGD. As reflected by previous work on the implicit bias of such methods gunasekar2018characterizing ; gunasekar2018implicit ; nacson2019convergence ; nacson2018stochastic ; soudry2018implicit , we see a logarithmic convergence to the maxmargin in both settings. One interesting observation is that GD and SGD minimize the training loss faster than standard GD and SGD, despite not directly optimizing this loss function. Finally, we see that for , GD and SGD generally seem to exhibit a convergence rate for . However, the convergence rate seems to increase proportionally to . Intuitively, becomes more difficult to minimize as increases.
7 Conclusion
In this paper, we analyzed adversarial training on separable data. We showed that while generic adversarial training and standard gradientbased methods may each require exponentially many iterations to obtain large margin, their combination exhibits a strong bias towards models with large margin that translates to fast convergence to these robust solutions. There are a large number of possible extensions. First, we would like to understand the behavior of these methods on nonseparable data, especially with regard to . Second, we would like to generalize our results to 1) multiclass classification, and 2) regression tasks. While the former is relatively straightforward, the latter will necessarily require new methods and perspectives, due to differences in the behavior of when is a loss function for classification or regression.
References
 [1] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
 [2] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

[3]
Anh Nguyen, Jason Yosinski, and Jeff Clune.
Deep neural networks are easily fooled: High confidence predictions
for unrecognizable images.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 427–436, 2015.  [4] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [5] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. Adversarial perturbations against deep neural networks for malware classification. arXiv preprint arXiv:1606.04435, 2016.
 [6] Jan Hendrik Metzen, Mummadi Chaithanya Kumar, Thomas Brox, and Volker Fischer. Universal adversarial perturbations against semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2755–2764, 2017.
 [7] Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [8] Konda Reddy Mopuri, Utsav Garg, and R Venkatesh Babu. Fast feature fool: A data independent approach to universal adversarial perturbations. arXiv preprint arXiv:1707.05572, 2017.
 [9] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
 [10] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
 [11] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
 [12] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. stat, 1050:29, 2017.

[13]
Valentina Zantedeschi, MariaIrina Nicolae, and Ambrish Rawat.
Efficient defenses against adversarial attacks.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pages 39–49. ACM, 2017.  [14] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defensegan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
 [15] Andrew Ilyas, Ajil Jalal, Eirini Asteri, Constantinos Daskalakis, and Alexandros G Dimakis. The robust manifold defense: Adversarial training using generative models. arXiv preprint arXiv:1712.09196, 2017.
 [16] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of supervised models through robust optimization. Neurocomputing, 307:195–204, 2018.
 [17] Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. Adversarial examples are a natural consequence of test error in noise. arXiv preprint arXiv:1901.10513, 2019.
 [18] Dan Hendrycks and Thomas G Dietterich. Benchmarking neural network robustness to common corruptions and surface variations. arXiv preprint arXiv:1807.01697, 2018.
 [19] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.
 [20] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018.
 [21] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
 [22] Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 3420–3428. PMLR, 16–18 Apr 2019.
 [23] Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. arXiv preprint arXiv:1806.01796, 2018.
 [24] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
 [25] Albert B Novikoff. On convergence proofs for perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962.
 [26] Yoav Freund and Robert E Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
 [27] Indraneel Mukherjee, Cynthia Rudin, and Robert E Schapire. The rate of convergence of adaboost. The Journal of Machine Learning Research, 14(1):2315–2347, 2013.
 [28] Matus Telgarsky. A primaldual convergence analysis of boosting. Journal of Machine Learning Research, 13(Mar):561–606, 2012.
 [29] Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5(Aug):941–973, 2004.
 [30] Matus Telgarsky. Margins, shrinkage, and boosting. arXiv preprint arXiv:1303.4172, 2013.
 [31] Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. arXiv preprint arXiv:1805.10204, 2018.
 [32] Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. arXiv preprint arXiv:1801.02774, 2018.
 [33] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5014–5026, 2018.

[34]
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and
Aleksander Madry.
Robustness may be at odds with accuracy.
stat, 1050:11, 2018.  [35] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. There is no free lunch in adversarial robustness (but there are unexpected benefits). arXiv preprint arXiv:1805.12152, 2018.
 [36] Constantine Caramanis, Shie Mannor, and Huan Xu. Robust optimization in machine learning. Optimization for Machine Learning, page 369, 2012.

[37]
Huan Xu, Constantine Caramanis, and Shie Mannor.
Robustness and regularization of support vector machines.
Journal of Machine Learning Research, 10(Jul):1485–1510, 2009.  [38] Shashank Rajput, Zhili Feng, Zachary Charles, PoLing Loh, and Dimitris Papailiopoulos. Does data augmentation lead to positive margin? arXiv preprint arXiv:1905.03177, 2019.
 [39] Ali Shafahi, Mahyar Najibi, Zheng Xu, John Dickerson, Larry S Davis, and Tom Goldstein. Universal adversarial training. arXiv preprint arXiv:1811.11304, 2018.
 [40] Philippe Delsarte, JeanMarie Goethals, and Johan Jacob Seidel. Spherical codes and designs. In Geometry and Combinatorics, pages 68–93. Elsevier, 1991.
 [41] GA Kabatyanskiı and VI Levenshteın. Bounds for packings on a sphere and in space. Problems of Information Transmission, 95:148–158, 1974.

[42]
Philippe Delsarte.
Bounds for unrestricted codes, by linear programming.
Philips Res. Rep, 27:272–289, 1972.  [43] N Sloane. Tables of sphere packings and spherical codes. IEEE Transactions on Information Theory, 27(3):327–338, 1981.
 [44] John Horton Conway and Neil James Alexander Sloane. Sphere packings, lattices and groups, volume 290. Springer Science & Business Media, 2013.
 [45] Henry Cohn, Yufei Zhao, et al. Sphere packing bounds via spherical codes. Duke Mathematical Journal, 163(10):1965–2002, 2014.
 [46] John M Danskin. The theory of maxmin and its application to weapons allocation problems, volume 5. Springer Science & Business Media, 2012.
 [47] Dimitri P Bertsekas. Control of uncertain systems with a setmembership description of the uncertainty. PhD thesis, Massachusetts Institute of Technology, 1971.
 [48] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
 [49] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.

[50]
Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire.
Contextual bandit algorithms with supervised learning guarantees.
In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26, 2011.  [51] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
Appendix A Proof of Theorem 4
Recall that in Algorithm 2, at each iteration the learner selects and then computes the adversarial examples in (4) for each at the current model . This set of adversarial examples is defined as . We will assume throughout that , as this only diminishes the adversary’s ability to obtain small margin.
Define . Let denote the unit sphere in . For any , we define to be the collection of subsets of of maximal size such that any two distinct elements satisfy ; these subsets are referred to as spherical codes. We let denote the size of any . For , we will relate the number of times an adversary can find a classifier with margin to . In the following, we will let be the vector with first coordinate of , and remaining coordinates of . Without loss of generality, we can assume the unit vector in the statement of Theorem 4 satisfies .
Lemma 6.
Let . For any , there is an admissible sequence such that for all satisfying
Proof.
Let . Note that has maxmargin . Fix and let
Let . For , define by
That is, the first coordinate of is , while its remaining coordinates are given by . Since , we have . We will show that each is admissible and has margin at most with respect to .
For any , we have
Thus, each correctly classifies . Moreover, since , its margin at is . We now must show that each correctly classifies .
Recall that we assume is of the form where is a monotonically increasing function. This implies that given , , and , satisfies (4). Therefore, for ,
Given and , and by construction of the , we have
Comments
There are no comments yet.