# Convergence and Margin of Adversarial Training on Separable Data

There are no comments yet.

## Authors

• 14 publications
• 4 publications
• 9 publications
• 19 publications
• ### Curriculum Adversarial Training

Recently, deep learning has been applied to many security-sensitive appl...
05/13/2018 ∙ by Qi-Zhi Cai, et al. ∙ 0

• ### Label Smoothing and Logit Squeezing: A Replacement for Adversarial Training?

Adversarial training is one of the strongest defenses against adversaria...
10/25/2019 ∙ by Ali Shafahi, et al. ∙ 0

• ### Inductive Bias of Gradient Descent based Adversarial Training on Separable Data

Adversarial training is a principled approach for training robust neural...
06/07/2019 ∙ by Yan Li, et al. ∙ 0

• ### Efficient learning with robust gradient descent

Minimizing the empirical risk is a popular training strategy, but for le...
06/01/2017 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Detecting and Recovering Adversarial Examples: An Input Sensitivity Guided Method

Deep neural networks undergo rapid development and achieve notable succe...
02/28/2020 ∙ by Mingxuan Li, et al. ∙ 0

• ### Does Data Augmentation Lead to Positive Margin?

Data augmentation (DA) is commonly used during model training, as it sig...
05/08/2019 ∙ by Shashank Rajput, et al. ∙ 0

• ### Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization

We propose a general framework for increasing local stability of Artific...
11/17/2015 ∙ by Uri Shaham, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Machine learning models trained through standard methods often lack robustness against adversarial examples. These are small perturbations of input examples, designed to “fool” the model into misclassifying the original input biggio2013evasion ; goodfellow2014explaining ; nguyen2015deep ; szegedy2013intriguing

. Unfortunately, even small perturbations can cause a large degradation in the test accuracy of popular machine learning models, including deep neural networks

szegedy2013intriguing . This lack of robustness has spurred a large body of work on designing attack methods for crafting effective adversarial examples grosse2016adversarial ; hendrik2017universal ; moosavi2016deepfool ; mopuri2017fast ; papernot2016transferability ; tramer2017ensemble and defense mechanisms for training models that are more robust to norm bounded perturbations tramer2017ensemble ; madry2017towards ; sinha2017certifiable ; zantedeschi2017efficient ; samangouei2018defense ; ilyas2017robust ; shaham2018understanding .

Adversarial training is a family of optimization-based methods for defending against adversarial perturbations. These methods generally operate by computing adversarial examples, and retraining the model on these examples goodfellow2014explaining ; madry2017towards ; shaham2018understanding . This two-step process is repeated iteratively. While adversarial training methods have achieved empirical success madry2017towards ; shaham2018understanding ; ford2019adversarial ; hendrycks2018benchmarking , there is currently little theoretical analysis of their convergence and capacity for guaranteeing robustness.

A parallel line of research has investigated whether standard optimization methods, such as gradient descent (GD) and stochastic gradient descent (SGD), exhibit an

implicit bias toward robust and generalizable models gunasekar2018characterizing ; gunasekar2018implicit ; ji2018risk ; nacson2019convergence ; nacson2018stochastic ; soudry2018implicit

. This line of work shows that GD and SGD both converge to the max-margin classifier of linearly separable data, provided that the loss function is chosen appropriately. Notably, the max-margin classifier is the most robust model against

bounded perturbations. Thus, gradient descent is indeed biased towards robustness in some settings. Unfortunately, convergence to this desirable limit can be slow, and in some cases an exponential number of iterations may be needed nacson2019convergence ; nacson2018stochastic ; soudry2018implicit .

#### Our contributions.

In this work, we merge these two previously separate lines of work, studying whether (and how) various types of adversarial training exhibit a bias towards robust models. We focus on linear classification tasks and study robustness primarily through the lens of margin, the minimum distance between the classification boundary and the (unperturbed) training examples. Our results show that alone, neither adversarial training with generic update rules, nor gradient-based training on the original data set, can find large-margin models quickly. However, by combining the two — interspersing gradient-based update rules with the addition of adversarial examples to the training set — we can train robust models significantly faster.

We show that for logistic regression, gradient-based update rules evaluated on adversarial examples minimize a robust form of the empirical risk function at a rate of

, where is the number of iterations of the adversarial training process. This convergence rate mirrors the convergence of GD and SGD on the standard empirical risk, despite the non-smoothness of the robust empirical risk function. We then use this analysis to quantify the number of iterations required to obtain a given margin. We show that while GD may require exponentially many iterations to achieve large margin in non-adversarial training, adversarial training with (stochastic) gradient-based rules requires only polynomially many iterations to achieve large margin. We support our theoretical bounds with experimental results.

### 1.1 Related Work

Our results are most similar in spirit to ji2018risk

, which uses techniques inspired by the Perceptron

novikoff1962convergence

to analyze the convergence of GD and SGD on logistic regression. It derives a high probability

convergence rate for SGD on separable data, as well as an convergence rate for GD in general. We adapt these techniques for adversarial training. Our work also connects to work on “implicit bias”, which studies the parameter convergence of GD and SGD for logistic regression on separable data gunasekar2018characterizing ; gunasekar2018implicit ; ji2018risk ; nacson2019convergence ; nacson2018stochastic ; soudry2018implicit . These works show that the parameters generated by GD and SGD converge to the parameters that correspond to the max-margin classifier at polylogarithmic rates. This line of work, among other tools, employs techniques developed in the context of AdaBoost freund1997decision ; mukherjee2013rate ; telgarsky2012primal . Our analysis is related in particular to margin analyses of boosting rosset2004boosting ; telgarsky2013margins , which show that the path taken by boosting on exponentially tailed losses approximates the max-margin classifier.

There is a large and active body of theoretical work on adversarial robustness. While there are various hardness results in learning robust models bubeck2018adversarial ; gilmer2018adversarial ; schmidt2018adversarially ; tsipras2018robustness ; tsipras2018there , our analysis shows that such results may not apply to practical settings. Our analysis uses a robust optimization lens previously applied to machine learning in work such as sinha2017certifiable ; caramanis201214 ; xu2009robustness . While xu2009robustness shows that the max-margin classifier is the solution to a robust empirical loss function, our work derives explicit convergence rates for SGD and GD on such losses. Finally, we note that adversarial training can be viewed as a data augmentation technique. While the relation between margin and static data augmentation was previously studied in rajput2019does , our work can be viewed as analyzing adaptive data augmentation methods.

## 2 Overview

Let , and denote the feature space, label space, and model space, respectively, and let be some loss function. Given a dataset , the empirical risk minimization objective is given by

 minw∈WL(w):=1|S|∑(x,y)∈Sℓ(w,x,y). (1)

Unfortunately, generic empirical risk minimizers may not be robust to small input perturbations. To find models that are resistant to bounded input perturbations, we define the following robust loss functions

 ℓrob(w,x,y):=max∥δ∥≤αℓ(w,x+δ,y),  Lrob(w):=1|S|∑(x,y)∈Sℓrob(w,x,y). (2)

The function is a measure for the robustness of on . While can be any norm, here we focus on the norm and let denote it throughout our text. Another important measure of robustness is margin. We focus on binary linear classification where , and . The class predicted by on is given by , and the margin of on is

 marginS(w):=inf(x,y)∈Sy⟨w,x⟩∥w∥. (3)

We say linearly separates if , . Note linearly separates iff . One can interpret margin as the size of the smallest perturbation needed to fool in to misclassifying an element of . Thus, the most robust linear separator is the classifier with the largest margin, referred to as the max-margin classifier.

One popular class of defenses, referred to generally as adversarial training madry2017towards

, involve retraining a model on adversarially perturbed data points. The general heuristic follows two steps. At each iteration

we construct adversarial examples for some subset of the training data. For each example in this set, an -bounded norm adversarial perturbation is constructed as follows:

 δ∗=argmax∥δ∥≤αℓ(w,x+δ,y). (4)

We then update our model using an update rule that operates on the current model and “adversarial examples” of the form . In the most general case, this update rule can also utilize true training data in and adversarial examples from prior iterations.

More formally, let be our initial model. denotes our true training data, and will denote all previously seen adversarial examples. We initialize . At each , we select some subset . For , we let be the solution to (4) when and . We then let

 S′t={(x(t)i+δ(t)i,y(t)i)}mi=1
 S′=S′∪S′t.

Thus, is the set of adversarial examples computed at iteration , while contains all adversarial examples computed up to (and including) iteration . Finally, we update our model via for some update rule . This generic notation will be useful to analyze a few different algorithms. A full description of adversarial training is given in Algorithm 1.

Once is fixed, there are two primary choices in selecting an adversarial training method: the subset used to find adversarial examples, and the update rule . For example, one popular instance of adversarial training (discussed in detail in madry2017towards ) performs mini-batch SGD on the adversarial examples. Specifically, this corresponds to the setting where is randomly selected from , and computes a mini-batch SGD update on via

 wt+1=A(wt,S,S′)=wt−ηt|S′t|∑(x+δ,y)∈S′t∇ℓ(wt,x+δ,y). (5)

In particular, this update does not utilize the full set of all previously seen adversarial examples, but instead updates only using the set of the most recently computed adversarial examples. It also does not use the true training samples . However, other incarnations of adversarial training have used more of and to enhance their accuracy and efficiency shafahi2018universal .

#### Main results.

In the following, we analyze the performance of adversarial training for binary linear classification. In particular, we wish to understand how the choice of , , and the number of iterations impact and . We will make the following assumptions throughout:

###### Assumption A1.

where is nonnegative and monotonically increasing.

###### Assumption A2.

is linearly separable with max-margin .

###### Assumption A3.

The parameter satisfies .

A1 guarantees that is a surrogate of the loss for linear classification, since decreases as increases. A2 allows us to compare the margin obtained by various methods to . We let denote the max-margin classifier. The assumption that is simply for convenience, as we can always rescale separable data to ensure this.

Combined, A2 and A3 guarantee that at every iteration, is linearly separable by with margin at least , as we show in the following lemma.

###### Lemma 1.

Suppose A2 and A3 hold, and let be the max-margin classifier of . Then at each iteration of Algorithm 1, linearly separates with margin at least .

###### Proof.

By construction, any element in is of the form where and . By assumption on and the Cauchy-Schwarz inequality,

 y⟨w∗,x+δ⟩=y⟨w∗,x⟩+y⟨w∗,δ⟩≥γ−α.

We can now state the main theorems of our work. We first show that adversarial training may take a long time to converge to models with large margin, even when finds an empirical risk minimizer (ERM) of the loss on . Note that by Lemma 1, this is equivalent to finding a linear separator of . That is, even if finds a model that perfectly fits the training data and all adversarial examples at each step, this is insufficient for fast convergence to good margin.

###### Theorem 1 (Informal).

Suppose outputs a linear separator of . In the worst case, Algorithm 1 requires iterations to achieve margin .

We then show that for logistic regression, if performs a full-batch gradient descent update on the adversarial examples, then adversarial training quickly finds a model with large margin. This corresponds to the setting where is given as in (5) with . We refer to this as GD with adversarial training.

###### Theorem 2 (Informal).

Let be the iterates of GD with adversarial training. Then , and for , .

The notation hides polylogarithmic factors. By contrast, one can easily adapt lower bounds in gunasekar2018implicit on the convergence of gradient descent to the max-margin classifier to show that standard gradient descent requires iterations to guarantee margin .

Since the inner maximization in Algorithm 1 is often expensive, we may want to be small. When and performs the gradient update in (5), Algorithm 1 becomes SGD with adversarial training, in which case we have the following.

###### Theorem 3 (Informal).

Let be the iterates of SGD with adversarial training, and let . With probability at least , and if , then .

## 3 Fundamental Limits of Adversarial Training for Linear Classifiers

We will now show that even if the subroutine in Algorithm 1 outputs an arbitrary empirical risk minimizer (ERM) of the loss on , then in the worst-case iterations are required to obtain margin .

Suppose that in Algorithm 1 is defined by

 A(wt,S,S′)∈argminw∈Rd∑(x,y)∈S∪S′ℓ0−1(sign(wTx),y).

By Lemma 1, is linearly separable. Thus, the update is equivalent to finding some linear separator of . When is an arbitrary ERM solver, we can analyze the worst case convergence of adversarial training by viewing it as a game played between two players. At each iteration, Player 1 augments the current data with adversarial examples computed for the current model. Player 2 then tries to find a linear separator of all previously seen points with small margin. This specialization of Algorithm 1 is given in Algorithm 2.

In the following, we assume for all . This only reduces reduces the ability of the worst-case ERM solver to output some model with small margin. We say a sequence is admissible if is generated according to iterations of Algorithm 2. Intuitively, the larger is (i.e., the more this game is played), the more restricted the set of linear separators of becomes. We might hope that after a moderate number of rounds, the only feasible separators left have high margin with respect to the original training set .

We show that this is not the case. Specifically, an ERM may still be able to output a linear separator with margin at most , even after exponentially many iterations of adversarial training.

###### Theorem 4.

Let , where

is a unit vector in

. Then, there is some constant such that for any , there is an admissible sequence such that for all satisfying

 t≤12exp(c(d−1)ϵ2(γ+ϵ)2).

The proof proceeds by relating the number of times an ERM can obtain margin to the size of spherical codes. These are arrangements of points on the sphere with some minimum angle constraint delsarte1991spherical ; kabatyanskii1974bounds ; delsarte1972bounds ; sloane1981tables and have strong connections to sphere packings and lattice density problems conway2013sphere . We show how an arbitrary ERM can use a spherical code of size to generate an admissible sequence with small margin for the first iterations. While computing spherical codes of maximal size is a notoriously difficult task cohn2014sphere , spherical codes with points can be constructed with high probability by taking spherically symmetric points on the sphere at random. A full proof can be found in Appendix A.

This implies that even for relatively small , the number of times an ERM can achieve margin is in the worst-case. As we will show in the proceeding sections, this worst-case scenario is overcome when we combine adversarial training with gradient dynamics.

We will now discuss gradient-based versions of adversarial training, in which we use gradients evaluated with respect to adversarially perturbed training points to update our model. Suppose that has associated empirical risk function as in (1). Let be some initial model. In adversarial training with gradient methods, at each , we select and update via

 δ(t)i=argmax∥δ∥≤α ℓ(wt,x(t)i+δ,y(t)i),∀i∈[m] (6)
 wt+1=wt−ηt|St|m∑i=1∇ℓ(wt,x(t)i+δ(t)i,y(t)i) (7)

where is the step size and is treated as constant with respect to when computing the gradient . When , we refer to this procedure as -GD. When is a single sample selected uniformly at random, we refer to this procedure as -SGD. Note that when , this becomes standard GD and SGD on .

Note that both these methods are special cases of Algorithm 1, where the update is given by (7). Before we proceed, we present an alternate view of this method. Recall the functions and defined in (2). To understand -GD, we will use Danskin’s theorem danskin2012theory . We note that this was previously used in madry2017towards to justify adversarial training with gradient updates. The version we cite was shown by Bertsekas bertsekas1971control . A more modern proof can be found in bertsekas1997nonlinear .

###### Proposition 1 (Danskin).

Suppose is a non-empty compact topological space and is a continuous function such that is differentiable for every . Define

 δ∗(w)={δ∈argmaxδ∈Xg(w,δ)},  ψ(w)=maxδ∈Xg(w,δ).

Then is subdifferentiable with subdifferential given by .

Thus, we can compute subgradients of by solving the inner maximization problem (6) for each , and then taking a gradient. In other words, for a given , let be a solution to (6). Then . Therefore, -GD is a subgradient descent method for , while -SGD is a stochastic subgradient method. Furthermore, if the solution to (6) is unique then Danskin’s theorem implies that -GD actually computes a gradient descent step, while -SGD computes a stochastic gradient step. Indeed, the above proposition also motivated madry2017towards and shaham2018understanding to use a projected gradient inner step to compute adversarial examples and approximate adversarial training with SGD.

For linear classification, we can derive stronger structural connections between and .

###### Lemma 2.

Suppose for monotonically increasing and differentiable. Then, the following properties hold:

1. For all , satisfies .

2. For all , is subdifferentiable with , where , if and otherwise.

3. If is strictly increasing, then is differentiable at all .

4. If is -Lipschitz, -smooth, and strictly increasing, then is twice differentiable at , in which case , where .

5. If is convex, then is convex.

A full proof is given in Appendix B. Thus, if is convex, then is convex and -GD and -SGD perform (stochastic) subgradient descent on a convex, non-smooth function. Unfortunately, even if is smooth, is typically non-smooth. Standard results for convex, non-smooth optimization then suggest that -GD and -SGD obtain a convergence rate of on . However, this is a pessimistic convergence rate for subgradient methods on non-smooth convex functions. By Lemma 2, inherits many nice geometric properties from . There is therefore ample reason to believe the pessimistic convergence rate is not tight. As we show in the following, -GD and -SGD actually minimize at a much faster rate.

In the next section, we analyze the convergence of -GD and -SGD, measured in terms of , as well as , for logistic regression. We adapt the classical analysis of the Perceptron algorithm from novikoff1962convergence to show that a given margin is obtained. To motivate this, we first analyze an adversarial training version of the Perceptron.

### 4.1 Adversarial Training with the Perceptron

Let . Then . For notational convenience, suppose that for all , . Let . Applying SGD with step-size , we get updates of the form where if and otherwise. This is essentially the Perceptron algorithm, in which case novikoff1962convergence implies the following.

###### Lemma 3.

This procedure stops after at most non-zero updates, at which point linearly separates .

Suppose we instead perform -SGD with step-size and . Given , let if and otherwise. Lemma 2 implies that -SGD does the following: Sample uniformly at random, then update via

 wt+1=wt+{yitxit−α¯¯¯¯wt,  yit⟨wt,xit⟩−α∥wt∥≤00, otherwise.

Due to its resemblance to the Perceptron, we refer to this update as the -Perceptron. We then get an analogous result on the number of iterations required to find classifiers with a given margin.

###### Lemma 4.

The -Perceptron stops after at most non-zero updates, after which point has margin at least .

###### Proof.

Assume the update at is non-zero, so . Let be a unit vector that achieves margin . Then,

 ⟨wt+1−wt,w∗⟩ =⟨yitxit−α¯¯¯¯wt,w∗⟩=⟨yitxit,w∗⟩−α⟨¯¯¯¯wt,w∗⟩≥γ−α.

Therefore, after iterations, . Next, we upper bound via:

 ∥wt+1∥2=∥wt∥2+2(yit⟨wt,xit⟩−α∥wt∥)+∥yitxit−α¯¯¯¯wt∥2≤∥wt∥2+(1+α)2.

The last step follows from the fact that we update iff . Recursively, we find that , so . Combining the above,

 1≥⟨wT,w∗⟩∥wT∥∥w∗∥≥√T(γ−α)1+α⟹T≤(1+αγ−α)2.

The update at is non-zero iff has margin at , so once -Perceptron stops updating, . ∎

While simple, this result hints at an underlying, more general phenomenon for linearly separable datasets: The convergence of gradient-based adversarial training to a robust risk minimizer often mirrors the convergence of conventional gradient methods to an empirical risk minimizer. We demonstrate this principle formally in the following section for logistic regression.

## 5 Adversarial Training for Logistic Regression

We will now analyze the convergence and margin of -GD and -SGD for logistic regression. In logistic regression, where . Note that is convex, -Lipschitz, and -smooth, and bounded below by 0. For notational simplicity, suppose that with for all . Thus, the max-margin of satisfies .

### 5.1 Convergence and Margin of α-Gd

Let be the iterates of -GD with step-sizes . We will suppose that , and . These assumptions are not necessary, but simplify the statement and proofs of the following results. Full proofs of all results in this section can be found in Appendix C.

To analyze the convergence of -GD on , we will use the fact that by Lemma 2, while is not smooth, it is -smooth away from . We then use a Perceptron-style argument inspired by ji2018risk to show that after a few iterations, the model produced by -GD has norm bounded below by some positive constant. We can then apply standard convergence techniques for gradient descent on -smooth functions to derive the following.

###### Theorem 5.

Suppose , and , . Then ,

 Lrob(wt)≤1t+(t−1∑j=1ηj)−1(14+ln(t)2(γ−α)2).

We can use the above results to show that after a polynomial number of iterations, we obtain a model with margin . To do so, we first require a straightforward lemma relating to margin.

###### Lemma 5.

If then .

We then get the following.

###### Corollary 1.

Suppose that for , and . For all , there is a constant such that for all satisfying

 t≥max{Cq,(nη(γ−α)2ln(2))q}. (8)

Ignoring all other terms, this implies that for all , iterations of -GD sufficient to obtain margin . The constant is how large must be so that for all , . As such, the constant tends to as tends to .

On the other hand, one can show that standard gradient descent may require exponentially many iterations to reach margin , even though it eventually converges to the max-margin classifier. This follows immediately from a direct adaptation of lower bounds from gunasekar2018implicit .

###### Theorem 6.

Let . Let be the iterates of GD with constant step-size initialized at for . For all , .

One can show that as decreases, this convergence rate only decreases. Thus, the exponentially slow convergence in margin is not an artifact of the choice of step-size, but rather an intrinsic property of gradient descent on logistic regression.

### 5.2 Convergence and Margin of α-Sgd

Recall that at each iteration , -SGD selects uniformly at random and updates via . We would like to derive similar results to those for -GD above. While we could simply try to derive the same results by taking expectations over the iterates of -SGD, this ignores relatively recent work that has instead derived high-probability convergence results for SGD ji2018risk ; rakhlin2011making . In particular, ji2018risk uses a martingale Bernstein bound from beygelzimer2011contextual to derive a high probability convergence rate for SGD on separable data. While the analysis cannot be used directly, we use the structural connections between and in Lemma 2 to adapt the techniques therein. We derive the following:

###### Theorem 7.

Let be the iterates of -SGD with constant step size and . For any , with probability at least , satisfies

 Lrob(^wt)≤1ηt(4ln(t)γ−α+6)(8ln(t)(γ−α)2+8γ−α+4ln(1/δ)).

A similar (but slightly more complicated) result can be shown when , which we have omitted for the sake of exposition. Using Lemma 5, we can now show that after iterations, with high probability, will have margin at least .

###### Corollary 2.

Let be the iterates of -SGD with constant step size and . For all , there is a constant

 t≥max{Cq,[cnη(1(γ−α)3+ln(1/δ)γ−α)]q}

then with probability at least , . Here, is some universal constant.

Ignoring all other factors, this implies that for any , with high probability iterations of -SGD are sufficient to obtain margin . As with -GD, the constant is how large must be so that for all , . Proofs of the above results can be found in Appendix D.

## 6 Experiments

To corroborate our theory, we evaluate -GD and -SGD on logistic regression with linearly separable data. As in our theory, we train linear classifiers whose prediction on is . We compare -GD and -SGD for various values of . Note that when , -GD and -SGD are identical to the standard GD and SGD training algorithms, which we use as benchmarks.

#### Evaluation metrics.

We evaluate these methods in the three ways. First, we compute the training loss in (1). Second, we compute the margin in (3). To aid clarity, we plot the truncated margin, . Third, we plot the robust training loss in (2). This is governed by . For convenience, we refer to this as the -robust loss and denote it by . To compare -SGD for different values of , we plot for -SGD. In particular, standard GD and SGD correspond to , in which case we plot .

#### Setup and implementation.

All experiments were implemented in PyTorch. We vary

over . When , we get standard GD and SGD. In all experiments, we use a constant step-size that is tuned for each . The tuning was done by varying over , evaluating the average value of after iterations, and selecting the step-size with the smallest loss. For -SGD, we did the same, but for

averaged over 5 trials. When plotting the above evaluation metrics for

-SGD, we ran multiple trials (where the number varied depending on the dataset) and plotted the average, as well as error bars corresponding to the standard deviation.

#### Synthetic data.

We draw uniformly at random from circles of radius 1 centered at and . These correspond to and labeled points, respectively. We draw points from each circle, and also add the points and , where . This guarantees that the max-margin is . We initialize at . While we observe similar behavior for any reasonable initialization, this intialization is used to compare how the methods “correct” bad models. For -SGD, we computed the average and standard deviation of the evaluation metrics above over 5 trials.

#### Real data.

We use the Iris Dataset Dua:2019 , which contains data for 3 classes, Iris-setosa, Iris-versicolor, and Iris-virginica. Iris-setosa is linearly separable from Iris-virginica with max-margin . We initialize with entries drawn from . We found that our results were not especially sensitive to the initialization scheme. While different initializations result in minor changes to the plots below, the effects were consistently uniform across different . For

-SGD, we computed the average and standard deviation of the evaluation metrics above over 9 trials. Note that we increased the number here due to the increased variance of single-sample SGD on this dataset over the synthetic dataset above.

#### Discussion.

The results for -GD on the synthetic dataset and the Iris dataset are given in Figures 1 and 2, while the results for -SGD on the synthetic dataset and the Iris dataset are given in Figures 3 and 4. The plots corroborate our theory for -GD and -SGD. Moreover, the results for these two methods are extremely similar on both datasets. The most notable difference is that for the margin plot on the Iris dataset, the margin for -SGD resembles a noisy version of the margin plot for -GD. This is expected, as -SGD focuses only on one example at a time, potentially decreasing the margin at other points, while -GD computes adversarial examples for every element of the training set at each iteration.

We see that -GD and -SGD quickly attain margin on both datasets, and once they do their margin convergence slows down. Moreover, the larger is, generally the larger the achieved margin is at any given iteration. Generally GD and SGD take much longer to obtain a given margin than -GD and -SGD. As reflected by previous work on the implicit bias of such methods gunasekar2018characterizing ; gunasekar2018implicit ; nacson2019convergence ; nacson2018stochastic ; soudry2018implicit , we see a logarithmic convergence to the max-margin in both settings. One interesting observation is that -GD and -SGD minimize the training loss faster than standard GD and SGD, despite not directly optimizing this loss function. Finally, we see that for , -GD and -SGD generally seem to exhibit a convergence rate for . However, the convergence rate seems to increase proportionally to . Intuitively, becomes more difficult to minimize as increases.

## 7 Conclusion

In this paper, we analyzed adversarial training on separable data. We showed that while generic adversarial training and standard gradient-based methods may each require exponentially many iterations to obtain large margin, their combination exhibits a strong bias towards models with large margin that translates to fast convergence to these robust solutions. There are a large number of possible extensions. First, we would like to understand the behavior of these methods on non-separable data, especially with regard to . Second, we would like to generalize our results to 1) multi-class classification, and 2) regression tasks. While the former is relatively straightforward, the latter will necessarily require new methods and perspectives, due to differences in the behavior of when is a loss function for classification or regression.

## Appendix A Proof of Theorem 4

Recall that in Algorithm 2, at each iteration the learner selects and then computes the adversarial examples in (4) for each at the current model . This set of adversarial examples is defined as . We will assume throughout that , as this only diminishes the adversary’s ability to obtain small margin.

Define . Let denote the unit sphere in . For any , we define to be the collection of subsets of of maximal size such that any two distinct elements satisfy ; these subsets are referred to as spherical codes. We let denote the size of any . For , we will relate the number of times an adversary can find a classifier with margin to . In the following, we will let be the vector with first coordinate of , and remaining coordinates of . Without loss of generality, we can assume the unit vector in the statement of Theorem 4 satisfies .

###### Lemma 6.

Let . For any , there is an admissible sequence such that for all satisfying

 t≤N(d−1,ϵ(γ2−ϵα)α(γ2−ϵ2)).
###### Proof.

Let . Note that has max-margin . Fix and let

 {v1,…,vm}∈C(d−1,ϵ(γ2−ϵα)α(γ2−ϵ2)).

Let . For , define by

 wTt=[a  (√1−a2)vTt].

That is, the first coordinate of is , while its remaining coordinates are given by . Since , we have . We will show that each is admissible and has margin at most with respect to .

For any , we have

 ⟨wt,x1⟩=γa=ϵ>0 ⟨wt,x2⟩=−γa=−ϵ<0.

Thus, each correctly classifies . Moreover, since , its margin at is . We now must show that each correctly classifies .

Recall that we assume is of the form where is a monotonically increasing function. This implies that given , , and , satisfies (4). Therefore, for ,

 S′

Given and , and by construction of the , we have

 ⟨wt,γe1−αwi⟩ =⟨wt,γe1⟩−α⟨wt,wi⟩ =ϵ−α(a2+(1−a)2⟨vt,vi⟩) =ϵ−αϵ2γ2−α(1−ϵ2γ2)⟨vt,vi⟩ >ϵ−αϵ2γ