Convergence of SGD in Learning ReLU Models with Separable Data

06/12/2018 ∙ by Tengyu Xu, et al. ∙ The Ohio State University 0

We consider the binary classification problem in which the objective function is the exponential loss with a ReLU model, and study the convergence property of the stochastic gradient descent (SGD) algorithm on linearly separable data. We show that the gradient descent (GD) algorithm do not always learn desirable model parameters due to the nonlinear ReLU model. Then, we identify a certain condition of data samples, under which we show that SGD can learn a proper classifier with implicit bias. In specific, we establish the sub-linear convergence rate of the function value generated by SGD to global minimum. We further show that SGD actually converges in expectation to the maximum margin classifier with respect to the samples with +1 label under the ReLU model at the rate O(1/ln t). We also extend our study to the case of multi-ReLU neurons, and show that SGD converges to a certain non-linear maximum margin classifier for a class of non-linearly separable data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has been observed in various machine learning problems recently that the gradient descent (GD) algorithm and the stochastic gradient descent (SGD) algorithm converge to solutions with certain properties even without explicit regularization in the objective function. Correspondingly, theoretical analysis has been developed to explain such implicit regularization property. For example, it has been shown in

Gunasekar et al. (2018, 2017) that GD converges to the solution with the minimum norm under certain initialization for regression problems, even without an explicit norm constraint.

Another type of implicit regularization, where GD converges to the max-margin classifier, has been recently studied in Gunasekar et al. (2018); Ji & Telgarsky (2018); Nacson et al. (2018a); Soudry et al. (2017, 2018) for classification problems as we describe below. Given a set of training samples for , where

denotes a feature vector and

denotes the corresponding label, the goal is to find a desirable linear model (i.e., a classifier) by solving the following empirical risk minimization problem

(1)

It has been shown in Nacson et al. (2018a); Soudry et al. (2017, 2018) that if the loss function is monotonically strictly decreasing and satisfies proper tail conditions (e.g., the exponential loss), and the data are linearly separable, then GD converges to the solution with infinite norm and the maximum margin direction of the data, although there is no explicit regularization towards the max-margin direction in the objective function. Such a phenomenon is referred to as the implicit bias of GD, and can help to explain some experimental results. For example, even when the training error achieves zero (i.e., the resulting model enters into the linearly separable region that correctly classifies the data), the testing error continues to decrease, because the direction of the model parameter continues to have an improved margin. Such a study has been further generalized to hold for various other types of gradient-based algorithms Gunasekar et al. (2018). Moreover, Ji & Telgarsky (2018) analyzed the convergence of GD with no assumption on the data separability, and characterized the implicit regularization to be in a subspace-based form.

The focus of this paper is on the following two fundamental issues, which have not been well addressed by existing studies.

  • Existing studies so far focused only on the linear classifier model. An important question one naturally asks is what happens for the more general nonlinear leaky ReLU and ReLU models. Will GD still converge, and if so will it converge to the max-margin direction? Our study here provides new insights for the ReLU model that have not been observed for the linear model in the previous studies.

  • Existing studies mainly analyzed the convergence of GD with the only exceptions Ji & Telgarsky (2018); Nacson et al. (2018b) on SGD. However, Ji & Telgarsky (2018) did not establish the convergence to the max-margin direction for SGD, and Nacson et al. (2018b) established the convergence to the max-margin solution only epochwisely for cyclic SGD (not iterationwise for SGD under random sampling with replacement). Moreover, both studies considered only the linear model. Here, our interest is to explore the iterationwise convergence of SGD under random sampling with replacement to the max-margin direction, and our result can shed insights for online SGD. Furthermore, our study provides new understanding for the nonlinear ReLU and leaky ReLU models.

1.1 Main Contributions

We summarize our main contributions, where our focus is on the exponential loss function under ReLU model.

We first characterize the landscape of the empirical risk function under the ReLU model, which is nonconvex and nonsmooth. We show that such a risk function has asymptotic global minima and asymptotic spurious local minima. Such a landscape is in sharp contrast to that under the linear model previously studied in Soudry et al. (2017), where there exist only equivalent global minima.

Based on the landscape property, we show that the implicit bias property in the course of the convergence of GD can fall into four cases: converges to the asymptotic global minimum along the max-margin direction, converges to an asymptotic local minimum along a local max-margin direction, stops at a finite spurious local minimum, or oscillates between the linearly separable and misclassified regions without convergence. Such a diverse behavior is also in sharp difference from that under the linear model Soudry et al. (2017), where GD always converges to the max-margin direction.

We then take a further step to study the implicit bias of SGD. We show that the expected averaged weight vector normalized by its expected norm converges to the global max-margin direction or local max-margin direction, as long as SGD stays either in the linearly separable region or in a region of the local minima defined by a subset of data samples with positive label. The proof here requires considerable new technical developments, which are very different from the traditional analysis of SGD, e.g., Bottou et al. (2016); Duchi & Singer (2009); Nemirovskii et al. (1983); Shalev-Shwartz et al. (2009); Xiao (2010); Bach & Moulines (2013); Bach (2014). This is because our focus here is on the exponential loss function without attainable global/local minima, whereas traditional analysis typically assumed that the minimum of the loss function is attainable. Furthermore, our goal is to analyze the implicit bias property of SGD, which is also beyond traditional analysis of SGD.

We further extend our analysis to the leaky ReLU model and multi-neuron networks.

1.2 Related Work

Implicit bias of gradient descent: Gunasekar et al. (2018) studied the implicit bias of GD and SGD for minimizing the squared loss function under bounded global minimum, and showed that some of these algorithms converge to a global minimum that is closest to the initial point. Another collection of papers Gunasekar et al. (2018); Ji & Telgarsky (2018); Nacson et al. (2018a); Soudry et al. (2017); Telgarsky (2013); Soudry et al. (2018) characterized the implicit bias of algorithms for the loss functions without attainable global minimum. Telgarsky (2013) showed that AdaBoost converges to an approximate max-margin classifier. Soudry et al. (2017, 2018)

studied the convergence of GD in logistic regression with linearly separable data and showed that GD converges in direction to the solution of support vector machine at a rate of

. Nacson et al. (2018a) improved this rate to under the exponential loss via normalized gradient descent. Gunasekar et al. (2018) further showed that steepest descent can lead to margin maximization under generic norms. Ji & Telgarsky (2018) analyzed the convergence of GD on an arbitrary dataset, and provided the convergence rates along the strongly convex subspace and the separable subspace. Our work studies the convergence of GD and SGD under the nonlinear ReLU model with the exponential loss, as opposed to the linear model studied by all the above previous work on the same type of loss functions.

Implicit bias of SGD: Ji & Telgarsky (2018) analyzed the average SGD (under random sampling) with fixed learning rate and proved the convergence of the population risk, but did not establish the parameter convergence of SGD in the max-margin direction. Nacson et al. (2018b) established the convergence of cyclic SGD epochwisely in direction to the max-margin classifier at a rate . Our work differs from these two studies first in that we study the ReLU model, whereas both of these studies analyzed the linear model. Furthermore, we showed that under SGD with random sampling, the expectation of the averaged weight vector converges in direction to the max-margin classifier at a rate .

Generalization of SGD: There have been extensive studies of the convergence and generalization performance of SGD under various models, of which we cannot provide a comprehensive list due to the space limitations. In general, these type of studies either characterize the convergence rate of SGD or provide the generalization error bounds at the convergence of SGD, e.g., Brutzkus et al. (2017); Wang et al. (2018); Li & Liang (2018), but did not characterize the implicit regularization property of SGD, such as the convergence to the max-margin direction as provided in our paper.

2 ReLU Classification Model

We consider the binary classification problem, in which we are given a set of training samples . Each training sample contains an input data and a corresponding binary label . We denote as the set of indices of samples with label and denote in a similar way. Their cardinalities are denoted as and , respectively, and are assumed to be non-zero. We consider all datasets that are linearly separable, i.e., there exists a linear classifier such that for all .

We are interested in training a ReLU model for the classification task. In specific, for a given input data , the model outputs , where

is the ReLU activation function and

denotes the weight parameters. The predicted label is set to be . Our goal is to learn a classifier by solving the following empirical risk minimization problem, where we adopt the exponential loss.

(P)

The ReLU activation causes the loss function in problem (P) to be nonconvex and nonsmooth. Therefore, it is important to first understand the landscape property of the loss function, which is critical for characterizing the implicit bias property of the GD and SGD algorithms.

3 Implicit Bias of GD in Learning ReLU Model

3.1 Landscape of ReLU Model

In order to understand the convergence of GD under the ReLU model, we first study the landscape of the loss function in problem (P), which turns out to be very different from that under the linear activation model. As been shown in Soudry et al. (2017); Ji & Telgarsky (2018), the loss function in problem (P) under linear activation is convex, and achieves asymptotic global minimum, i.e., and as the scaling constant , only if is in the linearly separable region. In contrast, under the ReLU model, the asymptotic critical points can be either global minimum or (spurious) local minimum depending on the training datasets, and hence the convergence property of GD can be very different in nature from that under the linear model.

The following theorem characterizes the landscape properties of problem (P). Throughout, we denote the infimum of the objective function in problem (P) as . Furthermore, we call a direction asymptotically critical if it satisfies as .

Theorem 3.1 (Asymptotic landscape property).

For problem (P) under the ReLU model, any corresponding asymptotic critical direction fall into one of the following cases:

  1. [leftmargin=*, noitemsep]

  2. (Asymptotic global minimum): for all . Then,

  3. (Asymptotic local minimum): for all and for all , where . Then,

  4. (Local minimum): for all . Then,

To further elaborate creftypecap 3.1, if classifies all data correctly (i.e., item 1), then the objective function possibly achieves global minimum along this direction. On the other hand, if classifies some data with label as (item 2), then the objective function achieves a sub-optimal value along this direction. In the worst case where all data samples are classified as (item 3), the ReLU unit is never activated and hence the corresponding objective function has constant value 1. We note that the cases in items 2 and 3 may or may not take place depending on specific datasets, but if they do occur, the corresponding are spurious (asymptotic) local minima. In summary, the landscape under the ReLU model can be partitioned into different regions, where gradient descent algorithms can have different implicit bias as we show next.

3.2 Convergence of GD

In this subsection, we analyze the convergence of GD in learning the ReLU model. At each iteration , GD performs the update

(GD)

where denotes the stepsize. For the linear model whose loss function has infinitely many asymptotic global minima, it has been shown in Soudry et al. (2017) that GD always converges to the max-margin direction. Such a phenomenon is regarded as the implicit bias property of GD. Here, for the ReLU model, we are also interested in analyzing whether such an implicit-bias property still holds. Furthermore, since the loss function under the ReLU model possibly contains spurious asymptotic local minima, the convergence of GD under the ReLU model should be very different from that under the linear model.

Next, we introduce various notions of margin in order to characterize the implicit bias under the ReLU model. The global max-margin direction of samples in is defined as

Such a notion of max-margin is natural because the ReLU activation function can suppress negative inputs. We note that here may not locate in the linearly separable region, and hence it may not be parallel to any (asymptotic) global minimum. As we show next, only when is in the linearly separable region, GD may converge in direction to such a max-margin direction under the ReLU model. Furthermore, for each given subset , we define the associated local max-margin direction as

We further denote the set of asymptotic local minima with respect to (see creftypecap 3.1 item 2) as

Of course, may or may not be empty for a certain , and may or may not belong to depending on the specific training dataset. As we show next, only when there exists a non-empty and the corresponding , GD may converge to such an asymptotic local minimum direction under the ReLU model. Next, we present the implicit bias of GD for learning the ReLU model in problem (P).

Theorem 3.2.

Apply GD to solve problem (P) with arbitrary initialization and a small enough constant stepsize. Then, the sequence generated by GD falls into one of the following cases.

  1. [leftmargin=*]

  2. , and , where is in linearly separable region;

  3. the direction of does not converge and oscillates between linearly separable and misclassified regions, where is not in linearly separable region;

  4. , and , where , and ;

  5. , and , where , i.e., GD terminates within finite steps.

Theorem 3.2 characterizes various instances of implicit bias of GD in learning the ReLU model, which the nature of the convergence is different from that in learning the linear model. In specific, GD can either converge in direction to the global max-margin direction that leads to the global minimum, or converge to the local max-margin direction that leads to a spurious local minimum. Furthermore, it may occur that GD oscillates between the linearly separable region and the misclassified region due to the suppression effect of ReLU function. In this case, GD does not have an implicit bias property and convergence guarantee. We provide two simple examples in the supplementary material to further elaborate these cases.

3.3 Implicit Bias of SGD in Learning ReLU Models

In this subsection, we analyze the convergence property and the implicit bias of SGD for solving problem (P). At each iteration , SGD samples an index uniformly at random with replacement, and performs the update

(SGD)

Similarly to the convergence of GD characterized in Theorem 3.2, SGD may oscillate between the linearly separable and misclassified regions. Therefore, our major interest here is the implicit bias of SGD when it does converge either to the asymptotic global minimum or local minimum. Thus, without loss of generality, we implicitly assume that is in the linearly separable region, and the relevant . Otherwise, SGD does not even converge.

The implicit bias of SGD with replacement sampling has not been studied in the existing literature, and the proof of the convergence and the characterization of the implicit bias requires substantial new technical developments. In particular, traditional analysis of SGD under convex functions requires the assumption that the variance of the gradient is bounded

Bottou et al. (2016); Bach (2014); Bach & Moulines (2013). Instead of making such an assumption, we next prove that SGD enjoys a nearly-constant bound on the variance up to a logarithmic factor of in learning the ReLU model.

Proposition 1 (Variance bound).

Apply SGD to solve problem (P) with any initialization. If there exists such that for all , either stays in the linearly separable region, or in , then with stepsize where , the variances of the stochastic gradients sampled by SGD along the iteration path satisfy that for all ,

creftypecap 1 shows that the summation of the norms of the stochastic gradients grows logarithmically fast. This implies that the variance of the stochastic gradients is well-controlled. In particular, if we choose , then the bound in creftypecap 1 implies that the term stays at a constant level. Based on the variance bound in creftypecap 1, we next establish the convergence rate of SGD for learning the ReLU model. Throughout, we denote as the averaged iterates generated by SGD.

Theorem 3.3 (Convergence rate of loss).

Apply SGD to solve problem (P) with any initialization. If there exist such that for all , either stays in the linearly separable region, then with the stepsize , where , the averaged iterates generated by SGD satisfies

If there exist such that for all , stays in , then with the same stepsize

creftypecap 3.3 establishes the convergence rate of the expected risk of the averaged iterates generated by SGD. It can be seen that the convergence of SGD achieves different loss values corresponding to global and local minimum in different regions. The stepsize is set to be diminishing to compensate the variance introduced by SGD. In particular, if is chosen to be sufficiently close to , then the convergence rate is nearly of the order , which matches the standard result of SGD in convex optimization up to an logarithmic order. creftypecap 3.3 also implies that the convergence of SGD is attained as at a rate of . We note that the analysis of creftypecap 3.3 is different from that of SGD in traditional convex optimization, which requires the global minimum to be achieved at a bounded point and assumes the variance of the stochastic gradients is bounded by a constant Shalev-Shwartz et al. (2009); Duchi & Singer (2009); Nemirovski et al. (2009). These assumptions do not hold here.

Theorem 3.4 (Implicit bias of SGD).

Apply SGD to solve problem (P) with any initialization. If there exist such that for all , stays in the linearly separable region, then with the stepsize where , the sequence of the averaged iterate generated by SGD satisfies

If there exist such that for all , stays in , then with the same stepsize

creftypecap 3.4 shows that the direction of the expected averaged iterate generated by SGD converges to the max-margin direction , without any explicit regularizer in the objective function. The proof of creftypecap 3.4 requires a detailed analysis of the SGD update under the ReLU model and is substantially different from that under the linear model Soudry et al. (2018); Ji & Telgarsky (2018); Nacson et al. (2018a, b). In particular, we need to handle the variance of the stochastic gradients introduced by SGD and exploit its classification properties under the ReLU model.

We next provide an example class of datasets (which has been studied in Combes et al. (2018)), for which we show that SGD stays stably in the linearly separable region.

Proposition 2.

If the linear separable samples satisfy the following conditions given in Combes et al. (2018):

  1. [leftmargin=*,noitemsep]

  2. For all , it holds that ;

  3. For all , it holds that ,

then there exists a such that for all the sequence generated by SGD stays in the linearly separable region, as long as SGD is not initialized at the local minima described in item 3 of creftypecap 3.1.

We also want to point out that any linearly separable dataset can satisfy the condition in creftypecap 2

after a proper transformation, e.g., data augmentation by padding 1s to the samples with label

and -1s to the samples with label . Such data transformation changes the landscape of the ReLU model into a more optimization-friendly version that facilitates to regularize the SGD path.

4 Further Extensions and Discussions

4.1 Leaky ReLU Models

The leaky ReLU activation takes the form , where the parameter . Clearly, leaky ReLU takes the linear and ReLU models as two special cases, respectively corresponding to and . Since the convergence of GD/SGD of the ReLU model is very different from that of the linear model, a natural question to ask is whether leaky ReLU with intermediate parameters takes the same behavior as the linear or ReLU model.

It can be shown that the loss function in problem (P) under the leaky ReLU model has only asymptotic global minima achieved by in the separable region with infinite norm (there does not exist asymptotic local minima). Hence, the convergence of GD is similar to that under the linear model, where the only difference is that the max-margin classifier needs to be defined based on leaky ReLU as follows.

For the given set of linearly separable data samples, we construct a new set of data , in which , , and . Essentially, the data samples with label are scaled by the parameter of leaky ReLU. Without loss of generality, we assume that the max-margin classifier for data passes through the origin after a proper translation. Then, we define the max-margin direction of data as

Then, following the result under the linear model in Soudry et al. (2017), it can be shown that GD with arbitrary initialization and small constant stepsize for solving problem (P) under the leaky ReLU model satisfies that converges to zero, and converges to the max-margin direction, i.e., , with its norm going to infinity.

Furthermore, following our result of creftypecap 3.4, it can be shown that for SGD applied to solve problem (P) with any initialization, if there exists such that for all stays in the linearly separable region, then with the stepsize , the sequence of the averaged iterate generated by SGD satisfies

Thus, for SGD under the leaky ReLU model, the normalized average of the parameter vector converges in direction to the max-margin classifier.

4.2 Multi-neuron Networks

In this subsection, we extend our study of the ReLU model to the problem of training a one-hidden-layer ReLU neural network with

hidden neurons for binary classification. Here, we do not assume linear separability of the dataset. The output of the network is given by

(2)

where with each column representing the weights of the th neuron in the hidden layer, denotes the weights of the output neuron, and represents the entry-wise ReLU activation function. We assume that is a fixed vector whose entries are nonzero and have both positive and negative values. Such an assumption is natural as it allows the model to have enough capacity to achieve zero loss. The predicted label is set to be the sign of , and the objective function under the exponential loss is given by

(3)

Our goal is to characterize the implicit bias of GD and SGD for learning the weight parameters of the multi-neuron model. In general, such a problem is challenging, as we have shown that GD may not converge to a desirable classifier even under the single-neuron ReLU model. For this reason, we adopt the same setting as that in (Soudry et al., 2017, Corollary 8), which assumes that the activated neurons do not change their activation status and the training error converges to zero after a sufficient number of iterations, but our result presented below characterizes the implicit bias of GD and SGD in the original feature space, which is different from that in (Soudry et al., 2017, Corollary 8). We define a set of vectors , where if the sample is activated on the th neuron, i.e., , and set otherwise. Such an vector is referred to as the activation pattern of . We then partition the set of all training samples into subsets , so that the samples in the same subset have the same ReLU activation pattern, and the samples in different subsets have different ReLU activation patterns. We call as the -th pattern partition. Let . Then, for any sample , the output of the network is given by

We next present our characterization of the implicit bias property of GD and SGD under the above ReLU network model. We define the corresponding max-margin direction of the samples in as

Then the following theorem characterizes the implicit bias of GD under the multi-neuron network.

Theorem 4.1.

Suppose that GD optimizes the loss in eq. 3 to zero and there exists such that for all , the neurons in the hidden layer do not change their activation status. If (where ”” denotes the entry-wise logic operator “AND” between digits zero or one) for any , then the samples in the same pattern partition of the ReLU activation have the same label, and

Differently from (Soudry et al., 2017, Corollary 8) which studies the convergence of the vectorized weight matrix so that the implicit bias of GD is with respect to features being lifted to an extended dimensional space, Theorem 4.1 characterizes the convergence of the weight parameters and the implicit bias in the original feature space. In particular, Theorem 4.1 implies that although the ReLU neural network is a nonlinear classifier, is equivalent to a ReLU classifier for the samples in the same pattern partition (that are from the same class), which converges in direction to the max-margin classifier of those data samples. We next let . Then the following theorem establishes the implicit bias of SGD.

Theorem 4.2.

Suppose that SGD optimizes the loss in eq. 3 so that there exists such that for any , , the neurons in the hidden layer do not change their activation status, and for any , . Then, for the stepsize , the samples in the same pattern partition of the ReLU activation have the same label, and

Similarly to GD, the averaged SGD in expectation maximizes the margin for every sample partition. At the high level,

creftypecap 4.1 and creftypecap 4.2 imply the following generalization performance of the ReLU network under study. After a sufficiently large number of iterations, the neural network partitions the data samples into different subsets, and for each subset, the distance from the samples to the decision boundary is maximized by GD and SGD. Thus, the learned classifier is robust to small perturbations of the data, resulting in good generalization performance.

5 Conclusion

In this paper, we study the problem of learning a ReLU neural network via gradient descent methods, and establish the corresponding risk and parameter convergence under the exponential loss function. In particular, we show that due to the possible existence of spurious asymptotic local minima, GD and SGD can converge either to the global or local max-margin direction, which in the nature of convergence is very different from that under the linear model in the previous studies. We also discuss the extensions of our analysis to the more general leaky ReLU model and multi-neuron networks. In the future, it is worthy to explore the implicit bias of GD and SGD in learning multi-layer neural network models and under more general (not necessarily linearly separable) datasets.

References

Appendix A Proof of creftypecap 3.1

The gradient is given by

If for all , then as , we have, ,

and

Recall that . If for all , then as , we obtain

and

If for all , then

The proof is now complete.

Appendix B Proof of creftypecap 3.2

First consider the case when is in linearly separable region and the local minimum does not exist along the updating path. We call the region where all vectors satisfy for all as negative correctly classified region. As shown in Soudry et al. (2017), is non-negative and -smooth, which implies that

Based on the above inequality, we have

which, in conjunction with , implies that

Thus, we have as . By creftypecap 3.1, vanishes only when all samples with label are correctly classified, and thus GD enters into the negative correctly classified region eventually and diverges to infinity. Soudry et al. (2017) Theorem 3 shows that when GD diverges to infinity, it simultaneously converges in the direction of the max-margin classifier of all samples satisfying . Thus, under our setting, GD either converges in the direction of the global max-margin classifer :

or the local max-margin classifier :

Next, consider the case when is not in linearly separable region, and the local minimum does not exist along the updating path. In such a case, we conclude that GD cannot stay in the linearly separable region. Otherwise, it converges in the direction of that is not in linearly separable region, which leads to a contradiction. If the asymptotic local minimum exists, then GD may converge in its direction. If does not exist, GD cannot stay in both the misclassified region and linearly separable region, and thus oscillates between these two regions.

In the case when GD reaches a local minimum, by creftypecap 3.1, we have , and thus GD stops immediately and does not diverges to infinity.

Appendix C Examples of Convergence of GD in ReLU model

Example 1 (Figure 1, left).

The dataset consists of two samples with label and one sample with label . These samples satisfy and .

For this example, if we initialize GD at the green classifier, then GD converges to the max-margin direction of the sample . Clearly, such a classifier misclassifies the data sample .

Example 2 (Figure 1, right).

The dataset consists of one sample with label and one sample with label . These two samples satisfy .

For this example, if we initialize at the green classifier, then GD oscillates around the direction and does not converge.

Figure 1: Failure of GD in learning ReLU models

Proof of Example 1

Consider the first iteration. Note that the sample has label , and from the illustration of Figure 1 (left) we have , and . Therefore, only the sample contributes to the gradient, which is given by

(4)

By the update rule of GD, we obtain that for all

(5)

By telescoping eq. 5, it is clear that any for all since . This implies that the sample is always misclassified.

Proof of Example 2

Since we initialize GD at such that and , the sample does not contribute to the GD update due to the ReLU activation. Next, we argue that there must exists a such that . Suppose such does not exist, we always have . Then, the linear classifier generated by GD stays between and , and the corresponding objective function reduces to a linear model that depends on the sample (Note that contributes a constant due to ReLU activation). Following from the results in Ji & Telgarsky (2018); Soudry et al. (2017) for linear model, we conclude that converges to the max-margin direction as . Since , this implies that as , contradicting with the assumption.

Next, we consider the such that and , the objective function is given by

and the corresponding gradient is given by

Next, we consider the case that for all . Otherwise, both of and are on the negative side of the classifier and GD cannot make any progress as the corresponding gradient is zero. In the case that for all , by the update rule of GD, we obtain that

(6)

Clearly, the sequence is strictly decreasing with a constant gap, and hence within finite steps we must have .

Appendix D Proof of creftypecap 1

Since SGD stays in the linearly separable region eventually, and hence only the data samples in contribute to the gradient update due to the ReLU activation function. For this reason, we reduce the original minimization problem (P) to the following optimization

(7)

which corresponds to a linear model with samples in . Similarly, if SGD stays in , only the data samples in contribute the the gradient update, the original minimization problem (P) is reduced to

(8)

The proof contains three main steps.

Step 1: For any , bounding the term : By the update rule of SGD, we have

(9)

where

By convexity we obtain that . Then, eq. 9 further becomes

(10)

Telescoping the above inequality yields that

(11)

Taking expectation on both sides of the above inequality and note that for all , we further obtain that

(12)

Note that whenever the data samples are correctly classified and for all , , and without loss of generality, we can assume . Hence, the term can be upper bounded by

Then, noting that , eq. 12 can be upper bounded by

(13)

Next, set and note that for all , we conclude that . Substituting this into the above inequality and noting that and , we further obtain that