1 Introduction
Deep neural networks (DNNs) are commonly trained using stochastic gradient descent (SGD), or one of its variants. During training, the learning rate is typically decreased according to some schedule (e.g., every
epochs we multiply the learning rate by some ). Determining the learning rate schedule, and its dependency on other factors, such as the minibatch size, has been the subject of a rapidly increasing number of recent empirical works (Hoffer et al. (2017); Goyal et al. (2017); Jastrzebski et al. (2017); Smith et al. (2018) are a few examples). Therefore, it is desirable to improve our understanding of such issues. However, somewhat surprisingly, we observe that we do not have even a satisfying answer to the basic questionWhy do we need to decrease the learning rate during training?
At first, it may seem that this question has already been answered. Many previous works have analyzed SGD theoretically (e.g., see Robbins and Monro (1951); Bertsekas (1999); Geary and Bertsekas (2001); Bach and Moulines (2011); BenDavid and ShalevShwartz (2014); Ghadimi et al. (2013); Bubeck (2015); Bottou et al. (2016); Ma et al. (2017) and references therein), under various assumptions. In all previous works, to the best of our knowledge, one must assume a vanishing learning rate schedule, averaging of the SGD iterates, partial strong convexity (i.e., strong convexity in some subspace), or the PolyakLojasiewicz (PL) condition (Bassily et al., 2018) — so that the SGD increments or the loss (in the convex case) will converge to zero for generic datasets. However, even near its global minima, a neural network loss is not partially strongly convex, and the PL condition does not hold. Therefore, without a vanishing learning rate or iterate averaging, the gradients are only guaranteed to decrease below some constant value, proportional to the learning rate. Thus, in this case, we may fluctuate near a critical point, but never converge to it.
Consequently it may seem that in neural networks we should always decrease the learning rate in SGD or average the weights, to enable the convergence of the weights to a critical point, and to decrease the loss. However, this reasoning does not hold empirically. In many datasets, even with a fixed learning rate and without averaging, we observe that the training loss can converge to zero. For example, we examine the learning dynamics of a ResNet18 trained on CIFAR10 in Figure 1. Even though the learning rate is fixed, the training loss converges to zero (and so does the classification error).
Notably, we do not observe any convergence issues, as we may have suspected from previous theoretical results. In fact, if we decrease the learning rate at any point, this only decreases the convergence rate of the training loss to zero. The main benefit of decreasing the learning rate is that it typically improves generalization performance. Such contradiction between existing theoretical and empirical results may indicate a significant gap in our understanding. We are therefore interested in closing this gap.
Training of a convolutional neural network on CIFAR10 using stochastic gradient descent with constant learning rate, softmax output and a cross entropy loss. We observe that, approximately: (1) The training loss and (classification) error both decays to zero; (2) after a while, the validation loss starts to increase; and (3) in contrast, the validation (classification) error slowly improves. In
Soudry et al. (2018b), the authors observed similar results with momentum.To do so, we first examine the network dynamics in Figure 1. Since the training error has reached zero after a certain number of iterations, by then the last hidden layer must have become linearly separable. Since the network is trained using the monotone crossentropy loss (with softmax outputs), by increasing the norm of the weights we decrease the loss. Therefore, if the loss is minimized then the weights would tend to diverge to infinity — as indeed happens. This weight divergence does not affect the scaleinsensitive validation (classification) error, which continues to decrease during training. In contrast, the validation loss starts to increase.
To explain this behavior, Soudry et al. (2018b, a) focused on the dynamics of the last layer, for a fixed separable input and no bias. For Gradient Descent (GD) dynamics, Soudry et al. (2018b, a) proved that the training loss converges to zero as , the direction of the weight vector converges to the max margin as , and the validation loss increase as . This had similar dynamics to those observed in Figure 1. However, the dynamics of GD are simpler than those of SGD. Notably, it is well known that on smooth functions, for the iterates of GD, the gradient converges to zero even with a fixed learning rate — just as long as this learning rate is below some fixed threshold (which depends on the smoothness of the function).
Our contributions.
In this paper we examine SGD optimization of homogeneous linear classifiers with smooth monotone loss functions, where the data is sampled either with replacement (the sampling regime typically examined in theory), or without replacement (the sampling regime typically used in practice). For simplicity, we focus on binary classification (e.g., logistic regression). First, we prove three basic results:

The norm of the weights diverges to infinity for any learning rate.

For a sufficiently small fixed learning rate, the loss and gradients converge to zero.

This upper bound we derived for the maximal learning rate is proportional to the minibatch size, when the data in SGD is sampled with replacement.
Similar behavior to the last property is also observed in deep networks (Goyal et al., 2017; Smith et al., 2018). Next, given an additional assumption that the loss function has an exponential tail (e.g., logistic regression), we prove that for almost all linearly separable datasets (i.e., except for measure zero cases):

The direction of the weight vector converges to that of the max margin solution.

The margin converges as , while the training loss converges as .
These conclusions for SGD are the same as for GD (Soudry et al., 2018b) — the only difference is the value of the maximal learning rate, which depends on the minibatch size. Therefore, we believe our SGD results might be similarly extended, as GD, to multiclass (Soudry et al., 2018a), other loss functions (Nacson et al., 2019), other optimization methods (Gunasekar et al., 2018b), linear convolutional neural networks (Gunasekar et al., 2018a), and hopefully to nonlinear deep networks.
Finally, under the assumption that the SVM support vectors span the dataset, we further characterize SGD iterate asymptotic behavior. Specifically, we show that, if we keep the learning rate proportional to the minibatch size, then:

The minibatch size does not affect the asymptotic convergence rate of SGD, in terms of epochs.

In terms of SGD iterations, the fastest asymptotic convergence rate, is obtained at full batch size, i.e. GD.
2 Preliminaries
Consider a dataset , with binary labels . We analyze learning by minimizing an empirical loss of homogeneous linear predictors (i.e., without bias), of the form
(1) 
where is the weight vector. To simplify notation, we assume that — this is true without loss of generality, since we can always redefine as .
We are particularly interested in problems that are linearly separable and with a smooth strictly decreasing and nonnegative loss function. Therefore, we assume:
Assumption 1.
The dataset is strictly linearly separable: such that .
Given that the data is linearly separable, the maximal margin is strictly positive
(2) 
Assumption 2.
is a positive, differentiable, smooth function (i.e., its derivative is Lipshitz), monotonically decreasing to zero, (so^{1}^{1}1The requirement of nonnegativity and that the loss asymptotes to zero is purely for convenience. It is enough to require the loss is monotone decreasing and bounded from below. Any such loss asymptotes to some constant, and is thus equivalent to one that satisfies this assumption, up to a shift by that constant. and ), and .
Many common loss functions, including the logistic and probit losses, follow Assumption 1. Assumption 1 also straightforwardly implies that is a smooth function, where the columns of are all samples, and
is the maximal singular value of
.Under these conditions, the infimum of the optimization problem is zero, but it is not attained at any finite . Furthermore, no finite critical point exists. We consider minimizing eq. 1 using Stochastic Gradient Descent (SGD) with a fixed learning rate , i.e., with steps of the form:
(3) 
where is a minibatch of distinct indices, chosen so is an integer, and that it satisfies one of the following assumptions. The first option is the assumption of random sampling with replacement:
Assumption 3a.
[Random sampling with replacement] At each iteration we randomly and uniformly sample a minibatch of
distinct indices, i.e. so each sample has an identical probability to be selected.
For example, this assumption holds if at each iteration we uniformly sample the indices without replacement from , or uniformly sample and select , where is some fixed partition of the data indices, i.e.,
This assumption is rather common in theoretical analysis, but less common in practice. The next alternative sampling method is more common in practice:
Assumption 3b (Sampling without replacement).
At each epoch, the minibatches partition the data:
This way, each sample is chosen exactly once at each epoch, and SGD completes balanced passes over the data. An important special case of this assumption is random sampling without replacement, which is the practically common method. Other special cases are periodic sampling (roundrobin), and even adversarial selection of the order of the samples.
3 Main Result 1: The Loss Converges to a Global Infimum
The weight norm always diverges to infinity, for any learning rate, as we prove next.
Lemma 1.
Proof.
Since the data is linearly separable, such that . We examine the dot product of with the iterates of SGD
Since and for any finite , we get that either or . In the first case, from CauchyShwartz
In the second case, since is strictly positive for any finite value, and achieves zero only at , we must have , which again implies
Combing both cases, we prove the theorem. ∎
As the weights go to infinity, we wish to understand the asymptotic behavior of the loss. As the next theorem shows, if the fixed learning rate is sufficiently small, then we get that the loss converges to zero.
Theorem 1.
Let be the iterates of SGD (eq. 3) from any starting point , where samples are either (case 1) selected randomly with replacement (Assumption 3a)) and with learning rate
(4) 
or (case 2) sampled without replacement (Assumption 3b)) and with learning rate
(5) 
For linearly separable data (Assumption 1), and smoothmonotone loss function (Assumption 1), we have the following, almost surely (with probability ) in the first case, and surely in the second case:

The loss converges to zero:

All samples are correctly classified, given sufficiently long time:

The iterates of SGD are square summable:
The complete proof of this theorem is given in section A in the appendix. The proof relies on the following key lemma
Lemma 2.
Proof.
In this proof we define as the minimizer of the right hand side of eq. 6, and as the maximizer of the optimization problem on the left hand side of the same equation. On the one hand
(7) 
where in we used CauchyShwartz inequality, and in we used the definition of , and that . On the other hand,
(8) 
where in we used the definition of the max margin from the left hand side of eq. 6 and , in we used that and the triangle inequality, and in we used that . Together, eqs. 7 and 8 imply the Lemma. ∎
This Lemma is useful since the SGD weight increments in eq. 3 have the form , where is some vector with nonnegative components. This enables us to bound the norm of the SGD updates using the norm of the full gradient, which allows us to use similar analysis as for GD. Additionally, we note the regime we analyze in Theorem 1 is somewhat unusual, as the weight vector goes to infinity. In many previous works it is assumed that there exists a finite critical point, or that the weights are bounded within a compact domain.
Theorem 1 Implications.
In both sampling regimes, we obtained that a fixed (nonvanishing) learning rate results in convergence to zero error. In the case of random sampling with replacement (Assumption 3a) we got a better upper bound on the learning rate (eq. 4), which does not depend on . Interestingly, this bound matches the empirical findings of Goyal et al. (2017); Smith et al. (2018), which observed that in a large range . Interestingly, in our case the relation holds exactly for all in the maximum learning rate (eq. 4
). In contrast, for linear regression, the relation becomes sublinear for large
(Ma et al., 2017).We also considered here the case when the datapoints are sampled without replacement (Assumption 3b). This is in contrast to most theoretical SGD results, which typically assume sampling with replacement (which is less common in practice). There are a few notable exceptions (Geary and Bertsekas (2001); Bertsekas (2011); Shamir (2016), and references therein). Perhaps the most similar previous result is the classical result of (Proposition 2.1 in Geary and Bertsekas (2001)), which has a similar sampling schedule, and in which the weights can go to infinity. However, in this result the learning rate must go to zero for the SGD iterates to converge. In our case, we are able to relax this assumption since we focus on linear classification with a monotone loss and separable data.
When assuming sampling without replacement (Assumption 3b) the learning rate bound (eq. 5) becomes significantly lower — roughly proportional to . This is because such a sampling assumption is very pessimistic (e.g., the samples can be selected by an adversary). Therefore, a small (yet non vanishing) learning rate is required to guarantee convergence. Such a dependence on is expected, since in this case we need to use a incremental gradient method type of proof, where such low learning rates are common. For example, in Bertsekas (2011) Proposition 3.2b, to get a low final error we must have a learning rate .
4 Main Result 2: The Weight Vector Direction Converges to the Max Margin
Next, we focus on a special case of monotone loss functions:
Definition 1.
A function has a “tight exponential tail", if there exist positive constants , and such that :
Assumption 4.
The negative loss derivative has a tight exponential tail.
Specifically, this applies to the logistic loss function. Given this additional assumption, we prove that SGD converges to the max margin solution.
Theorem 2.
For almost all datasets for which the assumptions of Theorem 1 hold, if has a tight exponential tail (Assumption 4), then the iterates of SGD, for any , will behave as:
(9) 
where is the following max margin separator:
(10) 
and the residual is bounded almost surely in the first case of Theorem 1 (random sampling with replacement), or surely in the second case (sampling without replacement).
Thus, from Theorem 2, for almost any linearly separable data set (e.g., with probability 1 if the data is sampled from an absolutely continuous distribution) , the normalized weight vector converges to the normalized max margin vector, i.e.,
with rate , identically to GD (Soudry et al., 2018b). Interestingly, the number of minibatches per epoch affects only the constants. Intuitively, this is reasonable, since if we rescale the time units, then the log term in eq. 9 will only add a constant to the residual .
Proof idea.
The theorem is proved in appendix section B.1. The proof builds on the results of Soudry et al. (2018b) for GD: as the weights diverge, the loss converges to zero, and only the gradients of the support vector remain significant. This implies that the gradient direction, as a positive linear combination of support vectors converges to the direction of the max margin. The main difficulty in extending the proof to the case of SGD is that at each iteration, is updated using only a subset of the data points. This could potentially lead to large difference from the GD solution. However, conceptually, we show that this difference of from the GD dynamics solution is in . The main novel idea here is that in order to calculate this difference at time , we use information on sampling selections made in the future, i.e. at times larger than .
Convergence Rates.
Theorem 2 directly implies the same convergence rates as in GD (Soudry et al., 2018b). Specifically, in the distance
(11) 
in the angle
(12) 
and in the margin gap
(13) 
On the other hand, the loss itself decreases as
(14) 
In Figure 2 we visualize these results. Additionally, in Figure 3 we observe that the convergence rates remain nearly the same for different minibatch sizes — as long as we linearly scale the learning rate with the minibatch size, i.e. . This behavior fits with the behavior of the maximal learning rate for which SGD converge in the case of sampling with replacement (eq. 4). However, it is not clear from Theorem 2 why the convergence rate stays almost exactly the same with such a linear scaling, since we do not know how does depends on and . In the special case where the SVM support vectors span the dataset, we can further characterize asymptotic dependence on and . We define as the orthogonal projection matrix to the subspace spanned by the support vectors, and as the complementary projection. In addition, we denote as the SVM dual variables so .
Theorem 3.
Under the conditions and notation of Theorem 2, for almost all datasets, if in addition the support vectors span the data (i.e. , where is a matrix whose columns are only those data points s.t. ), then , where is a solution to
(15) 
The theorem is proved in appendix section B.2. Note that is only dependent on the dataset and the initialization. This fact enables us to state the following result for the asymptotic behavior of SGD.
Corollary 1.
Under the conditions and notation of Theorem 3, GD iterate will behave as:
where is the maximummargin separator, is the solution of eq. 15 (which does not depend on and ), and is a vanishing term. Therefore, if the step size is kept proportional to the minibatch size, i.e., , changing the number of minibatches is equivalent to linearly rescaling the time units of .
From the corollary, we expect the same asymptotic convergence rates for all batch sizes as long as we scale the learning rate linearly with the batch size, i.e., keep . This is exactly the behavior we observe in Figure 3. Since changing the number of minibatches is equivalent to linearly rescaling the time units, smaller implies faster asymptotic convergence assuming full parallelization capabilities (i.e. the minibatch size does not affect the iterate time). Additionally, note that the corollary only guarantees the same asymptotic behavior. Particularly, different initializations and datasets can exhibit different behavior initially. It remains an interesting direction for future work to understand dependence on and , in the case when the support vectors do not span the dataset.
Lastly, for logistic regression loss, the validation loss (calculated on an independent validation set ) increases as
Notably, as was observed in Soudry et al. (2018b), these asymptotic rates also match what we observe numerically for the convnet in Figure 1: the training loss decreases as , the validation loss increases as , and the validation (classification) improves very slowly, similarly to the logarithmic decay of the angle gap (so the convnet might have a similarly slow decay to its respective implicit bias).
5 Discussion and Related Works
In Theorem 1 we proved that for monotone smooth loss functions on linearly separable data, the iterates of SGD with a sufficiently small (but nonvanishing) learning rate converge to zero loss. In contrast to typical convergence to finite critical points, in this case, the "noise" inherent in SGD vanishes asymptotically. Therefore, we do not need to decrease the learning rate, or average the SGD iterates, to ensure exact convergence. Decaying the learning rate during training will only decrease the convergence speed of the loss.
To the best of our knowledge, such exact convergence result previously required that either (1) the loss function is partially strongly convex, i.e. strongly convex except on some subspace (where the dynamics are frozen), as shown in (Ma et al., 2017) for the case of overparameterized linear regression (with more parameters then samples); or (2) that the PolyakLojasiewicz (PL) condition applies (Bassily et al., 2018). However, in this paper we do not require such conditions, which does not hold for deep networks, even in the vicinity of the (finite or infinite) critical points. Moreover, the dependence of the learning rate on the minibatch size is different, as we discuss next.
We proved Theorem 1 both for random sampling with replacement (Assumption 3a) and for sampling without replacement (Assumption 3b). In the first case, eq. 4 implies that, to guarantee convergence, we need to increase the learning rate proportionally to the minibatch size. In the second case (sampling without replacement) the learning rate bound (eq. 5) is more pessimistic, since our assumption is more general (e.g., it includes adversarial sampling).
In Theorem 2, we proved, given the additional assumption of an exponential tail (e.g., as in logistic regression), that for almost all datasets the weight vector converges to the max margin in direction as , and that the training loss converges to zero as . We believe these results could be extended for every dataset, using the techniques of Soudry et al. (2018a). Again, decaying the learning rate will only degrade the convergence speed to the max margin direction. In fact, the results of Nacson et al. (2019) indicate that we may need to increase the learning rate to improve convergence: For GD, Nacson et al. (2019) proved that this can drastically improve the convergence rate from to . It is yet to be seen if such results might also be applied to deep networks.
In Theorem 3 we further characterized the weights asymptotic behaviour under the additional assumption that the SVM support vectors span the dataset. Combining the results from Theorem 2 and Theorem 3 we obtain Corollary 1. This corollary states that, under linear scaling of the learning rate with the batch size, the asymptotic convergence rate of SGD, in terms of epochs, is not affected by the minibatch size.
Thus, we have shown that exact linear scaling of the learning rate with the minibatch size () is beneficial in two ways: (a) in Theorem 1 for the upper bound of the learning rate in the case of of random sampling with replacement (b) in Corollary 1 for the asymptotic behaviour of the weights assuming tight exponential loss function and that the SVM support vectors span the data. This exact linear scaling, stands in contrast to previous theoretical results with exact convergence (Ma et al., 2017), in which there exists a "saturation limit". Above this limit we should not increase the learning rate linearly with the minibatch size, or the convergence rate will be degraded, and eventually we will loose the convergence guarantee. As predicted by Corollary 1, in Figure 3 we observe that with a linear scaling , the convergence plots exactly match: as we can see, there is almost no asymptotic difference between different minibatch sizes. Therefore, in contrast to Ma et al. (2017), there is no "optimal" minibatch size. In this case, to minimize the number of SGD iterations we should use the largest minibatch possible. This will speed up convergence in wall clock time (as was done in Goyal et al. (2017); Smith et al. (2018)) if it is possible to parallelize the calculation of a minibatch — so one SGD update with a minibatch of size takes less time then updates of SGD with minibatch of size .
An early version of this manuscript previously appeared on arxiv. However, it had only the results in the case of sampling without replacement, and no Theorem 3. Two other related SGD results appeared on arXiv in parallel (with less than a week difference).
First, Ji and Telgarsky (2018) analyzed logistic regression optimized by SGD on separable data (in addition to other results on GD when the data is nonseparable). Ji and Telgarsky (2018) also assume a fixed learning rate, but use averaging of the iterates (which is known to enable exact convergence). They focus on the case in which the datapoints are independently sampled from a separable distribution, while we focused on the case of sampling from a fixed dataset. They show, that with high probability, the population risk converges to zero as . As explained in Ji and Telgarsky (2018), such a fast rate was proven before only for strongly convex loss functions (the logistic loss is not strongly convex). We showed a similar rate, but for the empirical risk (eq. 14). We additionally showed that the weight vector converges in direction to the direction of the max margin.
Second, among other results, Xu et al. (2018) also examined optimizing logistic regression with SGD on a fixed dataset using random sampling with replacement, iterate averaging and a vanishing learning rate. There, in Theorems 3.2 and 3.3, it is shown that the expectation of the loss converges as and the expectation of the averaged iterates converges in the norm as , which is slower than our result. Thus, in contrast to both works Ji and Telgarsky (2018); Xu et al. (2018), we did not assume iterate averaging or decreasing learning rate. Additionally, our new results on sampling with replacement give a linear relationship between the learning rate and the minibatch size, and Corollary 1 shows the affect of the minibatch size on the asymptotic convergence rate.
6 Conclusions
We found that for logistic regression with no bias on separable data, SGD behaves similarly to GD in terms of the implicit bias and convergence rate. The only difference is the maximum possible learning rate should change proportionally to the minibatch size. It remains to be seen if this also holds for deep networks.
Acknowledgements
The authors are grateful to C. Zeno, and I. Golan for helpful comments on the manuscript. This research was supported by the Israel Science foundation (grant No. 31/1031), and by the Taub foundation. A Titan Xp used for this research was donated by the NVIDIA Corporation. NS was partially supported by NSF awards IIS1302662 and IIS1764032.
References

Bach and Moulines (2011)
Francis Bach and Eric Moulines.
NonAsymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.
NIPS, pages –, 2011.  Bassily et al. (2018) Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in nonconvex overparametrized learning. pages 1–7, 2018.
 BenDavid and ShalevShwartz (2014) Shai BenDavid and Shai ShalevShwartz. Understanding Machine Learning: From Theory to Algorithms. 2014.
 Bertsekas (1999) D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
 Bertsekas (2011) Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, jul 2011.
 Bottou et al. (2016) Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization Methods for LargeScale Machine Learning. 2016.
 Bubeck (2015) Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends®in Machine Learning, 8(34):231–357, 2015.
 Geary and Bertsekas (2001) A. Geary and D.P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304), 1(1):907–912, 2001.
 Ghadimi et al. (2013) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Minibatch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Math. Prog., 155(12):267–305, 2013.

Goyal et al. (2017)
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia Kaiming, and
He Facebook.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
arXiv preprint, 2017.  Gunasekar et al. (2018a) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit Bias of Gradient Descent on Linear Convolutional Networks. In NIPS, jun 2018a.
 Gunasekar et al. (2018b) Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In ICML, 2018b.
 Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1729–1739, 2017.
 Jastrzebski et al. (2017) Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three Factors Influencing Minima in SGD. arXiv, pages 1–21, 2017.
 Ji and Telgarsky (2018) Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300v2, 2018.

Ma et al. (2017)
Siyuan Ma, Raef Bassily, and Mikhail Belkin.
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Overparametrized Learning.
2017.  Nacson et al. (2019) Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of Gradient Descent on Separable Data. AISTATS, 2019.
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 Shamir (2016) Ohad Shamir. WithoutReplacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization. pages 1–36, 2016.
 Smith et al. (2018) Samuel L. Smith, PieterJan Kindermans, Chris Ying, and Quoc V. Le. Don’t Decay the Learning Rate, Increase the Batch Size. In ICLR, 2018.
 Soudry et al. (2018a) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint: 1710.10345v3, 2018a.
 Soudry et al. (2018b) Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. ICLR, 2018b.

Xu et al. (2018)
Tengyu Xu, Yi Zhou, Kaiyi Ji, and Yingbin Liang.
When Will Gradient Methods Converge to Maxmargin Classifier under ReLU Models?
arXiv, 2018.
References

Bach and Moulines (2011)
Francis Bach and Eric Moulines.
NonAsymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.
NIPS, pages –, 2011.  Bassily et al. (2018) Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in nonconvex overparametrized learning. pages 1–7, 2018.
 BenDavid and ShalevShwartz (2014) Shai BenDavid and Shai ShalevShwartz. Understanding Machine Learning: From Theory to Algorithms. 2014.
 Bertsekas (1999) D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
 Bertsekas (2011) Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, jul 2011.
 Bottou et al. (2016) Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization Methods for LargeScale Machine Learning. 2016.
 Bubeck (2015) Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends®in Machine Learning, 8(34):231–357, 2015.
 Geary and Bertsekas (2001) A. Geary and D.P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304), 1(1):907–912, 2001.
 Ghadimi et al. (2013) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Minibatch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Math. Prog., 155(12):267–305, 2013.

Goyal et al. (2017)
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia Kaiming, and
He Facebook.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
arXiv preprint, 2017.  Gunasekar et al. (2018a) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit Bias of Gradient Descent on Linear Convolutional Networks. In NIPS, jun 2018a.
 Gunasekar et al. (2018b) Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In ICML, 2018b.
 Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1729–1739, 2017.
 Jastrzebski et al. (2017) Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three Factors Influencing Minima in SGD. arXiv, pages 1–21, 2017.
 Ji and Telgarsky (2018) Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300v2, 2018.

Ma et al. (2017)
Siyuan Ma, Raef Bassily, and Mikhail Belkin.
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Overparametrized Learning.
2017.  Nacson et al. (2019) Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of Gradient Descent on Separable Data. AISTATS, 2019.
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 Shamir (2016) Ohad Shamir. WithoutReplacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization. pages 1–36, 2016.
 Smith et al. (2018) Samuel L. Smith, PieterJan Kindermans, Chris Ying, and Quoc V. Le. Don’t Decay the Learning Rate, Increase the Batch Size. In ICLR, 2018.
 Soudry et al. (2018a) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint: 1710.10345v3, 2018a.
 Soudry et al. (2018b) Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. ICLR, 2018b.

Xu et al. (2018)
Tengyu Xu, Yi Zhou, Kaiyi Ji, and Yingbin Liang.
When Will Gradient Methods Converge to Maxmargin Classifier under ReLU Models?
arXiv, 2018.
Appendix A Proof of Theorem 1
Our proof relies on Lemma 2. Specifically, since we assumed , this Lemma implies that
(17) 
Next, we will rely on this key fact to prove our results for each case.
a.1 Case 1: Random sampling with replacement
From the smoothness of the loss
Taking expectation, we have
where in we defined
as a random variable equal to
if sample is selected at time , or otherwise, in we used the definition of , in we used and , and in we used eq. 17. Therefore, if(18) 
then
and we can write
Summing over we have
a.2 Case 2: Sampling without replacement
Linear separability enforces a lower bound on the norm of these increments (eq. 17, which follows form Lemma 2). This bound enables us to bound the SGD increments, and other related quantities, in terms of the norm of the full gradient (Lemma 3 below).
Lemma 3.
For all and , such that and are in the same epoch, we have
Proof.
See appendix section A.3. ∎
Together, these bounds enable us to complete the proof. First, we assume that is the first iteration in some epoch, i.e., for some . The smoothness of the loss function (Assumption 1), implies that is smooth. This entails that
(21) 
and therefore,
where in we used eq. 21 and the first two equations in Lemma 3, in we recall we assumed that in eq. 5, and in we denoted . Recall we assumed in eq. 5. Summing over we obtain
since and according to our assumption on .
Next, we consider general time (i.e., not only first iteration at epochs, as we assumed until now). We note that, for any such that is in the same epoch as , we have that
where we used the last equation in Lemma 3. Thus, combining the last two equations we obtain
(22) 
which also implies that . Next, we recall eq. 17 to obtain
Therefore, . Since is strictly positive, and equal to zero only at (from assumption 1), we obtain that
Finally, using eq. 17 again, we obtain
(23) 
a.3 Proof of Lemma 3
First, we prove the following technical Lemma.
Lemma 4.
Let and be two positive constants. If then
(24) 
and
(25) 
Proof.
We prove this by direct calculation
Also, from the first and last lines in the above equation, we have
∎
With this result in hand, we complete the proof by direct calculation
(26) 
where in we used the triangle inequality, in we define , and used
in we used the fact that is the Lipshitz constant of , and in we used the definition of . The above bound implies the following bound
(27) 
where in we added and subtracted the same term, in we used the triangle inequality, and in we used eq. 26 and also eq. 17 to obtain
(28) 