Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed learning rate --- in the special case of linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous proofs with an exact asymptotic convergence of SGD required a learning rate that asymptotically vanishes to zero, or averaging of the SGD iterates. Furthermore, if the loss function has an exponential tail (e.g., logistic regression), then we prove that with SGD the weight vector converges in direction to the L_2 max margin vector as O(1/(t)) for almost all separable datasets, and the loss converges as O(1/t) --- similarly to gradient descent. These results suggest an explanation to the similar behavior observed in deep networks when trained with SGD.

There are no comments yet.

Authors

• 5 publications
• 76 publications
• 38 publications
• Convergence of SGD in Learning ReLU Models with Separable Data

We consider the binary classification problem in which the objective fun...
06/12/2018 ∙ by Tengyu Xu, et al. ∙ 0

• Online Robust Regression via SGD on the l1 loss

We consider the robust linear regression problem in the online setting w...
07/01/2020 ∙ by Scott Pesme, et al. ∙ 0

• Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss

Uncertainty sampling, a popular active learning algorithm, is used to re...
12/05/2018 ∙ by Stephen Mussmann, et al. ∙ 0

• Inductive Bias of Gradient Descent for Exponentially Weight Normalized Smooth Homogeneous Neural Nets

We analyze the inductive bias of gradient descent for weight normalized ...
10/24/2020 ∙ by Depen Morwani, et al. ∙ 0

• Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian

Over the last decade, a single algorithm has changed many facets of our ...
11/12/2020 ∙ by Jack Parker-Holder, et al. ∙ 11

• Weighted Risk Minimization & Deep Learning

Importance weighting is a key ingredient in many algorithms for causal i...
12/08/2018 ∙ by Jonathon Byrd, et al. ∙ 0

• Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate

Understanding the algorithmic regularization effect of stochastic gradie...
11/04/2020 ∙ by Jingfeng Wu, et al. ∙ 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) are commonly trained using stochastic gradient descent (SGD), or one of its variants. During training, the learning rate is typically decreased according to some schedule (e.g., every

epochs we multiply the learning rate by some ). Determining the learning rate schedule, and its dependency on other factors, such as the minibatch size, has been the subject of a rapidly increasing number of recent empirical works (Hoffer et al. (2017); Goyal et al. (2017); Jastrzebski et al. (2017); Smith et al. (2018) are a few examples). Therefore, it is desirable to improve our understanding of such issues. However, somewhat surprisingly, we observe that we do not have even a satisfying answer to the basic question

Why do we need to decrease the learning rate during training?

At first, it may seem that this question has already been answered. Many previous works have analyzed SGD theoretically (e.g., see Robbins and Monro (1951); Bertsekas (1999); Geary and Bertsekas (2001); Bach and Moulines (2011); Ben-David and Shalev-Shwartz (2014); Ghadimi et al. (2013); Bubeck (2015); Bottou et al. (2016); Ma et al. (2017) and references therein), under various assumptions. In all previous works, to the best of our knowledge, one must assume a vanishing learning rate schedule, averaging of the SGD iterates, partial strong convexity (i.e., strong convexity in some subspace), or the Polyak-Lojasiewicz (PL) condition (Bassily et al., 2018) — so that the SGD increments or the loss (in the convex case) will converge to zero for generic datasets. However, even near its global minima, a neural network loss is not partially strongly convex, and the PL condition does not hold. Therefore, without a vanishing learning rate or iterate averaging, the gradients are only guaranteed to decrease below some constant value, proportional to the learning rate. Thus, in this case, we may fluctuate near a critical point, but never converge to it.

Consequently it may seem that in neural networks we should always decrease the learning rate in SGD or average the weights, to enable the convergence of the weights to a critical point, and to decrease the loss. However, this reasoning does not hold empirically. In many datasets, even with a fixed learning rate and without averaging, we observe that the training loss can converge to zero. For example, we examine the learning dynamics of a ResNet-18 trained on CIFAR10 in Figure 1. Even though the learning rate is fixed, the training loss converges to zero (and so does the classification error).

Notably, we do not observe any convergence issues, as we may have suspected from previous theoretical results. In fact, if we decrease the learning rate at any point, this only decreases the convergence rate of the training loss to zero. The main benefit of decreasing the learning rate is that it typically improves generalization performance. Such contradiction between existing theoretical and empirical results may indicate a significant gap in our understanding. We are therefore interested in closing this gap.

To do so, we first examine the network dynamics in Figure 1. Since the training error has reached zero after a certain number of iterations, by then the last hidden layer must have become linearly separable. Since the network is trained using the monotone cross-entropy loss (with softmax outputs), by increasing the norm of the weights we decrease the loss. Therefore, if the loss is minimized then the weights would tend to diverge to infinity — as indeed happens. This weight divergence does not affect the scale-insensitive validation (classification) error, which continues to decrease during training. In contrast, the validation loss starts to increase.

To explain this behavior, Soudry et al. (2018b, a) focused on the dynamics of the last layer, for a fixed separable input and no bias. For Gradient Descent (GD) dynamics, Soudry et al. (2018b, a) proved that the training loss converges to zero as , the direction of the weight vector converges to the max margin as , and the validation loss increase as . This had similar dynamics to those observed in Figure 1. However, the dynamics of GD are simpler than those of SGD. Notably, it is well known that on smooth functions, for the iterates of GD, the gradient converges to zero even with a fixed learning rate — just as long as this learning rate is below some fixed threshold (which depends on the smoothness of the function).

Our contributions.

In this paper we examine SGD optimization of homogeneous linear classifiers with smooth monotone loss functions, where the data is sampled either with replacement (the sampling regime typically examined in theory), or without replacement (the sampling regime typically used in practice). For simplicity, we focus on binary classification (e.g., logistic regression). First, we prove three basic results:

• The norm of the weights diverges to infinity for any learning rate.

• For a sufficiently small fixed learning rate, the loss and gradients converge to zero.

• This upper bound we derived for the maximal learning rate is proportional to the minibatch size, when the data in SGD is sampled with replacement.

Similar behavior to the last property is also observed in deep networks (Goyal et al., 2017; Smith et al., 2018). Next, given an additional assumption that the loss function has an exponential tail (e.g., logistic regression), we prove that for almost all linearly separable datasets (i.e., except for measure zero cases):

• The direction of the weight vector converges to that of the max margin solution.

• The margin converges as , while the training loss converges as .

These conclusions for SGD are the same as for GD (Soudry et al., 2018b) — the only difference is the value of the maximal learning rate, which depends on the minibatch size. Therefore, we believe our SGD results might be similarly extended, as GD, to multi-class (Soudry et al., 2018a), other loss functions (Nacson et al., 2019), other optimization methods (Gunasekar et al., 2018b), linear convolutional neural networks (Gunasekar et al., 2018a), and hopefully to nonlinear deep networks.

Finally, under the assumption that the SVM support vectors span the dataset, we further characterize SGD iterate asymptotic behavior. Specifically, we show that, if we keep the learning rate proportional to the minibatch size, then:

• The minibatch size does not affect the asymptotic convergence rate of SGD, in terms of epochs.

• In terms of SGD iterations, the fastest asymptotic convergence rate, is obtained at full batch size, i.e. GD.

These results suggest the large potential of parallelism in separable problems, as observed in deep networks (Goyal et al., 2017; Smith et al., 2018).

2 Preliminaries

Consider a dataset , with binary labels . We analyze learning by minimizing an empirical loss of homogeneous linear predictors (i.e., without bias), of the form

 L(w)=N∑n=1ℓ(ynw⊤xn), (1)

where is the weight vector. To simplify notation, we assume that — this is true without loss of generality, since we can always re-define as .

We are particularly interested in problems that are linearly separable and with a smooth strictly decreasing and non-negative loss function. Therefore, we assume:

Assumption 1.

The dataset is strictly linearly separable: such that .

Given that the data is linearly separable, the maximal margin is strictly positive

 γ=maxw∈Rd:∥w∥=1minnw⊤xn>0. (2)
Assumption 2.

is a positive, differentiable, -smooth function (i.e., its derivative is -Lipshitz), monotonically decreasing to zero, (so111The requirement of nonnegativity and that the loss asymptotes to zero is purely for convenience. It is enough to require the loss is monotone decreasing and bounded from below. Any such loss asymptotes to some constant, and is thus equivalent to one that satisfies this assumption, up to a shift by that constant. and ), and .

Many common loss functions, including the logistic and probit losses, follow Assumption 1. Assumption 1 also straightforwardly implies that is a -smooth function, where the columns of are all samples, and

is the maximal singular value of

.

Under these conditions, the infimum of the optimization problem is zero, but it is not attained at any finite . Furthermore, no finite critical point exists. We consider minimizing eq. 1 using Stochastic Gradient Descent (SGD) with a fixed learning rate , i.e., with steps of the form:

 w(t+1)=w(t)−ηB∑n∈B(t)ℓ′(w(t)⊤xn)xn, (3)

where is a minibatch of distinct indices, chosen so is an integer, and that it satisfies one of the following assumptions. The first option is the assumption of random sampling with replacement:

Assumption 3a.

[Random sampling with replacement] At each iteration we randomly and uniformly sample a minibatch of

distinct indices, i.e. so each sample has an identical probability to be selected.

For example, this assumption holds if at each iteration we uniformly sample the indices without replacement from , or uniformly sample and select , where is some fixed partition of the data indices, i.e.,

 ∪K−1k=0Bk={1,…,N}.

This assumption is rather common in theoretical analysis, but less common in practice. The next alternative sampling method is more common in practice:

Assumption 3b (Sampling without replacement).

At each epoch, the minibatches partition the data:

 ∀u∈{0,1,2,…}:∪K−1k=0B(Ku+k)={1,…,N}.

This way, each sample is chosen exactly once at each epoch, and SGD completes balanced passes over the data. An important special case of this assumption is random sampling without replacement, which is the practically common method. Other special cases are periodic sampling (round-robin), and even adversarial selection of the order of the samples.

3 Main Result 1: The Loss Converges to a Global Infimum

The weight norm always diverges to infinity, for any learning rate, as we prove next.

Lemma 1.

Given assumptions 1 and 1, and any starting point , the iterates of SGD on (eq. 3), with either sampling regimes (Assumption 3a or 3b), diverge to infinity, i.e. .

Proof.

Since the data is linearly separable, such that . We examine the dot product of with the iterates of SGD

Since and for any finite , we get that either or . In the first case, from Cauchy-Shwartz

In the second case, since is strictly positive for any finite value, and achieves zero only at , we must have , which again implies

Combing both cases, we prove the theorem. ∎

As the weights go to infinity, we wish to understand the asymptotic behavior of the loss. As the next theorem shows, if the fixed learning rate is sufficiently small, then we get that the loss converges to zero.

Theorem 1.

Let be the iterates of SGD (eq. 3) from any starting point , where samples are either (case 1) selected randomly with replacement (Assumption 3a)) and with learning rate

 ηB<2γ2βσ2max, (4)

or (case 2) sampled without replacement (Assumption 3b)) and with learning rate

 ηB

For linearly separable data (Assumption 1), and smooth-monotone loss function (Assumption 1), we have the following, almost surely (with probability ) in the first case, and surely in the second case:

1. The loss converges to zero:

 limt→∞L(w(t))=0,
2. All samples are correctly classified, given sufficiently long time:

 ∀n:limt→∞w(t)⊤xn=∞,
3. The iterates of SGD are square summable:

 ∞∑t=0∥w(t+1)−w(t)∥2<∞.

The complete proof of this theorem is given in section A in the appendix. The proof relies on the following key lemma

Lemma 2.

The

max margin lower bounds the minimal “non-negative right eigenvalue” of

 γ=maxw∈Rd:∥w∥=1minnw⊤xn≤minv∈Rd≥0:∥v∥=1∥Xv∥ (6)
Proof.

In this proof we define as the minimizer of the right hand side of eq. 6, and as the maximizer of the optimization problem on the left hand side of the same equation. On the one hand

 w⊤∗Xv∗(1)≤∥w∗∥∥Xv∗∥(2)=minv∈Rd≥0:∥v∥=1∥Xv∥, (7)

where in we used Cauchy-Shwartz inequality, and in we used the definition of , and that . On the other hand,

 w⊤∗Xv∗(1)≥γN∑n=1v∗n(2)≥γ ⎷N∑n=1(v∗n)2(3)=γ, (8)

where in we used the definition of the max margin from the left hand side of eq. 6 and , in we used that and the triangle inequality, and in we used that . Together, eqs. 7 and 8 imply the Lemma. ∎

This Lemma is useful since the SGD weight increments in eq. 3 have the form , where is some vector with non-negative components. This enables us to bound the norm of the SGD updates using the norm of the full gradient, which allows us to use similar analysis as for GD. Additionally, we note the regime we analyze in Theorem 1 is somewhat unusual, as the weight vector goes to infinity. In many previous works it is assumed that there exists a finite critical point, or that the weights are bounded within a compact domain.

Theorem 1 Implications.

In both sampling regimes, we obtained that a fixed (non-vanishing) learning rate results in convergence to zero error. In the case of random sampling with replacement (Assumption 3a) we got a better upper bound on the learning rate (eq. 4), which does not depend on . Interestingly, this bound matches the empirical findings of Goyal et al. (2017); Smith et al. (2018), which observed that in a large range . Interestingly, in our case the relation holds exactly for all in the maximum learning rate (eq. 4

). In contrast, for linear regression, the relation becomes sub-linear for large

(Ma et al., 2017).

We also considered here the case when the datapoints are sampled without replacement (Assumption 3b). This is in contrast to most theoretical SGD results, which typically assume sampling with replacement (which is less common in practice). There are a few notable exceptions (Geary and Bertsekas (2001); Bertsekas (2011); Shamir (2016), and references therein). Perhaps the most similar previous result is the classical result of (Proposition 2.1 in Geary and Bertsekas (2001)), which has a similar sampling schedule, and in which the weights can go to infinity. However, in this result the learning rate must go to zero for the SGD iterates to converge. In our case, we are able to relax this assumption since we focus on linear classification with a monotone loss and separable data.

When assuming sampling without replacement (Assumption 3b) the learning rate bound (eq. 5) becomes significantly lower — roughly proportional to . This is because such a sampling assumption is very pessimistic (e.g., the samples can be selected by an adversary). Therefore, a small (yet non vanishing) learning rate is required to guarantee convergence. Such a dependence on is expected, since in this case we need to use a incremental gradient method type of proof, where such low learning rates are common. For example, in Bertsekas (2011) Proposition 3.2b, to get a low final error we must have a learning rate .

4 Main Result 2: The Weight Vector Direction Converges to the Max Margin

Next, we focus on a special case of monotone loss functions:

Definition 1.

A function has a “tight exponential tail", if there exist positive constants , and such that :

 (1−exp(−μ−u))e−u≤f(u)≤(1+exp(−μ+u))e−u
Assumption 4.

The negative loss derivative has a tight exponential tail.

Specifically, this applies to the logistic loss function. Given this additional assumption, we prove that SGD converges to the max margin solution.

Theorem 2.

For almost all datasets for which the assumptions of Theorem 1 hold, if has a tight exponential tail (Assumption 4), then the iterates of SGD, for any , will behave as:

 (9)

where is the following max margin separator:

 ^w=argminw∈Rd∥w∥2s.t.w⊤xn≥1, (10)

and the residual is bounded almost surely in the first case of Theorem 1 (random sampling with replacement), or surely in the second case (sampling without replacement).

Thus, from Theorem 2, for almost any linearly separable data set (e.g., with probability 1 if the data is sampled from an absolutely continuous distribution) , the normalized weight vector converges to the normalized max margin vector, i.e.,

 limt→∞w(t)∥w(t)∥=^w∥^w∥

with rate , identically to GD (Soudry et al., 2018b). Interestingly, the number of minibatches per epoch affects only the constants. Intuitively, this is reasonable, since if we rescale the time units, then the log term in eq. 9 will only add a constant to the residual .

Proof idea.

The theorem is proved in appendix section B.1. The proof builds on the results of Soudry et al. (2018b) for GD: as the weights diverge, the loss converges to zero, and only the gradients of the support vector remain significant. This implies that the gradient direction, as a positive linear combination of support vectors converges to the direction of the max margin. The main difficulty in extending the proof to the case of SGD is that at each iteration, is updated using only a subset of the data points. This could potentially lead to large difference from the GD solution. However, conceptually, we show that this difference of from the GD dynamics solution is in . The main novel idea here is that in order to calculate this difference at time , we use information on sampling selections made in the future, i.e. at times larger than .

Convergence Rates.

Theorem 2 directly implies the same convergence rates as in GD (Soudry et al., 2018b). Specifically, in the distance

 ∥∥∥w(t)∥w(t)∥−^w∥^w∥∥∥∥=O(1logt), (11)

in the angle

 1−w(t)⊤^w∥w(t)∥∥^w∥=O(1log2t), (12)

and in the margin gap

 1∥^w∥−minnx⊤nw(t)∥w(t)∥=O(1logt). (13)

On the other hand, the loss itself decreases as

 L(w(t))=O(1t). (14)

In Figure 2 we visualize these results. Additionally, in Figure 3 we observe that the convergence rates remain nearly the same for different minibatch sizes — as long as we linearly scale the learning rate with the minibatch size, i.e. . This behavior fits with the behavior of the maximal learning rate for which SGD converge in the case of sampling with replacement (eq. 4). However, it is not clear from Theorem 2 why the convergence rate stays almost exactly the same with such a linear scaling, since we do not know how does depends on and . In the special case where the SVM support vectors span the dataset, we can further characterize asymptotic dependence on and . We define as the orthogonal projection matrix to the subspace spanned by the support vectors, and as the complementary projection. In addition, we denote as the SVM dual variables so .

Theorem 3.

Under the conditions and notation of Theorem 2, for almost all datasets, if in addition the support vectors span the data (i.e. , where is a matrix whose columns are only those data points s.t. ), then , where is a solution to

 ∀n∈S:exp(−x⊤n~w)=αn, ¯P(~w−w(0))=0. (15)

The theorem is proved in appendix section B.2. Note that is only dependent on the dataset and the initialization. This fact enables us to state the following result for the asymptotic behavior of SGD.

Corollary 1.

Under the conditions and notation of Theorem 3, GD iterate will behave as:

where is the maximum-margin separator, is the solution of eq. 15 (which does not depend on and ), and is a vanishing term. Therefore, if the step size is kept proportional to the minibatch size, i.e., , changing the number of minibatches is equivalent to linearly re-scaling the time units of .

From the corollary, we expect the same asymptotic convergence rates for all batch sizes as long as we scale the learning rate linearly with the batch size, i.e., keep . This is exactly the behavior we observe in Figure 3. Since changing the number of minibatches is equivalent to linearly re-scaling the time units, smaller implies faster asymptotic convergence assuming full parallelization capabilities (i.e. the minibatch size does not affect the iterate time). Additionally, note that the corollary only guarantees the same asymptotic behavior. Particularly, different initializations and datasets can exhibit different behavior initially. It remains an interesting direction for future work to understand dependence on and , in the case when the support vectors do not span the dataset.

Lastly, for logistic regression loss, the validation loss (calculated on an independent validation set ) increases as

Notably, as was observed in Soudry et al. (2018b), these asymptotic rates also match what we observe numerically for the convnet in Figure 1: the training loss decreases as , the validation loss increases as , and the validation (classification) improves very slowly, similarly to the logarithmic decay of the angle gap (so the convnet might have a similarly slow decay to its respective implicit bias).

5 Discussion and Related Works

In Theorem 1 we proved that for monotone smooth loss functions on linearly separable data, the iterates of SGD with a sufficiently small (but non-vanishing) learning rate converge to zero loss. In contrast to typical convergence to finite critical points, in this case, the "noise" inherent in SGD vanishes asymptotically. Therefore, we do not need to decrease the learning rate, or average the SGD iterates, to ensure exact convergence. Decaying the learning rate during training will only decrease the convergence speed of the loss.

To the best of our knowledge, such exact convergence result previously required that either (1) the loss function is partially strongly convex, i.e. strongly convex except on some subspace (where the dynamics are frozen), as shown in (Ma et al., 2017) for the case of over-parameterized linear regression (with more parameters then samples); or (2) that the Polyak-Lojasiewicz (PL) condition applies (Bassily et al., 2018). However, in this paper we do not require such conditions, which does not hold for deep networks, even in the vicinity of the (finite or infinite) critical points. Moreover, the dependence of the learning rate on the minibatch size is different, as we discuss next.

We proved Theorem 1 both for random sampling with replacement (Assumption 3a) and for sampling without replacement (Assumption 3b). In the first case, eq. 4 implies that, to guarantee convergence, we need to increase the learning rate proportionally to the minibatch size. In the second case (sampling without replacement) the learning rate bound (eq. 5) is more pessimistic, since our assumption is more general (e.g., it includes adversarial sampling).

In Theorem 2, we proved, given the additional assumption of an exponential tail (e.g., as in logistic regression), that for almost all datasets the weight vector converges to the max margin in direction as , and that the training loss converges to zero as . We believe these results could be extended for every dataset, using the techniques of Soudry et al. (2018a). Again, decaying the learning rate will only degrade the convergence speed to the max margin direction. In fact, the results of Nacson et al. (2019) indicate that we may need to increase the learning rate to improve convergence: For GD, Nacson et al. (2019) proved that this can drastically improve the convergence rate from to . It is yet to be seen if such results might also be applied to deep networks.

In Theorem 3 we further characterized the weights asymptotic behaviour under the additional assumption that the SVM support vectors span the dataset. Combining the results from Theorem 2 and Theorem 3 we obtain Corollary 1. This corollary states that, under linear scaling of the learning rate with the batch size, the asymptotic convergence rate of SGD, in terms of epochs, is not affected by the mini-batch size.

Thus, we have shown that exact linear scaling of the learning rate with the minibatch size () is beneficial in two ways: (a) in Theorem 1 for the upper bound of the learning rate in the case of of random sampling with replacement (b) in Corollary 1 for the asymptotic behaviour of the weights assuming tight exponential loss function and that the SVM support vectors span the data. This exact linear scaling, stands in contrast to previous theoretical results with exact convergence (Ma et al., 2017), in which there exists a "saturation limit". Above this limit we should not increase the learning rate linearly with the minibatch size, or the convergence rate will be degraded, and eventually we will loose the convergence guarantee. As predicted by Corollary 1, in Figure 3 we observe that with a linear scaling , the convergence plots exactly match: as we can see, there is almost no asymptotic difference between different minibatch sizes. Therefore, in contrast to Ma et al. (2017), there is no "optimal" minibatch size. In this case, to minimize the number of SGD iterations we should use the largest minibatch possible. This will speed up convergence in wall clock time (as was done in Goyal et al. (2017); Smith et al. (2018)) if it is possible to parallelize the calculation of a minibatch — so one SGD update with a minibatch of size takes less time then updates of SGD with minibatch of size .

An early version of this manuscript previously appeared on arxiv. However, it had only the results in the case of sampling without replacement, and no Theorem 3. Two other related SGD results appeared on arXiv in parallel (with less than a week difference).

First, Ji and Telgarsky (2018) analyzed logistic regression optimized by SGD on separable data (in addition to other results on GD when the data is non-separable). Ji and Telgarsky (2018) also assume a fixed learning rate, but use averaging of the iterates (which is known to enable exact convergence). They focus on the case in which the datapoints are independently sampled from a separable distribution, while we focused on the case of sampling from a fixed dataset. They show, that with high probability, the population risk converges to zero as . As explained in Ji and Telgarsky (2018), such a fast rate was proven before only for strongly convex loss functions (the logistic loss is not strongly convex). We showed a similar rate, but for the empirical risk (eq. 14). We additionally showed that the weight vector converges in direction to the direction of the max margin.

Second, among other results, Xu et al. (2018) also examined optimizing logistic regression with SGD on a fixed dataset using random sampling with replacement, iterate averaging and a vanishing learning rate. There, in Theorems 3.2 and 3.3, it is shown that the expectation of the loss converges as and the expectation of the averaged iterates converges in the norm as , which is slower than our result. Thus, in contrast to both works Ji and Telgarsky (2018); Xu et al. (2018), we did not assume iterate averaging or decreasing learning rate. Additionally, our new results on sampling with replacement give a linear relationship between the learning rate and the minibatch size, and Corollary 1 shows the affect of the minibatch size on the asymptotic convergence rate.

6 Conclusions

We found that for logistic regression with no bias on separable data, SGD behaves similarly to GD in terms of the implicit bias and convergence rate. The only difference is the maximum possible learning rate should change proportionally to the minibatch size. It remains to be seen if this also holds for deep networks.

Acknowledgements

The authors are grateful to C. Zeno, and I. Golan for helpful comments on the manuscript. This research was supported by the Israel Science foundation (grant No. 31/1031), and by the Taub foundation. A Titan Xp used for this research was donated by the NVIDIA Corporation. NS was partially supported by NSF awards IIS-1302662 and IIS-1764032.

References

• Bach and Moulines (2011) Francis Bach and Eric Moulines.

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.

NIPS, pages –, 2011.
• Bassily et al. (2018) Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non-convex over-parametrized learning. pages 1–7, 2018.
• Ben-David and Shalev-Shwartz (2014) Shai Ben-David and Shai Shalev-Shwartz. Understanding Machine Learning: From Theory to Algorithms. 2014.
• Bertsekas (1999) D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
• Bertsekas (2011) Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, jul 2011.
• Bottou et al. (2016) Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization Methods for Large-Scale Machine Learning. 2016.
• Bubeck (2015) Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends®in Machine Learning, 8(3-4):231–357, 2015.
• Geary and Bertsekas (2001) A. Geary and D.P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304), 1(1):907–912, 2001.
• Ghadimi et al. (2013) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Math. Prog., 155(1-2):267–305, 2013.
• Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia Kaiming, and He Facebook.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

arXiv preprint, 2017.
• Gunasekar et al. (2018a) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit Bias of Gradient Descent on Linear Convolutional Networks. In NIPS, jun 2018a.
• Gunasekar et al. (2018b) Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In ICML, 2018b.
• Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1729–1739, 2017.
• Jastrzebski et al. (2017) Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three Factors Influencing Minima in SGD. arXiv, pages 1–21, 2017.
• Ji and Telgarsky (2018) Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300v2, 2018.
• Ma et al. (2017) Siyuan Ma, Raef Bassily, and Mikhail Belkin.

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning.

2017.
• Nacson et al. (2019) Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of Gradient Descent on Separable Data. AISTATS, 2019.
• Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
• Shamir (2016) Ohad Shamir. Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization. pages 1–36, 2016.
• Smith et al. (2018) Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t Decay the Learning Rate, Increase the Batch Size. In ICLR, 2018.
• Soudry et al. (2018a) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint: 1710.10345v3, 2018a.
• Soudry et al. (2018b) Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. ICLR, 2018b.
• Xu et al. (2018) Tengyu Xu, Yi Zhou, Kaiyi Ji, and Yingbin Liang.

When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models?

arXiv, 2018.

References

• Bach and Moulines (2011) Francis Bach and Eric Moulines.

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.

NIPS, pages –, 2011.
• Bassily et al. (2018) Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non-convex over-parametrized learning. pages 1–7, 2018.
• Ben-David and Shalev-Shwartz (2014) Shai Ben-David and Shai Shalev-Shwartz. Understanding Machine Learning: From Theory to Algorithms. 2014.
• Bertsekas (1999) D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
• Bertsekas (2011) Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, jul 2011.
• Bottou et al. (2016) Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization Methods for Large-Scale Machine Learning. 2016.
• Bubeck (2015) Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends®in Machine Learning, 8(3-4):231–357, 2015.
• Geary and Bertsekas (2001) A. Geary and D.P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304), 1(1):907–912, 2001.
• Ghadimi et al. (2013) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Math. Prog., 155(1-2):267–305, 2013.
• Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia Kaiming, and He Facebook.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

arXiv preprint, 2017.
• Gunasekar et al. (2018a) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit Bias of Gradient Descent on Linear Convolutional Networks. In NIPS, jun 2018a.
• Gunasekar et al. (2018b) Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In ICML, 2018b.
• Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1729–1739, 2017.
• Jastrzebski et al. (2017) Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three Factors Influencing Minima in SGD. arXiv, pages 1–21, 2017.
• Ji and Telgarsky (2018) Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300v2, 2018.
• Ma et al. (2017) Siyuan Ma, Raef Bassily, and Mikhail Belkin.

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning.

2017.
• Nacson et al. (2019) Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of Gradient Descent on Separable Data. AISTATS, 2019.
• Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
• Shamir (2016) Ohad Shamir. Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization. pages 1–36, 2016.
• Smith et al. (2018) Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t Decay the Learning Rate, Increase the Batch Size. In ICLR, 2018.
• Soudry et al. (2018a) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint: 1710.10345v3, 2018a.
• Soudry et al. (2018b) Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. ICLR, 2018b.
• Xu et al. (2018) Tengyu Xu, Yi Zhou, Kaiyi Ji, and Yingbin Liang.

When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models?

arXiv, 2018.

Appendix A Proof of Theorem 1

Our proof relies on Lemma 2. Specifically, since we assumed , this Lemma implies that

 ∥∇L(w(t))∥=∥∥ ∥∥N∑n=1ℓ′(x⊤nw(t))xn∥∥ ∥∥≥γ ⎷N∑n=1(ℓ′(x⊤nw(t)))2. (17)

Next, we will rely on this key fact to prove our results for each case.

a.1 Case 1: Random sampling with replacement

From the -smoothness of the loss

 L(w(t+1))−L(w(t))≤∇L(w)⊤(w(t+1)−w(t))+β2∥w(t+1)−w(t)∥2

Taking expectation, we have

 EL(w(t+1))−EL(w(t)) ≤E[∇L(w(t))⊤[(w(t+1)−w(t))]]+β2E∥w(t+1)−w(t)∥2 (1)≤E[E[∇L(w(t))⊤(w(t+1)−w(t))|w(t)]]+β2E∥∥ ∥∥N∑n=1zt,nℓ′(w(t)⊤xn)xn∥∥ ∥∥2 (2)≤−η1KE∥∇L(w(t))∥2+βσ2max2η2E[N∑n=1z2t,n(ℓ′(w(t)⊤xn))2] (3)≤−η1KE∥∇L(w(t))∥2+βσ2max2η21KN∑n=1E(ℓ′(w(t)⊤xn))2 (4)≤−η1KE∥∇L(w(t))∥2+βσ2max2γ2η21KE∥∇L(w(t))∥2 =−1Kη(1−βσ2max2γ2η)E∥∇L(w(t))∥2,

where in we defined

as a random variable equal to

if sample is selected at time , or otherwise, in we used the definition of , in we used and , and in we used eq. 17. Therefore, if

 η<2γ2βσ2max (18)

then

and we can write

Summing over we have

 (19)

and therefore . Moreover, the Markov inequality, we have

 P(∞∑t=1∥∇L(w(t))∥2

Combining this equation with equation 19, and taking the limit of to , we obtain

 P(∞∑t=1∥∇L(w(t))∥2<∞)=1. (20)

Therefore, with probability 1, we have , which implies . Moreover,

 ∞∑t=1∥w(t+1)−w(t)∥2=η2∞∑t=1∥∥ ∥∥∑n∈B(t)ℓ′(w(t)⊤xn)xn∥∥ ∥∥2 ≤ (1)≤ η2σ2maxγ2∞∑t=1∥∇L(w(t))∥2(2)<∞

where in we used eq. 17, and is true with probability 1 from eq. 20.

a.2 Case 2: Sampling without replacement

Linear separability enforces a lower bound on the norm of these increments (eq. 17, which follows form Lemma 2). This bound enables us to bound the SGD increments, and other related quantities, in terms of the norm of the full gradient (Lemma 3 below).

Lemma 3.

For all and , such that and are in the same epoch, we have

 ∥w(t+k)−w(t)+η∇L(w(t))∥ ≤η2kβσ3maxγ−1[1−ηkβσ2max]−1∥∇L(w(t))∥ ∥w(t+k)−w(t)∥ ≤ηγ−1σmax[1−ηkβσ2max]−1∥∇L(w(t))∥ ∥∇L(w(t+k))−∇L(w(t))∥ ≤ηβγ−1σ2max[1−ηkβσ2max]−1∥∇L(w(t))∥.
Proof.

See appendix section A.3. ∎

Together, these bounds enable us to complete the proof. First, we assume that is the first iteration in some epoch, i.e., for some . The -smoothness of the loss function (Assumption 1), implies that is -smooth. This entails that

 L(w(t+K))−L(w(t))−βσ2max2∥w(t+K)−w(t)∥2 ≤∇L(w(t))⊤(w(t+K)−w(t)) =∇L(w(t))⊤(−η∇L(w(t))+w(t+K)−w(t)+η∇L(w(t))) ≤−η∥∇L(w(t))∥2+∥∇L(w(t))∥∥w(t+K)−w(t)+η∇L(w(t))∥ (21)

and therefore,

 L(w(t+K))−L(w(t)) (1)≤−η∥∇L(w(t))∥2+η2Kβσ3maxγ−1[1−ηKβσ2max]−1∥∇L(w(t))∥2 +12η2βγ−2σ4max[1−ηKβσ2max]−2∥∇L(w(t))∥2 =−η(1−η(Kβσ3maxγ−1[1−ηKβσ2max]−1+12βγ−2σ4max[1−ηKβσ2max]−2))∥∇L(w(t))∥2 (2)≤−η(1−η2βσ3maxγ−1(K+γ−1σmax))∥∇L(w(t))∥2 (3)=−η(1−ηq)∥∇L(w(t))∥2

where in we used eq. 21 and the first two equations in Lemma 3, in we recall we assumed that in eq. 5, and in we denoted . Recall we assumed in eq. 5. Summing over we obtain

 ∞∑u=0∥∇L(w(uK))∥2≤L(w(0))−limu→∞L(w(uK))η(1−ηq)≤L(w(0))η(1−ηq)<∞

since and according to our assumption on .

Next, we consider general time (i.e., not only first iteration at epochs, as we assumed until now). We note that, for any such that is in the same epoch as , we have that

 ∥∇L(w(t+k))∥ ≤∥∇L(w(t))∥+∥∇L(w(t+k))−∇L(w(t))∥

where we used the last equation in Lemma 3. Thus, combining the last two equations we obtain

 ∞∑u=0∥∇L(w(u))∥2=∞∑u=0K−1∑k=0∥∇L(w(uK+k))∥2 ≤ (1+ηβγ−1σ2max[1−ηKβσ2max]−1)2K∞∑u=0∥∇L(w(uK))∥2<∞ (22)

which also implies that . Next, we recall eq. 17 to obtain

  ⎷N∑n=1(ℓ′(x⊤nw(t)))2≤1γ∥∇L(w(t))∥→0.

Therefore, . Since is strictly positive, and equal to zero only at (from assumption 1), we obtain that

Finally, using eq. 17 again, we obtain

 ∥∇L(w(t))∥≥γ ⎷N∑n=1(ℓ′(x⊤nw(t)))2≥γ√∑n∈B(t)(ℓ′(x⊤nw(t)))2 (23)

Combining eq. 23 and 22 we obtain that

a.3 Proof of Lemma 3

First, we prove the following technical Lemma.

Lemma 4.

Let and be two positive constants. If then

 δk≤θ1−kϵ (24)

and

 k−1∑u=0δu≤kθ1−kϵ. (25)
Proof.

We prove this by direct calculation

 δk ≤θ+ϵk−1∑u=0δu≤θ+ϵk−1∑u1=0(θ+ϵu1−1∑u2=0δu2) ≤θ+ϵk−1∑u1=0θ+ϵ2k−1∑u1=0u1−1∑u2=0θ+⋯+ϵkk−1∑u1=0u1−1∑u2=0⋯uk−1−1∑uk=0θ ≤θ[1+ϵk+ϵ2k(k−1)+⋯+ϵkk!] ≤θk∑u=0(kϵ)u=θ1−(kϵ)k+11−kϵ≤θ1−kϵ

Also, from the first and last lines in the above equation, we have

 k−1∑u=0δu≤θϵ−1k∑u=1(kϵ)u=θkk−1∑u=0(kϵ)u≤kθ1−kϵ.

With this result in hand, we complete the proof by direct calculation

 ∥∥ ∥∥w(t+k)−w(t)+ηk−1∑u=0∑n∈B(t+u)ℓ′(x⊤nw(t))xn∥∥ ∥∥ (1)≤ηk−1∑u=0∥∥ ∥∥∑n∈B(t+u)[−ℓ′(x⊤nw(t+u))+ℓ′(x⊤nw(t))]xn∥∥ ∥∥ (2)≤ησmaxk−1∑u=0 ⎷N∑n=1[−ℓ′(x⊤nw(t+u))+ℓ′(x⊤nw(t))]2 (3)≤ηβσmaxk−1∑u=0 ⎷N∑n=1(x⊤n(w(t+u)−w(t)))2 (4)≤ηβσ2maxk−1∑u=0∥(w(t+u)−w(t))∥, (26)

where in we used the triangle inequality, in we define , and used

 ∥∥ ∥∥∑n∈B(t+u)νnxn∥∥ ∥∥≤σmax√∑n∈B(t+u)ν2n≤σmax ⎷N∑n=1ν2n,

in we used the fact that is the Lipshitz constant of , and in we used the definition of . The above bound implies the following bound

 ∥w(t+k)−w(t)∥ (1)=∥∥ ∥∥−ηk−1∑u=0∑n∈B(t+u)ℓ′(x⊤nw(t))xn+ηk−1∑u=0∑n∈B(t+u)ℓ′(x⊤nw(t))xn+w(t+k)−w(t)∥∥ ∥∥ (3)≤ηγ−1σmax∥∇L(w(t))∥+ηβσ2maxk−1∑u=0∥w(t+u)−w(t)∥, (27)

where in we added and subtracted the same term, in we used the triangle inequality, and in we used eq. 26 and also eq. 17 to obtain

 ∥∥ ∥∥k−1∑u=0∑n∈B(t+u)ℓ′(x⊤nw(t))xn∥∥ ∥∥≤σmax ⎷k−1∑u=0∑n∈B(t+u)(ℓ′(x⊤nw(t)))2 ≤σmax ⎷N∑n=1(ℓ′(x⊤nw(t)))2≤σmaxγ∥∇L(w(t))∥, (28)

Next, we apply eq. 24 from Lemma 4 on eq. 27, with