DeepAI

The Implicit Bias of Gradient Descent on Separable Data

We show that gradient descent on an unregularized logistic regression problem with separable data converges to the max-margin solution. The result generalizes also to other monotone decreasing loss functions with an infimum at infinity, and we also discuss a multi-class generalizations to the cross entropy loss. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods.

• 42 publications
• 18 publications
• 88 publications
10/04/2018

Gradient descent aligns the layers of deep linear networks

This paper establishes risk convergence and asymptotic weight matrix ali...
06/13/2019

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Recent works on implicit regularization have shown that gradient descent...
03/29/2019

A proof of convergence of multi-class logistic regression network

This paper revisits the special type of a neural network known under two...
09/15/2022

Decentralized Learning with Separable Data: Generalization and Fast Algorithms

Decentralized learning offers privacy and communication efficiency when ...
06/09/2019

11/18/2015

A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation

We examine a new form of smooth approximation to the zero one loss in wh...
01/27/2019

Large-Scale Classification using Multinomial Regression and ADMM

We present a novel method for learning the weights in multinomial logist...

1 Introduction

It is becoming increasingly clear that implicit biases introduced by the optimization algorithm play a crucial role in deep learning and in the generalization ability of the learned models

(Neyshabur et al., 2014, 2015; Zhang et al., 2017; Keskar et al., 2017; Neyshabur et al., 2017; Wilson et al., 2017). In particular, minimizing the training error, without any explicit regularization, over models with more parameters and more capacity than the number of training examples, often yields good generalization, despite the empirical optimization problem being highly underdetermined. That is, there are many global minima of the training objective, most of which will not generalize well, but the optimization algorithm (e.g. gradient descent) biases us toward a particular minimum that does generalize well. Unfortunately, we still do not have a good understanding of the biases introduced by different optimization algorithms in different situations.

We do have an understanding of the implicit regularization introduced by early stopping of stochastic methods or, at an extreme, of one-pass (no repetition) stochastic gradient descent

(Hardt et al., 2016). However, as discussed above, in deep learning we often benefit from implicit bias even when optimizing the training error to convergence (without early stopping) using stochastic or batch methods. For loss functions with attainable, finite minimizers, such as the squared loss, we have some understanding of this: in particular, when minimizing an underdetermined least squares problem using gradient descent starting from the origin, it can be shown that we will converge to the minimum Euclidean norm solution. However, the logistic loss, and its generalization the cross-entropy loss which is often used in deep learning, do not admit a finite minimizer on separable problems. Instead, to drive the loss toward zero and thus minimize it, the norm of the predictor must diverge toward infinity.

Do we still benefit from implicit regularization when minimizing the logistic loss on separable data? Clearly the norm of the predictor itself is not minimized, since it grows to infinity. However, for prediction, only the direction of the predictor, i.e. the normalized , is important. How does behave as when we minimize the logistic (or similar) loss using gradient descent on separable data, i.e., when it is possible to get zero misclassification error and thus drive the loss to zero?

In this paper, we show that even without any explicit regularization, for all linearly separable datasets, when minimizing logistic regression problems using gradient descent, we have that converges to the maximum margin separator, i.e. to the solution of the hard margin SVM. This happens even though neither the norm , nor the margin constraint, are in no way part of the objective nor explicitly introduced into optimization. More generally, we show the same behavior for generalized linear problems with any smooth, monotone strictly decreasing, lower bounded loss with an exponential tail. Furthermore, we characterize the rate of this convergence, and show that it is rather slow, wherein for almost all datasets, the distance to the max-margin predictor decreasing only as , and in some degenerate datasets, the rate further slows down to . This explains why the predictor continues to improve even when the training loss is already extremely small. We emphasize and demonstrate that this bias is specific to gradient descent, and changing the optimization algorithm, e.g. using adaptive learning rate methods such as ADAM (Kingma and Ba, 2015), changes this implicit bias.

2 Main Results

Consider a dataset , with and binary labels . We analyze learning by minimizing an empirical loss of the form

 L(w)=N∑n=1ℓ(ynw⊤xn). (1)

where

is the weight vector. A bias term could be added in the usual way, extending

by an additional ’1’ component. To simplify notation, we assume that all the labels are positive: — this is true without loss of generality, since we can always re-define as .

We are particularly interested in problems that are linearly separable, and the loss is smooth monotone strictly decreasing and non-negative:

Assumption 1

The dataset is linearly separable: such that .

Assumption 2

is a positive, differentiable, monotonically decreasing to zero111The requirement of non-negativity and that the loss asymptotes to zero is purely for convenience. It is enough to require the loss is monotone decreasing and bounded from below. Any such loss asymptotes to some constant, and is thus equivalent to one that satisfies this assumption, up to a shift by that constant., (so and ) and a -smooth function, i.e. its derivative is -Lipshitz.

Assumption 1 includes many common loss functions, including the logistic, exp-loss222The exp-loss does not have a global smoothness parameter. However, if we initialize with then it is straightforward to show the gradient descent iterates maintain bounded local smoothness., probit and sigmoidal losses. Assumption 1 implies that is a -smooth function, where

is the maximal singular value of the data matrix

.

Under these conditions, the infimum of the optimization problem is zero, but it is not attained at any finite . Furthermore, no finite critical point exist. We consider minimizing eq. 1 using Gradient Descent (GD) with a fixed learning rate , i.e., with steps of the form:

 (2)

We do not require convexity. Under Assumptions 1 and 2, gradient descent converges to the global minimum (i.e. to zero loss) even without it:

Lemma 1

Let be the iterates of gradient descent (eq. 2) with and any starting point . Under Assumptions 1 and 1, we have: (1) , (2) , and (3) .

Since the data is linearly separable, which linearly separates the data, and therefore

 w⊤∗∇L(w)=N∑n=1ℓ′(w⊤xn)w⊤∗xn.

For any finite , this sum cannot be equal to zero, as a sum of negative terms, since and . Therefore, there are no finite critical points , for which . But gradient descent on a smooth loss with an appropriate stepsize is always guaranteed to converge to a critical point: (see, e.g. Lemma A.2 in Appendix A.4, slightly adapted from Ganti (2015), Theorem 2). This necessarily implies that while for large enough —since only then . Therefore, , so GD converges to the global minimum. The main question we ask is: can we characterize the direction in which diverges? That is, does the limit always exists, and if so, what is it?

In order to analyze this limit, we will need to make a further assumption on the tail of the loss function:

Definition 2

A function has a “tight exponential tail”, if there exist positive constants and such that

 ∀u>u+: f(u)≤c(1+exp(−μ+u))e−au ∀u>u−: f(u)≥c(1−exp(−μ−u))e−au.
Assumption 3

The negative loss derivative has a tight exponential tail (Definition 2).

For example, the exponential loss and the commonly used logistic loss both follow this assumption with . We will assume — without loss of generality, since these constants can be always absorbed by re-scaling and .

We are now ready to state our main result: []

For any dataset which is linearly separable (Assumption 1), any -smooth decreasing loss function (Assumption 1) with an exponential tail (Assumption 3), any stepsize and any starting point , the gradient descent iterates (as in eq. 2) will behave as:

 (3)

where is the max margin vector (the solution to the hard margin SVM):

 ^w=argminw∈Rd∥w∥2s.t.w⊤xn≥1, (4)

and the residual grows at most as , and so

Furthermore, for almost all data sets (all except measure zero), the residual is bounded.

Proof Sketch

We first understand intuitively why an exponential tail of the loss entail asymptotic convergence to the max margin vector: Assume for simplicity that exactly, and examine the asymptotic regime of gradient descent in which , as is guaranteed by Lemma 1. If converges to some limit , then we can write such that , , and . The gradient can then be written as:

 (5)

As and the exponents become more negative, only those samples with the largest (i.e., least negative) exponents will contribute to the gradient. These are precisely the samples with the smallest margin , aka the “support vectors”. The negative gradient (eq. 5) would then asymptotically become a non-negative linear combination of support vectors. The limit will then be dominated by these gradients, since any initial conditions become negligible as (from Lemma 1). Therefore, will also be non-negative linear combination of support vectors, and so will its scaling . We therefore have:

 ^w=N∑n=1αnxn∀n(αn≥0and^w⊤xn=1)OR(αn=0and^w⊤xn>1) (6)

These are precisely the KKT condition for the SVM problem (eq. 4) and we can conclude that is indeed its solution and is thus proportional to it.

To prove Theorem 2 rigorously, we need to show that has a limit, that and to bound the effect of various residual errors, such as gradients of non-support vectors and the fact that the loss is only approximately exponential. To do so, we substitute eq. 3 into the gradient descent dynamics (eq. 2), with being the max margin vector and . We then show that, except when certain degeneracies occur, the increment in the norm of is bounded by for some and , which is a converging series. This happens because the increment in the max margin term, , cancels out the dominant term in the gradient (eq. 5 with and ).

Degenerate and Non-Degenerate Data Sets

An earlier conference version of this paper (Soudry et al., 2018) included a partial version of Theorem 2, which only applies to almost all data sets, in which case we can ensure the residual is bounded. This partial statement (for almost all data sets) is restated and proved as Theorem A in Appendix A. It applies, e.g.

with probability one for data sampled from any absolutely continuous distribution. It does not apply in “degenerate” cases where some of the support vectors

(for which ) are associated with dual variables that are zero () in the dual optimum of 4. As we show in Appendix B, this only happens on measure zero data sets. Here, we prove the more general result, for all data sets, including degenerate data sets. To do so, in Theorem 4 in Appendix C we provide a more complete characterization of the iterates that explicitly specifies all unbounded components even in the degenerate case. We then prove the Theorem by plugging in this more complete characterization and showing that the residual is bounded, thus establishing also Theorem 2.

Parallel Work on the Degenerate Case

Following publication of our initial version, and while preparing this revised version for publication, we learned of parallel work by Ziwei Ji and Matus Telgarsky that also closes this gap. Ji and Telgarsky (2018) provide an analysis of the degenerate case, establishing converges to the max margin predictor by showing that . Our analysis provides a more precise characterization of the iterates, and also shows the convergence is actually quadratically faster (see Section 3). However, Ji and Telgarsky go even further and provide a characterization also when the data is non-separable but still goes to infinity.

More Refined Analysis of the Residual

In some non-degenerate cases, we can further characterize the asymptotic behavior of . To do so, we need to refer to the KKT conditions (eq. 6) of the SVM problem (eq. 4) and the associated support vectors . We then have the following Theorem, proved in Appendix A:

[]

Under the conditions and notation of Theorem 2, for almost all datasets, if in addition the support vectors span the data (i.e. , where is a matrix whose columns are only those data points s.t. ), then , where is a solution to

 ∀n∈S:ηexp(−x⊤n~w)=αn (7)
Analogies with Boosting

Perhaps most similar to our study is the line of work on understanding AdaBoost in terms its implicit bias toward large -margin solutions, starting with the seminal work of Schapire et al. (1998). Since AdaBoost can be viewed as coordinate descent on the exponential loss of a linear model, these results can be interpreted as analyzing the bias of coordinate descent, rather then gradient descent, on a monotone decreasing loss with an exact exponential tail. Indeed, with small enough step sizes, such a coordinate descent procedure does converge precisely to the maximum -margin solution (Zhang et al., 2005; Telgarsky, 2013). In fact, Telgarsky (2013) also generalizes these result to other losses with tight exponential tails, similar to the class of losses we consider here.

Also related is the work of Rosset et al. (2004). For similar loss function as we do, they considered the regularization path in which and showed that is proportional to the maximum margin solution. That is, they showed how adding infinitesimal (e.g.  and ) regularization to logistic-type loss gives rise to the corresponding max-margin predictor. But Rosset et al. do not consider the effect of the optimization algorithm, and instead add explicit regularization—here we are specifically interested in the bias implied by the algorithm not by adding (even infinitesimal) explicit regularization. We see that coordinate descent gives rise to the max margin predictor, while gradient descent gives rise to the max norm predictor. In Section 4.3 and in follow-up work (Gunasekar et al., 2018) we discuss also other optimization algorithms, and their implied biases.

3 Implications: Rates of convergence

The solution in eq. 3 implies that converges to the normalized max margin vector Moreover, this convergence is very slow— logarithmic in the number of iterations. Specifically, our results imply the following tight rates of convergence: [] Under the conditions and notation of Theorem 2, for any linearly seprable data set, the normalized weight vector converges to the normalized max margin vector in norm

 ∥∥∥w(t)∥w(t)∥−^w∥^w∥∥∥∥=O(loglogtlogt), (8)

with this rate improving to for almost every dataset; and in angle

 1−w(t)⊤^w∥w(t)∥∥^w∥=O((loglogtlogt)2), (9)

with this rate improving to for almost every dataset; and the margin converges as

 1∥^w∥−minnx⊤nw(t)∥w(t)∥=O(1logt). (10)

On the other hand, the loss itself decreases as

 L(w(t))=O(1t). (11)

All the rates in the above Theorem are a direct consequence of Theorem 2, except for avoiding the factor for the degenerate cases in eq. 10 and eq. 11 (i.e., establishing that the rates and always hold)—this additional improvement is a consequence of the more complete characterization of Theorem 4. Full details are provided in Appendix D. In this appendix, we also provide a simple construction showing all the rates in Theorem 3 are tight (except possibly for the factors).

The sharp contrast between the tight logarithmic and rates in Theorem 3 implies that the convergence of to the max-margin can be logarithmic in the loss itself, and we might need to wait until the loss is exponentially small in order to be close to the max-margin solution. This can help explain why continuing to optimize the training loss, even after the training error is zero and the training loss is extremely small, still improves generalization performance—our results suggests that the margin could still be improving significantly in this regime.

A numerical illustration of the convergence is depicted in Figure 1. As predicted by the theory, the norm grows logarithmically (note the semi-log scaling), and converges to the max-margin separator, but only logarithmically, while the loss itself decreases very rapidly (note the log-log scaling).

An important practical consequence of our theory, is that although the margin of keeps improving, and so we can expect the population (or test) misclassification error of to improve for many datasets, the same cannot be said about the expected population loss (or test loss)! At the limit, the direction of will converge toward the max margin predictor . Although has zero training error, it will not generally have zero misclassification error on the population, or on a test or a validation set. Since the norm of will increase, if we use the logistic loss or any other convex loss, the loss incurred on those misclassified points will also increase. More formally, consider the logistic loss and define also the hinge-at-zero loss . Since classifies all training points correctly, we have that on the training set . However, on the population we would expect some errors and so . Since and as , we have:

 E[ℓ(w(t)⊤x)]≈E[ℓ((logt)^w⊤x)]≈(logt)E[h(^w⊤x)]=Ω(logt). (12)

That is, the population loss increases logarithmically while the margin and the population misclassification error improve. Roughly speaking, the improvement in misclassification does not out-weight the increase in the loss of those points still misclassified.

The increase in the test loss is practically important because the loss on a validation set is frequently used to monitor progress and decide on stopping. Similar to the population loss, the validation loss will increase logarithmically with , if there is at least one sample in the validation set which is classified incorrectly by the max margin vector (since we would not expect zero validation error). More precisely, as a direct consequence of Theorem 2 (as shown on Appendix D): Let be the logistic loss, and be an independent validation set, for which such that . Then the validation loss increases as

 Lval(w(t))=∑x∈Vℓ(w(t)⊤x)=Ω(log(t)).

This behavior might cause us to think we are over-fitting or otherwise encourage us to stop the optimization. However, this increase does not actually represent the model getting worse, merely getting larger, and in fact the model might be getting better (increasing the margin and possibly decreasing the error rate).

4 Extensions

4.1 Multi-Class Classification with Cross-Entropy Loss

So far, we have discussed the problem of binary classification, but in many practical situations we have more then two classes. For multi-class problems, the labels are the class indices and we learn a predictor for each class . A common loss function in multi-class classification is the following cross-entropy loss with a softmax output, which is a generalization of the logistic loss:

 L({wk}k∈[K]) =−N∑n=1log⎛⎝exp(w⊤ynxn)∑Kk=1exp(w⊤kxn)⎞⎠ (13)

What do the linear predictors converge to if we minimize the cross-entropy loss by gradient descent on the predictors? In Appendix E we analyze this problem for separable data, and show that again, the predictors diverge to infinity and the loss converges to zero. Furthermore, we prove the following Theorem: [] For almost all multiclass datasets (i.e., except for a measure zero) which are linearly separable (i.e. the constraints in eq. 15 below are feasible), any starting point and any small enough stepsize, the iterates of gradient descent on 13 will behave as:

 wk(t)=^wklog(t)+ρk(t), (14)

where the residual is bounded and is the solution of the K-class SVM:

 argminw1,...,wkK∑k=1||wk||2s.t.∀n,∀k≠yn:w⊤ynxn≥w⊤kxn+1. (15)

4.2 Deep networks

In this paper, we only considered linear prediction. Naturally, it is desirable to generalize our results also to non-linear models and especially multi-layer neural networks.

Even without a formal extension and description of the precise bias, our results already shed light on how minimizing the cross-entropy loss with gradient descent can have a margin maximizing effect, how the margin might improve only logarithmically slow, and why it might continue improving even as the validation loss increases. These effects are demonstrated in Figure 2 and Table 1 which portray typical training of a convolutional neural network using unregularized gradient descent333Code available here: https://github.com/paper-submissions/MaxMargin. As can be seen, the norm of the weight increases, but the validation error continues decreasing, albeit very slowly (as predicted by the theory), even after the training error is zero and the training loss is extremely small. We can now understand how even though the loss is already extremely small, some sort of margin might be gradually improving as we continue optimizing. We can also observe how the validation loss increases despite the validation error decreasing, as discussed in Section 3.

As an initial advance toward tackling deep network, we can point out that for two special cases, our results may be directly applied to multi-layered networks. First, somewhat trivially, our results may be applied directly to the last weight layer of a neural network if the last hidden layer becomes fixed and linearly separable after a certain number of iterations. This can become true, either approximately, if the input to the last hidden layer is normalized (e.g., using batch norm), or exactly, if the last hidden layer is quantized (Hubara et al., 2016).

Second, as we show next, our results may be applied exactly on deep networks if only a single weight layer is being optimized, and, furthermore, after a sufficient number of iterations, the activation units stop switching and the training error goes to zero.

We examine a multilayer neural network with component-wise ReLU functions

, and weights . Given input and target , the DNN produces a scalar output

 un=WLf(WL−1f(⋯W2f(W1xn)))

and has loss , where obeys assumptions 1 and 3.

If we optimize a single weight layer using gradient descent, so that converges to zero, and such that the ReLU inputs do not switch signs, then converges to

 ^wl=argminwl∥wl∥2s.t.ynun(wl)≥1.

We examine the output of the network given a single input , for . Since the ReLU inputs do not switch signs, we can write , the output of layer , as

 vl,n=l∏m=1Am,nWmxn,

where we defined for as a diagonal 0-1 matrix, which diagonal is the ReLU slopes at layer , sample , and . Additionally, we define

 δl,n=Al,nl+1∏m=LW⊤mAm,n;~xl,n=δl,n⊗ul−1,n.

Using this notation we can write

 un(wl)=vL,n=L∏m=1Am,nWmxn=δ⊤l,nWlul−1,n=~x⊤l,nwl (16)

This implies that

 L(wl)=N∑n=1ℓ(ynun(wl))=N∑n=1ℓ(yn~x⊤l,nwl),

which is the same as the original linear problem. Since the loss converges to zero, the dataset must be linearly separable. Applying Theorem 2, and recalling that from eq. 16, we prove this corollary.

Importantly, this case is non-convex, unless we are optimizing the last layer. Note we assumed ReLU functions for simplicity, but this proof can be easily generalized for any other piecewise linear constant activation functions (

e.g.

, leaky ReLU, max-pooling).

4.3 Other optimization methods

In this paper we examined the implicit bias of gradient descent. Different optimization algorithms exhibit different biases, and understanding these biases and how they differ is crucial to understanding and constructing learning methods attuned to the inductive biases we expect. Can we characterize the implicit bias and convergence rate in other optimization methods?

In Figure 1 we see that adding momentum does not qualitatively affects the bias induced by gradient descent. In Figure 4 in Appendix F we also repeat the experiment using stochastic gradient descent, and observe a similar bias. This is consistent with the fact that momentum, acceleration and stochasticity do not change the bias when using gradient descent to optimize an under determined least squares problems. It would be beneficial, though, to rigorously understand how much we can generalize our result to gradient descent variants, and how the convergence rates might change in these cases.

On the other hand, as an example of how changing the optimization algorithm does change the bias, consider adaptive methods, such as AdaGrad (Duchi et al., 2011) and ADAM (Kingma and Ba, 2015). In Figure 3 we show the predictors obtained by ADAM and by gradient descent on a simple data set. Both methods converge to zero training error solutions. But although gradient descent converges to the max margin predictor, as predicted by our theory, ADAM does not. The implicit bias of adaptive method has in fact been a recent topic of interest, with Hoffer et al. (2017) and Wilson et al. (2017) suggesting they lead to worse generalization, and Wilson et al. (2017)

providing examples of the differences in the bias for linear regression problems with the squared loss. Can we characterize the bias of adaptive methods for logistic regression problems? Can we characterize the bias of other optimization methods, providing a general understanding linking optimization algorithms with their bias?

In a follow-up paper (Gunasekar et al., 2018) we provide initial answers to these questions. We provide a precise characterization of the limit direction of steepest descent for general norms when optimizing the exp-loss, and show that for adaptive methods such as Adagrad the limit direction can depend on the initial point and step size and is thus not as predictable and robust as with non-adaptive methods.

4.4 Other loss functions

In this work we focused on loss functions with exponential tail and observed a very slow, logarithmic convergence of the normalized weight vector to the max margin direction. A natural question that follows is how does this behavior change with types of loss function tails. Specifically, does the normalized weight vector always converges to the max margin solution? How is the convergence rate affected? Can we improve the convergence rate beyond the logarithmic rate found in this work?

In a follow-up work Nacson et al. (2018) we provide partial answers to these questions. We prove that the exponential tail has the optimal convergence rate, for tails for which is of the form with

. We then conjecture, based on heuristic analysis, that the exponential tail is optimal among all possible tails. Furthermore, we demonstrate that polynomial or heavier tails do not converge to the max margin solution. Lastly, for the exponential loss we propose a normalized gradient scheme which can significantly improve convergence rate, achieving

.

4.5 Matrix Factorization

With multi-layered neural networks in mind, Gunasekar et al. (2017) recently embarked on a study of the implicit bias of under-determined matrix factorization problems, where we minimize the squared loss of linear observation of a matrix by gradient descent on its factorization. Since a matrix factorization can be viewed as a two-layer network with linear activations, this is perhaps the simplest deep model one can study in full, and can thus provide insight and direction to studying more complex neural networks. Gunasekar et al. conjectured, and provided theoretical and empirical evidence, that gradient descent on the factorization for an under-determined problem converges to the minimum nuclear norm solution, but only if the initialization is infinitesimally close to zero and the step-sizes are infinitesimally small. With finite step-sizes or finite initialization, Gunasekar et al. could not characterize the bias.

In our followup paper (Gunasekar et al., 2018) we study this same problem with exponential loss instead of squared loss. Under additional assumptions on the asymptotic convergence of update directions and gradient directions, we were able to show that the direction of gradient descent iterates on the factorized parameterization asymptotically converge towards the maximum margin solution with unit nuclear norm. Unlike the case of squared loss, the result for exponential loss are independent of initialization and with only mild conditions on the step size. Here again, we see the asymptotic nature of exponential loss on separable data nullifying the initialization effects thereby making the analysis simpler compared to squared loss.

5 Summary

We characterized the implicit bias induced by gradient descent when minimizing smooth monotone loss functions with an exponential tail. This is the type of loss commonly being minimized in deep learning. We can now rigorously understand:

1. How gradient descent, without early stopping, induces implicit regularization and converges to the maximum margin solution, when minimizing the both binary classification with logistic loss, exp-loss, or other exponential tailed monotone decreasing loss, as well as for multi-class classification with cross-entropy loss. In particular, the non-tail part does not affect the bias and so the logistic loss and the exp-loss, although very different on non-separable problems, behave the same for separable problems. The bias is also independent of the step-size used (as long as it is small enough to ensure convergence) and is also independent on the initialization (unlike for least square problem).

2. The convergence of direction of gradient descent updates to the maximum margin solution, however is very slow compared to the convergence of training loss, which explains why it is worthwhile continuing to optimize long after we have zero training error, and even when the loss itself is already extremely small.

3. We should not rely on plateauing of the training loss or on the loss (logistic or exp or cross-entropy) evaluated on a validation data, as measures to decide when to stop. Instead, we should look at the error on the validation dataset. We might improve the validation and test errors even when when the decrease in the training loss is tiny and even when the validation loss itself increases.

Perhaps that gradient descent leads to a max margin solution is not a big surprise to those for whom the connection between regularization and gradient descent is natural. Nevertheless, we are not familiar with any prior study or mention of this fact, let alone a rigorous analysis and study of how this bias is exact and independent of the initial point and the step-size. Furthermore, we also analyze the rate at which this happens, leading to the novel observations discussed above. Even more importantly, we hope that our analysis can open the door to further analysis of different optimization methods or in different models, including deep networks, where implicit regularization is not well understood even for least square problems, or where we do not have such a natural guess as for gradient descent on linear problems. Analyzing gradient descent on logistic/cross-entropy loss is not only arguably more relevant than the least square loss, but might also be technically easier.

Acknowledgments

The authors are grateful to J. Lee, and C. Zeno for helpful comments on the manuscript. The research of DS was supported by the Taub foundation and of NS by the National Science Foundation.

References

• Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.

Journal of Machine Learning Research

, 12(Jul):2121–2159, 2011.
• Ganti (2015) RadhaKrishna Ganti. EE6151, Convex optimization algorithms. Unconstrained minimization: Gradient descent algorithm, 2015. URL
• Gunasekar et al. (2017) Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit Regularization in Matrix Factorization. arXiv, pages 1–10, 2017.
• Gunasekar et al. (2018) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.
• Hardt et al. (2016) Moritz Hardt, Benjamin Recht, and Y Singer. Train faster, generalize better: Stability of stochastic gradient descent. ICML, pages 1–24, 2016.
• Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In NIPS (oral presentation), pages 1–13, may 2017.
• Hubara et al. (2016) I Hubara, M Courbariaux, D. Soudry, R El-yaniv, and Y Bengio. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. Accepted to JMLR, 2016.
• Ji and Telgarsky (2018) Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. Communicated by the authors, 2018.
• Keskar et al. (2017) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR, pages 1–16, 2017.
• Kingma and Ba (2015) Diederik P Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. In ICLR, pages 1–13, 2015.
• Nacson et al. (2018) Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of Gradient Descent on Separable Data. arXiv, pages 1–45, 2018.
• Neyshabur et al. (2014) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
• Neyshabur et al. (2015) Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In NIPS, 2015.
• Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring Generalization in Deep Learning. arXiv, jun 2017.
• Rosset et al. (2004) Saharon Rosset, Ji Zhu, and Trevor J Hastie. Margin Maximizing Loss Functions. In NIPS, pages 1237–1244, 2004. ISBN 0-262-20152-6.
• Schapire et al. (1998) Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee, et al. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651–1686, 1998.
• Soudry et al. (2018) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, and N Srebro. The Implicit Bias of Gradient Descent on Separable Data. In ICLR, 2018.
• Telgarsky (2013) Matus Telgarsky. Margins, shrinkage and boosting. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages II–307. JMLR. org, 2013.
• Wilson et al. (2017) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The Marginal Value of Adaptive Gradient Methods in Machine Learning. arXiv, pages 1–14, 2017.
• Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
• Zhang et al. (2005) Tong Zhang, Bin Yu, et al. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4):1538–1579, 2005.

Appendix A Proof of Theorems 2 and 2 for almost every dataset

In the following sub-sections we first prove Theorem A below, which is a version of Theorem 2, specialized for almost every dataset. We then prove Theorem 2 (which is already stated for almost every dataset).

[]

For almost every dataset which is linearly separable (Assumption 1), any -smooth decreasing loss function (Assumption 1) with an exponential tail (Assumption 3), any stepsize and any starting point , the gradient descent iterates (as in eq. 2) will behave as:

 (17)

where is the max margin vector

 ^w=argminw∈Rd∥w∥2s.t.∀n:w⊤xn≥1,

the residual is bounded, and so

In the following proofs, for any solution , we define

 r(t)=w(t)−^wlogt−~w,

where and follow the conditions of Theorems 2 and 2, i.e. is the is the max margin vector defined above, and is a vector which satisfies eq. 7:

 ∀n∈S:ηexp(−x⊤n~w)=αn, (18)

where we recall that we denoted as the matrix whose columns are the support vectors, a subset of the columns of .

In Lemma 3 (Appendix B) we prove that for almost every dataset is uniquely defined, there are no more then support vectors and , . Therefore, eq. 18 is well-defined in those cases. If the support vectors do not span the data, then the solution to eq. 18 might not be unique. In this case, we can use any such solution in the proof.

We furthermore denote the minimum margin to a non-support vector as:

 θ=minn∉Sx⊤n^w>1, (19)

and by ,, () various positive constants which are independent of . Lastly, we define as the orthogonal projection matrix444This matrix can be written as , where is the Moore-Penrose pseudoinverse of . to the subspace spanned by the support vectors (the columns of ), and as the complementary projection (to the left nullspace of ).

a.1 Simple proof of Theorem A

In this section we first examine the special case that and take the continuous time limit of gradient descent: , so

The proof in this case is rather short and self-contained (i.e., does not rely on any previous results), and so it helps to clarify the main ideas of the general (more complicated) proof which we will give in the next sections.

Recall we defined

 r(t)=w(t)−log(t)^w−~w. (20)

Our goal is to show that is bounded, and therefore is bounded. Eq. 20 implies that

 ˙r(t)=˙w(t)−1t^w=−∇L(w(t))−1t^w (21)

and therefore

 12ddt∥r(t)∥2=˙r⊤(t)r(t) =N∑n=1exp(−x⊤nw(t))x⊤nr(t)−1t^w⊤r(t) =[∑n∈Sexp(−log(t)^w⊤xn−~w⊤xn−x⊤nr(t))x⊤nr(t)−1t^w⊤r(t)] +⎡⎣∑n∉/Sexp(−log(t)^w⊤xn−~w⊤xn−x⊤nr(t))x⊤nr(t)⎤⎦, (22)

where in the last equality we used eq. 20 and decomposed the sum over support vectors and non-support vectors. We examine both bracketed terms.

Recall that for , and that we defined (in eq. 18) so that . Thus, the first bracketed term in eq. 22 can be written as

 1t∑n∈Sexp(−~w⊤xn−x⊤nr(t))x⊤nr(t)−1t∑n∈Sexp(−~w⊤xn)xn = 1t∑n∈Sexp(−~w⊤xn)(exp(−x⊤nr(t))−1)x⊤nr(t)≤0, (23)

since . Furthermore, since and (eq. 19), the second bracketed term in eq. 22 can be upper bounded by

 ≤1tθ∑n∉/Sexp(−~w⊤xn). (24)

Substituting eq. 23 and 24 into eq. 22 and integrating, we obtain, that such that

 ∀t1,∀t>t1:∥r(t)∥2−||r(t1)||2≤C∫tt1dttθ≤C′<∞,

since (eq. 19). Thus, we showed that is bounded, which completes the proof for the special case.

a.2 Complete proof of Theorem A

Next, we give the proof for the general case (nn-infinitesimal step size, and exponentially-tailed functions). Though it is based on a similar analysis as in the special case we examined in the previous section, it is somewhat more involved since we have to bound additional terms.

First, we state two auxiliary lemmata, that are proven below in appendix sections A.4 and A.5:

[]

Let be a -smooth non-negative objective. If , then, for any , with the GD sequence

 (25)

we have that and therefore

[]

We have

 ∃C1,t1:∀t>t1:(r(t+1)−r(t))⊤r(t)≤C1t−min(θ,1+1.5μ+,1+0.5μ−). (26)

Additionally, , , such that , if

 ∥P1r(t)∥≥ϵ1, (27)

then the following improved bound holds

 (28)

Our goal is to show that is bounded, and therefore is bounded. To show this, we will upper bound the following equation

 ∥r(t+1)∥2 (29)

First, we note that first term in this equation can be upper-bounded by

 (1)=∥w(t+1)−^wlog(t+1)−~w−w(t)+^wlog(t)+~w∥2 (2)=∥−η∇L(w(t))−^w[log(t+1)−log(t)]∥2 =η2∥∇L(w(t))∥2+∥^w∥2log2(1+t−1)+2η^w⊤∇L(w(t))log(1+t−1) (3)≤η2∥∇L(w(t))∥2+∥^w∥2t−2 (30)

where in we used eq. 20, in we used eq. 2, and in we used , and also that

 ^w⊤∇L(w(t))=N∑n=1ℓ′(w(t)⊤xn)^w⊤xn≤0, (31)

since (from the definition of ) and .

Also, from Lemma A.2 we know that

 (32)

Substituting eq. 32 into eq. 30, and recalling that a power series converges for any , we can find such that

 ∥r(t+1)−r(t)∥2=o(1)and∞∑t=0∥r(t+1)−r(t)∥2=C0<∞. (33)

Note that this equation also implies that

 ∃t0:∀t>t0:|∥r(t+1)∥−∥r(t)∥|<ϵ0. (34)

Next, we would like to bound the second term in eq. 29. From eq. 26 in Lemma A.2, we can find such that :

 (35)

Thus, by combining eqs. 35 and 33 into eq. 29, we find

 ∥r(t)∥2−∥r(t1)∥2 =t−1∑u=t1[∥r(u+1)∥2−∥r(u)∥2] ≤C0+2t−1∑u=t1C1u−min(θ,1+1.5μ+,1+0.5μ−)

which is a bounded, since (eq. 19) and (Definition 2). Therefore, is bounded.

a.3 Proof of Theorem 2

All that remains now is to show that if , and that is unique given . To do so, this proof will continue where the proof of Theorem 2 stopped, using notations and equations from that proof.

Since has a bounded norm, its two orthogonal components also have bounded norms (recall that were defined in the beginning of appendix section A). From eq. 2, is spanned by the columns of . If , then it is also spanned by the columns of , and so . Therefore, is not updated during GD, and remains constant. Since in eq. 3 is also bounded, we can absorb this constant into without affecting eq. 7 (since ). Thus, without loss of generality, we can assume that .

Now, recall eq. 28 in Lemma A.2

 ∃C2,t2:∀t>t2:(r(t+1)−r(t))⊤r(t)≤−C2t−1<0.

Combining this with eqs. 29 and 33, implies that such that such that , we have that is a decreasing function since then

 ∥r(t+1)∥2−∥r(t)∥2≤−C3t−1<0. (36)

Additionally, this result also implies that we cannot have , since then we arrive to the contradiction.

Therefore, such that . Recall also that is a decreasing function whenever (eq. 36). Also, recall that , so from eq. 34, we have that ,. Combining these three facts we conclude that . Since this reasoning holds , this implies that .

Lastly, we note that since is not updated during GD, we have that . This sets uniquely, together with eq. 7.

a.4 Proof of Lemma a.2

See A.2

This proof is a slightly modified version of the proof of Theorem 2 in (Ganti, 2015). Recall a well-known property of -smooth functions:

 ∣∣f(x)−f(y)−∇f(y)⊤(x−y)∣∣≤β2∥x−y∥2.