In over-parameterized models, wherein the training objective has multiple global optima, different optimization algorithms learn models with different implicit bias, and hence, different generalization to the population loss. This effect of the implicit bias of the optimization algorithm on generalization is particularly prominent in deep learning models, where the generalization is not driven by explicit regularization or restrictions of the model capacity(Neyshabur et al., 2015; Zhang et al., 2017; Hoffer et al., 2017). Thus, in order to understand what drives generalization in such high capacity models, it is important to rigorously understand how optimization affects implicit bias.
Consider unregularized logistic regression over separable data. Soudry et al. (2018a) showed that the gradient descent iterates for this problem converge in direction to the maximum margin separator with unit
norm, and this implicit bias holds independent of initialization and step size choices. This is exactly the solution of the hard margin support vector machine (SVM) where thenorm constraint is explicitly added. While the maximum margin solution is perhaps the first guess of the implicit bias from the optimization geometry of gradient descent, Soudry et al. (2018a) also showed that the rate of convergence to maximum margin solution is which is much slower compared to the rate of convergence of the loss function itself which is . This implies that the classification boundary of logistic regression, and hence the generalization of the classifier, continues to change long after the - error on training examples has diminished to zero.
Here we provide a more detailed study of this problem, focusing on the rate of convergence of gradient descent to the maximum margin solution. First, we ask whether this result can be extended to different loss functions, with different tails, beyond the tight exponential tail of logistic loss or exponential loss: do we still get convergence to the maximum margin separator? Does a heavier or lighter tail gives a faster rate of convergence?
We show that convergence to the maximum margin solution can be extended to various losses with faster than polynomial tails, but not to losses with polynomial tails. However, our analysis suggests that the (popular) exponential tail is optimal in terms of the rates. We then focus on the optimal case of the exponential loss and ask whether we can accelerate the convergence to the maximum margin by using more aggressive and variable step sizes. The answer is yes, and we show that using normalized gradient updates, i.e. step size proportional to the inverse gradient, we can get rates as fast as instead of . Preliminary numerical results suggests we might also be able to improve similarly the convergence rates for deep networks.
2 Setup and review of previous results
Consider a dataset , with binary labels . We denote the data matrix and
as the maximal singular value of. We assume, without loss of generality that and , where denotes the norm.
We analyze learning by minimizing an empirical loss of the form
where is the weight vector. A bias term could be added in the usual way, extending by an additional ‘1’ component. To simplify notation, we assume that — this is without loss of generality, since we can always re-define as .
The gradient descent (GD) iteration with fixed step size is given by
We look at the iterates of GD on linearly separable datasets with monotonic loss functions.
[Linear separability] The dataset is linearly separable if there exists a separator such that .
[Strict Monotone Loss] is a differentiable monotonically decreasing function bounded from below. Without loss of generality, let and .
For strictly monotone losses over separable data, there are no finite global minima of the objective in eq. (1), and gradient descent iterates will diverge to infinity. While the norm of the iterates diverges, the classification boundary is entirely specified by the direction of . Can we say something interesting about which direction the iterates converge to? Soudry et al. (2018a) characterized this direction for loss function with exponential tails, defined below,
[Tight Exponential Tail] A function has a “tight exponential tail”, if there exist positive constants , and such that :
[-smooth function] A function is -smooth if its derivative is Lipschitz continuous with a Lipschitz constant .
Theorem 1 (Theorem 3 in Soudry et al. (2018a), rephrased)
For almost all datasets that are linearly separable (Definition 1), and any -smooth (Definition 4) with a strictly monotone loss function (Definition 2), for which has a tight exponential tail (Definition 3), the gradient descent iterates in eq. 2 with any step size and any initialization will behave as:
where the residual is bounded and is the following max margin separator:
Theorem 1 holds for common classification loss functions, including the logistic loss, sigmoidal loss, and the exponential loss111Note that for exp-loss, does not have a global smoothness parameter. However, if we initialize with then it is straightforward to show the gradient descent iterates maintain bounded local smoothness , so we will have for all iterates.
and for all datasets except a measure zero set (e.g., with probability 1, for any dataset sampled from an absolutely continuous distribution).Soudry et al. (2018b) generalized this Theorem to include also the measure zero cases. Theorem 1 implies logarithmically slow convergence in direction to the max-margin separator
and in the margin
Gunasekar et al. (2018) further generalized this characterization to steepest descent with respect to an arbitrary norm, establishing convergence to the maximum margin predictor with respect to the chosen norm. The proof technique used for this more general setting is different and it is based on bounding the decrease in loss and increase in norm, generalizing the analysis of Telgarsky (2013) which shows how Boosting converges to the max--margin predictor. This analysis does not rely on the data being non-degenerate as in Theorem 1 (i.e. it applies for any data set, not only almost all data sets). Although Gunasekar et al. (2018) do not state a rate of convergence, the technique can be used to establish that the margin converges at the rate of as summarized in the following theorem (specialized here only for gradient descent), which is proved in appendix A:
For any separable data set (Definition 1), any initial point , consider gradient descent iterates with a fixed step size for linear classification with the exponential loss .
Then the iterates satisfy:
where is the maximum margin .
Note that Theorem 2 ensures the rate of convergence of the margin, but does not specify how quickly itself converges to the max-margin predictor .
3 Main Results
Previous results, summarized in Section 2, show that on separable data and with strictly monotone exponentially tailed loss functions, gradient descent converges to the max-margin separator with a very slow rate of . We therefore first explore if this rate is affected by the choice of the loss function. We examine the following type of loss functions.
3.1 Losses with poly-exponential tails
A function has a “tight poly-exponential tail”, if there exist positive constants , and such that :
For almost all datasets that are linearly separable (Definition 1) and any -smooth (Definition 4), with strictly monotone loss function (Definition 2) for which has a tight poly-exponential tail (Definition 5) with , given step size and any initialization , the iterates of gradient descent in eq. 2 will behave as:
where is the following max margin separator:
and for a constant independent of ,
Theorem 3 implies that still converges to the normalized max margin separator for poly-exponential tails with , but with a different rate. In Appendix C we show that Theorem 3 implies the convergence rates specified in Table 1. From this table, we can see that the optimal convergence rate for poly-exponential tails is achieved at . Moreover, this rate becomes slower as increases, at least in the range . In section 4 we discuss why our analysis suggests this sub-optimal behavior remains true for and even for slower, sub-exponential tails, until we no longer converge to the max-margin separator if has a polynomial tail.
3.2 Faster rates using variable aggressive step sizes
Our analysis so far suggests that exponential tails have an optimal convergence rate, and for exponential tail losses with fixed step size, we have an extremely slow rate of convergence, . Therefore, the question is can we somehow accelerate this rate using variable step sizes. Fortunately, the answer is yes and we can indeed show faster rate of convergence by aggressively increasing the step size to compensate for the vanishing gradient. Specially, as we prove in appendix A, using the following normalized gradient descent algorithm, we can attain a rate of :
For any separable data set (Definition 1) and any initial point , consider the normalized gradient descent updates above with a variable step size and exponential loss .
Then the margin of the iterates converges to the max-margin with rate :
Again, in the appendix we prove a more general version of Theorem 4, which obtains the same rate for any steepest descent algorithm. In Figure 1 we visualize the different rates for GD and normalized GD. As expected, we find that normalized GD converges significantly faster than GD.
The observation that aggressive changes in the step size can improve convergence rate is applied in the AdaBoost literature (Schapire and Freund, 2012), where exact line-search is used. We use a slightly less aggressive strategy of decaying step-sizes with normalized gradient descent, attaining a rate of . This rate almost matches , which is the best possible rate for the margin suboptimality in solving hard margin SVM, and that which is achieved by the best known methods. 222 The best known method in terms of margin suboptimality, and using vector operations (operations on all training examples), is the aggressive Perceptron, which achieved a rate of
The best known method in terms of margin suboptimality, and using vector operations (operations on all training examples), is the aggressive Perceptron, which achieved a rate of. Clarkson et al. (2012) obtained an improved method which they showed is optimal, that does not use vector operations, and achieves a rate of where now is the number of scalar operations.
This suggests that gradient descent with a more aggressive step-size policy is quite efficient at margin maximization. We emphasize our goal here is not to develop a faster SVM optimizer, but rather to understand and improve gradient descent and local search in a way that might be applicable also for deep neural networks, as indicated by preliminary numerical results (appendixE).
4 Ideas behind Theorem 3, and analysis for generic tails
Theorem 3 is a generalization of Theorem 1, and therefore builds on similar ideas as in Soudry et al. (2018a). The complete proof is given in the appendix. In this section we describe non-rigorously the main ideas of the proof (which is rather long, as we calculate exact asymptotic behavior, including constants in some cases), and how these ideas might extend beyond the specific tails considered in Theorem 3. We consider strictly monotone losses (Definition 2) with a general tail, given as , such that is a strictly increasing function of .
4.1 Convergence to the max-margin separator
From Lemma 1 in Soudry et al. (2018a) we know that for linearly separable datasets, and smooth strictly monotonic loss functions, the iterates of GD entail that and as , if the learning rate is sufficiently small. Now, if exists, then we can write where , and . Using this result, the gradients can be written as:
As the exponents become more negative, since is an increasing function, and . Therefore, if is increasing sufficiently fast, only samples with minimal margin contribute to the sum. Examining the gradient descent dynamics, this implies that and also its scaling are a linear non negative combination of support vectors:
these are exactly the KKT conditions for the SVM problem and we can conclude that is proportional to .
4.2 Calculation of rates and validity conditions
Next, we aim to find and so we can calculate the convergence rates. Also, we aim to find what are the conditions on so this calculation would break. To simplify our analysis we examine the continuous time version of GD, in which we take the limit . In this limit
We define , i.e., the set of indices of support vectors, so we have . From our reasoning above, if increases fast enough, then we expect that the contribution of the non-support vectors to the gradient would be negligible, and therefore
Additionally, if we assume that converges to some direction , and is some vector orthogonal to the support vectors (if such direction exists), then we expect that asymptotic solution to be of the form
In order for this to be a valid solution, it must satisfy eq. 16. We verify this by substitution and examining the leading orders
where in (1) we used a Taylor approximation and in (2) we used that . For the last equation to to hold, we require
and satisfies the equations:
where we define as the orthogonal projection matrix to the subspace spanned by the support vectors, and as the complementary projection matrix. Equation 18 has a unique solution for almost every dataset from Lemma 8 in Soudry et al. (2018a). Specifically, this equation does not have a solution when one of the must be equal to zero (i.e., some support vectors exert “zero force” on the the margin — and this happens only in measure zero cases).
Since we assume that we must have meaning which implies . This condition must hold for this analysis to make sense. Moreover, the differential equation that defines (eq. 17) is generally intractable. However, if the condition holds (which is true for many functions), then we can approximate
which has a closed form solution
4.3 Conjecture and comparison with exact results
This analysis provides the following characterization of the asymptotic solution:
where and , gradient descent (as in eq. 2) with stepsize and any starting point will behave as:
where is not dependent on and so
where is the max margin vector (eq. 4)
To prove this conjecture in general we have to justify various assumptions (e.g., the existence of certain limits) and approximations (e.g., Taylor expansions) we made during the analysis above. It turns out that this becomes more and more difficult as the tail of the loss derivative becomes heavier. Therefore, our exact results on Poly-exponential tails (Theorem 3), which assumed asymptotically for that
were only proved for . These results are consistent with Conjecture 1:
When , is not and therefore .
When , and therefore
This suggests that Theorem 3 holds even for , and that indeed the exponential tail () obtains the optimal convergence rate over all poly-exponential loss functions.
In appendix D, we demonstrate the conjecture by examining two examples. The first example proves that when is not (e.g. the loss has a power-law tail) we might not convergence to the max-margin separator. The second example demonstrates convergence with sub-poly-exponential tails (that satisfies ).
In this work, we have examined the convergence rate of gradient descent on separable data, in binary linear classification tasks, and given strictly monotone and smooth loss functions. First, we examined how the convergence rate depends on the tail of the loss function. In Theorem 3, an extension of Theorem 1, we rigorously derived the convergence rate for loss functions with poly-exponential tails, for which , in the range . In that range, the exponential tail () has the optimal rate. This offers a possible explanation to the empirical preference of the exponentially-tailed loss functions over other poly-exponential tailed losses (probit is perhaps the only example) — since the exponential loss can lead to faster convergence to the asymptotic (implicitly biased) solution, as we showed here. Further analysis suggested that the rate for exponential tail remains the optimal rate outside of the range, and even for sub-exponential tails, until, for polynomial tails, we no longer converge to the max-margin separator.
In Theorem 4, an extension of Theorem 2, we showed that the convergence of gradient descent for exponential loss function could be significantly accelerated by simply increasing the learning rate. In fact, the normalized gradient descent algorithm can also approximate the regularization path in the following sense. Let , and . Then
As a simple implication of this, the gradient descent path starting at has , so after steps the loss achieved by is close to the best predictor of the same norm. This shows that gradient descent is closely approximating the regularization path.
Theorems 3 and 4, and their proof methods, both seem to have their own strengths and weaknesses. The analysis behind Theorem 3 allows exact calculation of convergence rates to the max-margin separator (including constants in some cases). These rates are easy to calculate and understand intuitively. However, it is significantly harder to prove them rigorously. Such a proof seems to become harder as the loss tail becomes heavier since we have to consider additional terms in the asymptotic calculations (this is why Theorem 3 stops at ). Additionally, the current results for are limited to “almost every dataset” (e.g., any dataset sampled from an absolutely continuous distribution). However, we believe that it is possible to derive the corrections to the convergence rate resulting from zero measure cases and that these should be of lower order (e.g., for the exponential tail), as proved in Soudry et al. (2018b) for exponential tails.
In contrast, the proof of Theorem 4 is significantly simpler, does not require any assumptions on the dataset beyond its linear separability, and it easily generalizes to steepest descent, and to variable step sizes. However, this approach has some weaknesses. First, the results are only stated for the exact exp-loss. In contrast, Theorem 3 only requires the tail of the loss to be poly-exponential. Extension of this result to losses with exponential tails seems possible, based on the methods of Telgarsky (2013), but less so to other types of tails. Second, this theorem only provides a bound, not an exact asymptotic result (in contrast to Theorem 3). Furthermore, this bound is only on the margin, so we may have a different rate on the convergence to max-margin separator itself; this is the case exactly in the zero measure cases of Theorem 3, as shown in Soudry et al. (2018b).
Clarkson et al. (2012)
Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff.
Sublinear optimization for machine learning.Journal of the ACM (JACM), 59(5):23, 2012.
- Gunasekar et al. (2018) Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint, 2018.
- Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In NIPS, 2017.
- Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015.
- Schapire and Freund (2012) Robert E. Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.
- Soudry et al. (2018a) Daniel Soudry, Elad Hoffer, , Mor Shpigel Nacson, and Nathan Srebro. The implicit bias of gradient descent on separable data. ICLR, 2018a.
- Soudry et al. (2018b) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data (journal version). arXiv preprint: 1710.10345v3, 2018b.
- Telgarsky (2013) Matus Telgarsky. Margins, shrinkage and boosting. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages II–307. JMLR. org, 2013.
- Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
In this section we prove extended versions of Theorems 2 and 4. In this section only, the norm is a general norm (not the norm, like in the rest of the paper). First, we state definitions and auxiliary results.
The following lemma is a standard result in convex analysis.
Lemma 1 (Fenchel Duality)
Let , and be two closed convex functions. Then
Let be the data matrix and without loss of generality . Define so that , and the - margin as . We wish to show that for all , which is an analog of the Polyak condition. Define . By noting that and , this can be restated as . Since we require this for all , , and norms are homogeneous, this is equivalent to
where is the -dimensional probability simplex.
The following duality holds:
Let and . Thus . The conjugates are , and . The LHS of Lemma 1 is
Thus the LHS is equal to , since it is precisely the optimization program of -SVM. By weak duality, we have shown that .
Using this lower bound we proceed with the optimization analysis which largely follows the standard arguments from the optimization literature on first-order methods. We prove the theorems for general steepest descent algorithms which includes gradient descent as a special case.
a.1 Proof of Theorem 2
Consider the steepest descent algorithm:
Note that for quadratic norm steepest descent is simply gradient descent.
Next, we prove the generalized version of Theorem 2, which applies to steepest descent (instead of just gradient descent):
Let us assume that . If the step size , then the iterates satisfy:
In particular, if there is a unique maximum- margin solution , then .
By Lemma 11 of Gunasekar et al. (2018) , for any . Then Taylor’s theorem gives and noting that ,
Next, we bound the un-normalized margin.
By applying ,
We have the following upper bound on
For every , we have that
Use that by the duality theorem,
Next we prove that . From (37),
and summing gives,
Next we show that . From Equation (37),
Since we chose ,