Gradient Methods Never Overfit On Separable Data

06/30/2020 ∙ by Ohad Shamir, et al. ∙ 0

A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data: If we run these methods for T iterations on a dataset of size m, both the empirical risk and the generalization error decrease at an essentially optimal rate of 𝒪̃(1/γ^2 T) up till T≈ m, at which point the generalization error remains fixed at an essentially optimal level of 𝒪̃(1/γ^2 m) regardless of how large T is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Motivated by empirical observations in the context of neural networks, there is considerable interest nowadays in studying the

implicit bias of learning algorithms. This refers to the fact that even without any explicit regularization or other techniques to avoid overfitting, the dynamics of the learning algorithm itself biases its output towards “simple” predictors that generalize well.

In this paper, we consider the implicit bias in a well-known and simple setting, namely learning linear predictors () for binary classification with respect to linearly-separable data. In a recent line of works (Soudry et al., 2018; Ji and Telgarsky, 2018b; Nacson et al., 2019a; Ji and Telgarsky, 2019b; Dudik et al., 2020), it was shown that if we attempt to do this by minimizing the empirical risk (average loss) over a dataset, using gradient descent and any exponentially-tailed loss (such as the logistic loss), then the predictor asymptotically converges in direction to the max-margin predictor with respect to the Euclidean norm111Namely, for a given dataset .. Since there are standard generalization bounds for predictors which achieve a large margin over the dataset, we get that asymptotically, gradient descent does not overfit, even if just run it on the empirical risk function without any explicit regularization, and even if the number of iterations diverges to infinity. In follow-up works, similar results were also obtained for other gradient methods such as stochastic gradient descent and mirror descent (Nacson et al., 2019b; Gunasekar, 2018)

, and for more complicated predictors such as linear networks, shallow ReLU networks, and linear convolutional networks

(Ji and Telgarsky, 2018a, 2019a; Gunasekar et al., 2018).

However, in practice the number of iterations is some bounded finite number. Thus, the asymptotic results above leave open the possibility that for a wide range of values of , gradient methods do not achieve a good margin, and possibly overfit. Admittedly, many of these papers do provide finite-time guarantees for linear predictors, which all tend to have the following form: After iterations, the output of gradient descent (normalized to have unit norm) satisfies

(1)

where is the max-margin predictor, and the notation hides dependencies in the dataset size and the margin attained by . However, such bounds do not satisfactorily address the problem above, since they decay extremely slowly with . For example, suppose we ask how many iterations are needed, till we get a predictor which achieves some positive margin on all the data points, assuming there exists a unit-norm predictor achieving a margin of (namely ). If all we know is the bound in Eq. (1), we must require , which holds if (and in fact, the actual required bound is much larger due to the hidden dependencies in the notation). For realistically small values of , this bound on is unacceptably large. Could it be that gradient methods do not overfit only after so many iterations? We note that in Ji and Telgarsky (2019b), it is shown that Eq. (1) is essentially tight, but this does not preclude the possibility that does not overfit even before getting very close to .

In this paper, we show that this is indeed the case, and in fact, for any number of iterations , gradient methods do not overfit and essentially behave in the best manner we can hope for. Specifically, if the underlying data distribution is separable with margin , and we attempt to minimize the average of an exponential or logistic loss over a training set of size , using standard gradient methods (gradient flow, gradient descent, or stochastic gradient descent), then the generalization error of the resulting predictor is at most , up to constants and logarithmic factors. For , this bound is , which is essentially the same as the optimal upper bound on the empirical risk of the algorithm’s output. In other words, both the generalization error and the empirical risk provably go down at the same (essentially optimal) rate. Once , the empirical risk may further decrease to , but the generalization error remains at , which is well-known to be essentially optimal for any learning algorithm in this setting.

To prove these results, we also establish more refined, nonasymptotic bounds on the margins attained on the dataset, which are also readily applicable to other losses. In general, these bounds imply that for any , after iterations, the resulting predictor achieves a margin of on all but of the data points. These bounds guarantee that as increases, for larger and larger portions of the dataset, the predictors achieve a margin of , implying good generalization properties. Finally, we also provide a lower bound, showing that such guarantees are essentially optimal.

Before continuing, we emphasize that the techniques we use in our upper bounds are not fundamentally novel, and similar ideas were employed in previous analyses on the convergence to the max-margin predictor, such as in Ji and Telgarsky (2018b, 2019a) (in fact, some of our results build on these analyses). However, we apply these techniques to a conceptually different question, about the non-asymptotic ability of gradient methods to attain some significant margin. In addition, since we only care about convergence to some large-margin predictor (as opposed to the max-margin predictor), our analysis can be shorter and simpler.

Finally, we note that polynomial-time, non-asymptotic guarantees on the generalization error of unregularized gradient methods were also obtained in Ji and Telgarsky (2019a) and version 2 of Ji and Telgarsky (2018b). However, the former is for nonlinear predictors, and the latter is for one-pass stochastic gradient descent, which is different than the algorithms considered here and where necessarily . Moreover, the bounds in both papers have a worse polynomial dependence on the margin , compared to our results.

The paper is structured as follows. In the next section, we define some useful notation, and formally describe our setting. In Sec. 3, we provide our positive results about the margin behavior and the generalization error, focusing on gradient flow (for which our analysis is the simplest and completely self-contained). In Sec. 4, we show how similar results can be obtained for gradient descent and stochastic gradient descent. In Sec. 5, focusing for concreteness on gradient descent, we show that our positive result on the margin behavior is essentially tight (however, similar analyses can be performed for the other methods we consider).

2 Preliminaries

We generally let boldfaced letters denote vectors. Given a positive integer

, we let be a shorthand for . Given a nonzero vector , we let

denote its normalization to unit norm. We use the standard and notation to hide constants, and , to hide constants and factors polylogarithmic in the problem parameters. refers to the natural logarithm, and refers to the Euclidean norm.

We consider datasets defined by a set of vectors , and algorithms which attempt to minimize the empirical risk function, namely

where

is some loss function

222In the context of binary classification, it is customary to consider labeled data points and losses of the form . However, for our purposes we can fold the binary label inside , and treat this as a single vector.. We will utilize the following two assumptions about the dataset and the loss:

Assumption 1.

, and the dataset is separable with margin : Namely, there exists a unit vector s.t. .

Assumption 2.

is convex, monotonically decreasing, and has an inverse function333Namely, for any , there is a unique such that . on the interval .

We note that assumption 1 is without much loss of generality (it simply sets the scaling of the problem). Assumption 2 implies that is convex, and is satisfied for most classification losses. When instantiating our general results, we will focus for concreteness on the logistic loss and the exponential loss . However, our results can also be applied to other losses (such as the hinge loss).

In our proofs, we will make use of the following well-known facts (see for example Nesterov (2018)): For a convex function on , it holds for any vectors that . Also, if is a function with -Lipschitz gradients, then for any , .

3 Gradient Flow

In this section, we present positive results on the margin behavior and generalization error of gradient flow. Gradient flow is a standard continuous-time analogue of gradient descent. Although it cannot be implemented precisely in practice, it is the method in which it is easiest to present our analysis (the analysis for gradient descent in the next section is just a slight variation). Gradient flow produces a continuous trajectory of vectors , indexed by a time . It is defined by a starting point (which will be the origin in our case), and the differential equation

We now present a general result about the margin behavior of gradient flow, which is applicable to general losses and any vector reached by gradient flow at any time point:

Theorem 1.

Under assumptions 1 and 2, let be some point reached by gradient flow, such that for some . Then for any , for at least of the indices ,

Intuitively, we expect the empirical risk to decay with (as we instantiate for specific losses later on). Computing the corresponding bounds on and plugging into the above, we can get guarantees on how many points in our dataset achieve a certain margin. Note that since is monotonically decreasing, the margin lower bound in the theorem is always at most . Since under assumption 1 the maximal margin is at least , this result cannot be used to recover the asymptotic convergence to the max-margin predictor, shown by previous results. However, as discussed in the introduction, this is not our focus here: We ask about the time to converge to some large-margin predictor which generalizes well, not necessarily the max-margin predictor. For that purpose, as we will see later, a margin lower bound of is perfectly adequate.

To prove Thm. 1, we will need the following key lemma, which bounds the norm of the points along the trajectory of gradient flow, in terms of the value of . The proof is short and relies only on the convexity of :

Lemma 1.

Fix some , and let be any vector such that . Then .

Proof.

By definition of gradient flow and the chain rule, we have

, so is monotonically decreasing in . As a result, for any , . By convexity of , is follows that

Using this inequality, the definition of gradient flow and the chain rule, it follows that

Therefore, is monotonically decreasing in , so it is at most . Thus, by the triangle inequality, . ∎

Proof of Thm. 1.

Let be a max-margin separator, so that . Define , which has norm , and note that since is non-negative,

This implies . Combined with Lemma 1, we get that

(2)

Since , we get that , where

is uniformly distributed on

. By Markov’s inequality and Eq. (2), it follows that

For concreteness, let us now apply Thm. 1 to the case of the exponential loss, . In order to get an interesting guarantee, we will utilize the following simple non-asymptotic guarantee on the decay of as a function of :

Lemma 2.

Under assumption 1, if is the exponential loss, then for any .

Proof.

Let be a max-margin unit vector, so that . By definition of gradient flow, the chain rule and Cauchy-Schwartz,

where in we used the fact that when for all . Consider now the function . We clearly have , and by differentiation and the inequality above, it is easily verified that whenever . Since is continuous, we must have for all , hence . ∎

Using this lemma, we get the following corollary of Thm. 1 for the exponential loss:

Theorem 2.

Under assumption 1, if is the exponential loss, then for any and any , it holds that for at least of the indices .

Proof.

By Lemma 2, we know that at time , for some (which implies ). Applying Thm. 1 (noting that and ), we get that for any , for at least of the indices ,

In particular, picking for some , the result follows. ∎

Note that if , then , which implies that for all , is at least . However, even for smaller values of , the theorem provides guarantees on the margin attained on most points in the dataset. Moreover, using standard margin-based generalization bounds for binary classification, this theorem implies that the predictors returned by gradient flow achieve low generalization error, uniformly over all sufficiently large time points :

Theorem 3.

Let be some distribution over , such that there exists a unit vector and satisfying . If we sample points i.i.d. from , and run gradient flow on , where

is the exponential loss, then with probability at least

over the sample, it holds for any time that

where the notation hides universal constants and factors polylogarithmic in and .

As discussed in the introduction, this is essentially the best behavior we can hope for (up to log factors): For , the generalization error decays as , which is also the (optimal) bound on the empirical risk. Once , the generalization error becomes , and stays there regardless how large is.

Proof of Thm. 3.

The bound in the theorem is vacuous when or , so we will assume without loss of generality that both quantities are larger than .

Standard margin-based generalization bounds (e.g. McAllester (2003)) imply that in our setting, if we pick points i.i.d., then with probability at least , any vector for which for all but of the points satisfies

where the hides universal constants and factors polylogarithmic in and . In particular, this can be applied uniformly for for any . By Thm. 2, we can substitute and , to get that with probability at least ,

(3)

This holds for any . In particular, if , pick , in which case Eq. (3) is at most

and if , pick , in which case Eq. (3) is at most

where we used the fact that . Combining the two cases, the result follows. ∎

4 Gradient Descent and Stochastic Gradient Descent

Having discussed gradient flow, we show in this section how essentially identical results can be obtained for gradient descent and stochastic gradient descent.

4.1 Gradient Descent

Gradient descent, which is perhaps the simplest and most well-known gradient method, optimizes by initializing at some point , and performing iterations of the form , where are step size parameters. We will utilize the following standard assumption:

Assumption 3.

The derivative of is -Lipschitz, and for all .

Inspecting the analysis for gradient flow from the previous section, we note that we relied on the algorithm’s structure only at two points: In Lemma 1, to bound the norm of the points along the trajectory, and in Lemma 2, to upper bound the values of . However, it is easy to provide analogues of these two lemmas for gradient descent:

Lemma 3.

Under assumptions 1,2 and 3, fix some index , and let be any vector such that . Then .

Lemma 4.

(Ji and Telgarsky (2018b, Theorem 3.1)) If is the logistic loss, then under assumption 3, gradient descent with step size satisfies for any .

The proof of Lemma 3 (which is a slight variation on the proof of Lemma 1, appears in Appendix A.

With these lemmas, we can prove analogues of the theorems from the previous section, this time for gradient descent and for the logistic loss. The resulting bounds are identical up to constants and logarithmic factors:

Theorem 4.

Under assumptions 1, 2 and 3, let be some point reached by gradient descent, such that for some . Then for any , for at least of the indices ,

Theorem 5.

Under assumption 1, if is the logistic loss, and we use a fixed step size of for all , then for any such that , and any , the gradient descent iterates satisfy for at least of the indices .

Theorem 6.

Let be some distribution over , such that there exists a unit vector and satisfying . If we sample points i.i.d. from , and run gradient descent with fixed step sizes on , where is the logistic loss, then with probability at least over the sample, it holds for any iteration that

where the notation hides universal constants and factors polylogarithmic in , and .

The results can be easily generalized to other step size strategies. The proofs are essentially identical to the proofs from the previous section, except that we use Lemma 3 and 4 instead of Lemmas 1 and 2. In particular, the proof of Thm. 4 is identical to the proof of Thm. 1; the proof of Thm. 5 is nearly identical to the proof of Thm. 2 (and is provided in Appendix A for completeness); and the proof of Thm. 6 is identical to the proof of Thm. 3, except that we use Thm. 5 and have some additional logarithmic factors which gets absorbed into the notation.

4.2 Stochastic Gradient Descent

We now turn to discuss the stochastic gradient descent (SGD) algorithm, perhaps the main workhorse of modern machine learning methods. We consider the simplest version of SGD for minimizing

: We initialize at the origin , and for any , define , where is chosen independently and uniformly at random (so that in expectation, , similar to the gradient descent update). We assume that at the end of iterations, the algorithm returns the average of the iterates obtained so far, .

To avoid yet another (and more complicated) repetition of the analysis from the previous section, we will take a somewhat different route, focusing on the logistic loss and fixed step sizes , which allows us to directly utilize some existing results in the literature to get a bound on the margin behavior of SGD (analogous to Thm. 2 for gradient flow and Thm. 5 for gradient descent). We note that the analysis can be easily generalized to other constant step sizes.

Theorem 7.

Under assumption 1, if is the logistic loss, and we use step sizes , then for any and any , the SGD iterates satisfy for at least of the indices , where

is a nonnegative random variable (dependent on the randomness of SGD), whose expectation is at most

.

Proof.

We will utilize the following easily-verified facts about the logistic loss: It is non-negative, its gradient is -Lipschitz, and its inverse is , which is between and for all  .

Let be a max-margin separator, so that , and define (similarly to the proof of Thm. 3) , where will be chosen later. It is easily verified that , and therefore . Since is non-negative and with a -Lipschitz derivative, we can use Theorem 14.13 from Shalev-Shwartz and Ben-David (2014) on the convergence of SGD for such losses to get

In particular, picking , we get

To simplify notation, define to be the expression in the right-hand side above. We also note that the left-hand side equals , where is uniformly distributed in . Thus, by Markov’s inequality, for any , we have

According to Theorem 2.1 in version 2 of Ji and Telgarsky (2018b), the iterates of SGD on the logistic loss satisfy deterministically , so by Jensen’s inequality, . Combining this with the displayed equation above, we get that

Choosing for some , and substituting into the above, we get that

Since , we have . Plugging into the above, we get

To complete the proof, we note that if we let denote the event that , and the indicator function of the event , then the above implies

Thus, letting , the theorem follows. ∎

Using this theorem, we get the following generalization error bound for SGD, which is a direct analogue of the error bounds we obtained for gradient flow and gradient descent:

Theorem 8.

Let be some distribution over , such that there exists a unit vector and satisfying . Suppose we sample points i.i.d. from , and run SGD (with step sizes ) on , where is the logistic loss. Then for any ,

where the notation hides universal constants and factors polylogarithmic in , and the expectation is over the randomness of the SGD algorithm.

Proof.

Using identical arguments as in the proof of Thm. 3 (except using Thm. 7 instead of Thm. 2, and the fact that high-probability bounds imply a bound on the expectation), we get that

(4)

This holds for any . In particular, if , pick , in which case Eq. (4) is at most

and if , pick , in which case Eq. (4) is at most

Combining the two cases, the result follows. ∎

Finally, we remark that Thm. 8 only bounds the expectation of the error probability of . It is likely that using more sophisticated concentration tools, one can obtain a high-probability bound. However, this would require a more involved analysis, and is left for future work.

5 Tightness

In the previous sections, we provided bounds on the margin behavior of gradient flow, gradient descent and SGD, which all have the following form (ignoring constants and log factors): After iterations, we get a predictor which achieves margin on at least of the data points, where is a constant arbitrarily close to one. It is natural to ask whether this result can be improved.

In this section, we show that this result is essentially tight: After iterations, it is impossible to guarantee any positive margin on more than of the points, when is much less than . For concreteness, we prove this for gradient descent and the logistic loss, although the analysis can be extended to other gradient methods and losses:

Theorem 9.

For any positive integers and any , there exists a dataset of points in satisfying assumption 1, such that gradient descent using the logistic loss and any step size must satisfy for at least of the data points.

Since translates to , the theorem also implies that the bounds on the empirical risk shown earlier are tight up to logarithmic factors. It also implies that the percentage of misclassified points () cannot decrease at a better rate. In addition, the lower bound implies that if we want to get any positive margin on all data points, the number of iterations must be at least in the worst case.

A lower bound related to ours appears in Ji and Telgarsky (2019b), where the authors show that if is the max-margin unit-norm predictor, then . This translates to a requirement to get a non-trivial guarantee on the direction of . However, this is a somewhat different objective than ours, and moreover, their lower bound does not specify the dependence on the margin parameter .

Proof of Thm. 9.

We will assume without loss of generality that (or equivalently, ), in which case the lower bound we need to prove is . Otherwise, if , we can simply apply the construction below with the larger margin parameter , which by definition satisfies , and uses a dataset separable with margin , so assumption 1 still holds with margin parameter .

We will also assume without loss of generality that (otherwise the theorem trivially holds), and fix (which is positive and less than ).

Figure 1: Illustration of the proof of Thm. 9, for . The dataset consists of 90% points at (thick black arrow) and 10% points at (thin black arrow). The gradient descent trajectory (with , starting from the origin) is the dotted red line, and the shaded blue region are the vectors which achieves a positive margin on more than 90% of the data points (or equivalently for this construction, get a positive margin on all data points). Initially, the gradient descent trajectory is mostly influenced by the points at , and only once a sufficiently large margin is achieved on them, the influence of the few points at begins to manifest, and the trajectory curves towards the blue region. Best viewed in color.

Consider a dataset consisting of a majority group of points of the form , and a minority group of points of the form (see Figure 1 for a sketch of the construction and proof idea). It is easily verified that all these points are contained in the unit ball, and that the vector satisfies for any in the dataset. Thus, assumption 1 holds. Moreover, the empirical risk function equals