Train faster, generalize better: Stability of stochastic gradient descent

09/03/2015 ∙ by Moritz Hardt, et al. ∙ Google berkeley college 0

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions. Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The most widely used optimization method in machine learning practice is

stochastic gradient method

(SGM). Stochastic gradient methods aim to minimize the empirical risk of a model by repeatedly computing the gradient of a loss function on a single training example, or a batch of few examples, and updating the model parameters accordingly. SGM is scalable, robust, and performs well across many different domains ranging from smooth and strongly convex problems to complex non-convex objectives.

In a nutshell, our results establish that:

Any model trained with stochastic gradient method in a reasonable

amount of time attains small generalization error.

As training time is inevitably limited in practice, our results help to explain the strong generalization performance of stochastic gradient methods observed in practice. More concretely, we bound the generalization error of a model in terms of the number of iterations that stochastic gradient method took in order to train the model. Our main analysis tool is to employ the notion of algorithmic stability due to Bousquet and Elisseeff [4]. We demonstrate that the stochastic gradient method is stable provided that the objective is relatively smooth and the number of steps taken is sufficiently small.

It is common in practice to perform a linear number of steps in the size of the sample and to access each data point multiple times. Our results show in a broad range of settings that, provided the number of iterations is linear in the number of data points, the generalization error is bounded by a vanishing function of the sample size. The results hold true even for complex models with large number of parameters and no explicit regularization term in the objective. Namely, fast training time by itself is sufficient to prevent overfitting.

Our bounds are algorithm specific: Since the number of iterations we allow can be larger than the sample size, an arbitrary algorithm could easily achieve small training error by memorizing all training data with no generalization ability whatsoever. In contrast, if the stochastic gradient method manages to fit the training data in a reasonable number of iterations, it is guaranteed to generalize.

Conceptually, we show that minimizing training time is not only beneficial for obvious computational advantages, but also has the important byproduct of decreasing generalization error. Consequently, it may make sense for practitioners to focus on minimizing training time, for instance, by designing model architectures for which stochastic gradient method converges fastest to a desired error level.

1.1 Our contributions

Our focus is on generating generalization bounds for models learned with stochastic gradient descent. Recall that the generalization bound is the expected difference between the error a model incurs on a training set versus the error incurred on a new data point, sampled from the same distribution that generated the training data. Throughout, we assume we are training models using sampled data points.

Our results build on a fundamental connection between the generalization error of an algorithm and its stability properties. Roughly speaking, an algorithm is stable if the training error it achieves varies only slightly if we change any single training data point. The precise notion of stability we use is known as uniform stability due to [4]. It states that a randomized algorithm  is uniformly stable if for all data sets differing in only one element, the learned models produce nearly the same predictions. We review this method in Section 2, and provide a new adaptation of this theory to iterative algorithms.

In Section 3, we show that stochastic gradient is uniformly stable, and our techniques mimic its convergence proofs. For convex loss functions, we prove that the stability measure decreases as a function of the sum of the step sizes. For strongly convex loss functions, we show that stochastic gradient is stable, even if we train for an arbitrarily long time. We can combine our bounds on the generalization error of stochastic gradient method with optimization bounds quantifying the convergence of the empirical loss achieved by SGM. In Section 5, we show that models trained for multiple epochs match classic bounds for stochastic gradient [28, 29].

More surprisingly, our results carry over to the case where the loss-function is non-convex. In this case we show that the method generalizes provided the steps are sufficiently small and the number of iterations is not too large. More specifically, we show the number of steps of stochastic gradient can grow as for a small . This provides some explanation as to why neural networks can be trained for multiple epochs of stochastic gradient and still exhibit excellent generalization. In Section 4

, we furthermore show that various heuristics used in practice, especially in the deep learning community, help to increase the stability of stochastic gradient method. For example, the popular dropout scheme 

[19, 40] improves all of our bounds. Similarly, -regularization improves the exponent of in our non-convex result. In fact, we can drive the exponent arbitrarily close to while preserving the non-convexity of the problem.

1.2 Related work

There is a venerable line of work on stability and generalization dating back more than thirty years [8, 18, 4, 26, 39]. The landmark work by Bousquet and Elisseeff [4] introduced the notion of uniform stability that we rely on. They showed that several important classification techniques are uniformly stable. In particular, under certain regularity assumptions, it was shown that the optimizer of a regularized empirical loss minimization problem is uniformly stable. Previous work generally applies only to the exact minimizer of specific optimization problems. It is not immediately evident on how to compute a generalization bound for an approximate minimizer such as one found by using stochastic gradient. Subsequent work studied stability bounds for randomized algorithms but focused on random perturbations of the cost function, such as those induced by bootstrapping or bagging [9]. This manuscript differs from this foundational work in that it derives stability bounds about the learning procedure, analyzing algorithmic properties that induce stability.

Stochastic gradient descent, of course, is closely related to our inquiry. Classic results by Nemirovski and Yudin show that the stochastic gradient method produces is nearly optimal for empirical risk minimization of convex loss functions [28, 29, 27, 11]. These results have been extended by many machine learning researchers, yielding tighter bounds and probabilistic guarantees  [13, 14, 35]. However, there is an important limitation of all of this prior art. The derived generalization bounds only hold for single passes over the data. That is, in order for the bounds to be valid, each training example must be used no more than once in a stochastic gradient update. In practice, of course, one tends to run multiple epochs

of the stochastic gradient method. Our results resolve this issue by combining stability with optimization error. We use the foundational results to estimate the error on the

empirical risk and then use stability to derive a deviation from the true risk. This enables us to study the risk incurred by multiple epochs and provide simple analyses of regularization methods for convex stochastic gradient. We compare our results to this related work in Section 5. We note that Rosasco and Villa obtain risk bounds for least squares minimization with an incremental gradient method in terms of the number of epochs [37]. These bounds are akin to our study in Section 5, although our results are incomparable due to various different assumptions.

Finally, we note that in the non-convex case, the stochastic gradient method is remarkably successful for training large neural networks [2, 19]. However, our theoretical understanding of this method is limited. Several authors have shown that the stochastic gradient method finds a stationary point of nonconvex cost functions [21, 12]. Beyond asymptotic convergence to stationary points, little is known about finding models with low training or generalization error in the nonconvex case. There have recently been several important studies investigating optimal training of neural nets. For example Livni et al. show that networks with polynomial activations can be learned in a greedy fashion [24]. Janzamin et al. [16]

show that two layer neural networks can be learned using tensor methods. Arora 

et al. [1] show that two-layer sparse coding dictionaries can be learned via stochastic gradient. Our work complements these developments: rather than providing new insights into mechanisms that yield low training error, we provide insights into mechanisms that yield low generalization error. If one can achieve low training error quickly on a nonconvex problem with stochastic gradient, our results guarantee that the resulting model generalizes well.

2 Stability of randomized iterative algorithms

Consider the following general setting of supervised learning. There is an unknown distribution 

over examples from some space We receive a sample of examples drawn i.i.d. from Our goal is to find a model  with small population risk, defined as:

Here, where is a loss function and designates the loss of the model described by  encountered on example .

Since we cannot measure the objective directly, we instead use a sample-averaged proxy, the empirical risk, defined as

The generalization error of a model  is the difference

(2.1)

When is chosen as a function of the data by a potentially randomized algorithm  it makes sense to consider the expected generalization error

(2.2)

where the expectation is over the randomness of and the sample 

In order to bound the generalization error of an algorithm, we employ the following notion of uniform stability in which we allow randomized algorithms as well.

Definition 2.1.

A randomized algorithm is -uniformly stable if for all data sets such that and differ in at most one example, we have

(2.3)

Here, the expectation is taken only over the internal randomness of We will denote by the infimum over all for which (2.3) holds. We will omit the tuple when it is clear from the context.

We recall the important theorem that uniform stability implies generalization in expectation. Since our notion of stability differs slightly from existing ones with respect to the randomness of the algorithm, we include a proof for the sake of completeness. The proof is based on an argument in Lemma 7 of [4] and very similar to Lemma 11 in [39].

Theorem 2.2.

[Generalization in expectation] Let be -uniformly stable. Then,

Proof.

Denote by and two independent random samples and let be the sample that is identical to except in the ’th example where we replace with . With this notation, we get that

where we can express as

Furthermore, taking the supremum over any two data sets differing in only one sample, we can bound the difference as

by our assumption on the uniform stability of The claim follows. ∎

Theorem 2.2 proves that if an algorithm is uniformly stable, then its generalization error is small. We now turn to some properties of iterative algorithms that control their uniform stability.

2.1 Properties of update rules

We consider general update rules of the form which map a point in the parameter space to another point The most common update is the gradient update rule

where is a step size and is a function that we want to optimize.

The canonical update rule we will consider in this manuscript is an incremental gradient update, where for some convex function . We will return to a detailed discussion of this specific update in the sequel, but the reader should keep this particular example in mind throughout the remainder of this section.

The following two definitions provide the foundation of our analysis of how two different sequences of update rules diverge when iterated from the same starting point. These definitions will ultimately be useful when analyzing the stability of stochastic gradient descent.

Definition 2.3.

An update rule is -expansive if

(2.4)
Definition 2.4.

An update rule is -bounded if

(2.5)

With these two properties, we can establish the following lemma of how a sequence of updates to a model diverge when the training set is perturbed.

Lemma 2.5 (Growth recursion).

Fix an arbitrary sequence of updates and another sequence Let be a starting point in and define where are defined recursively through

()

Then, we have the recurrence relation

()
Proof.

The first bound on follow directly from the assumption that and the definition of expansiveness. For the second bound, recall from Definition 2.4 that if and are -bounded, then by the triangle inequality,

which gives half of the second bound. We can alternatively bound as

3 Stability of Stochastic Gradient Method

Given labeled examples where consider a decomposable objective function

where denotes the loss of on the example  The stochastic gradient update for this problem with learning rate is given by

Stochastic gradient method (SGM) is the algorithm resulting from performing stochastic gradient updates times where the indices are randomly chosen. There are two popular schemes for choosing the examples’ indices. One is to pick uniformly at random in at each step. The other is to choose a random permutation over and cycle through the examples repeatedly in the order determined by the permutation. Our results hold for both variants.

In parallel with the previous section the stochastic gradient method is akin to applying the gradient update rule defined as follows.

Definition 3.1.

For a nonnegative step size and a function we define the gradient update rule as

3.1 Proof idea: Stability of stochastic gradient method

In order to prove that the stochastic gradient method is stable, we will analyze the output of the algorithm on two data sets that differ in precisely one location. Note that if the loss function is -Lipschitz for every example , we have for all and . Hence, it suffices to analyze how and diverge in the domain as a function of time  Recalling that is obtained from via a gradient update, our goal is to bound recursively and in expectation as a function of

There are two cases to consider. In the first case, SGM selects the index of an example at step on which is identical in and . Unfortunately, it could still be the case that grows, since and differ and so the gradients at these two points may still differ. Below, we will show how to control in terms of the convexity and smoothness properties of the stochastic gradients.

The second case to consider is when SGM selects the one example to update in which and

differ. Note that this happens only with probability 

if examples are selected randomly. In this case, we simply bound the increase in by the norm of the two gradient and The sum of the norms is bounded by and we obtain Combining the two cases, we can then solve a simple recurrence relation to obtain a bound on

This simple approach suffices to obtain the desired result in the convex case, but there are additional difficulties in the non-convex case. Here, we need to use an intriguing stability property of stochastic gradient method. Specifically, the first time step  at which SGM even encounters the example in which and

differ is a random variable in

which tends to be relatively large. Specifically, for any the probability that is upper bounded by This allows us to argue that SGM has a long “burn-in period” where does not grow at all. Once begins to grow, the step size has already decayed allowing us to obtain a non-trivial bound.

We now turn to making this argument precise.

3.2 Expansion properties of stochastic gradients

Let us now record some of the core properties of the stochastic gradient update. The gradient update rule is bounded provided that the function  satisfies the following common Lipschitz condition.

Definition 3.2.

We say that is -Lipschitz if for all points in the domain of we have This implies that

(3.1)
Lemma 3.3.

Assume that is -Lipschitz. Then, the gradient update is -bounded.

Proof.

By our Lipschitz assumption, . ∎

We now turn to expansiveness. As we will see shortly, different expansion properties are achieved for non-convex, convex, and strongly convex functions.

Definition 3.4.

A function is convex if for all we have

Definition 3.5.

A function is -strongly convex if for all we have

The following standard notion of smoothness leads to a bound on how expansive the gradient update is.

Definition 3.6.

A function is -smooth if for all for all we have

(3.2)

In general, smoothness will imply that the gradient updates cannot be overly expansive. When the function is also convex and the step size is sufficiently small the gradient update becomes non-expansive. When the function is additionally strongly convex, the gradient update becomes contractive in the sense that will be less than one and and will actually shrink closer to one another. The majority of the following results can be found in several textbooks and monographs. Notable references are Polyak [34] and Nesterov [30]. We include proofs in the appendix for completeness.

Lemma 3.7.

Assume that is -smooth. Then, the following properties hold.

is -expansive.

Assume in addition that is convex. Then, for any the gradient update is -expansive.

Assume in addition that is -strongly convex. Then, for , is -expansive.

Henceforth we will no longer mention which random selection rule we use as the proofs are almost identical for both rules.

3.3 Convex optimization

We begin with a simple stability bound for convex loss minimization via stochastic gradient method.

Theorem 3.8.

Assume that the loss function is -smooth, convex and -Lipschitz for every  Suppose that we run SGM with step sizes for steps. Then, SGM satisfies uniform stability with

Proof.

Let and be two samples of size differing in only a single example. Consider the gradient updates and induced by running SGM on sample and respectively. Let and denote the corresponding outputs of SGM.

We now fix an example and apply the Lipschitz condition on to get

(3.3)

where Observe that at step with probability the example selected by SGM is the same in both and In this case we have that and we can use the -expansivity of the update rule which follows from Lemma 3.7 using the fact that the objective function is convex and that . With probability the selected example is different in which case we use that both and are -bounded as a consequence of Lemma 3.3. Hence, we can apply Lemma 2.5 and linearity of expectation to conclude that for every

(3.4)

Unraveling the recursion gives

Plugging this back into equation (3.3), we obtain

Since this bounds holds for all and we obtain the desired bound on the uniform stability. ∎

3.4 Strongly Convex Optimization

In the strongly convex case we can bound stability with no dependence on the number of steps at all. Assume that the function is strongly convex with respect to for all . Let be a compact, convex set over which we wish to optimize. Assume further that we can readily compute the Euclidean projection onto the set , namely, . In this section we restrict our attention to the projected stochastic gradient method

(3.5)

A common application of the above iteration in machine learning is solving Tikhonov regularization problems. Specifically, the empirical risk is augmented with an additional regularization term,

(3.6)

where is as before a pre-specified loss function. We can assume without loss of generality that . Then, the optimal solution of (3.6) must lie in a ball of radius about where This fact can be ascertained by plugging in and noting that the minimizer of (3.6) must have a smaller cost, thus We can now define the set to be the ball of radius , in which case the projection is a simple scaling operation. Througout the rest of the section we replace with its regularized form, namely,

which is strongly convex with parameter . Similarly, we will overload the constant to by setting

(3.7)

Note that if is -smooth for all , then is always finite as it is less than or equal to . We need to restrict the supremum to because strongly convex functions have unbounded gradients on . We can now state the first result about strongly convex functions.

Theorem 3.9.

Assume that the loss function is -strongly convex and -smooth for all  Suppose we run the projected SGM iteration (3.5) with constant step size for steps. Then, SGM satisfies uniform stability with

Proof.

The proof is analogous to that of Theorem 3.8 with a slightly different recurrence relation. We repeat the argument for completeness. Let and be two samples of size differing in only a single example. Consider the gradient updates and induced by running SGM on sample and respectively. Let and denote the corresponding outputs of SGM.

Denoting and appealing to the boundedness of the gradient of we have

(3.8)

Observe that at step with probability the example selected by SGM is the same in both and In this case we have that . At this stage, note that

because Euclidean projection does not increase the distance between projected points (see Lemma 4.6 below for a generalization of this fact). We can now apply the following useful simplification of Lemma 3.7 if : since and , is -expansive. With probability the selected example is different in which case we use that both and are -bounded as a consequence of Lemma 3.3. Hence, we can apply Lemma 2.5 and linearity of expectation to conclude that for every

(3.9)

Unraveling the recursion gives

Plugging the above inequality into equation (3.3), we obtain

Since this bounds holds for all and the lemma follows. ∎

We would like to note that a nearly identical result holds for a “staircase” decaying step-size that is also popular in machine learning and stochastic optimization.

Theorem 3.10.

Assume that the loss function is -strongly convex has gradients bounded by as in (3.7), and is -smooth function for all Suppose we run SGM with step sizes . Then, SGM has uniform stability of

where .

Proof.

Note that once , the iterates are contractive with contractivity . Thus, for , we have

Assuming that and expanding this recursion, we find:

Now, the result follows from Lemma 3.11 with the fact that . ∎

3.5 Non-convex optimization

In this section we prove stability results for stochastic gradient methods that do not require convexity. We will still assume that the objective function is smooth and Lipschitz as defined previously.

The crux of the proof is to observe that SGM typically makes several steps before it even encounters the one example on which two data sets in the stability analysis differ.

Lemma 3.11.

Assume that the loss function is nonnegative and -Lipschitz for all  Let and be two samples of size differing in only a single example. Denote by and the output of steps of SGM on and respectively. Then, for every and every under both the random update rule and the random permutation rule, we have

Proof.

Let and be two samples of size differing in only a single example, and let be an arbitrary example. Consider running SGM on sample and , respectively. As stated, and denote the corresponding outputs of SGM. Let denote the event that We have,

The second inequality follows from the Lipschitz assumption.

It remains to bound Toward that end, let denote the position in which and differ and consider the random variable assuming the index of the first time step in which SGM uses the example Note that when then we must have that since the execution on and is identical until step Hence,

Under the random permutation rule, is a uniformly random number in and therefore

This proves the claim we stated for the random permutation rule. For the random selection rule, we have by the union bound This completes the proof. ∎

Theorem 3.12.

Assume that is an -Lipschitz and -smooth loss function for every  Suppose that we run SGM for steps with monotonically non-increasing step sizes Then, SGM has uniform stability with

In particular, omitting constant factors that depend on and we get

Proof.

Let and be two samples of size differing in only a single example. Consider the gradient updates and induced by running SGM on sample and respectively. Let and denote the corresponding outputs of SGM.

By Lemma 3.11, we have for every

(3.10)

where To simplify notation, let We will bound as function of and then minimize for

Toward this goal, observe that at step with probability the example selected by SGM is the same in both and In this case we have that and we can use the -expansivity of the update rule which follows from our smoothness assumption via Lemma 3.7. With probability the selected example is different in which case we use that both and are -bounded as a consequence of Lemma 3.3.

Hence, we can apply Lemma 2.5 and linearity of expectation to conclude that for every

Here we used that for all

Using the fact that we can unwind this recurrence relation from down to This gives

Plugging this bound into (3.10), we get

Letting the right hand side is approximately minimized when

This setting gives us

Since the bound we just derived holds for all and we immediately get the claimed upper bound on the uniform stability. ∎

4 Stability-inducing operations

In light of our results, it makes sense to analyse for operations that increase the stability of the stochastic gradient method. We show in this section that pleasingly several popular heuristics and methods indeed improve the stability of SGM. Our rather straightforward analyses both strengthen the bounds we previously obtained and help to provide an explanation for the empirical success of these methods.

Weight Decay and Regularization.

Weight decay is a simple and effective method that often improves generalization [20].

Definition 4.1.

Let be a differentiable function. We define the gradient update with weight decay at rate  as

It is easy to verify that the above update rule is equivalent to performing a gradient update on the -regularized objective

Lemma 4.2.

Assume that is -smooth. Then, is -expansive.

Proof.

Let By triangle inequality and our smoothness assumption,

The above lemma shows as that a regularization parameter counters a smoothness parameter  Once the gradient update with decay becomes contractive. Any theorem we proved in previous sections that has a dependence on leads to a corresponding theorem for stochastic gradient with weight decay in which is replaced with

Gradient Clipping.

It is common when training deep neural networks to enforce bounds on the norm of the gradients encountered by SGD. This is often done by either truncation, scaling, or dropping of examples that cause an exceptionally large value of the gradient norm. Any such heuristic directly leads to a bound on the Lipschitz parameter  that appears in our bounds. It is also easy to introduce a varying Lipschitz parameter to account for possibly different values.

Dropout.

Dropout [40] is a popular and effective heuristic for preventing large neural networks from overfitting. Here we prove that, indeed, dropout improves all of our stability bounds generically. From the point of view of stochastic gradient descent, dropout is equivalent to setting a fraction of the gradient weights to zero. That is, instead of updating with a stochastic gradient we instead update with a perturbed gradient which is is typically identical to in some of the coordinates and equal to on the remaining coordinates, although our definition is a fair bit more general.

Definition 4.3.

We say that a randomized map is a dropout operator with dropout rate  if for every we have For a differentiable function we let denote the dropout gradient update defined as

As expected, dropout improves the effective Lipschitz constant of the objective function.

Lemma 4.4.

Assume that is -Lipschitz. Then, the dropout update with dropout rate is -bounded.

Proof.

By our Lipschitz assumption and linearity of expectation,

From this lemma we can obtain various corollaries by replacing with in our theorems.

Projections and Proximal Steps.

Related to regularization, there are many popular updates which follow a stochastic gradient update with a projection onto a set or some statistical shrinkage operation. The vast majority of these operations can be understood as applying a proximal-point operation associated with a convex function. Similar to the gradient operation, we can define the proximal update rule.

Definition 4.5.

For a nonnegative step size and a function we define the proximal update rule as

(4.1)

For example, Euclidean projection is the proximal point operation associated with the indicator of the associated set. Soft-thresholding is the proximal point operator associated with the -norm. For more information, see the surveys by Combettes and Wajs [6] or Parikh and Boyd [33].

An elementary proof of the following Lemma, due to Rockafellar [36], can be found in the appendix.

Lemma 4.6.

If is convex, the proximal update (4.1) is -expansive.

In particular, this Lemma implies that the Euclidean projection onto a convex set is -expansive. Note that in many important cases, proximal operators are actually contractive. That is, they are -expansive with . An notable example is when is the Euclidean norm for which the update rule is -expansive with . So stability can be induced by the choice of an appropriate prox-operation, which can always be interpreted as some form of regularization.

Model Averaging.

Model averaging refers to the idea of averaging out the iterates 

obtained by a run of SGD. In convex optimization, model averaging is sometimes observed to lead to better empirical performance of SGM and closely replated updates such as the Perceptron 

[10]. Here we show that model averaging improves our bound for the convex optimization by a constant factor.

Theorem 4.7.

Assume that is a decomposable convex -Lipschitz -smooth function and that we run SGD with step sizes for steps. Then, the average of the first iterates of SGD has uniform stability of

Proof.

Let