# Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li-Wei-Ma (2019) recently proved this behavior can exist in a simplified non-convex neural-network setting. In this note, we show that this phenomenon can exist even for convex learning problems – in particular, linear regression in 2 dimensions. We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rate) can lead gradient descent to minima with provably better generalization than using a small learning rate throughout. In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and early-stopping.

## Authors

• 13 publications
• ### Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Stochastic gradient descent with a large initial learning rate is a wide...
07/10/2019 ∙ by Yuanzhi Li, et al. ∙ 6

• ### Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

While the generalization properties of neural networks are not yet well ...
03/09/2020 ∙ by Nikhil Iyer, et al. ∙ 11

• ### The Parameter-Less Self-Organizing Map algorithm

The Parameter-Less Self-Organizing Map (PLSOM) is a new neural network a...
05/02/2007 ∙ by Erik Berglund, et al. ∙ 0

• ### Learning an Adaptive Learning Rate Schedule

The learning rate is one of the most important hyper-parameters for mode...
09/20/2019 ∙ by Zhen Xu, et al. ∙ 10

• ### Implicit bias of deep linear networks in the large learning rate phase

Correctly choosing a learning rate (scheme) for gradient-based optimizat...
11/25/2020 ∙ by Wei Huang, et al. ∙ 8

• ### Acceleration via Fractal Learning Rate Schedules

When balancing the practical tradeoffs of iterative methods for large-sc...
03/01/2021 ∙ by Naman Agarwal, et al. ∙ 4

• ### Equilibrated adaptive learning rates for non-convex optimization

Parameter-specific adaptive learning rate methods are computationally ef...
02/15/2015 ∙ by Yann N. Dauphin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The learning rate schedule of stochastic gradient descent is known to strongly affect the generalization of modern neural networks, in ways not explained by optimization considerations alone. In particular, training with a large initial learning-rate followed by a smaller annealed learning rate can vastly outperform training with the smaller learning rate throughout – even when allowing both to reach the same train loss. The recent work of

li2019explaining sheds some light on this, by showing a simplified neural-network setting in which this behavior provably occurs.

It may be conjectured that non-convexity is crucial for learning-rate schedule to affect generalization (for example, because strongly-convex objectives have unique global minima). However, we show this behavior can appear even in strongly-convex learning problems. The key insight is that although strongly-convex problems have a unique minima, early-stopping breaks this uniqueness: there is a set of minima with the same train loss . And these minima can have different generalization properties when the train loss surface does not closely approximate the test loss surface. In our setting, a large learning-rate prevents gradient descent from optimizing along high-curvature directions, and thus effectively regularizes against directions that are high-curvature in the train set, but low-curvature on the true distribution (see Figure 1).

Technically, we model a small learning rate by considering gradient flow. Then we compare the following optimizers:

1. [(A)]

2. Gradient descent with an infinitesimally small step size (i.e. gradient flow).

3. Gradient descent with a large step size, followed by gradient flow.

We show that for a particular linear regression problem, if we run both (A) and (B) until reaching the same

train loss, then with probability

over the train samples, the test loss of (A) is twice that of (B). That is, starting with a large learning rate and then annealing to an infinitesimally small one helps generalization significantly.

Our Contribution.

We show that non-convexity is not required to reproduce an effect of learning rate on generalization that is observed in deep learning models. We give a simple model where the mechanisms behind this can be theoretically understood, which shares some features with deep learning models in practice. We hope such simple examples can yield insight and intuition that may eventually lead to a better understanding of the effect of learning rate in deep learning.

Organization. In Section 2 we describe the example. In Section 3 we discuss features of the example, and its potential implications (and non-implications) in deep learning. In Appendix A, we include formal proofs and provide a mild generalization of the example in Section 2.

### 1.1 Related Work

This work was inspired by li2019explaining, which gives a certain simplified neural-network setting in which learning-rate schedule provably affects generalization, despite performing identically with respect to optimization. The mechanisms and intuitions in li2019explaining

depend crucially on non-convexity of the objective. In contrast, in this work we provide complementary results, by showing that this behavior can occur even in convex models due to the interaction between early-stopping and estimation error. Our work is not incompatible with

li2019explaining; indeed, the true mechanisms behind generalization in real neural networks may involve both of these factors (among others).

There are many empirical and theoretical works attempting to understand the effect of learning-rate in deep learning, which we briefly describe here. Several recent works study the effect of learning rate by considering properties (e.g. curvature) of the loss landscape, finding that different learning rate schedules can lead SGD to regions in parameter space with different local geometry (jastrzebski2020break; lewkowycz2020large). This fits more broadly into a study of the implicit bias of SGD. For example, there is debate on whether generalization is connected to “sharpness” of the minima (e.g. hochreiter1997long; keskar2016large; dinh2017sharp). The effect of learning-rate is also often also studied alongside the effect of batch-size, since the optimal choice of these two parameters is coupled in practice (krizhevsky2014one; goyal2017accurate; smith2017don). Several works suggest that the stochasticity of SGD is crucial to the effect of learning-rate, in that different learning rates lead to different stochastic dynamics, with different generalization behavior (smith2017bayesian; mandt2017stochastic; hoffer2017train; mccandlish2018empirical; welling2011bayesian; li2019explaining). In particular, it may be the case that the stochasticity in the gradient estimates of SGD acts as an effective “regularizer” at high learning rates. This aspect is not explicitly present in our work, and is an interesting area for future theoretical and empirical study.

## 2 Convex Example

The main idea is illustrated in the following linear regression problem, as visualized in Figure 1. Consider the data distribution over defined as:

 x∈{→e1,→e2} uniformly at random ;y=⟨β∗,x⟩

for some ground-truth . We want to learn a linear model with small mean-squared-error on the population distribution. Specifically, for model parameters , the population loss is

 LD(β):=\Ex,y∼D[(⟨β,x⟩−y)2]

We want to approximate . Suppose we try to find this minima by drawing samples from , and performing gradient descent on the empirical loss:

 ^Ln(β):=1n∑i∈[n](⟨β,xi⟩−yi)2

starting at , and stopping when for some small .

Now for simplicity let , and let the ground-truth model be . With probability , two of the samples will have the same value of – say this value is , so the samples are . In this case the empirical loss is

 ^Ln(β)=23(β1−β∗1)2(A)+13(β2−β∗2)2(% B) (1)

which is distorted compared to the population loss, since we have few samples relative to the dimension. The population loss is:

 LD(β)=12(β1−β∗1)2+12(β2−β∗2)2 (2)

The key point is that although the global minima of and are identical, their level sets are not: not all points of small train loss have identical test loss . To see this, refer to Equation 1, and consider two different ways that train loss could be achieved. In the “good” situation, term (A) in Equation 1 is , and term (B) is . In the “bad” situation, term (A) is and term (B) is . These two cases have test losses which differ by a factor of two, since terms are re-weighted differently in the test loss (Equation 2). This is summarized in the below table.

Residual Train Loss Test Loss
Good:

Now, if our optimizer stops at train loss , it will pick one of the points in . We see from the above that some of these points are twice as bad as others.

Notice that gradient flow on (Equation 1) will optimize twice as fast along the coordinate compared to . The gradient flow dynamics of the residual are:

 dβdt=−∇β^Ln⟹{ddt(β1−β∗1)=−43(β1−β∗1)ddt(β2−β∗2)=−23(β2−β∗2)

This will tend to find solutions closer to the “Bad” solution from the above table, where and .

However, gradient descent with a large step size can oscillate on the coordinate, and achieve low train loss by optimizing instead. Then in the second stage, once the learning-rate is annealed, gradient descent will optimize on while keeping the coordinate small. This will lead to a minima closer to the “Good” solution, where . These dynamics are visualized in Figure 1.

This example is formalized in the following claim. The proof is straightforward, and included in Appendix A. For all , there exists a distribution over and a learning-rate (the “large” learning-rate) such that for all , the following holds. With probability over samples:

1. Gradient flow from -initialization, early-stopped at train loss , achieves population loss

2. Annealed gradient descent from -initialization (i.e. gradient descent with stepsize for steps, followed by gradient flow) stopped at train loss , achieves population loss

 LD(β\emphgood)≤34\eps+exp(−Ω(K))

In particular, since can be taken arbitrarily small, and taken arbitrarily large, this implies that gradient flow achieves a population loss twice as high as gradient descent with a careful step size, followed by gradient flow.

Moreover, with the remaining probability, the samples will be such that gradient flow and annealed gradient descent behave identically:

## 3 Discussion

Similarities. This example, though stylized, shares several features with deep neural networks in practice.

• Neural nets trained with cross-entropy loss cannot be trained to global emperical risk minimas, and are instead early-stopped at some small value of train loss.

• There is a mismatch between train and test loss landscapes (in deep learning, this is due to overparameterization/undersampling).

• Learning rate annealing typically generalizes better than using a small constant learning rate (goodfellow2016deep).

• The “large” learning rates used in practice are far larger than what optimization theory would prescribe. In particular, consecutive iterates of SGD are often negatively-correlated with each other in the later stage of training (xing2018walk), suggesting that the iterates are oscillating around a sharp valley. jastrzebski2018relation also finds that the typical SGD step is too large to optimize along the steepest directions of the loss.

Limitations. Our example is nevertheless a toy example, and does not exhibit some features of real networks. For example, our setting has only one basin of attraction. But real networks have many basins of attraction, and there is evidence that a high initial learning rate influences the choice of basin (e.g. li2019explaining; jastrzebski2020break). It remains an important open question to understand the effect of learning rate in deep learning.

#### Acknowledgements

We thank John Schulman for a discussion around learning rates that led to wondering if this can occur in convex problems. We thank Aditya Ramesh, Ilya Sutskever, and Gal Kaplun for helpful comments on presentation, and Jacob Steinhardt for technical discussions refining these results.

Work supported in part by the Simons Investigator Awards of Boaz Barak and Madhu Sudan, and NSF Awards under grants CCF 1565264, CCF 1715187, and CNS 1618026.

## Appendix A Formal Statements

In this section, we state and prove Claim 2, along with a mild generalization of the setting via Lemma A.2.

### a.1 Notation and Preliminaries

For a distribution over , let be the population loss for parameters . Let be the train loss for samples .

Let be the function which optimizes train loss from initial point , until the train loss reaches early-stopping threshold , and then outputs the resulting parameters.

Let be the function which first optimizes train loss with gradient-descent starting at initial point , with step-size , for steps, and an early-stopping threshold , and then continues with gradient flow with early-stopping threshold .

For samples, let be the design matrix of samples. Let be the population covariance and let be the emperical covariance.

We assume throughout that and are simultaneously diagonalizable:

 ΣX=UΛUT;^ΣX=UΓUT

for diagonal and orthonormal . Further, assume without loss of generality that (if is not full rank, we can restrict attention to its image).

We consider optimizing the the train loss:

 ^L(β)=1n||X(β−β∗)||22=||β−β∗||2^ΣX

Observe that the optimal population loss for a point with train loss is:

 mins.t. L^Σ(β)=\epsLΣ(β)=\epsmini({λiγi}) (3)

Similarly, the worst population loss for a point with train loss is:

 maxs.t. L^Σ(β)=\epsLΣ(β)=\epsmaxi({λiγi}) (4)

Now, the gradient flow dynamics on the train loss is:

 dβdt=−∇β^L(β)=−2nXTX(β−β∗)

Switching to coordinates , these dynamics are equivalent to:

 dδdt=−2Γδ

The empirical and population losses in these coordinates can be written as:

 LD(δ)=∑iλiδ2i;^L(δ)=∑iγiδ2i

And the trajectory of gradient flow, from initialization , is

 ∀i∈[d]:  δi(t)=δi(0)e−2γit

With this setup, we now state and prove the following main lemma, characterizing the behavior of gradient flow and gradient descent for a given sample covariance.

### a.2 Main Lemma

Let

be as above. Let eigenvalues be ordered according to

: .

Let for some . That is, let index a block of “small” eigenvalues. Let . Assume there is an eigenvalue gap between the “large” and “small” eigenvalues, where for some

 ∀i∈¯S:γi/γk≥1+p.

For all , if the initialization satisfies

1. At initialization, the contribution of the largest eigenspace to the train loss is at least

:

 γ1δ1(0)2>\eps
2. The eigenvalue gap is large enough to ensure that eventually only the “small” eigenspace is significant, specifically:

 \epsp∑i∈¯Sγiδi(0)2(|S|γj∗δj∗(0)2)1+p≤αwhere j∗:=\argminj∈Sγjδj(0)2

Then there exists a learning-rate (the “large” learning-rate) such that

1. Gradient flow achieves population loss

 LD(βslow)≥\eps(1−α)minj∈Sλjγj
2. Annealed gradient descent run for steps has loss

 LD(βfast)≤\epsmaxj:γj=γ1(λjγj)+exp(−Ω(K))

In particular, if the top eigenvalue of is unique, then

 LD(βfast)≤\epsλ1γ1+exp(−Ω(K))

The constant depends on the spectrum, and taking suffices.

###### Proof.

The proof idea is to show that gradient flow must run for some time in order to reach train loss – and since higher-eigenvalues optimize faster, most of the train loss will be due to contributions from the “small” coordinates .

Let be the stopping-time of gradient flow, such that . We can lower-bound the time required to reach train loss as:

 |S|γj∗δj∗(0)2e−4γkT ≤∑j∈Sγjδj(0)2e−4γjT≤∑j∈[n]γjδj(0)2e−4γjT=^L(δ(T))=\eps (5) ⟹T ≥14γklog(|S|γj∗δj∗(0)2\eps) (6)

Decompose the train loss as:

 ^L(δ)=^L¯S(δ)+^LS(δ)

Where and . Now, we can bound as:

 ^L¯S(δ(T))=∑i∈¯Sγiδi(T)2 =∑i∈¯Sγiδi(0)2e−4γiT ≤∑i∈¯Sγiδi(0)2(\eps|S|γj∗δj∗(0)2)γi/γk (by Equation 6) ≤∑i∈¯Sγiδi(0)2(\eps|S|γj∗δj∗(0)2)1+p ≤α\eps

Where the last inequality is due to Condition 2 of the Lemma. Now, since , and , and , we must have

 ^LS(δ(T))≥(1−α)\eps

This lower-bounds the population loss as desired.

 LD(δ(T)) ≥∑i∈Sλiδi(T)2≥minj∈S(λjγj)∑i∈Sγiδi(T)2=minj∈S(λjγj)^LS(δ(T))≥minj∈S(λjγj)(1−α)\eps

For annealed gradient descent, the proof idea is: set the learning-rate such that in the gradient descent stage, the optimizer oscillates on the first (highest-curvature) coordinate , and optimizes until the remaining coordinates are . Then, in the gradient flow stage, it optimizes on the first coordinate until hitting the target train loss of . Thus, at completion most of the train loss will be due to contributions from the “large” first coordinate.

Specifically, the gradient descent dynamics with stepsize is:

 δi(t+1)=(1−2ηγi)δi(t)

for discrete . We set , so the first coordinate simply oscillates: . This is the case for the top eigenspace, i.e. all coordinates where . The remaining coordinates decay exponentially. That is, let be the remaining coordinates (corresponding to smaller eigenspaces). Then we have:

 ∀i∈Q: δi(t)2≤c−tδi(0)2 (7)

for constant .

Recall, the population and empirical losses are:

 LD(δ)=∑iλiδ2i;^L(δ)=∑iγiδ2i

By Equation 7, after steps of gradient descent, the contribution to the population loss from the coordinates is small:

 ∑i∈Qλiδi(K)2≤exp(−Ωc(K))

By Condition 1 of the Lemma, the train loss is still not below after the gradient descent stage, since the first coordinate did not optimize: . We now run gradient flow, stopping at time when the train loss is . At this point, we have

 ^L(δ(T))≤\eps⟹∑i∉Qγiδi(T)2≤\eps

And thus,

 LD(δ(T)) =∑i∉Qλiδi(T)2+∑i∈Qλiδi(T)2 ≤(maxk∉Qλkγk)∑i∉Qγiδi(T)2+exp(−Ω(K)) ≤(maxk∉Qλkγk)\eps+exp(−Ω(K))

as desired. ∎

As a corollary of Lemma A.2, we recover the 2-dimensional of Claim 2.

### a.3 Proof of Claim 2

###### Proof sketch of Claim 2.

Consider the distribution defined as: Sample uniformly at random, and let for ground-truth . The population covariance is simply

 ΣX=[1/2001/2]

With probability , two of the three samples will have the same value of . In this case, the sample covariance will be equal (up to reordering of coordinates) to

 ^ΣX=[2/3001/3]

This satisfies the conditions of Lemma A.2, taking the “small” set of eigenvalue indices to be . Thus, the conclusion follows by Lemma A.2.

With probability , all the samples will all be identical, and in particular share the same value of . Here, the optimization trajectory will be 1-dimensional, and it is easy to see that gradient flow and annealed gradient descent will reach identical minima. ∎