Log In Sign Up

On the convergence of the Stochastic Heavy Ball Method

by   Othmane Sebbouh, et al.

We provide a comprehensive analysis of the Stochastic Heavy Ball (SHB) method (otherwise known as the momentum method), including a convergence of the last iterate of SHB, establishing a faster rate of convergence than existing bounds on the last iterate of Stochastic Gradient Descent (SGD) in the convex setting. Our analysis shows that unlike SGD, no final iterate averaging is necessary with the SHB method. We detail new iteration dependent step sizes (learning rates) and momentum parameters for the SHB that result in this fast convergence. Moreover, assuming only smoothness and convexity, we prove that the iterates of SHB converge almost surely to a minimizer, and that the convergence of the function values of (S)HB is asymptotically faster than that of (S)GD in the overparametrized and in the deterministic settings. Our analysis is general, in that it includes all forms of mini-batching and non-uniform samplings as a special case, using an arbitrary sampling framework. Furthermore, our analysis does not rely on the bounded gradient assumptions. Instead, it only relies on smoothness, which is an assumption that can be more readily verified. Finally, we present extensive numerical experiments that show that our theoretically motivated parameter settings give a statistically significant faster convergence across a diverse collection of datasets.


page 1

page 2

page 3

page 4


Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Heavy ball momentum is crucial in accelerating (stochastic) gradient-bas...

Momentum and Stochastic Momentum for Stochastic Gradient, Newton, Proximal Point and Subspace Descent Methods

In this paper we study several classes of stochastic optimization algori...

The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods

The adaptive stochastic gradient descent (SGD) with momentum has been wi...

On Uniform Boundedness Properties of SGD and its Momentum Variants

A theoretical, and potentially also practical, problem with stochastic g...

Gradient Temporal Difference with Momentum: Stability and Convergence

Gradient temporal difference (Gradient TD) algorithms are a popular clas...

Leveraging Non-uniformity in First-order Non-convex Optimization

Classical global convergence results for first-order methods rely on uni...

Data Dependent Convergence for Distributed Stochastic Optimization

In this dissertation we propose alternative analysis of distributed stoc...

1 Introduction

Consider the problem of minimizing an average of loss functions


where each is the loss function over the th data point. Let be the set of solutions of (1).

The interest in efficiently solving (1) is growing due to the significant growth in data sets. In particular, the number of data points can be exceedingly large. In this setting, stochastic gradient descent (SGD) RobbinsMonro:1951 type methods have proven to be very effective. In particular, a new strand of SGD type methods based on momentum and adaptive step sizes are quickly becoming the state-of-the-art.

While adaptive methods date back at least to ADAGrad ADAGRAD, it is the more recent notorious ADAM ADAM that has sparked a renewed interest in both momentum techniques and adaptive step sizes. ADAM has shown to work very well in several settings Vaswani17; Radford15; Rastegari16, and with this practical success has now come a push to 1) provide theory that shows how to set the parameters so that these adaptive momentum methods work well 2) design better new adaptive methods. On the theoretical side, the initial proof of convergence of ADAM was shown to be incorrect AMSgrad, and several new methods with accompanying proofs have now been proposed as a solution, including AMSgrad AMSgrad, ADAMX ADAMX and more luo2018adaptive.

As far as we are aware, there exists no proof that these new adaptive momentum methods converge faster than plain vanilla SGD (despite their clear practical success). This is perhaps not surprising since even the simplest of the momentum-based methods, namely Stochastic Heavy Ball (SHB) has not been shown to converge faster than SGD. It is this gap that motivates our paper.

Here we provide a careful and comprehensive convergence theory of the stochastic heavy ball (SHB) method in the convex and strongly convex setting. The iterates of the SHB method are given by


where , the index is sampled i.i.d at each iteration, and the step sizes and momentum parameters are carefully chosen. Typically is a standard setting, but here we show that different sequences of momentum parameters lead to better theoretical and practical performance.

In the deep neural network literature, the SHB method is more commonly written as


where and See Section A in the appendix for a proof of the equivalence between (2) and (3). When written in the form (3) the method is often known as simply the Momentum method Sutskever13; ruder2016overview.

1.1 Contributions and Background

An important focus of our work is providing an analysis for SHB which only depends on simple and verifiable assumptions. Our starting point is examining the existing assumptions for the analysis of SGD. Most convergence results on SGD depend on the bounded stochastic gradients or bounded stochastic gradients variance assumptions. If

is an unbiased estimate of the gradient or a subgradient of the gradient

, these assumptions can be written as:
where are constants. While using a uniform bound on the subgradients like (BG) seems often necessary to analyze stochastic subgradient descent Nemirovski09; Rakhlin12; Shamir13, this bound has been proven in Nguyen18 never to hold for a large class of convex functions, namely strongly convex ones. Similarly, it is possible to show that Assumption (BV), used for example in the analysis of an accelerated variant of SGD in Ghadimi13, does not hold for some convex functions (see Proposition 1 in Khaled20). Fueled by these observations, a recent line of work Nguyen18; Gower19; Khaled20 has emerged, which aims to avoid Assumptions (BG) and (BV). We follow this line of work. In all of our analyis, we will only assume that is smooth and convex.

We now present our contributions to the analysis of SHB.

The deterministic Heavy Ball method.

The first local convergence of the deterministic Heavy Ball method was given in Polyak64, showing that it converges at an accelerated rate for twice differentiable strongly convex functions. Only recently did Ghadimi2014 show that the deterministic Heavy Ball method converged globally and sublinearly for smooth and convex functions.

Contributions. Our analysis recovers the results of Ghadimi2014 as a special case and extends them to the stochastic setting. Indeed, when specialized to the full batch case, our rates match theirs111Up to a small constant factor difference..

Stochastic Heavy Ball analysis.

The SHB has recently been analysed for nonconvex functions and for strongly convex functions in Gadat18. For strongly convex functions, they prove a convergence rate for any . An analysis of SHB based on differential equations was given in Orvieto19. There, the authors use a similar Lyapunov function as Ghadimi2014, however, they rely on Assumption (BV). A convergence rate for SHB in the convex setting was given in Yang16, but again by relying on Assumptions (BG) and (BV). Furthermore, they provide a rate only for the average of the iterates rather than the last iterate of SHB. For the specialized setting of minimizing quadratics, it has been shown that the SHB converges linearly at an accelerated rate, but only in expectation rather than convergence in L2 Loizou2018. By using stronger assumptions on the noise as compared to Kidambi18, in Can19 the authors show that by using a specific parameter setting, the SHB applied on quadratics converges at an accelerated rate to a neighborhood of a minimizer.

Contributions. We provide the first proof of convergence of SHB in the general convex setting without assuming (BG) nor (BV). Instead, we rely simply on the smoothness of the loss functions. Additionally, for strongly convex functions, we provide new iteration dependent parameters in Section H of the supplementary material that result in sublinear convergence of SHB.

Stochastic Gradient Descent analysis.

In the convex setting and without assuming that the gradients are bounded, only the average of the iterates of SGD has been shown to converge sublinearly to a neighborhood of the solution, see Theorem 6 in Vaswani18222They show convergence to the minimum if the gradient noise at the optimum is zero., which contrasts with what works well in practice, which is using the last iterate of SGD. Motivated by this gap between theory and practice, it was proved very recently in Jain19 that a convergence rate of the last iterate of SGD can be attained using an elaborate step size scheme, but in a different setting, under Assumption (BG) and the assumption that is convex and Lipschitz over a closed bounded set333Note that the suffix averaging scheme proposed in Nemirovski09 under Assumption (BG) results in a convergence rate, but when this result is specialized to the extreme case of picking the last iterate, the upper bound on the suboptimality is of the order . This contradicts Harvey19, which claims that for smooth convex functions, the last iterate of SGD was proven to converge in Nemirovski09 at a rate..

Contributions. We prove that in contrast with SGD, using a fixed step size, the last iterate

of SHB converges sublinearly to a neighborhood of the minimum and to the minimum exactly in the interpolation regime, which supports what is done in practice.

Parameter settings.

As a rule of thumb, the momentum parameter is often fixed at around , which often exhibits better empirical performance than SGD Sutskever13

. Despite this practical success, there exist simple linear regression problems where SHB is worse than SGD for any choice of a fixed momentum and step size 


Contributions. We provide iteration dependent formulae for updating the step size and momentum parameters that result in a fast convergence in theory and in practice. We show through extensive numerical experiments in Figure 1 that our new parameter setting is statistically superior to the standard rule-of-thumb settings on convex problems.

(S)HB is asymptotically faster than (S)GD.

The almost sure convergence of the iterates of SGD and SHB is a well-studied question Bottou03; Zhou17; Nguyen18; Gadat18. For SGD, the almost sure convergence of the iterates for functions satisfying , called variationally coherent, was shown in Bottou03 by assuming that the minimizer is unique. Recently in Zhou17, the uniqueness of the minimizer was dropped for variationally coherent functions, but again by assuming (BG). As for SHB, almost sure convergence to a minimizer for nonconvex functions was proven in Gadat18 under Assumption (BV) and an unusual helliptic condition which guarantees that SHB escapes any unstable point.

Contributions. Assuming only convexity and smoothness, we prove that the iterates of SHB converge almost surely to a minimizer. To the best of our knowledge, this is the first work proving the convergence of the iterates of a stochastic first-order method under these sole assumptions. Moreover, we prove that when the noise at the minimum is , which holds when the model is overparametrized (resp. when we use the full gradient at each iteration), SHB (resp. deterministic HB) converges at a rate rather than the known for SGD Vaswani18 (resp. GD).

Mini-batching and importance sampling.

Our analysis uses arbitrary sampling, which was introduced in Gower19. As such, it includes all forms of sampling of the data, such as mini-batching and importance sampling. We are even able to derive an optimal mini-batch size. Such analysis has been done for SGD Gower19, SVRG our_SVRG and SAGA SAGAminib . There appears to be no prior work analyzing SHB with mini-batching and other samplings.

1.2 Assumptions and arbitrary sampling

All of our theory only relies on the following assumption.

LAssumption 1.1.

For all , there exists such that for every we have that


Let Consequently, is also smooth and we use to denote its smoothness constant.

So that we can analyze the SHB method under different forms of mini-batching and non-uniform sampling, we will use an arbitrary sampling vector which was introduced by 

SAGAminib; Gower19.

LDefinition 1.2 (Arbitrary sampling).

Let be a random vector drawn from some distribution such that for

We refer to in the above definition as an arbitrary sampling vector since we can use to encode any sampling of the functions and their gradients. Indeed, if we define , then and are unbiased estimates of and respectively. This follows from Definition 1.2 since


and analogously This observation allows us to write an arbitrary sampling version for any stochastic gradient type method. In particular for the SHB method, instead of sampling a single function index at each iteration , we sample a vector , and iterate


For all our analysis we will use (7).

The sampling we use also affects how smooth our estimates are in expectation. This change in smoothness is captured by the Expected Smoothness constant that we introduce in the following lemma.

LLemma 1.3 (Expected smoothness JakSketch).

Let Assumption 1.1 hold and let be a sampling vector. It follows that there exists such that


This expected smoothness (8) also gives us a bound on the gradient noise.

LLemma 1.4.

Let be the residual gradient noise

If Assumption 1.1 holds then


Follows immediately by using (8) with for and

With this bound (9) on the gradient noise, we do not need to assume that the stochastic gradients are bounded such as in (BG) or (BV), as is often done when analyzing SGD Nemirovski09 or SHB Yang16. Instead, we simply employ (9) which is a direct consequence of Assumption 1.1. Note that the analysis carried for SGD and SHB in Nemirovski09; Yang16 is more general and applies to the nonsmooth case, for which assuming BG is often necessary. But to our knowledge, there is no existing analysis for SHB without (BG) or (BV) for smooth and convex functions.

Both the expected smoothness constant and the residual gradient noise will appear in our analysis. Fortunately, we can calculate the expected smoothness constant. The exact expression of the constant depends on both the sampling and the smoothness constants of the functions , as we show next. For example, as conjectured in SAGAminib and proven in Gower19, for mini-batching with size without replacement we have that (9) holds for


where . Note that and , as expected, since corresponds to full batch gradients, or equivalently to using the deterministic HB. Similarly, , since corresponds to sampling one individual function. As for when , there is no easy way to estimate it, excluding for overparametrized models such as deep nets.

overparametrized models. When our models have enough parameters to interpolate the data Vaswani18 then , , and consequently

Before moving on to our main theoretical results, we first present a lesser known view point of SHB as the iterate-moving-average method. It is this viewpoint that facilitates our forthcoming analysis.

2 An iterative averaging viewpoint of the stochastic heavy ball method

Our forthcoming analysis suggests the following new parametrization of SHB444This iterate-moving-average method (12) was analyzed in Appendix H of Taylor19a. However, the link with SHB was not established..

LTheorem 2.1.

Let Consider the iterate-moving-average (IMA) method:


when we set . If


then the iterates in (12) are equal to the iterates of the SHB method (7) .

The equivalence between this formulation and the original (7) is proven in the supplementary material (Section B).

In all of our theorems, the parameters and naturally arise in the recurrences and Lyaponuv function. As such, we determine how to set the parameters and which in turn gives settings for and through (13).

Having new reformulations often leads to new insights. This is the case for Nesterov’s accelerated gradient method, where at least six forms are known adefazio-curvedgeom2019 and recent research suggests that iterate-averaged reformulations are the easiest to generalize to the combined proximal & variance-reduced case lan2017.

3 Convex case

Our first theorem provides an upper bound on the suboptimality given any sampling and any sequence of step sizes. Later we develop special cases of this theorem through different choices of the parameters.

LTheorem 3.1.

Let and consider the iterates (7). Let be a sequence such that for all . Define






Note that in Theorem 3.1 the only free parameters are the ’s which in the iterate-moving-average viewpoint (12) play the role of a learning rate. All our other parameters, including the step sizes and the momentum parameters , are given once we have chosen . We now explore three different settings of the ’s in the following subsections.

3.1 Convergence to a neighborhood of the minimum

Using a constant in Theorem 3.1 gives an interesting new sequence of decreasing step sizes and increasing momentum parameters , as we show in the next corollary.

LCorollary 3.2.

Let If we set


we have and . Then the iterates of SHB (7) converge according to


In particular for we have that and , which gives


Corollary 3.2 shows how to set the parameters of SHB so that the last iterate converges sublinearly to a neighborhood of the minimum. In particular, for overparametrized models with , the last iterate of SHB converges sublinearly to the minimum. This same result was only known to hold for the average of the iterates of SGD Vaswani18. Moreover, when using the full gradient, which corresponds to sampling all individual gradients, we have and , which recovers the rate derived in Ghadimi2014 for the deterministic HB method upto a constant.

We can also translate this and the following convergence results into convenient complexity results, which we defer to the appendix (Section F) due to lack of space. We can also specialize our results to different forms of samplings and derive the mini-batch size which minimizes the total complexity, which we also defer to the appendix (Section G).

3.2 Exact convergence to the minimum

Now we provide parameter settings for ’s and ’s that guarantee convergence to the minimum.

LCorollary 3.3.

Consider the setting in Theorem (3.1). If we set , where then the SHB method converges according to


For the step size and momentum parameters are given by (15) where

This convergence rate is the same rate that can be derived using a weighted average of the iterates of SGD, as is done by Nemirovski09. Next we show how to drop the factor (21) if we know the stopping time of the algorithm. Note that using the stopping time to drop such terms was first introduced in Nemirovski09 for the analysis of the average of the iterates of SGD.

LCorollary 3.4 (Convergence with known stopping time).

Suppose Algorithm (7) is run for iterations. Set for all , where , in Theorem 3.1. Then it follows directly from (18) that


4 Faster asymptotic convergence

In this section, we show that SHB is asymptotically faster than SGD when the model is overparametrized, and that the deterministic HB is asymptotically faster than Gradient Descent. Here we use a.s. as an abbreviation of almost surely, otherwise also known as convergence

with probability one.

Moreover, we prove that the iterates of SHB (2) converge a.s. to a minimizer.

LTheorem 4.1.

Consider the iterates of (2) and the setting of Theorem 3.1. Choose , such that and . With for all we have . that

  1. for some ,

  2. for any , .

Note that when specialized to full gradients sampling, i.e. when we use the deterministic HB method, our results hold without the need for almost sure statements. This is another benefit of our analysis, since it unifies the analysis of both the stochastic and the determinstic versions of the HB method.

To the best of our knowledge, Theorem 4.1 is the first result showing that the iterates of a stochastic first-order method converge to a minimizer assuming only smoothness and convexity. Indeed, existing results on the a.s. convergence of the iterates of SGD or SHB all assume either (BG), (BV) or the unicity of the minimizer Bottou03; Zhou17; Nguyen18; Gadat18. For overparametrized models, Theorem 4.1 shows that converges faster than .

LCorollary 4.2.

Assume and let for all . By Theorem 4.1 we have

This corollary has fundamental implications in the deterministic and the stochastic case. In the deterministic case, always holds. Thus Corollary 4.2 shows that the HB method is asymptotically faster than gradient descent since gradient descent is only known to converge according to . In the stochastic and overparametrized regime, this also shows that SHB is asymptotically faster than SGD with averaging which is only guaranteed to converge according to where Vaswani18.

It seems that it is our new iteration-dependent momentum coefficients that enable this new fast ‘small o’ convergence of the objective values. Indeed, in Attouch16 the authors also showed that a version of (deterministic) Nesterov’s Accelerated Gradient algorithm with carefully chosen iteration dependent momentum coefficients converges at rate rather than the previously known .

5 Experiments

For our experiments, we selected a diverse set of multi-class classification problems from the LibSVM repository, 25 problems in total. These datasets range from a few classes to a thousand, and they vary from hundreds of data-points to hundreds of thousands. We normalized each dataset by a constant so that the largest data vector had norm

. We used a multi-class logistic regression loss with no regularization so we could test the non-strongly convex convergence properties, and we ran for 50 epochs with no batching.

Here we compare the parameter setting given by our theory against three common alternative parameter settings used throughout the machine learning literature: SGD with fixed momentum

of 0.9 and 0.99 as well as no momentum, as given in (3). We left the effective step size of these three methods to be determined through a grid search.

We use SHB to denote our method (7) with and set using (15) and left as a constant to be determined through grid search.

For the gridsearch, we used power-of-2 grid (), we ran 5 random seeds and chose the learning rate that gave the lowest loss on average for each combination of problem and method. We widened the grid search as necessary for each combination to ensure that the chosen learning-rate was not from the endpoints of our grid search.

Since the and constants in our method depend on the smoothness constant , we set these parameters using (15) and the assumption that , so that . Although it is possible to give a closed-form bound for the Lipschitz smoothness constant for our test problems, the above setting is less conservative and has the advantage of being usable without requiring any knowledge about the problem structure.

We then ran 40 different random seeds to produce Figure 1

. To determine which method if any was best on each problem, we performed t-tests with Bonferroni correction, and we report how often each method was statistically significantly superior to all of the other three methods in Table

1. The stochastic heavy ball method using our theoretically motivated parameter settings performed better than all other methods on 11 of the 25 problems. On the remaining problems, no other method was statistically significantly better than all of the rest.

SHB SGD Momentum 0.9 Momentum 0.99 No best method
Best method for 11 0 0 0 14
Table 1: Count of how many problems each method is statistically significantly superior to the rest on
Figure 1: Average training error convergence plots for 25 LibSVM datasets, with using the best learning rate for each method and problem combination. Averages are over 40 runs. Error bars show a range of +/- 2SE.

Broader Impact

This work develops the theory and a new viewpoint of a commonly used method (the Momentum method) for training supervised machine learning methods. We give new parameter settings that we believe will reduce the training time. Furthermore we develop new iterate-moving-average viewpoints that we believe can also lead to new insights and understanding of all momentum based method. Given that we do not envision any particular application, nor does this work open up any new applications, we see no ethical or immediate societal consequences.

In the appendix, we procede to prove the results we derive in the main paper, then we present the optimal minibatch size to use for SHB depending on the problem setting in Section G. In Section H, we extend the theory developed in Section 3 to the strongly convex case, and show that SHB improves over the last iterate convergence result for SGD by a constant.

Appendix A Heavy ball and Momentum are the same thing

To see that (2) and (3) are equivalent we first expand (3) so that

Now using that which rearranged gives in the above gives

which after substituting gives the equivalence.

Appendix B Proof of Theorem 2.1


Consider the iterate-averaging method


and let


Substituting (22) into (23) gives


Now using (23) at the previous iteration we have that that

Substituting the above into (25) gives

Consequently by using (24) gives the result. ∎

Appendix C Proof of Theorem 3.1

The proof uses the following Lyaponuv function



We have

where we used in (E) that and Then taking conditional expectation we have


Since we have that

Using this in (27) then taking expectation and rearranging gives

Summing over to and using a telescopic sum, we have

where we used that Thus, writing explicitly, gives

Appendix D Proof of Corollary 3.3


Using the integral bound and plugging in our choice of gives


Furthermore using the integral bound again we have that


Now using (28) and (29) we have that


Using (28) and (30) in (16) gives (20).

As for the parameter settings, note that

For the above gives

Thus by maintaining and updating we can compute the step sizes and momentum parameters using (15). ∎

Appendix E Proof of Theorem 4.1

A necessary tool to prove Theorem 4.1 is the following Robbins-Siegmund theorem Robbins71.

LLemma E.1 (Simplified Robbins-Sigmund Theorem).

Consider a filtration and nonnegative sequences of adapted processes , and such that

  • (31)

Then, converges and almost surely.

In the remainder of this section, we consider the iterates of (2) and the setting of Theorem 4.1, that is:


where , and . We also define:


To make the proof more readable, we first state the two following lemmas, for which we give a proof after the proof of the theorem.

LLemma E.2.

LLemma E.3.

and thus,

We can now prove Theorem 4.1.

Proof of the theorem.

This proof aims at proving that, a.s.

  1. for some .

  2. for any ,

In our road to prove the first point, we will prove the second point as a byproduct.

We will now prove that exists .






We will first prove that exists , then that exists .

First, we have from Lemma E.3 that converges (to ) . Hence, it remains to show that exists

From (71), we have that:


By definition of , we have . Therefore, noting


we have:


But from Lemma E.2, we have . Moreover, . Hence, we have by Lemma E.1 that exists almost surely.

Moreover, by Lemma E.3, , and we have