Consider the problem of minimizing an average of loss functions
where each is the loss function over the th data point. Let be the set of solutions of (1).
The interest in efficiently solving (1) is growing due to the significant growth in data sets. In particular, the number of data points can be exceedingly large. In this setting, stochastic gradient descent (SGD) RobbinsMonro:1951 type methods have proven to be very effective. In particular, a new strand of SGD type methods based on momentum and adaptive step sizes are quickly becoming the state-of-the-art.
While adaptive methods date back at least to ADAGrad ADAGRAD, it is the more recent notorious ADAM ADAM that has sparked a renewed interest in both momentum techniques and adaptive step sizes. ADAM has shown to work very well in several settings Vaswani17; Radford15; Rastegari16, and with this practical success has now come a push to 1) provide theory that shows how to set the parameters so that these adaptive momentum methods work well 2) design better new adaptive methods. On the theoretical side, the initial proof of convergence of ADAM was shown to be incorrect AMSgrad, and several new methods with accompanying proofs have now been proposed as a solution, including AMSgrad AMSgrad, ADAMX ADAMX and more luo2018adaptive.
As far as we are aware, there exists no proof that these new adaptive momentum methods converge faster than plain vanilla SGD (despite their clear practical success). This is perhaps not surprising since even the simplest of the momentum-based methods, namely Stochastic Heavy Ball (SHB) has not been shown to converge faster than SGD. It is this gap that motivates our paper.
Here we provide a careful and comprehensive convergence theory of the stochastic heavy ball (SHB) method in the convex and strongly convex setting. The iterates of the SHB method are given by
where , the index is sampled i.i.d at each iteration, and the step sizes and momentum parameters are carefully chosen. Typically is a standard setting, but here we show that different sequences of momentum parameters lead to better theoretical and practical performance.
In the deep neural network literature, the SHB method is more commonly written as
where and See Section A in the appendix for a proof of the equivalence between (2) and (3). When written in the form (3) the method is often known as simply the Momentum method Sutskever13; ruder2016overview.
1.1 Contributions and Background
An important focus of our work is providing an analysis for SHB which only depends on simple and verifiable assumptions. Our starting point is examining the existing assumptions for the analysis of SGD. Most convergence results on SGD depend on the bounded stochastic gradients or bounded stochastic gradients variance assumptions. If
is an unbiased estimate of the gradient or a subgradient of the gradient, these assumptions can be written as:
We now present our contributions to the analysis of SHB.
The deterministic Heavy Ball method.
The first local convergence of the deterministic Heavy Ball method was given in Polyak64, showing that it converges at an accelerated rate for twice differentiable strongly convex functions. Only recently did Ghadimi2014 show that the deterministic Heavy Ball method converged globally and sublinearly for smooth and convex functions.
Contributions. Our analysis recovers the results of Ghadimi2014 as a special case and extends them to the stochastic setting. Indeed, when specialized to the full batch case, our rates match theirs111Up to a small constant factor difference..
Stochastic Heavy Ball analysis.
The SHB has recently been analysed for nonconvex functions and for strongly convex functions in Gadat18. For strongly convex functions, they prove a convergence rate for any . An analysis of SHB based on differential equations was given in Orvieto19. There, the authors use a similar Lyapunov function as Ghadimi2014, however, they rely on Assumption (BV). A convergence rate for SHB in the convex setting was given in Yang16, but again by relying on Assumptions (BG) and (BV). Furthermore, they provide a rate only for the average of the iterates rather than the last iterate of SHB. For the specialized setting of minimizing quadratics, it has been shown that the SHB converges linearly at an accelerated rate, but only in expectation rather than convergence in L2 Loizou2018. By using stronger assumptions on the noise as compared to Kidambi18, in Can19 the authors show that by using a specific parameter setting, the SHB applied on quadratics converges at an accelerated rate to a neighborhood of a minimizer.
Contributions. We provide the first proof of convergence of SHB in the general convex setting without assuming (BG) nor (BV). Instead, we rely simply on the smoothness of the loss functions. Additionally, for strongly convex functions, we provide new iteration dependent parameters in Section H of the supplementary material that result in sublinear convergence of SHB.
Stochastic Gradient Descent analysis.
In the convex setting and without assuming that the gradients are bounded, only the average of the iterates of SGD has been shown to converge sublinearly to a neighborhood of the solution, see Theorem 6 in Vaswani18222They show convergence to the minimum if the gradient noise at the optimum is zero., which contrasts with what works well in practice, which is using the last iterate of SGD. Motivated by this gap between theory and practice, it was proved very recently in Jain19 that a convergence rate of the last iterate of SGD can be attained using an elaborate step size scheme, but in a different setting, under Assumption (BG) and the assumption that is convex and Lipschitz over a closed bounded set333Note that the suffix averaging scheme proposed in Nemirovski09 under Assumption (BG) results in a convergence rate, but when this result is specialized to the extreme case of picking the last iterate, the upper bound on the suboptimality is of the order . This contradicts Harvey19, which claims that for smooth convex functions, the last iterate of SGD was proven to converge in Nemirovski09 at a rate..
Contributions. We prove that in contrast with SGD, using a fixed step size, the last iterate
of SHB converges sublinearly to a neighborhood of the minimum and to the minimum exactly in the interpolation regime, which supports what is done in practice.
As a rule of thumb, the momentum parameter is often fixed at around , which often exhibits better empirical performance than SGD Sutskever13
. Despite this practical success, there exist simple linear regression problems where SHB is worse than SGD for any choice of a fixed momentum and step sizeKidambi18.
Contributions. We provide iteration dependent formulae for updating the step size and momentum parameters that result in a fast convergence in theory and in practice. We show through extensive numerical experiments in Figure 1 that our new parameter setting is statistically superior to the standard rule-of-thumb settings on convex problems.
(S)HB is asymptotically faster than (S)GD.
The almost sure convergence of the iterates of SGD and SHB is a well-studied question Bottou03; Zhou17; Nguyen18; Gadat18. For SGD, the almost sure convergence of the iterates for functions satisfying , called variationally coherent, was shown in Bottou03 by assuming that the minimizer is unique. Recently in Zhou17, the uniqueness of the minimizer was dropped for variationally coherent functions, but again by assuming (BG). As for SHB, almost sure convergence to a minimizer for nonconvex functions was proven in Gadat18 under Assumption (BV) and an unusual helliptic condition which guarantees that SHB escapes any unstable point.
Contributions. Assuming only convexity and smoothness, we prove that the iterates of SHB converge almost surely to a minimizer. To the best of our knowledge, this is the first work proving the convergence of the iterates of a stochastic first-order method under these sole assumptions. Moreover, we prove that when the noise at the minimum is , which holds when the model is overparametrized (resp. when we use the full gradient at each iteration), SHB (resp. deterministic HB) converges at a rate rather than the known for SGD Vaswani18 (resp. GD).
Mini-batching and importance sampling.
Our analysis uses arbitrary sampling, which was introduced in Gower19. As such, it includes all forms of sampling of the data, such as mini-batching and importance sampling. We are even able to derive an optimal mini-batch size. Such analysis has been done for SGD Gower19, SVRG our_SVRG and SAGA SAGAminib . There appears to be no prior work analyzing SHB with mini-batching and other samplings.
1.2 Assumptions and arbitrary sampling
All of our theory only relies on the following assumption.
For all , there exists such that for every we have that
Let Consequently, is also smooth and we use to denote its smoothness constant.
So that we can analyze the SHB method under different forms of mini-batching and non-uniform sampling, we will use an arbitrary sampling vector which was introduced bySAGAminib; Gower19.
LDefinition 1.2 (Arbitrary sampling).
Let be a random vector drawn from some distribution such that for
We refer to in the above definition as an arbitrary sampling vector since we can use to encode any sampling of the functions and their gradients. Indeed, if we define , then and are unbiased estimates of and respectively. This follows from Definition 1.2 since
and analogously This observation allows us to write an arbitrary sampling version for any stochastic gradient type method. In particular for the SHB method, instead of sampling a single function index at each iteration , we sample a vector , and iterate
For all our analysis we will use (7).
The sampling we use also affects how smooth our estimates are in expectation. This change in smoothness is captured by the Expected Smoothness constant that we introduce in the following lemma.
LLemma 1.3 (Expected smoothness JakSketch).
Let Assumption 1.1 hold and let be a sampling vector. It follows that there exists such that
This expected smoothness (8) also gives us a bound on the gradient noise.
Let be the residual gradient noise
If Assumption 1.1 holds then
Follows immediately by using (8) with for and ∎
With this bound (9) on the gradient noise, we do not need to assume that the stochastic gradients are bounded such as in (BG) or (BV), as is often done when analyzing SGD Nemirovski09 or SHB Yang16. Instead, we simply employ (9) which is a direct consequence of Assumption 1.1. Note that the analysis carried for SGD and SHB in Nemirovski09; Yang16 is more general and applies to the nonsmooth case, for which assuming BG is often necessary. But to our knowledge, there is no existing analysis for SHB without (BG) or (BV) for smooth and convex functions.
Both the expected smoothness constant and the residual gradient noise will appear in our analysis. Fortunately, we can calculate the expected smoothness constant. The exact expression of the constant depends on both the sampling and the smoothness constants of the functions , as we show next. For example, as conjectured in SAGAminib and proven in Gower19, for mini-batching with size without replacement we have that (9) holds for
where . Note that and , as expected, since corresponds to full batch gradients, or equivalently to using the deterministic HB. Similarly, , since corresponds to sampling one individual function. As for when , there is no easy way to estimate it, excluding for overparametrized models such as deep nets.
overparametrized models. When our models have enough parameters to interpolate the data Vaswani18 then , , and consequently
Before moving on to our main theoretical results, we first present a lesser known view point of SHB as the iterate-moving-average method. It is this viewpoint that facilitates our forthcoming analysis.
2 An iterative averaging viewpoint of the stochastic heavy ball method
Our forthcoming analysis suggests the following new parametrization of SHB444This iterate-moving-average method (12) was analyzed in Appendix H of Taylor19a. However, the link with SHB was not established..
In all of our theorems, the parameters and naturally arise in the recurrences and Lyaponuv function. As such, we determine how to set the parameters and which in turn gives settings for and through (13).
Having new reformulations often leads to new insights. This is the case for Nesterov’s accelerated gradient method, where at least six forms are known adefazio-curvedgeom2019 and recent research suggests that iterate-averaged reformulations are the easiest to generalize to the combined proximal & variance-reduced case lan2017.
3 Convex case
Our first theorem provides an upper bound on the suboptimality given any sampling and any sequence of step sizes. Later we develop special cases of this theorem through different choices of the parameters.
Let and consider the iterates (7). Let be a sequence such that for all . Define
Note that in Theorem 3.1 the only free parameters are the ’s which in the iterate-moving-average viewpoint (12) play the role of a learning rate. All our other parameters, including the step sizes and the momentum parameters , are given once we have chosen . We now explore three different settings of the ’s in the following subsections.
3.1 Convergence to a neighborhood of the minimum
Using a constant in Theorem 3.1 gives an interesting new sequence of decreasing step sizes and increasing momentum parameters , as we show in the next corollary.
Let If we set
we have and . Then the iterates of SHB (7) converge according to
In particular for we have that and , which gives
Corollary 3.2 shows how to set the parameters of SHB so that the last iterate converges sublinearly to a neighborhood of the minimum. In particular, for overparametrized models with , the last iterate of SHB converges sublinearly to the minimum. This same result was only known to hold for the average of the iterates of SGD Vaswani18. Moreover, when using the full gradient, which corresponds to sampling all individual gradients, we have and , which recovers the rate derived in Ghadimi2014 for the deterministic HB method upto a constant.
We can also translate this and the following convergence results into convenient complexity results, which we defer to the appendix (Section F) due to lack of space. We can also specialize our results to different forms of samplings and derive the mini-batch size which minimizes the total complexity, which we also defer to the appendix (Section G).
3.2 Exact convergence to the minimum
Now we provide parameter settings for ’s and ’s that guarantee convergence to the minimum.
This convergence rate is the same rate that can be derived using a weighted average of the iterates of SGD, as is done by Nemirovski09. Next we show how to drop the factor (21) if we know the stopping time of the algorithm. Note that using the stopping time to drop such terms was first introduced in Nemirovski09 for the analysis of the average of the iterates of SGD.
4 Faster asymptotic convergence
In this section, we show that SHB
is asymptotically faster than SGD when the model is overparametrized, and that the deterministic HB is asymptotically faster than Gradient Descent. Here we use a.s. as an abbreviation of almost surely, otherwise also known as convergence with probability one.
with probability one.Moreover, we prove that the iterates of SHB (2) converge a.s. to a minimizer.
Note that when specialized to full gradients sampling, i.e. when we use the deterministic HB method, our results hold without the need for almost sure statements. This is another benefit of our analysis, since it unifies the analysis of both the stochastic and the determinstic versions of the HB method.
To the best of our knowledge, Theorem 4.1 is the first result showing that the iterates of a stochastic first-order method converge to a minimizer assuming only smoothness and convexity. Indeed, existing results on the a.s. convergence of the iterates of SGD or SHB all assume either (BG), (BV) or the unicity of the minimizer Bottou03; Zhou17; Nguyen18; Gadat18. For overparametrized models, Theorem 4.1 shows that converges faster than .
Assume and let for all . By Theorem 4.1 we have
This corollary has fundamental implications in the deterministic and the stochastic case. In the deterministic case, always holds. Thus Corollary 4.2 shows that the HB method is asymptotically faster than gradient descent since gradient descent is only known to converge according to . In the stochastic and overparametrized regime, this also shows that SHB is asymptotically faster than SGD with averaging which is only guaranteed to converge according to where Vaswani18.
It seems that it is our new iteration-dependent momentum coefficients that enable this new fast ‘small o’ convergence of the objective values. Indeed, in Attouch16 the authors also showed that a version of (deterministic) Nesterov’s Accelerated Gradient algorithm with carefully chosen iteration dependent momentum coefficients converges at rate rather than the previously known .
For our experiments, we selected a diverse set of multi-class classification problems from the LibSVM repository, 25 problems in total. These datasets range from a few classes to a thousand, and they vary from hundreds of data-points to hundreds of thousands. We normalized each dataset by a constant so that the largest data vector had norm
Here we compare the parameter setting given by our theory against three common alternative parameter settings used throughout the machine learning literature: SGD with fixed momentumof 0.9 and 0.99 as well as no momentum, as given in (3). We left the effective step size of these three methods to be determined through a grid search.
For the gridsearch, we used power-of-2 grid (), we ran 5 random seeds and chose the learning rate that gave the lowest loss on average for each combination of problem and method. We widened the grid search as necessary for each combination to ensure that the chosen learning-rate was not from the endpoints of our grid search.
Since the and constants in our method depend on the smoothness constant , we set these parameters using (15) and the assumption that , so that . Although it is possible to give a closed-form bound for the Lipschitz smoothness constant for our test problems, the above setting is less conservative and has the advantage of being usable without requiring any knowledge about the problem structure.
We then ran 40 different random seeds to produce Figure 1
. To determine which method if any was best on each problem, we performed t-tests with Bonferroni correction, and we report how often each method was statistically significantly superior to all of the other three methods in Table1. The stochastic heavy ball method using our theoretically motivated parameter settings performed better than all other methods on 11 of the 25 problems. On the remaining problems, no other method was statistically significantly better than all of the rest.
|SHB||SGD||Momentum 0.9||Momentum 0.99||No best method|
|Best method for||11||0||0||0||14|
This work develops the theory and a new viewpoint of a commonly used method (the Momentum method) for training supervised machine learning methods. We give new parameter settings that we believe will reduce the training time. Furthermore we develop new iterate-moving-average viewpoints that we believe can also lead to new insights and understanding of all momentum based method. Given that we do not envision any particular application, nor does this work open up any new applications, we see no ethical or immediate societal consequences.
In the appendix, we procede to prove the results we derive in the main paper, then we present the optimal minibatch size to use for SHB depending on the problem setting in Section G. In Section H, we extend the theory developed in Section 3 to the strongly convex case, and show that SHB improves over the last iterate convergence result for SGD by a constant.
Appendix A Heavy ball and Momentum are the same thing
Appendix B Proof of Theorem 2.1
Appendix C Proof of Theorem 3.1
The proof uses the following Lyaponuv function
Appendix D Proof of Corollary 3.3
Using the integral bound and plugging in our choice of gives
Furthermore using the integral bound again we have that
As for the parameter settings, note that
For the above gives
Thus by maintaining and updating we can compute the step sizes and momentum parameters using (15). ∎
Appendix E Proof of Theorem 4.1
A necessary tool to prove Theorem 4.1 is the following Robbins-Siegmund theorem Robbins71.
LLemma E.1 (Simplified Robbins-Sigmund Theorem).
Consider a filtration and nonnegative sequences of adapted processes , and such that
Then, converges and almost surely.
where , and . We also define:
To make the proof more readable, we first state the two following lemmas, for which we give a proof after the proof of the theorem.
We can now prove Theorem 4.1.
Proof of the theorem.
This proof aims at proving that, a.s.
for some .
for any ,
In our road to prove the first point, we will prove the second point as a byproduct.
We will now prove that exists .
We will first prove that exists , then that exists .
First, we have from Lemma E.3 that converges (to ) . Hence, it remains to show that exists
From (71), we have that:
By definition of , we have . Therefore, noting
Moreover, by Lemma E.3,