# On the insufficiency of existing momentum schemes for Stochastic Optimization

Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, "fast gradient" methods have provable improvements over gradient descent only for the deterministic case, where the gradients are exact. In the stochastic case, the popular explanations for their wide applicability is that when these fast gradient methods are applied in the stochastic case, they partially mimic their exact gradient counterparts, resulting in some practical gain. This work provides a counterpoint to this belief by proving that there exist simple problem instances where these methods cannot outperform SGD despite the best setting of its parameters. These negative problem instances are, in an informal sense, generic; they do not look like carefully constructed pathological instances. These results suggest (along with empirical evidence) that HB or NAG's practical performance gains are a by-product of mini-batching. Furthermore, this work provides a viable (and provable) alternative, which, on the same set of problem instances, significantly improves over HB, NAG, and SGD's performance. This algorithm, referred to as Accelerated Stochastic Gradient Descent (ASGD), is a simple to implement stochastic algorithm, based on a relatively less popular variant of Nesterov's Acceleration. Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.

There are no comments yet.

## Authors

• 13 publications
• 42 publications
• 63 publications
• 56 publications
• ### Optimal Adaptive and Accelerated Stochastic Gradient Descent

Stochastic gradient descent (Sgd) methods are the most powerful optimiza...
10/01/2018 ∙ by Qi Deng, et al. ∙ 0

• ### Accelerating Stochastic Gradient Descent

There is widespread sentiment that it is not possible to effectively uti...
04/26/2017 ∙ by Prateek Jain, et al. ∙ 0

• ### Experiential Robot Learning with Accelerated Neuroevolution

Derivative-based optimization techniques such as Stochastic Gradient Des...
08/16/2018 ∙ by Ahmed Aly, et al. ∙ 0

• ### A unified theory of adaptive stochastic gradient descent as Bayesian filtering

There are a diverse array of schemes for adaptive stochastic gradient de...
07/19/2018 ∙ by Laurence Aitchison, et al. ∙ 0

• ### Data Cleansing for Models Trained with SGD

Data cleansing is a typical approach used to improve the accuracy of mac...
06/20/2019 ∙ by Satoshi Hara, et al. ∙ 3

• ### Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles

Many practical perception systems exist within larger processes that inc...
06/24/2016 ∙ by Stefan Lee, et al. ∙ 0

• ### Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples

Self-paced learning and hard example mining re-weight training instances...
04/24/2017 ∙ by Haw-Shiuan Chang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

First order optimization methods, which access a function to be optimized through its gradient or an unbiased approximation of its gradient, are the workhorses for modern large scale optimization problems, which include training the current state-of-the-art deep neural networks. Gradient descent

(Cauchy, 1847) is the simplest first order method that is used heavily in practice. However, it is known that for the class of smooth convex functions as well as some simple non-smooth problems (Nesterov, 2012a)), gradient descent is suboptimal (Nesterov, 2004) and there exists a class of algorithms called fast gradient/momentum based methods which achieve optimal convergence guarantees. The heavy ball method (Polyak, 1964) and Nesterov’s accelerated gradient descent (Nesterov, 1983) are two of the most popular methods in this category.

On the other hand, training deep neural networks on large scale datasets have been possible through the use of Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951)

, which samples a random subset of training data to compute gradient estimates that are then used to optimize the objective function. The advantages of SGD for large scale optimization and the related issues of tradeoffs between computational and statistical efficiency was highlighted in

Bottou and Bousquet (2007).

The above mentioned theoretical advantages of fast gradient methods (Polyak, 1964; Nesterov, 1983) (albeit for smooth convex problems) coupled with cheap to compute stochastic gradient estimates led to the influential work of Sutskever et al. (2013), which demonstrated the empirical advantages possessed by SGD when augmented with the momentum machinery. This work has led to widespread adoption of momentum methods for training deep neural nets; so much so that, in the context of neural network training, gradient descent often refers to momentum methods.

But, there is a subtle difference between classical momentum methods and their implementation in practice – classical momentum methods work in the exact first order oracle model (Nesterov, 2004), i.e., they employ exact gradients (computed on the full training dataset), while in practice (Sutskever et al., 2013), they are implemented with stochastic gradients (estimated from a randomly sampled mini-batch of training data). This leads to a natural question:

“Are momentum methods optimal even in the stochastic first order oracle (SFO) model, where we access stochastic gradients computed on a small constant sized minibatches (or a batchsize of ?)”

Even disregarding the question of optimality of momentum methods in the SFO model, it is not even known if momentum methods (say, Polyak (1964); Nesterov (1983)) provide any provable improvement over SGD in this model. While these are open questions, a recent effort of Jain et al. (2017) showed that improving upon SGD (in the stochastic first order oracle) is rather subtle as there exists problem instances in SFO model where it is not possible to improve upon SGD, even information theoretically. Jain et al. (2017) studied a variant of Nesterov’s accelerated gradient updates (Nesterov, 2012b)

for stochastic linear regression and show that their method improves upon SGD wherever it is information theoretically admissible. Through out this paper, we refer to the algorithm of

Jain et al. (2017) as Accelerated Stochastic Gradient Method (ASGD) while we refer to a stochastic version of the most widespread form of Nesterov’s method (Nesterov, 1983) as NAG; HB denotes a stochastic version of the heavy ball method (Polyak, 1964). Critically, while Jain et al. (2017) shows that ASGD improves on SGD in any information-theoretically admissible regime, it is still not known whether HB and NAG can achieve a similar performance gain.

A key contribution of this work is to show that HB does not provide similar performance gains over SGD even when it is informationally-theoretically admissible. That is, we provide a problem instance where it is indeed possible to improve upon SGD (and ASGD achieves this improvement), but HB cannot achieve any improvement over SGD. We validate this claim empirically as well. In fact, we provide empirical evidence to the claim that NAG also do not achieve any improvement over SGD for several problems where ASGD can still achieve better rates of convergence.

This raises a question about why HB and NAG provide better performance than SGD in practice (Sutskever et al., 2013), especially for training deep networks. Our conclusion (that is well supported by our theoretical result) is that HB and NAG’s improved performance is attributed to mini-batching and hence, these methods will often struggle to improve over SGD with small constant batch sizes. This is in stark contrast to methods like ASGD, which is designed to improve over SGD across both small or large mini-batch sizes. In fact, based on our experiments, we observe that on the task of training deep residual networks (He et al., 2016a) on the cifar-10 dataset, we note that ASGD offers noticeable improvements by achieving better test error over HB and NAG even with commonly used batch sizes like during the initial stages of the optimization.

### 1.1 Contributions

The contributions of this paper are as follows.

1. In Section 3, we prove that HB is not optimal in the SFO model. In particular, there exist linear regression problems for which the performance of HB (with any step size and momentum) is either the same or worse than that of SGD while ASGD  improves upon both of them.

2. Experiments on several linear regression problems suggest that the suboptimality of HB in the SFO model is not restricted to special cases – it is rather widespread. Empirically, the same holds true for NAG as well (Section 5).

3. The above observations suggest that the only reason for the superiority of momentum methods in practice is mini-batching, which reduces the variance in stochastic gradients and moves the SFO closer to the exact first order oracle. This conclusion is supported by empirical evidence through training deep residual networks on cifar-10, with a batch size of

(see Section 5.3).

4. We present an intuitive and easier to tune version of ASGD (see Section 4

) and show that ASGD can provide significantly faster convergence to a reasonable accuracy than  SGD, HB, NAG, while still providing favorable or comparable asymptotic accuracy as these methods, particularly on several deep learning problems.

Hence, the take-home message of this paper is: HB and NAG are not optimal in the SFO model. The only reason for the superiority of momentum methods in practice is mini-batching. ASGD provides a distinct advantage in training deep networks over SGD, HB and NAG.

## 2 Notation

We denote matrices by bold-face capital letters and vectors by lower-case letters.

denotes the function to optimize w.r.t. model parameters . denotes exact gradient of at while denotes a stochastic gradient of . That is, where is sampled uniformly at random from . For linear regression, where is the target and is the covariate, and . In this case, denotes the Hessian of and denotes it’s condition number.

Algorithm 1 provides a pseudo-code of HB method (Polyak, 1964). is the momentum term and denotes the momentum parameter. Next iterate is obtained by a linear combination of the SGD update and the momentum term. Algorithm 2 provides pseudo-code of a stochastic version of the most commonly used form of Nesterov’s accelerated gradient descent (Nesterov, 1983).

## 3 Suboptimality of Heavy Ball Method

In this section, we show that there exists linear regression problems where the performance of HB  (Algorithm 1) is no better than that of SGD, while ASGD significantly improves upon SGD’s performance. Let us now describe the problem instance.

Fix and let be a sample from the distribution such that:

 a={σ1⋅z⋅e1 w.p. 0.5σ2⋅z⋅e2 w.p. 0.5, and b=⟨w∗,a⟩,

where are canonical basis vectors, . Let

be a random variable such that

and . Hence, we have: for . Now, our goal is to minimize:

 f(w)def=0.5⋅E[(⟨w∗,a⟩−b)2], Hessian Hdef=E[aa⊤]=[σ2100σ22].

Let and denote the computational and statistical condition numbers – see Jain et al. (2017) for definitions. For the problem above, we have and . Then we obtain following convergence rates for SGD and ASGD when applied to the above given problem instance:

###### Corollary 1 (of Theorem 1 of Jain et al. (2016)).

Let be the iterate of SGD on the above problem with starting point and stepsize . The error of can be bounded as,

 E[f(wSGDt)]−f(w∗)≤exp(−tκ)(f(w0)−f(w∗)).

On the other hand, ASGD achieves the following superior rate.

###### Corollary 2 (of Theorem 1 of Jain et al. (2017)).

Let be the iterate of ASGD on the above problem with starting point and appropriate parameters. The error of can be bounded as,

 E[f(wASGDt)]−f(w∗)≤poly(κ)exp(−t√κ~κ)(f(w0)−f(w∗)).

Note that for a given problem/input distribution is a constant while can be arbitrarily large. Note that . Hence, ASGD improves upon rate of SGD by a factor of . The following proposition, which is the main result of this section, establishes that  HB (Algorithm 1) cannot provide a similar improvement over SGD as what ASGD offers. In fact, we show no matter the choice of parameters of HB, its performance does not improve over SGD by more than a constant.

###### Proposition 3.

Let be the iterate of HB (Algorithm 1) on the above problem with starting point . For any choice of stepsize and momentum , large enough such that , we have,

 E[f(wHBt)]−f(w∗)≥C(κ,δ,α)⋅exp(−500tκ)(f(w0)−f(w∗)),

where depends on and (but not on ).

Thus, to obtain s.t. , HB requires samples and iterations. On the other hand, ASGD can obtain -approximation to in iterations. We note that the gains offered by ASGD are meaningful when  (Jain et al., 2017); otherwise, all the algorithms including SGD achieve nearly the same rates (upto constant factors). While we do not prove it theoretically, we observe empirically that for the same problem instance, NAG also obtains nearly same rate as HB and SGD. We conjecture that a lower bound for NAG can be established using a similar proof technique as that of HB (i.e. Proposition 3). We also believe that the constant in the lower bound described in proposition 3 can be improved to some small number ().

## 4 Algorithm

We will now present and explain an intuitive version of ASGD (pseudo code in Algorithm 3). The algorithm takes three inputs: short step , long step parameter and statistical advantage parameter . The short step is precisely the same as the step size in SGD, HB or NAG. For convex problems, this scales inversely with the smoothness of the function. The long step parameter is intended to give an estimate of the ratio of the largest and smallest curvatures of the function; for convex functions, this is just the condition number. The statistical advantage parameter captures trade off between statistical and computational condition numbers – in the deterministic case, and ASGD is equivalent to NAG, while in the high stochasticity regime, is much smaller. The algorithm maintains two iterates: descent iterate and a running average . The running average is a weighted average of the previous average and a long gradient step from the descent iterate, while the descent iterate is updated as a convex combination of short gradient step from the descent iterate and the running average. The idea is that since the algorithm takes a long step as well as short step and an appropriate average of both of them, it can make progress on different directions at a similar pace. Appendix B shows the equivalence between Algorithm 3 and ASGD as proposed in Jain et al. (2017). Note that the constant appearing in Algorithm 3 has no special significance. Jain et al. (2017) require it to be smaller than but any constant smaller than seems to work in practice.

## 5 Experiments

We now present our experimental results exploring performance of SGD, HB, NAG and ASGD. Our experiments are geared towards answering the following questions:

• Even for linear regression, is the suboptimality of HB restricted to specific distributions in Section 3 or does it hold for more general distributions as well? Is the same true of NAG?

• What is the reason for the superiority of HB and NAG in practice? Is it because momentum methods have better performance that SGD for stochastic gradients or due to mini-batching? Does this superiority hold even for small minibatches?

• How does the performance of ASGD compare to that of SGD, HB and NAG, when training deep networks?

Section 5.1 and parts of Section 5.2 address the first two questions. Section 5.2 and 5.3 address Question 2 partially and the last question. We use Matlab to conduct experiments presented in Section 5.1

and use PyTorch

(pytorch, 2017) for our deep networks related experiments. Pytorch code implementing the ASGD algorithm can be found at https://github.com/rahulkidambi/AccSGD.

### 5.1 Linear Regression

In this section, we will present results on performance of the four optimization methods (SGD, HB, NAG, and ASGD) for linear regression problems. We consider two different class of linear regression problems, both in two dimensions. For a given condition number , we consider the following two distributions:

Discrete: w.p. and with ; is the standard basis vector.

Gaussian : is distributed as a Gaussian random vector with covariance matrix .

We fix a randomly generated and for both the distributions above, we let . We vary from and for each in this set, we run 100 independent runs of all four methods, each for a total of iterations. We define that the algorithm converges if there is no error in the second half (i.e. after updates) that exceeds the starting error - this is reasonable since we expect geometric convergence of the initial error.

Unlike ASGD and SGD, we do not know optimal learning rate and momentum parameters for NAG and HB in the stochastic gradient model. So, we perform a grid search over the values of the learning rate and momentum parameters. In particular, we lay a grid in for learning rate and momentum and run NAG and HB. Then, for each grid point, we consider the subset of trials that converged and computed the final error using these. Finally, the parameters that yield the minimal error are chosen for NAG and HB, and these numbers are reported. We measure convergence performance of a method using:

 rate=log(f(w0))−log(f(wt))t, (1)

We compute the rate  (1) for all the algorithms with varying condition number . Given a rate vs plot for a method, we compute it’s slope (denoted as ) using linear regression. Table 1 presents the estimated slopes (i.e. ) for various methods for both the discrete and the Gaussian case. The slope values clearly show that the rate of SGD, HB and NAG have a nearly linear dependence on while that of ASGD seems to scale linearly with .

### 5.2 Deep Autoencoders for MNIST

In this section, we present experimental results on training deep autoencoders for the mnist dataset, and we follow the setup of

Hinton and Salakhutdinov (2006). This problem is a standard benchmark for evaluating optimization algorithms e.g., Martens (2010); Sutskever et al. (2013); Martens and Grosse (2015); Reddi et al. (2017). The network architecture follows previous work (Hinton and Salakhutdinov, 2006) and is represented as with the first and last nodes representing the input and output respectively. All hidden/output nodes employ sigmoid activations except for the layer with nodes which employs linear activations and we use MSE loss. We use the initialization scheme of Martens (2010), also employed in Sutskever et al. (2013); Martens and Grosse (2015). We perform training with two minibatch sizes and . The runs with minibatch size of were run for epochs while the runs with minibatch size of were run for epochs. For each of SGD, HB, NAG and ASGD, a grid search over learning rate, momentum and long step parameter (whichever is applicable) was done and best parameters were chosen based on achieving the smallest training error in the same protocol followed by Sutskever et al. (2013). The grid was extended whenever the best parameter fell at the edge of a grid. For the parameters chosen by grid search, we perform runs with different seeds and averaged the results. The results are presented in Figures 2 and 3. Note that the final loss values reported are suboptimal compared to the published literature e.g., Sutskever et al. (2013); while Sutskever et al. (2013) report results after updates with a large batch size of (which implies a total of M gradient evaluations), whereas, ours are after M updates of  SGD with a batch size (which is just M gradient evaluations).

Effect of minibatch sizes: While HB and NAG decay the loss faster compared to SGD for a minibatch size of (Figure 2), this superior decay rate does not hold for a minibatch size of (Figure 3). This supports our intuitions from the stochastic linear regression setting, where we demonstrate that HB and NAG are suboptimal in the stochastic first order oracle model.

Comparison of ASGD with momentum methods: While ASGD performs slightly better than NAG for batch size in the training error (Figure 2), ASGD decays the error at a faster rate compared to all the three other methods for a batch size of (Figure 3).

### 5.3 Deep Residual Networks for CIFAR-10

We will now present experimental results on training deep residual networks (He et al., 2016b) with pre-activation blocks He et al. (2016a)

for classifying images in cifar-10

(Krizhevsky and Hinton, 2009); the network we use has layers (dubbed preresnet-44). The code for this section was downloaded from preresnet (2017). One of the most distinct characteristics of this experiment compared to our previous experiments is learning rate decay. We use a validation set based decay scheme, wherein, after every epochs, we decay the learning rate by a certain factor (which we grid search on) if the validation zero one error does not decrease by at least a certain amount (precise numbers are provided in the appendix since they vary across batch sizes). Due to space constraints, we present only a subset of training error plots. Please see Appendix C.3 for some more plots on training errors.

Effect of minibatch sizes: Our first experiment tries to understand how the performance of HB and NAG compare with that of SGD and how it varies with minibatch sizes. Figure 4 presents the test zero one error for minibatch sizes of and . While training with batch size was done for epochs, with batch size , it was done for epochs. We perform a grid search over all parameters for each of these algorithms. See Appendix C.3 for details on the grid search parameters. We observe that final error achieved by SGD, HB and NAG are all very close for both batch sizes. While NAG exhibits a superior rate of convergence compared to SGD and HB for batch size , this superior rate of convergence disappears for a batch size of .

Comparison of ASGD with momentum methods: The next experiment tries to understand how ASGD compares with HB and NAG. The errors achieved by various methods when we do grid search over all parameters are presented in Table 2. Note that the final test errors for batch size are better than those for batch size since the former was run for epochs while the latter was run only for epochs (due to time constraints).

While the final error achieved by ASGD is similar/favorable compared to all other methods, we are also interested in understanding whether ASGD has a superior convergence speed. For this experiment, we need to address the issue of differing learning rates used by various algorithms and different iterations where they decay learning rates. So, for each of HB and NAG, we choose the learning rate and decay factors by grid search, use these values for ASGD and do grid search only over long step parameter and momentum for ASGD. The results are presented in Figures 5 and 6. For batch size , ASGD decays error at a faster rate compared to both HB and NAG. For batch size , while we see a superior convergence of ASGD compared to NAG, we do not see this superiority over HB. The reason for this turns out to be that the learning rate for HB, which we also use for ASGD, turns out to be quite suboptimal for ASGD. So, for batch size , we also compare fully optimized (i.e., grid search over learning rate as well) ASGD with HB. The superiority of ASGD over HB is clear from this comparison. These results suggest that ASGD decays error at a faster rate compared to HB and NAG across different batch sizes.

## 6 Related Work

First order oracle methods: The primary method in this family is Gradient Descent (GD) (Cauchy, 1847). As mentioned previously, GD is suboptimal for smooth convex optimization (Nesterov, 2004), and this is addressed using momentum methods such as the Heavy Ball method (Polyak, 1964) (for quadratics), and Nesterov’s Accelerated gradient descent (Nesterov, 1983).

Stochastic first order methods and noise stability: The simplest method employing the SFO is SGD (Robbins and Monro, 1951); the effectiveness of SGD has been immense, and its applicability goes well beyond optimizing convex objectives. Accelerating SGD is a tricky proposition given the instability of fast gradient methods in dealing with noise, as evidenced by several negative results which consider statistical (Proakis, 1974; Polyak, 1987; Roy and Shynk, 1990), numerical (Paige, 1971; Greenbaum, 1989) and adversarial errors (d’Aspremont, 2008; Devolder et al., 2014). A result of Jain et al. (2017) developed the first provably accelerated SGD method for linear regression which achieved minimax rates, inspired by a method of Nesterov (2012b). Schemes of Ghadimi and Lan (2012, 2013); Dieuleveut et al. (2016), which indicate acceleration is possible with noisy gradients do not hold in the SFO model satisfied by algorithms that are run in practice (see Jain et al. (2017) for more details).

While HB (Polyak, 1964) and NAG (Nesterov, 1983) are known to be effective in case of exact first order oracle, for the SFO, the theoretical performance of HB and NAG is not well understood.

Understanding Stochastic Heavy Ball: Understanding HB’s performance with inexact gradients has been considered in efforts spanning several decades, in many communities like controls, optimization and signal processing. Polyak (1987) considered HB with noisy gradients and concluded that the improvements offered by HB with inexact gradients vanish unless strong assumptions on the inexactness was considered; an instance of this is when the variance of inexactness decreased as the iterates approach the minimizer. Proakis (1974); Roy and Shynk (1990); Sharma et al. (1998) suggest that the improved non-asymptotic rates offered by stochastic HB arose at the cost of worse asymptotic behavior. We resolve these unquantified improvements on rates as being just constant factors over SGD, in stark contrast to the gains offered by ASGD. Loizou and Richtárik (2017) state their method as Stochastic HB but require stochastic gradients that nearly behave as exact gradients; indeed, their rates match that of the standard HB method (Polyak, 1964). Such rates are not information theoretically possible (see Jain et al. (2017)), especially with a batch size of or even with constant sized minibatches.

Accelerated and Fast Methods for finite-sums: There have been developments pertaining to faster methods for finite-sums (also known as offline stochastic optimization): amongst these are methods such as SDCA (Shalev-Shwartz and Zhang, 2012), SAG (Roux et al., 2012), SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014), which offer linear convergence rates for strongly convex finite-sums, improving over SGD’s sub-linear rates (Rakhlin et al., 2012). These methods have been improved using accelerated variants (Shalev-Shwartz and Zhang, 2014; Frostig et al., 2015a; Lin et al., 2015; Defazio, 2016; Allen-Zhu, 2016). Note that these methods require storing the entire training set in memory and taking multiple passes over the same for guaranteed progress. Furthermore, these methods require computing a batch gradient or require memory requirements (typically ). For deep learning problems, data augmentation is often deemed necessary for achieving good performance; this implies computing quantities such as batch gradient (or storage necessities) over this augmented dataset is often infeasible. Such requirements are mitigated by the use of simple streaming methods such as SGD, ASGD, HB, NAG. For other technical distinctions between the offline and online stochastic methods refer to Frostig et al. (2015b).

Practical methods for training deep networks: Momentum based methods employed with stochastic gradients (Sutskever et al., 2013) have become standard and popular in practice. These schemes tend to outperform standard SGD on several important practical problems. As previously mentioned, we attribute this improvement to effect of mini-batching rather than improvement offered by HB or NAG in the SFO model. Schemes such as Adagrad (Duchi et al., 2011)

(Tieleman and Hinton, 2012), Adam (Kingma and Ba, 2014) represent an important and useful class of algorithms. The advantages offered by these methods are orthogonal to the advantages offered by fast gradient methods; it is an important direction to explore augmenting these methods with ASGD as opposed to standard HB or NAG based acceleration schemes.

Chaudhari et al. (2017) proposed Entropy-SGD, which is an altered objective that adds a local strong convexity term to the actual empirical risk objective, with an aim to improve generalization. However, we do not understand convergence rates for convex problems or the generalization ability of this technique in a rigorous manner. Chaudhari et al. (2017) propose to use SGD in their procedure but mention that they employ the HB/NAG method in their implementation for achieving better performance. Naturally, we can use ASGD in this context. Path normalized SGD (Neyshabur et al., 2015) is a variant of SGD that alters the metric on which the weights are optimized. As noted in their paper, path normalized SGD could be improved using HB/NAG (or even the ASGD method).

## 7 Conclusions and Future Directions

In this paper, we show that the performance gain of HB over SGD in stochastic setting is attributed to mini-batching rather than the algorithm’s ability to accelerate with stochastic gradients. Concretely, we provide a formal proof that for several easy problem instances, HB does not outperform SGD despite large condition number of the problem; we observe this trend for NAG in our experiments. In contrast, ASGD (Jain et al., 2017) provides significant improvement over SGD for these problem instances. We observe similar trends when training a resnet on cifar-10 and an autoencoder on mnist. This work motivates several directions such as understanding the behavior of ASGD on domains such as NLP, and developing automatic momentum tuning schemes (Zhang et al., 2017).

#### Acknowledgments

Sham Kakade acknowledges funding from Washington Research Foundation Fund for Innovation in Data-Intensive Discovery and the NSF through awards CCF-, CCF- and CCF-.

## Appendix A Suboptimality of HB: Proof of Proposition 3

Before proceeding to the proof, we introduce some additional notation. Let denote the concatenated and centered estimates in the direction for .

 θ(j)t+1def=[w(j)t+1−(w∗)(j)w(j)t−(w∗)(j)],j=1,2.

Since the distribution over is such that the coordinates are decoupled, we see that can be written in terms of as:

Let denote the covariance matrix of . We have with, defined as

 B(j) def=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣E[(1+α−δ(a(j))2)2]E[−α(1+α−δ(a(j))2)]E[−α(1+α−δ(a(j))2]α2E[(1+α−δ(a(j))2)]0−α0E[(1+α−δ(a(j))2)]−α001000⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ =⎡⎢ ⎢ ⎢ ⎢ ⎢⎣(1+α−δσ2j)2+(c−1)(δσ2j)2−α(1+α−δσ2j)−α(1+α−δσ2j)α2(1+α−δσ2j)0−α0(1+α−δσ2j)−α001000⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

We prove Proposition 3 by showing that for any choice of stepsize and momentum, either of the two holds:

• has an eigenvalue larger than , or,

• the largest eigenvalue of is greater than .

This is formalized in the following two lemmas.

###### Lemma 4.

If the stepsize is such that , then has an eigenvalue .

###### Lemma 5.

If the stepsize is such that , then has an eigenvalue of magnitude .

Given this notation, we can now consider the dimension without the superscripts; when needed, they will be made clear in the exposition. Denoting and , we have:

 B=⎡⎢ ⎢ ⎢ ⎢⎣t2+(c−1)x2−αt−αtα2t0−α0t−α001000⎤⎥ ⎥ ⎥ ⎥⎦

### a.1 Proof

The analysis goes via computation of the characteristic polynomial of and evaluating it at different values to obtain bounds on its roots.

###### Lemma 6.

The characteristic polynomial of is:

 D(z)=z4−(t2+(c−1)x2)z3+(2αt2−2α2)z2+(−t2+(c−1)x2)α2z+α4.
###### Proof.

We first begin by writing out the expression for the determinant:

 Det(B−zI)=∣∣ ∣ ∣ ∣∣t2+(c−1)x2−z−αt−αtα2t−z−α0t−α−z0100−z∣∣ ∣ ∣ ∣∣.

expanding along the first column, we have:

 Det(B−zI) =(t2+(c−1)x2−z)(α2z−z3)−t(−αtz2+α2tz)+t(−αt(αz)+z⋅αtz)−(z⋅α2z−α4) =(t2+(c−1)x2−z)(α2z−z3)−2t(α2tz−αtz2)−(α2z2−α4).

Expanding the terms yields the expression in the lemma. ∎

The next corollary follows by some simple arithmetic manipulations.

###### Corollary 7.

Substituting in the characteristic equation of Lemma 6, we have:

 D(1−τ) =τ4+τ3(−4+t2+(c−1)x2)+τ2(6−3t2−3(c−1)x2−2α2+2αt2) +τ(−4+3t2+3(c−1)x2+4α2−4αt2−(c−1)x2α2+t2α2) +(1−t2−(c−1)x2−2α2+2αt2+(c−1)x2α2−t2α2+α4) =τ4+τ3[−(3+α)(1−α)−2x(1+α)+cx2] +τ2[(3−4α−α2+2α3)−2x(1+α)(2α−3)+x2(2α−3c)] +τ[−(1−α)2(1−α2)−2x(3−α)(1−α2)+x2(3c−4α+(2−c)α2)] +x(1−α)[2(1−α2)−x(c+(c−2)α)]. (2)
###### Proof of Lemma 4.

The first observation necessary to prove the lemma is that the characteristic polynomial approaches as , i.e., .

Next, we evaluate the characteristic polynomial at , i.e. compute . This follows in a straightforward manner from corollary (7) by substituting in equation (7), and this yields,

 D(1)=(1−α)x⋅(2(1−α2)−x(1−α)−(c−1)x(1+α)).

As , , we have the following by setting and solving for :

 x≥2(1−α2)c+(c−2)α.

Since and as , there exists a root of which is . ∎

###### Remark 8.

The above characterization is striking in the sense that for any , increasing the momentum parameter naturally requires the reduction in the step size to permit the convergence of the algorithm, which is not observed when fast gradient methods are employed in deterministic optimization. For instance, in the case of deterministic optimization, setting yields . On the other hand, when employing the stochastic heavy ball method with , we have the condition that , and this implies, .

We now prove Lemma 5. We first consider the large momentum setting.

###### Lemma 9.

When the momentum parameter is set such that , has an eigenvalue of magnitude .

###### Proof.

This follows easily from the fact that , thus implying . ∎

###### Remark 10.

Note that the above lemma holds for any value of the learning rate , and holds for every eigen direction of . Thus, for “large” values of momentum, the behavior of stochastic heavy ball does degenerate to the behavior of stochastic gradient descent.

We now consider the setting where momentum is bounded away from .

###### Corollary 11.

Consider , by substituting , in equation (7) and accumulating terms in varying powers of , we obtain:

 G(l) def=c3(δσ21)2l3κ5+l4−2c(δσ21)l3(1+α)+(2α−3c)c2(δσ21)2l2κ4 +−(3+α)(1−α)l3−2(1+α)(2α−3)c(δσ21)l2+(3c−4α+(2−c)α2)c2(δσ21)2lκ3 +(3−4α−α2+2α3)l2−2c(δσ21)l(3−α)(1−α2)−c2(δσ21)2(1−α)(c+(c−2)α)κ2 +−(1−α)2(1−α2)l+2c(δσ21)(1−α)(1−α2)κ (3)
###### Lemma 12.

Let , , . Then, .

###### Proof.

Since , this implies , thus implying, .

Substituting the value of in equation (11), the coefficient of is .

We will bound this term along with to obtain:

 −(1−α)3(1+α)κ+(1−α)2(3+2α)l2κ2 ≤−(1−α)3(1+α)κ+405(1−α)2κ2 ≤(1−α)2κ(405κ−(1−α2)) ≤(1−α)2κ(405κ−(1−α))≤−45⋅4502κ4,

where, we use the fact that , . The natural implication of this bound is that the terms that are lower order, such as and will be negative owing to the large constant above. Let us verify that this is indeed the case by considering the terms having powers of and from equation (11):

 c3(δσ21)2l3κ5+l4−2c(δσ21)l3(1+α)+(2α−3c)c2(δσ21)2l2κ4−45⋅4502κ4 ≤c3(δσ21)2l3κ5+l4κ4−45⋅4502κ4 ≤cl3κ5+(94−(45⋅4502))κ4≤93c+94−(45⋅4502)κ4

The expression above evaluates to given an upperbound on the value of . The expression above follows from the fact that .

Next, consider the terms involving and , in particular,

 (3c−4α+(2−c)α2)c2(δσ21)2lκ3−c2(δσ21)2(1−α)(c+(c−2)α)κ2 ≤c2(δσ21)2κ2(l(3c+2)κ−(1−α)(c+(c−2)α)) ≤c2(δσ21)2κ2(5clκ−(1−α)(c+(c−2)α)) ≤c2(δσ21)2κ2(5clκ−(1−α)c) ≤c3(δσ21)2κ2(5lκ−450κ) ≤c3(δσ21)2κ2⋅−405κ≤0.

Next,

 −2(1+α)(2α−3)c(δσ21)l2κ3−2c(δσ21)l(3−α)(1−α2)κ2

In both these cases, we used the fact that implying . Finally, other remaining terms are negative. ∎

Before rounding up the proof of the proposition, we need the following lemma to ensure that our lower bounds on the largest eigenvalue of indeed affect the algorithm’s rates and are true irrespective of where the algorithm is begun. Note that this allows our result to be much stronger than typical optimization lowerbounds that rely on specific initializations to ensure a component along the largest eigendirection of the update operator, for which bounds are proven.

###### Lemma 13.

For any starting iterate , the HB method produces a non-zero component along the largest eigen direction of .

###### Proof.

We note that in a similar manner as other proofs, it suffices to argue for each dimension of the problem separately. But before we start looking at each dimension separately, let us consider the dimension, and detail the approach we use to prove the claim: the idea is to examine the subspace spanned by covariance of the iterates , for every starting iterate and prove that the largest eigenvector of the expected operator is not orthogonal to this subspace. This implies that there exists a non-zero component of in the largest eigen direction of , and this decays at a rate that is at best .

Since , we begin by examining the expected covariance spanned by the iterates . Let . Now, this implies . Then,

This implies that just appears as a scale factor. This in turn implies that in order to analyze the subspace spanned by the covariance of iterates , we can assume without any loss in generality. This implies, . Note that with this in place, we see that we can now drop the superscript that represents the dimension, since the analysis decouples across the dimensions . Furthermore, let the entries of the vector be represented as Next, denote . This implies,

 ˆAk=[^tk−α10].

Furthermore,

 (4)

Let us consider the vectorized form of , and we denote this as . Note that makes become a column vector of size . Now, consider for and concatenate these to form a matrix that we denote as , i.e.

 D =[vec(Φ0)vec(Φ1)vec(Φ2)vec(Φ3)].

Now, since we note that is a symmetric matrix, should contain two identical rows implying that it has an eigenvalue that is zero and a corresponding eigenvector that is . It turns out that this is also an eigenvector of with an eigenvalue . Note that . This implies there are two cases that we need to consider: (i) when all eigenvalues of have the same magnitude (). In this case, we are already done, because there exists at least one non zero eigenvalue of and this should have some component along one of the eigenvectors of and we know that all eigenvectors have eigenvalues with a magnitude equal to . Thus, there exists an iterate which has a non-zero component along the largest eigendirection of . (ii) the second case is the situation when we have eigenvalues with different magnitudes. In this case, note that implying . In this case, we need to prove that spans a three-dimensional subspace; if it does, it contains a component along the largest eigendirection of which will round up the proof. Since we need to understand whether spans a three dimensional subspace, we can consider a different (yet related) matrix, which we call and this is defined as:

 Rdef=E(⎡⎢ ⎢⎣θ201θ211θ221θ01θ02θ11θ12θ21θ22θ202θ212θ222⎤⎥ ⎥⎦)

Given the expressions for (by definition of