1 Introduction
Firstorder stochastic methods have become the workhorse of machine learning, where many tasks can be cast as optimization problems,
(1) 
Methods incorporating momentum and acceleration play an important role in the current practice of machine learning Sutskever et al. (2013); Bottou et al. (2018), where they are commonly used in conjunction with stochastic gradients. However, the theoretical understanding of accelerated methods remains limited when used with stochastic gradients.
This paper studies the accelerated gradient (ag) method of Nesterov (1983). Given an initial point , and with , the ag method repeats, for ,
(2)  
(3) 
where and are the stepsize and momentum parameters,^{1}^{1}1We focus on constant stepsize and momentum in this paper, although much of our analysis can be easily extended to handle varying stepsize and/or momentum. respectively, and in the deterministic setting, . When the momentum parameter is , ag simplifies to standard gradient descent (gd). When it is possible to achieve accelerated rates of convergence for certain combinations of and in the deterministic setting.
1.1 Previous Work with Deterministic Gradients
Suppose that the objective function in (1) is smooth and stronglyconvex. Then is minimized at a unique point , and we denote its minimum by . Let denote the condition number of . In the deterministic setting, where for all , gd with constant stepsize converges at the rate (Polyak, 1987)
(4) 
The ag method with constant stepsize and momentum parameter converges at the rate (Nesterov, 2004)
(5) 
The rate in (5) matches (up to constants) the tightestknown worstcase lower bound achievable by any firstorder blackbox method for stronglyconvex and smooth objectives:
(6) 
The lower bound (6) is proved in Nesterov (2004) in the infinite dimensional setting under the assumption . Accordingly, Nesterov’s Accelerated Gradient method is considered optimal in the sense that the convergence rate in (5) depends on rather than .
The proof of (5) presented in Nesterov (2004)
uses the method of estimate sequences. Several works have set out to develop better intuition for how the
ag method achieves acceleration though other analysis techniques.One line of work considers the limit of infinitessimally small stepsizes, obtaining ordinary differential equations (ODEs) that model the trajectory of the
ag method Su et al. (2014); Defazio (2019); Laborde and Oberman (2019). AllenZhu and Orecchia (2014) view the ag method as an alternating iteration between mirror descent and gradient descent and show sublinear convergence of the ag method for smooth convex objectives.Lessard et al. (2016) and Hu and Lessard (2017) frame the ag method and other popular firstorder optimization methods as linear dynamical systems with feedback and characterize their convergence rate using a controltheoretic stability framework. The framework leads to closedform rates of convergence for stronglyconvex quadratic functions with deterministic gradients. For more general (nonquadratic) deterministic problems, the framework provides a means to numerically certify rates of convergence.
1.2 Previous Work with Stochastic Gradients
When Nesterov’s method is run with stochastic gradients , typically satisfying , we refer to it as the accelerated stochastic gradient (asg) method. In this setting, if then asg is equivalent to stochastic gradient descent (sgd).
Despite the widespread interest in, and use of, the asg method, there are no definitive theoretical convergence guarantees. Wiegerinck et al. (1994) study the asg method in an online learning setting and show that optimization can be modelled as a Markov process but do not provide convergence rates. Yang et al. (2016) study the asg method in the smooth stronglyconvex setting, and show an convergence rate when employed with a diminishing stepsize and bounded gradient assumption, but the rates obtained are slower than those for sgd.
Recent work establishes convergence guarantees for the asg method in certain restricted settings. Aybat et al. (2019) consider smooth stronglyconvex functions in a stochastic approximation model with gradients that are unbiased and have bounded variance, and they show convergence to a neighborhood when running the method with constant step size and momentum. Can et al. (2019) further establish convergence in Wasserstein distribution under a stochastic approximation model. Laborde and Oberman (2019) study a perturbed ODE and show convergence for diminishing stepsize. Vaswani et al. (2019) study the asg method with constant stepsize and diminishing momentum, and show linear convergence under a stronggrowth condition, where the gradient variance vanishes at a stationary point (Vaswani et al., 2019).
Some results are available for other momentum schemes. Loizou and Richtárik (2017) study Polyak’s heavyball momentum method with stochastic gradients for quadratic problems and show that it converges linearly under an exactness assumption. Gitman et al. (2019) characterize the stationary distribution of the QuasiHyperbolic Momentum (qhm) method (Ma and Yarats, 2019) around the minimizer for stronglyconvex quadratic functions with bounded gradients and bounded gradient noise variance.
The lack of general convergence guarantees for existing momentum schemes, such as Polyak’s and Nesterov’s, have led many authors to develop alternative accelerated methods specifically for use with stochastic gradients Lan (2012); AllenZhu (2017); Kidambi et al. (2018); Kulunchakov and Mairal (2019); Liu and Belkin (2020).
1.3 Contributions
We provide additional insights into the behavior of Nesterov’s accelerated gradient method when run with stochastic gradients by considering two different settings. We first consider the stochastic approximation setting, where the gradients used by the method are unbiased, conditionally independent from iteration to iteration, and have bounded variance. We show that Nesterov’s method converges at an accelerated linear rate to a region of the optimal solution for smooth stronglyconvex quadratic problems.
Next, we consider the finitesum setting, where , under the assumption that each term is smooth and stronglyconvex, and the only randomness is due to sampling one or a minibatch of terms at each iteration. In this setting we prove that, even when all functions are quadratic, Nesterov’s asg method with the usual choice of stepsize and momentum cannot be guaranteed to converge without making additional assumptions on the condition number and data distribution. When coupled with convergence guarantees in the stochastic approximation setting, this impossibility result illuminates the dichotomy between our understanding of momentumbased methods in the stochastic approximation setting, and practical implementations of these methods in a finitesum framework.
Our results also shed light as to why Nesterov’s method may fail to converge or achieve acceleration in the finitesum setting, providing further insight into what has previously been reported based on empirical observations. In particular, the boundedvariance assumption does not apply in the finitesum setting with quadratic objectives.
We also suggest choices of the stepsize and momentum parameters under which the asg method is guaranteed to converge for any smooth stronglyconvex finitesum, but where accelerated rates of convergence are no longer guaranteed. Our analysis approach leads to new bounds on the convergence rate of sgd in the finitesum setting, under the assumption that each term is smooth, stronglyconvex, and twice continuously differentiable.
2 Preliminaries and Analysis Framework
In this section we establish a basic framework for analyzing the ag method. Then we specialize it to the stochastic approximation and finitesum setting settings, respectively, in Sections 3 and 4.
Throughout this paper we assume that is twicecontinuously differentiable, smooth, and strongly convex, with ; see e.g., Nesterov (2004); Bubeck (2015). Examples of typical tasks satisfying these assumptions are
regularized logistic regression and
regularized leastsquares regression (i.e., ridge regression). Taken together, these properties imply that the Hessian
exists, and for allthe eigenvalues of
lie in the interval . Also, recall that denotes the unique minimizer of and .In contrast to all previous work we are aware of, our analysis focuses on the sequence generated by the method (2)–(3). Let denote the suboptimality of the current iterate, and let denote the velocity.
Substituting the definition of from (2) into (3) and rearranging, we obtain
(7) 
By using the definition of , substituting (7) and (3) into (2), and rearranging, we also obtain that
(8) 
Combining (7) and (8), we get the recursion
(9) 
Note that and based on the common convention that .
Our analysis below will build on the recursion (9) and will also make use of the basic fact that if is twice continuously differentiable then for all
(10) 
3 The Stochastic Approximation Setting
Now consider the stochastic approximation setting. We assume, for all , that
is a random vector satisfying
and that there is a finite constant such that
Let denote the gradient noise at iteration , and suppose that these gradient noise terms are mutually independent. Applying (10) with and , we get that
Using this in (9), we find that and evolve according to
(11) 
where
(12) 
Unrolling the recursion (11), we get that
(13) 
from which it is clear that we may expect convergence properties to depend on the matrix products .
3.1 The quadratic case
We can explicitly bound the matrix product in the specific case where , for a symmetric matrix , and with and . In this case, (13) simplifies to
(14) 
where
(15) 
We obtain an error bound by ensuring that the spectral radius of is less than . In this case we recover the wellknown rate for ag in the deterministic setting. Let and define
Theorem 1.
Let . If and are chosen so that , then for any , there exists a constant such that, for all ,
Theorem 1 holds with respect to all norms; the constant depends on and the choice of norm. Theorem 1 shows that asg converges at a linear rate to a neighborhood of the minimizer of that is proportional to . The proof is given in Appendix A of the supplementary material, and we provide numerical experiments in Section 3.2 to analyze the tightness of the convergence rate and coefficient multiplying in Theorem 1. In comparison to Aybat et al. (2019), we recover the same rate, despite taking a very different approach, and the coefficient multiplying in Theorem 1 is smaller.
Corollary 1.1.
Suppose that and . Then for and all ,
where .
Theorem 2.
Let be an smooth, stronglyconvex, twice continuouslydifferentiable function (not necessarily quadratic). Suppose that and . Then for all ,
Corollary 1.1 confirms that, with the standard choice of parameters, asg converges at an accelerated rate to a region of the optimizer. Comparing with Theorem 2, which is proved in Appendix B, we see that in the stochastic approximation setting, with bounded variance, asg not only converges at a faster rate than sgd, the factor multiplying also scales more favorably, for asg vs. for sgd.






3.2 Numerical Experiments
In Figure 1 we visualize runs of the asg method on a leastsquares regression problem for different problem condition numbers . The objective corresponds to the worstcase quadratic function used to construct the lower bound (6) Nesterov (2004), for dimension . Stochastic gradients are sampled by adding zeromean Gaussian noise with variance to the true gradient. The left plots in each subfigure depict theoretical predictions from Theorem 1, while the right plots in each subfigure depict empirical results. Each pixel corresponds to an independent run of the asg method for a specific choice of constant stepsize and momentum parameters. In all figures, the area enclosed by the red contour depicts the theoretical stability region from Theorem 1 for which .
Figures (a)a/(c)c/(e)e showcase the coefficient multiplying the variance term, which is taken to be in theory. Brighter regions correspond to smaller coefficients, while darker regions correspond to larger coefficients. All sets of figures (theoretical and empirical) use the same color scale. We can see that the coefficient of the variance term in Theorem 1 provides a good characterization of the magnitude of the neighbourhood of convergence, suggesting that it is reasonable to treat with respect to the infinity norm.
Figures. (b)b/(d)d/(f)f showcase the linear convergence rate in theory and in practice. Brighter regions correspond to faster rates, and darker regions correspond to slower rates. Again, all figures (theoretical and empirical) use the same color scale. We can see that the theoretical linear convergence rates in Theorem 1 provide a good characterization of the empirical convergence rates. Moreover, the theoretical conditions for convergence in Theorem 1 depicted by the redcontour appear to be tight.
In short, the theory developed in this section appears to provide an accurate characterization of the asg method in the stochasticapproximation setting. As we will see in the subsequent section, this theoretical characterization does not reflect its behavior in the finitesum setting, which is typically closer to practical machinelearning setups, where randomness is due to minibatching.
4 The FiniteSum Setting
Now consider the finitesum setting, with
(16) 
where each function is strongly convex, smooth, and twice continuously differentiable. In this setting, stochastic gradients are obtained by sampling a subset of terms. This can be seen as approximating the gradient with a minibatch gradient
(17) 
where is a sampling vector with components satisfying Gower et al. (2019). To simplify the discussion, let us assume that the minibatch sampled at every iteration has the same size, and all elements are given the same weight, so , those indices which are sampled have where is the minibatch size (), and for all other indices.
4.1 An Impossibility Result
It turns out that even when each function is wellbehaved, the asg method may diverge when using the standard choice of stepsize and momentum. Our main result in this section shows that it is impossible to obtain a convergence result for asg in this setting for this choice of parameters.
Let us assume that we do not see the same minibatch twice consecutively; i.e.,
(18) 
It is typical in practice to perform training in epochs over the data set, and to randomly permute the data set at the beginning of each epoch, so it is unlikely to see the same minibatch twice in a row. Note we have not assumed that the sample vectors
are independent. We do assume that , where denotes expectation with respect to the marginal distribution of .The interpolation condition is said to hold if the minimizer of also minimizes each ; i.e., if for all
. It has been observed in some settings that stronger convergence guarantees can also be obtained when interpolation or a related assumption holds; e.g.,
Schmidt and Le Roux (2013); Loizou and Richtárik (2017); Ma et al. (2018); Vaswani et al. (2019).Theorem 3.
Suppose we run the asg method (2)–(3) in a finitesum setting where and the sampling vectors satisfy the condition (18). For any initial point , there exist smooth, strongly convex quadratic functions such that is also smooth and strongly convex, and if we run the asg method with and , then
This is true even if the functions are required to satisfy the interpolation condition.
Proof.
We will prove this claim constructively. Given the initial vector , choose to be any vector .
Let
be an orthogonal matrix. Let the Hessian matrices
, , be chosen so that they are all diagonalized by , and let denote the diagonal matrix of eigenvalues of ; i.e., . Denote by the matrix(19) 
It follows that is also diagonal, and all of its diagonal entries are in .
Recall that we have assumed that the functions are quadratic: . Let us assume that and are chosen so that all functions are minimized at the same point , satisfying the interpolation condition. Then from (10), we have
(20) 
Using this in (9) and unrolling, we obtain that
(21) 
where
(22) 
For fixed and , there are a finite number of sampling vectors (precisely ), and therefore the matrices belong to a bounded set . It follows that the trajectory is stable if the joint spectral radius of the set of matrices is less than one Rota and Strang (1960). Conversely, if for all sufficiently large, then .
Based on the construction above, the norm of the matrix product in (21) can be characterized by studying products of smaller matrices of the form
(23) 
where is a diagonal entry of . To see this, observe that there is a permutation matrix such that (see Appendix C)
where is the th diagonal entry of .
Furthermore, since all matrices
have the same eigenvectors, we have that
where . Hence, the spectral radius of the product corresponds to the maximum spectral radius of any of the matrices , .
Let index a subspace such that , where is the th column of . To simplify the discussion, suppose that all minibatches are of size , and assume . Since we can define the Hessians of the functions such that the eigenvalues pair together arbitrarily, consider matrix products of the form
(24) 
where . That is, all but one of the functions have the eigenvalue in this subspace, and the remaining one has eigenvalue in this subspace. Hence, most of the time we sample minibatches corresponding to , and once in a while we sample minibatches with . Moreover, since we do not sample the same minibatch twice consecutively, we never see backtoback ’s. For this case, and with the standard choice of stepsize and momentum parameters, we can precisely characterize the spectral radius of .
Lemma 1.
If and , then
The proof of Lemma 1 is given in Appendix D. Since we do not sample the same minibatch twice in a row, it follows that for all . Based on the assumption that , we have . Moreover, since , for large () we have . Thus, for sufficiently large and sufficiently large ,
Therefore, .
Recall that we assumed the interpolation condition holds in order to get of the form (20). If we relax this and do not require interpolation, then will have an additional term involving , and the expression (21) will also have an additional terms, akin to the terms in (13). The same arguments still apply, leading to the same conclusion. ∎
4.2 Example
The divergence result in Theorem 3 stems from the fact that the algorithm acquires momentum along a lowcurvature direction, and then, suddenly, a highcurvature minibatch is sampled that overshoots along the current trajectory. Momentum prevents the iterates from immediately adapting to the overshoot, and propels the iterates away from the minimizer for several consecutive iterations.
To illustrate this effect, consider the following example finitesum problem with , where each function is a stronglyconvex quadratic with gradient
For simplicity, take , and let
The scalar is equal to for all , and is equal to for . Therefore, each function is strongly convex, smooth, and minimized at , and the global objective is also strongly convex, smooth, and minimized at . Moreover, the functions are nearly all identical, except for , which we refer to as the inconsistent minibatch.
From the proof of Theorem 3, the growth rate of the iterates along the third coordinate direction, with the usual choice of parameters (, ), is
Notice that the term goes to as grows to infinity. Hence, for a fixed condition number , the asg
method exhibits an increased probability of convergence as
becomes large. The intuition for this is that we sample the inconsistent minibatch less frequently, and thereby decrease the likelihood of derailing convergence.Figure 2 illustrates the convergence of the asg method in this setting with the usual choice of parameters (, ), for various (number of terms in the finitesum). At each iteration, the asg method obtains a stochastic gradient by sampling a minibatch from the finitesum. Components of iterates along the first coordinate direction converge in a finite number of steps, and components of iterates along the second coordinate direction converge at Nesterov’s rate . Meanwhile, components of iterates along the third coordinate direction diverge.
Annotated red points indicate iterations at which the minibatch corresponding to the function was sampled. The shaded windows illustrate that immediately after the inconsistent minibatch is sampled, the gradient and momentum buffer have opposite signs for several consecutive iterations.
4.3 Convergent Parameters
Next we turn our attention to finding alternative settings for the parameters and in the asg method which guarantee convergence in the finitesum setting. Vaswani et al. (2019) obtain linear convergence under a strong growth condition using an alternative formulation of asg which has multiple momentum parameters by keeping the stepsize constant and having the momentum parameters vary. Here we focus on constant stepsize and momentum and make no assumptions about growth.
Our approach is to bound the spectral norm of the products using submultiplicativity of matrix norms. This recovers linear convergence to a neighborhood of the minimizer, but the rate is no longer accelerated.
Define the quantities
and let .
Theorem 4.
Let and be chosen so that . Then for all ,
where
Theorem 4 is proved in Appendix E. Note that if an interpolation condition holds (a weaker assumption than the strong growth condition), then .
Theorem 4 shows that the asg method can be made to converge in the finitesum setting for smooth strongly convex objective functions when run with constant stepsizes. In particular, the algorithm converges at a linear rate to a neighborhood of the minimizer that is proportional to the variance of the noise terms. Note that this theorem also allows for negative momentum parameters. Using the spectral norm to guarantee stability is restrictive, in that it is sufficient but not necessary. There may be values of and for which and the algorithm still converges. Having ensures that decreases at every iteration.
Corollary 4.1.
Suppose that and . Then for all
Corollary 4.1, which is proved in Appendix F shows that, for functions which are twice continuously differentiable, sgd converges to a neighborhood of at the same linear rate as gd, viz. (4), in the finitesum setting without making any assumptions on the noise distribution, such as the stronggrowth condition; a novel result to the best of our knowledge. Moreover, when the interpolation condition holds, we have that .
5 Conclusions
This paper contributes to a broader understanding of the asg method in stochastic settings. Although the method behaves well in the stochastic approximation setting, in the finitesum setting it may diverge when using the usual stepsize and momentum. This emphasizes the important role the bounded variance assumption plays in the stochastic approximation setting, since a similar condition does not necessarily hold in the finitesum setting. Forsaking acceleration guarantees, we provide conditions under which the asg method is guaranteed to converge in the smooth stronglyconvex finitesum setting with constant stepsize and momentum, without assuming any growth or interpolation condition.
We believe there is scope to obtain tighter convergence bounds for the asg method with constant stepsize and momentum in the finitesum setting. Convergence guarantees using the joint spectral radius are likely to provide the tightest and most intuitive bounds, but are also difficult to obtain. To date, Lyapunovbased proof techniques have been the most fruitful in the literature.
We also believe that future work understanding the role that negative momentum parameters play in practice may lead to improved optimization of machine learning models. All convergence guarantees and variance bounds in this paper hold for both positive and negative momentum parameters. Our variance bounds and theoretical rates support the observation that negative momentum parameters may slowdown convergence, but can also lead to nontrivial variance reduction. Previous work has found negative momentum to be useful in asynchronous distributed optimization Mitliagkas et al. (2016) and for stabilizing adversarial training Gidel et al. (2018)
. Although it is almost certainly not possible (in general) to obtain zero variance solutions by only using negative momentum parameters, for Deep Learning practitioners that already use the
asg method to train their models, perhaps momentum schedules incorporating negative values towards the end of training can improve performance.References
 Linear coupling: an ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537. Note: “they [NAG methods] are often regarded as “analytical tricks” [17] because their convergence analyses are somewhat complicated and lack of intuitions.” “we provide a simple, alternative, but complete version of the accelerated gradient method. Here, by “complete” we mean our method works for any norm, and for both the constrained and unconstrained case.5 Our key observation is to construct two sequences of updates: one sequence of gradientdescent updates and one sequence of mirrordescent updates.” results in convex deterministic setting Cited by: §1.1.
 Katyusha: the first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research 18 (1), pp. 8194–8244. Note: “At a high level, momentum is dangerous if stochastic gradients are present. If some gradient estimator is very inaccurate, then adding it to the momentum and moving further in this direction (for every future iteration) may hurt the convergence performance. In other words, when naively equipped with momentum, stochastic gradient methods are “very prune to error accumulation” (Koneˇcny‘ et al., 2016) and do not yield accelerated convergence rates in general.” (1) sometimes fail to converge in an accelerated rate, (2) become unstable and hard to tune, and (3) have no support theory behind them. See Section 7.1 for an experiment illustrating that, even for convex stochastic optimization. Cited by: §1.2.
 Robust accelerated gradient methods for smooth strongly convex functions. Note: arXiv preprint 1805.10579 Cited by: §1.2, §3.1.
 Optimization methods for largescale machine learning. SIAM Review 60 (2), pp. 223–311. Cited by: §1.
 Convex optimization: algorithms and complexity. Foundations and Trends in Machine Learning 8 (34), pp. 231–357. Cited by: §2.
 Accelerated linear convergence of stochastic momentum methods in wasserstein distances. arXiv preprint arXiv:1901.07445. Cited by: §1.2.
 Smooth optimization with approximate gradient. SIAM Journal on Optimization 19 (3), pp. 1171–1183. Cited by: §1.2.
 On the curved geometry of accelerated optimization. In Advances in Neural Information Processing Systems, pp. 1764–1773. Cited by: §1.1.
 Firstorder methods of smooth convex optimization with inexact oracle. Mathematical Programming 146 (1–2), pp. 37–75. Cited by: §1.2.
 Negative momentum for improved game dynamics. arXiv preprint arXiv:1807.04740. Cited by: §5.
 Understanding the role of momentum in stochastic gradient methods. In Advances in Neural Information Processing Systems, pp. 9630–9640. Cited by: §1.2.
 SGD: general analysis and improved rates. In International Conference on Machine Learning, pp. 5200–5209. Cited by: §4.
 Matrix analysis. 2nd edition, Cambridge University Press. Cited by: Appendix A.
 Dissipativity theory for Nesterov’s accelerated method. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1549–1557. Cited by: §1.1.
 On the insufficiency of existing momentum schemes for stochastic optimization. In International Conference on Learning Representations, Cited by: §1.2.
 A generic acceleration framework for stochastic composite optimization. In Advances in Neural Information Processing Systems, Cited by: §1.2.
 A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. Note: arXiv preprint 1908.07861 Cited by: §1.1, §1.2.
 An optimal method for stochastic composite optimization. Mathematical Programming 133 (12), pp. 365–397. Cited by: §1.2.
 Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization 26 (1), pp. 57–95. Cited by: Appendix A, §1.1.
 Accelerating SGD with momentum for overparameterized learning. In International Conference on Learning Representations, Cited by: §1.2.
 Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677. Cited by: §1.2, §4.1.
 Quasihyperbolic momentum and adam for deep learning. In International Conference on Learning Representations, Cited by: §1.2.
 The power of interpolation: understanding the effectiveness of SGD in modern overparameterized learning. In International Conference on Machine Learning, Cited by: §4.1.
 Asynchrony begets momentum, with an application to deep learning. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 997–1004. Cited by: §5.
 A method for solving a convex programming problem with convergence rate . Soviet Mathematics Doklady 27, pp. 372–367. Cited by: §1.
 Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, pp. 71–81. Cited by: §1.1, §1.1, §2, §3.2.
 Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4 (5), pp. 1–17. Cited by: Appendix A, Appendix B, Appendix E.
 Introduction to optimization. Optimization Software Inc.. Cited by: §1.1.
 A note on the joint spectral radius. Cited by: §4.1.
 Fast convergence of stochastic gradient descent under a strong growth condition. Note: arXiv preprint 1308.6370 Cited by: §4.1.
 A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. In Advances in Neural Information Processing Systems, pp. 2510–2518. Cited by: §1.1.
 On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §1.

Fast and faster convergence of SGD for overparameterized models (and an accelerated perceptron)
. In International Conference on Machine Learning, Cited by: §1.2, §4.1, §4.3. 
Stochastic dynamics of learning with momentum in neural networks
. Journal of Physics A: Mathematical and General 27 (13), pp. 4425. Cited by: §1.2.  Unified convergence analysis of stochastic momentum methods for convex and nonconvex optimization. arXiv preprint arXiv:1604.03257. Cited by: §1.2.
Appendix A Proof of Theorem 1
We begin from (14). By taking the squared norm on both sides, and recalling that the random vectors have zero mean and are mutually independent, we have
(25) 
Recall that the spectral radius of a square matrix is defined as , where is the th eigenvalue of . The spectral radius satisfies (Horn and Johnson, 2013)
and (Gelfand’s theorem)
Hence, for any , there exists a such that for all . Let
Then for all . Moreover, if converges monotonically to , then .
Now, recall that we have assumed where is symmetric, and we have also assumed that is smooth and strongly convex. Thus all eigenvalues of satisfy .
Lemma 2.
Proof.
Since is real and symmetric, it has a real eigenvalue decomposition , where is an orthogonal matrix and is the diagonal matrix of eigenvalues of . Observe that can be viewed as a block matrix with
blocks that all commute with each other, since each block is an affine matrix function of
. Thus, by Polyak (1964, Lemma 5), is an eigenvalue of if and only if there is an eigenvalue of , such that is an eigenvalue of the matrix(26) 
The characteristic polynomial of is
from which it follows that eigenvalues of are given by ; see, e.g.., Lessard et al. (2016, Appendix A). Note that the characteristic polynomial of is the same as the characteristic polynomial of a different matrix appearing in Lessard et al. (2016), that arises from a different analysis of the ag method. Finally, as discussed in Lessard et al. (2016), for any fixed values of and , the function is quasiconvex in , and hence the maximum over all eigenvalues of is achieved at one of the extremes or . ∎
Appendix B Proofs of Corollary 1.1 and Theorem 2
Taking and , we find that . Since is an smooth strongly convex quadratic, all eigenvalues of are bounded between and . Therefore, from Polyak (1964, Lemma 5), we have that , where is as defined in (26). The eigenvalues of are maximized at for , therefore, for large , is maximized at .
Note that the Jordan form of is given by , where
Using the Jordan form, we determine that is
Therefore, we have that