1 Introduction
Firstorder optimization methods are ubiquitous in machine learning because of their implementational simplicity and low costperiteration complexity. In this paper, we consider unconstrained minimization problems:
(1.1) 
where is a smooth convex function. The touchstone method in this setting is gradient descent (GD):
(1.2) 
where is a given initial point and is the step size. Whether there exist methods that improve on GD while remaining within the framework of firstorder optimization is a subtle and important question.
Modern attempts to address this question date to Polyak [1964, 1987], who incorporated a momentum term into the gradient step, yielding a method that is referred to as the heavyball method:
(1.3) 
where is a momentum coefficient. While the heavyball method provably attains a faster rate of local convergence than GD near a minimum of , it generally does not provide a guarantee of acceleration globally [Polyak, 1964].
The next major development in firstorder methodology is due to Nesterov, who introduced firstorder gradient methods that have a faster global convergence rate than GD [Nesterov, 1983, 2013]. For a strongly convex objective with Lipschitz gradients, Nesterov’s accelerated gradient method (NAGSC) involves the following pair of update equations:
(1.4) 
If one sets , then NAGSC enjoys a convergence rate, improving on the convergence rate of GD. Nesterov also developed an accelerated algorithm (NAGC) targeting smooth convex functions that are not strongly convex:
(1.5) 
This algorithm has a convergence rate, which is faster than GD’s rate.
While yielding optimal and effective algorithms, the design principle of Nesterov’s accelerated algorithms (NAG) is not transparent. Convergence proofs for NAG often use the estimate sequence technique, which is inductive in nature and relies on series of algebraic tricks [Bubeck, 2015]. In recent years progress has been made in the understanding of acceleration by moving to a continuoustime formulation. In particular, Su et al. [2016] showed that as , NAGC converges to an ordinary differential equation (ODE) (Equation (2.2)); moreover, for this ODE, Su et al. [2016] derived a (continuoustime) convergence rate using a Lyapunov function, and further transformed this Lyapunov function to a discrete version and thereby provided a new proof of the fact that NAGC enjoys a rate.
Further progress in this vein has involved taking a variational point of view that derives ODEs from an underlying Lagrangian rather than from a limiting argument [Wibisono et al., 2016]. While this approach captures many of the variations of Nesterov acceleration presented in the literature, it does not distinguish between the heavyball dynamics and the accelerated dynamics, and thus fails to distinguish between local and global acceleration. More recently, Shi et al. [2018] have returned to limiting arguments with a more sophisticated methodology. They have derived highresolution ODEs for the heavyball method (Equation (2.4)), NAGSC (Equation (2.5)) and NAGC (Equation (2.6)). Notably, the highresolution ODEs for the heavyball dynamics and the accelerated dynamics are different. Shi et al. [2018] also presented Lyapunov functions for these ODEs and for their discretizations, and showed that these Lyapunov functions can be used to derive the accelerated rates of NAGSC and NAGC.
This emerging literature has thus provided a new level of understanding of design principles for accelerated optimization. The design involves an interplay between continuoustime and discretetime dynamics. ODEs are obtained either variationally or via a limiting scheme, and various properties of the ODEs are studied, including their convergence rate, topological aspects of their flow and their behavior under perturbation. Lyapunov functions play a key role in such analyses, and also allow aspects of the continuoustime analysis to be transferred to discrete time [see, e.g., Wilson et al., 2016].
And yet the literature has not yet provided a full exploration of the transition from continuoustime ODEs to discretetime algorithms. Indeed, this transition is a nontrivial one, as evidenced by the decades of research on numerical methods for the discretization of ODEs, including most notably the sophisticated arsenal of techniques referred to as “geometric numerical integration” that are used for ODEs obtained from underlying variational principles [Hairer et al., 2006]. Recent work has begun to explore these issues; examples include the use of symplectic integrators by Betancourt et al. [2018] and the use of RungeKutta integration by Zhang et al. [2018]. However, these methods do not always yield proofs that accelerated rates are retained in discrete time, and when they do they involve implicit discretization, which is generally not practical except in the setting of quadratic objectives.
Thus we wish to address the following fundamental question:
Can we systematically and provably obtain new accelerated methods via the numerical discretization of ordinary differential equations?
Our approach to this question is a dynamical systems framework based on Lyapunov theory. Our main results are as follows:

In Section 3.1, we consider three simple numerical discretization schemes—explicit Euler, implicit Euler, and symplectic Euler schemes—to discretize the highresolution ODE of Nesterov’s accelerated method for strongly convex functions. We show that the optimization method generated by symplectic discretization achieves a rate, thereby attaining acceleration. In sharp contrast, the implicit scheme is not practical for implementation, and the explicit scheme, while being simple, fails to achieve acceleration.

Next, we apply the three simple Euler schemes to the highresolution ODE of Nesterov’s accelerated method for convex functions. Again, our Lyapunov analysis sheds light on the superiority of the symplectic scheme over the other two schemes. This is the subject of Section 4.

In Section 3.2, we apply these discretization schemes to the ODE for modeling the heavyball method, which can be viewed as a lowresolution ODE that lacks a gradientcorrection term [Shi et al., 2018]. In contrast to the previous two cases of highresolution ODEs, the symplectic scheme does not achieve acceleration for this lowresolution ODE. More broadly, in Appendix D we present more examples of lowresolution ODEs where symplectic discretization does not lead to acceleration.
Taken together, the three findings have the implication that highresolution ODEs and symplectic schemes are critical to achieving acceleration using numerical discretization. More precisely, in addition to allowing relatively simple implementations, symplectic schemes allow for a large step size without a loss of stability, in a manner akin to (but better than) implicit schemes. In stark contrast, in the setting of lowresolution ODEs, only the implicit schemes remain stable with a large step size, due to the lack of gradient correction. Moreover, the choice of Lyapunov function is essential to obtaining sharp convergence rates. This fact is highlighted in Section 2.4, where we analyze GD by considering it as a discretization method for gradient flow (the ODE counterpart of GD). Using the discrete version of the Lyapunov function proposed in Su et al. [2016] instead of the classical one, we show that GD in fact minimizes the squared gradient norm (choosing the best iterate so far) at a rate of . Although this rate of convergence in the problem of squared gradient norm minimization is known in the literature [Nesterov, 2012], the Lyapunov function argument provides a systematic approach to obtaining this rate in this problem and others.
2 Preliminaries
In this section, we introduce necessary notation, and review ODEs derived in previous work and three classical numerical discretization schemes.
2.1 Notation
We mostly follow the notation of Nesterov [2013], with slight modifications tailored to the present paper. Let be the class of smooth convex functions defined on ; that is, if for all and its gradient is Lipschitz continuous in the sense that
where denotes the standard Euclidean norm and is the Lipschitz constant. The function class is the subclass of such that each has a Lipschitzcontinuous Hessian. For , let denote the subclass of such that each member is strongly convex for some . That is, if and for all . Let denote (a) minimizer of .
2.2 Approximating ODEs
In this section we list all of the ODEs that we will discretize in this paper. We refer readers to recent papers by Su et al. [2016], Wibisono et al. [2016] and Shi et al. [2018] for the rigorous derivations of these ODEs. We begin with the simplest. Taking the step size in Equation (1.2), we obtain the following ODE (gradient flow):
(2.1) 
with any initial .
Next, by taking in Equation (1.5), Su et al. [2016] derived the lowresolution ODE of NAGC:
(2.2) 
with and . For strongly convex functions, by taking , one can derive the following lowresolution ODE (see, for example, Wibisono et al. [2016])
(2.3) 
that models both the heavyball method and NAGSC. This ODE has the same initial conditions as (2.2).
Recently, Shi et al. [2018] proposed highresolution ODEs for modeling acceleration methods. The key ingredient in these ODEs is that the terms are preserved in the ODEs. As a result, the heavyball method and NAGSC have different models as ODEs.

[label = ()]

If , the highresolution ODE of the heavyball method is
(2.4) with and . This ODE has essentially the same properties as its lowresolution counterpart (2.3) due to the absence of .

If , the highresolution ODE of NAGSC is
(2.5) with and .

If , the highresolution ODE for NAGC is
(2.6) for , with and .
2.3 Discretization schemes
To discretize ODEs (2.1)(2.6), we replace by , by and replace other terms with approximations. Different discretization schemes correspond to different approximations.

The most straightforward scheme is the explicit scheme, which uses the following approximation rule:

Another discretization scheme is the implicit scheme, which uses the following approximation rule:
Note that compared with the explicit scheme, the implicit scheme is not practical because the update of requires knowing while the update of requires knowing .

The last discretization scheme considered in this paper is the symplectic scheme, which uses the following approximation rule.
Note this scheme is practical because the update of only requires knowing .
We remark that for lowresolution ODEs, there is no term, whereas for highresolution ODEs, we have this term and we use the difference of gradients to approximate this term. This additional approximation term is critical to acceleration.
2.4 Warmup: Gradient norm minimization via GD
As a warmup, we revisit gradient descent for smooth convex functions, highlighting the ODEbased approach and the importance of Lyapunov functions in proofs. Our finding is the following theorem.
Theorem 2.1.
Let . Taking step size , the iterates generated by gradient descent (1.2) satisfy
(2.7) 
This theorem directly implies that
which is known in the literature [Nesterov, 2012]. However, our result is stronger in the sense that the rate does not imply (2.7).
To obtain this result, we use a Lyapunov function that is different from the standard analysis of gradient descent, which uses the Lyapunov function . This Lyapunov function yields the convergence rate for the function value. For the squared gradient norm, however, this Lyapunov function can only exploit the smoothness property that transforms the function value to the gradient norm, giving the suboptimal rate, due to the absence of gradient information in this function. Our proof below uses a different Lyapunov function.
Proof.
We consider the following Lyapunov function
We remark that this is just the discrete version of the Lyapunov function constructed by Su et al. [2016]. The difference of this Lyapunov function satisfies
Summing over and plugging in , we complete the proof. ∎
3 HighResolution ODEs for Strongly Convex Functions
This section considers numerical discretization of the highresolution ODEs of NAGSC and the heavyball method using the symplectic Euler scheme, explicit Euler scheme, and implicit Euler scheme. In particular, we compare rates of convergence towards the objective minimum of the three simple Euler schemes and the two methods (NAGSC and the heavyball method) in Section 3.1 and Section 3.2, respectively. For both cases, the associated symplectic scheme is shown to exhibit surprisingly close similarity with the corresponding classical method.
3.1 NagSc
The highresolution ODE (2.5) of NAGSC can be equivalently written in the phase space as
(3.1) 
with the initial conditions and . For any , Theorem 1 of Shi et al. [2018] shows that the solution of the ODE (2.5) satisfies
for any step size . In particular, setting the step size to , we get
In the phase space representation, NAGSC is formulated as
(3.2) 
with the initial condition for any . This method maintains the accelerated rate of the ODE by recognizing
(see Theorem 3 in Shi et al. [2018]) and the identification .
Viewing NAGSC as a numerical discretization of (2.5), one might wonder if any of the three simple Euler schemes—symplectic Euler scheme, explicit Euler scheme, and implicit Euler scheme—maintain the accelerated rate in discretizing the highresolution ODE. For clarity, the update rules of the three schemes are given as follows, each with the initial points and .
Symplectic scheme:
Explicit scheme:
Implicit scheme:
Among the three Euler schemes, the symplectic scheme is the closest to NAGSC (3.2). More precisely, NAGSC differs from the symplectic scheme only in an additional factor of in the second line of (3.2). When the step size is small, NAGSC is, roughly speaking, a symplectic method if we make use of . In relating to the literature, the connection between accelerated methods and the symplectic schemes has been explored in Betancourt et al. [2018], which mainly considers the leapfrog integrator, a secondorder symplectic integrator. In contrast, the symplectic Euler scheme studied in this paper is a firstorder symplectic integrator.
Interestingly, the close resemblance between the two algorithms is found not only in their formulations, but also in their convergence rates, which are both accelerated as shown by Theorem 3.1 and Corollary 3.2.
Theorem 3.1 (Discretization of NAGSc Ode).
For any , the following conclusions hold:

Taking , the symplectic Euler scheme satisfies
(3.3) 
Taking , the explicit Euler scheme satisfies
(3.4) 
Taking , the implicit Euler scheme satisfies
(3.5)
Note that the Lyapunov function for the symplectic scheme used in the proof of (3.1) is
(3.6) 
The proof of Theorem 3.1 is deferred to Appendix B.1. The following result is a useful consequence of this theorem.
Corollary 3.2 (Discretization of NAGSc Ode).
For any , the following conclusions hold:

Taking step size , the symplectic Euler scheme satisfies
(3.7) 
Taking step size , the explicit Euler scheme satisfies
(3.8) 
Taking step size , the implicit Euler scheme satisfies
(3.9)
In addition, Corollary 3.2 shows that the implicit scheme also achieves acceleration. However, unlike NAGSC, the symplectic scheme, and the explicit scheme, the implicit scheme is generally not easy to use in practice because it requires solving a nonlinear fixedpoint equation when the objective is not quadratic. On the other hand, the explicit scheme can only take a smaller step size , which prevents this scheme from achieving acceleration.
3.2 The heavyball method
We turn to the heavyball method ODE (2.4), whose phase space representation reads
(3.10) 
with the initial conditions and . Theorem 2 in Shi et al. [2018] shows that the solution to this ODE satisfies
for and any step size . In particular, taking gives
Returning to the discrete regime, Polyak’s heavyball method uses the following update rule:
which attains a nonaccelerated rate (see Theorem 4 of Shi et al. [2018]):
(3.11) 
The three simple Euler schemes for numerically solving the ODE (2.4) are given as follows. Every scheme starts with any arbitrary and . As in the case of NAGSC, the symplectic scheme is the closest to the heavyball method.
The theorem below characterizes the convergence rates of the three schemes. This theorem is extended to general step sizes by Theorem B.1 in Appendix B.2.
Theorem 3.3 (Discretization of heavyball ODE).
For any , the following conclusions hold:

Taking step size , the symplectic Euler scheme satisfies
(3.12) 
Taking step size , the explicit Euler scheme satisfies
(3.13) 
Taking step size , the implicit Euler scheme satisfies
(3.14)
4 HighResolution ODEs for Convex Functions
In this section, we turn to numerical discretization of the highresolution ODE (2.6) related to NAGC. All proofs are deferred to Appendix C. This ODE in the phase space representation reads [Shi et al., 2018] as follows:
(4.1) 
with and . Theorem 5 of Shi et al. [2018] shows that Let . For any step size , the solution of the highresolution ODE (2.6) satisfies
(4.2) 
for any . A caveat here is that it is unclear how to use a Lyapunov function to prove convergence of the (simple) explicit, symplectic or implicit Euler scheme by direct numerical discretization of the ODE (2.2). See Appendix C.2 for more discussion on this point. Therefore, we slightly modify the ODE to the following one:
(4.3) 
The only difference is in the third term on the righthand side of the second equation, where we replace by . Now, we apply the three schemes on this (modified) ODE in the phase space, including the original NAGC, which all start with and .
Symplectic phasespace scheme (the original NAGC):
(4.4) 
Explicit phasespace scheme:
(4.5) 
Implicit phasespace scheme:
(4.6) 
Theorem 4.1.
5 Discussion
In this paper, we have analyzed the convergence rates of three numerical discretization schemes—the symplectic Euler scheme, explicit Euler scheme, and implicit Euler scheme—applied to ODEs that are used for modeling Nesterov’s accelerated methods and Polyak’s heavyball method. The symplectic scheme is shown to achieve accelerated rates for the highresolution ODEs of NAGSC and (slightly modified) NAGC [Shi et al., 2018], whereas no acceleration rates are observed when the same scheme is used to discretize the lowresolution counterparts [Su et al., 2016]. For comparison, the explicit scheme only allows for a small step size in discretizing these ODEs in order to ensure stability, thereby failing to achieve acceleration. Although the explicit scheme is proved to yield accelerated methods no matter whether highresolution or lowresolution ODEs are discretized, this scheme is generally not practical except for a limited number of cases (for example, quadratic objectives).
We conclude this paper by presenting several directions for future work. This work suggests that both symplectic schemes and highresolution ODEs are crucial for numerical discretization to achieve acceleration. It would be of interest to formalize and prove this assertion. For example, does any higherorder symplectic scheme maintain acceleration for the highresolution ODEs of NAGs? What is the fundamental mechanism of the gradient correction in highresolution ODE in stabilizing symplectic discretization? Moreover, since the discretizations are applied to the modified highresolution ODE of NAGC, it is tempting to perform a comparison study between the two highresolution ODEs in terms of discretization properties. Finally, recognizing Nesterov’s method (NAGSC
) is very similar to, but still different from, the corresponding symplectic scheme, one can design new algorithms as interpolations of the two methods; it would be interesting to investigate the convergence properties of these new algorithms.
Acknowledgments
W. J. S. was supported in part by the NSF via grant CCF1763314. M. I. J. was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N000141812764.
References
 Betancourt et al. [2018] Betancourt, M., Jordan, M. I., and Wilson, A. C. (2018). On symplectic optimization. arXiv preprint arXiv:1802.03653.
 Bubeck [2015] Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(34):231–357.
 Hairer et al. [2006] Hairer, E., Lubich, C., and Wanner, G. (2006). Geometric numerical integration: structurepreserving algorithms for ordinary differential equations, volume 31. Springer Science & Business Media.
 Nesterov [1983] Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 27(2):372–376.
 Nesterov [2012] Nesterov, Y. (2012). How to make the gradients small. Optima, 88:10–11.
 Nesterov [2013] Nesterov, Y. (2013). Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media.
 Polyak [1964] Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17.
 Polyak [1987] Polyak, B. T. (1987). Introduction to optimization. Optimization Software, Inc, New York.
 Shi et al. [2018] Shi, B., Du, S. S., Jordan, M. I., and Su, W. J. (2018). Understanding the acceleration phenomenon via highresolution differential equations. arXiv preprint arXiv:1810.08907.
 Su et al. [2016] Su, W., Boyd, S., and Candès, E. J. (2016). A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. Journal of Machine Learning Research, 17(153):1–43.
 Wibisono et al. [2016] Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358.
 Wilson et al. [2016] Wilson, A. C., Recht, B., and Jordan, M. I. (2016). A Lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635.
 Zhang et al. [2018] Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A. (2018). Direct Runge–Kutta discretization achieves acceleration. arXiv preprint arXiv:1805.00521.
Appendix A Gradient Flow
a.1 Convergence rate of gradient flow
The following theorem is the continuoustime version of Theorem 2.1.15 in Nesterov [2013].
Theorem A.1.
Let . The solution to the gradient flow (2.1) satisfies
Proof.
Taking the following Lyapunov function
we calculate its time derivative as
Thus, we complete the proof. ∎
The theorem below is a continuous version of Theorem 2.1.14 in Nesterov [2013].
Theorem A.2.
Let . The solution to the gradient flow (2.1) satisfies
Proof.
The time derivative of the distance function is
We define a Lyapunov function as
With the basic convex inequality for , we have
Furthermore, we obtain that the time derivative is
Hence, the convergence rate is
∎
The following theorem is based on the Lyapunov function for gradient flow (2.1) in Su et al. [2016].
Theorem A.3.
Let . The solution to the gradient flow (2.1) satisfies
Proof.
The Lyapunov function is
We calculate its time derivative as
Thus, we complete the proof. ∎
a.2 Explicit Euler scheme
The corresponding explicitscheme version of Theorem A.1 is just Theorem 2.1.15 in Nesterov [2013]. We state it below.
Theorem A.4 (Theorem 2.1.15, Nesterov [2013]).
Let . Taking any step size , the iterates generated by GD satisfy
In addition, if the step size is set to , we get
This proof is from Nesterov [2013]. The only conceptual difference is that we use the Lyapunov function
instead of the distance function in Nesterov [2013].
The corresponding explicit version of Theorem A.2 is Theorem 2.1.14 in Nesterov [2013]. We also state it as follows.
Theorem A.5 (Theorem 2.1.14, Nesterov [2013]).
Let . Taking any step size , the iterates generated by GD satisfy
(A.1) 
In addition, if the step size is set to , we get
(A.2) 
Again, the from Nesterov [2013] is the use of the Lyapunov function instead of . Finally, we show the corresponding discrete version of Theorem A.3, which completes Theorem 2.1. Note that the proof is shown in Section 2.4.
Theorem A.6.
Let . Taking any step size , the iterates generated by GD satisfy
(A.3) 
In addition, if the step size is set , we have
(A.4) 
a.3 Implicit Euler scheme
Next, we consider the implicit Euler scheme of the gradient flow (2.1) as
(A.5) 
with any initial . The corresponding implicit version of Theorem A.1 is shown as below.
Theorem A.7.
Let , the iterates generated by implicit gradient descent (A.5) satisfy
(A.6) 
In addition, if the step size , where , we have
(A.7) 
Proof.
The Lyapunov function is
Comments
There are no comments yet.