Acceleration via Symplectic Discretization of High-Resolution Differential Equations

02/11/2019 ∙ by Bin Shi, et al. ∙ University of Pennsylvania Carnegie Mellon University berkeley college 0

We study first-order optimization methods obtained by discretizing ordinary differential equations (ODEs) corresponding to Nesterov's accelerated gradient methods (NAGs) and Polyak's heavy-ball method. We consider three discretization schemes: an explicit Euler scheme, an implicit Euler scheme, and a symplectic scheme. We show that the optimization algorithm generated by applying the symplectic scheme to a high-resolution ODE proposed by Shi et al. [2018] achieves an accelerated rate for minimizing smooth strongly convex functions. On the other hand, the resulting algorithm either fails to achieve acceleration or is impractical when the scheme is implicit, the ODE is low-resolution, or the scheme is explicit.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

First-order optimization methods are ubiquitous in machine learning because of their implementational simplicity and low cost-per-iteration complexity. In this paper, we consider unconstrained minimization problems:

(1.1)

where is a smooth convex function. The touchstone method in this setting is gradient descent (GD):

(1.2)

where is a given initial point and is the step size. Whether there exist methods that improve on GD while remaining within the framework of first-order optimization is a subtle and important question.

Modern attempts to address this question date to  Polyak [1964, 1987], who incorporated a momentum term into the gradient step, yielding a method that is referred to as the heavy-ball method:

(1.3)

where is a momentum coefficient. While the heavy-ball method provably attains a faster rate of local convergence than GD near a minimum of , it generally does not provide a guarantee of acceleration globally [Polyak, 1964].

The next major development in first-order methodology is due to Nesterov, who introduced first-order gradient methods that have a faster global convergence rate than GD [Nesterov, 1983, 2013]. For a -strongly convex objective with -Lipschitz gradients, Nesterov’s accelerated gradient method (NAG-SC) involves the following pair of update equations:

(1.4)

If one sets , then NAG-SC enjoys a convergence rate, improving on the convergence rate of GD. Nesterov also developed an accelerated algorithm (NAG-C) targeting smooth convex functions that are not strongly convex:

(1.5)

This algorithm has a convergence rate, which is faster than GD’s rate.

While yielding optimal and effective algorithms, the design principle of Nesterov’s accelerated algorithms (NAG) is not transparent. Convergence proofs for NAG often use the estimate sequence technique, which is inductive in nature and relies on series of algebraic tricks [Bubeck, 2015]. In recent years progress has been made in the understanding of acceleration by moving to a continuous-time formulation. In particular, Su et al. [2016] showed that as , NAG-C converges to an ordinary differential equation (ODE) (Equation (2.2)); moreover, for this ODE, Su et al. [2016] derived a (continuous-time) convergence rate using a Lyapunov function, and further transformed this Lyapunov function to a discrete version and thereby provided a new proof of the fact that NAG-C enjoys a rate.

Further progress in this vein has involved taking a variational point of view that derives ODEs from an underlying Lagrangian rather than from a limiting argument [Wibisono et al., 2016]. While this approach captures many of the variations of Nesterov acceleration presented in the literature, it does not distinguish between the heavy-ball dynamics and the accelerated dynamics, and thus fails to distinguish between local and global acceleration. More recently, Shi et al. [2018] have returned to limiting arguments with a more sophisticated methodology. They have derived high-resolution ODEs for the heavy-ball method (Equation (2.4)), NAG-SC (Equation (2.5)) and NAG-C (Equation (2.6)). Notably, the high-resolution ODEs for the heavy-ball dynamics and the accelerated dynamics are different. Shi et al. [2018] also presented Lyapunov functions for these ODEs and for their discretizations, and showed that these Lyapunov functions can be used to derive the accelerated rates of NAG-SC and NAG-C.

This emerging literature has thus provided a new level of understanding of design principles for accelerated optimization. The design involves an interplay between continuous-time and discrete-time dynamics. ODEs are obtained either variationally or via a limiting scheme, and various properties of the ODEs are studied, including their convergence rate, topological aspects of their flow and their behavior under perturbation. Lyapunov functions play a key role in such analyses, and also allow aspects of the continuous-time analysis to be transferred to discrete time [see, e.g., Wilson et al., 2016].

And yet the literature has not yet provided a full exploration of the transition from continuous-time ODEs to discrete-time algorithms. Indeed, this transition is a non-trivial one, as evidenced by the decades of research on numerical methods for the discretization of ODEs, including most notably the sophisticated arsenal of techniques referred to as “geometric numerical integration” that are used for ODEs obtained from underlying variational principles [Hairer et al., 2006]. Recent work has begun to explore these issues; examples include the use of symplectic integrators by Betancourt et al. [2018] and the use of Runge-Kutta integration by  Zhang et al. [2018]. However, these methods do not always yield proofs that accelerated rates are retained in discrete time, and when they do they involve implicit discretization, which is generally not practical except in the setting of quadratic objectives.

Thus we wish to address the following fundamental question:

Can we systematically and provably obtain new accelerated methods via the numerical discretization of ordinary differential equations?

Our approach to this question is a dynamical systems framework based on Lyapunov theory. Our main results are as follows:

  1. In Section 3.1, we consider three simple numerical discretization schemes—explicit Euler, implicit Euler, and symplectic Euler schemes—to discretize the high-resolution ODE of Nesterov’s accelerated method for strongly convex functions. We show that the optimization method generated by symplectic discretization achieves a rate, thereby attaining acceleration. In sharp contrast, the implicit scheme is not practical for implementation, and the explicit scheme, while being simple, fails to achieve acceleration.

  2. Next, we apply the three simple Euler schemes to the high-resolution ODE of Nesterov’s accelerated method for convex functions. Again, our Lyapunov analysis sheds light on the superiority of the symplectic scheme over the other two schemes. This is the subject of Section 4.

  3. In Section 3.2, we apply these discretization schemes to the ODE for modeling the heavy-ball method, which can be viewed as a low-resolution ODE that lacks a gradient-correction term [Shi et al., 2018]. In contrast to the previous two cases of high-resolution ODEs, the symplectic scheme does not achieve acceleration for this low-resolution ODE. More broadly, in Appendix D we present more examples of low-resolution ODEs where symplectic discretization does not lead to acceleration.

Taken together, the three findings have the implication that high-resolution ODEs and symplectic schemes are critical to achieving acceleration using numerical discretization. More precisely, in addition to allowing relatively simple implementations, symplectic schemes allow for a large step size without a loss of stability, in a manner akin to (but better than) implicit schemes. In stark contrast, in the setting of low-resolution ODEs, only the implicit schemes remain stable with a large step size, due to the lack of gradient correction. Moreover, the choice of Lyapunov function is essential to obtaining sharp convergence rates. This fact is highlighted in Section 2.4, where we analyze GD by considering it as a discretization method for gradient flow (the ODE counterpart of GD). Using the discrete version of the Lyapunov function proposed in Su et al. [2016] instead of the classical one, we show that GD in fact minimizes the squared gradient norm (choosing the best iterate so far) at a rate of . Although this rate of convergence in the problem of squared gradient norm minimization is known in the literature [Nesterov, 2012], the Lyapunov function argument provides a systematic approach to obtaining this rate in this problem and others.

2 Preliminaries

In this section, we introduce necessary notation, and review ODEs derived in previous work and three classical numerical discretization schemes.

2.1 Notation

We mostly follow the notation of Nesterov [2013], with slight modifications tailored to the present paper. Let be the class of -smooth convex functions defined on ; that is, if for all and its gradient is -Lipschitz continuous in the sense that

where denotes the standard Euclidean norm and is the Lipschitz constant. The function class is the subclass of such that each has a Lipschitz-continuous Hessian. For , let denote the subclass of such that each member is -strongly convex for some . That is, if and for all . Let denote (a) minimizer of .

2.2 Approximating ODEs

In this section we list all of the ODEs that we will discretize in this paper. We refer readers to recent papers by Su et al. [2016], Wibisono et al. [2016] and Shi et al. [2018] for the rigorous derivations of these ODEs. We begin with the simplest. Taking the step size in Equation (1.2), we obtain the following ODE (gradient flow):

(2.1)

with any initial .

Next, by taking in Equation (1.5), Su et al. [2016] derived the low-resolution ODE of NAG-C:

(2.2)

with and . For strongly convex functions, by taking , one can derive the following low-resolution ODE (see, for example, Wibisono et al. [2016])

(2.3)

that models both the heavy-ball method and NAG-SC. This ODE has the same initial conditions as (2.2).

Recently, Shi et al. [2018] proposed high-resolution ODEs for modeling acceleration methods. The key ingredient in these ODEs is that the terms are preserved in the ODEs. As a result, the heavy-ball method and NAG-SC have different models as ODEs.

  1. [label = ()]

  2. If , the high-resolution ODE of the heavy-ball method is

    (2.4)

    with and . This ODE has essentially the same properties as its low-resolution counterpart (2.3) due to the absence of .

  3. If , the high-resolution ODE of NAG-SC is

    (2.5)

    with and .

  4. If , the high-resolution ODE for NAG-C is

    (2.6)

    for , with and .

2.3 Discretization schemes

To discretize ODEs (2.1)-(2.6), we replace by , by and replace other terms with approximations. Different discretization schemes correspond to different approximations.

  • The most straightforward scheme is the explicit scheme, which uses the following approximation rule:

  • Another discretization scheme is the implicit scheme, which uses the following approximation rule:

    Note that compared with the explicit scheme, the implicit scheme is not practical because the update of requires knowing while the update of requires knowing .

  • The last discretization scheme considered in this paper is the symplectic scheme, which uses the following approximation rule.

    Note this scheme is practical because the update of only requires knowing .

We remark that for low-resolution ODEs, there is no term, whereas for high-resolution ODEs, we have this term and we use the difference of gradients to approximate this term. This additional approximation term is critical to acceleration.

2.4 Warm-up: Gradient norm minimization via GD

As a warm-up, we revisit gradient descent for smooth convex functions, highlighting the ODE-based approach and the importance of Lyapunov functions in proofs. Our finding is the following theorem.

Theorem 2.1.

Let . Taking step size , the iterates generated by gradient descent (1.2) satisfy

(2.7)

This theorem directly implies that

which is known in the literature [Nesterov, 2012]. However, our result is stronger in the sense that the rate does not imply (2.7).

To obtain this result, we use a Lyapunov function that is different from the standard analysis of gradient descent, which uses the Lyapunov function . This Lyapunov function yields the convergence rate for the function value. For the squared gradient norm, however, this Lyapunov function can only exploit the -smoothness property that transforms the function value to the gradient norm, giving the sub-optimal rate, due to the absence of gradient information in this function. Our proof below uses a different Lyapunov function.

Proof.

We consider the following Lyapunov function

We remark that this is just the discrete version of the Lyapunov function constructed by Su et al. [2016]. The difference of this Lyapunov function satisfies

Summing over and plugging in , we complete the proof. ∎

3 High-Resolution ODEs for Strongly Convex Functions

This section considers numerical discretization of the high-resolution ODEs of NAG-SC and the heavy-ball method using the symplectic Euler scheme, explicit Euler scheme, and implicit Euler scheme. In particular, we compare rates of convergence towards the objective minimum of the three simple Euler schemes and the two methods (NAG-SC and the heavy-ball method) in Section 3.1 and Section 3.2, respectively. For both cases, the associated symplectic scheme is shown to exhibit surprisingly close similarity with the corresponding classical method.

3.1 Nag-Sc

The high-resolution ODE (2.5) of NAG-SC can be equivalently written in the phase space as

(3.1)

with the initial conditions and . For any , Theorem 1 of Shi et al. [2018] shows that the solution of the ODE (2.5) satisfies

for any step size . In particular, setting the step size to , we get

In the phase space representation, NAG-SC is formulated as

(3.2)

with the initial condition for any . This method maintains the accelerated rate of the ODE by recognizing

(see Theorem 3 in Shi et al. [2018]) and the identification .

Viewing NAG-SC as a numerical discretization of (2.5), one might wonder if any of the three simple Euler schemes—symplectic Euler scheme, explicit Euler scheme, and implicit Euler scheme—maintain the accelerated rate in discretizing the high-resolution ODE. For clarity, the update rules of the three schemes are given as follows, each with the initial points and .

Symplectic scheme:
Explicit scheme:
Implicit scheme:

Among the three Euler schemes, the symplectic scheme is the closest to NAG-SC (3.2). More precisely, NAG-SC differs from the symplectic scheme only in an additional factor of in the second line of (3.2). When the step size is small, NAG-SC is, roughly speaking, a symplectic method if we make use of . In relating to the literature, the connection between accelerated methods and the symplectic schemes has been explored in Betancourt et al. [2018], which mainly considers the leapfrog integrator, a second-order symplectic integrator. In contrast, the symplectic Euler scheme studied in this paper is a first-order symplectic integrator.

Interestingly, the close resemblance between the two algorithms is found not only in their formulations, but also in their convergence rates, which are both accelerated as shown by Theorem 3.1 and Corollary 3.2.

Theorem 3.1 (Discretization of NAG-Sc Ode).

For any , the following conclusions hold:

  1. Taking , the symplectic Euler scheme satisfies

    (3.3)
  2. Taking , the explicit Euler scheme satisfies

    (3.4)
  3. Taking , the implicit Euler scheme satisfies

    (3.5)

Note that the Lyapunov function for the symplectic scheme used in the proof of (3.1) is

(3.6)

The proof of Theorem 3.1 is deferred to Appendix B.1. The following result is a useful consequence of this theorem.

Corollary 3.2 (Discretization of NAG-Sc Ode).

For any , the following conclusions hold:

  1. Taking step size , the symplectic Euler scheme satisfies

    (3.7)
  2. Taking step size , the explicit Euler scheme satisfies

    (3.8)
  3. Taking step size , the implicit Euler scheme satisfies

    (3.9)

In addition, Corollary 3.2 shows that the implicit scheme also achieves acceleration. However, unlike NAG-SC, the symplectic scheme, and the explicit scheme, the implicit scheme is generally not easy to use in practice because it requires solving a nonlinear fixed-point equation when the objective is not quadratic. On the other hand, the explicit scheme can only take a smaller step size , which prevents this scheme from achieving acceleration.

3.2 The heavy-ball method

We turn to the heavy-ball method ODE (2.4), whose phase space representation reads

(3.10)

with the initial conditions and . Theorem 2 in Shi et al. [2018] shows that the solution to this ODE satisfies

for and any step size . In particular, taking gives

Returning to the discrete regime, Polyak’s heavy-ball method uses the following update rule:

which attains a non-accelerated rate (see Theorem 4 of Shi et al. [2018]):

(3.11)

The three simple Euler schemes for numerically solving the ODE (2.4) are given as follows. Every scheme starts with any arbitrary and . As in the case of NAG-SC, the symplectic scheme is the closest to the heavy-ball method.

The theorem below characterizes the convergence rates of the three schemes. This theorem is extended to general step sizes by Theorem B.1 in Appendix B.2.

Theorem 3.3 (Discretization of heavy-ball ODE).

For any , the following conclusions hold:

  1. Taking step size , the symplectic Euler scheme satisfies

    (3.12)
  2. Taking step size , the explicit Euler scheme satisfies

    (3.13)
  3. Taking step size , the implicit Euler scheme satisfies

    (3.14)

Taken together, (3.11) and Theorem 3.3 imply that neither the heavy-ball method nor the symplectic scheme attains an accelerated rate. In contrast, the implicit scheme achieves acceleration as in the NAG-SC case, but it is impractical except for quadratic objectives.

4 High-Resolution ODEs for Convex Functions

In this section, we turn to numerical discretization of the high-resolution ODE (2.6) related to NAG-C. All proofs are deferred to Appendix C. This ODE in the phase space representation reads [Shi et al., 2018] as follows:

(4.1)

with and . Theorem 5 of Shi et al. [2018] shows that Let . For any step size , the solution of the high-resolution ODE (2.6) satisfies

(4.2)

for any . A caveat here is that it is unclear how to use a Lyapunov function to prove convergence of the (simple) explicit, symplectic or implicit Euler scheme by direct numerical discretization of the ODE (2.2). See Appendix C.2 for more discussion on this point. Therefore, we slightly modify the ODE to the following one:

(4.3)

The only difference is in the third term on the right-hand side of the second equation, where we replace by . Now, we apply the three schemes on this (modified) ODE in the phase space, including the original NAG-C, which all start with and .

Symplectic phase-space scheme (the original NAG-C):
(4.4)
Explicit phase-space scheme:
(4.5)
Implicit phase-space scheme:
(4.6)
Theorem 4.1.

Let . The following statements are true:

  1. For any step size , the symplectic (original) NAG-C (4.4) satisfies

    (4.7)
  2. Taking any step size , the implicit NAG-C (4.6) satisfies

    (4.8)

Note that Theorem 4.1 (a) is the same as Theorem 6 of Shi et al. [2018]. The explicit Euler scheme (4.5) does not guarantee convergence; see the analysis in Appendix C.1.

5 Discussion

In this paper, we have analyzed the convergence rates of three numerical discretization schemes—the symplectic Euler scheme, explicit Euler scheme, and implicit Euler scheme—applied to ODEs that are used for modeling Nesterov’s accelerated methods and Polyak’s heavy-ball method. The symplectic scheme is shown to achieve accelerated rates for the high-resolution ODEs of NAG-SC and (slightly modified) NAG-C [Shi et al., 2018], whereas no acceleration rates are observed when the same scheme is used to discretize the low-resolution counterparts [Su et al., 2016]. For comparison, the explicit scheme only allows for a small step size in discretizing these ODEs in order to ensure stability, thereby failing to achieve acceleration. Although the explicit scheme is proved to yield accelerated methods no matter whether high-resolution or low-resolution ODEs are discretized, this scheme is generally not practical except for a limited number of cases (for example, quadratic objectives).

We conclude this paper by presenting several directions for future work. This work suggests that both symplectic schemes and high-resolution ODEs are crucial for numerical discretization to achieve acceleration. It would be of interest to formalize and prove this assertion. For example, does any higher-order symplectic scheme maintain acceleration for the high-resolution ODEs of NAGs? What is the fundamental mechanism of the gradient correction in high-resolution ODE in stabilizing symplectic discretization? Moreover, since the discretizations are applied to the modified high-resolution ODE of NAG-C, it is tempting to perform a comparison study between the two high-resolution ODEs in terms of discretization properties. Finally, recognizing Nesterov’s method (NAG-SC

) is very similar to, but still different from, the corresponding symplectic scheme, one can design new algorithms as interpolations of the two methods; it would be interesting to investigate the convergence properties of these new algorithms.

Acknowledgments

W. J. S. was supported in part by the NSF via grant CCF-1763314. M. I. J. was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.

References

Appendix A Gradient Flow

a.1 Convergence rate of gradient flow

The following theorem is the continuous-time version of Theorem 2.1.15 in Nesterov [2013].

Theorem A.1.

Let . The solution to the gradient flow (2.1) satisfies

Proof.

Taking the following Lyapunov function

we calculate its time derivative as

Thus, we complete the proof. ∎

The theorem below is a continuous version of Theorem 2.1.14 in Nesterov [2013].

Theorem A.2.

Let . The solution to the gradient flow (2.1) satisfies

Proof.

The time derivative of the distance function is

We define a Lyapunov function as

With the basic convex inequality for , we have

Furthermore, we obtain that the time derivative is

Hence, the convergence rate is

The following theorem is based on the Lyapunov function for gradient flow (2.1) in Su et al. [2016].

Theorem A.3.

Let . The solution to the gradient flow (2.1) satisfies

Proof.

The Lyapunov function is

We calculate its time derivative as

Thus, we complete the proof. ∎

a.2 Explicit Euler scheme

The corresponding explicit-scheme version of Theorem A.1 is just Theorem 2.1.15 in Nesterov [2013]. We state it below.

Theorem A.4 (Theorem 2.1.15, Nesterov [2013]).

Let . Taking any step size , the iterates generated by GD  satisfy

In addition, if the step size is set to , we get

This proof is from Nesterov [2013]. The only conceptual difference is that we use the Lyapunov function

instead of the distance function in Nesterov [2013].

The corresponding explicit version of Theorem A.2 is Theorem 2.1.14 in Nesterov [2013]. We also state it as follows.

Theorem A.5 (Theorem 2.1.14, Nesterov [2013]).

Let . Taking any step size , the iterates generated by GD satisfy

(A.1)

In addition, if the step size is set to , we get

(A.2)

Again, the from  Nesterov [2013] is the use of the Lyapunov function instead of . Finally, we show the corresponding discrete version of Theorem A.3, which completes Theorem 2.1. Note that the proof is shown in Section 2.4.

Theorem A.6.

Let . Taking any step size , the iterates generated by GD satisfy

(A.3)

In addition, if the step size is set , we have

(A.4)

a.3 Implicit Euler scheme

Next, we consider the implicit Euler scheme of the gradient flow (2.1) as

(A.5)

with any initial . The corresponding implicit version of Theorem A.1 is shown as below.

Theorem A.7.

Let , the iterates generated by implicit gradient descent (A.5) satisfy

(A.6)

In addition, if the step size , where , we have

(A.7)
Proof.

The Lyapunov function is