A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights

03/04/2015 ∙ by Weijie Su, et al. ∙ Stanford University 0

We derive a second-order ordinary differential equation (ODE) which is the limit of Nesterov's accelerated gradient method. This ODE exhibits approximate equivalence to Nesterov's scheme and thus can serve as a tool for analysis. We show that the continuous time ODE allows for a better understanding of Nesterov's scheme. As a byproduct, we obtain a family of schemes with similar convergence rates. The ODE interpretation also suggests restarting Nesterov's scheme leading to an algorithm, which can be rigorously proven to converge at a linear rate whenever the objective is strongly convex.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many fields of machine learning, minimizing a convex function is at the core of efficient model estimation. In the simplest and most standard form, we are interested in solving

where is a convex function, smooth or non-smooth, and is the variable. Since Newton, numerous algorithms and methods have been proposed to solve the minimization problem, notably gradient and subgradient descent, Newton’s methods, trust region methods, conjugate gradient methods, and interior point methods (see e.g. Polyak, 1987; Boyd and Vandenberghe, 2004; Nocedal and Wright, 2006; Ruszczyński, 2006; Boyd et al., 2011; Shor, 2012; Beck, 2014, for expositions).

First-order methods have regained popularity as data sets and problems are ever increasing in size and, consequently, there has been much research on the theory and practice of accelerated first-order schemes. Perhaps the earliest first-order method for minimizing a convex function is the gradient method, which dates back to Euler and Lagrange. Thirty years ago, however, in a seminal paper Nesterov proposed an accelerated gradient method (Nesterov, 1983), which may take the following form: starting with and , inductively define

(1)

For any fixed step size , where is the Lipschitz constant of , this scheme exhibits the convergence rate

(2)

Above, is any minimizer of and . It is well-known that this rate is optimal among all methods having only information about the gradient of at consecutive iterates (Nesterov, 2004). This is in contrast to vanilla gradient descent methods, which have the same computational complexity but can only achieve a rate of . This improvement relies on the introduction of the momentum term as well as the particularly tuned coefficient . Since the introduction of Nesterov’s scheme, there has been much work on the development of first-order accelerated methods, see Nesterov (2004, 2005, 2013) for theoretical developments, and Tseng (2008)

for a unified analysis of these ideas. Notable applications can be found in sparse linear regression

(Beck and Teboulle, 2009; Qin and Goldfarb, 2012), compressed sensing (Becker et al., 2011)

and, deep and recurrent neural networks

(Sutskever et al., 2013).

In a different direction, there is a long history relating ordinary differential equation (ODEs) to optimization, see Helmke and Moore (1996), Schropp and Singer (2000), and Fiori (2005) for example. The connection between ODEs and numerical optimization is often established via taking step sizes to be very small so that the trajectory or solution path converges to a curve modeled by an ODE. The conciseness and well-established theory of ODEs provide deeper insights into optimization, which has led to many interesting findings. Notable examples include linear regression via solving differential equations induced by linearized Bregman iteration algorithm (Osher et al., 2014), a continuous-time Nesterov-like algorithm in the context of control design (Dürr and Ebenbauer, 2012; Dürr et al., 2012), and modeling design iterative optimization algorithms as nonlinear dynamical systems (Lessard et al., 2014).

In this work, we derive a second-order ODE which is the exact limit of Nesterov’s scheme by taking small step sizes in (1); to the best of our knowledge, this work is the first to use ODEs to model Nesterov’s scheme or its variants in this limit. One surprising fact in connection with this subject is that a first-order scheme is modeled by a second-order ODE. This ODE takes the following form:

(3)

for , with initial conditions ; here, is the starting point in Nesterov’s scheme, denotes the time derivative or velocity and similarly denotes the acceleration. The time parameter in this ODE is related to the step size in (1) via . Expectedly, it also enjoys inverse quadratic convergence rate as its discrete analog,

Approximate equivalence between Nesterov’s scheme and the ODE is established later in various perspectives, rigorous and intuitive. In the main body of this paper, examples and case studies are provided to demonstrate that the homogeneous and conceptually simpler ODE can serve as a tool for understanding, analyzing and generalizing Nesterov’s scheme.

In the following, two insights of Nesterov’s scheme are highlighted, the first one on oscillations in the trajectories of this scheme, and the second on the peculiar constant 3 appearing in the ODE.

1.1 From Overdamping to Underdamping

In general, Nesterov’s scheme is not monotone in the objective function value due to the introduction of the momentum term. Oscillations or overshoots along the trajectory of iterates approaching the minimizer are often observed when running Nesterov’s scheme. Figure 1 presents typical phenomena of this kind, where a two-dimensional convex function is minimized by Nesterov’s scheme. Viewing the ODE as a damping system, we obtain interpretations as follows.

Small . In the beginning, the damping ratio is large. This leads the ODE to be an overdamped system, returning to the equilibrium without oscillating;
Large . As increases, the ODE with a small behaves like an underdamped system, oscillating with the amplitude gradually decreasing to zero.

As depicted in Figure 0(a), in the beginning the ODE curve moves smoothly towards the origin, the minimizer . The second interpretation “Large ’ provides partial explanation for the oscillations observed in Nesterov’s scheme at later stage. Although our analysis extends farther, it is similar in spirit to that carried in O’Donoghue and Candès (2013). In particular, the zoomed Figure 0(b) presents some butterfly-like oscillations for both the scheme and ODE. There, we see that the trajectory constantly moves away from the origin and returns back later. Each overshoot in Figure 0(b) causes a bump in the function values, as shown in Figure 0(c). We observe also from Figure 0(c) that the periodicity captured by the bumps are very close to that of the ODE solution. In passing, it is worth mentioning that the solution to the ODE in this case can be expressed via Bessel functions, hence enabling quantitative characterizations of these overshoots and bumps, which are given in full detail in Section 3.

(a) Trajectories.
(b) Zoomed trajectories.
(c) Errors .
Figure 1: Minimizing , starting from . The black and solid curves correspond to the solution to the ODE. In (c), for the x-axis we use the identification between time and iterations, .

1.2 A Phase Transition

The constant 3, derived from in (3), is not haphazard. In fact, it is the smallest constant that guarantees convergence rate. Specifically, parameterized by a constant , the generalized ODE

can be translated into a generalized Nesterov’s scheme that is the same as the original (1) except for being replaced by . Surprisingly, for both generalized ODEs and schemes, the inverse quadratic convergence is guaranteed if and only if

. This phase transition suggests there might be deep causes for acceleration among first-order methods. In particular, for

, the worst case constant in this inverse quadratic convergence rate is minimized at .

Figure 2 illustrates the growth of and , respectively, for the generalized ODE and scheme with , where the objective function is simply . Inverse quadratic convergence fails to be observed in both Figures 1(a) and 1(b), where the scaled errors grow with or iterations, for both the generalized ODE and scheme.

(a) Scaled errors .
(b) Scaled errors .
Figure 2: Minimizing by the generalized ODE and scheme with , starting from . In (b), the step size .

1.3 Outline and Notation

The rest of the paper is organized as follows. In Section 2, the ODE is rigorously derived from Nesterov’s scheme, and a generalization to composite optimization, where may be non-smooth, is also obtained. Connections between the ODE and the scheme, in terms of trajectory behaviors and convergence rates, are summarized in Section 3. In Section 4, we discuss the effect of replacing the constant in (3) by an arbitrary constant on the convergence rate. A new restarting scheme is suggested in Section 5, with linear convergence rate established and empirically observed.

Some standard notations used throughout the paper are collected here. We denote by the class of convex functions with –Lipschitz continuous gradients defined on , i.e., is convex, continuously differentiable, and satisfies

for any , where is the standard Euclidean norm and is the Lipschitz constant. Next, denotes the class of –strongly convex functions on with continuous gradients, i.e., is continuously differentiable and is convex. We set .

2 Derivation

First, we sketch an informal derivation of the ODE (3). Assume for . Combining the two equations of (1) and applying a rescaling gives

(4)

Introduce the Ansatz for some smooth curve defined for . Put . Then as the step size goes to zero, and , and Taylor expansion gives

and . Thus (4) can be written as

(5)

By comparing the coefficients of in (5), we obtain

The first initial condition is . Taking in (4) yields

Hence, the second initial condition is simply (vanishing initial velocity).

One popular alternative momentum coefficient is , where are iteratively defined as , starting from (Nesterov, 1983; Beck and Teboulle, 2009). Simple analysis reveals that asymptotically equals , thus leading to the same ODE as (1).

Classical results in ODE theory do not directly imply the existence or uniqueness of the solution to this ODE because the coefficient is singular at . In addition, is typically not analytic at , which leads to the inapplicability of the power series method for studying singular ODEs. Nevertheless, the ODE is well posed: the strategy we employ for showing this constructs a series of ODEs approximating (3), and then chooses a convergent subsequence by some compactness arguments such as the Arzelá-Ascoli theorem. Below, denotes the class of twice continuously differentiable maps from to ; similarly, denotes the class of continuously differentiable maps from to . For any and any , the ODE (3) with initial conditions has a unique global solution . The next theorem, in a rigorous way, guarantees the validity of the derivation of this ODE. The proofs of both theorems are deferred to the appendices.

For any , as the step size , Nesterov’s scheme (1) converges to the ODE (3) in the sense that for all fixed ,

2.1 Simple Properties

We collect some elementary properties that are helpful in understanding the ODE.
Time Invariance.  If we adopt a linear time transformation, for some

, by the chain rule it follows that

This yields the ODE parameterized by ,

Also note that minimizing is equivalent to minimizing . Hence, the ODE is invariant under the time change. In fact, it is easy to see that time invariance holds if and only if the coefficient of has the form for some constant .
Rotational Invariance.  Nesterov’s scheme and other gradient-based schemes are invariant under rotations. As expected, the ODE is also invariant under orthogonal transformation. To see this, let

for some orthogonal matrix

. This leads to and . Hence, denoting by the transpose of , the ODE in the new coordinate system reads , which is of the same form as (3) once multiplying on both sides.
Initial Asymptotic.  Assume sufficient smoothness of such that exists. The mean value theorem guarantees the existence of some that satisfies . Hence, from the ODE we deduce . Taking the limit gives . Hence, for small we have the asymptotic form:

This asymptotic expansion is consistent with the empirical observation that Nesterov’s scheme moves slowly in the beginning.

2.2 ODE for Composite Optimization

It is interesting and important to generalize the ODE to minimizing in the composite form , where the smooth part and the non-smooth part is a structured general convex function. Both Nesterov (2013) and Beck and Teboulle (2009) obtain convergence rate by employing the proximal structure of . In analogy to the smooth case, an ODE for composite is derived in the appendix.

3 Connections and Interpretations

In this section, we explore the approximate equivalence between the ODE and Nesterov’s scheme, and provide evidence that the ODE can serve as an amenable tool for interpreting and analyzing Nesterov’s scheme. The first subsection exhibits inverse quadratic convergence rate for the ODE solution, the next two address the oscillation phenomenon discussed in Section 1.1, and the last subsection is devoted to comparing Nesterov’s scheme with gradient descent from a numerical perspective.

3.1 Analogous Convergence Rate

The original result from Nesterov (1983) states that, for any , the sequence given by (1) with step size satisfies

(6)

Our next result indicates that the trajectory of (3) closely resembles the sequence in terms of the convergence rate to a minimizer . Compared with the discrete case, this proof is shorter and simpler. For any , let be the unique global solution to (3) with initial conditions . Then, for any ,

(7)

Consider the energy functional111We may also view this functional as the negative entropy. Similarly, for the gradient flow , an energy function of form can be used to derive the bound . defined as , whose time derivative is

Substituting with , the above equation gives

where the inequality follows from the convexity of . Hence by monotonicity of and non-negativity of , the gap satisfies

Making use of the approximation , we observe that the convergence rate in (6) is essentially a discrete version of that in (7), providing yet another piece of evidence for the approximate equivalence between the ODE and the scheme.

We finish this subsection by showing that the number 2 appearing in the numerator of the error bound in (7) is optimal. Consider an arbitrary such that for . Starting from some , the solution to (3) is before hitting the origin. Hence, has a maximum achieved at . Therefore, we cannot replace 2 by any smaller number, and we can expect that this tightness also applies to the discrete analog (6).

3.2 Quadratic and Bessel Functions

For quadratic , the ODE (3) admits a solution in closed form. This closed form solution turns out to be very useful in understanding the issues raised in the introduction.

Let , where is a positive semidefinite matrix and is in the column space of because otherwise this function can attain . Then a simple translation in can absorb the linear term into the quadratic term. Since both the ODE and the scheme move within the affine space perpendicular to the kernel of , without loss of generality, we assume that is positive definite, admitting a spectral decomposition , where

is a diagonal matrix formed by the eigenvalues. Replacing

with , we assume from now on. Now, the ODE for this function admits a simple decomposition of form

with . Introduce , which satisfies

This is Bessel’s differential equation of order one. Since vanishes at , we see that is a constant multiple of , the Bessel function of the first kind of order one.222Up to a constant multiplier, is the unique solution to the Bessel’s differential equation that is finite at the origin. In the analytic expansion of , denotes the double factorial defined as for even , or

for odd

. It has an analytic expansion:

which gives the asymptotic expansion

when . Requiring , hence, we obtain

(8)

For large , the Bessel function has the following asymptotic form (see e.g. Watson, 1995):

(9)

This asymptotic expansion yields (note that )

(10)

On the other hand, (9) and (10) give a lower bound:

(11)

where is the spectral norm of . The first inequality follows by interpreting as the mean of on in certain sense.

In view of (10), Nesterov’s scheme might possibly exhibit convergence rate for strongly convex functions. This convergence rate is consistent with the second inequality in Theorem 4.1. In Section 4.3, we prove the rate for a generalized version of (3). However, (11) rules out the possibility of a higher order convergence rate.

Recall that the function considered in Figure 1 is , starting from . As the step size becomes smaller, the trajectory of Nesterov’s scheme converges to the solid curve represented via the Bessel function. While approaching the minimizer , each trajectory displays the oscillation pattern, as well-captured by the zoomed Figure 0(b). This prevents Nesterov’s scheme from achieving better convergence rate. The representation (8) offers excellent explanation as follows. Denote by , respectively, the approximate periodicities of the first component in absolute value and the second . By (9), we get and . Hence, as the amplitude gradually decreases to zero, the function has a major cycle of , the least common multiple of and . A careful look at Figure 0(c) reveals that within each major bump, roughly, there are minor peaks.

3.3 Fluctuations of Strongly Convex

The analysis carried out in the previous subsection only applies to convex quadratic functions. In this subsection, we extend the discussion to one-dimensional strongly convex functions. The Sturm-Picone theory (see e.g. Hinton, 2005) is extensively used all along the analysis.

Let . Without loss of generality, assume attains minimum at . Then, by definition for any . Denoting by the solution to the ODE (3), we consider the self-adjoint equation,

(12)

which, apparently, admits a solution . To apply the Sturm-Picone comparison theorem, consider

for a comparison. This equation admits a solution . Denote by all the positive roots of , which satisfy (see e .g. Watson, 1995)

where . Then, it follows that the positive roots of are . Since , the Sturm-Picone comparison theorem asserts that has a root in each interval .

To obtain a similar result in the opposite direction, consider

(13)

Applying the Sturm-Picone comparison theorem to (12) and (13), we ensure that between any two consecutive positive roots of , there is at least one . Now, we summarize our findings in the following. Roughly speaking, this result concludes that the oscillation frequency of the ODE solution is between and .

Denote by all the roots of . Then these roots satisfy, for all ,

3.4 Nesterov’s Scheme Compared with Gradient Descent

The ansatz in relating the ODE and Nesterov’s scheme is formally confirmed in Theorem 2. Consequently, for any constant , this implies that does not change much for a range of step sizes if . To empirically support this claim, we present an example in Figure 2(a), where the scheme minimizes with and starting from (here is the th column of ). From this figure, we are delight to observe that with the same are very close to each other.

This interesting square-root scaling has the potential to shed light on the superiority of Nesterov’s scheme over gradient descent. Roughly speaking, each iteration in Nesterov’s scheme amounts to traveling in time along the integral curve of (3), whereas it is known that the simple gradient descent moves along the integral curve of . We expect that for small Nesterov’s scheme moves more in each iteration since is much larger than . Figure 2(b) illustrates and supports this claim, where the function minimized is with step size (The coordinates are appropriately rotated to allow and lie on the same horizontal line). The circles are the iterates for . For Nesterov’s scheme, the seventh circle has already passed , while for gradient descent the last point has merely arrived at .

(a) Square-root scaling of .
(b) Race between Nesterov’s and gradient.
Figure 3: In (a), the circles, crosses and triangles are evaluated at and , respectively. In (b), the circles are iterations given by Nesterov’s scheme or gradient descent, depending on the color, and the stars are on the integral curves for .

A second look at Figure 2(b) suggests that Nesterov’s scheme allows a large deviation from its limit curve, as compared with gradient descent. This raises the question of the stable step size allowed for numerically solving the ODE (3) in the presence of accumulated errors. The finite difference approximation by the forward Euler method is

(14)

which is equivalent to

(15)

Assuming is sufficiently smooth, we have for small perturbations , where is the Hessian of evaluated at . Identifying , the characteristic equation of this finite difference scheme is approximately

(16)

The numerical stability of (14) with respect to accumulated errors is equivalent to this: all the roots of (16) lie in the unit circle (see e.g. Leader, 2004). When (i.e.  is positive semidefinite), if small and , we see that all the roots of (16) lie in the unit circle. On the other hand, if , (16) can possibly have a root outside the unit circle, causing numerical instability. Under our identification , a step size of in Nesterov’s scheme (1) is approximately equivalent to a step size of in the forward Euler method, which is stable for numerically integrating (14).

As a comparison, note that the finite difference scheme of the ODE , which models gradient descent with updates , has the characteristic equation . Thus, to guarantee in worst case analysis, one can only choose for a fixed step size, which is much smaller than the step size for (14) when is very variable, i.e., is large.

4 The Magic Constant 3

Recall that the constant 3 appearing in the coefficient of in (3) originates from . This number leads to the momentum coefficient in (1) taking the form . In this section, we demonstrate that 3 can be replaced by any larger number, while maintaining the convergence rate. To begin with, let us consider the following ODE parameterized by a constant :

(17)

with initial conditions . The proof of Theorem 2, which seamlessly applies here, guarantees the existence and uniqueness of the solution to this ODE.

Interpreting the damping ratio as a measure of friction333In physics and engineering, damping may be modeled as a force proportional to velocity but opposite in direction, i.e. resisting motion; for instance, this force may be used as an approximation to the friction caused by drag. In our model, this force would be proportional to where is velocity and is the damping coefficient. in the damping system, our results say that more friction does not end the and convergence rate. On the other hand, in the lower friction setting, where is smaller than 3, we can no longer expect inverse quadratic convergence rate, unless some additional structures of are imposed. We believe that this striking phase transition at 3 deserves more attention as an interesting research challenge.

4.1 High Friction

Here, we study the convergence rate of (17) with and . Compared with (3), this new ODE as a damping suffers from higher friction. Following the strategy adopted in the proof of Theorem 3.1, we consider a new energy functional defined as

By studying the derivative of this functional, we get the following result.

The solution to (17) satisfies

Noting , we get equal to

(18)

where the inequality follows from the convexity of . Since , the last display implies that is non-increasing. Hence

yielding the first inequality of this theorem. To complete the proof, from (18) it follows that

as desired for establishing the second inequality. The first inequality is the same as (7) for the ODE (3), except for a larger constant . The second inequality measures the error in an average sense, and cannot be deduced from the first inequality.

Now, it is tempting to obtain such analogs for the discrete Nesterov’s scheme as well. Following the formulation of Beck and Teboulle (2009), we wish to minimize in the composite form , where for some and is convex on possibly assuming extended value . Define the proximal subgradient

Parametrizing by a constant , we propose the generalized Nesterov’s scheme,

(19)

starting from . The discrete analog of Theorem 4.1 is below. The sequence given by (19) with satisfies

The first inequality suggests that the generalized Nesterov’s schemes still achieve convergence rate. However, if the error bound satisfies for some arbitrarily small and a dense subsequence , i.e., for all and some , then the second inequality of the theorem would be violated. To see this, note that if it were the case, we would have ; the sum of the harmonic series over a dense subset of is infinite. Hence, the second inequality is not trivial because it implies the error bound is, in some sense, suboptimal.

Now we turn to the proof of this theorem. It is worth pointing out that, though based on the same idea, the proof below is much more complicated than that of Theorem 4.1. Consider the discrete energy functional,

where . If we have

(20)

then it would immediately yield the desired results by summing (20) over . That is, by recursively applying (20), we see