 # Hamiltonian Descent Methods

We propose a family of optimization methods that achieve linear convergence using first-order gradient information and constant step sizes on a class of convex functions much larger than the smooth and strongly convex ones. This larger class includes functions whose second derivatives may be singular or unbounded at their minima. Our methods are discretizations of conformal Hamiltonian dynamics, which generalize the classical momentum method to model the motion of a particle with non-standard kinetic energy exposed to a dissipative force and the gradient field of the function of interest. They are first-order in the sense that they require only gradient computation. Yet, crucially the kinetic gradient map can be designed to incorporate information about the convex conjugate in a fashion that allows for linear convergence on convex functions that may be non-smooth or non-strongly convex. We study in detail one implicit and two explicit methods. For one explicit method, we provide conditions under which it converges to stationary points of non-convex functions. For all, we provide conditions on the convex function and kinetic energy pair that guarantee linear convergence, and show that these conditions can be satisfied by functions with power growth. In sum, these methods expand the class of convex functions on which linear convergence is possible with first-order computation.

## Code Repositories

### FirstExplicitMethod-HDM

Implementation of Hamiltonian Descent Methods (First explicit method) with Python3

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the problem of unconstrained minimization of a differentiable function ,

 minx∈\Rdf(x), (1)

by iterative methods that require only the partial derivatives of , known also as first-order methods [38, 45, 41]. These methods produce a sequence of iterates , and our emphasis is on those that achieve linear convergence, i.e., as a function of the iteration they satisfy for some rate and a global minimizer. We briefly consider non-convex differentiable , but the bulk of our analysis focuses on the case of convex differentiable . Our results will also occasionally require twice differentiability of .

The convergence rates of first-order methods on convex functions can be broadly separated by the properties of strong convexity and Lipschitz smoothness. Taken together these properties for convex are equivalent to the conditions that the following left hand bound (strong convexity) and right hand bound (smoothness) hold for some and all ,

 μ2\normx−y22≤f(x)−f(y)−\inner∇f(y)x−y≤L2\normx−y22, (2)

where is the standard inner product and is the Euclidean norm. For twice differentiable

, these properties are equivalent to the conditions that eigenvalues of the matrix of second-order partial derivatives

are everywhere lower bounded by and upper bounded by , respectively. Thus, functions whose second derivatives are continuously unbounded or approaching 0, cannot be both strongly convex and smooth. Both bounds play an important role in the performance of first-order methods. On the one hand, for smooth and strongly convex , the iterates of many first-order methods converge linearly. On the other hand, for any first-order method, there exist smooth convex functions and non-smooth strongly convex functions on which its convergence is sub-linear, i.e., for any first-order method on smooth convex functions. See [38, 45, 41] for these classical results and  for other more exotic scenarios. Moreover, for a given method it can sometimes be very easy to find examples on which its convergence is slow; see Figure 1, in which gradient descent with a fixed step size converges slowly on , which is not strongly convex as its Hessian is singular at . Figure 1: Optimizing f(x)=[x(1)+x(2)]4+[x(1)/2−x(2)/2]4 with three methods: gradient descent with fixed step size equal to 1/L0 where L0=λmax(∇2f(x0)) is the maximum eigenvalue of the Hessian ∇2f at x0; classical momentum, which is a particular case of our first explicit method with k(p)=[(p(1))2+(p(2))2]/2 and fixed step size equal to 1/L0; and Hamiltonian descent, which is our first explicit method with k(p)=(3/4)[(p(1))4/3+(p(2))4/3] and a fixed step size.

The central assumption in the worst case analyses of first-order methods is that information about is restricted to black box evaluations of and locally at points , see [38, 41]. In this paper we assume additional access to first-order information of a second differentiable function and show how can be designed to incorporate information about to yield practical methods that converge linearly on convex functions. These methods are derived by discretizing the conformal Hamiltonian system . These systems are parameterized by and with solutions ,

 x′t =∇k(pt) (3) p′t =−∇f(xt)−γpt.

From a physical perspective, these systems model the dynamics of a single particle located at with momentum and kinetic energy being exposed to a force field and a dissipative force. For this reason we refer to as, the kinetic energy, and , the kinetic map. When the kinetic map is the identity, , these dynamics are the continuous time analog of Polyak’s heavy ball method . Let denote the centered version of , which takes its minimum at , with minimum value . Our key observation in this regard is that when is convex, and is chosen as (where is the convex conjugate of ), these dynamics have linear convergence with rate independent of . In other words, this choice of acts as a preconditioner, a generalization of using for . Thus can exploit global information provided by the conjugate to condition convergence for generic convex functions. Figure 2: Convergence Regions for Power Functions. Shown are regions of distinct convergence types for Hamiltonian descent systems with f(x)=|x|b/b,k(p)=|p|a/a for x,p∈\R and a,b∈(1,∞). We show in Section 2 convergence is linear in continuous time iff 1/a+1/b≥1. In Section 4 we show that the assumptions of the explicit discretizations can be satisfied if 1/a+1/b=1, leaving this as the only suitable pairing for linear convergence. Light dotted line is the line occupied by classical momentum with k(p)=p2/2.

To preview the flavor of our results in detail, consider the special case of optimizing the power function for and initialized at using system (3) (or discretizations of it) with for and . FOr this choice of , it can be shown that when . In line with this, in Section 2 we show that (3) exhibits linear convergence in continuous time if and only if . In Section 3 we propose two explicit discretizations with fixed step sizes; in Section 4 we show that the first explicit discretization converges if and , and the second converges if and . This means that the only suitable pairing corresponds in this case to the choice . Figure 2 summarizes this discussion. Returning to Figure 1, we can compare the use of the kinetic energy of Polyak’s heavy ball with a kinetic energy that relates appropriately to the convex conjugate of .

Most convex functions are not simple power functions, and computing exactly is rarely feasible. To make our observations useful for numerical optimization, we show that linear convergence is still achievable in continuous time even if for some within a region defined by . We study three discretizations of (3), one implicit method and two explicit ones (which are suitable for functions that grow asymptotically fast or slow, respectively). We prove linear convergence rates for these under appropriate additional assumptions. We introduce a family of kinetic energies that generalize the power functions to capture distinct power growth near zero and asymptotically far from zero. We show that the additional assumptions of discretization can be satisfied for this family of . We derive conditions on that guarantee the linear convergence of our methods when paired with a specific choice of from this family. These conditions generalize the quadratic growth implied by smoothness and strong convexity, extending it to general power growth that may be distinct near the minimum and asymptotically far from the minimum, which we refer to as tail and body behavior, respectively. Step sizes can be fixed independently of the initial position (and often dimension), and do not require adaptation, which often leads to convergence problems, see . Indeed, we analyze a kinetic map that resembles the iterate updates of some popular adaptive gradient methods [13, 59, 18, 27], and show that it conditions the optimization of strongly convex functions with very fast growing tails (non-smooth). Thus, our methods provide a framework optimizing potentially non-smooth or non-strongly convex functions with linear rates using first-order computation.

The organization of the paper is as follows. In the rest of this section, we cover notation, review a few results from convex analysis, and give an overview of the related literature. In Section 2, we show the linear convergence of (3) under conditions on the relation between the kinetic energy and . We show a partial converse that in some settings our conditions are necessary. In Section 3, we present the three discretizations of the continuous dynamics and study the assumptions under which linear rates can be guaranteed for convex functions. For one of the discretizations, we also provide conditions under which it converges to stationary points of non-convex functions. In Section 4, we study a family of kinetic energies suitable for functions with power growth. We describe the class of functions for which the assumptions of the discretizations can be satisfied when using these kinetic energies.

### 1.1 Notation and Convex Analysis Review

We let denote the standard inner product for and the Euclidean norm. For a differentiable function , the gradient

is the vector of partial derivatives at

. For twice-differentiable , the Hessian is the matrix of second-order partial derivatives at . The notation denotes the solution to a differential equation with derivative in denoted . denotes the iterates of a discrete system.

Consider a convex function that is defined on a convex domain and differentiable on the interior . The convex conjugate is defined as

 h∗(p)=sup{\innerxp−h(x):x∈C} (4)

and it is itself convex. It is easy to show from the definition that if is another convex function such that for all , then for all . Because we make such extensive use of it, we remind readers of the Fenchel-Young inequality: for and ,

 \innerxp ≤h(x)+h∗(p), (5) which is easily derived from the definition of h∗, or see Section 12 of . For x∈\interior(C) by Theorem 26.4 of , \innerx∇h(x) =h(x)+h∗(∇h(x)). (6)

Let , . If , then (Theorem 12.3 ). If for and , then where (page 106 of ). If , then (Table 3.2 ). For these and more on , we refer readers to [47, 8, 6].

### 1.2 Related Literature

Standard references on convex optimization and the convergence analysis of first-order methods include [38, 45, 3, 8, 41, 9].

The heavy ball method was introduced by Polyak in his seminal paper . In this paper, local convergence with linear rate was shown (i.e., when the initial position is sufficiently close to the local minimum). For quadratic functions, it can be shown that the convergence rate for optimally chosen step sizes is proportional to the square root of the conditional number of the Hessian, similarly to conjugate gradient descent (see e.g., ). As far as we know, global convergence of the heavy ball method for non-quadratic functions was only recently established in  and , see 

for an extension to stochastic average gradients. The heavy ball method forms the basis of the some of the most successful optimization methods for deep learning, see e.g.,

[54, 27], and the recent review . Hereafter, classical momentum refers to any first-order discretization of the continuous analog of Polyak’s heavy ball (with possibly suboptimal step sizes).

Nesterov obtained upper and lower bounds of matching order for first-order methods for smooth convex functions and smooth strongly convex functions, see . In Necoara et al. , the assumption of strong convexity was relaxed, and under a weaker quadratic growth condition, linear rates were obtained by several well known optimization methods. Several other authors obtained linear rates for various classes of non-strongly convex or non-uniformly smooth functions, see e.g., [37, 26, 11, 58, 14, 48].

In recent years, there has been interest in the optimization community in looking at the continuous time ODE limit of optimization methods, when the step size tends to zero. Su et al. [52, 53] have found the continuous time limit of Nesterov’s accelerated gradient descent. This result improves the intuition about Nesterov’s method, as the proofs of convergence rates in continuous time are rather elegant and clear, while the previous proofs in discrete time are not as transparent. Follow-ups have studied the continuous time counterparts to accelerated mirror descent  as well as higher order discretizations of such systems [55, 56]. Studying continuous time systems for optimization can separate the concerns of designing an optimizer from the difficulties of discretization. This perspective has resulted in numerous other recent works that propose new optimization methods, and study existing ones via their continuous time limit, see e.g., [4, 1, 15, 24, 10, 16, 17].

Conformal Hamiltonian systems (3) are studied in geometry [33, 5], because their solutions preserve symplectic area up to a constant; when symplectic area is exactly preserved, when symplectic area dissipates uniformly at an exponential rate . In classical mechanics, Hamiltonian dynamics (system (3) with ) are used to describe the motion of a particle exposed to the force field . Here, the most common form for is , where is the mass, or in relativistic mechanics, where is the speed of light, see 

. In the Markov Chain Monte Carlo literature, where (discretized) Hamiltonian dynamics (again

) are used to propose moves in a Metropolis–Hastings algorithm [34, 23, 12, 35],

is viewed as a degree of freedom that can be used to improve the mixing properties of the Markov chain

[20, 31]. Stochastic differential equations similar to (3) with have been studied from the perspective of designing [32, 51].

## 2 Continuous Dynamics

In this section, we motivate the discrete optimization algorithms by introducing their continuous time counterparts. These systems are differential equations described by a Hamiltonian vector field plus a dissipation field. Thus, we briefly review Hamiltonian dynamics, the continuous dynamics of Hamiltonian descent, and derive convergence rates for convex in continuous time.

### 2.1 Hamiltonian Systems

In the Hamiltonian formulation of mechanics, the evolution of a particle exposed to a force field is described by its location and momentum as functions of time. The system is characterized by the total energy, or Hamiltonian,

 H(x,p)=k(p)+f(x)−f(x⋆), (7)

where is one of the global minimizers of and is called the kinetic energy. Throughout, we consider kinetic energies that are a strictly convex functions with minimum at . The Hamiltonian defines the trajectory of a particle and its momentum

via the ordinary differential equation,

 x′t =∇pH(xt,pt)=∇k(pt) (8) p′t =−∇xH(xt,pt)=−∇f(xt).

For any solution of this system, the value of the total energy over time is conserved as Thus, the solutions of the Hamiltonian field oscillate, exchanging energy from to and back again.

### 2.2 Continuously Descending the Hamiltonian

The solutions of a Hamiltonian system remain in the level set . To drive such a system towards stationary points, the total energy must reduce over time. Consider as a motivating example the continuous system , which describes Polyak’s heavy ball algorithm in continuous time . Letting , the heavy ball system can be rewritten as

 x′t=ptp′t=−∇f(xt)−γpt. (9)

Note that this system can be viewed as a combination of a Hamiltonian field with and a dissipation field, i.e., where and ), see Figure 3 for a visualization. This is naturally extended to define the more general conformal Hamiltonian system ,

 x′t =∇k(pt) (3 revisited) p′t =−∇f(xt)−γpt.

with . When is convex with a minimum , these systems descend the level sets of the Hamiltonian. We can see this by showing that the total energy is reduced along the trajectory ,

 H′t=\inner∇k(pt)p′t+\inner∇f(xt)x′t=−γ\inner∇k(pt)pt≤−γk(pt)≤0, (10)

where we have used the convexity of , and the fact that it is minimised at .

The following proposition shows some existence and uniqueness results for the dynamics (3). We say that is radially unbounded if when , e.g., this would be implied if and were strictly convex with unique minima. [Existence and uniqueness] If and are continuous, is convex with a minimum , and is radially unbounded, then for every , there exists a solution of (3) defined for every with . If in addition, and are continuously differentiable, then this solution is unique.

###### Proof.

First, only assuming continuity, it follows from Peano’s existence theorem  that there exists a local solution on an interval for some . Let denote the right maximal interval where a solution of (3) satisfying that and exist. From (10), it follows that , and hence for every . Now by the radial unboundedness of , and the fact that , it follows that the compact set is never left by the dynamics, and hence by Theorem 3 of  (page 91), we must have . The uniqueness under continuous differentiability follows from the Fundamental Existence–Uniqueness Theorem on page 74 of . ∎

As shown in the next proposition, (10) implies that conformal Hamiltonian systems approach stationary points of .

[Convergence to a stationary point] Let be a solution to the system (3) with initial conditions , continuously differentiable, and continuously differentiable, strictly convex with minimum at and . If is bounded below and is radially unbounded, then .

###### Proof.

Since is bounded below, . Since is radially unbounded, the set is a compact set that contains in its interior. Moreover, by (10), we also have for all . Consider the set . Since is strictly convex, this set is equivalent to . The largest invariant set of the dynamics (3) inside is . By LaSalle’s principle , all trajectories started from must approach . Since is a continuous bounded function on the compact set , there is a point such that for every (i.e. the minimum is attained in ) by the extreme value theorem (see ). Moreover, due to the definition of , is in its interior, hence and therefore . Thus the set is non-empty (note that might contain other local minima as well). ∎

This construction can be generalized by modifying the component of (3) to a more general dissipation field . If the dissipation field is everywhere aligned with the kinetic map, , then these systems dissipate energy. We have not found alternatives to that result in linear convergence in general.

### 2.3 Continuous Hamiltonian Descent on Convex Functions

In this section we study how can be designed to condition the system (3) for linear convergence in . Although the solutions of (3) approach stationary points under weak conditions, to derive rates we consider the case when is convex. To motivate our choice of , consider the quadratic function with for positive definite symmetric . Now (3) becomes,

 x′t=A−1ptp′t=−Axt−γpt. (11)

By the change of variables , this is equivalent to

 x′t=vtv′t=−xt−γvt, (12)

which is a universal equation and hence the convergence rate of (11) is independent of . Although this kinetic energy implements a constant preconditioner for any , for this specific is its convex conjugate . This suggests the core idea of this paper: taking related in some sense to for more general convex functions may condition the convergence of (3). Indeed, we show in this section that, if the kinetic energy upper bounds a centered version of , then the convergence of (3) is linear.

More precisely, define the following centered function ,

 fc(x)=f(x+x⋆)−f(x⋆). (13)

The convex conjugate of is given by and is minimized at . Importantly, as we will show in the final lemma of this section, taking a kinetic energy such that for some suffices to achieve linear rates on any differentiable convex in continuous time. The constant is included to capture the fact that

may under estimate

by some constant factor, so long as it is positive. If does not depend in any fashion on , then the convergence rate of (3) is independent of . In Section 2.4 we also show a partial converse — for some simple problems taking a not satisfying those assumptions results in sub-linear convergence for almost every path (except for one unique curve and its mirror).

There is an interesting connection to duality theory for a specific choice of . In a slight abuse of representation, consider rewriting the original problem as

 minx∈\Rdf(x)=minx∈\Rd12(f(x)+f(x)).

The Fenchel dual of this problem is equivalent to the following problem after a small reparameterization of (see Chapter 31 of ),

 maxp∈\Rd12(−f∗(p)−f∗(−p)).

The Fenchel duality theorem guarantees that for a given pair of primal-dual variables , the duality gap between the primal objective and the dual objective is positive. Thus,

 f(x)−(−f∗(p)−f∗(−p))/2 =f(x)−f(x⋆)+(f∗(p)+f∗(−p))/2+f(x⋆) =f(x)−f(x⋆)+(f∗c(p)+f∗c(−p))/2≥0.

Thus, for the choice , which as we will show implies linear convergence of (3), the Hamiltonian is exactly the duality gap between the primal and dual objectives.

Linear rates in continuous time can be derived by a Lyapunov function that summarizes the total energy of the system, contracts exponentially (or linearly in -space), and is positive unless . Ultimately we are trying to prove a result of the form for some rate . As the energy is decreasing, it suggests using as a Lyapunov function. Unfortunately, this will not suffice, as plateaus instantaneously () at points on the trajectory where despite possibly being far from . However, when , the momentum field reduces to the term and the derivative of in is instantaneously strictly negative for convex (unless we are at ). This suggests the family of Lyapunov functions that we study in this paper,

 V(x,p)=H(x,p)+β\innerx−x⋆p, (14)

where (see the next lemma for conditions that guarantee that it is non-negative). As with , is used to indicate at time along a solution to (3). Before moving on to the final lemma of the section, we prove two technical lemmas that will give us useful control over throughout the paper.

The first lemma describes how must be constrained for to be positive and to track closely, so that it is useful for the analysis of the convergence of and ultimately .

[Bounding the ratio of and ]lembetachoicealpha Let , convex with unique minimum , strictly convex with minimum , and .
If is such that , then

 \innerx−x⋆p≥−(k(p)/α+f(x)−f(x⋆))≥−H(x,p)α, (15) α−βαH(x,p)≤V(x,p). (16)

If is such that , then

 \innerx−x⋆p≤k(p)/α+f(x)−f(x⋆)≤H(x,p)α, (17) V(x,p)≤α+βαH(x,p). (18)
###### Proof.

Assuming that , we have

 k(p)/α+fc(x−x⋆) ≥f∗c(−p)+fc(x−x⋆) ≥\innerx−x⋆−p−fc(x−x⋆)+fc(x−x⋆) =−\innerx−x⋆p,

hence we have (15). (16) follows by rearrangement. The proof of (17) and (18) is similar. ∎

Lemma 2.3 constrains in terms of . For a result like , we will need to control in terms of the magnitude of the dissipation field. The following lemma provides constraints on and, under those constraints, the optimal . The proof can be found in Section A of the Appendix.

[Convergence rates in continuous time for fixed ]lemVderivativebound Given , differentiable and convex with unique minimum , differentiable and strictly convex with minimum . Let be the value at time of a solution to the system (3) such that there exists where . Define

 λ(α,β,γ)=min(αγ−αβ−βγα−β,β(1−γ)1−β). (19)

If , then

 V′t≤−λ(α,β,γ)Vt.

Finally,

1. The optimal , and are given by,

 β⋆ =11+α(α+γ2−√(1−γ)α2+γ24), (20) λ⋆ =⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩11−α((1−γ)α+γ2−√(1−γ)α2+γ24) for 0<α<1,γ(1−γ)2−γ for α=1, (21)
2. If , then

 λ(α,β,γ)=β(1−γ)1−β, and (22) −(γ−β−γ2(1−γ)/4)k(pt)−βγ\innerxt−x⋆pt−β\innerxt−x⋆∇f(xt) ≤−β(1−γ)(k(pt)+f(xt)−f(x⋆)+β\innerxt−x⋆pt). (23)

These two lemmas are sufficient to prove the linear contraction of and the contraction under the assumption of constant and . Still, the constant , which controls our approximation of may be quite pessimistic if it must hold globally along as the system converges to its minimum. Instead, in the final lemma that collects the convergence result for this section, we consider the case where may increase as convergence proceeds. To support an improving , our constant will now have to vary with time and we will be forced to take slightly suboptimal and given by (22) of Lemma 2.3. Still, the improving will be important in future sections for ensuring that we are able to achieve position independent step sizes.

We are now ready to present the central result of this section. Under Assumptions 2.3 we show linear convergence of (3). In general, the dependence of the rate of linear convergence on is via the function and the constant in our analysis.

Assumptions A.
1. [label= A.0]

2. differentiable and convex with unique minimum .

3. differentiable and strictly convex with minimum .

4. .

5. There exists some differentiable non-increasing convex function and constant such that for every ,

 k(p)≥α(k(p))max(f∗c(p),f∗c(−p)) (24)

and that for every

 −Cα,γα′(y)y<α(y). (25)

In particular, if for a constant , then the constant function serves as a valid, but pessimistic choice.

Assumption 4 can be satisfied if a symmetric lower bound on is known. For example, strong convexity implies

 f(x+x⋆)−f(x⋆)≥μ2\normx22.

This in turn implies . Because is symmetric, it satisfies 4 which explains why conditions relating to strong convexity are necessary for linear convergence of Polyak’s heavy ball.

[Convergence bound in continuous time with general ]thmcontinuouslyap Given , , , , satisfying Assumptions 2.3. Let be a solution to the system (3) with initial states where . Let , , and be the solution of

 W′t=−λ⋅α(2Wt)Wt,

with . Then for every , we have

 f(xt)−f(x⋆)≤2H0exp(−λ∫t0α(2Wt))≤2H0exp(−λα⋆t). (26)
###### Proof.

By (24) in assumption 4, the conditions of Lemma 2.3 hold, and by (15) and (17) we have

 |\innerxt−x⋆pt|≤k(pt)/α(k(pt))+f(xt)−f(x⋆)≤Htα(k(pt)). (27)

Instead of defining the Lyapunov function exactly as in (14) we take a time-dependent . Specifically, for every let be the unique solution of the equation

 v=Ht+Cα,γα(2v)2\innerxt−x⋆pt (28)

in the interval . To see why this equation has a unique solution in , note that from (27) it follows that

and hence for any such , we have

 Ht2≤Ht+Cα,γα(2v)2\innerxt−x⋆pt≤32Ht. (29)

This means that for , the left hand side of (28) is smaller than the right hand side, while for , it is the other way around. Now using (25) in assumption 4 and (27), we have

 ∣∣Cα,γα′(2Vt)\innerxt−x⋆pt∣∣≤∣∣∣Cα,γα′(2Vt)2Vtα(2Vt)∣∣∣<1, (30)

Thus, by differentiation, we can see that (30) implies that

 ∂∂v(v−Ht−Cα,γ2α(2v)\innerxt−x⋆pt)>0,

which implies that (28) has a unique solution in . Let and . By the implicit function theorem, it follows that is differentiable in . Morover, since

 Vt=Ht+Cα,γα(2Vt)2\innerxt−x⋆pt (31)

for every , by differentiating both sides, we obtain that

 V′t =−(γ−βt)\inner∇k(pt)pt−βtγ\innerxt−x⋆pt−βt\innerxt−x⋆∇f(xt)+β′t\innerxt−x⋆pt The first three terms are equivalent to the temporal derivative of Vt with constant β=βt. Since αt≤α(k(pt)) and βt≤γ, the assumptions of Lemma 2.3 are satisfied locally for αt, βt and we get V′t ≤−λ(αt,βt,γ)Vt+β′t\innerxt−x⋆pt=−λ(αt,βt,γ)Vt+Cα,γα′