Gradient descent based methods are ubiquitous in machine learning because they only require first-order information, which makes them computationally cheap. However, vanilla gradient descent can be slow. Alternatively, accelerated gradient methods are popular because of their ability to achieve best worst-case complexity bounds. Their basic construction can be traced back to Polyak [Polyak:1964] and Nesterov [Nesterov:1983] who introduced two different types of accelerated versions of gradient descent. The gradient descent with momentum, or simply classical momentum (CM) for short, is often found in the form [Hinton:2013]
where is the momentum factor, the learning rate, and is the function being minimized. Nesterov’s accelerated gradient (NAG) has a similar form and one of its realizations is [Hinton:2013]111 We will also consider the other popular form of NAG, which is equivalent to this form with an adaptive .
Both algorithms have a long history in optimization and have been extensively applied as optimizers in deep learning [Hinton:2013]. They are also the basic prototype of other accelerated algorithms, such as Adam [Kingma:2015], AdaGrad [Duchi:2017]
and RMSprop[Tieleman:2012], which in addition include different normalizations of the gradient. However, a complete theoretical understanding of these methods is still lacking, and their benefits remain unclear [Wilson:2017].
A promising direction to understand accelerated methods has been emerging by drawing connections to continuous dynamical systems. Although relations between differential equations and optimization date back to Cauchy [Cauchy:1847], and were used by Polyak [Polyak:1964], there has been renewed interest due to the recent discovery of a second-order differential equation associated to Nesterov’s method [Candes:2016]. This motivated several other works that established analogous connections with other algorithms as well [Wibisono:2016, Krichene:2015, Attouch:2016, Attouch:2017, Attouch:2018, Zhang:2018, Shi:2018, Yang:2018, Franca:2018, Franca:2018b]. In essence, the dynamical systems behind these methods can all be seen as a nonlinear version of the damped harmonic oscillator, sometimes with extra ingredients, whose Lagrangian and Hamiltonian variational formulations have been known for a long time [Bateman:1931, Lemos:1979]. From this perspective, both algorithms (1) and (2) correspond to different discretizations of the same Hamiltonian system (see (6)).
Conservative Hamiltonian systems have a special underlying symplectic geometry where the flow map preserves the symplectic 2-form [Goldstein]. A discretization that preserves this symplectic structure is called a symplectic integrator. Such integrators are well-known for long-time stability, for preserving first integrals, and more importantly, for preserving the phase portrait of the flow map [Quispel:2006]. The first steps in the construction of these geometric integrators date back to the 1950’s [Vogelaere:1956], but went unnoticed. Only later in the 1980’s was this approach rediscovered and gained interest [Ruth:1983, Channell:1983, Channell:1983b, Kang:1987, Neri:1987, Menyuk:1984]. (See [Forest:2006, Channell:1990] for a historical view.) Nowadays, symplectic integrators are widely used across many fields such as nonlinear systems, chaotic systems, molecular dynamics, complex systems, Monte Carlo methods, particle physics, astrophysics, and so forth. However, only recently has their importance started to be recognized in optimization [Betancourt:2018].
Conservative systems are not interesting for optimization. Instead, dissipative systems are more relevant since the system can be brought to a state of minimal energy, which might correspond to a minimum of the objective function intended to be minimized. A class of dissipative systems are conformal Hamiltonian systems [McLachlan:2001], which have constant friction that dissipates energy and a well-defined symplectic structure, namely, the symplectic 2-form contracts linearly. Discretizations that preserve dissipation of the symplectic 2-form are called conformal symplectic integrators [McLachlan:2006]. These are related to but different from symplectic integrators. In general, when a symplectic integrator is applied to a conformal Hamiltonian system the resulting method is not conformal symplectic [McLachlan:2004].
Standard discretizations are designed to provide fast and accurate numerical solutions of a given differential equation. Yet, there is no guarantee whatsoever that they respect geometric properties of the flow map. In fact, typical discretizations are prone to miss global structures of Hamiltonian systems and are unable to describe long-time qualitative phenomena. With typical discretizations, such as explicit/implicit Euler, the system may be damped or excited artificially, which may create misleading results. For instance, the method may settle into a stable fixed point that is not truly present in the original continuous system. On the other hand, if the numerical map is (conformal) symplectic, this is no longer the case as the geometric structure of the flow is preserved. Moreover, (conformal) symplectic methods may yield faster, simpler, and more stable algorithms [McLachlan:2006]. For these reasons, whenever a (conformal) Hamiltonian system is being studied, (conformal) symplectic integrators are the obvious choice.
Which properties of the underlying dynamical system, if any,
do classical momentum and Nesterov’s accelerated gradient preserve?
The Hamiltonian system behind (1) and (2) is based on Newtonian mechanics, where trajectories lie on Euclidean space, and time is just a parameter. Since space and time are independent, velocities can be arbitrarily large. This is reflected in the resulting discretization, such as (1) and (2), where large may give rise to large , which in turn may cause the updates to diverge. The only way to control this potential instability is by making the stepsize smaller, which makes the algorithm slower. However, Newtonian mechanics is inconsistent with the fact that, physically, there exists a limiting speed—the speed of light in vacuum. To conciliate mechanics with this, and other facts, [Einstein:1905] realized that space and time are not independent. He thus introduced a unified geometric entity, called spacetime (see Figure 2), leading to the special theory of relativity. By constructing discretizations of relativistic systems, one can incorporate such limiting velocity and potentially obtain more stable optimization algorithms that do not require severe restrictions on the stepsize. Moreover, this can be done in a (conformal) symplectic manner.
Starting from a conformal Hamiltonian system, we first construct a general conformal symplectic integrator. This construction allows us to draw precise connections with popular momentum based methods. Indeed, we are able to partly address the question raised above as follows:
The classical momentum method is a conformal symplectic integrator,
while Nesterov’s accelerated gradient is not.
The second (and perhaps most important) contribution is to derive a relativistic generalization of CM. Applying our conformal symplectic integrator to a dissipative relativistic system we obtain222 We later present a different, but equivalent, parametrization in terms of physical constants which is preferable (see (33)). We simply write (3) for an easier comparison to (1) at this point.
where is a maximum speed playing the role of the speed of light. Note that when this recovers classical momentum (1). The updates (3) can naturally be seen as a relativistic gradient descent (RGD) method. We stress that the updates in (3) are always finite, even when the updates are unbounded, contrary to (1). Moreover, we will demonstrate through numerical experiments that, under careful tuning of
, RGD results in faster and more stable optimization of complex problems such as the Rosenbrock function and image classification with deep convolutional neural networks (CNNs). Importantly, (3b) resembles the popular Adam, AdaGrad, and RMSprop, but employs different normalizations on the gradients. However, unlike these alternatives, RGD enjoys a clear and elegant theoretical justification in terms of conformal symplectic integrators, while obtaining comparable performance in practice.
2 Conformal Hamiltonian Systems
Here we summarize the basic ideas of conformal Hamiltonian systems together with their intrinsic symplectic geometry (see [McLachlan:2001, Maro:2017] for more details). The state of a Hamiltonian system is described by a point on phase space , where is the generalized coordinates, is the conjugate momentum associated to , and trajectories are parametrized by time . The system is completely specified by a Hamiltonian function
. A conformal vector field is required to obey a modified version of Hamilton’s equations given by
where . Here represents a damping that dissipates the energy. It is important that be constant for the system to be conformal. A classical example is given by
where is the mass of a particle moving in dimensions subject to the potential . In this case,
The Hamiltonian is the energy of the system. Taking its total time derivative along trajectories one concludes that , thus is a Lyapunov function and all orbits tend to fixed points [McLachlan:2001].
The most remarkable property of conformal Hamiltonian systems is their underlying symplectic geometry. Let
Note that and , so that
is real, orthogonal and skew-symmetric
is real, orthogonal and skew-symmetric. Since we have a complex structure on phase space. Let and define the symplectic 2-form333 In the language of differential geometry, we have a 2-form , where is the wedge product. For any 1-forms we have . Also, defines a 1-form for any differentiable function . Therefore, and summing over all these area elements we have which is (9).
A transformation is said to be symplectic if , which is equivalent to
Thus, if and are symplectic transformations so is their composition . Moreover, means is invertible with a symplectic inverse . Restricting to cases where , which naturally includes , we see that symplectic transformations form a group under composition called the symplectic group .
The equations of motion (8) define a flow according to , where is the trajectory at time with initial condition . Let be the Jacobian matrix of . It is possible to show that [McLachlan:2001]
Therefore, a conformal Hamiltonian flow contracts the symplectic 2-form exponentially.444For a conservative system, i.e., , the Hamiltonian flow is a symplectic transformation (10) and the symplectic form is preserved. Moreover, it follows from (11) that the volume also shrinks exponentially since
where . Note the contraction of the volume is stronger as the number of dimensions increases. Note also that in the conservative case (), volumes are left invariant. It can also be shown [McLachlan:2001b] that any first integral of a conservative system (i.e., one such that when ) obeys , or equivalently , in the conformal case (). Another known property of conformal Hamiltonian systems is that their Lyapunov exponents sum in pairs to [Dessler:1988]. This imposes constraints on the admissible dynamics and controls the phase portraits near fixed points. For other properties of attractor sets we refer to [Maro:2017].
3 Conformal Symplectic Optimization
Here we derive a family of optimization algorithms through a conformal symplectic discretization of a Hamiltonian system. We then show that classical momentum is conformal symplectic, while Nesterov’s accelerated gradient is not.
3.1 A Conformal Symplectic Integrator
Standard approaches to conformal symplectic integrators treat the damping term in (4) separately and apply any symplectic integrator to the remaining (conservative) part [Moore:2016]. Obviously, one can use any of the existing schemes to obtain a valid discretization to (4). Herein, we propose a particular construction that will make precise connections to known optimization methods.
Consider the conformal Hamiltonian system (4) under a separable , which can be written as
We associate flows and to the vector fields and , respectively. Symplectic integrators are splitting methods that approximate the true flow by composing and [Hairer]; see Figure 1 (left). For a small and fixed time interval , we can approximate a flow using and . Hence, integrating each term of (13) independently, we have mappings and that up to are given by
Consider the composition555This is known as Strang composition [Strang:1968] and is second-order accurate, i.e., it approximates the true flow locally up to since it cancels the neglected terms in (14). (see Figure 1 (right))
Consider a regular discretization , for , and denote and . Hence,
Note that (17) only requires one gradient computation per iteration (the last can be reused) and this is an explicit method, hence computationally cheap. Setting in (17) reduces to the leapfrog or momentum Störmer-Verlet method [Hairer], which is arguably the most popular symplectic integrator used in molecular dynamics and Monte Carlo simulations. The discretization in (17) approximates the true flow up to locally. Hence, it is second-order accurate as one can show that the global error is over a finite interval .
Since the flow of conformal Hamiltonian systems has a specific geometric structure, it is desirable that a numerical discretization preserves the geometry of the flow map. We use the following definition [Hairer].
A numerical one-step map , where is a stepsize, is said to be conformal symplectic if is conformal symplectic, i.e., , whenever is applied to a smooth Hamiltonian.666 Note that simply takes any and outputs . Since for one-step , in iterations we have , where . The algorithm (17) consists precisely of such iterations.
The numerical mapping (17) is a conformal symplectic integrator.
Using the skew-symmetric property of the wedge product together with (18) we obtain
3.2 Classical Momentum is Conformal Symplectic
Next, we show a surprising positive result.
Now, define to obtain the equivalent updates
Corollary 3 implies that CM is a structure-preserving discretization of (6). Note also that the updates in (20), despite being equivalent to (1), offer a more interpretable alternative with parameters expressed in terms of physical constants. Thus, based on physical intuition we make the following observations:
The parameter controls the amount of dissipation. We automatically have . Large brings more friction which tends to make the system slower. Thus, if has low curvature, a small is desirable. However, in settings where has high curvature, the system can have strong oscillations, which can be controlled by a larger thus improving convergence.
The mass controls the inertia. Light particles gain more acceleration and travel faster. Thus, in general, we expect that small will speedup the algorithm. Note that smaller increases the learning rate; see (21). One should be careful, however, since when is too small, (20b) may become unstable and diverge.
Naturally, is the discretization stepsize. By increasing
the algorithm gives “larger strides”, but may become unstable, especially in regions of high curvature. However, whenis nearly flat, larger might work fine and speedup the algorithm.
The above intuition does indeed have theoretical support. Following the results of [Franca:2018b], when applied to (6), one obtains the rates described below.
Consider system (6). Let .
If is convex, then for all it holds that
If is -strongly convex, then for all we have
Making smaller improves the constants in these rates. Such rates are also consistent with the above observations regarding the damping . Finally, since (20) is conformal symplectic it preserves the dissipation of (6), thus we expect that these rates are preserved. Stability of the continuous system depends crucially on the damping. Since the same dissipation is preserved by the algorithm, this indicates that the same stability properties might hold in discrete-time.
3.3 NAG is Not Conformal Symplectic
NAG is not a conformal symplectic integrator.
Consider the mapping as
and note that under the correspondence
this is equivalent to NAG in (2).777 As usual, it is implicit that one must set and then , and so forth [Hairer]. Thus, it suffices to show that (24) is not conformal symplectic. The variational form of these updates are given by
From skew-symmetry of the wedge product we obtain
which in general does not vanish. Therefore, from (27) we conclude that , which implies that , finishing the proof. ∎
It is common to find Nesterov’s method as
where with . This system can also be written in the exact same form as (2) [Hinton:2013], now with an adaptive . This can be seen by introducing the variable and writing the updates in terms of and . Thus, when we already showed in Theorem 5 that (29) is not conformal symplectic. In the case of the differential equation associated to (29) [Candes:2016] is equivalent to (4) after the substitution . Since is time dependent, the system is no longer conformal. Therefore, in its instantiation (29), NAG cannot be a conformal symplectic integrator.
4 Conformal Relativistic Systems
The previous algorithms are based on (6) which is equivalent to Newton’s equation . In Newtonian mechanics time is just a parameter, independent from the Euclidean space where trajectories take place. Since time and space are independent, there is no restriction on the speed that a particle can attain. This translates into a discretization such as (20) since when the gradient is large, the momentum is also large, implying that the position updates may diverge. See Figure 2 (left) for an illustration of the underlying classical geometry.
As mentioned in the introduction, this classical view is an incomplete description of nature, which is corrected by special relativity; for an introduction, see [Goldstein, French]. In reality, space and time form a unified geometric entity called spacetime, which is the -dimensional Minkowski space with coordinates , where , , and is the speed of light. The metric is , yielding the interval . Null geodesics correspond to , which implies that . Thus, there is a limit to the maximal speed any particle can achieve—that of the speed of light. This imposes constraints on the geometry as illustrated in Figure 2 (right). The lower and upper part of the cone represents the past and future, and physical trajectories must reside inside the cone whose boundaries correspond to particles traveling at the speed of light. Since now velocities are bounded, this may offer benefits to optimization such as preventing divergence as may happen in the classical case.
As before, let denote the spatial coordinates with conjugate momentum . The relativistic dynamics of a particle subject to the potential can be described by the Hamiltonian [Landau2]
In the classical limit, , we thus obtain that , recovering (5) up to a constant . This constant is the famous Einstein’s equivalence between mass and energy and corresponds to the rest energy of the particle. Substituting (30) into (4), to include dissipation as a conformal system, we now obtain the relativistic equations of motion
Again, in the classical limit (31) recovers (6). Note how the momentum in the first equation above is normalized by the factor so that remains bounded, even if was to go unbounded. Applying our general conformal symplectic integrator (17), which amounts to replacing the kinetic energy of (30), yields the discretization
Using the same argument as in the beginning of Corollary 3 that led to (20), now leads to the following updates, which we call relativistic gradient descent (RGD):888 The updates (3) are obtained from the ones of (33) under the correspondence (21) and by redefining the speed of light as . Although (3) looks closer to (1), the form presented in (33) has a more intuitive parameter dependence.
Relativistic gradient descent (33) is a conformal symplectic integrator.
The previous intuition on tuning and also holds for RGD. In special relativity is a universal constant, but in RGD it is a free parameter, which implies a cutoff to the update that can prevent divergence, assuring at least that . Thus, RGD can be more stable and is at least as good as CM since it is a particular case. We stress that relativistic effects are only noticeable when particles have a velocity comparable to . For small velocities, classical and relativistic mechanics coincide, implying that there is no significant difference between RGD and CM.
5 Numerical Experiments
5.1 Rosenbrock Function
Consider the nonconvex Rosenbrock function [Rosenbrock:1960, Goldberg:1989]
where . This function has a global minimum at , where . For it also has a local minimum near with a value [Shang:2006]. The global minimum lies on a long, narrow and parabolic flat valley. It is easy to find this valley but convergence to the minimum is difficult. For instance, in our experiments, standard gradient descent was unable to minimize this function. In Figure 3 we compare the previous methods when minimizing (34) in different settings. The parameters are tuned for each algorithm separately (see Appendix A for details). We use CM in its standard form (1) (), but also in the “physical” form (20) () to illustrate that decreasing usually improves convergence. We use NAG in the form (2). Note that RGD obtains faster convergence by controlling the speed of light , allowing us to decrease even further compared to CM. Although we did not change the stepsize for RGD in these experiments, often RGD allows for larger values of compared to CM as well.
5.2 Deep Neural Networks
To showcase the generality of the proposed approach, and to explore its performance in more challenging (non-convex) and high-dimensional optimization problems, we now tackle an image classification problem for three standard benchmarks of different difficulty, MNIST, SVHN and CIFAR100, and deploy CNN models of increasing complexity. For MNIST, we employ a standard three layer CNN with max pooling and a final linear classification layer. For SVHN, we deploy a VGG network with 11 layers, and for the CIFAR100 case we use a residual network with 18 layers. Full specification of models architectures and hyper-parameters are included in AppendixB.
Due to the large amount of training samples, we employ an online optimization approach, essentially applying RGD to mini-batches and turning gradient descent based methods into their stochastic counterparts. In light of the adaptive normalization provided by current popular methods in deep learning, we intend to study whether RGD provides a similar behaviour. To this end, we compare RGD with Adam [Kingma:2015] as a representative candidate from such group of optimizers. All methods were run from 10 random initializations, and we present the mean
one standard deviation (line and shading, respectively) in Figure4, Figure 5 and Figure 6. In all cases, RGD is able to achieve faster convergence than CM by controlling the speed of light , and also decreasing the mass . Note that decreasing provides a more significant role in the normalization aspect of the update, which gives it a behavior akin to Adam.
6 Related Work
We are only aware of a couple of papers related to this work. To the best of our knowledge, [Betancourt:2018] is the only paper that considers symplectic integrators in an optimization context. However, their starting point, is not a conformal Hamiltonian system, as in our approach. Instead, [Betancourt:2018] apply the standard leapfrog symplectic integrator (for conservative systems) to a nonautonomous dynamical system when written in the extended phase space. In [Lu:2017] relativistic mechanics in the context of Hamiltonian Monte Carlo is considered. However, no relationship to (conformal) symplectic integrators is made, and their starting point is not the conformal relativistic system (31). The authors of [Lu:2017] also comment on similarities between the normalization provided by special relativity and Adam, AdaGrad and RMSProp. The paper [Maddison:2018] considered explicit and implicit Euler discretizations of a conformal Hamiltonian system, but without any connections to (conformal) symplectic integrators. The focus of [Maddison:2018] is on generalized kinetic energies that allow for linear convergence for functions that are not strongly convex. Generalized kinetic energies were also previously considered in Monte Carlo methods [Livingstone:2017].
7 Discussion and Conclussions
We constructed a general conformal symplectic integrator (Theorem 2) allowing us to prove that classical momentum (CM) is a conformal symplectic integrator, while Nesterov’s accelerated gradient is not (Corollary 3 and Theorem 5). In so doing, we introduced a more intuitive version of CM, given by (20), whose parameters have a physical interpretation. Furthermore, we proposed relativistic gradient descent (RGD), given by (33). RGD is also a conformal symplectic integrator and recovers CM in the classical limit. Being a simulation of a relativistic system, RGD operates on a different geometric space compared to its classical counterpart CM, where there exists a maximum speed that helps control divergence of the iterates. In our experiments, we demonstrated the flexibility of RGD by controlling the mass and the speed of light . A more complete analytical study about the dependence of convergence rates on these parameters is needed, and is currently being investigated.
On one hand, our results establish connections between geometric or structure-preserving discretizations with known momentum based optimization methods. Symplectic integrators constitute an important topic in the numerical analysis of differential equations and simulations of Hamiltonian systems. On the other hand, we bring a principled generalization of CM that automatically normalizes the momentum. Herein, it was shown that this can be done by incorporating relativistic effects. For practical problems, it is imperative that one tunes the speed of light appropriately, since this establishes a scaling for the problem; otherwise RGD works in a classical regime and is equivalent to CM. Empirically, RGD performed similarly to Adam, and outperforms CM and NAG. It is important to note, however, that RGD is extremely simple and is derived from a clear and theoretically justified approach, as opposed to methods such as Adam, AdaGrad and RMSProp, which lack a similar theoretical foundation.
This work was supported by grants ARO MURI W911NF-17-1-0304 and NSF 1447822.
Appendix A Parameters for Rosenbrock
Here we specify the parameters used in the numerical experiments behind Figure 3.
|Top Left||Top Right|
|Bottom Left||Bottom Right|
Appendix B Deep Learning Implementation Details
In this section we expand on the details of the deep learning experiments. For the MNIST example, we employed a standard feed-forward convolutional network, namely, three layers with 32, 64 and 256 filters, of dimensions , and
, respectively, with strides of 2 for the first two layers. A final fully connected layer was employed as a linear classifier, and the model was trained with the cross entropy loss. Training was carried out with mini-batches of size 512, weight-decay (regularization for the model parameters) of and learning rate of . A momentum term of was used for classical momentum (CM) and also for relativistic gradient descent (RGD)999With the correspondence ; see (21).. For RGD, in addition, we set the mass as , and the speed of light as . These parameters were found to provide a difference in performance between RGD and CM, though further improvement of RGD is likely with an exhaustive parameter search. For a fair comparison with Adam, a parameter search was conducted for its learning rate, maintaining other parameters (’s and [Kingma:2015]) in their default setting. An optimal learning rate of was found for Adam as well.
For SVHN, we deploy a standard VGG network with 11 layers [simonyan2014very]. Training is done in a similar manner, employing a weight decay of and batch-size of 512 samples. Adam’s stepsize was optimized over a grid of 0.1 steps, resulting in an optimum value of . For CM and RGD, a stepsize of and momentum term of was used, though RGD employs a smaller mass () and speed of light of .
For CIFAR100, we employed a residual networks model (resnet) of 18 layers from [he2016deep]. We employed a batch-size of 128 samples, weight decay of . A momentum term of was used for CM and RGD, as well as a stepsize of . For Adam, an optimal learning rate of was used (the default learning rate performed significantly worse.) For RGD, we used , and .