# The Role of Memory in Stochastic Optimization

The choice of how to retain information about past gradients dramatically affects the convergence properties of state-of-the-art stochastic optimization methods, such as Heavy-ball, Nesterov's momentum, RMSprop and Adam. Building on this observation, we use stochastic differential equations (SDEs) to explicitly study the role of memory in gradient-based algorithms. We first derive a general continuous-time model that can incorporate arbitrary types of memory, for both deterministic and stochastic settings. We provide convergence guarantees for this SDE for weakly-quasi-convex and quadratically growing functions. We then demonstrate how to discretize this SDE to get a flexible discrete-time algorithm that can implement a board spectrum of memories ranging from short- to long-term. Not only does this algorithm increase the degrees of freedom in algorithmic choice for practitioners but it also comes with better stability properties than classical momentum in the convex stochastic setting. In particular, no iterate averaging is needed for convergence. Interestingly, our analysis also provides a novel interpretation of Nesterov's momentum as stable gradient amplification and highlights a possible reason for its unstable behavior in the (convex) stochastic setting. Furthermore, we discuss the use of long term memory for second-moment estimation in adaptive methods, such as Adam and RMSprop. Finally, we provide an extensive experimental study of the effect of different types of memory in both convex and nonconvex settings.

• 13 publications
• 47 publications
• 5 publications
05/30/2022

### Last-iterate convergence analysis of stochastic momentum methods for neural networks

The stochastic momentum method is a commonly used acceleration technique...
10/05/2018

### Continuous-time Models for Stochastic Optimization Algorithms

We propose a new continuous-time formulation for first-order stochastic ...
06/01/2020

### Factorial Powers for Stochastic Optimization

The convergence rates for convex and non-convex optimization methods dep...
04/19/2019

### On the Convergence of Adam and Beyond

Several recently proposed stochastic optimization methods that have been...
06/01/2016

### CaMKII activation supports reward-based neural network optimization through Hamiltonian sampling

Synaptic plasticity is implemented and controlled through over thousand ...
06/14/2020

### On the convergence of the Stochastic Heavy Ball Method

We provide a comprehensive analysis of the Stochastic Heavy Ball (SHB) m...
05/28/2021

### Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks

We present a new class of adaptive stochastic optimization algorithms, w...

## 1 Introduction

Our object of study is the classical problem of minimizing finite-sum objective functions:

 x∗=argminx∈Rdf(x):=1nn∑i=1fi(x). (P)

Accelerated gradient methods play a fundamental role in optimizing such losses, providing optimal rates of convergence for certain types of function classes such as the ones being convex [Nesterov, 2018]. The two most popular momentum methods are Heavy-ball (HB[Polyak, 1964] and Nesterov’s accelerated gradient (NAG) [Nesterov, 1983]. They are based on the fundamental idea of augmenting gradient-based algorithms with a momentum term that uses previous gradient directions in order to accelerate convergence, which yields the following type of iterative updates:

 xk+1=xk+βk(xk−xk−1)−η∇f(xk), (HB)

with an iteration dependent momentum parameter222Gradient Descent [Cauchy, 1847] can be seen as a special case of HB for . and a positive number called learning rate (a.k.a. stepsize).

Although both HB and NAG have received a lot of attention in the literature, the idea of acceleration is still not entirely well understood. For instance, a series of recent works [Su et al., 2016, Wibisono et al., 2016, Yang et al., 2018] has studied these methods from a physical perspective, which yields a connection to damped linear oscillators. Arguably, the insights provided by these works are mostly descriptive and have so far not been able to help with the design of conceptually new algorithms. Furthermore, the resulting analysis often cannot be easily translated to stochastic optimization settings, where stability of momentum methods may actually be reduced due to inexact gradient information [Jain et al., 2018, Kidambi et al., 2018].

This lack of theoretical understanding is rather unsatisfying. Why is it that acceleration able to provide faster rates of convergence for convex functions but fails when used on non-convex functions or in a stochastic setting? This question is especially relevant given that momentum techniques (such as Adam [Kingma and Ba, 2014]

) are commonly used in machine learning in order to optimize non-convex objective functions that arise when training deep neural networks.

In order to address this issue, we here exploit an alternative view on the inner workings of momentum methods which is not physically-inspired but instead builds upon the theoretical work on memory gradient diffusions developed by [Cabot et al., 2009] and [Gadat and Panloup, 2014]. In order to leverage this analogy, we first rewrite HB as follows:

 xk+1=xk−ηk−1∑j=0⎛⎝k∏h=j+1βh⎞⎠∇f(xj)−η∇f(xk) (HB-SUM)

where is assumed. That is, at each iteration the next step is computed using a weighted average of past gradients : . In particular, if is constant across all iterations, the memory — which is controlled by the weights — vanishes exponentially fast (short-term memory). Such averaging provides a cheap way to 1) adapt to the geometry of ill-conditioned problems (also noted in [Sutskever et al., 2013]) and 2) denoise stochastic gradients if the underlying true gradients are changing slowly. Similarly, adaptive methods [Duchi et al., 2011, Kingma and Ba, 2014] use memory of past square gradients to automatically adjust the learning rate during training. For the latter task, it has been shown [Reddi et al., 2018] that some form of long-term memory is convenient both for in theory (to ensure convergence) and in practice, since it has been observed that, in a mini-batch setting, large gradients are not common and might be quite informative.

In summary — most modern stochastic optimization methods can be seen as composition of memory systems. Inspired by this observation and by the undeniable importance of shining some light on the acceleration phenomenon, we make the following contributions.

1. Following previous work from [Cabot et al., 2009], we generalize the continuous-time limit of HB to an interpretable ODE that can implement various types of gradient forgetting (Sec. 3.1). Next, we extend this ODE to the stochastic setting (Sec. 3.2).

2. By comparing the resulting SDE to the model for Nesterov momentum developed in

[Su et al., 2016], we provide a novel interpretation of acceleration as gradient amplification and give some potential answers regarding the source of instability of stochastic momentum methods (Sec. 3.3).

3. We study the convergence guarantees of our continuous-time memory system and show that, in the convex setting, long-term (polynomial) memory is more stable than classical momentum (Sec. 4).

4. We discretize this memory system and derive an algorithmic framework that can incorporate various types of gradient forgetting efficiently. Crucially, we show the discretization process preserves the convergence guarantees.

5. We run several experiments to support our theory with empirical evidence in both deterministic and stochastic settings (Sec. 5).

6. We propose a modification of Adam which uses long-term memory of gradient second moments to adaptively choose the learning rates (Sec. 6).

We provide an overview of our notation in App. A.

## 2 Related Work

##### Momentum in deterministic settings.

The first accelerated proof of convergence for the deterministic setting dates back to [Polyak, 1964] who proved a local linear rate of convergence for Heavy-ball (with constant momentum) for twice continuously differentiable, -strongly convex and -smooth functions (with a constant which is faster than gradient descent). [Ghadimi et al., 2015] derived a proof of convergence of the same method for convex functions with Lipschitz-continuous gradients, for which the Cesàro average of the iterates converges in function value like (for small enough and ).

A similar method, Nesterov’s Accelerated Gradient (NAG), was introduced by [Nesterov, 1983]. It achieves the optimal rate of convergence for convex functions and, with small modifications, an accelerated (with respect to gradient desacent) linear convergence rate for smooth and strongly-convex functions.

##### Momentum in stochastic settings.

Prior work has shown that the simple momentum methods discussed above lack stability in stochastic settings, where the evaluation of the gradients is affected by noise (see motivation in [Allen-Zhu, 2017] for the Katyusha method). In particular, for quadratic costs,  [Polyak, 1987] showed that stochastic Heavy-ball does not achieve any accelerated rate but instead matches the rate of SGD. More general results are proved in [Yang et al., 2016] for these methods, both for convex and for smooth functions, requiring a decreasing learning rate, bounded noise and bounded subgradients (see Tb.1). For strongly-convex functions, [Yuan et al., 2016] also studied the mean-square error stability and showed that convergence requires small (constant) learning rates. Furthermore, the rate is shown to be equivalent to SGD and therefore the theoretical benefits of acceleration in the deterministic setting do not seem to carry over to the stochastic setting.

##### Continuous-time perspective.

The continuous time ODE model of NAG for convex functions presented in [Su et al., 2016] led to the developments of several variants of Nesterov-inspired accelerated methods in the deterministic setting (e.g. [Krichene et al., 2015] and [Wilson et al., 2016]). In this line of research, interesting insights often come from a numerical analysis and discretization viewpoint [Zhang et al., 2018, Betancourt et al., 2018]. Similarly, in stochastic settings, guided by SDE models derived from Nesterov’s ODE in [Su et al., 2016] and by the variational perspective in [Wibisono et al., 2016],  [Xu et al., 2018a] and [Xu et al., 2018b] proposed an interpretable alternative to AC-SA (an accelerated stochastic approximation algorithm introduced in  [Lan, 2012] and [Ghadimi and Lan, 2012]). This is a sophisticated momentum method that in expectation achieves a rate333

bounds the stochastic gradient variance in each direction.

for -strongly convex and -smooth functions and for convex -smooth functions. These rates are nearly optimal, since in the deterministic limit they still capture acceleration.

Unlike [Xu et al., 2018a, Xu et al., 2018b], we focus on how the memory of past gradients relates to the classical and most widely used momentum methods (HB, NAG) and, with the help of the SDE models, show that the resulting insights can be used to design building blocks for new optimization methods.

In his 1964 paper, Polyak motivated HB as the discrete time analogue of a second order ODE:

 ¨X(t)+a(t)X(t)+∇f(X(t))=0, (HB-ODE)

which can be written in phase-space as

 {˙V(t)=−a(t)V(t)−∇f(X(t))˙X(t)=V(t). (HB-ODE-PS)

This connection can be made precise: in App. B.1 we show that HB is indeed the result of semi-implicit Euler integration444 [Hairer et al., 2006] for an introduction. on HB-ODE-PS.

### 3.1 Memory and Gradient Forgetting

If the viscosity parameter is time-independent, HB-ODE, with initial condition and , can be cast into an integro-differential equation555By computing from HB-ODE-INT-C using the fundamental theorem of calculus and plugging in .:

 ˙X(t)=−∫t0e−α⋅(t−s)∇f(X(s))ds. (HB-ODE-INT-C)
##### Bias correction.

Notice that the instantaneous update direction of HB-ODE-INT-C is a weighted average of the past gradients, namely with . However, the weights do not integrate to one. Indeed, for all , we have , which goes to as . As a result, in the constant gradient setting, the previous sum is a biased estimator of the actual gradient. This fact suggests a simple modification of HB-ODE-INT-C, for :

 ˙X(t)=−α1−e−αt∫t0e−α⋅(t−s)∇f(X(s))ds. (1)

which we write as . We note that this normalization step follows exactly the same motivation as bias correction in Adam; we provide an overview of this method in App. B.2. If we define , the previous formula takes the form:

 ˙X(t)=−∫t0˙m(s)m(t)∇f(X(s))ds. (MG-ODE-INT)

This memory-gradient integro-differential equation (MG-ODE-INT) provides a generalization of HB-ODE-INT-C, with bias correction. The following crucial lemma is consequence of the fundamental theorem of calculus.

###### Lemma 3.1.

For any s.t. , MG-ODE-INT is normalized : , for all .

###### Proof.

Since , . ∎

Based on Lemma 3.1, we will always set . What other properties shall a general have? Requiring for all ensures that there does not exist a time instant where the gradient is systematically discarded. Hence, since , is either monotonically decreasing and negative or monotonically increasing and positive. In the latter case, without loss of generality, we can flip its sign. This motivates the following definition.

##### Definition.

is a memory function if it is non-negative, strictly increasing and s.t. .

For example, , from which we started our discussion, is a valid memory function. Crucially, we note that plays the important role of controlling the speed at which we forget previously observed gradients. For instance, let ; since , the system forgets past gradients quadratically fast. In contrast, leads to exponential forgetting. Some important memory functions are listed in Tb. 2, and their respective influence on past gradients is depicted in Fig. 1. We point out that, in the limit , the weights associated with exponential forgetting converge to a Dirac distribution . Hence, we recover the Gradient Descent ODE [Mertikopoulos and Staudigl, 2018]: . For the sake of comparability, we will refer to this as instantaneous forgetting.

Finally, notice that MG-ODE-INT can be written as a second order ODE. Too see this, we just need to compute the second derivative. For we have that

 ¨X(t)=˙m(t)m(t)2∫t0˙m(s)∇f(X(s))ds−˙m(t)m(t)∇f(X(t)).

Plugging in the definition of from the integro-differential equation, we get the memory-gradient ODE:

 ¨X(t)+˙m(t)m(t)˙X(t)+˙m(t)m(t)∇f(X(t))=0. (MG-ODE)

Equivalently, we can transform this second order ODE into a system of two first order ODEs by introducing the variable and noting that . This is called the phase-space representation of MG-ODE, which we use in Sec. 3.2 to provide the extension to the stochastic setting. Also, for the sake of comparison with recent literature (e.g. [Wibisono et al., 2016]), we provide a variational interpretation of MG-ODE in App. C.2.

##### Existence and uniqueness.

Readers familiar with ODE theory probably realized that, since by definition

, the question of existence and uniqueness of the solution to MG-ODE is not trivial. This is why we stressed its validity for multiple times during the derivation. Indeed, it turns out that such a solution may not exist globally on (see App. C.1). Nevertheless, if we allow to start integration from any and assume to be -smooth, standard ODE theory [Khalil and Grizzle, 2002] ensures that the sought solution exists and is unique on . Since can be made as small as we like (in our simulations in App. F we use ) this apparent issue can be regarded an artifact of the model. Also, we point out that the integral formulation MG-ODE-INT is well defined for every . Therefore, in the theoretical part of this work, we act as if integration starts at but we highlight in the appendix that choosing the initial condition induces only a negligible difference (see Remarks C.2 and D.1).

In this section we introduce stochasticity in the MG-ODE model. As already mentioned in the introduction, at each step , iterative stochastic optimization methods have access to an estimate of : the so called stochastic gradient. This information is used and possibly combined with previous gradient estimates , to compute a new approximation to the solution . There are many ways to design : the simplest [Robbins and Monro, 1951] is to take , where is a uniformly sampled datapoint. This gradient estimator is trivially unbiased (conditioned on past iterates) and we denote its covariance matrix at point by .

Following [Krichene and Bartlett, 2017] we model such stochasticity adding a volatility term in MG-ODE.

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩dX(t)=V(t)dtdV(t)=−˙m(t)m(t)V(t)dt               −˙m(t)m(t)[∇f(X(t))dt+σ(X(t))dB(t)] (MG-SDE)

where and is a standard Brownian Motion. Notice that this system of equations reduces to the phase-space representation of MG-ODE if is the null matrix. The connection from to the gradient estimator covariance matrix can be made precise: [Li et al., 2017] motivate the choice , where denotes the principal square root and is the discretization stepsize.

The proof of existence and uniqueness to the solution of this SDE666See e.g. Thm. 5.2.1 in [Øksendal, 2003], which gives sufficient conditions for (strong) existence and uniqueness. relies on the same arguments made for MG-ODE in Sec. 3.1, with one additional crucial difference: [Orvieto and Lucchi, 2018] showed that needs to additionally be three times continuously differentiable with bounded third derivative (i.e. ), in order for to be Lipschitz continuous. Hence, we will assume this regularity and refer the reader to [Orvieto and Lucchi, 2018] for further details.

### 3.3 The Connection to Nesterov’s Sde

[Su et al., 2016] showed that the continuous-time limit of NAG for convex functions is HB-ODE with time-dependent viscosity : , which we refer to as Nesterov’s ODE. Using Bessel functions, the authors were able to provide a new insightful description and analysis of this mysterious algorithm. In particular, they motivated how the vanishing viscosity is essential for acceleration777Acceleration is not achieved for a viscosity of e.g. .. Indeed, the solution to the equation above is s.t. ; in contrast to the solution to the GD-ODE , which only achieves a rate .

A closer look at Tb. 2 reveals that the choice is related to MG-ODE with quadratic forgetting, that is . However, it is necessary to note that in MG-SDE also the gradient term is premultiplied by . Here we analyse the effects of this intriguing difference and its connection to acceleration.

A naïve way to speed up the convergence of the GD-ODE is to consider . This can be seen by means of the Lyapunov function . Using convexity of , we have and therefore, the solution is s.t. . However, the Euler discretization of this ODE is the gradient-descent-like recursion — which is not accelerated. Indeed, this gradient amplification by a factor of is effectively changing the Lipschitz constant of the gradient field from to . Therefore, each step is going to yield a descent only if888See e.g. [Bottou et al., 2018]. . Yet, this iteration dependent learning rate effectively cancels out the gradient amplification, which brings us back to the standard convergence rate . It is thus natural to ask: "Is the mechanism of acceleration behind Nesterov’s ODE related to a similar gradient amplification?"

In App. C.4 we show that , the solution to Nesterov’s SDE999Nesterov’s SDE is defined, as for MG-SDE by augmenting the phase space representation with a volatility term. The resulting system is then : ., is s.t. the infinitesimal update direction of the position can be written as

 VN(t)=−∫t0s3t3∇f(X(s))ds+ζN(t), (2)

where

is a random vector with

and . In contrast, the solution of MG-SDE with quadratic forgetting satisfies

 Vm2(t)=−∫t03s2t3∇f(X(s))ds+ζm2(t), (3)

where is a random vector with but . Even though the reader might already have spotted an important difference in the noise covariances, to make our connection to gradient amplification even clearer, we consider the simpler setting of constant gradients: in this case, we have , . That is, stochastic algorithms with increasing momentum (i.e. decreasing101010See the connection between and in Thm. B.1. viscosity, like the Nesterov’s SDE) are systematically amplifying the gradients over time. Yet, at the same time they also linearly amplify the noise variance (see Fig. 7 in the appendix). This argument can easily be extended to the non-constant gradient case by noticing that is a weighted average of gradients where the weights integrate to 1 for all (Lemma 3.1) . In contrast, in these weights integrate to . This behaviour is illustrated in Fig. 2: While the Nesterov’s SDE is faster compared to MG-SDE with at the beginning, it quickly becomes unstable because of the increasing noise in velocity and hence position.

This gives multiple insights on the behavior of Nesterov’s accelerated method for convex functions, both in for deterministic and the stochastic gradients:

1. Deterministic gradients get linearly amplified overtime, which counteracts the slow-down induced by the vanishing gradient problem around the solution. Interestingly Eq. (

2) reveals that this amplification is not performed directly on the local gradient but on past history, with cubic forgetting. It is this feature that makes the discretization stable compared to the naïve approach .

2. Stochasticity corrupts the gradient amplification by an increasing noise variance (see Eq. (2)), which makes Nesterov’s SDE unstable and hence not converging. This finding is in line with [Allen-Zhu, 2017].

Furthermore, our analysis also gives an intuition as to why a constant momentum cannot yield acceleration. Indeed, we saw already that HB-ODE-INT-C does not allow such persistent amplification, but at most a constant amplification inversely proportional to the (constant) viscosity. Yet, as we are going to see in Sec. 4, this feature makes the algorithm more stable under stochastic gradients.

To conclude, we point the reader to App. C.4.2, where we extend the last discussion from the constant gradient case to the quadratic cost case and get a close form for the (exploding) covariance of Nesterov’s SDE (which backs up theoretically the unstable behavior shown in Fig. 2). Nonetheless, we remind that this analysis still relies on continuous-time models; hence, the results above can only be considered as insights and further investigation is needed to translate them to the actual NAG algorithm.

##### Time warping of linear memory.

Next, we now turn our attention to the following question: "How is the gradient amplification mechanism of NAG related to its — notoriously wiggling111111Detailed simulations in App. F.— path?". Even though Nesterov’s ODE and MG-ODE with quadratic forgetting are described by similar formulas, we see in Fig. 2 that the trajectories are very different, even when the gradients are large. The object of this paragraph is to show that Nesterov’s path has a strong link to — surprisingly — linear forgetting. Consider speeding-up the linear forgetting ODE by introducing the time change and let be the accelerated

solution to linear forgetting. By the chain rule, we have

and . It can easily be verified that we recover . However, in the stochastic setting, the behaviour is still quite different: as predicted by the theory, in Fig. 3 we see that — when gradients are large — the trajectory of the two sample paths are almost identical (yet, notice that Nesterov moves faster); however, as we approach the solution, Nesterov diverges while linear forgetting stably proceeds towards the minimizer along the Nesterov’s ODE path, but at a different speed, until convergence to a neighborhood of the solution, as proved in Sec. 4. Furthermore, in App. C.3 we prove that there are no other time changes which can cast MG-ODE into HB-ODE, which yields the following interesting conclusion: the only way to translate a memory system into a momentum method is by using a time change .

## 4 Analysis and Discretization

In this section we first analyze the convergence properties of MG-SDE under different memory functions. Next, we use the Lyapunov analysis carried out in continuous-time to derive an iterative discrete-time method which implements polynomial forgetting and has provable convergence guarantees. We state a few assumptions:

(H0c) .

The definition of nicely decouples the measure of noise magnitude to the problem dimension (which will then, of course, appear explicitly in all our rates).

(H1)   The cost is -smooth and convex.

(H2)   The cost is -smooth and -strongly convex.

We provide the proofs (under the less restrictive assumptions of weak-quasi-convexity and quadratic growth121212-weak-quasiconvexity is implied by convexity and has been shown to be of chief importance in the context of learning dynamical systems [Hardt et al., 2018]. Strong convexity implies quadratic growth with a unique minimizer [Karimi et al., 2016] as well as -weak-quasiconvexity. More details in the appendix.), as well as an introduction to stochastic calculus, in App. D.

### 4.1 Exponential Forgetting

If , then which converges to exponentially fast. To simplify the analysis and for comparison with the literature on HB-SDE (which is usually analyzed under constant volatility [Shi et al., 2018]) we consider here MG-SDE with the approximation . In App. D we show that, under (H1), the rate of convergence of , evaluated at the Cesàro average is sublinear (see Tb. 3) to a ball131313Note that the term "ball" might be misleading: indeed, the set is not compact in general if is convex but not strongly convex. around of size , which is in line with known results for SGD [Bottou et al., 2018]141414Note that, by definition of (see discussion after MG-SDE),

is proportional both to the learning rate and to the largest eigenvalue of the stochastic gradient covariance

. Note that the size of this ball would change if we were to study a stochastic version of HB-ODE with constant volatility (i.e. ). In particular, it would depend on the normalization constant in Eq. (1). Also, in App. D, under (H2), we provide a linear convergence rate of the form to a ball ( depends on and ). Our result generalizes the analysis in [Shi et al., 2018] to work with any viscosity and with stochastic gradients.

##### Discretization.

As shown in Sec. 3.1, the discrete equivalent of MG-SDE with exponential forgetting is Adam without adaptive stepsizes (see App. B.2). As for the continuous-time model we just studied, for a sufficiently large iteration, exponential forgetting can be approximated with the following recursive formula:

 xk+1=xk+β(xk−xk−1)−η(1−β)∇f(xk),

which is exactly HB with learning rate . Hence, the corresponding rates can be derived from Tb. 1.

### 4.2 Polynomial Forgetting

The insights revealed in Sec. 3.3 highlight the importance of the choice in this paper. In contrast to instantaneous [Mertikopoulos and Staudigl, 2018] and exponential forgetting, the rate we prove in App. D for this case under (H1) does not involve a Cesàro average — but holds for the last time point (see Tb. 3). This stability property is directly linked to our discussion in Sec. 3.3 and shows that different types of memory may react to noise very differently. Also, we note that the size of the ball we found is now also proportional to ; this is not surprising since, as the memory becomes more focused on recent past, we get back to the discussion in the previous subsection and we need to consider a Cesàro average.

##### Discretization.

Finally, we consider the burning question "Is it possible to discretize MG-SDE — with polynomial forgetting — to derive a cheap iterative algorithm with similar properties?". In App. E, we build this algorithm in a non-standard way: we reverse-engineer the proof of the rate for MG-SDE to get a method which is able to mimic each step of the proof. Starting from , it is described by the following recursion

 xk+1=xk+kk+p(xk−xk−1)−pk+pη∇f(xk). (MemSGD-p)

As a direct result of our derivation, we show in Thm. E (App. E) that this algorithm preserves exactly the rate of its continuous-time model in the stochastic setting151515Required assumptions: (H1), , and bounds the gradient variance in each direction of .:

 E[f(xk)−f(x∗)]≤(p−1)2∥x0−x∗∥22ηp(k+p−1)+12pdης2∗.

We also show that MemSGD-p can be written as , where (in analogy to the bias correction in Adam) and with increasing as a polynomial of order for all , again in complete analogy with the model. Fig. 4 shows the behaviour of different types of memory in a simple convex setting; as predicted, polynomial (in this case linear) forgetting has a much smoother trajectory then both exponential and instantaneous forgetting. Also, the reader can verify that the asymptotic noise level for MemSGD (p=2) is slightly higher, as just discussed.

For ease of comparison with polynomial memory, we will often write MemSGD (p=e) to denote exponential forgetting (i.e. Adam without adaptive steps) and SGD (p=inf) to stress that SGD implements instantaneous forgetting.

## 5 Large Scale Experiments

In order to assess the effect of different types of memory in practical settings, we benchmark MemSGD with different memory functions: from instantaneous to exponential, including various types of polynomial forgetting. As a reference point, we also run vanilla HB

with constant momentum as stated in the introduction. To get a broad overview of the performance of each method, we run experiments on a convex logistic regression loss as well as on non-convex neural networks in both a mini- and full-batch setting. Details regarding algorithms, datasets and architectures can be found in App.

G.1.

##### Results and discussion.

Fig. 5 summarizes our results in terms of training loss. While it becomes evident that no method is best on all problems, we can nevertheless draw some interesting conclusions.

First, we observe that while long-term memory (especially ) is faster than SGD in the convex case, it does not provide any empirical gain in the neural network settings. This is not particularly surprising since past gradients may quickly become outdated in non-convex landscapes. Short term memory is at least as good as SGD in all cases except for the CIFAR-10 CNN, which represents the most complex of our loss landscapes in terms of curvature.

Secondly, we find that the best stepsize for HB is always strictly smaller than the one for SGD in the non-convex setting. MemSGD, on the other hand, can run on stepsizes as large as SGD which reflects the gradient amplification of HB as well as the unbiasedness of MemSGD. Interestingly, however, a closer look at Fig. 15 (appendix) reveals that HB (with best stepsize) actually takes much smaller steps than SGD for almost all iterations. While this makes sense from the perspective that memory averages past gradients, it is somewhat counter-intuitive given the inertia interpretation of HB which should make the method travel further than SGD. Indeed, both [Sutskever et al., 2013] and [Goodfellow et al., 2016] attribute the effectiveness of HB to its increased velocity along consistent directions (especially early on in the optimization process). However, our observation, together with the fact that MemSGD with fast forgetting ( and ) is as good as HB, suggests that there is actually more to the success of taking past gradients into account and that this must lie in the altered directions that adapt better to the underlying geometry of the problem.161616Note that we find the exact opposite in the convex case, where HB does take bigger steps and converges faster.

Finally, we draw two conclusions that arise when comparing the mini- and full batch setting. First, the superiority of HB and fast forgetting MemSGD over vanilla SGD in the deterministic setting is indeed reduced as soon as stochastic gradients come into play (this is in line with the discussion in Sec. 3.3). Second, we find that stochasticity per se is not needed to optimize the neural networks in the sense that all methods eventually reach very similar methods of suboptimality. That is, not even the full batch methods get stuck in any elevated local minima including the saddle found in the MNIST autoencoder which they nicely escape (given the right stepsize).

## 6 Memory in Adaptive Methods

While the main focus of this paper is the study of the effect of different types of memory on the first moment of the gradients, past gradient information is also commonly used to adapt stepsizes. This is the case for Adagrad and Adam which both make use of the second moment of past gradients to precondition their respective update steps.

Of course, the use of polynomial memory generalizes directly to the second moment estimates and we thus consider a comprehensive study of the effect of long- versus short-term memory in adaptive preconditioning an exciting direction of future research. In fact, as shown in [Reddi et al., 2018] the non-convergence issue of Adam can be fixed by making the method forget past gradients less quickly. For that purpose the authors propose an algorithm called AdamNC that essentially differs from Adam by the choice of

, which closely resembles Adagrad with constant memory. Interestingly, the memory framework introduced in this paper allows to interpolate between the two extremes of constant- and exponential memory (i.e. Adagrad and Adam) in a principled way. Indeed, by tuning the additional parameter

— which specifies the degree of the polynomial memory function — one can equip Adam with any degree of short- to long-term memory desired. As a proof of concept, Fig. 6 shows that Adam equipped with a polynomial memory of the squared gradients (PolyAdam) can in fact be faster than both Adam and Adagrad.

## 7 Conclusion

We undertook an extensive theoretical study of the role of memory in (stochastic) optimization. We provided convergence guarantees for memory systems as well as for novel algorithms based on such systems. This study led us to derive novel insights on momentum methods. We complemented these findings with empirical results, both on simple functions as well as more complex functions based on neural networks. There, long- and short-term memory methods exhibit a different behaviour, which suggests further investigation is needed to better understand the interplay between the geometry of neural networks losses, memory and gradient stochasticity. On a more theoretical side, an interesting direction of future work is the study of the role of memory in state-of-the art momentum methods such as algorithms that include primal averaging, increasing gradient sensitivity or decreasing learning rates (see e.g. [Krichene and Bartlett, 2017]).

## References

• [Allen-Zhu, 2017] Allen-Zhu, Z. (2017). Katyusha: The first direct acceleration of stochastic gradient methods. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

, pages 1200–1205. ACM.
• [Arnol’d, 2013] Arnol’d, V. I. (2013). Mathematical methods of classical mechanics, volume 60. Springer Science & Business Media.
• [Betancourt et al., 2018] Betancourt, M., Jordan, M. I., and Wilson, A. C. (2018). On symplectic optimization. arXiv preprint arXiv:1802.03653.
• [Bottou et al., 2018] Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311.
• [Cabot et al., 2009] Cabot, A., Engler, H., and Gadat, S. (2009). On the long time behavior of second order differential equations with asymptotically small dissipation. Transactions of the American Mathematical Society, 361:5983–6017.
• [Cauchy, 1847] Cauchy, A. (1847). Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538.
• [Duchi et al., 2011] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
• [Gadat and Panloup, 2014] Gadat, S. and Panloup, F. (2014). Long time behaviour and stationary regime of memory gradient diffusions. In Annales de l’IHP Probabilités et statistiques, volume 50, pages 564–601.
• [Ghadimi et al., 2015] Ghadimi, E., Feyzmahdavian, H. R., and Johansson, M. (2015). Global convergence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European, pages 310–315. IEEE.
• [Ghadimi and Lan, 2012] Ghadimi, S. and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492.
• [Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). MIT press.
• [Hairer et al., 2006] Hairer, E., Lubich, C., and Wanner, G. (2006).

Geometric numerical integration: structure-preserving algorithms for ordinary differential equations

, volume 31.
• [Hardt et al., 2018] Hardt, M., Ma, T., and Recht, B. (2018). Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1):1025–1068.
• [Hinton and Salakhutdinov, 2006] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786):504–507.
• [Jain et al., 2018] Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., and Sidford, A. (2018).

Accelerating stochastic gradient descent for least squares regression.

In Conference On Learning Theory, pages 545–604.
• [Karimi et al., 2016] Karimi, H., Nutini, J., and Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer.
• [Khalil and Grizzle, 2002] Khalil, H. K. and Grizzle, J. (2002). Nonlinear systems, volume 3. Prentice hall Upper Saddle River, NJ.
• [Kidambi et al., 2018] Kidambi, R., Netrapalli, P., Jain, P., and Kakade, S. (2018). On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE.
• [Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
• [Krichene and Bartlett, 2017] Krichene, W. and Bartlett, P. L. (2017). Acceleration and averaging in stochastic descent dynamics. In Advances in Neural Information Processing Systems, pages 6796–6806.
• [Krichene et al., 2015] Krichene, W., Bayen, A., and Bartlett, P. L. (2015). Accelerated mirror descent in continuous and discrete time. In Advances in neural information processing systems, pages 2845–2853.
• [Lan, 2012] Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397.
• [Li et al., 2017] Li, Q., Tai, C., et al. (2017). Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2101–2110. JMLR. org.
• [Ma and Yarats, 2018] Ma, J. and Yarats, D. (2018). Quasi-hyperbolic momentum and adam for deep learning. arXiv preprint arXiv:1810.06801.
• [Mao, 2007] Mao, X. (2007). Stochastic differential equations and applications. Elsevier.
• [Mertikopoulos and Staudigl, 2018] Mertikopoulos, P. and Staudigl, M. (2018). On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197.
• [Mil’shtejn, 1975] Mil’shtejn, G. (1975). Approximate integration of stochastic differential equations. Theory of Probability & Its Applications, 19(3):557–562.
• [Nesterov, 2018] Nesterov, Y. (2018). Lectures on convex optimization. Springer.
• [Nesterov, 1983] Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547.
• [Øksendal, 2003] Øksendal, B. (2003). Stochastic differential equations. In Stochastic differential equations, pages 65–84. Springer.
• [Orvieto and Lucchi, 2018] Orvieto, A. and Lucchi, A. (2018). Continuous-time models for stochastic optimization algorithms. arXiv preprint arXiv:1810.02565.
• [Paszke et al., 2017] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017).

Automatic differentiation in pytorch.

In NIPS-W.
• [Polyak, 1964] Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17.
• [Polyak, 1987] Polyak, B. T. (1987). Introduction to optimization. optimization software. Inc., Publications Division, New York, 1.
• [Reddi et al., 2018] Reddi, S., Kale, S., and Kumar, S. (2018). On the convergence of adam and beyond. In International Conference on Learning Representations.
• [Robbins and Monro, 1951] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
• [Shi et al., 2018] Shi, B., Du, S. S., Jordan, M. I., and Su, W. J. (2018). Understanding the acceleration phenomenon via high-resolution differential equations. arXiv preprint arXiv:1810.08907.
• [Shi et al., 2019] Shi, B., Du, S. S., Su, W. J., and Jordan, M. I. (2019). Acceleration via symplectic discretization of high-resolution differential equations. arXiv preprint arXiv:1902.03694.
• [Su et al., 2016] Su, W., Boyd, S., and Candes, E. J. (2016). A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17:1–43.
• [Sutskever et al., 2013] Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Dasgupta, S. and McAllester, D., editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA. PMLR.
• [Wibisono et al., 2016] Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47):E7351–E7358.
• [Wilson et al., 2016] Wilson, A. C., Recht, B., and Jordan, M. I. (2016). A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635.
• [Xu et al., 2018a] Xu, P., Wang, T., and Gu, Q. (2018a). Accelerated stochastic mirror descent: From continuous-time dynamics to discrete-time algorithms. In International Conference on Artificial Intelligence and Statistics, pages 1087–1096.
• [Xu et al., 2018b] Xu, P., Wang, T., and Gu, Q. (2018b). Continuous and discrete-time accelerated stochastic mirror descent for strongly convex functions. In International Conference on Machine Learning, pages 5488–5497.
• [Yang et al., 2018] Yang, L., Arora, R., Zhao, T., et al. (2018). The physical systems behind optimization algorithms. In Advances in Neural Information Processing Systems, pages 4377–4386.
• [Yang et al., 2016] Yang, T., Lin, Q., and Li, Z. (2016). Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257.
• [Yuan et al., 2016] Yuan, K., Ying, B., and Sayed, A. H. (2016). On the influence of momentum acceleration on online learning. The Journal of Machine Learning Research, 17(1):6602–6667.
• [Zhang et al., 2018] Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A. (2018). Direct runge-kutta discretization achieves acceleration. In Advances in Neural Information Processing Systems, pages 3904–3913.

## Appendix A Basic Definitions and Notation

In this paper we work in with the metric induced by the Euclidean norm, which we denote by . We say that is if it is times continuously differentiable and we say that it is -Lipschitz if for all . We say that is -strongly convex if for all . We call a function convex if it is -strongly convex. An equivalent definition involves the Hessian: a twice differentiable function is -strongly convex and -smooth if and only if, for all , , where

is the identity matrix in

and denotes the operator norm of a matrix in Euclidean space: . For a symmetric matrix, the operator norm is the norm of the maximum positive eigenvalue.

## Appendix B Background on Momentum and Adam

All the algorithms/ODEs mentioned in this appendix are reported in Sec. 3.

### b.1 Discretization of Hb-Ode

The next theorem provides a strong link between HB and HB-ODE, and in partially included in [Shi et al., 2019]. The Heavy-Ball is the result of semi-implicit Euler integration on (HB-ODE-PS).

###### Proof.

Semi-implicit integration with stepsize , at iteration and from the current integral approximation , computes as follows:

 {vk+1=vk+h(−akvk−∇f(xk))xk+1=xk+hvk+1, (4)

where . Notice that and

 xk+1=xk−(1−akh)(xk−xk−1)−h2∇f(xk), (5)

which is exactly an Heavy Ball iteration, with and . ∎

###### Remark B.1.

The semi-implicit Euler method —when applied to an Hamiltonian system— is symplectic, meaning that is preserves some geometric properties of the true solution [Hairer et al., 2006]. However, as also pointed out in [Zhang et al., 2018], continuous time models of momentum methods are energy-dissipative —hence not Hamiltonian. Therefore, as opposed to [Betancourt et al., 2018, Shi et al., 2019], we avoid using this misleading moniker.

Moreover, the last result also has an integral formulation. The differential equation HB-ODE-INT-C is the continuous time limit of HB-SUM.

###### Proof.

Recalling that (where is the stepsize of the semi-implicit Euler integration defined in Thm. B.1) and choosing () and (),

Moreover, since , taking one inside the summation, we get a Riemann sum which then rightfully converges to the integral in the limit. ∎

### b.2 Unbiasing Hb-Sum under constant momentum: the birth of Adam

[Kingma and Ba, 2014] noticed that, in the limit case where true gradients are constant, the averaging procedure in HB-SUM (defined in the Introduction section of the main paper) is biased: let be the data-point selected at iteration and the corresponding stochastic gradient; if we define and pick constant momentum , we have

 E[k∑j=0βk−j∇fij(xj)]=k∑j=0βk−jE[∇f+ϵj]=βk+1−1β−1∇f.

Therefore —to ensure an unbiased update, at least for this simple case— [Kingma and Ba, 2014] normalize the sum above by , showing significant benefits in the experimental section. Indeed, such normalization retains all the celebrated geometric properties of momentum (see Introduction), while improving statistical accuracy —a crucial feature of SGD. For convenience of the reader, we report below the full Adam algorithm.

Initialize and choose initial estimate . Let "" denote the element-wise product.

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩mk+1=β1mk+(1−β1)∇f(xk)vk+1=β2vk+(1−β2)∇f∘2k(xk)^mk+1=mk+1/(1−β1)k+1^vk+1=vk+1/(1−β2)k+1xk+1=xk−η^mk√^vk+ϵ (ADAM)

In addition, such normalization also performs variance reduction. Indeed, let be the gradient covariance, which we assume constant and finite for simplicity. Then,

 Cov[β−1βk+1−1k∑j=0βk−j∇fij(xj)] =(β−1)2(βk+1−1)2k∑j=0β2(k−j)Σ =(β−1)2(βk+1−1)2β2(k+1)−1β2−1Σ =β−1(βk+1−1)βk+1+1β+1Σ⪯Σ.

Note that if increases, the covariance monotonically decreases until reaching the minimum at infinity. We also note that, in case (Gradient Descent) we have no variance reduction, as expected.

Supported by the empirical success of Adam[Kingma and Ba, 2014], we suggest in the main paper to modify the standard Heavy Ball (HB-SUM) with constant momentum to match the Adam update:

 xk+1=xk−ηβ−1βk+1−1k∑j=0βk−j∇f(xj). (6)

Which can be written recursively using 3 variables (see again [Kingma and Ba, 2014]). Notice that, after a relatively small number of iterations, we converge to the simpler update rule

 xk+1=xk+β(xk−xk−1)−η(1−β)∇f(xk),

which is also the starting point in some recent elaboration on Heavy Ball [Ma and Yarats, 2018].

## Appendix C Time-warping, acceleration and gradient amplification

### c.1 Counterexample for existence of a solution of Mg-Ode starting integration at 0

Consider the one dimensional dynamics with gradients always equal to one and . The MG-ODE (see main paper) reads . It is easy to realize using a Cauchy-Euler argument that all solutions are of the form . Unfortunately, the constraint fixes both the degrees of freedom and fixes , so that we necessarily have .

### c.2 Variational point of view

Let be a curve; the action (see [Arnol’d, 2013] for the precise definition) associated with a Lagrangian is . The fundamental result in variational analysis states that the curve is a stationary point for the action only if it solves the Euler-Lagrange equation . We define the Memory Lagrangian as

 Lm(X,˙X,t):=12m(t)∥˙X∥2−˙m(t)f(X).

It is straightforward to verify that the associated Euler-Lagrange equations give rise to MG-ODE. We cannot help but noticing the striking simplicity of this Lagrangian when compared to others arising from momentum methods (see e.g. the Bregman Lagrangian in [Wibisono et al., 2016]) . For the sake of delivering other points in this paper, we leave the analysis of the symmetries of to future work.

### c.3 Time-warping of memory: a general correspondence to Hb-Ode

Consider the ODE and the time change .

Let , following [Wibisono et al., 2016] we have, by the chain rule,

 ˙Y(t)=˙X(τ(t))˙τ(t);        ¨Y(t)=˙X(τ(t))¨τ(t)+(˙τ(t))2¨X(τ(t)).

Therefore, we have

 ˙X(τ(t))=˙Y(t)˙τ(t);        ¨X(τ(t))=¨Y(t)(˙τ(t))2−¨τ(t)(˙τ(t))3˙Y(t).

Next, we construct a new ODE for using the previous formulas:

 1(˙τ(t))2¨Y(t)+(˙m(τ(t))˙τ(t)m(τ(t))−¨τ(t)(˙τ(t))3)˙Y(t)+˙m(τ(t))m(τ(t))∇f(X(τ(t)));

Next, we multiply everything by , in order to eliminate the coefficient in front of the gradient:

 m(τ(t))˙m(τ(t))(˙τ(t))2¨Y(t)+(1˙τ(t)−m(τ(t))¨τ(t)˙m(τ(t))(˙τ(t))3)˙Y(t)+∇f(Y(t))

If we also want the coefficient in front of to be one, we need to satisfy the following differential equation: