 # Toward Deeper Understanding of Nonconvex Stochastic Optimization with Momentum using Diffusion Approximations

Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning. Popular examples include training deep neural networks, dimensionality reduction, and etc. Due to the lack of convexity and the extra momentum term, the optimization theory of MSGD is still largely unknown. In this paper, we study this fundamental optimization algorithm based on the so-called "strict saddle problem." By diffusion approximation type analysis, our study shows that the momentum helps escape from saddle points, but hurts the convergence within the neighborhood of optima (if without the step size annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks. Moreover, our analysis applies the martingale method and "Fixed-State-Chain" method from the stochastic approximation literature, which are of independent interest.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Nonconvex stochastic optimization naturally arises in many machine learning problems. Taking training deep neural networks as an example, given samples denoted by , where is the -th input feature and is the response, we solve the following optimization problem,

 minθ\cF(θ):=1n∑ni=1ℓ(yi,f(xi,θ)), (1.1)

where

is a loss function,

denotes the decision function based on the neural network, and denotes the parameter associated with .

Momentum Stochastic Gradient Descent (MSGD, Robbins and Monro (1951); Polyak (1964)) is one of the most popular algorithms for solving (1.1). Specifically, at the -th iteration, we uniformly sample from . Then, we take

 θ(t+1)=θ(t)−η∇ℓ(yi,f(xi,θ(t)))+μ(θ(t)−θ(t−1)), (1.2)

where is the step size parameter and is the parameter for controlling the momentum. Note that when , (1.2) is reduced to Vanilla Stochastic Gradient Descent (VSGD).

Although SGD-type algorithms have demonstrated significant empirical successes for training deep neural networks, due to the lack of convexity, their convergence properties for nonconvex optimization are still largely unknown. For VSGD, existing literature shows that it is guaranteed to converge to a first-order optimal solution (i.e., ) under general smooth nonconvex optimization.

The theoretical investigation of MSGD is even more limited than that of VSGD. The momentum in (1.2) has been observed to significantly accelerate computation in practice. To the best of our knowledge, we are only aware of Ghadimi and Lan (2016) in existing literature, which shows that MSGD is guaranteed to converge to a first-order optimal solution for smooth nonconvex problems. Their analysis, however, does not justify the advantage of the momentum in MSGD over VSGD.

The major technical bottleneck in analyzing MSGD and VSGD comes from the nonconvex optimization landscape of these highly complicated problems, e.g., training large recommendation systems and deep neural networks. The current technical limit makes establishing a general theory infeasible. Therefore, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problems — streaming PCA. This allows us to make progress toward understanding MSGD and gaining new insights on more general problems. Specifically, given a streaming data set drawn independently from some unknown zero-mean distribution , we consider the following problem

 maxvv⊤\EEX∼\cD[XX⊤]vsubject % tov⊤v=1. (1.3)

Note that (1.3), though nonconvex, is well known as a strict saddle optimization problem over sphere, of which the optimization landscape enjoys two geometric properties: (1) There is no spurious local optima and (2) there always exist negative curvatures around saddle points, and contains the following three regions:

• [leftmargin=*]

• : The region containing the neighborhood of strict saddle points with negative curvatures;

• : The region including the set of points whose gradient has sufficiently large magnitude;

• : The region containing the neighborhood of all global optima with a positive curvature along a certain direction.

These nice geometric properties are also shared by several other popular nonconvex optimization problems arising in machine learning and signal processing, including matrix regression/ completion/sensing, independent component analysis, partial least square multiview learning, and phase retrieval

(Ge et al., 2016; Li et al., 2016b; Sun et al., 2016). Moreover, since there is a significant lack of understanding the optimization landscape of general noconvex problems, many researchers suggest that analyzing streaming PCA and other strict saddle optimization problems should be considered as the very first and important step towards understanding the algorithmic behaviors in more general nonconvex optimization.

By using streaming PCA as an illustrative example, we are interested in answering a natural and fundamental question:

What is the role of the momentum in nonconvex stochastic optimization?

Our analysis is also based on the diffusion approximation of stochastic optimization, which is a powerful tool in applied probability. Specifically, we prove asymptotically the solution trajectory of MSGD converges weakly to the solution of an appropriately constructed ODE/SDE, and this solution can provide intuitive characterization of the algorithmic behavior. We remark here the major technical challenge is to prove the weak convergence of the trajectory sequence. This is because the Infinitesimal Perturbed Analysis for VSGD used in existing literature is not applicable here due to the momentum term of MSGD

(Chen et al., 2017; Li et al., 2016a). Instead, we apply the martingale method and “Fixed-State-Chain” method form the stochastic approximation literature (Kushner and Yin, 2003). To the best of our knowledge, we are the first to apply these powerful methods to analyze MSGD. Our result shows the momentum can play different but important roles in different regions.

• [leftmargin=*]

• The momentum helps escape from the neighborhood of saddle points (

): In this region, since the gradient diminishes, the variance of the stochastic gradient dominates the algorithmic behavior. Our analysis indicates that the momentum greatly increases the variance and perturbs the algorithm more violently. Thus, it becomes harder for the algorithm to stay around saddle points. In addition, the momentum also encourages more aggressive exploitation, and in each iteration, the algorithm makes more progress along the descent direction by a factor of

, where is the momentum parameter.

• The momentum helps evolve toward global optima in the non-stationary region (): In this region, the variance of the stochastic gradient can be neglected due to the larger magnitude of the gradient. At the same time, with the help of the momentum, the algorithm makes more progress along the descent direction. Thus, the momentum can accelerate the algorithm in this region by a factor of .

• The momentum hurts the convergence within the neighborhood of global optima (): Similar to , the gradient dies out, and the variance of the stochastic gradient dominates. Since the momentum increases the variance, it is harder for the algorithm to enter the small neighborhood. To this respect, the momentum hurts in this region.

This characterization has a profound impact and can help explain some phenomena observed when training deep neural networks. There have been some empirical observations and theoretical results (Choromanska et al., 2015) showing that saddle points are the major computation bottleneck, and VSGD usually spends most of the time traveling along saddle and non-stationary regions. Since the momentum helps in both regions, we can find in practice MSGD performs better than VSGD. In addition, from our analysis, the momentum hurts convergence within the neighborhood of the optima. However, we can address this problem by decreasing the step size or the momentum.

We further verify our theoretical findings through numerical experiments on training a ResNet-18 deep neural network using the CIFAR-100 dataset. The experimental results show that the algorithmic behavior of MSGD is consistent with our analysis. Moreover, we observe that with a proper initial step size and a proper step size annealing process, MSGD eventually achieves better generalization accuracy than that of VSGD in training neural networks.

Several recent results are closely related to our work. Li et al. (2017) adopt a numerical SDE approach to derive the so-called Stochastic Modified Equations for VSGD. However, their analysis requires the drift term in the SDE to be bounded, which is not satisfied by MSGD. Other results consider SDE approximations of several accelerated SGD algorithms for convex smooth problems only (Wang, 2017; Krichene and Bartlett, 2017). In contrast, our analysis is for the nonconvex streaming PCA problem and technically more challenging.

Notations: For let (the -th dimension equals to , others ) be the standard basis in

. Given a vector

, we define the vector norm: . The notation is short for with probability one, is the standard Brownian Motion in , and denotes the sphere of the unit ball in , i.e., denotes the derivative of the function .

## 2 Momentum SGD for Streaming PCA

Recall that we study MSGD for the streaming PCA problem formulated as (1.3),

 maxvv⊤\EEX∼\cD[XX⊤]vsubject % tov⊤v=1.

The optimization landscape of (1.3) has been well studied. For notational simplicity, denote Before we proceed, we impose the following assumption on : The covariance matrix

is positive definite with eigenvalues

and associated normalized eigenvectors

. Under this assumption, Chen et al. (2017) have shown that the eigenvectors are all the stationary points for problem (1.3) on the unit sphere . Moreover, the eigen-gap assumption () guarantees that the global optimum is identifiable up to sign change. Meanwhile, are strict saddle points, and is the global minimum.

Given the optimization landscape of (1.3), we have already understood well the behavior of VSGD algorithms, including Oja’s and stochastic generalized Hebbian algorithms (SGHA) for streaming PCA (Chen et al., 2017). For MSGD, however, the additional momentum term makes the theoretical analysis much more challenging. Specifically, we consider a variant of SGHA with Polyak’s momentum. Recall that we are given a streaming data set drawn independently from some zero-mean distribution . At the -th iteration, the algorithm takes

 vk+1=vk+η(I−vkv⊤k)Σkvk+μ(vk−vk−1), (2.1)

where and is the momentum with a parameter . When , (2.1) is reduced to SGHA. A detailed derivation of standard SGHA can be found in Chen et al. (2017). We remark that though we focus on Polyak’s momentum, extending our theoretical analysis to Nesterov’s momentum is straightforward (Nesterov, 1983).

## 3 Analyzing Global Dynamics by ODE

We first analyze the global dynamics of Momentum SGD (MSGD) based on a diffusion approximation framework. Roughly speaking, by taking

the continuous-time interpolation of the iterations

, which can be treated as a stochastic process with Càdlàg paths ( right continuous and have left-hand limits), becomes a continuous stochastic process. For MSGD, this continuous process follows an ODE with an analytical solution. Such a solution helps us understand how the momentum affects the global dynamics. We remark that is a fixed constant in our analysis.

More precisely, define the continuous-time interpolation of the solution trajectory of the algorithm as follows: For , set on the time interval Throughout our analysis, similar notations apply to other interpolations (e.g. , ). We then answer the following question: Does the solution trajectory sequence converge weakly as goes to zero? If so, what is the limit? This question has been studied for SGD in Chen et al. (2017). They use the Infinitesimal Perturbed Analysis (IPA) technique to show that under some regularity conditions, converges weakly to a solution of the following ODE:

 ˙V(t)−(ΣV−V⊤ΣVV)=0.

This method, however, cannot be applied to analyze MSGD due to the additional momentum term. Here, we explain why this method fails. We rewrite the algorithm (2.1) as

 δk+1=μδk+η[Σkvk−v⊤kΣkvkvk],vk+1=vk+δk+1.

One can easily check is Markovian. To apply IPA, the infinitesimal conditional expectation (ICE) must converge to a constant. However, the ICE for MSGD, which can be calculated as follows:

 \EE[δk+1−δk|δk,vk]η=(μ−1)δkη+[Σvk−v⊤kΣvkvk],

goes to infinity (blows up). Thus, we cannot apply IPA.

To address this challenge, we provide a new technique to prove the weak convergence and find the desired ODE. Roughly speaking, we first prove rigorously the weak convergence of the trajectory sequence. Then, with the help of the martingale theory, we find the ODE. For self-containedness, we provide a summary on the pre-requisite weak convergence theory in Appendix A.

Before we proceed, we impose the following assumption on the problem: The data points are drawn independently from a distribution in , such that: where is a constant (possibly dependent on ). This uniformly boundedness assumption can actually be relaxed to the boundedness of the

-th-order moment (

) with a careful truncation argument. The proof, however, will be much more involved and beyond the scope of this paper. Thus, we use the uniformly boundedness assumption for convenience. Under this assumption, we characterize the global behavior of MSGD as follows. Suppose . Then for each subsequence of , there exists a further subsequence and a process such that in the weak sense as through the convergent subsequence, where satisfies the following ODE:

 ˙V=11−μ[ΣV−V⊤ΣVV],V(0)=v0. (3.1)
###### Proof Sketch.

To prove this theorem, we first show the trajectory sequence converges weakly. Let be the space of -valued functions which are right continuous and have left-hand limits for each dimension. By Prokhorov’s Theorem A.2 (in Appendix A), we need to prove tightness, which means is bounded in probability in space . This can be proved by Theorem A.2 (in Appendix A), which requires the following two conditions: (1) must be bounded in probability for any uniformly in step size ; (2) The maximal discontinuity (the largest difference between two iterations, i.e., ) must go to zero as goes to Lemma B.1 in the Appendix B.1 shows that these two conditions hold for our algorithm.

We next compute the weak limit. For simplicity, we define

 βk=∑k−1i=0μk−i[(Σi−Σ)vi−v⊤i(Σi−Σ)vivi]  and  ϵk=(Σk−Σ)vk−v⊤k(Σk−Σ)vkvk.

We then rewrite the algorithm as follows:

 mk+1=mk+(1−μ)[−mk+˜M(vk)],  vk+1=vk+η(mk+1+βk+ϵk), (3.2)

where The basic idea of the proof is to view (3.2) as a two-time-scale algorithm, where is updated with a larger step size and thus under a faster time-scale, and is under a slower one. Then we can treat the slower time-scale iterate as static and replace the faster time-scale iterate by its stable point in term of this fixed in (3.2). This stable point is , which is shown in Lemma B.1 in the Appendix B.1.

We then show that the continuous time interpolation of the error converges weakly to a Lipschitz continuous martingale with zero initialization. From the martingale theory, we know such kind of martingales must be a constant. Thus, the error sequence converges weakly to zero, and what is left is actually the discretization of ODE (3.1). Please refer to Appendix B.2 for detailed proof. ∎

To solve ODE (3.1), we need to rotate the coordinate to decouple each dimension. Under Assumption 2

, there exists an orthogonal matrix Q such that:

where Let , , and . Multiply each side of (2.1) by , and we get

 hk+1=hk+μ(hk−hk−1)+η[Λkhk−h⊤kΛkhkhk]. (3.3)

After the rotation, is the only global optimum, and are saddles up to sign change. The continuous interpolation of is Then, we rewrite ODE (3.1) as:

 ˙H=11−μ[ΛH−H⊤ΛHH]. (3.4)

Here, let for simplicity. The ODE (3.4) is different from that in (4.6) in Chen et al. (2017) by a constant . Then we have the following corollary. Suppose . As , converges weakly to

 H(i)(t)=(d∑i=1[H(i)(0)exp(λit1−μ)]2)−12H(i)(0)exp(λit1−μ),i=1,...,d,

Moreover, given , converges to as . Corollary 3 implies that when not initialized at saddle points or minima, the algorithm asymptotically converges to the global optimum. However, such a deterministic ODE-based approach is insufficient to characterize the local algorithmic behavior, since the noise of the stochastic gradient diminishes as . Thus, we resort to the following SDE-based approach for a more precise characterization.

## 4 Analyzing Local Dynamics by SDE

To characterize the local algorithmic behavior, we need to rescale the influence of the noise. For this purpose, we consider the normalized error

under the diffusion approximation framework. Different from the previous ODE-based approach, we obtain an SDE approximation here. Intuitively, the previous ODE-based approach is analogous to the Law of Large Number for random variables, while the SDE-based approach serves the same role as Central Limit Theorem. For consistency, we first study the algorithmic behavior around the global optimum.

### 4.1 Phase iii@: Around Global Optima

Recall that all the coordinates are decoupled after the rotation. We directly consider each individual coordinate separately. For the -th coordinate, , we define the normalized process , where is the -th dimension of . Accordingly, . The next theorem characterizes the limiting process of As , () converges weakly to a stationary solution of

 dU=λi−λ11−μUdt+αi,11−μdBt, (4.1)

where by Assumption 3. Note that our analysis is very different from that in Chen et al. (2017) because of the failure of IPA due to the similar blow-up issue. We remark that our technique mainly relies on Theorem A.6 (in Appendix A) from Kushner and Yin (2003). Since the proof is much more sophisticated and involved than IPA, we only introduce the key technique, Fixed-State-Chain, in a high level.

###### Proof Sketch.

Note that the algorithm can be rewritten as

 hη,ik+1 =hη,ik+η[∑k−1j=1μk−j(Λjhηj−(hηj)⊤Λjhηjhηj)+Λhηk−(hηk)⊤Λhηkhηk](i) +η[(Λk−Λ)hηk−(hηk)⊤(Λk−Λ)hηkhηk](i).

Here, for a vector and an integer , represents the -th dimension of . We define

 ξ(i)k =[∑k−1j=1μk−j(Λjhj−h⊤jΛjhjhj)+Λhk−h⊤kΛhkhk](i),  Z(i)k=g(i)(ξk,hk)+γ(i)k, γ(i)k =[(Λk−Λ)hk−h⊤k(Λk−Λ)hkhk](i),  g(i)(ξk,hk)=ξ(i)k+[Λhk−h⊤kΛhkhk](i).

Here, is the accelerated gradient flow, and is the noise. Then the algorithm becomes

 hη,ik+1=hη,ik+ηZη,ik=hη,ik+ηg(i)(ξηk,hηk)+ηγη,ik,

and thus we have Note that imply that the noise is a martingale difference sequence.

We then manipulate the algorithm to extract the Markov structure of the algorithm in an explicit form. To make it clear, given , there exists a transition function such that

 P{ξη,ik+1∈⋅|Fηk}=P(ξη,ik,⋅|H=hη,ik).

This comes from the observation that where the randomness only comes from the data when the state

is given. Then the fixed-state-chain refers to the Markov chain with transition function

for a fixed . The state of this Markov chain will be denoted by . We then decompose into

 hη,ik+1−hη,ik =ηM(i)(hηk)+ηγη,ik+η[g(i)(ξk(hηk),hηk)−M(i)(hηk)] +η[g(i)(ξηk,hηk)−g(i)(ξk(hηk),hηk)]=ηM(i)(hηk)+ηWη,ik. (4.2)

The error term in (4.1) comes from three sources: (1) Difference between the fixed-state-chain and the limiting process: ; (2) Difference between the accelerated gradient flow and the fixed-state-chain: ; (3) The noise .

We then handle them separately and combine the results together to get the variance of . Then follows: . Together with the fact that around , we further know

 uη,ik+1−uη,ikη=(λi−λ1)1−μuη,ik+Wη,ik√η+o(|uη,ik|). (4.3)

After calculating the variance of , we see that (4.3) is essentially the discretization of SDE (4.1). For detailed proof, please refer to Appendix C.1. ∎

Note that (4.1) admits an explicit solution which is known as an O-U process (Øksendal, 2003) defined as:

 U(i)(t)=αi,11−μ∫T0exp[λi−λ11−μ(s−t)]dB(s)+U(i)(0)exp[λi−λ11−μt].

Its expectation and variance are:

 \EE[U(i)(t)]=U(i)(0)exp[λi−λ11−μt], \Var[U(i)(t)]=11−μα2i,12(λ1−λi)(1−exp[2λi−λ11−μt]).

We see clearly that the momentum essentially increases the variance of the normalized error by a factor of around the global optimum. Thus, it becomes harder for the algorithm to converge. The next lemma provides a more precise characterization of such a phenomenon. Given a sufficiently small and (under Assumption 3), we need the step size satisfying

 η<(1−μ)(λ1−λ2)ϵ/(4ϕ) (4.4)

such that enters the -neighborhood of the global optimum with probability at least at some time , i.e., . Note that Chen et al. (2017) choose the step size of VSGD as , which does not satisfy (4.4) for close to . This means that when using the same step size of VSGD, MSGD fails to converge, since the variance increased by the momentum becomes too large. To handle this issue, we have to decrease the step size by a factor , also known as the step size annealing, i.e.,

 η≍(1−μ)ϵ(λ1−λ2)/ϕ≍(1−μ)η0. (4.5)

Then we obtain the following proposition. For a sufficiently small and , there exists some constant , such that after restarting the counter of time, given , we need

 T3≍(1−μ)2(λ1−λ2)⋅log(8(λ1−λ2)δ2(λ1−λ2)ϵ−4ηϕ)

to ensure with probability at least . Proposition 4.1 implies the algorithm needs asymptotically at most

 N3≍T3η≍ϕϵ(λ1−λ2)2⋅log(8(λ1−λ2)δ2(λ1−λ2)ϵ−4η0ϕ)

iterations to converge to an -optimal solution in Phase III. Thus, MSGD does not have an advantage over VSGD in Phase III. We remark that is only used for Phase III. For the other two phases, we can choose

### 4.2 Phase ii@: How MSGD Traverses between Stationary Points

For Phase ii@, we characterize how the algorithm behaves, once it has escaped from saddle points. During this period, MSGD is dominated by the gradient, and the influence of the noise is negligible. Thus, the algorithm behaves like an almost deterministic traverse between stationary points, which can be viewed as a two-step discretization of the ODE with a discretization error (Griffiths and Higham, 2010). Thus, we can use the ODE approximation to study the algorithm before it enters the neighborhood of the optimum. By Corollary 3, we obtain the following proposition. After restarting the counter of time, for sufficiently small , , we need

 T2≍(1−μ)2(λ1−λ2)log(1−δ2δ2)

such that .

When in Proposition 4.1 is small enough, we can chose , which is the same as SGD (much larger than (4.5) for close to 1), and this result implies that the algorithm needs asymptotically at most

 N2≍T2η≍(1−μ)ϕ2ϵ(λ1−λ2)2log(1−δ2δ2)

iterations to traverse between stationary points. Clearly, MSGD is faster than SGD by a factor of in Phase ii@, when using the same step size. This is because the algorithm can make more progress along the descent direction with the help of the momentum.

### 4.3 Phase i@: Escaping from Saddle Points

At last, we study the algorithmic behavior around saddle points . By the same SDE approximation technique used in Section 4.1, we obtain the following theorem. Condition on the event that for . Then for , converges weakly to a solution of

 dU=λi−λj1−μUdt+αi,j1−μdBt.

We remark that is only a technical assumption. This does not cause any issue since when is large, or equivalently is smaller than (), MSGD has escaped from the saddle point , which is out of Phase I.

Theorem 4.3 implies that for , the process defined by the equation above is an unstable O-U process, which goes to infinity. Thus, the algorithm will not be trapped around saddle points. Then we obtain the following proposition.

Given a pre-specified , , and , then the following result holds: We need at most

 T1≍1−μ2(λ1−λ2)log⎛⎜ ⎜ ⎜⎝2(1−μ)η−1δ2(λ1−λ2)Φ−1(1+ν/22)2α212+1⎞⎟ ⎟ ⎟⎠, (4.6)

such that with probability at least , where

is the CDF of the standard normal distribution. Proposition

4.3 suggests that we need asymptotically

 N1≍(1−μ)ϕ(λ1−λ2)2ϵlog⎛⎝2(1−μ)η−1δ2(λ1−λ2)Φ−1(1+ν/22)2α212+1⎞⎠

iterations to escape from saddle points. Thus, when using the same step size, MSGD can escape from saddle points in less iterations than SGD by a factor of . This is due to the fact that the momentum can greatly increase the variance and perturb the algorithm more violently. Thus, it becomes harder to stay around saddle points. Moreover, the momentum also encourages more aggressive exploitation, and in each iteration, the algorithm makes more progress along the descent direction by a factor of .

## 5 Some Insights on Training DNNs

The streaming PCA problem is closely related to optimization for deep neural networks (DNNs) from many aspects. Existing literature has shown that the optimization landscape of training DNNs, though much more complicated and difficult to analyze, consists of similar basic geometric structures, such as saddle points and local optima. Thus, our theoretical characterization of the algorithmic behavior of MSGD around saddle points and local optima (for the streaming PCA problem) can provide us new insights of how MSGD behave in training DNNs. Choromanska et al. (2015); Dauphin et al. (2014); Kawaguchi (2016); Hardt and Ma (2016) suggest that there are a combinatorially large number of saddle points and many local optima in training DNNs. Figure 1: Two illustrative examples of the flat and sharp local optima. MSGD tends to avoid the sharp local optimum, since its high variance encourages exploration.

Under certain oversimplified conditions, they prove: When the size of the network is large enough, most local optima are equivalent and yield similar generalization performance; Moreover, the probability of achieving a “spurious/bad” local optimum (which does not generalize well), though not zero, decreases exponential fast, as the size of the network gets larger. Thus, they suspect that the major computational challenge of training DNNs should be “how efficient an algorithm is when escaping from numerous saddle points.” From this aspect, our Proposition 4.3 suggests that MSGD indeed escapes from saddle points faster than VSGD in existence of negative curvatures.

“No spurious/bad local optima”, however, is often considered as an overoptimistic claim. Some recent results (Hochreiter and Schmidhuber, 1997; Keskar et al., 2016; Zhang et al., 2017; Neyshabur et al., 2017; Safran and Shamir, 2017) provide some empirical and theoretical evidences that the spurious/bad local optima are not completely negligible. Keskar et al. (2016); Zhang et al. (2017); Neyshabur et al. (2017) further suggest that the landscape of these spurious/bad local optima is usually sharp, i.e., their basin of attractions are small and wiggle. From this aspect, our analysis suggests that MSGD with a larger momentum ( is very close to ) tends to stay in “flat/good local optima”, since the higher variance of the noise introduced by the momentum encourages more exploration outside the small basin of attraction of sharp local optima.

Our analysis also provides some new insights on how to apply the step size annealing technique to MSGD. Specifically, our analysis suggests that at the final stage of the step size annealing, MSGD should use a much smaller step size than that of VSGD. Otherwise, MSGD may be numerically unstable and fail to converge well.

## 6 Numerical Experiments

We present numerical experiments for both streaming PCA and training deep neural networks. The experiments on streaming PCA verify our theory in Section 4, and the experiments on training deep neural networks verify some of our discussions in Section 5.

### 6.1 Streaming PCA

We first provide a numerical experiment to verify our theory for streaming PCA. We set and the covariance matrix The optimum is Figure 2 compares the performance of VSGD, MSGD (with and without the step size annealing in Phase iii@). The initial solution is the saddle point . We choose Figure 2: Comparison between SGD and MSGD (with and without the step size annealing (SSA) in Phase III).

and , and decrease the step size of MSGD by a factor after iterations in Fig.2.c. Fig.2.a-c plot the results of 100 simulations, and the vertical axis corresponds to . We can clearly differentiate the three phases of VSGD in Fig.2.a. For MSGD in Fig.2.b, we hardly recognize Phases i@ and ii@, since they last for a much shorter time. This is because the momentum significantly helps escape from saddle points and evolve toward global optima. Moreover, we also observe that MSGD without the step size annealing does not converge well, but the step size annealing resolves this issue. All these observations are consistent with our analysis. Fig.2.d plots the optimization errors of these three algorithms averaged over all 100 simulations, and we observe similar results.

### 6.2 Deep Neural Networks

We then provide three experiments to compare MSGD with VSGD in training a ResNet-18 DNN using the CIFAR- dataset for a -class image classification task. We choose a batch size of . k images are used for training, and the rest k are used for testing. We repeat each experiment for times and report the average.

 ηM ∈{0.6,0.2,0.1,0.05,0.02,0.01}andηV∈{6,2,1,0.5,0.2,0.1,0.05,0.02,0.01}

denote the initial step sizes of MSGD and VSGD, respectively. We decrease the step size by a factor of after , , and epochs. More details on the network architecture and experimental settings can be found in Appendix D.

• [leftmargin=*]

• Experiment . The results are shown in Fig.3. Choosing , MSGD achieves better generalization than VSGD. Figure 3: Experimental results on CIFAR-100 for training DNNs. MSGD and VSGD are using the same step sizes.
• Experiment .The results are shown in Fig.4. Choosing , MSGD achieves similar generalization to VSGD (when ). Figure 4: Experimental results on CIFAR-100 for training DNNs. VSGD is using the step sizes of MSGD rescaled by 1/(1−μ).
• Experiment . The results are shown in Fig.5. For MSGD, choosing and achieves the optimal generalization (among all possible values). For VSGD, choosing achieves the optimal generalization (among all possible values). We see that the optimal generalization of MSGD is better than that of VSGD. Note that for , MSGD still works well. However, for , VSGD no longer works, and the generalization drops significantly. Specifically, the failure rate of VSGD with is in runs. Table 1

shows the best performance and the standard deviation (Std) of each experiment setting, which shows that MSGD has a relative small standard deviation.

## 7 Discussions

The results on training DNNs are expectable or partially expectable, given our theoretical analysis for streaming PCA. We remark that our experiments (in Fig.5 Bottom) show some inconsistency with an earlier paper Wilson et al. (2017). Specifically, Wilson et al. (2017) show that MSGD does not outperform VSGD in training a VGG-16 deep neural network using the CIFAR-10 dataset. However, our results show that the momentum indeed improves the training. We suspect that there exist certain structures in the optimization landscape of VGG-16, which marginalize the value of MSGD. In contrast, the optimization landscape of ResNet-18 is more friendly to MSGD than VGG-16.

Moreover, we remark that our theory helps explain some phenomena in training DNNs, however, there still exist some gaps: (1) Our analysis requires to do the diffusion approximation. However, the experiments actually use relatively large step sizes at the early stage of training. Though we can expect large and small step sizes share some similar behaviors, it may lead to very different results. For example, we observe that VSGD can use larger step sizes, and achieve similar generalization to that of MSGD. However, when MSGD achieves the optimal generalization using , VSGD performs much worse using ; (2) The optimization landscape of DNNs also contains a vast amount of high order saddle points, where our analysis cannot be applied (neither all existing analyses). How SGD/MSGD behaves in this scenario is still an open theoretical problem.

We also summarize the comparison between our results and related works in Table 2. To the best of our knowledge, we are only aware of Ghadimi and Lan (2016); Jin et al. (2017) in existing literature considering nonconvex optimization using momentum.

We remark that Ghadimi and Lan (2016) only consider convergence to the first order optimal solution, and therefore cannot justify the advantage of the momentum in escaping from saddle points; Jin et al. (2017) only consider a batch algorithm, which cannot explain why the momentum hurts when MSGD converges to optima. Moreover, Jin et al. (2017) need an additional negative curvature exploitation procedure, which is not used in popular Nesterov’s accelerated gradient algorithms.

## Appendix A Summary on Weak Convergence and Main Theorems

Here, we summarize the theory of weak convergence and theorems used in this paper. Recall that the continuous-time interpolation of the solution trajectory is defined as on the time interval It has sample paths in the space of Càdlàg functions ( right continuous and have left-hand limits) defined on , or Skorokhod Space, denoted by . Thus, the weak convergence we consider here is defined in this space instead of . The special metric in is called Skorokhod metric, and the topology generated by this metric is Skorokhod topology. Please refer to Sagitov (2013); Kushner and Yin (2003) for detailed explanations. The weak convergence in is defined as follows: [Weak Convergence in ] Let be the minimal -field induced by Skorokhod topology. Let and be random variables on defined on a probability space Suppose that and are the probability measures on generated by and X. We say converges weakly to (), if for all bounded and continuous real-valued functions on , the following condition holds:

 \EEF(Xn)=∫F(x)dPn(x)→\EEF(X)=∫F(x)dP(x) (A.1)

With an abuse of terminology, we say converges weakly to and write Another important definition we need is tightness: A set of -valued random variables is said to be tight if for each , there is a compact set such that:

 supnP{Xn∉Bδ}≤δ. (A.2)

We care about tightness because it provides us a powerful way to prove weak convergence based on the following two theorems: [Prokhorov’s Theorem] Under Skorokhod topology, is tight in if and only if it is relative compact which means each subsequence contains a further subsequence that converges weakly. [Sagitov (2013), Theorem 3.8] A necessary and sufficient condintion for is each subsequence contains a further subsequence converging weakly to Thus, if we can prove is tight and all the further subsequences share the same weak limit , then we have converges weakly to . However, (A.2) is hard to verified. We usually check another easier criteria. Let be the -algebra generated by , and denotes a -stopping time. [Kushner and Yin (2003), Theorem 3.3, Chapter 7] Let be a sequence of processes that have paths in . Suppose that for each and each in a dense set in , there is a compact set in such that

 infnP{Xn(t)∈Kδ,t}≥1−δ, (A.3)

and for each positive ,

 limδlimsupnsup|τ|≤Tsups≤δ\EEmin[∥Xn(τ+s)−Xn(τ)∥,1]=0. (A.4)

Then is tight in This theorem is used in Section 3 to prove tightness of the trajectory of Momentum SGD.

At last, we provide the theorem we use to prove the SDE approximation. Let’s consider the following algorithm:

 θηn+1=θηn+ηYηn, (A.5)

where , and is a martingale difference sequence. Then the normalized process satisfies:

 Uηn+1=Uηn+√η(gηn(θηn,ξηn)+Mηn). (A.6)

We further assume the fixed-state-chain exists as in Section 4.1 and use the same notation to denote the fixed--process. Then we have the following theorem: [Kushner and Yin (2003), Theorem 8.1, Chapter 10] Assume the following conditions hold:

1. [ref=Assumption 0]

2. For small , is uniformly integrable.

3. There is a continuous function such that for any sequence of integers satisfying as and each compact set ,

 1nηjnη+nη−1∑i=jnηEηjnη