1 Introduction
Nonconvex stochastic optimization naturally arises in many machine learning problems. Taking training deep neural networks as an example, given samples denoted by , where is the th input feature and is the response, we solve the following optimization problem,
(1.1) 
where
is a loss function,
denotes the decision function based on the neural network, and denotes the parameter associated with .Momentum Stochastic Gradient Descent (MSGD, Robbins and Monro (1951); Polyak (1964)) is one of the most popular algorithms for solving (1.1). Specifically, at the th iteration, we uniformly sample from . Then, we take
(1.2) 
where is the step size parameter and is the parameter for controlling the momentum. Note that when , (1.2) is reduced to Vanilla Stochastic Gradient Descent (VSGD).
Although SGDtype algorithms have demonstrated significant empirical successes for training deep neural networks, due to the lack of convexity, their convergence properties for nonconvex optimization are still largely unknown. For VSGD, existing literature shows that it is guaranteed to converge to a firstorder optimal solution (i.e., ) under general smooth nonconvex optimization.
The theoretical investigation of MSGD is even more limited than that of VSGD. The momentum in (1.2) has been observed to significantly accelerate computation in practice. To the best of our knowledge, we are only aware of Ghadimi and Lan (2016) in existing literature, which shows that MSGD is guaranteed to converge to a firstorder optimal solution for smooth nonconvex problems. Their analysis, however, does not justify the advantage of the momentum in MSGD over VSGD.
The major technical bottleneck in analyzing MSGD and VSGD comes from the nonconvex optimization landscape of these highly complicated problems, e.g., training large recommendation systems and deep neural networks. The current technical limit makes establishing a general theory infeasible. Therefore, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problems — streaming PCA. This allows us to make progress toward understanding MSGD and gaining new insights on more general problems. Specifically, given a streaming data set drawn independently from some unknown zeromean distribution , we consider the following problem
(1.3) 
Note that (1.3), though nonconvex, is well known as a strict saddle optimization problem over sphere, of which the optimization landscape enjoys two geometric properties: (1) There is no spurious local optima and (2) there always exist negative curvatures around saddle points, and contains the following three regions:

[leftmargin=*]

: The region containing the neighborhood of strict saddle points with negative curvatures;

: The region including the set of points whose gradient has sufficiently large magnitude;

: The region containing the neighborhood of all global optima with a positive curvature along a certain direction.
These nice geometric properties are also shared by several other popular nonconvex optimization problems arising in machine learning and signal processing, including matrix regression/ completion/sensing, independent component analysis, partial least square multiview learning, and phase retrieval
(Ge et al., 2016; Li et al., 2016b; Sun et al., 2016). Moreover, since there is a significant lack of understanding the optimization landscape of general noconvex problems, many researchers suggest that analyzing streaming PCA and other strict saddle optimization problems should be considered as the very first and important step towards understanding the algorithmic behaviors in more general nonconvex optimization.By using streaming PCA as an illustrative example, we are interested in answering a natural and fundamental question:
What is the role of the momentum in nonconvex stochastic optimization?
Our analysis is also based on the diffusion approximation of stochastic optimization, which is a powerful tool in applied probability. Specifically, we prove asymptotically the solution trajectory of MSGD converges weakly to the solution of an appropriately constructed ODE/SDE, and this solution can provide intuitive characterization of the algorithmic behavior. We remark here the major technical challenge is to prove the weak convergence of the trajectory sequence. This is because the Infinitesimal Perturbed Analysis for VSGD used in existing literature is not applicable here due to the momentum term of MSGD
(Chen et al., 2017; Li et al., 2016a). Instead, we apply the martingale method and “FixedStateChain” method form the stochastic approximation literature (Kushner and Yin, 2003). To the best of our knowledge, we are the first to apply these powerful methods to analyze MSGD. Our result shows the momentum can play different but important roles in different regions.
[leftmargin=*]

The momentum helps escape from the neighborhood of saddle points (
): In this region, since the gradient diminishes, the variance of the stochastic gradient dominates the algorithmic behavior. Our analysis indicates that the momentum greatly increases the variance and perturbs the algorithm more violently. Thus, it becomes harder for the algorithm to stay around saddle points. In addition, the momentum also encourages more aggressive exploitation, and in each iteration, the algorithm makes more progress along the descent direction by a factor of
, where is the momentum parameter. 
The momentum helps evolve toward global optima in the nonstationary region (): In this region, the variance of the stochastic gradient can be neglected due to the larger magnitude of the gradient. At the same time, with the help of the momentum, the algorithm makes more progress along the descent direction. Thus, the momentum can accelerate the algorithm in this region by a factor of .

The momentum hurts the convergence within the neighborhood of global optima (): Similar to , the gradient dies out, and the variance of the stochastic gradient dominates. Since the momentum increases the variance, it is harder for the algorithm to enter the small neighborhood. To this respect, the momentum hurts in this region.
This characterization has a profound impact and can help explain some phenomena observed when training deep neural networks. There have been some empirical observations and theoretical results (Choromanska et al., 2015) showing that saddle points are the major computation bottleneck, and VSGD usually spends most of the time traveling along saddle and nonstationary regions. Since the momentum helps in both regions, we can find in practice MSGD performs better than VSGD. In addition, from our analysis, the momentum hurts convergence within the neighborhood of the optima. However, we can address this problem by decreasing the step size or the momentum.
We further verify our theoretical findings through numerical experiments on training a ResNet18 deep neural network using the CIFAR100 dataset. The experimental results show that the algorithmic behavior of MSGD is consistent with our analysis. Moreover, we observe that with a proper initial step size and a proper step size annealing process, MSGD eventually achieves better generalization accuracy than that of VSGD in training neural networks.
Several recent results are closely related to our work. Li et al. (2017) adopt a numerical SDE approach to derive the socalled Stochastic Modified Equations for VSGD. However, their analysis requires the drift term in the SDE to be bounded, which is not satisfied by MSGD. Other results consider SDE approximations of several accelerated SGD algorithms for convex smooth problems only (Wang, 2017; Krichene and Bartlett, 2017). In contrast, our analysis is for the nonconvex streaming PCA problem and technically more challenging.
Notations: For let (the th dimension equals to , others ) be the standard basis in
. Given a vector
, we define the vector norm: . The notation is short for with probability one, is the standard Brownian Motion in , and denotes the sphere of the unit ball in , i.e., denotes the derivative of the function .2 Momentum SGD for Streaming PCA
Recall that we study MSGD for the streaming PCA problem formulated as (1.3),
The optimization landscape of (1.3) has been well studied. For notational simplicity, denote Before we proceed, we impose the following assumption on : The covariance matrix
is positive definite with eigenvalues
and associated normalized eigenvectors
. Under this assumption, Chen et al. (2017) have shown that the eigenvectors are all the stationary points for problem (1.3) on the unit sphere . Moreover, the eigengap assumption () guarantees that the global optimum is identifiable up to sign change. Meanwhile, are strict saddle points, and is the global minimum.Given the optimization landscape of (1.3), we have already understood well the behavior of VSGD algorithms, including Oja’s and stochastic generalized Hebbian algorithms (SGHA) for streaming PCA (Chen et al., 2017). For MSGD, however, the additional momentum term makes the theoretical analysis much more challenging. Specifically, we consider a variant of SGHA with Polyak’s momentum. Recall that we are given a streaming data set drawn independently from some zeromean distribution . At the th iteration, the algorithm takes
(2.1) 
where and is the momentum with a parameter . When , (2.1) is reduced to SGHA. A detailed derivation of standard SGHA can be found in Chen et al. (2017). We remark that though we focus on Polyak’s momentum, extending our theoretical analysis to Nesterov’s momentum is straightforward (Nesterov, 1983).
3 Analyzing Global Dynamics by ODE
We first analyze the global dynamics of Momentum SGD (MSGD) based on a diffusion approximation framework. Roughly speaking, by taking
the continuoustime interpolation of the iterations
, which can be treated as a stochastic process with Càdlàg paths ( right continuous and have lefthand limits), becomes a continuous stochastic process. For MSGD, this continuous process follows an ODE with an analytical solution. Such a solution helps us understand how the momentum affects the global dynamics. We remark that is a fixed constant in our analysis.More precisely, define the continuoustime interpolation of the solution trajectory of the algorithm as follows: For , set on the time interval Throughout our analysis, similar notations apply to other interpolations (e.g. , ). We then answer the following question: Does the solution trajectory sequence converge weakly as goes to zero? If so, what is the limit? This question has been studied for SGD in Chen et al. (2017). They use the Infinitesimal Perturbed Analysis (IPA) technique to show that under some regularity conditions, converges weakly to a solution of the following ODE:
This method, however, cannot be applied to analyze MSGD due to the additional momentum term. Here, we explain why this method fails. We rewrite the algorithm (2.1) as
One can easily check is Markovian. To apply IPA, the infinitesimal conditional expectation (ICE) must converge to a constant. However, the ICE for MSGD, which can be calculated as follows:
goes to infinity (blows up). Thus, we cannot apply IPA.
To address this challenge, we provide a new technique to prove the weak convergence and find the desired ODE. Roughly speaking, we first prove rigorously the weak convergence of the trajectory sequence. Then, with the help of the martingale theory, we find the ODE. For selfcontainedness, we provide a summary on the prerequisite weak convergence theory in Appendix A.
Before we proceed, we impose the following assumption on the problem: The data points are drawn independently from a distribution in , such that: where is a constant (possibly dependent on ). This uniformly boundedness assumption can actually be relaxed to the boundedness of the
thorder moment (
) with a careful truncation argument. The proof, however, will be much more involved and beyond the scope of this paper. Thus, we use the uniformly boundedness assumption for convenience. Under this assumption, we characterize the global behavior of MSGD as follows. Suppose . Then for each subsequence of , there exists a further subsequence and a process such that in the weak sense as through the convergent subsequence, where satisfies the following ODE:(3.1) 
Proof Sketch.
To prove this theorem, we first show the trajectory sequence converges weakly. Let be the space of valued functions which are right continuous and have lefthand limits for each dimension. By Prokhorov’s Theorem A.2 (in Appendix A), we need to prove tightness, which means is bounded in probability in space . This can be proved by Theorem A.2 (in Appendix A), which requires the following two conditions: (1) must be bounded in probability for any uniformly in step size ; (2) The maximal discontinuity (the largest difference between two iterations, i.e., ) must go to zero as goes to Lemma B.1 in the Appendix B.1 shows that these two conditions hold for our algorithm.
We next compute the weak limit. For simplicity, we define
We then rewrite the algorithm as follows:
(3.2) 
where The basic idea of the proof is to view (3.2) as a twotimescale algorithm, where is updated with a larger step size and thus under a faster timescale, and is under a slower one. Then we can treat the slower timescale iterate as static and replace the faster timescale iterate by its stable point in term of this fixed in (3.2). This stable point is , which is shown in Lemma B.1 in the Appendix B.1.
We then show that the continuous time interpolation of the error converges weakly to a Lipschitz continuous martingale with zero initialization. From the martingale theory, we know such kind of martingales must be a constant. Thus, the error sequence converges weakly to zero, and what is left is actually the discretization of ODE (3.1). Please refer to Appendix B.2 for detailed proof. ∎
To solve ODE (3.1), we need to rotate the coordinate to decouple each dimension. Under Assumption 2
, there exists an orthogonal matrix Q such that:
where Let , , and . Multiply each side of (2.1) by , and we get(3.3) 
After the rotation, is the only global optimum, and are saddles up to sign change. The continuous interpolation of is Then, we rewrite ODE (3.1) as:
(3.4) 
Here, let for simplicity. The ODE (3.4) is different from that in (4.6) in Chen et al. (2017) by a constant . Then we have the following corollary. Suppose . As , converges weakly to
Moreover, given , converges to as . Corollary 3 implies that when not initialized at saddle points or minima, the algorithm asymptotically converges to the global optimum. However, such a deterministic ODEbased approach is insufficient to characterize the local algorithmic behavior, since the noise of the stochastic gradient diminishes as . Thus, we resort to the following SDEbased approach for a more precise characterization.
4 Analyzing Local Dynamics by SDE
To characterize the local algorithmic behavior, we need to rescale the influence of the noise. For this purpose, we consider the normalized error
under the diffusion approximation framework. Different from the previous ODEbased approach, we obtain an SDE approximation here. Intuitively, the previous ODEbased approach is analogous to the Law of Large Number for random variables, while the SDEbased approach serves the same role as Central Limit Theorem. For consistency, we first study the algorithmic behavior around the global optimum.
4.1 Phase iii@: Around Global Optima
Recall that all the coordinates are decoupled after the rotation. We directly consider each individual coordinate separately. For the th coordinate, , we define the normalized process , where is the th dimension of . Accordingly, . The next theorem characterizes the limiting process of As , () converges weakly to a stationary solution of
(4.1) 
where by Assumption 3. Note that our analysis is very different from that in Chen et al. (2017) because of the failure of IPA due to the similar blowup issue. We remark that our technique mainly relies on Theorem A.6 (in Appendix A) from Kushner and Yin (2003). Since the proof is much more sophisticated and involved than IPA, we only introduce the key technique, FixedStateChain, in a high level.
Proof Sketch.
Note that the algorithm can be rewritten as
Here, for a vector and an integer , represents the th dimension of . We define
Here, is the accelerated gradient flow, and is the noise. Then the algorithm becomes
and thus we have Note that imply that the noise is a martingale difference sequence.
We then manipulate the algorithm to extract the Markov structure of the algorithm in an explicit form. To make it clear, given , there exists a transition function such that
This comes from the observation that where the randomness only comes from the data when the state
is given. Then the fixedstatechain refers to the Markov chain with transition function
for a fixed . The state of this Markov chain will be denoted by . We then decompose into(4.2) 
The error term in (4.1) comes from three sources: (1) Difference between the fixedstatechain and the limiting process: ; (2) Difference between the accelerated gradient flow and the fixedstatechain: ; (3) The noise .
We then handle them separately and combine the results together to get the variance of . Then follows: . Together with the fact that around , we further know
(4.3) 
After calculating the variance of , we see that (4.3) is essentially the discretization of SDE (4.1). For detailed proof, please refer to Appendix C.1. ∎
Note that (4.1) admits an explicit solution which is known as an OU process (Øksendal, 2003) defined as:
Its expectation and variance are:
We see clearly that the momentum essentially increases the variance of the normalized error by a factor of around the global optimum. Thus, it becomes harder for the algorithm to converge. The next lemma provides a more precise characterization of such a phenomenon. Given a sufficiently small and (under Assumption 3), we need the step size satisfying
(4.4) 
such that enters the neighborhood of the global optimum with probability at least at some time , i.e., . Note that Chen et al. (2017) choose the step size of VSGD as , which does not satisfy (4.4) for close to . This means that when using the same step size of VSGD, MSGD fails to converge, since the variance increased by the momentum becomes too large. To handle this issue, we have to decrease the step size by a factor , also known as the step size annealing, i.e.,
(4.5) 
Then we obtain the following proposition. For a sufficiently small and , there exists some constant , such that after restarting the counter of time, given , we need
to ensure with probability at least . Proposition 4.1 implies the algorithm needs asymptotically at most
iterations to converge to an optimal solution in Phase III. Thus, MSGD does not have an advantage over VSGD in Phase III. We remark that is only used for Phase III. For the other two phases, we can choose
4.2 Phase ii@: How MSGD Traverses between Stationary Points
For Phase ii@, we characterize how the algorithm behaves, once it has escaped from saddle points. During this period, MSGD is dominated by the gradient, and the influence of the noise is negligible. Thus, the algorithm behaves like an almost deterministic traverse between stationary points, which can be viewed as a twostep discretization of the ODE with a discretization error (Griffiths and Higham, 2010). Thus, we can use the ODE approximation to study the algorithm before it enters the neighborhood of the optimum. By Corollary 3, we obtain the following proposition. After restarting the counter of time, for sufficiently small , , we need
such that .
When in Proposition 4.1 is small enough, we can chose , which is the same as SGD (much larger than (4.5) for close to 1), and this result implies that the algorithm needs asymptotically at most
iterations to traverse between stationary points. Clearly, MSGD is faster than SGD by a factor of in Phase ii@, when using the same step size. This is because the algorithm can make more progress along the descent direction with the help of the momentum.
4.3 Phase i@: Escaping from Saddle Points
At last, we study the algorithmic behavior around saddle points . By the same SDE approximation technique used in Section 4.1, we obtain the following theorem. Condition on the event that for . Then for , converges weakly to a solution of
We remark that is only a technical assumption. This does not cause any issue since when is large, or equivalently is smaller than (), MSGD has escaped from the saddle point , which is out of Phase I.
Theorem 4.3 implies that for , the process defined by the equation above is an unstable OU process, which goes to infinity. Thus, the algorithm will not be trapped around saddle points. Then we obtain the following proposition.
Given a prespecified , , and , then the following result holds: We need at most
(4.6) 
such that with probability at least , where
is the CDF of the standard normal distribution. Proposition
4.3 suggests that we need asymptoticallyiterations to escape from saddle points. Thus, when using the same step size, MSGD can escape from saddle points in less iterations than SGD by a factor of . This is due to the fact that the momentum can greatly increase the variance and perturb the algorithm more violently. Thus, it becomes harder to stay around saddle points. Moreover, the momentum also encourages more aggressive exploitation, and in each iteration, the algorithm makes more progress along the descent direction by a factor of .
5 Some Insights on Training DNNs
The streaming PCA problem is closely related to optimization for deep neural networks (DNNs) from many aspects. Existing literature has shown that the optimization landscape of training DNNs, though much more complicated and difficult to analyze, consists of similar basic geometric structures, such as saddle points and local optima. Thus, our theoretical characterization of the algorithmic behavior of MSGD around saddle points and local optima (for the streaming PCA problem) can provide us new insights of how MSGD behave in training DNNs. Choromanska et al. (2015); Dauphin et al. (2014); Kawaguchi (2016); Hardt and Ma (2016) suggest that there are a combinatorially large number of saddle points and many local optima in training DNNs.
Under certain oversimplified conditions, they prove: When the size of the network is large enough, most local optima are equivalent and yield similar generalization performance; Moreover, the probability of achieving a “spurious/bad” local optimum (which does not generalize well), though not zero, decreases exponential fast, as the size of the network gets larger. Thus, they suspect that the major computational challenge of training DNNs should be “how efficient an algorithm is when escaping from numerous saddle points.” From this aspect, our Proposition 4.3 suggests that MSGD indeed escapes from saddle points faster than VSGD in existence of negative curvatures.
“No spurious/bad local optima”, however, is often considered as an overoptimistic claim. Some recent results (Hochreiter and Schmidhuber, 1997; Keskar et al., 2016; Zhang et al., 2017; Neyshabur et al., 2017; Safran and Shamir, 2017) provide some empirical and theoretical evidences that the spurious/bad local optima are not completely negligible. Keskar et al. (2016); Zhang et al. (2017); Neyshabur et al. (2017) further suggest that the landscape of these spurious/bad local optima is usually sharp, i.e., their basin of attractions are small and wiggle. From this aspect, our analysis suggests that MSGD with a larger momentum ( is very close to ) tends to stay in “flat/good local optima”, since the higher variance of the noise introduced by the momentum encourages more exploration outside the small basin of attraction of sharp local optima.
Our analysis also provides some new insights on how to apply the step size annealing technique to MSGD. Specifically, our analysis suggests that at the final stage of the step size annealing, MSGD should use a much smaller step size than that of VSGD. Otherwise, MSGD may be numerically unstable and fail to converge well.
6 Numerical Experiments
We present numerical experiments for both streaming PCA and training deep neural networks. The experiments on streaming PCA verify our theory in Section 4, and the experiments on training deep neural networks verify some of our discussions in Section 5.
6.1 Streaming PCA
We first provide a numerical experiment to verify our theory for streaming PCA. We set and the covariance matrix The optimum is Figure 2 compares the performance of VSGD, MSGD (with and without the step size annealing in Phase iii@). The initial solution is the saddle point . We choose
and , and decrease the step size of MSGD by a factor after iterations in Fig.2.c. Fig.2.ac plot the results of 100 simulations, and the vertical axis corresponds to . We can clearly differentiate the three phases of VSGD in Fig.2.a. For MSGD in Fig.2.b, we hardly recognize Phases i@ and ii@, since they last for a much shorter time. This is because the momentum significantly helps escape from saddle points and evolve toward global optima. Moreover, we also observe that MSGD without the step size annealing does not converge well, but the step size annealing resolves this issue. All these observations are consistent with our analysis. Fig.2.d plots the optimization errors of these three algorithms averaged over all 100 simulations, and we observe similar results.
6.2 Deep Neural Networks
We then provide three experiments to compare MSGD with VSGD in training a ResNet18 DNN using the CIFAR dataset for a class image classification task. We choose a batch size of . k images are used for training, and the rest k are used for testing. We repeat each experiment for times and report the average.
denote the initial step sizes of MSGD and VSGD, respectively. We decrease the step size by a factor of after , , and epochs. More details on the network architecture and experimental settings can be found in Appendix D.

[leftmargin=*]

Experiment . The results are shown in Fig.3. Choosing , MSGD achieves better generalization than VSGD.

Experiment .The results are shown in Fig.4. Choosing , MSGD achieves similar generalization to VSGD (when ).

Experiment . The results are shown in Fig.5. For MSGD, choosing and achieves the optimal generalization (among all possible values). For VSGD, choosing achieves the optimal generalization (among all possible values). We see that the optimal generalization of MSGD is better than that of VSGD. Note that for , MSGD still works well. However, for , VSGD no longer works, and the generalization drops significantly. Specifically, the failure rate of VSGD with is in runs. Table 1
shows the best performance and the standard deviation (Std) of each experiment setting, which shows that MSGD has a relative small standard deviation.
Figure 5: Experimental results on CIFAR100 for training DNNs. The results show that the failure rate of VSGD with is 0.4, while MSGD with and still works in each experiment. Setting Best Std Table 1: Mean and standard deviation of best performance in each setting.
7 Discussions
The results on training DNNs are expectable or partially expectable, given our theoretical analysis for streaming PCA. We remark that our experiments (in Fig.5 Bottom) show some inconsistency with an earlier paper Wilson et al. (2017). Specifically, Wilson et al. (2017) show that MSGD does not outperform VSGD in training a VGG16 deep neural network using the CIFAR10 dataset. However, our results show that the momentum indeed improves the training. We suspect that there exist certain structures in the optimization landscape of VGG16, which marginalize the value of MSGD. In contrast, the optimization landscape of ResNet18 is more friendly to MSGD than VGG16.
Moreover, we remark that our theory helps explain some phenomena in training DNNs, however, there still exist some gaps: (1) Our analysis requires to do the diffusion approximation. However, the experiments actually use relatively large step sizes at the early stage of training. Though we can expect large and small step sizes share some similar behaviors, it may lead to very different results. For example, we observe that VSGD can use larger step sizes, and achieve similar generalization to that of MSGD. However, when MSGD achieves the optimal generalization using , VSGD performs much worse using ; (2) The optimization landscape of DNNs also contains a vast amount of high order saddle points, where our analysis cannot be applied (neither all existing analyses). How SGD/MSGD behaves in this scenario is still an open theoretical problem.
We also summarize the comparison between our results and related works in Table 2. To the best of our knowledge, we are only aware of Ghadimi and Lan (2016); Jin et al. (2017) in existing literature considering nonconvex optimization using momentum.
FOOS  SOOS  SA  SEA  Assumptions  A/N  
Ours  PCA (Constrained Quadratic Program)  A  
Ghadimi and Lan (2016)  LCG/LH/Unconstrained  N  
Jin et al. (2017)  LCG/LH/Unconstrained  N 
We remark that Ghadimi and Lan (2016) only consider convergence to the first order optimal solution, and therefore cannot justify the advantage of the momentum in escaping from saddle points; Jin et al. (2017) only consider a batch algorithm, which cannot explain why the momentum hurts when MSGD converges to optima. Moreover, Jin et al. (2017) need an additional negative curvature exploitation procedure, which is not used in popular Nesterov’s accelerated gradient algorithms.
References
 Chen et al. (2017) Chen, Z., Yang, F. L., Li, C. J. and Zhao, T. (2017). Online multiview representation learning: Dropping convexity for better efficiency. arXiv preprint arXiv:1702.08134 .
 Choromanska et al. (2015) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. and LeCun, Y. (2015). The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics.
 Dauphin et al. (2014) Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S. and Bengio, Y. (2014). Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems.
 Ge et al. (2016) Ge, R., Lee, J. D. and Ma, T. (2016). Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems.
 Ghadimi and Lan (2016) Ghadimi, S. and Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming 156 59–99.

Griffiths and Higham (2010)
Griffiths, D. F. and Higham, D. J. (2010).
Numerical methods for ordinary differential equations: initial value problems
. Springer Science & Business Media.  Hardt and Ma (2016) Hardt, M. and Ma, T. (2016). Identity matters in deep learning. arXiv preprint arXiv:1611.04231 .
 Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Flat minima. Neural Computation 9 1–42.
 Jin et al. (2017) Jin, C., Netrapalli, P. and Jordan, M. I. (2017). Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 .
 Kawaguchi (2016) Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems.
 Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. and Tang, P. T. P. (2016). On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 .
 Krichene and Bartlett (2017) Krichene, W. and Bartlett, P. L. (2017). Acceleration and averaging in stochastic mirror descent dynamics. arXiv preprint arXiv:1707.06219 .
 Kushner and Yin (2003) Kushner, H. J. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications, stochastic modelling and applied probability, vol. 35.
 Li et al. (2016a) Li, C. J., Wang, Z. and Liu, H. (2016a). Online ica: Understanding global dynamics of nonconvex optimization via diffusion processes. In Advances in Neural Information Processing Systems.
 Li et al. (2017) Li, Q., Tai, C. and Weinan, E. (2017). Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning.
 Li et al. (2016b) Li, X., Wang, Z., Lu, J., Arora, R., Haupt, J., Liu, H. and Zhao, T. (2016b). Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 .
 Liu et al. (2017) Liu, W., Zhang, Y.M., Li, X., Yu, Z., Dai, B., Zhao, T. and Song, L. (2017). Deep hyperspherical learning. In Advances in Neural Information Processing Systems.
 Nesterov (1983) Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate o (1/k2).
 Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D. and Srebro, N. (2017). Exploring generalization in deep learning. In Advances in Neural Information Processing Systems.
 Øksendal (2003) Øksendal, B. (2003). Stochastic differential equations. In Stochastic differential equations. Springer, 65–84.
 Polyak (1964) Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4 1–17.
 Robbins and Monro (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics 400–407.
 Safran and Shamir (2017) Safran, I. and Shamir, O. (2017). Spurious local minima are common in twolayer relu neural networks. arXiv preprint arXiv:1712.08968 .
 Sagitov (2013) Sagitov, S. (2013). Weak convergence of probability measures .
 Sun et al. (2016) Sun, J., Qu, Q. and Wright, J. (2016). A geometric analysis of phase retrieval. In Information Theory (ISIT), 2016 IEEE International Symposium on. IEEE.
 Wang (2017) Wang, Y. (2017). Asymptotic analysis via stochastic differential equations of gradient descent algorithms in statistical and computational paradigms. arXiv preprint arXiv:1711.09514 .
 Wilson et al. (2017) Wilson, A. C., Roelofs, R., Stern, M., Srebro, N. and Recht, B. (2017). The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292 .
 Zhang et al. (2017) Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N. and Poggio, T. (2017). Theory of deep learning iii: Generalization properties of sgd. Tech. rep., Center for Brains, Minds and Machines (CBMM).
Appendix A Summary on Weak Convergence and Main Theorems
Here, we summarize the theory of weak convergence and theorems used in this paper. Recall that the continuoustime interpolation of the solution trajectory is defined as on the time interval It has sample paths in the space of Càdlàg functions ( right continuous and have lefthand limits) defined on , or Skorokhod Space, denoted by . Thus, the weak convergence we consider here is defined in this space instead of . The special metric in is called Skorokhod metric, and the topology generated by this metric is Skorokhod topology. Please refer to Sagitov (2013); Kushner and Yin (2003) for detailed explanations. The weak convergence in is defined as follows: [Weak Convergence in ] Let be the minimal field induced by Skorokhod topology. Let and be random variables on defined on a probability space Suppose that and are the probability measures on generated by and X. We say converges weakly to (), if for all bounded and continuous realvalued functions on , the following condition holds:
(A.1) 
With an abuse of terminology, we say converges weakly to and write Another important definition we need is tightness: A set of valued random variables is said to be tight if for each , there is a compact set such that:
(A.2) 
We care about tightness because it provides us a powerful way to prove weak convergence based on the following two theorems: [Prokhorov’s Theorem] Under Skorokhod topology, is tight in if and only if it is relative compact which means each subsequence contains a further subsequence that converges weakly. [Sagitov (2013), Theorem 3.8] A necessary and sufficient condintion for is each subsequence contains a further subsequence converging weakly to Thus, if we can prove is tight and all the further subsequences share the same weak limit , then we have converges weakly to . However, (A.2) is hard to verified. We usually check another easier criteria. Let be the algebra generated by , and denotes a stopping time. [Kushner and Yin (2003), Theorem 3.3, Chapter 7] Let be a sequence of processes that have paths in . Suppose that for each and each in a dense set in , there is a compact set in such that
(A.3) 
and for each positive ,
(A.4) 
Then is tight in This theorem is used in Section 3 to prove tightness of the trajectory of Momentum SGD.
At last, we provide the theorem we use to prove the SDE approximation. Let’s consider the following algorithm:
(A.5) 
where , and is a martingale difference sequence. Then the normalized process satisfies:
(A.6) 
We further assume the fixedstatechain exists as in Section 4.1 and use the same notation to denote the fixedprocess. Then we have the following theorem: [Kushner and Yin (2003), Theorem 8.1, Chapter 10] Assume the following conditions hold:

[ref=Assumption 0]

For small , is uniformly integrable.

There is a continuous function such that for any sequence of integers satisfying as and each compact set ,