For solving a broad range of large-scale statistical learning problems, e.g., deep learning, nonconvex optimization methods often exhibit favorable computational and statistical efficiency empirically. However, there is still a lack of theoretical understanding of the global dynamics of these nonconvex optimization methods. In specific, it remains largely unexplored why simple optimization algorithms, e.g., stochastic gradient descent (SGD), often exhibit fast convergence towards local minima with desirable statistical accuracy. In this paper, we aim to develop a new analytic framework to theoretically understand this phenomenon.
The dynamics of nonconvex statistical optimization are of central interest to a recent line of work. Specifically, by exploring the local convexity within the basins of attraction, [26, 35, 1, 5, 52, 53, 36, 22, 25, 11, 7, 56, 39, 20, 47, 31, 46, 6, 57, 12, 50, 13, 8, 54, 58, 10, 24, 51, 55, 48, 21, 49] establish local fast rates of convergence towards the desirable local minima for a variety statistical problems. Most of these characterizations of local dynamics are based on two decoupled ingredients from statistics and optimization: (i) the local (approximately) convex geometry of the objective functions, which is induced by the underlying statistical models, and (ii) adaptation of classical optimization analysis [34, 19] by incorporating the perturbations induced by nonconvex geometry as well as random noise. To achieve global convergence guarantees, they rely on various problem-specific approaches to obtain initializations that provably fall into the basins of attraction. Meanwhile, for some learning problems, such as phase retrieval and tensor decomposition for latent variable models, it is empirically observed that good initializations within the basins of attraction are not essential to the desirable convergence. However, it remains highly challenging to characterize the global dynamics, especially within the highly nonconvex regions outside the local basins of attraction.
In this paper, we address this problem with a new analytic framework based on diffusion processes. In particular, we focus on the concrete example of SGD applied on the tensor decomposition formulation of independent component analysis (ICA). Instead of adapting classical optimization analysis accordingly to local nonconvex geometry, we cast SGD in different phases as diffusion processes, i.e., solutions to stochastic differential equations (SDE), by analyzing the weak convergence from discrete Markov chains to their continuous-time limits [40, 17]. The SDE automatically incorporates the geometry and randomness induced by the statistical model, which allows us to establish the exact dynamics of SGD. In contrast, classical optimization analysis only yields upper bounds on the optimization error, which are unlikely to be tight in the presence of highly nonconvex geometry, especially around the stationary points that have negative curvatures along certain directions. In particular, we identify three consecutive phases of the global dynamics of SGD, which is illustrated in Figure 1.
We consider the most challenging initialization at a stationary point with negative curvatures, which can be cast as an unstable equilibrium of the SDE. Within the first phase, the dynamics of SGD are characterized by an unstable Ornstein-Uhlenbeck process [37, 2], which departs from the initialization at a relatively slow rate and enters the second phase.
Within the second phase, the dynamics of SGD are characterized by the exact solution to an ordinary differential equation. This solution evolves towards the desirable local minimum at a relatively fast rate until it approaches a small basin around the local minimum.
More related work. Our results are connected with a very recent line of work [18, 42, 43, 44, 45, 27, 29, 3, 38] on the global dynamics of nonconvex statistical optimization. In detail, they characterize the global geometry of nonconvex objective functions, especially around their saddle points or local maxima. Based on the geometry, they prove that specific optimization algorithms, e.g., SGD with artificial noise injection, gradient descent with random initialization, and second-order methods, avoid the saddle points or local maxima, and globally converge to the desirable local minima. Among these results, our results are most related to , which considers SGD with noise injection on ICA. Compared with this line of work, our analysis takes a completely different approach based on diffusion processes, which is also related to another line of work [14, 15, 41, 30, 33, 32].
Without characterizing the global geometry, we establish the global exact dynamics of SGD, which illustrate that, even starting from the most challenging stationary point, it may be unnecessary to use additional techniques such as noise injection, random initialization, and second-order information to ensure the desirable convergence. In other words, the unstable Ornstein-Uhlenbeck process within the first phase itself is powerful enough to escape from stationary points with negative curvatures. This phenomenon is not captured by the previous upper bound-based analysis, since previous upper bounds are relatively coarse-grained compared with the exact dynamics, which naturally give a sharp characterization simultaneously from upper and lower bounds. Furthermore, in Section 5 we will show that our sharp diffusion process-based characterization provides understanding on different phases of dynamics of our online/SGD algorithm for ICA.
A recent work 
analyzes an online principal component analysis algorithm based on the intuition gained from diffusion approximation. In this paper, we consider a different statistical problem with a rigorous characterization of the diffusion approximations in three separate phases.
Our contribution. In summary, we propose a new analytic paradigm based on diffusion processes for characterizing the global dynamics of nonconvex statistical optimization. For SGD on ICA, we identify the aforementioned three phases for the first time. Our analysis is based on Stroock and Varadhan’s weak convergence of Markov chains to diffusion processes, which are of independent interest.
In this section we formally introduce a special model of independent component analysis (ICA) and the associated SGD algorithm. Let be the data sample identically distributed as . We make assumptions for the distribution of as follows. Let be the
-norm of a vector.
There is an orthonormal matrix such that , where is a random vector that has independent entries satisfying the following conditions:
[label=(), topsep=0pt, leftmargin=*]
The distribution of each is symmetric about 0;
There is a constant such that ;
The are independent with identical moments for , denoted by ;
The , , .
Assumption 1(iii) above is a generalization of i.i.d. tensor components. Let
whose columns form an orthonormal basis. Our goal is to estimate the orthonormal basisfrom online data . We first establish a preliminary lemma.
Let be the 4th-order tensor whose -entry is . Under Assumption 1, we have
Lemma 1 implies that finding ’s can be cast into the solution to the following population optimization problem
It is straightforward to conclude that all stable equilibria of (2.2) are whose number linearly grows with . Meanwhile, by analyzing the Hessian matrices the set of unstable equilibria of (2.2) includes (but not limited to) all , whose number grows exponentially as increases [18, 44].
Now we introduce the SGD algorithm for solving (2.2) with finite samples. Let be the unit sphere in , and denote for the projection operator onto . With appropriate initialization, the SGD for tensor method iteratively updates the estimator via the following Eq. (2.3):
The SGD algorithms that performs stochastic approximation using single online data sample in each update has the advantage of less temporal and spatial complexity, especially when is high [29, 18]. An essential issue of this nonconvex optimization problem is how the algorithm escape from unstable equilibria.  provides a method of adding artificial noises to the samples, where the noise variables are uniformly sampled from . In our work, we demonstrate that under some reasonable distributional assumptions, the online data provide sufficient noise for the algorithm to escape from the unstable equilibria.
By symmetry, our algorithm in Eq. (2.3) converges to a uniformly random tensor component from components. In order to solve the problem completely, one can repeatedly run the algorithm using different set of online samples until all tensor components are found. In the case where is high, the well-known coupon collector problem  implies that it takes runs of SGD algorithm to obtain all tensor components.
From Eq. (2.2) we see the tensor structure in Eq. (2.1) is unidentifiable in the case of , see more discussion in [4, 18]. Therefore in Assumption 1 we rule out the value and call the value the tensor gap. The reader will see later that, analogous to eigengap in SGD algorithm for principal component analysis (PCA) , tensor gap plays a vital role in the time complexity in the algorithm analysis.
3 Markov Processes and Differential Equation Approximation
To work on the approximation we first conclude the following proposition.
The iteration generated by Eq. (2.3) forms a discrete-time, time-homogeneous Markov process that takes values on . Furthermore, holds strong Markov property.
For convenience of analysis we use the transformed iteration in the rest of this paper. The update equation in Eq. (2.3) is equivalently written as
Here has the same sign with . It is obvious from Proposition 1 that the (strong) Markov property applies to , and one can analyze the iterates generated by Eq. (3.1) from a perspective of Markov processes.
Our next step is to conclude that as the stepsize , the iterates generated by Eq. (2.3), under the time scaling that speeds up the algorithm by a factor , can be globally approximated by the solution to the following ODE system. To characterize such approximation we use theory of weak convergence to diffusions [40, 17]
via computing the infinitesimal mean and variance for SGD for the tensor method. We remind the readers of the definition of weak convergencein stochastic processes: for any the following convergence in distribution occurs as
To highlight the dependence on we add it in the superscipts of iterates . Recall that is the integer part of the real number .
If for each , as converges weakly to some constant scalar then the Markov process converges weakly to the solution of the ODE system
with initial values .
To understand the complex ODE system in Eq. (3.2) we first investigate into the case of . Consider a change of variable
we have by chain rule in calculus andthe following derivation:
Eq. (3.3) is an autonomous, first-order ODE for . Although this equation is complex, a closed-form solution is available:
and , where the choices of and depend on the initial value. The above solution allows us to conclude that if the initial vector (resp. ), then it approaches to 1 (resp. 0) as . This intuition can be generalized to the case of higher that the ODE system in Eq. (3.2) converges to the coordinate direction if is strictly maximal among in the initial vector. To estimate the time of traverse we establish the following Proposition 2.
Fix and the initial value that satisfies for all , then there is a constant (called traverse time) that depends only on such that Furthermore has the following upper bound: let solution to the following auxillary ODE
with . Let be the time that . Then
In an earlier paper  which focuses on the SGD algorithm for PCA, when the stepsize is small, the algorithm iteration is approximated by the solution to ODE system after appropriate time rescaling. The approximate ODE system for SGD for PCA is
The analysis there also involves computation of infinitesimal mean and variance for each coordinate as the stepsize and theory of convergence to diffusions [40, 17]. A closed-form solution to Eq. (3.6) is obtained in , called the generalized logistic curves. In contrast, to our best knowledge a closed-form solution to Eq. (3.2) is generally not available.
4 Local Approximation via Stochastic Differential Equation
The ODE approximation in Section 3 is very informative: it characterizes globally the trajectory of our algorithm for ICA or tensor method in Eq. (2.3) with approximation errors. However it fails to characterize the behavior near equilibria where the gradients in our ODE system are close to zero. For instance, if the SGD algorithm starts from , on a microscopic magnitude of the noises generated by online samples help escaping from a neighborhood of .
Our main goal in this section is to demonstrate that under appropriate spatial and temporal scalings, the algorithm iteration converges locally to the solution to certain stochastic differential equations (SDE). We provide the SDE approximations in two scenarios, separately near an arbitrary tensor component (Subsection 4.1) which indicates that our SGD for tensor method converges to a local minimum at a desirable rate, and a special local maximum (Subsection 4.2) which implies that the stochastic nature of our SGD algorithm for tensor method helps escaping from unstable equilibria. Note that in the algorithm iterates, the escaping from stationary points occurs first, followed by the ODE and then by the phase of convergence to local minimum. We discuss this further in Section 5.
4.1 Neighborhood of Local Minimizers
To analyze the behavior of SGD for tensor method we first consider the case where the iterates enter a neighborhood of one local minimizer, i.e. the tensor component. Since the tensor decomposition in Eq. (2.2) is full-rank and symmetric, we consider without loss of generality the neighborhood near the first tensor component. The following Theorem 2 indicates that under appropriate spatial and temporal scalings, the process admits an approximation by Ornstein-Uhlenbeck process. Such approximation is characterized rigorously using weak convergence theory of Markov processes [40, 17]. The readers are referred to  for fundamental topics on SDE.
If for each , converges weakly to as then the stochastic process converges weakly to the solution of the stochastic differential equation
with initial values . Here is a standard one-dimensional Brownian motion.
We identify the solution to Eq. (4.1) as an Ornstein-Uhlenbeck process which can be expressed in terms of a Itô integral, with
Itô isometry along with mean-zero property of Itô integral gives
which, by taking the limit , approaches . From the above analysis we conclude that the Ornstein-Uhlenbeck process has the mean-reverting property that its mean decays exponentially towards 0 with persistent fluctuations at equilibrium.
4.2 Escape from Unstable Equilibria
In this subsection we consider SGD for tensor method that starts from a sufficiently small neighborhood of a special unstable equilibrium. We show that after appropriate rescalings of both time and space, the SGD for tensor iteration can be approximated by the solution to a second SDE. Analyzing the approximate SDE suggests that our SGD algorithm iterations can get rid of the unstable equilibria (including local maxima and stationary points with negative curvatures) whereas the traditional gradient descent (GD) method gets stuck. In other words, under weak distributional assumptions the stochastic gradient plays a vital role that helps the escape. As a illustrative example, we consider the special stationary points . Consider a submanifold where
In words, consists of all where the maximum of is not unique. In the case of , it is illustrated by Figure 1 that is the frame of a 3-dimenisional box, and hence we call the frame. Let
The reason we study is that these functions of form a local coordinate map around and further characterize the distance between and on a spatial scale of . We define the positive constant as
We have our second SDE approximation result as follows.
We can solve Eq. (4.5) and obtain an unstable Ornstein-Uhlenbeck process as
Let be defined as
We conclude that the following holds.
[label=(), topsep=0pt, leftmargin=*]
is a normal variable with mean and variance ;
When is large has the following approximation
To verify (i) above we have the Itô integral in Eq. (4.6)
and by using Itô isometry
The analysis above on the unstable Ornstein-Uhlenbeck process indicates that the process has the momentum nature that when
is large, it can be regarded as at a normally distributed location centered at 0 and grows exponentially. In Section5 we will see how the result in Theorem 3 provides explanation on the escape from unstable equilibria.
5 Phase Analysis
In this section, we utilize the weak convergence results in Sections 3 and 4 to understand the dynamics of online ICA in different phases. For purposes of illustration and brevity, we restrict ourselves to the case of starting point , a local maxima that has negative curvatures in every direction. In below we denote by as when the limit of ratio .
Phase I (Escape from unstable equilibria).
Phase II (Deterministic traverse).
By (strong) Markov property we can restart the counter of iteration, we have the max and second max
Proposition 2 implies that it takes time
for the ODE to traverse from for . Converting to the timescale of the SGD, the second phase has the following relations as
Phase III (Convergence to stable equilibria).
6 Summary and discussions
In this paper, we take online ICA as a first step towards understanding the global dynamics of stochastic gradient descent. For general nonconvex optimization problems such as training deep networks, phase-retrieval, dictionary learning and PCA, we expect similar multiple-phase phenomenon. It is believed that the flavor of asymptotic analysis above can help identify a class of stochastic algorithms for nonconvex optimization with statistical structure.
Our continuous-time analysis also reflects the dynamics of the algorithm in discrete time. This is substantiated by Theorems 1, 2 and 3 which rigorously characterize the convergence of iterates to ODE or SDE by shifting to different temporal and spatial scales. In detail, our results imply when :
Phase I takes iteration number ;
Phase II takes iteration number ;
Phase III takes iteration number .
After the three phases, the iteration reaches a point that is distant on average to one local minimizer. As we have . This implies that the algorithm demonstrates the cutoff phenomenon which frequently occur in discrete-time Markov processes [28, Chap. 18]. In words, the Phase II where the objective value in Eq. (2.2) drops from to is a short-time phase compared to Phases I and III, so the convergence curve illustrated in the right figure in Figure 1 instead of an exponentially decaying curve. As we have , which suggests that Phase I of escaping from unstable equlibria dominates Phase III by a factor of .
- Agarwal et al.  Agarwal, A., Anandkumar, A., Jain, P. and Netrapalli, P. (2013). Learning sparsely used overcomplete dictionaries via alternating minimization. arXiv preprint arXiv:1310.7991.
- Aldous  Aldous, D. (1989). Probability approximations via the Poisson clumping heuristic. Applied Mathematical Sciences, 77.
- Anandkumar and Ge  Anandkumar, A. and Ge, R. (2016). Efficient approaches for escaping higher order saddle points in non-convex optimization. arXiv preprint arXiv:1602.05908.
Anandkumar et al. [2014a]
Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M. and
Telgarsky, M. (2014a).
Tensor decompositions for learning latent variable models.
Journal of Machine Learning Research, 15 2773–2832.
- Anandkumar et al. [2014b] Anandkumar, A., Ge, R. and Janzamin, M. (2014b). Analyzing tensor power method dynamics in overcomplete regime. arXiv preprint arXiv:1411.1488.
- Arora et al.  Arora, S., Ge, R., Ma, T. and Moitra, A. (2015). Simple, efficient, and neural algorithms for sparse coding. arXiv preprint arXiv:1503.00778.
- Balakrishnan et al.  Balakrishnan, S., Wainwright, M. J. and Yu, B. (2014). Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156.
- Bhojanapalli et al.  Bhojanapalli, S., Kyrillidis, A. and Sanghavi, S. (2015). Dropping convexity for faster semi-definite optimization. arXiv preprint arXiv:1509.03917.
- Bronshtein and Semendyayev  Bronshtein, I. N. and Semendyayev, K. A. (1998). Handbook of mathematics. Springer.
- Cai et al.  Cai, T. T., Li, X. and Ma, Z. (2015). Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow. arXiv preprint arXiv:1506.03382.
- Candès et al.  Candès, E., Li, X. and Soltanolkotabi, M. (2014). Phase retrieval via Wirtinger flow: Theory and algorithms. arXiv preprint arXiv:1407.1065.
- Chen and Candès  Chen, Y. and Candès, E. (2015). Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems.
- Chen and Wainwright  Chen, Y. and Wainwright, M. J. (2015). Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025.
- Darken and Moody  Darken, C. and Moody, J. (1991). Towards faster stochastic gradient search. In Advances in Neural Information Processing Systems.
- De Sa et al.  De Sa, C., Olukotun, K. and Ré, C. (2014). Global convergence of stochastic gradient descent for some non-convex matrix problems. arXiv preprint arXiv:1411.1134.
- Durrett  Durrett, R. (2010). Probability: Theory and examples. Cambridge University Press.
- Ethier and Kurtz  Ethier, S. N. and Kurtz, T. G. (1985). Markov processes: Characterization and convergence, vol. 282. John Wiley & Sons.
- Ge et al.  Ge, R., Huang, F., Jin, C. and Yuan, Y. (2015). Escaping from saddle points — online stochastic gradient for tensor decomposition. arXiv preprint arXiv:1503.02101.
- Golub and Van Loan  Golub, G. H. and Van Loan, C. F. (2012). Matrix computations. JHU Press.
- Gu et al.  Gu, Q., Wang, Z. and Liu, H. (2014). Sparse PCA with oracle property. In Advances in neural information processing systems.
Gu et al. 
Gu, Q., Wang, Z. and Liu, H. (2016).
Low-rank and sparse structure pursuit via alternating minimization.
International Conference on Artificial Intelligence and Statistics.
- Hardt  Hardt, M. (2014). Understanding alternating minimization for matrix completion. In Foundations of Computer Science.
- Hirsch et al.  Hirsch, M. W., Smale, S. and Devaney, R. L. (2012). Differential equations, dynamical systems, and an introduction to chaos. Academic Press.
- Jain et al.  Jain, P., Jin, C., Kakade, S. M. and Netrapalli, P. (2015). Computing matrix squareroot via non convex local search. arXiv preprint arXiv:1507.05854.
- Jain and Netrapalli  Jain, P. and Netrapalli, P. (2014). Fast exact matrix completion with finite samples. arXiv preprint arXiv:1411.1087.
Jain et al. 
Jain, P., Netrapalli, P. and Sanghavi, S. (2013).
Low-rank matrix completion using alternating minimization.
Symposium on Theory of Computing.
- Lee et al.  Lee, J. D., Simchowitz, M., Jordan, M. I. and Recht, B. (2016). Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915.
- Levin et al.  Levin, D. A., Peres, Y. and Wilmer, E. L. (2009). Markov chains and mixing times. American Mathematical Society.
- Li et al.  Li, C. J., Wang, M., Liu, H. and Zhang, T. (2016). Near-optimal stochastic approximation for online principal component estimation. arXiv preprint arXiv:1603.05305.
- Li et al.  Li, Q., Tai, C. et al. (2015). Dynamics of stochastic gradient algorithms. arXiv preprint arXiv:1511.06251.
- Loh and Wainwright  Loh, P.-L. and Wainwright, M. J. (2015). Regularized -estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16 559–616.
- Mandt et al.  Mandt, S., Hoffman, M. D. and Blei, D. M. (2016). A variational analysis of stochastic gradient algorithms. arXiv preprint arXiv:1602.02666.
- Mobahi  Mobahi, H. (2016). Training recurrent neural networks by diffusion. arXiv preprint arXiv:1601.04114.
- Nesterov  Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, vol. 87. Springer.
- Netrapalli et al.  Netrapalli, P., Jain, P. and Sanghavi, S. (2013). Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems.
- Netrapalli et al.  Netrapalli, P., Niranjan, U., Sanghavi, S., Anandkumar, A. and Jain, P. (2014). Non-convex robust pca. In Advances in Neural Information Processing Systems.
- Oksendal  Oksendal, B. (2003). Stochastic differential equations. Springer.
- Panageas and Piliouras  Panageas, I. and Piliouras, G. (2016). Gradient descent converges to minimizers: The case of non-isolated critical points. arXiv preprint arXiv:1605.00405.
- Qu et al.  Qu, Q., Sun, J. and Wright, J. (2014). Finding a sparse vector in a subspace: Linear sparsity using alternating directions. In Advances in Neural Information Processing Systems.
- Stroock and Varadhan  Stroock, D. W. and Varadhan, S. S. (1979). Multidimensional diffusion processes, vol. 233. Springer.
- Su et al.  Su, W., Boyd, S. and Candès, E. (2014). A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems.
- Sun et al. [2015a] Sun, J., Qu, Q. and Wright, J. (2015a). Complete dictionary recovery over the sphere i: Overview and the geometric picture. arXiv preprint arXiv:1511.03607.
- Sun et al. [2015b] Sun, J., Qu, Q. and Wright, J. (2015b). Complete dictionary recovery over the sphere ii: Recovery by Riemannian trust-region method. arXiv preprint arXiv:1511.04777.
- Sun et al. [2015c] Sun, J., Qu, Q. and Wright, J. (2015c). When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096.
- Sun et al.  Sun, J., Qu, Q. and Wright, J. (2016). A geometric analysis of phase retrieval. arXiv preprint arXiv:1602.06664.
- Sun and Luo  Sun, R. and Luo, Z.-Q. (2015). Guaranteed matrix completion via nonconvex factorization. In Foundations of Computer Science.
- Sun et al. [2015d] Sun, W., Lu, J., Liu, H. and Cheng, G. (2015d). Provable sparse tensor decomposition. arXiv preprint arXiv:1502.01425.
- Sun et al. [2015e] Sun, W., Wang, Z., Liu, H. and Cheng, G. (2015e). Non-convex statistical optimization for sparse tensor graphical model. In Advances in Neural Information Processing Systems 28.
- Tan et al.  Tan, K. M., Wang, Z., Liu, H. and Zhang, T. (2016). Sparse generalized eigenvalue problem: Optimal statistical rates via truncated rayleigh flow. arXiv preprint arXiv:1604.08697.
- Tu et al.  Tu, S., Boczar, R., Soltanolkotabi, M. and Recht, B. (2015). Low-rank solutions of linear matrix equations via procrustes flow. arXiv preprint arXiv:1507.03566.
- Wang et al.  Wang, Z., Gu, Q., Ning, Y. and Liu, H. (2015). High dimensional EM algorithm: Statistical optimization and asymptotic normality. In Advances in Neural Information Processing Systems.
- Wang et al. [2014a] Wang, Z., Liu, H. and Zhang, T. (2014a). Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Annals of statistics, 42 2164.
- Wang et al. [2014b] Wang, Z., Lu, H. and Liu, H. (2014b). Nonconvex statistical optimization: Minimax-optimal sparse PCA in polynomial time. arXiv preprint arXiv:1408.5352.
- White et al.  White, C. D., Sanghavi, S. and Ward, R. (2015). The local convexity of solving systems of quadratic equations. arXiv preprint arXiv:1506.07868.
- Yang et al.  Yang, Z., Wang, Z., Liu, H., Eldar, Y. C. and Zhang, T. (2015). Sparse nonlinear regression: Parameter estimation and asymptotic inference under nonconvexity. arXiv preprint arXiv:1511.04514.
- Zhang et al.  Zhang, Y., Chen, X., Zhou, D. and Jordan, M. I. (2014). Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems.
- Zhao et al.  Zhao, T., Wang, Z. and Liu, H. (2015). A nonconvex optimization framework for low rank matrix estimation. In Advances in Neural Information Processing Systems.
- Zheng and Lafferty  Zheng, Q. and Lafferty, J. (2015). A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. arXiv preprint arXiv:1506.06081.
a.1 Proof of Lemma 1
We only need to show
Note due to the following well-known expansion 
where the summations above iterate through all monomial terms. Plugging in and taking expectations, we conclude that under Assumption 1
Note that from the constraint of our optimization problem Eq. (2.2), we have
a.2 Proof of Proposition 1
Let be the -field filtration generated by the iteration , viewed as a stochastic process. From the recursion equation in Eq. (2.3) we have a Markov transition kernel such that for each Borel set
a.3 Proof of Theorem 1
We first use the standard one-step analysis and conclude the following proposition, whose proof is deferred to Subsection C.1.
For brevity let and , separately. Under Assumption 1, when
for each and we have the following:
[label=(), topsep=0pt, leftmargin=*]
There exists a random variablethat depends solely on with almost surely, such that the increment can be represented as
The increment of on coordinate has the following bound
There exists a deterministic function with , such that the conditional expectation of the increment is
In Proposition 3, (i) characterizes the relationship between the increment on and the online sample, and (ii) bounds such increment. From (iii) we can compute the infinitesimal mean and variance for SGD for tensor method and conclude that as the stepsize , the iterates generated by Eq. (2.3), under the time scaling that speeds up the algorithm by a factor , can be globally approximated by the solution to the following ODE system in Eq. (3.2) as
To characterize such approximation we use theory of weak convergence to diffusions [40, 17]. We remind the readers of the definition of weak convergence in stochastic processes: for any the following convergence in distribution occurs as
To highlight the dependence on we add it in the superscipts of iterates .
Proof of Theorem 1.
Let . Proposition 3 implies for coordinate satisfies
Using Eq. (A.6) we have the infinitesimal variance
which tends to 0 as . Let be the solution to ODE system Eq. (3.2) with initial values . Applying standard infinitesimal generator argument [17, Corollary 4.2 in Sec. 7.4] one can conclude that as , the Markov process converges weakly to .
a.4 Proof of Proposition 2
For simplicity we denote in the proofs that the initial value . Also, throughout this subsection we assume without loss of generality that is maximal among , , and furthermore
For that satisfies Eq. (A.8), then we have for all
We compare the coordinate between two distinct coordinates and have by calculus that for all