We often encounter the following optimization problem in machine learning, signal processing, and stochastic control:
is a loss function,denotes a feasible set, is the number of constraints, and ’s are the differentiable functions that impose constraints into model parameters. For notational simplicity, we define and
. Principal component analysis (PCA), canonical correlation analysis (CCA), matrix factorization/sensing/completion, phase retrieval, and many other problems(Friedman et al., 2001; Sun et al., 2016; Bhojanapalli et al., 2016; Li et al., 2016b; Ge et al., 2016b; Chen et al., 2017; Zhu et al., 2017) can be viewed as special examples of (1). Many algorithms have been proposed to solve (1). For the unconstrained () or a simple constraint , e.g., the spherical constraint, , we can apply simple first order algorithms such as the projected gradient descent algorithm (Luenberger et al., 1984).
However, when is complicated, the aforementioned algorithms are often not applicable or inefficient. This is because the projection to does not admit a closed form expression and can be computationally expensive in each iteration. To address this issue, we convert (1) to a min-max problem using the Lagrangian multiplier method. Specifically, instead of solving (1), we solve the following problem:
where is the Lagrangian multiplier. is often referred as the Lagrangian function in existing literature (Boyd and Vandenberghe, 2004). The existing literature on optimization also refers to as the primal variable and as the dual variable. Accordingly, (1
) is called the primal problem. From the perspective of game theory, they can be viewed as two players competing with each other and eventually achieving some equilibrium. Whenis convex and is convex or the boundary of a convex set, the optimization landscape of (2) is essentially convex-concave, that is, for any fixed , is convex in , and for any fixed , is concave in . Such a landscape further implies that the equilibrium of (2) is a saddle point, whose primal variable is equivalent to the global optimum of (1) under strong duality conditions. To solve (2), we resort to primal-dual algorithms, which iterate over both and (usually in an alternating manner). The global convergence rates to the equilibrium are also established accordingly for these algorithms (Lan et al., 2011; Chen et al., 2014; Iouditski and Nesterov, 2014).
When and are nonconvex, both (1) and (2) become much more computationally challenging, NP-Hard in general. Significant progress has been made toward solving the primal problem (1). For example, Ge et al. (2015)
show that when certain tensor factorization satisfies the so-called strict saddle properties, one can apply some first order algorithms such as the projected gradient algorithm, and the global convergence in polynomial time can be guaranteed. Their results further motivate many follow-up works, proving that many problems can be formulated as strict saddle optimization problems, including PCA, multiview learning, phase retrieval, matrix factorization/sensing/completion, complete dictionary learning(Sun et al., 2016; Bhojanapalli et al., 2016; Li et al., 2016b; Ge et al., 2016b; Chen et al., 2017; Zhu et al., 2017). Note that these strict saddle optimization problems are either unconstrained or just with a simple spherical constraint. However, for many other nonconvex optimization problems, can be much more complicated. To the best of our knowledge, when is not only nonconvex but also complicated, the applicable algorithms and convergence guarantees are still largely unknown in existing literature.
To handle the complicated , this paper proposes to investigate the min-max problem (2). Specifically, we first define a special class of Lagrangian functions, where the landscape of enjoys the following good properties:
There exist only two types of equilibria – stable and unstable equilibria. At an unstable equilibrium, has negative curvature with respect to the primal variable . More details in Section 2.
All stable equilibria correspond to the global optima of the primal problem (1).
Both properties are intuitive. On the one hand, the negative curvature in the first property enables the primal variable to escape from the unstable equilibria along some decent direction. On the other hand, the second property ensures that we do not get spurious local optima of (1), that is all local minima must also be global optima.
We then study a generalized eigenvalue (GEV) problem, which includes CCA, Fisher discriminant analysis (FDA, Mika et al. (1999)), sufficient dimension reduction (SDR, Cook and Ni (2005)) as special examples. Specifically, GEV solves
where are symmetric, is positive semidefinite. We rewrite (3) as a min-max problem,
where is the Lagrangian multiplier. Theoretically, we show that the Lagrangian function in (4) exactly belongs to our previously defined class. Motivated by our defined landscape structures, we then solve an online version of (4), where we can only access independent unbiased stochastic approximations of and directly accessing and is prohibited. Specifically, at the -th iteration, we only obtain independent and satisfying
Computationally, we propose a simple stochastic primal-dual algorithm, which is a stochastic variant of the generalized Hebbian algorithm (GHA, Gorrell (2006)). Theoretically, we establish its asymptotic rate of convergence to stable equilibria for our stochastic GHA (SGHA) based on the diffusion approximations (Kushner and Yin, 2003). Specifically, we show that, asymptotically, the solution trajectory of SGHA weakly converges to the solutions of stochastic differential equations (SDEs). By studying the analytical solutions of these SDEs, we further establish the asymptotic sample/iteration complexity of SGHA under certain regularity conditions (Harold et al., 1997; Li et al., 2016a; Chen et al., 2017). To the best of our knowledge, this is the first asymptotic sample/iteration complexity analysis of a stochastic optimization algorithm for solving the online version of GEV problem. Numerical experiments are presented to justify our theory.
Our work is closely related to several recent results on solving GEV problems. For example, Ge et al. (2016a) propose a multistage semi-stochastic optimization algorithm for solving GEV problems with a finite sum structure. At each optimization stage, their algorithm needs to access the exact matrix, and compute the approximate inverse of by solving a quadratic program, which is not allowed in our setting. Similar matrix inversion approaches are also adopted by a few other recently proposed algorithms for solving GEV problem (Allen-Zhu and Li, 2016; Arora et al., 2017). In contrast, our proposed SGHA is a fully stochastic algorithm, which does not require any matrix inversion.
Moreover, our work is also related to several more complicated min-max problems, such as Markov Decision Process with function approximation, Generative Adversarial Network, multistage stochastic programming and control(Sutton et al., 2000; Shapiro et al., 2009; Goodfellow et al., 2014). Many primal-dual algorithms have been proposed to solve these problems. However, most of these algorithms are even not guaranteed to converge. As mentioned earlier, when the convex-concave structure is missing, the min-max problems go far beyond the existing theories. Moreover, both primal and dual iterations involve sophisticated stochastic approximations (equally or more difficult than our online version of GEV). This paper makes the attempt on understanding the optimization landscape of these challenging min-max problems. Taking our results as an initial start, we expect more sophisticated and stronger follow-up works that apply to these min-max problems.
Notations. Given an integer , we denote as a identity matrix, . Given an index set and a matrix , we denote as the complement set of , () as the -th column (row) of , as the -th entry of , and () as the column (row) submatrix of indexed by ,
as the vectorization of, as the column space of , and as the null space of . Given a symmetric matrix , we denote
as its smallest/largest singular value, and denote the eigenvalue decomposition ofas , where with , denote as the spectral norm of . Given two matrices and , as the Kronecker product of , .
2 Characterization of Equilibria
Recall the Lagrangian function in (2). Then we start with characterizing its equilibria. By KKT conditions, an equilibrium satisfies
which only contains the first order information of . To further distinguish the difference among the equilibria, we define two types of equilibria by the second order information. Given the Lagrangian function in (2), a point is called:
(1) An equilibrium of , if
(2) An equilibrium is unstable, if is an equilibrium and
(3) An equilibrium is stable, if is an equilibrium, , and is strongly convex over a restricted domain.
Note that (2) in Definition 2 has a similar strict saddle property over a manifold in Ge et al. (2015). The motivation behind Definition 2 is intuitive. When has negative curvature with respect to the primal variable at an equilibrium, we can find a direction in to further decrease . Therefore, a tiny perturbation can break this unstable equilibrium. An illustrative example is presented in Figure 1. Moreover, at a stable equilibrium , there is restricted strong convexity, which relates to several conditions, e.g., Polyak Łojasiewicz conditions (Polyak, 1963), i.e.,
for belonging to a small region near and is a constant, or Error Bound conditions (Luo and Tseng, 1993). With this property, we cannot decrease along any direction with respect to . Definition 2 excludes the high order unstable equilibrium, which may exist due to the degeneracy of . Specifically, such a high order unstable equilibrium cannot be identified by the second order information, e.g.,
is an equilibrium with a positive semidefinite Hessian matrix. However, it is an unstable equilibria, since a small perturbation to can break this equilibrium. Such an equilibrium makes the landscape highly more complicated. Overall, we consider a specific class of Lagrangian functions throughout the rest of this paper. They enjoy the following properties:
All equilibria are either stable or unstable (i.e., no high order unstable equilibria);
All stable equilibria correspond to the global optima of the primal problem.
As mentioned earlier, the first property ensures that the second order information can identify the type of equilibria. The second property guarantees that we do not get spurious optima for (1) as long as an algorithm attains a stable equilibrium. Several machine learning problems belong to this class, such as the generalized eigenvalue decomposition problem.
3 Generalized Eigenvalue Decomposition
We consider the generalized eigenvalue (GEV) problem as a motivating example, which includes CCA, FDA, SDR, etc. as special examples. Recall its min-max formulation (4):
Before we proceed, we impose the following assumption on the problem. Given a symmetric matrix and a positive definite matrix , the eigenvalues of , denoted by , satisfy
Such an eigengap assumption avoids the identifiability issue. The full rank assumption on in Assumption 3 ensures that the original constrained optimization problem is bounded. This assumption can be further relaxed but require more involved analysis. We will discuss this in Appendix B.
To characterize all equilibria of GEV, we leverage the idea of an invariant group. Li et al. (2016b) use similar techniques for an unconstrained matrix factorization problem. However, it does not work for the Lagrangian function due to the more complicate landscape. Therefore, we consider a more general invariant group. Moreover, by analyzing the Hessian matrix of at the equilibria, we demonstrate that each equilibrium is either unstable or stable and the stable equilibria correspond to the global optima of the primal problem (3). Therefore, GEV belongs to the class we defined earlier.
3.1 Invariant Group and Symmetric Property
We first denote the orthogonal group in dimension as
Notice that for any , in (4) has the same landscape with . This further indicates that given an equilibrium , is also an equilibrium. This symmetric property motivates us to characterize the equilibria of with an invariant group.
We introduce several important definitions in group theory (Dummit and Foote, 2004).
Given a group and a set , a map from to is called the group action of on if satisfies the following two properties:
Identity: , where denotes the identity element of .
Compatibility: . Given a function , a group is a stationary invariant group of with respect to two group actions of , on and on , if satisfies
For notational simplicity, we denote . Given the group , two sets and , we define a group action with of on and a group action of on as
One can check that the orthogonal group is a stationary invariant group of with respect to two group actions of , on and on . By this invariant group, we define the equivalence relation between and , if there exists a such that
To find all equilibria of GEV, we examine the KKT conditions of :
Given the eigenvalue decomposition , we denote
We then consider the eigenvalue decomposition . The following theorem shows the connection between the equilibrium of and the column submatrix of , denoted as , where
is the column index set to determine a column submatrix. [Symmetric Property] Suppose Assumption 3 holds. Then is an equilibrium of , if and only if can be written as
where index and . The proof of Theorem 3.1 is provided in Appendix A.1. Theorem 3.1 implies that there are equilibria of under the equivalence relation given in (5). Each of them corresponds to an , where is the index set. Then whole equilibria set is generated by these with the transformation matrix and the invariant group action induced by .
3.2 Unstable Equilibrium vs. Stable Equilibrium
We further identify the stable and unstable equilibria. Specifically, given as an equilibrium of , we denote the Hessian matrix of with respect to the primal variable as
Then we calculate the eigenvalues of . By Definition 2, is unstable if has a negative eigenvalue; Otherwise, we analyze the local landscape at to determine whether it is stable or not. The following theorem shows that all equilibria are either stable or unstable and demonstrates how the choice of index set corresponds to the unstable and stable equilibria of . Suppose Assumption 3 holds, and is an equilibrium in (4). By Theorem 3.1, can be represented as for some and .
If , then is an unstable equilibrium with
where , and , is the -th leading eigenvalue of
Otherwise, we have Moreover, is a stable equilibrium of min-max problem (4).
, that is, the eigenvectors ofcorresponding to the largest eigenvalues, is a stable equilibrium of , where Although is degenerate at this equilibrium, all directions in essentially point to the primal variables of other stable equilibria. Excluding these directions, the rest all have positive curvature, which implies that this equilibrium is stable. Moreover, such an corresponds to the optima of (3). When , due to the negative curvature, these equilibria are unstable. Therefore, all stable equilibria of correspond to the global optima in and other equilibria are unstable, which further indicates that GEV belongs to the class we defined earlier.
4 Stochastic Search for Online GEV
For GEV, we propose a fully stochastic primal-dual algorithm to solve (4), which only requires access to the stochastic approximations of and matrices. This is very different from other existing semi-stochastic algorithms that require to access the exact matrix (Ge et al., 2016a). Specifically, we propose a stochastic variant of the generalized Hebbian algorithm (GHA), also referred as Sanger’s rule in existing literature (Sanger, 1989), to solve (4). For online setting, accessing the exact and is prohibitive and we only get and that are independently sampled from the distribution associated with and at the -th iteration. Our proposed SGHA updates primal and dual variables as follows:
is a step size parameter. Note that the primal update is a stochastic gradient descent step, while the dual update is motivated by the KKT conditions of (4). SGHA is simple and easy to implement. The constraint is naturally handled by the dual update. Further, motivated by the the landscape of GEV, we analyze the algorithm by diffusion approximations and obtain the asymptotical sample complexity.
4.1 Numerical Evaluations
We first provide numerical evaluations to illustrate the effectiveness of SGHA, and then provide an asymptotic convergence analysis of SGHA. We choose and select three different settings:
, , , and ;
, and randomly generate an orthogonal matrixsuch that and ;
, , and randomly generate two orthogonal matrices such that and .
At the -th iteration of SGHA, we independently sample random vectors from and respectively. Accordingly, we compute the sample covariance matrices and as the approximations of and . We repeat numerical simulations under each setting for times using random data generations, and present all results in Figure 2. The horizontal axis corresponds to the number of iterations, and the vertical axis corresponds to the optimization error
Our experiments indicate that SGHA converges to a global optimum in all settings.
4.2 Convergence Analysis for Commutative and
As a special case, we first prove the convergence of SGHA for GEV with , and and are commutative. We will discuss more on noncommutative cases and in the next section. Before we proceed, we introduce our assumptions on the problem. We assume that the following conditions hold:
(a): ’s and ’s are independently sampled from two different distributions and respectively, where and ;
(b): and are commutative, i.e., there exists an orthogonal matrix such that and , where and are diagonal matrices with ;
satisfy the moment conditions, that is, for some generic constantsand , and .
We remark that (8) is very different from existing optimization algorithms over the generalized Stiefel manifold. Specifically, computing the gradient over the generalized Stiefel manifold requires , which is not allowed in our setting. For notational convenience, we further denote
Without loss of generality, we assume and . Note that and , however, are not necessarily to be monotonic. We denote
Denote . One can verify that (8) can be rewritten as follows:
where and Note that corresponds to the optimal solution of (3).
By diffusion approximation, we show that our algorithm converges through three Phases:
Phase I: Given an initial near a saddle point, we show that after rescaling of time properly, the algorithm can be characterized by a stochastic differential equation (SDE). Such an SDE further implies our algorithm can escape from the saddle fast;
We show that away from the saddle, the trajectory of our algorithm can be approximated by an ordinary differential equation (ODE);
Phase III: We first show that after Phase II, the norm of solution converges to a constant. Then, the algorithm can be characterized by an SDE, like Phase I. By the SDE, we analyze the error fluctuation when the solution is within a small neighborhood of the global optimum.
Overall, we obtain an asymptotic sample complexity.
ODE Characterization: To demonstrate an ODE characterization for the trajectory of our algorithm, we introduce a continuous time random process
where and is the step size in (8). For notational simplicity, we drop when it is clear from the context. Instead of showing a global convergence of , we show that the quantity
converges to an exponential decay function, where is the -th component (coordinate) of .
Suppose that Assumption 4.2 holds and the initial solution is away from any saddle point, i.e., given pre-specified constants, and , there exist such that
As , weakly converges to the solution of the following ODE:
where is the initial value of . In particular, we consider . Then, as , the dominating component of will be .
The ODE approximation of the algorithm implies that after long enough time, i.e.,
is large enough, the solution of the algorithm can be arbitrarily close to a global optimum. Nevertheless, to obtain the asymptotic “convergence rate”, we need to study the variance of the trajectory at time. Thus, we resort to the following SDE-based approach for a more precise characterization.
SDE Characterization: We notice that such a variance with order vanishes as . To characterize this variance, we rescale the updates by a factor of , i.e., by defining a new process as . After rescaling, the variance of is of order . The following lemma characterizes how the algorithm escapes from the saddle, i.e., , where , in Phase I. Suppose Assumption 4.2 holds and the initial is close to a saddle point, that is, given pre-specified constants and , there exists an such that
As , then weakly converges to the solution of the following SDE:
where and is a standard Brownian motion. The proof of Lemma 4.2 is provided in Appendix C.2. Note that (11) is a Fokker-Plank equation, whose solution is an Ornstein-Uhlenbeck (O-U) process (Doob, 1942) as follows:
We consider . Note that
is essentially a random variable with meanand variance smaller than . However, the larger is, the closer its variance gets to this upper bound. Moreover, the term essentially amplifies by a factor exponentially increasing in . This tremendous amplification forces to quickly get away from , as increases, which indicates that the algorithm will escape from the saddle. Further, the following lemma characterizes the local behavior of the algorithm near the optimal.
Suppose that Assumption 4.2 holds and the initial solution is close to an optimal solution, that is, given pre-specified constants and , we have . As , then we have and weakly converges to the solution of the following SDE:
Note the second term of the right hand side in (14) decays to 0, as time . The rest is a pure random walk. Thus, the fluctuation of is essentially the error fluctuation of the algorithm after sufficiently long time.
Suppose Assumption 4.2 holds. Given a sufficiently small error , , and
such that with probability at least , , where is the optima of (3). The proof of Theorem 4.2 is provided in Appendix C.4. Theorem 4.2 implies that asymptotically, our algorithm yields an iterations of complexity:
which not only depends on the gap, i.e., , but also depends on , which is the condition number of in the worst case. As can be seen, for an ill-conditioned , the problem (3) is more difficult to solve.
4.3 When and are Noncommutative?
Unfortunately, when and are noncommutative, the analysis is more difficult, even for . Recall that the optimization landscape of the Lagrangian function in (4) enjoys a nice geometric property: At an unstable equilibrium, the negative curvature with respect to the primal variable encourages the algorithm to escape. Specifically, suppose the algorithm is initialized at an unstable equilibrium , the descent direction for is determined by the eigenvectors of
associated with the negative eigenvalues. After one iteration, we obtain . The Hessian matrix becomes
Since is a stochastic approximation, the random noise can make significantly different from . Thus, the eigenvectors of associated with the negative eigenvalues can be also very different from those of . This phenomenon can seriously confuse the algorithm about the descent direction of the primal variable. We remark that such an issue does not appear if we assume and are commutative. We suspect that this is very likely an artifact of our proof technique, since our numerical experiments have provided some empirical evidences of the convergence of SGHA.
Here we briefly discuss a few related works:
Li et al. (2016b) propose a framework for characterizing the stationary points in the unconstrained nonconvex matrix factorization problem, while our studied generalized eigenvalue problem is constrained. Different from their analysis, we analyze the optimization landscape of the corresponding Lagrangian function. When characterize the stationary points, we need to take both primal and dual variables into consideration, which is technically more challenging.
Ge et al. (2016a) also consider the (off-line) generalized eigenvalue problem but in a finite sum form. Unlike our studied online setting, they access exact and in each iteration. Specifically, they need to access exact and to compute an approximate inverse of
to find the descent direction. Meanwhile, they also need a modified Gram Schmidt process, which also requires accessing exact, to maintain the solution on the generalized Stiefel manifold (defined by via exact , Mishra and Sepulchre (2016)). Our proposed stochastic search, however, is a full stochastic primal-dual algorithm, which neither require accessing exact and , nor enforcing the the primal variables to stay on the manifold.
- Allen-Zhu and Li (2016) Allen-Zhu, Z. and Li, Y. (2016). Doubly accelerated methods for faster CCA and generalized eigendecomposition. arXiv preprint arXiv:1607.06017 .
- Arora et al. (2017) Arora, R., Marinov, T. V., Mianjy, P. and Srebro, N. (2017). Stochastic approximation for canonical correlation analysis. In Advances in Neural Information Processing Systems.
- Bhojanapalli et al. (2016) Bhojanapalli, S., Neyshabur, B. and Srebro, N. (2016). Global optimality of local search for low rank matrix recovery. arXiv preprint arXiv:1605.07221 .
- Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.
- Chen et al. (2014) Chen, Y., Lan, G. and Ouyang, Y. (2014). Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization 24 1779–1814.
- Chen et al. (2017) Chen, Z., Yang, F. L., Li, C. J. and Zhao, T. (2017). Online multiview representation learning: Dropping convexity for better efficiency. arXiv preprint arXiv:1702.08134 .
- Cook and Ni (2005) Cook, R. D. and Ni, L. (2005). Sufficient dimension reduction via inverse regression: A minimum discrepancy approach. Journal of the American Statistical Association 100 410–428.
- Doob (1942) Doob, J. L. (1942). The brownian movement and stochastic equations. Annals of Mathematics 351–369.
- Dummit and Foote (2004) Dummit, D. S. and Foote, R. M. (2004). Abstract algebra, vol. 3. Wiley Hoboken.
- Ethier and Kurtz (2009) Ethier, S. N. and Kurtz, T. G. (2009). Markov processes: characterization and convergence, vol. 282. John Wiley & Sons.
- Friedman et al. (2001) Friedman, J., Hastie, T. and Tibshirani, R. (2001). The elements of statistical learning, vol. 1. Springer series in statistics New York.
- Ge et al. (2015) Ge, R., Huang, F., Jin, C. and Yuan, Y. (2015). Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory.
- Ge et al. (2016a) Ge, R., Jin, C., Netrapalli, P., Sidford, A. et al. (2016a). Efficient algorithms for large-scale generalized eigenvector computation and canonical correlation analysis. In International Conference on Machine Learning.
- Ge et al. (2016b) Ge, R., Lee, J. D. and Ma, T. (2016b). Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems.
- Gorrell (2006) Gorrell, G. (2006). In EACL, vol. 6. Citeseer.
- Harold et al. (1997) Harold, J., Kushner, G. and Yin, G. (1997). Stochastic approximation and recursive algorithm and applications. Application of Mathematics 35.
- Iouditski and Nesterov (2014) Iouditski, A. and Nesterov, Y. (2014). Primal-dual subgradient methods for minimizing uniformly convex functions. arXiv preprint arXiv:1401.1792 .
- Kushner and Yin (2003) Kushner, H. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications, vol. 35. Springer Science & Business Media.
- Lan et al. (2011) Lan, G., Lu, Z. and Monteiro, R. D. (2011). Primal-dual first-order methods with mathcal O(1/epsilon) iteration-complexity for cone programming. Mathematical Programming 126 1–29.
- Li et al. (2016a) Li, C. J., Wang, M., Liu, H. and Zhang, T. (2016a). Near-optimal stochastic approximation for online principal component estimation. arXiv preprint arXiv:1603.05305 .
- Li et al. (2016b) Li, X., Wang, Z., Lu, J., Arora, R., Haupt, J., Liu, H. and Zhao, T. (2016b). Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 .
- Luenberger et al. (1984) Luenberger, D. G., Ye, Y. et al. (1984). Linear and nonlinear programming, vol. 2. Springer.
- Luo and Tseng (1993) Luo, Z.-Q. and Tseng, P. (1993). Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research 46 157–178.
- Mika et al. (1999) Mika, S., Ratsch, G., Weston, J., Scholkopf, B. and Mullers, K.-R. (1999). Fisher discriminant analysis with kernels. In Neural networks for signal processing IX, 1999. Proceedings of the 1999 IEEE signal processing society workshop. Ieee.
- Mishra and Sepulchre (2016) Mishra, B. and Sepulchre, R. (2016). Riemannian preconditioning. SIAM Journal on Optimization 26 635–660.
- Polyak (1963) Polyak, B. T. (1963). Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 3 643–653.
Sanger, T. D. (1989).
Optimal unsupervised learning in a single-layer linear feedforward neural network.Neural networks 2 459–473.
- Shapiro et al. (2009) Shapiro, A., Dentcheva, D. and Ruszczyński, A. (2009). Lectures on stochastic programming: modeling and theory. SIAM.
- Sun et al. (2016) Sun, J., Qu, Q. and Wright, J. (2016). A geometric analysis of phase retrieval. In Information Theory (ISIT), 2016 IEEE International Symposium on. IEEE.
Sutton et al. (2000)
Sutton, R. S., McAllester, D. A., Singh, S. P. and
Mansour, Y. (2000).
Policy gradient methods for reinforcement learning with function approximation.In Advances in neural information processing systems.
- Zhu et al. (2017) Zhu, Z., Li, Q., Tang, G. and Wakin, M. B. (2017). The global optimization geometry of nonsymmetric matrix factorization and sensing. arXiv preprint arXiv:1703.01256 .
Appendix A Proofs for Determining Stationary Points
a.1 Proof of Theorem 3.1
Remind that the eigendecomposition of is . Given the eigendecomposition of is , we can write as
We denote as for some with . For where . It is easy to see that . Ignore the constant 2 in the gradient for convenience, we have,