On Landscape of Lagrangian Functions and Stochastic Search for Constrained Nonconvex Optimization

06/13/2018 ∙ by Zhehui Chen, et al. ∙ Georgia Institute of Technology 0

We study constrained nonconvex optimization problems in machine learning, signal processing, and stochastic control. It is well-known that these problems can be rewritten to a minimax problem in a Lagrangian form. However, due to the lack of convexity, their landscape is not well understood and how to find the stable equilibria of the Lagrangian function is still unknown. To bridge the gap, we study the landscape of the Lagrangian function. Further, we define a special class of Lagrangian functions. They enjoy two properties: 1.Equilibria are either stable or unstable (Formal definition in Section 2); 2.Stable equilibria correspond to the global optima of the original problem. We show that a generalized eigenvalue (GEV) problem, including canonical correlation analysis and other problems, belongs to the class. Specifically, we characterize its stable and unstable equilibria by leveraging an invariant group and symmetric property (more details in Section 3). Motivated by these neat geometric structures, we propose a simple, efficient, and stochastic primal-dual algorithm solving the online GEV problem. Theoretically, we provide sufficient conditions, based on which we establish an asymptotic convergence rate and obtain the first sample complexity result for the online GEV problem by diffusion approximations, which are widely used in applied probability and stochastic control. Numerical results are provided to support our theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We often encounter the following optimization problem in machine learning, signal processing, and stochastic control:

(1)

where

is a loss function,

denotes a feasible set, is the number of constraints, and ’s are the differentiable functions that impose constraints into model parameters. For notational simplicity, we define and

. Principal component analysis (PCA), canonical correlation analysis (CCA), matrix factorization/sensing/completion, phase retrieval, and many other problems

(Friedman et al., 2001; Sun et al., 2016; Bhojanapalli et al., 2016; Li et al., 2016b; Ge et al., 2016b; Chen et al., 2017; Zhu et al., 2017) can be viewed as special examples of (1). Many algorithms have been proposed to solve (1). For the unconstrained () or a simple constraint , e.g., the spherical constraint, , we can apply simple first order algorithms such as the projected gradient descent algorithm (Luenberger et al., 1984).

However, when is complicated, the aforementioned algorithms are often not applicable or inefficient. This is because the projection to does not admit a closed form expression and can be computationally expensive in each iteration. To address this issue, we convert (1) to a min-max problem using the Lagrangian multiplier method. Specifically, instead of solving (1), we solve the following problem:

(2)

where is the Lagrangian multiplier. is often referred as the Lagrangian function in existing literature (Boyd and Vandenberghe, 2004). The existing literature on optimization also refers to as the primal variable and as the dual variable. Accordingly, (1

) is called the primal problem. From the perspective of game theory, they can be viewed as two players competing with each other and eventually achieving some equilibrium. When

is convex and is convex or the boundary of a convex set, the optimization landscape of (2) is essentially convex-concave, that is, for any fixed , is convex in , and for any fixed , is concave in . Such a landscape further implies that the equilibrium of (2) is a saddle point, whose primal variable is equivalent to the global optimum of (1) under strong duality conditions. To solve (2), we resort to primal-dual algorithms, which iterate over both and (usually in an alternating manner). The global convergence rates to the equilibrium are also established accordingly for these algorithms (Lan et al., 2011; Chen et al., 2014; Iouditski and Nesterov, 2014).

When and are nonconvex, both (1) and (2) become much more computationally challenging, NP-Hard in general. Significant progress has been made toward solving the primal problem (1). For example, Ge et al. (2015)

show that when certain tensor factorization satisfies the so-called strict saddle properties, one can apply some first order algorithms such as the projected gradient algorithm, and the global convergence in polynomial time can be guaranteed. Their results further motivate many follow-up works, proving that many problems can be formulated as strict saddle optimization problems, including PCA, multiview learning, phase retrieval, matrix factorization/sensing/completion, complete dictionary learning

(Sun et al., 2016; Bhojanapalli et al., 2016; Li et al., 2016b; Ge et al., 2016b; Chen et al., 2017; Zhu et al., 2017). Note that these strict saddle optimization problems are either unconstrained or just with a simple spherical constraint. However, for many other nonconvex optimization problems, can be much more complicated. To the best of our knowledge, when is not only nonconvex but also complicated, the applicable algorithms and convergence guarantees are still largely unknown in existing literature.

To handle the complicated , this paper proposes to investigate the min-max problem (2). Specifically, we first define a special class of Lagrangian functions, where the landscape of enjoys the following good properties:

  • [leftmargin=*]

  • There exist only two types of equilibria – stable and unstable equilibria. At an unstable equilibrium, has negative curvature with respect to the primal variable . More details in Section 2.

  • All stable equilibria correspond to the global optima of the primal problem (1).

Both properties are intuitive. On the one hand, the negative curvature in the first property enables the primal variable to escape from the unstable equilibria along some decent direction. On the other hand, the second property ensures that we do not get spurious local optima of (1), that is all local minima must also be global optima.

We then study a generalized eigenvalue (GEV) problem, which includes CCA, Fisher discriminant analysis (FDA, Mika et al. (1999)), sufficient dimension reduction (SDR, Cook and Ni (2005)) as special examples. Specifically, GEV solves

(3)

where are symmetric, is positive semidefinite. We rewrite (3) as a min-max problem,

(4)

where is the Lagrangian multiplier. Theoretically, we show that the Lagrangian function in (4) exactly belongs to our previously defined class. Motivated by our defined landscape structures, we then solve an online version of (4), where we can only access independent unbiased stochastic approximations of and directly accessing and is prohibited. Specifically, at the -th iteration, we only obtain independent and satisfying

Computationally, we propose a simple stochastic primal-dual algorithm, which is a stochastic variant of the generalized Hebbian algorithm (GHA, Gorrell (2006)). Theoretically, we establish its asymptotic rate of convergence to stable equilibria for our stochastic GHA (SGHA) based on the diffusion approximations (Kushner and Yin, 2003). Specifically, we show that, asymptotically, the solution trajectory of SGHA weakly converges to the solutions of stochastic differential equations (SDEs). By studying the analytical solutions of these SDEs, we further establish the asymptotic sample/iteration complexity of SGHA under certain regularity conditions (Harold et al., 1997; Li et al., 2016a; Chen et al., 2017). To the best of our knowledge, this is the first asymptotic sample/iteration complexity analysis of a stochastic optimization algorithm for solving the online version of GEV problem. Numerical experiments are presented to justify our theory.

Our work is closely related to several recent results on solving GEV problems. For example, Ge et al. (2016a) propose a multistage semi-stochastic optimization algorithm for solving GEV problems with a finite sum structure. At each optimization stage, their algorithm needs to access the exact matrix, and compute the approximate inverse of by solving a quadratic program, which is not allowed in our setting. Similar matrix inversion approaches are also adopted by a few other recently proposed algorithms for solving GEV problem (Allen-Zhu and Li, 2016; Arora et al., 2017). In contrast, our proposed SGHA is a fully stochastic algorithm, which does not require any matrix inversion.

Moreover, our work is also related to several more complicated min-max problems, such as Markov Decision Process with function approximation, Generative Adversarial Network, multistage stochastic programming and control 

(Sutton et al., 2000; Shapiro et al., 2009; Goodfellow et al., 2014). Many primal-dual algorithms have been proposed to solve these problems. However, most of these algorithms are even not guaranteed to converge. As mentioned earlier, when the convex-concave structure is missing, the min-max problems go far beyond the existing theories. Moreover, both primal and dual iterations involve sophisticated stochastic approximations (equally or more difficult than our online version of GEV). This paper makes the attempt on understanding the optimization landscape of these challenging min-max problems. Taking our results as an initial start, we expect more sophisticated and stronger follow-up works that apply to these min-max problems.

Notations. Given an integer , we denote as a identity matrix, . Given an index set and a matrix , we denote as the complement set of , () as the -th column (row) of , as the -th entry of , and () as the column (row) submatrix of indexed by ,

as the vectorization of

, as the column space of , and as the null space of . Given a symmetric matrix , we denote

as its smallest/largest singular value, and denote the eigenvalue decomposition of

as , where with , denote as the spectral norm of . Given two matrices and , as the Kronecker product of , .

2 Characterization of Equilibria

Recall the Lagrangian function in (2). Then we start with characterizing its equilibria. By KKT conditions, an equilibrium satisfies

which only contains the first order information of . To further distinguish the difference among the equilibria, we define two types of equilibria by the second order information. Given the Lagrangian function in (2), a point is called:

  • [leftmargin=*]

  • (1) An equilibrium of , if

  • (2) An equilibrium is unstable, if is an equilibrium and

  • (3) An equilibrium is stable, if is an equilibrium, , and is strongly convex over a restricted domain.

Note that (2) in Definition 2 has a similar strict saddle property over a manifold in Ge et al. (2015). The motivation behind Definition 2 is intuitive. When has negative curvature with respect to the primal variable at an equilibrium, we can find a direction in to further decrease . Therefore, a tiny perturbation can break this unstable equilibrium. An illustrative example is presented in Figure 1. Moreover, at a stable equilibrium , there is restricted strong convexity, which relates to several conditions, e.g., Polyak Łojasiewicz conditions (Polyak, 1963), i.e.,

for belonging to a small region near and is a constant, or Error Bound conditions (Luo and Tseng, 1993). With this property, we cannot decrease along any direction with respect to . Definition 2 excludes the high order unstable equilibrium, which may exist due to the degeneracy of . Specifically, such a high order unstable equilibrium cannot be identified by the second order information, e.g.,

Figure 1: An illustration of an unstable equilibrium: . Notice that is an equilibrium but unstable. For visualization, we show three views: (a) ; (b) ; (c) . The red lines correspond to and , and the green one corresponds to the .

is an equilibrium with a positive semidefinite Hessian matrix. However, it is an unstable equilibria, since a small perturbation to can break this equilibrium. Such an equilibrium makes the landscape highly more complicated. Overall, we consider a specific class of Lagrangian functions throughout the rest of this paper. They enjoy the following properties:

  • [leftmargin=*]

  • All equilibria are either stable or unstable (i.e., no high order unstable equilibria);

  • All stable equilibria correspond to the global optima of the primal problem.

As mentioned earlier, the first property ensures that the second order information can identify the type of equilibria. The second property guarantees that we do not get spurious optima for (1) as long as an algorithm attains a stable equilibrium. Several machine learning problems belong to this class, such as the generalized eigenvalue decomposition problem.

3 Generalized Eigenvalue Decomposition

We consider the generalized eigenvalue (GEV) problem as a motivating example, which includes CCA, FDA, SDR, etc. as special examples. Recall its min-max formulation (4):

Before we proceed, we impose the following assumption on the problem. Given a symmetric matrix and a positive definite matrix , the eigenvalues of , denoted by , satisfy

Such an eigengap assumption avoids the identifiability issue. The full rank assumption on in Assumption 3 ensures that the original constrained optimization problem is bounded. This assumption can be further relaxed but require more involved analysis. We will discuss this in Appendix B.

To characterize all equilibria of GEV, we leverage the idea of an invariant group. Li et al. (2016b) use similar techniques for an unconstrained matrix factorization problem. However, it does not work for the Lagrangian function due to the more complicate landscape. Therefore, we consider a more general invariant group. Moreover, by analyzing the Hessian matrix of at the equilibria, we demonstrate that each equilibrium is either unstable or stable and the stable equilibria correspond to the global optima of the primal problem (3). Therefore, GEV belongs to the class we defined earlier.

3.1 Invariant Group and Symmetric Property

We first denote the orthogonal group in dimension as

Notice that for any , in (4) has the same landscape with . This further indicates that given an equilibrium , is also an equilibrium. This symmetric property motivates us to characterize the equilibria of with an invariant group.

We introduce several important definitions in group theory (Dummit and Foote, 2004).

Given a group and a set , a map from to is called the group action of on if satisfies the following two properties:

Identity: , where denotes the identity element of .

Compatibility: . Given a function , a group is a stationary invariant group of with respect to two group actions of , on and on , if satisfies

For notational simplicity, we denote . Given the group , two sets and , we define a group action with of on and a group action of on as

One can check that the orthogonal group is a stationary invariant group of with respect to two group actions of , on and on . By this invariant group, we define the equivalence relation between and , if there exists a such that

(5)

To find all equilibria of GEV, we examine the KKT conditions of :

Given the eigenvalue decomposition , we denote

We then consider the eigenvalue decomposition . The following theorem shows the connection between the equilibrium of and the column submatrix of , denoted as , where

is the column index set to determine a column submatrix. [Symmetric Property] Suppose Assumption 3 holds. Then is an equilibrium of , if and only if can be written as

where index and . The proof of Theorem 3.1 is provided in Appendix A.1. Theorem 3.1 implies that there are equilibria of under the equivalence relation given in (5). Each of them corresponds to an , where is the index set. Then whole equilibria set is generated by these with the transformation matrix and the invariant group action induced by .

3.2 Unstable Equilibrium vs. Stable Equilibrium

We further identify the stable and unstable equilibria. Specifically, given as an equilibrium of , we denote the Hessian matrix of with respect to the primal variable as

Then we calculate the eigenvalues of . By Definition 2, is unstable if has a negative eigenvalue; Otherwise, we analyze the local landscape at to determine whether it is stable or not. The following theorem shows that all equilibria are either stable or unstable and demonstrates how the choice of index set corresponds to the unstable and stable equilibria of . Suppose Assumption 3 holds, and is an equilibrium in (4). By Theorem 3.1, can be represented as for some and .

If , then is an unstable equilibrium with

where , and , is the -th leading eigenvalue of

Otherwise, we have Moreover, is a stable equilibrium of min-max problem (4).

The proof of Theorem 3.2 is provided in Appendix A.2. Theorem 3.2 indicates that when

, that is, the eigenvectors of

corresponding to the largest eigenvalues, is a stable equilibrium of , where Although is degenerate at this equilibrium, all directions in essentially point to the primal variables of other stable equilibria. Excluding these directions, the rest all have positive curvature, which implies that this equilibrium is stable. Moreover, such an corresponds to the optima of (3). When , due to the negative curvature, these equilibria are unstable. Therefore, all stable equilibria of correspond to the global optima in and other equilibria are unstable, which further indicates that GEV belongs to the class we defined earlier.

4 Stochastic Search for Online GEV

For GEV, we propose a fully stochastic primal-dual algorithm to solve (4), which only requires access to the stochastic approximations of and matrices. This is very different from other existing semi-stochastic algorithms that require to access the exact matrix (Ge et al., 2016a). Specifically, we propose a stochastic variant of the generalized Hebbian algorithm (GHA), also referred as Sanger’s rule in existing literature (Sanger, 1989), to solve (4). For online setting, accessing the exact and is prohibitive and we only get and that are independently sampled from the distribution associated with and at the -th iteration. Our proposed SGHA updates primal and dual variables as follows:

Primal Update: (6)
Dual Update: (7)

where

is a step size parameter. Note that the primal update is a stochastic gradient descent step, while the dual update is motivated by the KKT conditions of (

4). SGHA is simple and easy to implement. The constraint is naturally handled by the dual update. Further, motivated by the the landscape of GEV, we analyze the algorithm by diffusion approximations and obtain the asymptotical sample complexity.

4.1 Numerical Evaluations

We first provide numerical evaluations to illustrate the effectiveness of SGHA, and then provide an asymptotic convergence analysis of SGHA. We choose and select three different settings:

  • [leftmargin=*]

  • , ,   , and   ;

  • ,

    , and randomly generate an orthogonal matrix

    such that and ;

  • , , and randomly generate two orthogonal matrices such that and .

(a) Setting (1)
(b) Setting (2)
(c) Setting (3)
Figure 2: Plots of the optimization error over SGHA iterations on synthetic data of 20 random data generations under different settings of parameters.

At the -th iteration of SGHA, we independently sample random vectors from and respectively. Accordingly, we compute the sample covariance matrices and as the approximations of and . We repeat numerical simulations under each setting for times using random data generations, and present all results in Figure 2. The horizontal axis corresponds to the number of iterations, and the vertical axis corresponds to the optimization error

Our experiments indicate that SGHA converges to a global optimum in all settings.

4.2 Convergence Analysis for Commutative and

As a special case, we first prove the convergence of SGHA for GEV with , and and are commutative. We will discuss more on noncommutative cases and in the next section. Before we proceed, we introduce our assumptions on the problem. We assume that the following conditions hold:

  • [leftmargin=*]

  • (a): ’s and ’s are independently sampled from two different distributions and respectively, where and ;

  • (b): and are commutative, i.e., there exists an orthogonal matrix such that and , where and are diagonal matrices with ;

  • (c): and

    satisfy the moment conditions, that is, for some generic constants

    and , and .

Note that (a) and (c) in (4.2) are mild, but (b) is stringent. For convenience of analysis, we combine (6) and (7) as

(8)

We remark that (8) is very different from existing optimization algorithms over the generalized Stiefel manifold. Specifically, computing the gradient over the generalized Stiefel manifold requires , which is not allowed in our setting. For notational convenience, we further denote

Without loss of generality, we assume and . Note that and , however, are not necessarily to be monotonic. We denote

Denote . One can verify that (8) can be rewritten as follows:

(9)

where and Note that corresponds to the optimal solution of (3).

By diffusion approximation, we show that our algorithm converges through three Phases:

  • [leftmargin=*]

  • Phase I: Given an initial near a saddle point, we show that after rescaling of time properly, the algorithm can be characterized by a stochastic differential equation (SDE). Such an SDE further implies our algorithm can escape from the saddle fast;

  • Phase II:

    We show that away from the saddle, the trajectory of our algorithm can be approximated by an ordinary differential equation (ODE);

  • Phase III: We first show that after Phase II, the norm of solution converges to a constant. Then, the algorithm can be characterized by an SDE, like Phase I. By the SDE, we analyze the error fluctuation when the solution is within a small neighborhood of the global optimum.

Overall, we obtain an asymptotic sample complexity.

ODE Characterization: To demonstrate an ODE characterization for the trajectory of our algorithm, we introduce a continuous time random process

where and is the step size in (8). For notational simplicity, we drop when it is clear from the context. Instead of showing a global convergence of , we show that the quantity

converges to an exponential decay function, where is the -th component (coordinate) of .

Suppose that Assumption 4.2 holds and the initial solution is away from any saddle point, i.e., given pre-specified constants, and , there exist such that

As , weakly converges to the solution of the following ODE:

(10)

The proof of Lemma 4.2 is provided in Appendix C.1. Lemma 4.2 essentially implies the global convergence of SGHA. Specifically, the solution of (10) is

where is the initial value of . In particular, we consider . Then, as , the dominating component of will be .

The ODE approximation of the algorithm implies that after long enough time, i.e.,

is large enough, the solution of the algorithm can be arbitrarily close to a global optimum. Nevertheless, to obtain the asymptotic “convergence rate”, we need to study the variance of the trajectory at time

. Thus, we resort to the following SDE-based approach for a more precise characterization.

SDE Characterization: We notice that such a variance with order vanishes as . To characterize this variance, we rescale the updates by a factor of , i.e., by defining a new process as . After rescaling, the variance of is of order . The following lemma characterizes how the algorithm escapes from the saddle, i.e., , where , in Phase I. Suppose Assumption 4.2 holds and the initial is close to a saddle point, that is, given pre-specified constants and , there exists an such that

As , then weakly converges to the solution of the following SDE:

(11)

where and is a standard Brownian motion. The proof of Lemma 4.2 is provided in Appendix C.2. Note that (11) is a Fokker-Plank equation, whose solution is an Ornstein-Uhlenbeck (O-U) process (Doob, 1942) as follows:

(12)

We consider . Note that

is essentially a random variable with mean

and variance smaller than . However, the larger is, the closer its variance gets to this upper bound. Moreover, the term essentially amplifies by a factor exponentially increasing in . This tremendous amplification forces to quickly get away from , as increases, which indicates that the algorithm will escape from the saddle. Further, the following lemma characterizes the local behavior of the algorithm near the optimal.

Suppose that Assumption 4.2 holds and the initial solution is close to an optimal solution, that is, given pre-specified constants and , we have . As , then we have and weakly converges to the solution of the following SDE:

(13)

where , and is a standard Brownian motion. The proof of Lemma 4.2 is provided in Appendix C.3. The solution of (13) is as follows:

(14)

Note the second term of the right hand side in (14) decays to 0, as time . The rest is a pure random walk. Thus, the fluctuation of is essentially the error fluctuation of the algorithm after sufficiently long time.

Combining Lemma 4.24.2, and 4.2, we obtain the following theorem.

Suppose Assumption 4.2 holds. Given a sufficiently small error , and

we need

(15)

such that with probability at least , , where is the optima of (3). The proof of Theorem 4.2 is provided in Appendix C.4. Theorem 4.2 implies that asymptotically, our algorithm yields an iterations of complexity:

which not only depends on the gap, i.e., , but also depends on , which is the condition number of in the worst case. As can be seen, for an ill-conditioned , the problem (3) is more difficult to solve.

4.3 When and are Noncommutative?

Unfortunately, when and are noncommutative, the analysis is more difficult, even for . Recall that the optimization landscape of the Lagrangian function in (4) enjoys a nice geometric property: At an unstable equilibrium, the negative curvature with respect to the primal variable encourages the algorithm to escape. Specifically, suppose the algorithm is initialized at an unstable equilibrium , the descent direction for is determined by the eigenvectors of

associated with the negative eigenvalues. After one iteration, we obtain . The Hessian matrix becomes

Since is a stochastic approximation, the random noise can make significantly different from . Thus, the eigenvectors of associated with the negative eigenvalues can be also very different from those of . This phenomenon can seriously confuse the algorithm about the descent direction of the primal variable. We remark that such an issue does not appear if we assume and are commutative. We suspect that this is very likely an artifact of our proof technique, since our numerical experiments have provided some empirical evidences of the convergence of SGHA.

5 Discussion

Here we briefly discuss a few related works:

  • [leftmargin=*]

  • Li et al. (2016b) propose a framework for characterizing the stationary points in the unconstrained nonconvex matrix factorization problem, while our studied generalized eigenvalue problem is constrained. Different from their analysis, we analyze the optimization landscape of the corresponding Lagrangian function. When characterize the stationary points, we need to take both primal and dual variables into consideration, which is technically more challenging.

  • Ge et al. (2016a) also consider the (off-line) generalized eigenvalue problem but in a finite sum form. Unlike our studied online setting, they access exact and in each iteration. Specifically, they need to access exact and to compute an approximate inverse of

    to find the descent direction. Meanwhile, they also need a modified Gram Schmidt process, which also requires accessing exact

    , to maintain the solution on the generalized Stiefel manifold (defined by via exact , Mishra and Sepulchre (2016)). Our proposed stochastic search, however, is a full stochastic primal-dual algorithm, which neither require accessing exact and , nor enforcing the the primal variables to stay on the manifold.

References

Appendix A Proofs for Determining Stationary Points

a.1 Proof of Theorem 3.1

Proof.

Remind that the eigendecomposition of is . Given the eigendecomposition of is , we can write as

We denote as for some with . For where . It is easy to see that . Ignore the constant 2 in the gradient for convenience, we have,