1 Introduction
We consider the following class of nonconvex linearly constrained optimization problems
(1) 
where is twice differentiable (possibly nonconvex); , and
are some given matrix and vector. Such a class of problems finds many applications in machine learning and data science. For example, in the nonnegative matrix factorization (NMF) problem, a given data matrix
is to be factorized into two nonnegative matrices and such that is minimized [1]. It is also of interest to consider the symmetric case where . Further, for nonconvex problems with regularizers (such as sparse PCA), we need to solve , which can be equivalently written as(2) 
which is of the form in (1).
Recently, algorithms that escape strict saddle points (stationary points that have negative curvatures) for unconstrained nonconvex problems attracted significant research efforts. This is partly due to the recent discoveries that, for certain nonconvex unconstrained problems, their optimization landscape is nice, in the sense that the stationary points are either global minimum or strict saddle points (e.g., shallow neural network training
[2; 3]), or all saddle points are strictsaddle points (e.g., tensor decomposition
[4], phase retrieval [5], lowrank matrix factorization [6], etc.). Therefore, escaping (strict) saddle points guarantees convergence to either local or even global optima.A natural question then arises: what if the considered problem has some simple constraints or nonsmooth regularizers? After all, there are many machine learning and data sciences problems of this type. It would seem to be straightforward to extend the previous “saddlepoint escaping” algorithms to these setting, just like we can extend algorithms that can reach unconstrained firstorder stationary solutions to constrained problems. Unfortunately, this is not the case. As will be seen shortly, even checking the secondorder stationary solution for linearly constrained problems could be daunting. The main task of this paper is then to identify situations in which finding secondorder stationary solution for problem (1) is easy, and design efficient algorithms for this task.
Related work. For unconstrained smooth problems, there has been a line of work that develops the algorithm by using both the gradient direction and negative curvature so that secondorder stationary points (SOSPs) can be attained within a certain number of iterations [7; 8; 9]. For example, a Hessianfree algorithm [8] is guaranteed to provably converge to SOSPs within a certain number of iterations. By exploiting the eigendecomposition, the convergence rate of some variant of the Newton method to SOSPs has been shown in [10; 11], where the algorithm is assumed to be able to access the Hessian matrix around strict saddle points [12]. Another way of finding negative curvature is to occasionally add noise to the iterates. A perturbed version of gradient descent (GD) was firstly proposed in [13], which shows that the convergence rate of perturbed GD to SOSPs is provably faster than the ordinary gradient descent algorithm with random initializations. In a followup work of the perturbation technique [14]
, the authors proposed NEgativecurvatureOriginatedfromNoise (NEON), and illustrated that the perturbed gradient descent is essentially implementing the power method around the saddle point so that a negative curvature of the Hessian matrix is extracted without performing the eigenvalue decomposition. Other recent works include generalizations of NEON such as NEON
[14] NEON2 [15] and perturbed alternating gradient descent proposed in [16] for block structured nonconvex problems. In practice, there may be some constraints, such as equality and inequality constraints. For the equality constraint, negative curvature method has been proposed [17; 18] so that SOSP can be obtained asymptotically as the algorithm proceeds.Despite these recent exciting developments, the aforementioned methods are unable to incorporate even the simpliest linear inequality constraints. In practical machine learning problems, however, inequality constraints are ubiquitous due to physical constraints. Examples include neural networks training with the nonnegative constraint [19], NMF [1], nonnegative tensor factorization [20], resource allocation in wireless networks [luwa18], image restoration, nonconvex quadratic programming with box constraints (QPB) [21], to name just a few. Existing work either directly rely on secondorder descent directions of the objective function [22; 7], or use this information together with other methods such as the trust region method [23], cubic regularized Newton’s method [24], primaldual algorithm [26], etc. These algorithms are generally unfavorable for largescale problems due to the scalability issues when computing the secondorder information. However, to the best of our knowledge, there has been no firstorder methods that can provably compute SOSPs for linearly constrained problem (1).
An even more challenging issue is that finding SOSP for general linearly constrained nonconvex problems is NPhard. Indeed, it has been shown in [27] that even obtaining the approximate SOSPs is hard in terms of both the total number of constraints and the inverse of the desired secondorder optimality error. So existing methods for finding SOSPs with global convergence rate all require some exponential complexity (exponential in the total number of constraints); see [27; 28].
Contributions of this work. We first introduce two notions of (approximate) SOSPs for problem (1), one based on identifying the active constraints at a given solution (referred to as SOSP1), and one based on the feasible directions orthogonal to the gradient (referred to as SOSP2). In particular, we show that, these two conditions become equivalent when certain (provably mild) strict complementarity (SC) conditions are satisfied. Such equivalence conditions enable us to design an algorithm by exploiting the active sets of the solution path, which is computationally much simpler compared with existing methods based on checking feasible directions orthogonal to gradient. Then we propose a Successive Negativecurvature grAdient Projection (SNAP) algorithm, which can find secondorder solutions with high probability. The algorithm updates the iterates by successively using either gradient projection step, or certain negativecurvature projection step (with appropriate active constraints based linesearch procedures). Further, we extend SNAP by proposing a firstorder algorithm (abbreviated as SNAP) which utilizes gradient steps to extract negative curvatures. Numerical simulations demonstrate that the proposed algorithm efficiently converges to SOSPs.
The main contributions of this work are summarized as follows:
1) We study problem (1) and analyze the equivalence of two different notions of (approximate) SOSPs under the assumption of strict complementarity. This part of work provides new insights of solution structures, and will be useful in subsequent algorithm design.
2) We propose the SNAP algorithm, which computes some approximate SOSP with iterations, and with polynomial computational complexity in dimensions as long as projection, gradient, and Hessian can be computed efficiently.
3) We extend SNAP to SNAP, an algorithm that only uses the gradient of the objective function, while being able to compute SOSPs for problem (1) within iterations. Each iteration of the proposed algorithm only requires simple vector operations and projections to the feasible set which can be done in polynomial iterations under reasonable oracles [29]. This makes the proposed algorithm amenable for large scale optimization problems. To the best of our knowledge, this is the first firstorder method that is capable of achieving the above stated properties.
Notation. Bold lower case characters, e.g., represents vectors and bold capital ones, e.g., denotes a matrix, or denotes the th entry of vector or where and . denotes the pseudo inverse of matrix , and denotes the spectral norm of .
2 Preliminaries
Definition 1.
A differentiable function is gradient Lipschitz if where denotes the feasible set.
Definition 2.
A twicedifferentiable function is Hessian Lipschitz if
We make the following assumption on the objective function of (1).
Assumption 1.
in (1) is gradient Lipschitz and Hessian Lipschitz, i.e.,
(3) 
Let denote the active set at a given point , where denotes the th row of matrix and denotes the th entry of . Define to be the complement of the set , i.e.,
(4) 
Concatenating the coefficients of the active constraints at , we define the matrix as
(5) 
In other words, is a submatrix of containing the rows of corresponding to the active set. Similarly, we can define by concatenating the entries of corresponding to the active set of constraints. At a given point , we define the projection onto the space spanned by the inactive constraints as
(6) 
Here, represents the projection matrix to the null space of . Define as the projector to the column space of . Similarly, let us define
(7) 
To measure the firstorder optimality of a given point, we first define the proximal gradient
(8) 
where is a fixed given constant and denotes the projection operator onto the feasible set. Then can be used to define the firstorder optimality gap for a given point the size of the proximal gradient, expressed as.
To define the secondorder optimality gap, let us start by stating the popular exact secondorder necessary conditions for local minimum points of constrained optimization [30, Proposition 3.3.1].
Proposition 1.
This proposition leads to the following form of exact SOSP.
Definition 3 (Exact SOSP1).
Similarly, we define the following approximate SOSP condition for problem (1) as:
Definition 4 (Sosp1).
A point is an SOSP point of problem (1) if
(approx. firstorder condition)  (11a)  
(approx. secondorder condition)  (11b) 
where are some given small constants.
By utilizing the definition of the null space of the active set in (6), we can rewrite condition (11b) as
(12) 
where is the operator that returns the smallest eigenvalue of a matrix. The above definition of secondorder solutions leads to the following definition of firstorder stationary solutions.
Definition 5 (Fosp1).
If a point satisfies the condition , we call it an firstorder stationary point of the first kind, abbreviated as FOSP1.
Note that the conditions in (10) and (11) are necessary conditions for being a local minimum solutions. There are, of course, many other necessary conditions of this kind. Therefore, to distinguish from various solution concepts, we will refer to the solutions satisfying the above conditions as SOSP of the first kind and SOSP of the first kind, abbreviated as SOSP1 and SOSP1, respectively. Below we present another popular secondorder solution concept, which appears in optimization literature; see [28; 27], and the references therein.
Definition 6 (Sosp2).
A point is an secondorder stationary point of the second kind of problem (1) if the following conditions are satisfied:
(13a)  
(13b) 
We refer to the above conditions as second order stationary solution of the second kind, abbreviated as SOSP2.
Definition 7 (Fosp2).
If a solution only satisfies (13a), we call it an firstorder stationary point of the second kind (FOSP2).
The classical result [31] shows that checking SOSP2 for (1) is NPhard in the problem dimension and in even for the class of quadratic functions. This hardness result has recently been strengthened by showing NPhardness in terms of problem dimension and in [27]. On the other hand, checking SOSP1 condition only requires projection onto linear subspaces and finding minimum eigenvalues. A natural question then arises: How do these different kinds of approximate and exact secondorder solution concepts relate to each other? In what follows, we provide a concrete answer to this question. This answer relies on a critical concept called strict complementarity, which will be introduced below.
Definition 8.
A give primaldual pair for the linearly constrained problem (1) satisfies the Karush–Kuhn–Tucker (KKT) condition with strict complementarity if
(14a)  
(14b) 
Note that the SC condition has been assumed and used in convergence analysis of many algorithms, e.g., trust region algorithms for nonconvex optimization with bound constraints in [32; 33; 34].
The results below extend a recent result in [35, Proposition 2.3], which shows that SC is satisfied for box constrained nonconvex problems (with high probability). See Appendix A.1 – A.2 for proof.
Proposition 2.
Corollary 1.
Suppose and is differentiable. If the data vector is generated from a continuous measure, then the SC condition holds for (1) with probability one.
This result shows that for a certain generic choice of objective functions, the SC condition is satisfied. As we will see in the next section, this SC condition leads to the equivalence of SOSP1 and SOSP2 conditions. Hence, instead of working with the computationally intractable SOSP2 condition, we can use a tractable SOSP1 condition for developing algorithms. In other words, by adding a small random linear perturbation to the objective function, which does not practically change the landscape of the optimization problem, we can avoid the computational intractability of SOSP2.
3 Almost Sure Equivalence of SOSP1 and SOSP2
To understand the relation between SOSP1 and SOSP2, let us consider the following example. Example 1. Consider the following box constrained quadratic problem:
(15) 
Clearly the point is an SOSP1. This is because the gradient of the objective is zero, both inequality constraints are active at this point and in (11). However, the point is not an SOSP2 according to the definition in (13). This is because the condition is true for all feasible , but for .
The above example suggests that condition (13) is stronger than (11), even when . Indeed, one can show that any point satisfying (13) also satisfies (11). More importantly, these conditions become equivalent when the SC condition (14) holds true. These results are presented in A.3 – A.6 and is summarized in Proposition 3 below.
Proposition 3.
In view of Example 1, the above equivalence result is somewhat surprising. However, by applying Proposition 2, one can slightly perturb the problem in Example 1 by adding to its objective a random linear term in the form of (with being very small) to satisfy the SC condition. One can check that after this perturbation, the two conditions become the same.
Next we proceed by analyzing the approximate secondorder conditions of SOSP1 and SOSP2.
Corollary 2.
Although at this point we have not shown the equivalence of SOSP1 and SOSP2 (a result that remains very challenging), we provide an alternative result showing that SOSP1 is a valid approximation of SOSP1, which in turn is equivalent to SOSP2 by Proposition 3. In particular, we show that SOSP1 becomes SOSP1 as .
Proposition 4.
Let be a sequence generated by an algorithm. Assume for each , the point be an SOSP1. Assume further that converges to the point . Then, any limit point of the sequence is an exact SOSP1.
While in general SOSP2 is stronger than SOSP1, using SOSP1 has the following advantages:
1) For a given , checking whether SOSP1 holds is computationally tractable, since it only requires finding the active constraints, computing its null space, and performing an eigenvalue decomposition. On the other hand, as proved in [27], checking SOSP2 is NPhard even for quadratic .
2) Intuitively, it is relatively easy to design an algorithm based on active constraints: When a sequence of iterates approaches an FOSP, the corresponding active set remains the same (see [36, Proposition 1.37], [37, Proposition 3]). Therefore locally the inequality constrained problem is reduced to an equality constrained problem, whose secondorder conditions are much easier to satisfy; see [38][39].
3) As we have shown, the SC condition is satisfied with probability one for problems with random data, implying that finding SOSP1 is already good enough for these problems.
Clearly, our proposed solution concept SOSP1 represents an interesting tradeoff between the quality of the solutions and computational complexity of the resulting algorithms. In what follows, we will design efficient algorithms that can compute such a solution concept.
4 SNAP for Computing SOSP1
4.1 Algorithm Description
Our proposed algorithm successively performs two main steps: a conventional projected gradient descent (PGD) step and a negative curvature descent (NCD) step. Assuming that the feasible set is easy to project (e.g., the nonnegativity constraints for the NMF problem), the PGD finds an approximate firstorder solution efficiently, while the negative curvature descent step explores curvature around a firstorder stationary solution to move the iterates forward.
To provide a detailed description of the algorithm, we will first introduce the notion of the feasible directions using the directions in (10b) and (11b). In particular, for a given point , we define the feasible direction subspace as , where denotes the null space of matrix . Such directions are useful for extracting negative curvature directions. We will refer to the subspace as free space and we refer to its orthogonal complement as active space.
The input of the algorithms are some constants related to the problem data, the initial solution , and parameters . Further is the stepsize, and is the accuracy of the curvature finding algorithm, both will be determined later.
It is important to note that as long as the computation of gradient, Hessian, and projection can be done in a polynomial number of floating point operations, the computational complexity of SNAP becomes polynomial. For most practical problems, it is reasonable to assume that gradient and Hessian and can be computed efficiently. In addition, projection to linear inequality constraints is wellstudied and can be done polynomially under reasonable oracles [29].
The PGD step (line 21). The conventional PGD, given below, is implemented in line 21 of Algorithm 1, with a constant stepsize :
(16) 
The PGD guarantees that the objective value decreases so that the algorithm can achieve some approximate FOSPs, i.e., efficiently (assuming that the projection can be done relatively easily). The procedure stops whenever the FOSP1 gap .
Negative curvature descent (NCD line 419).
After PGD has been completed, we know that is already small. Suppose that the SOSP1 solution has not been found yet. Then our next step is to design an update direction to increase .
Towards this end, a NCD step will be performed (Algorithm 1, line 4–19), which constructs update directions that can exploit curvature information about the Hessian matrix, while ensuring that the iterates stay in the feasible region.
The NCD further contains the following subprocedures.
(1) Extracting negative curvature.
Assuming that (12) does not hold. First, a procedure NegativeEigenPair is called, which exploits some secondorder information about the function at , and returns an approximate eigenpair of the Hessian . Such an approximate eigenpair should satisfy the following requirements (with probability at least ):

and ;

for some where such that .
If satisfies all the above conditions, NegativeEigenPair returns , otherwise, it returns . As long as (12) holds, many existing algorithms can achieve the two conditions stated above in a finite number of iterations (e.g., the power or Lanczos method). However, these methods typically require to evaluate the Hessian matrix or Hessianvector product. Subsequently, we will design a new procedure that only utilizes gradient information for such purposes.
(2) Selection of update direction. First, note that whichever choice of the directions we make, we will have . Second, by the Lipschitz continuity, we have
(17) 
since and . Similarly, by Lipschitz continuity, we have
(18) 
Therefore, it can be observed that the descent of the objective value is determined by the choice of .
The actual update direction we use is chosen between the direction found in the previous step, and a direction computed by directly projecting to the subspace of feasible directions. The selection criteria, given in line 8 of Algorithm 1, is motivated by the following descent properties (note ).
Lemma 1.
If is updated by Algorithm 2 (the line search) and in Algorithm 2 (linear search) is not chosen, the minimum descent of the objective value by choosing is no less than the one by selecting , and vice versa.
(3) Backtracking line search (Algorithm 2). The third subprocedure uses line search to determine the stepsize so that the new and feasible iterate can be generated. The key in this step is to make sure that, either the new iterate achieves some kind of “sufficient” descent, or it touches the boundary of the feasible set. We use to denote whether the updated iterate touches a new boundary or not. Since the direction is already fixed to be in the free space, after performing the line search, the dimension of the feasible directions will be nonincreasing, i.e., .
To understand the algorithm, let us first define the set of inactive constraints as
(19) 
The details of the line search algorithm is shown in Algorithm 2.
(20) 
(21) 
(22) 
(23) 
(24) 
In particular, in this line search procedure, we will first decide a maximum stepsize . Recall that defined in (19) represents the set of constraints that are inactive at point .
Lemma 2.
If there exits an index so that the following holds
(25) 
then, we have
(26) 
On the other hand, if the condition (25) does not hold, it means that along the current direction the problem is effectively unconstrained. Therefore, the line search algorithm reduces to the classic unconstrained update. Then by setting , SNAP will give a sufficient decrease in this case; see Lemma 5.
After choosing , we check if the following holds
(27) 
If so, then the algorithm either touches the boundary without increasing the objective, or it has already achieved sufficient descent.
If (27) does not hold, then the algorithm will call the backtracking line search by successively shrinking the stepsize starting at . In particular, if (where is some predetermined negative quantity, see (22)), we will implement until a sufficient descent is satisfied (note, such a sufficient descent can be eventually achieved, see Lemma 5 and Lemma 6).
(4) “Flags” in the algorithm. We note that after each NCD step if (i.e., some sufficient descent is achieved), we perform iterations of PGD. This design tries to improve the practical efficiency of the algorithm by striking the balance between objective reduction and computational complexity. In practice can be chosen as any constant number. When , SNAP becomes simpler and has the same order of convergence rate as the case where . See Appendix C.2.
4.2 Theoretical Guarantees
The convergence analysis of the proposed algorithm is provided in this section. See Appendix C.
Theorem 1.
(Convergence rate of SNAP) Suppose the objective function satisfies assumption 1. There exists a sufficient small so that the sequence generated by Algorithm 1 satisfies optimality condition (11) in the following number of iterations with probability at least :
(28) 
where the randomness comes from the oracle NegativeEigenPair, denotes the global minimum value, and hides the number of iterations run by an oracle NegativeEigenPair.
Remark 1. The convergence rate of SNAP has the same order in , compared with those proposed in [28] and [27] (which compute SOSP2s). However, it is important to note that the periteration complexity of SNAP is polynomial in both problem dimension and in number of constraints, while algorithms proposed in [28] and [27] have exponential periteration complexity.
Remark 2. In particulay, SNAP needs number of iterations to achieve an SOSP1 by just substituting , and .
5 Firstorder Successive Negativecurvature Gradient Projection (SNAP)
In this section, we propose a firstorder algorithm for SNAP, i.e., SNAP, featuring a subspace perturbed gradient descent (SPGD) procedure that can extract the negative curvature in a subspace. Our work is motivated by recent works, which show that occasionally adding random noise to the iterates of GD can help escape from saddle points efficiently [13; 4]. The benefit of the proposed SNAP is that its complexity can be improved significantly since the procedure of finding the negative eigenpair is implemented by a firstorder method.
In particular, the key idea of these perturbation schemes is to use the difference of the gradient successively [14], given below, to approximate the Hessianvector product
(29) 
Here is some properly selected constant, is the stepsize and the algorithm is initialized from a random vector
drawn from a uniform distribution in the interval
, where is some constant. This process can be viewed as performing power iteration around the strict saddle point. The details of the algorithm is presented in Algorithm 3, and its convergence is as the following.Theorem 2.
SPGD is called with the stepsize ,
(30) 
where . Then, for any , , SPGD returns and a vector such that
(31) 
with probability . Otherwise SPGD returns and vector 0, indicating that with probability .
(32) 
Essentially, when SPGD stops, it produces a direction that satisfies the requirements of the outputs for the NegativeEigenPair oracle in Algorithm 1. It follows that the rate claimed in Theorem 1 still holds for SNAP. The proof can be found in Appendix D.2
Corollary 3.
(Convergence rate of SNAP) Suppose Assumption 1 is satisfied and SPGD with stepsize less than is used to find the negative eigenpair. Then, there exists a sufficiently small such that the sequence generated by Algorithm 1 finds an SOSP1 in the following number of iterations with probability at least (where hides the polynomial in terms of )
(33) 
Note that the total rate includes the number of iteration required by SPGD, so it is slower than the rate of SNAP.
6 Connection with Existing Work and Future Work
Algorithm  Complexity PI  Iterations  SOSP  Oracle 

ESP [28]  SOSP2  Hessian  
SOLCTRACE [40]  SOSP2  HV product  
SNAP (This work)  SOSP1  HV product  
SNAP (This work)  SOSP1  Gradient 
In Table 1, we provide comparisons of key features of a few existing algorithms for finding SOSPs for problem linearly constrained nonconvex problems.
1), SNAP and SNAP has a polynomial complexity periteration, while the existing works rely on an exponential complexity of some subroutine for solving the inner loop optimization problems.
2), SNAP has the same convergence rate as the existing work in [28]. If the acceleration technique is adopted, SNAP can also have a faster convergence.
3), To the best of our knowledge, SNAP is the first firstorder algorithm that has the provable convergence rate to SOSPs, where the rate is only in the order of polynomial in terms of problem dimension and the total number of constraints.
7 Numerical Results
In this section, we showcase the numerical advantages of SNAP and SNAP, compared with PGD and PGD with line search (PGDLS) for multiple machine learning problems, such as NMF, training nonnegative neural networks, penalized NMF, etc.
7.1 NMF Problems
First, we consider the following NMF problem, which is
(34a)  
s.t.  (34b) 
The starting point for all the algorithms is , where and are randomly generated. Constant controls the distance between the initialization point and the origin. The cases of and correspond to large and small initialization, respectively. The rationale for considering a small initialization is that for NMF problems, it can be easily checked that is a saddle point. By initializing around this point, we aim at examining whether indeed the proposed SNAP and SNAP are able to escape from this region.
7.1.1 Synthetic Dataset
We compare the proposed SNAP, SNAP with PGD and PGDLS on the synthetic dataset for MNF problem and show the advantages of exploiting negative curvatures. The data matrices are randomly generated, where , , , , and follows the uniform distribution in the interval . Further, we randomly set 5% entries of as 0. The starting point for all the algorithms is , where and
are randomly generated and follow Gaussian distribution
. Clearly, the origin point is a strict saddle point. We use three different constants to initialize sequence and the results are shown in Figure 1–Figure 3, where stepsize for PGD, SNAP, and SNAP is , , , , , and . Note that the stopping criteria are removed in the simulation, otherwise PGD and PGDLS will not give any output if the initialing point is close to origin.It can be observed that when is large, all algorithms can converge to the global optimal point of this NMF problem, whereas when is small as shown in Figure 3 PGD and PGDLS only converge to a point that has a very large loss value compared with the ones achieved by SNAP and SNAP. These results show that when the iterates are near the strict saddle points, by exploring the negative curvature, SNAP and SNAP are able to escape from the saddle points quickly and converge to the global optimal solutions. Comparing SNAP and SNAP, we can see that the computational time of SNAP is less than SNAP. The reason is simple, which is the computational complexity of calculation of Hessian and eigendecomposition is too high so that SNAP takes more time to converge. By accessing the gradient and loss value of the objective function, SNAP is only required to compute one eigenvector whose eigenvalue is the smallest of Hessian around the strict saddle point. The line search algorithm is one of the most effective ways of computing stepsizes. From Figure 1 and Figure 2, it can be observed that PGDLS converges faster than PGD in terms of iterations but costs more computational time. SNAP and SNAP are using line search occasionally rather than each step, so the computational time is not as high as PGDLS. In Figure 3(b), it can be observed that SNAP and SNAP obtain loss values that are many orders of magnitude smaller than those obtained by PGD and PGDLS, confirming that the proposed methods are able to escape from saddle points, while PGD and PGDLS get trapped.
7.1.2 Real Dataset
We also compare the convergence behaviours of the algorithms on USPS handwritten digits dataset [41], where images are grayscale pixels. In Figure 4, we use the , , . Since the problem size is large, performing eigenvalue decomposition is prohibitive, so we only compare SNAP, PGD, and PGDLS, where and .
7.2 Nonnegative Two Layer Nonlinear Neural Networks
In this section, we consider a nonnegative two layer nonlinear neural network, which is
s.t. 
Comments
There are no comments yet.