We consider the following class of non-convex linearly constrained optimization problems
where is twice differentiable (possibly non-convex); , and
are some given matrix and vector. Such a class of problems finds many applications in machine learning and data science. For example, in the nonnegative matrix factorization (NMF) problem, a given data matrixis to be factorized into two nonnegative matrices and such that is minimized . It is also of interest to consider the symmetric case where . Further, for non-convex problems with regularizers (such as sparse PCA), we need to solve , which can be equivalently written as
which is of the form in (1).
Recently, algorithms that escape strict saddle points (stationary points that have negative curvatures) for unconstrained non-convex problems attracted significant research efforts. This is partly due to the recent discoveries that, for certain non-convex unconstrained problems, their optimization landscape is nice, in the sense that the stationary points are either global minimum or strict saddle points (e.g., shallow neural network training[2; 3]), or all saddle points are strict
saddle points (e.g., tensor decomposition, phase retrieval , low-rank matrix factorization , etc.). Therefore, escaping (strict) saddle points guarantees convergence to either local or even global optima.
A natural question then arises: what if the considered problem has some simple constraints or non-smooth regularizers? After all, there are many machine learning and data sciences problems of this type. It would seem to be straightforward to extend the previous “saddle-point escaping” algorithms to these setting, just like we can extend algorithms that can reach unconstrained first-order stationary solutions to constrained problems. Unfortunately, this is not the case. As will be seen shortly, even checking the second-order stationary solution for linearly constrained problems could be daunting. The main task of this paper is then to identify situations in which finding second-order stationary solution for problem (1) is easy, and design efficient algorithms for this task.
Related work. For unconstrained smooth problems, there has been a line of work that develops the algorithm by using both the gradient direction and negative curvature so that second-order stationary points (SOSPs) can be attained within a certain number of iterations [7; 8; 9]. For example, a Hessian-free algorithm  is guaranteed to provably converge to SOSPs within a certain number of iterations. By exploiting the eigen-decomposition, the convergence rate of some variant of the Newton method to SOSPs has been shown in [10; 11], where the algorithm is assumed to be able to access the Hessian matrix around strict saddle points . Another way of finding negative curvature is to occasionally add noise to the iterates. A perturbed version of gradient descent (GD) was firstly proposed in , which shows that the convergence rate of perturbed GD to SOSPs is provably faster than the ordinary gradient descent algorithm with random initializations. In a follow-up work of the perturbation technique 
, the authors proposed NEgative-curvature-Originated-from-Noise (NEON), and illustrated that the perturbed gradient descent is essentially implementing the power method around the saddle point so that a negative curvature of the Hessian matrix is extracted without performing the eigenvalue decomposition. Other recent works include generalizations of NEON such as NEON NEON2  and perturbed alternating gradient descent proposed in  for block structured non-convex problems. In practice, there may be some constraints, such as equality and inequality constraints. For the equality constraint, negative curvature method has been proposed [17; 18] so that SOSP can be obtained asymptotically as the algorithm proceeds.
Despite these recent exciting developments, the aforementioned methods are unable to incorporate even the simpliest linear inequality constraints. In practical machine learning problems, however, inequality constraints are ubiquitous due to physical constraints. Examples include neural networks training with the nonnegative constraint , NMF , nonnegative tensor factorization , resource allocation in wireless networks [luwa18], image restoration, non-convex quadratic programming with box constraints (QPB) , to name just a few. Existing work either directly rely on second-order descent directions of the objective function [22; 7], or use this information together with other methods such as the trust region method , cubic regularized Newton’s method , primal-dual algorithm , etc. These algorithms are generally unfavorable for large-scale problems due to the scalability issues when computing the second-order information. However, to the best of our knowledge, there has been no first-order methods that can provably compute SOSPs for linearly constrained problem (1).
An even more challenging issue is that finding SOSP for general linearly constrained non-convex problems is NP-hard. Indeed, it has been shown in  that even obtaining the approximate SOSPs is hard in terms of both the total number of constraints and the inverse of the desired second-order optimality error. So existing methods for finding SOSPs with global convergence rate all require some exponential complexity (exponential in the total number of constraints); see [27; 28].
Contributions of this work. We first introduce two notions of (approximate) SOSPs for problem (1), one based on identifying the active constraints at a given solution (referred to as SOSP1), and one based on the feasible directions orthogonal to the gradient (referred to as SOSP2). In particular, we show that, these two conditions become equivalent when certain (provably mild) strict complementarity (SC) conditions are satisfied. Such equivalence conditions enable us to design an algorithm by exploiting the active sets of the solution path, which is computationally much simpler compared with existing methods based on checking feasible directions orthogonal to gradient. Then we propose a Successive Negative-curvature grAdient Projection (SNAP) algorithm, which can find second-order solutions with high probability. The algorithm updates the iterates by successively using either gradient projection step, or certain negative-curvature projection step (with appropriate active constraints based line-search procedures). Further, we extend SNAP by proposing a first-order algorithm (abbreviated as SNAP) which utilizes gradient steps to extract negative curvatures. Numerical simulations demonstrate that the proposed algorithm efficiently converges to SOSPs.
The main contributions of this work are summarized as follows:
1) We study problem (1) and analyze the equivalence of two different notions of (approximate) SOSPs under the assumption of strict complementarity. This part of work provides new insights of solution structures, and will be useful in subsequent algorithm design.
2) We propose the SNAP algorithm, which computes some approximate -SOSP with iterations, and with polynomial computational complexity in dimensions as long as projection, gradient, and Hessian can be computed efficiently.
3) We extend SNAP to SNAP, an algorithm that only uses the gradient of the objective function, while being able to compute -SOSPs for problem (1) within iterations. Each iteration of the proposed algorithm only requires simple vector operations and projections to the feasible set which can be done in polynomial iterations under reasonable oracles . This makes the proposed algorithm amenable for large scale optimization problems. To the best of our knowledge, this is the first first-order method that is capable of achieving the above stated properties.
Notation. Bold lower case characters, e.g., represents vectors and bold capital ones, e.g., denotes a matrix, or denotes the th entry of vector or where and . denotes the pseudo inverse of matrix , and denotes the spectral norm of .
A differentiable function is -gradient Lipschitz if where denotes the feasible set.
A twice-differentiable function is -Hessian Lipschitz if
We make the following assumption on the objective function of (1).
in (1) is -gradient Lipschitz and -Hessian Lipschitz, i.e.,
Let denote the active set at a given point , where denotes the th row of matrix and denotes the th entry of . Define to be the complement of the set , i.e.,
Concatenating the coefficients of the active constraints at , we define the matrix as
In other words, is a submatrix of containing the rows of corresponding to the active set. Similarly, we can define by concatenating the entries of corresponding to the active set of constraints. At a given point , we define the projection onto the space spanned by the inactive constraints as
Here, represents the projection matrix to the null space of . Define as the projector to the column space of . Similarly, let us define
To measure the first-order optimality of a given point, we first define the proximal gradient
where is a fixed given constant and denotes the projection operator onto the feasible set. Then can be used to define the first-order optimality gap for a given point the size of the proximal gradient, expressed as.
To define the second-order optimality gap, let us start by stating the popular exact second-order necessary conditions for local minimum points of constrained optimization [30, Proposition 3.3.1].
This proposition leads to the following form of exact SOSP.
Definition 3 (Exact SOSP1).
Similarly, we define the following approximate SOSP condition for problem (1) as:
Definition 4 (-Sosp1).
A point is an -SOSP point of problem (1) if
|(approx. first-order condition)||(11a)|
|(approx. second-order condition)||(11b)|
where are some given small constants.
where is the operator that returns the smallest eigenvalue of a matrix. The above definition of second-order solutions leads to the following definition of first-order stationary solutions.
Definition 5 (-Fosp1).
If a point satisfies the condition , we call it an -first-order stationary point of the first kind, abbreviated as -FOSP1.
Note that the conditions in (10) and (11) are necessary conditions for being a local minimum solutions. There are, of course, many other necessary conditions of this kind. Therefore, to distinguish from various solution concepts, we will refer to the solutions satisfying the above conditions as SOSP of the first kind and -SOSP of the first kind, abbreviated as SOSP1 and -SOSP1, respectively. Below we present another popular second-order solution concept, which appears in optimization literature; see [28; 27], and the references therein.
Definition 6 (-Sosp2).
A point is an -second-order stationary point of the second kind of problem (1) if the following conditions are satisfied:
We refer to the above conditions as -second order stationary solution of the second kind, abbreviated as -SOSP2.
Definition 7 (-Fosp2).
If a solution only satisfies (13a), we call it an -first-order stationary point of the second kind (FOSP2).
The classical result  shows that checking -SOSP2 for (1) is NP-hard in the problem dimension and in even for the class of quadratic functions. This hardness result has recently been strengthened by showing NP-hardness in terms of problem dimension and in . On the other hand, checking -SOSP1 condition only requires projection onto linear subspaces and finding minimum eigenvalues. A natural question then arises: How do these different kinds of approximate and exact second-order solution concepts relate to each other? In what follows, we provide a concrete answer to this question. This answer relies on a critical concept called strict complementarity, which will be introduced below.
A give primal-dual pair for the linearly constrained problem (1) satisfies the Karush–Kuhn–Tucker (KKT) condition with strict complementarity if
Suppose and is differentiable. If the data vector is generated from a continuous measure, then the SC condition holds for (1) with probability one.
This result shows that for a certain generic choice of objective functions, the SC condition is satisfied. As we will see in the next section, this SC condition leads to the equivalence of SOSP1 and SOSP2 conditions. Hence, instead of working with the computationally intractable SOSP2 condition, we can use a tractable SOSP1 condition for developing algorithms. In other words, by adding a small random linear perturbation to the objective function, which does not practically change the landscape of the optimization problem, we can avoid the computational intractability of SOSP2.
3 Almost Sure Equivalence of SOSP1 and SOSP2
To understand the relation between SOSP1 and SOSP2, let us consider the following example. Example 1. Consider the following box constrained quadratic problem:
Clearly the point is an SOSP1. This is because the gradient of the objective is zero, both inequality constraints are active at this point and in (11). However, the point is not an SOSP2 according to the definition in (13). This is because the condition is true for all feasible , but for .
The above example suggests that condition (13) is stronger than (11), even when . Indeed, one can show that any point satisfying (13) also satisfies (11). More importantly, these conditions become equivalent when the SC condition (14) holds true. These results are presented in A.3 – A.6 and is summarized in Proposition 3 below.
In view of Example 1, the above equivalence result is somewhat surprising. However, by applying Proposition 2, one can slightly perturb the problem in Example 1 by adding to its objective a random linear term in the form of (with being very small) to satisfy the SC condition. One can check that after this perturbation, the two conditions become the same.
Next we proceed by analyzing the approximate second-order conditions of SOSP1 and SOSP2.
Although at this point we have not shown the equivalence of -SOSP1 and -SOSP2 (a result that remains very challenging), we provide an alternative result showing that -SOSP1 is a valid approximation of -SOSP1, which in turn is equivalent to -SOSP2 by Proposition 3. In particular, we show that -SOSP1 becomes -SOSP1 as .
Let be a sequence generated by an algorithm. Assume for each , the point be an -SOSP1. Assume further that converges to the point . Then, any limit point of the sequence is an exact SOSP1.
While in general SOSP2 is stronger than SOSP1, using SOSP1 has the following advantages:
1) For a given , checking whether SOSP1 holds is computationally tractable, since it only requires finding the active constraints, computing its null space, and performing an eigenvalue decomposition. On the other hand, as proved in , checking SOSP2 is NP-hard even for quadratic .
2) Intuitively, it is relatively easy to design an algorithm based on active constraints: When a sequence of iterates approaches an FOSP, the corresponding active set remains the same (see [36, Proposition 1.37], [37, Proposition 3]). Therefore locally the inequality constrained problem is reduced to an equality constrained problem, whose second-order conditions are much easier to satisfy; see .
3) As we have shown, the SC condition is satisfied with probability one for problems with random data, implying that finding SOSP1 is already good enough for these problems.
Clearly, our proposed solution concept SOSP1 represents an interesting tradeoff between the quality of the solutions and computational complexity of the resulting algorithms. In what follows, we will design efficient algorithms that can compute such a solution concept.
4 SNAP for Computing SOSP1
4.1 Algorithm Description
Our proposed algorithm successively performs two main steps: a conventional projected gradient descent (PGD) step and a negative curvature descent (NCD) step. Assuming that the feasible set is easy to project (e.g., the non-negativity constraints for the NMF problem), the PGD finds an approximate first-order solution efficiently, while the negative curvature descent step explores curvature around a first-order stationary solution to move the iterates forward.
To provide a detailed description of the algorithm, we will first introduce the notion of the feasible directions using the directions in (10b) and (11b). In particular, for a given point , we define the feasible direction subspace as , where denotes the null space of matrix . Such directions are useful for extracting negative curvature directions. We will refer to the subspace as free space and we refer to its orthogonal complement as active space.
The input of the algorithms are some constants related to the problem data, the initial solution , and parameters . Further is the step-size, and is the accuracy of the curvature finding algorithm, both will be determined later.
It is important to note that as long as the computation of gradient, Hessian, and projection can be done in a polynomial number of floating point operations, the computational complexity of SNAP becomes polynomial. For most practical problems, it is reasonable to assume that gradient and Hessian and can be computed efficiently. In addition, projection to linear inequality constraints is well-studied and can be done polynomially under reasonable oracles .
The PGD step (line 21). The conventional PGD, given below, is implemented in line 21 of Algorithm 1, with a constant step-size :
The PGD guarantees that the objective value decreases so that the algorithm can achieve some approximate FOSPs, i.e., efficiently (assuming that the projection can be done relatively easily). The procedure stops whenever the FOSP1 gap .
Negative curvature descent (NCD line 4-19).
After PGD has been completed, we know that is already small. Suppose that the -SOSP1 solution has not been found yet. Then our next step is to design an update direction to increase .
Towards this end, a NCD step will be performed (Algorithm 1, line 4–19), which constructs update directions that can exploit curvature information about the Hessian matrix, while ensuring that the iterates stay in the feasible region.
The NCD further contains the following sub-procedures.
(1) Extracting negative curvature. Assuming that (12) does not hold. First, a procedure Negative-Eigen-Pair is called, which exploits some second-order information about the function at , and returns an approximate eigen-pair of the Hessian . Such an approximate eigen-pair should satisfy the following requirements (with probability at least ):
for some where such that .
If satisfies all the above conditions, Negative-Eigen-Pair returns , otherwise, it returns . As long as (12) holds, many existing algorithms can achieve the two conditions stated above in a finite number of iterations (e.g., the power or Lanczos method). However, these methods typically require to evaluate the Hessian matrix or Hessian-vector product. Subsequently, we will design a new procedure that only utilizes gradient information for such purposes.
(2) Selection of update direction. First, note that whichever choice of the directions we make, we will have . Second, by the -Lipschitz continuity, we have
since and . Similarly, by -Lipschitz continuity, we have
Therefore, it can be observed that the descent of the objective value is determined by the choice of .
The actual update direction we use is chosen between the direction found in the previous step, and a direction computed by directly projecting to the subspace of feasible directions. The selection criteria, given in line 8 of Algorithm 1, is motivated by the following descent properties (note ).
If is updated by Algorithm 2 (the line search) and in Algorithm 2 (linear search) is not chosen, the minimum descent of the objective value by choosing is no less than the one by selecting , and vice versa.
(3) Backtracking line search (Algorithm 2). The third sub-procedure uses line search to determine the step-size so that the new and feasible iterate can be generated. The key in this step is to make sure that, either the new iterate achieves some kind of “sufficient” descent, or it touches the boundary of the feasible set. We use to denote whether the updated iterate touches a new boundary or not. Since the direction is already fixed to be in the free space, after performing the line search, the dimension of the feasible directions will be non-increasing, i.e., .
To understand the algorithm, let us first define the set of inactive constraints as
The details of the line search algorithm is shown in Algorithm 2.
In particular, in this line search procedure, we will first decide a maximum stepsize . Recall that defined in (19) represents the set of constraints that are inactive at point .
If there exits an index so that the following holds
then, we have
On the other hand, if the condition (25) does not hold, it means that along the current direction the problem is effectively unconstrained. Therefore, the line search algorithm reduces to the classic unconstrained update. Then by setting , SNAP will give a sufficient decrease in this case; see Lemma 5.
After choosing , we check if the following holds
If so, then the algorithm either touches the boundary without increasing the objective, or it has already achieved sufficient descent.
If (27) does not hold, then the algorithm will call the backtracking line search by successively shrinking the step-size starting at . In particular, if (where is some pre-determined negative quantity, see (22)), we will implement until a sufficient descent is satisfied (note, such a sufficient descent can be eventually achieved, see Lemma 5 and Lemma 6).
(4) “Flags” in the algorithm. We note that after each NCD step if (i.e., some sufficient descent is achieved), we perform iterations of PGD. This design tries to improve the practical efficiency of the algorithm by striking the balance between objective reduction and computational complexity. In practice can be chosen as any constant number. When , SNAP becomes simpler and has the same order of convergence rate as the case where . See Appendix C.2.
4.2 Theoretical Guarantees
The convergence analysis of the proposed algorithm is provided in this section. See Appendix C.
(Convergence rate of SNAP) Suppose the objective function satisfies assumption 1. There exists a sufficient small so that the sequence generated by Algorithm 1 satisfies optimality condition (11) in the following number of iterations with probability at least :
where the randomness comes from the oracle Negative-Eigen-Pair, denotes the global minimum value, and hides the number of iterations run by an oracle Negative-Eigen-Pair.
Remark 1. The convergence rate of SNAP has the same order in , compared with those proposed in  and  (which compute -SOSP2s). However, it is important to note that the per-iteration complexity of SNAP is polynomial in both problem dimension and in number of constraints, while algorithms proposed in  and  have exponential per-iteration complexity.
Remark 2. In particulay, SNAP needs number of iterations to achieve an -SOSP1 by just substituting , and .
5 First-order Successive Negative-curvature Gradient Projection (SNAP)
In this section, we propose a first-order algorithm for SNAP, i.e., SNAP, featuring a subspace perturbed gradient descent (SP-GD) procedure that can extract the negative curvature in a subspace. Our work is motivated by recent works, which show that occasionally adding random noise to the iterates of GD can help escape from saddle points efficiently [13; 4]. The benefit of the proposed SNAP is that its complexity can be improved significantly since the procedure of finding the negative eigenpair is implemented by a first-order method.
In particular, the key idea of these perturbation schemes is to use the difference of the gradient successively , given below, to approximate the Hessian-vector product
Here is some properly selected constant, is the step-size and the algorithm is initialized from a random vector
drawn from a uniform distribution in the interval, where is some constant. This process can be viewed as performing power iteration around the strict saddle point. The details of the algorithm is presented in Algorithm 3, and its convergence is as the following.
SP-GD is called with the step-size ,
where . Then, for any , , SP-GD returns and a vector such that
with probability . Otherwise SP-GD returns and vector 0, indicating that with probability .
Essentially, when SP-GD stops, it produces a direction that satisfies the requirements of the outputs for the Negative-Eigen-Pair oracle in Algorithm 1. It follows that the rate claimed in Theorem 1 still holds for SNAP. The proof can be found in Appendix D.2
(Convergence rate of SNAP) Suppose Assumption 1 is satisfied and SP-GD with step-size less than is used to find the negative eigen-pair. Then, there exists a sufficiently small such that the sequence generated by Algorithm 1 finds an -SOSP1 in the following number of iterations with probability at least (where hides the polynomial in terms of )
Note that the total rate includes the number of iteration required by SP-GD, so it is slower than the rate of SNAP.
6 Connection with Existing Work and Future Work
|SO-LC-TRACE ||-SOSP2||H-V product|
|SNAP (This work)||-SOSP1||H-V product|
|SNAP (This work)||-SOSP1||Gradient|
In Table 1, we provide comparisons of key features of a few existing algorithms for finding SOSPs for problem linearly constrained non-convex problems.
1), SNAP and SNAP has a polynomial complexity per-iteration, while the existing works rely on an exponential complexity of some subroutine for solving the inner loop optimization problems.
2), SNAP has the same convergence rate as the existing work in . If the acceleration technique is adopted, SNAP can also have a faster convergence.
3), To the best of our knowledge, SNAP is the first first-order algorithm that has the provable convergence rate to SOSPs, where the rate is only in the order of polynomial in terms of problem dimension and the total number of constraints.
7 Numerical Results
In this section, we showcase the numerical advantages of SNAP and SNAP, compared with PGD and PGD with line search (PGD-LS) for multiple machine learning problems, such as NMF, training nonnegative neural networks, penalized NMF, etc.
7.1 NMF Problems
First, we consider the following NMF problem, which is
The starting point for all the algorithms is , where and are randomly generated. Constant controls the distance between the initialization point and the origin. The cases of and correspond to large and small initialization, respectively. The rationale for considering a small initialization is that for NMF problems, it can be easily checked that is a saddle point. By initializing around this point, we aim at examining whether indeed the proposed SNAP and SNAP are able to escape from this region.
7.1.1 Synthetic Dataset
We compare the proposed SNAP, SNAP with PGD and PGD-LS on the synthetic dataset for MNF problem and show the advantages of exploiting negative curvatures. The data matrices are randomly generated, where , , , , and follows the uniform distribution in the interval . Further, we randomly set 5% entries of as 0. The starting point for all the algorithms is , where and
are randomly generated and follow Gaussian distribution. Clearly, the origin point is a strict saddle point. We use three different constants to initialize sequence and the results are shown in Figure 1–Figure 3, where step-size for PGD, SNAP, and SNAP is , , , , , and . Note that the stopping criteria are removed in the simulation, otherwise PGD and PGD-LS will not give any output if the initialing point is close to origin.
It can be observed that when is large, all algorithms can converge to the global optimal point of this NMF problem, whereas when is small as shown in Figure 3 PGD and PGD-LS only converge to a point that has a very large loss value compared with the ones achieved by SNAP and SNAP. These results show that when the iterates are near the strict saddle points, by exploring the negative curvature, SNAP and SNAP are able to escape from the saddle points quickly and converge to the global optimal solutions. Comparing SNAP and SNAP, we can see that the computational time of SNAP is less than SNAP. The reason is simple, which is the computational complexity of calculation of Hessian and eigen-decomposition is too high so that SNAP takes more time to converge. By accessing the gradient and loss value of the objective function, SNAP is only required to compute one eigen-vector whose eigenvalue is the smallest of Hessian around the strict saddle point. The line search algorithm is one of the most effective ways of computing step-sizes. From Figure 1 and Figure 2, it can be observed that PGD-LS converges faster than PGD in terms of iterations but costs more computational time. SNAP and SNAP are using line search occasionally rather than each step, so the computational time is not as high as PGD-LS. In Figure 3(b), it can be observed that SNAP and SNAP obtain loss values that are many orders of magnitude smaller than those obtained by PGD and PGD-LS, confirming that the proposed methods are able to escape from saddle points, while PGD and PGD-LS get trapped.
7.1.2 Real Dataset
We also compare the convergence behaviours of the algorithms on USPS handwritten digits dataset , where images are grayscale pixels. In Figure 4, we use the , , . Since the problem size is large, performing eigenvalue decomposition is prohibitive, so we only compare SNAP, PGD, and PGD-LS, where and .
7.2 Nonnegative Two Layer Non-linear Neural Networks
In this section, we consider a nonnegative two layer non-linear neural network, which is