There has been a recent revival of interest in non-convex optimization, due to obvious applications in machine learning. While the modern history of the subject goes back six or seven decades, the recent attention to the topic stems from new applications as well as availability of modern analytical and computational tools, providing a new perspective on classical problems. Following this trend, in this paper we focus on the problem of minimizing a smooth nonconvex function over a convex set as follows:
where is the decision variable, is a closed convex set, and is a twice continuously differentiable function over . It is well known that finding the global minimum of Problem (1) is hard. Equally well-known is the fact that for certain nonconvex problems, all local minimizers are global. These include, for example, matrix completion (Ge et al., 2016), phase retrieval (Sun et al., 2016), and dictionary learning (Sun et al., 2017). For such problems, finding the global minimum of (1) reduces to the problem of finding one of its local minima.
Given the well-known hardness results in finding stationary points, recent focus has shifted in characterizing approximate stationary points. When the objective function is convex, finding an -first-order stationary point is often sufficient since it leads to finding an approximate local (and hence global) minimum. However, in the nonconvex setting, even when the problem is unconstrained, i.e., , convergence to a first-order stationary point (FOSP) is not enough as the critical point to which convergence is established might be a saddle point. It is therefore natural to look at higher order derivatives and search for a second-order stationary points. Indeed, under the assumption that all the saddle points are strict (formally defined later), in both unconstrained and constrained settings, convergence to a second-order stationary point (SOSP) implies convergence to a local minimum. While convergence to an SOSP has been thoroughly investigated in the recent literature for the unconstrained setting, the iteration complexity of the convex-constrained setting has not been studied yet.
Contributions. Our main contribution in this paper is to propose a generic framework which generates a sequence of iterates converging to an approximate second-order stationary point for the constrained nonconvex problem in (1), when the convex set has a specific structure that allows for approximate minimization of a quadratic loss over the feasible set. The proposed framework consists of two main stages: First, it utilizes first-order information to reach a first-order stationary point; next, it incorporates second-order information to escape from a stationary point if it is a local maximizer or a strict saddle point. We show that the proposed approach leads to an -second-order stationary point (SOSP) for Problem (1) (check Definition 2). The proposed approach utilizes advances in constant-factor optimization of nonconvex quadratic programs (Vavasis, 1991; Ye, 1992; Fu et al., 1998; Nemirovski et al., 1999) that find a -approximate solution over in polynomial time, where is a positive constant smaller than that depends on the structure of . When such approximate solution exists, the sequence of iterates generated by the proposed framework reaches an -SOSP of Problem (1) in at most iterations.
We show that linear and quadratic constraints satisfy the required condition for the convex set . In particular, for the case that is defined as a set of quadratic constraints, we can achieve an -SOSP after at most arithmetic operations, where is the dimension of the problem and
is the number of required arithmetic operations to solve a linear program overor to project a point onto . We further extend our results to the stochastic setting and show that we can reach an -SOSP after computing at most stochastic gradients and stochastic Hessians.
1.1 Related work
Unconstrained case. The rich literature on nonconvex optimization provides a plethora of algorithms for reaching stationary points of a smooth unconstrained minimization problem. Convergence to first-order stationary points (FOSP) has been widely studied for both deterministic (Nesterov, 2013; Cartis et al., 2010; Agarwal et al., 2017; Carmon et al., 2016, 2017a, 2017b, 2017c) and stochastic settings (Reddi et al., 2016b, a; Allen Zhu and Hazan, 2016; Lei et al., 2017). Stronger results which indicate convergence to an SOSP are also established. Numerical optimization methods such as trust-region methods (Cartis et al., 2012a; Curtis et al., 2017; Martínez and Raydan, 2017) and cubic regularization algortihms (Nesterov and Polyak, 2006; Cartis et al., 2011a, b) can reach an approximate second-order stationary point in a finite number of iterations; however, typically the computational complexity of each iteration could be relatively large due to the cost of solving trust-region or regularized cubic subproblems. Recently, a new line of research has emerged that focuses on the overall computational cost to achieve an SOSP. These results build on the idea of escaping from strict saddle points with perturbing the iterates by injecting a properly chosen noise (Ge et al., 2015; Jin et al., 2017a, b)2016; Allen-Zhu, 2017; Xu and Yang, 2017; Allen-Zhu and Li, 2017; Royer and Wright, 2017; Agarwal et al., 2017; Reddi et al., 2018; Paternain et al., 2017).
Constrained case. Asymptotic convergence to first-order and second-order stationary points for the constrained optimization problem in (1) has been studied in the numerical optimization community (Burke et al., 1990; Conn et al., 1993; Facchinei and Lucidi, 1998; Di Pillo et al., 2005). Recently, finite-time analysis for convergence to an FOSP of the generic smooth constrained problem in (1) has received a lot of attention. In particular, Lacoste-Julien (2016) shows that the sequence of iterates generated by the update of Frank-Wolfe converges to an -FOSP of Problem (1) after iterations. Ghadimi et al. (2016) consider norm of gradient mapping as a measure of non-stationarity and show that the projected gradient method has the same complexity of . Further, Ghadimi and Lan (2016) prove a similar complexity for the accelerated projected gradient method with a slightly better dependency on the Lipschitz constant of gradients. Adaptive cubic regularization methods in (Cartis et al., 2012b, 2013, 2015) improve these results using second-order information and obtain an -FOSP of Problem (1) after at most iterations. Finite time analysis for convergence to an SOSP has also been studied for linear constraints. To be more precise, Bian et al. (2015) study convergence to an SOSP of (1) when the set is a linear constraint of the form and propose a trust region interior point method that obtains an -SOSP in iterations. Haeser et al. (2017) extend their results to the case that the objective function is potentially not differentiable or not twice differentiable on the boundary of the feasible region. Cartis et al. (2017) focus on the general convex constraint case and introduce a trust region algorithm that requires iterations to obtain an SOSP; however, each iteration of their proposed method requires access to the exact solution of a nonconvex quadratic program (finding its global minimum) which, in general, could be computationally prohibitive. To the best of our knowledge, our paper provides the first finite-time overall computational complexity analysis for reaching an SOSP of Problem (1) with nonlinear constraints.
2 Preliminaries and Definitions
In the case of unconstrained minimization of the objective function , the first-order and second-order necessary conditions for a point to be a local minimum of that are defined as and , respectively. If a point satisfies these conditions it is called a second-order stationary point (SOSP). If the second condition becomes strict, i.e., , then we recover the sufficient conditions for a local minimum. However, to derive finite time convergence bounds for achieving an SOSP, these conditions should be relaxed. In other words, the goal should be to find an approximate SOSP where the approximation error can be arbitrary small. For the case of unconstrained minimization, a point is called an -second-order stationary point if it satisfies
where and are arbitrary positive constants. In the following definition we formally define strict saddle points for the unconstrained version of Problem (1), i.e., when .
Consider Problem (1) when . Then, is called a -strict saddle point if it is a saddle point, i.e., and is indefinite, and the smallest eigenvalue of the Hessian evaluated at is strictly smaller than , i.e., .
Using the definitions of a -strict saddle and an -SOSP we obtain that an -SOSP is a local minimum for the unconstrained optimization problem if all the saddle points are -strict and the condition is satisfied.
To study the constrained setting, we first state the necessary conditions for a local minimum of problem (1).
[Bertsekas (1999)] If is a local minimum of the function over the convex set , then
The conditions in (3) and (4) are the first-order and second-order necessary optimality conditions, respectively. By making the inequality in (4) strict, i.e., , we recover the sufficient conditions for a local minimum when is a polyhedral (Bertsekas, 1999). Further, if the inequality in (4) is replaced by for some , we obtain the sufficient conditions for a local minimum of Problem (1) for any convex constraint ; see (Bertsekas, 1999). If a point satisfies the conditions in (3) and (4) it is an SOSP of Problem (1).
As in the unconstrained setting, the first-order and second-order optimality conditions may not be satisfied in finite number of iterations, and we focus on finding an approximate SOSP.
Recall the twice continuously differentiable function and the convex closed set introduced in Problem (1). We call an -second-order stationary point of Problem (1) if the following conditions are satisfied.
If a point only satisfies the first condition, we call it an -first-order stationary point.
To further clarify the conditions in Definition 2, we first elaborate on the conditions in Proposition 2 for stationary points. The condition in (3) ensures that there is no feasible direction that makes the linear term in the Taylor series of around negative. If there exists such a point then by choosing a very small stepsize we can find a feasible direction that decreases the objective function value, and, therefore, cannot be a local minimum. The condition in (5) relaxes this requirement and checks if the inner product is not too negative for any . In other words, it ensures that the function value is not decreasing more than for any feasible direction. Note that if is strictly positive for all , then we can ensure that in a small neighborhood of any feasible direction increases the function value and hence is a local minimum.
To ensure that has the necessary conditions for a local minimum, among the feasible directions that are orthogonal to the gradient, i.e., , we must ensure that the function value is non-decreasing. The condition in (4) guarantees that among the feasible directions that satisfy , there is no direction that makes the quadratic term in the Taylor series of around negative. We would like to emphasize that as we know from the first-order optimality condition that satisfies the condition for any , the quadratic condition in (4) should only be satisfied for the directions that , since for those that we have the function value is increasing in a small neighborhood of . Using the same argument, for the relaxed version of (4) it is not required to check the quadratic condition for points satisfying ; however, as we relax the first-order optimality condition, i.e., the inner product is allowed to be a small negative value, we need to ensure that the quadratic condition in (4) stands for all points . The conditions in (5) and (6) together ensure that in a neighborhood of with radius the function value does not decrease more than by the linear term and more than by the quadratic term.
We further formally define strict saddle points for the constrained optimization problem in (1).
A point is a -strict saddle point of Problem (1) if (i) for all the condition holds, and (ii) there exists a point such that
We emphasize that in this paper we do not assume that all saddles are strict to prove convergence to an SOSP. We formally defined strict saddles just to clarify that if all the saddles are strict then convergence to an approximate SOSP is equivalent to convergence to an approximate local minimum.
Our goal throughout the rest of the paper is to design an algorithm which finds an -SOSP of Problem (1). To do so, we first assume the following conditions are satisfied.
The gradients are -Lipschitz continuous over the set , i.e., for any ,
The Hessians are -Lipschitz continuous over the set , i.e., for any
The diameter of the compact convex set is upper bounded by a constant , i.e.,
3 Main Result
In this section, we introduce a generic framework to reach an -SOSP of the non-convex function over the convex set , when has a specific structure as we describe below. In particular, we focus on the case when we can solve a quadratic program (QP) of the form
up to a constant factor with a finite number of arithmetic operations. Here, is a symmetric matrix,
is a vector, andis a scalar. To clarify the notion of solving a problem up to a constant factor , consider as a global minimizer of (11). Then, we say Problem (11) is solved up to a constant factor if we have found a feasible solution such that
Note that here w.l.o.g. we have assumed that the optimal objective function value is non-positive. Larger constant implies that the approximate solution is more accurate. If satisfies the condition in (12), we call it a -approximate solution of Problem (11). Indeed, if then is a global minimizer of Problem (11).
In Algorithm 1, we introduce a generic framework that finds an -SOSP of Problem (1) whose running time is polynomial in , , and , when we can find a -approximate solution of a quadratic problem of the form (11) in a time that is polynomial in . The proposed scheme consists of two major stages. In the first phase, as mentioned in Steps 2-4, we use a first-order update, i.e., a gradient-based update, to find an -FOSP, i.e., we update the decision variable according to a first-order update until we reach a point that satisfies the condition
In Section 4, we study in detail projected gradient descent and conditional gradient algorithms for the first-order phase of the proposed framework. Interestingly, both of these algorithms require at most iterations to reach an -first-order stationary point.
The second stage of the proposed scheme uses second-order information of the objective function to escape from the stationary point if it is a local maximum or a strict saddle point. To be more precise, if we assume that is a feasible point satisfying the condition (13), we then aim to find a descent direction by solving the following quadratic program
up to a constant factor where . If we define as the optimal objective function value of the program in (3), we focus on the cases that we can obtain a feasible point which is a -approximate solution of Problem (3), i.e., satisfies the constraints in (3) and
The problem formulation in (3) can be transformed into the quadratic program in (11); see Section 5 for more details. Note that the constant is independent of , , and and only depends on the structure of the convex set . For instance, if is defined in terms of quadratic constraints one can find a approximate solution of (3) after at most arithmetic operations (Section 5).
After computing a feasible point satisfying the condition in (15), we check the quadratic objective function value at the point , and if the inequality holds, we follow the update
where is a positive step size. Otherwise, we stop the process and return as an -second-order stationary point of Problem (1). To check this claim, note that Algorithm 1 stops if we reach a point that satisfies the first-order stationary condition , and the objective function value for the -approximate solution of the quadratic subproblem is larger than , i.e., . The second condition alongside with the fact that satisfies (15) implies that . Therefore, for any and , it holds that
These two observations show that the outcome of the proposed framework in Algorithm 1 is an -SOSP of Problem (1). Now it remains to characterize the number of iterations that Algorithm 1 needs to perform before reaching an -SOSP which we formally state in the following theorem.
Consider the problem in (1). Suppose the conditions in Assumptions 1-3 are satisfied. If in the first-order stage, i.e., Steps 2-4, we use the update of Frank-Wolfe or projected gradient descent, the framework proposed in Algorithm 1 finds an -second-order stationary point of Problem (1) after at most iterations.
The result in Theorem 3 shows that if the convex constraint is such that one can solve the quadratic subproblem in (3) -approximately, then the proposed generic framework finds an -SOSP point of Problem (1) after at most first-order and second-order updates.
To prove the claim in Theorem 3, we first review first-order conditional gradient and projected gradient algorithms and show that if the current iterate is not a first-order stationary point, by following either of these updates the objective function value decreases by a constant of (Section 4). We then focus on the second stage of Algorithm 1 which corresponds to the case that the current iterate is an -FOSP and we need to solve the quadratic program in (3) approximately (Section 5). In this case, we show that if the iterate is not an -SOSP, by following the update in (16) the objective function value decreases at least by a constant of . Finally, by combining these two results it can be shown that Algorithm 1 finds an -SOSP after at most iterations.
4 First-Order Step: Convergence to a First-Order Stationary Point
In this section, we study two different first-order methods for the first stage of Algorithm 1. The result in this section can also be independently used for convergence to an FOSP of Problem (1) satisfying
where is a positive constant. Although for Algorithm 1 we assume that has a specific structure as mentioned in (11), the results in this section hold for any closed and compact convex set . To keep our result as general as possible, in this section, we study both conditional gradient and projected-based methods when they are used in the first-stage of the proposed generic framework.
4.1 Conditional gradient update
The conditional gradient (Frank-Wolfe) update has two steps. We first solve the linear program
Then, we compute the updated variable according to the update
where is a stepsize. In the following proposition, we show that if the current iterate is not an -first-order stationary point, then by updating the variable according to (19)-(20) the objective function value decreases. The proof of the following proposition is adopted from (Lacoste-Julien, 2016).
Consider the optimization problem in (1). Suppose Assumptions 1 and 3 hold. Set the stepsize in (20) to . Then, if the iterate at step is not an -first-order stationary point, the objective function value at the updated variable satisfies the inequality
The result in Proposition 4.1 shows that by following the update of the conditional gradient method the objective function value decreases by , if an -FOSP is not achieved. As a consequence of this result, the FW algorithm reaches an -first-order stationary point after at most iterations, or equivalently, after at most solving linear programs.
In step 3 of Algorithm 1 we first check if is an -FOSP. This can be done by evaluating
and comparing the optimal value with . Note that the linear program in (22) is the same as the one in (19). Therefore, by checking the first-order optimality condition of , the variable is already computed, and we need to solve only one linear program per iteration.
4.2 Projected gradient update
The projected gradient descent (PGD) update consists of two steps: (i) descending through the gradient direction and (ii) projecting the updated variable onto the convex constraint set. These two steps can be combined together and the update can be explicitly written as
where is the Euclidean projection onto the convex set and is a positive stepsize. In the following proposition, we first show that by following the update of PGD the objective function value decreases by a constant until we reach an - FOSP. Further, we show that the number of required iterations for PGD to reach an -FOSP is of .
Consider Problem (1). Suppose Assumptions 1 and 3 are satisfied. Further, assume that the gradients are uniformly bounded by for all . If the stepsize of the projected gradient descent method defined in (23) is set to the objective function value decreases by
Moreover, iterates reach a first-order stationary point satisfying (18) after at most iterations.
Proposition 4.2 shows that by following the update of PGD the function value decreases by until we reach an -FOSP. It further shows PGD obtains an -FOSP satisfying (18) after at most iterations. To the best of our knowledge, this result is also novel, since the only convergence guarantee for PGD in (Ghadimi et al., 2016) is in terms of number of iterations to reach a point with a gradient mapping norm less than , while our result characterizes number of iterations to satisfy (18).
To use the PGD update in the first stage of Algorithm 1 one needs to define a criteria to check if is an -FOSP or not. However, in PGD we do not solve the linear program . This issue can be resolved by checking the condition which is a sufficient condition for the condition in (18). In other words, if this condition holds we stop and is an -FOSP; otherwise, the result in (24) holds and the function value decreases. For more details please check the proof of Proposition 4.2.
5 Second-Order Step: Escaping Saddle Points
In this section, we study the second stage of the framework in Algorithm 1 which corresponds to the case that the current iterate is an -FOSP. Note that when we reach a critical point the goal is to find a feasible point in the subspace that makes the inner product smaller than . To achieve this goal we need to check the minimum value of this inner product over the constraints, i.e., we need to solve the quadratic program in (3) up to a constant factor . In the following proposition, we show that the updated variable according to (16) decreases the objective function value if the condition holds.
Consider the quadratic program in (3). Let be a -approximate solution for quadratic subproblem in (3). Suppose that Assumptions 2 and 3 hold. Further, set the stepsize to . If the quadratic objective function value evaluated at satisfies the condition , then the updated variable according to (16) satisfies the inequality
The only unanswered question is how to solve the quadratic subproblem in (3) up to a constant factor in polynomial time. For general , the quadratic subproblem could be NP-hard (Murty and Kabadi, 1987); however, for some special choices of the convex constraint , e.g., linear and quadratic constraints, this quadratic program (QP) can be solved either exactly or approximately up to a constant factor. In the following section, we focus on the quadratic constraint case. The discussion on the linear constraint case is available in Section 8.1 in the supplementary material.
5.1 Quadratic constraints case
Consider the case that the convex set is the intersection of ellipsoids, i.e.,
where , , and . Under this assumption, the QP in (3) can be written as
Note that the linear constraints can be easily handled by writing them in a form of quadratic program. To do so, first define a new optimization variable to obtain
where and . We simply can replace by the quadratic constraint with parameters , , and . Similarly, we can write as the quadratic constraint with parameters , , and . Therefore, the problem in (5.1) can be written as
Note that the matrices are positive semidefinite, while the Hessian might be indefinite. Indeed, the optimal objective function value of the program in (5.1) is equal to the optimal objective function value of (5.1). The program in (5.1) is a Quadratic Constraint Quadratic Program (QCQP), and one can find an approximate solution for this program as we state in the following proposition.
[Fu et al. (1998)] Consider Problem (5.1). There exists a polynomial time method that obtains a -approximation by at most arithmetic operations, where is the ratio of the radius of the largest inscribing sphere over that of the smallest circumscribing sphere of the feasible set.
The result in Proposition 5.1 indicates that we can solve the QCQP in (5.1) with the approximation factor for . According to this result the complexity of solving the QCQP in (3) is when the constraint is defined as convex quadratic constraints. As the total number of calls to the second-order stage is at most , we obtain that the total number of arithmetic operations for the second-order stage is at most .
The main idea of the proposed algorithm by Fu et al. (1998) is to approximate the feasible set by an inscribing ellipsoid and maximize the quadratic objective function over this ellipsoid to find a global minimizer , and then use as an approximate global minimizer for the original QP. If by increasing the radius of the ellipsoid the enlarged ellipsoid contains the feasible set, then using the result in (Ye, 1992) it follows that is a good approximate solution of the original QCQP.
If , one can also use the S-Procedure (Pólik and Terlaky, 2007) to solve (5.1) accurately. Further, Nemirovski et al. (1999) showed that if the sum of the matrices is positive definite and the constraints are homogeneous, i.e., , one can find a -approximate solution of (5.1) by solving its SDP relaxation, where .
6 Stochastic Extension
In this section, we focus on stochastic constrained minimization problems. Consider the optimization problem in (1) when the objective function is defined as an expectation of a set of stochastic functions with inputs and , where. To be more precise, we consider the optimization problem
Our goal is to find a point which satisfies the necessary optimality conditions with high probability.
Consider the vector and matrix as stochastic approximations of the gradient and Hessian , respectively. Here and are the gradient and Hessian batch sizes, respectively, and the vectors are the realizations of the random variable . In Algorithm 2, we present the stochastic variant of our proposed scheme for finding an -SOSP of Problem (30). We only focus on a stochastic variant of the Frank-Wolfe method for simplicity, but similar stochastic extension can also be shown for the projected gradient descent method.
Algorithm 2 differs from Algorithm 1 in using stochastic gradients and Hessians in lieu of exact gradients and Hessians. The second major difference is the inequality constraint in step 6. Here instead of using the constraint we need to use , where is a properly chosen constant. This modification is needed to ensure that if a point satisfies this constraint it also satisfies with high probability.
To prove our main result we assume that the following conditions also hold.
The variance of the stochastic gradients and Hessians are uniformly bounded by constants
The variance of the stochastic gradients and Hessians are uniformly bounded by constantsand , respectively, i.e., for any and we can write
The conditions in Assumption 4
which require access to unbiased estimates of the gradient and Hessian with bounded variances are customary in stochastic optimization. In the following theorem, we characterize the iteration complexity of Algorithm2 to reach an -SOSP of Problem (30) with high probability.
Consider the optimization problem defined in (30). Suppose the conditions in Assumptions 1-4 are satisfied. If the batch sizes are and and we set , then the outcome of the proposed framework outlined in Algorithm 2 is an -second-order stationary point of Problem (30) with high probability. Further, the total number of iterations to reach such point is at most with high probability.
The result in Theorem 6 indicates that the total number of iterations to reach an -SOSP is at most . As each iteration at most requires stochastic gradient and stochastic Hessian evaluations, the total number of stochastic gradient and Hessian computations to reach an -SOSP is of and , respectively.
In this paper, we studied the problem of finding an -second order stationary point (SOSP) of a generic smooth constrained nonconvex minimization problem. We proposed a procedure that obtains an -SOSP after at most iterations if the constraint set is such that one can find a -approximate solution of a quadratic optimization problem with the constraint set . We showed that our results hold for the case that is a set of linear constraints or is the form of finite number of quadratic convex constraints. We further extended our results to the stochastic setting and characterized the number of stochastic gradient and Hessian evaluations to reach an -SOSP.
8.1 The linear constraints case
Define a new variable and the vector to rewrite the problem as
Then, we can simply write the extra linear inequalities as and . Using these modifications we can write the problem in (8.1) as
where and . The QP in (8.1) in general could be NP-hard; however, in polynomial time one can find an approximate solution for this program as we state in the following proposition.
Consider the optimization problem in (8.1). There exists a polynomial time algorithm, based on solving a ball-constraint quadratic problem, to compute a -approximate solution (Vavasis, 1991) and (Ye, 1992). Further, if the constraint is the polytope for some , there exits a polynomial time algorithm with complexity which reaches a -approximate solution (Fu et al., 1998, Section 4).
8.2 Proof of Proposition 2
The claim in (3) follows from Proposition 2.1.2 in (Bertsekas, 1999). The proof for the claim in (4) is similar to the proof of Proposition 2.1.2 in (Bertsekas, 1999), and we mention it for completeness.
We prove the claim in (4) by contradiction. Suppose that for some satisfying . By the mean value theorem, for any there exists an such that
Use the relation to simplify the right hand side to
Note that since and the Hessian is continuous, we have for all sufficiently small , . This observation and the expression in (36) follows that for sufficiently small we have . Note that the point for all belongs to the set and satisfies the inequality . Therefore, we obtained a contradiction of the local optimality of .
8.3 Proof of Proposition 4.1
First consider the definition