We consider constrained composite optimization problems of the form
in which is a smooth map, is convex, , and . Our main motivating example for such problems is the Burer-Monteiro factorization method for solving the semidefinite optimization problem
for some with , the Burer-Monteiro factorization is particularly appealing because, in addition to its lower storage and computational cost, it can solve the original problem (2). Many problems in science and engineering can be cast as problem (2) with a low-rank solution, including phase retrieval [16, 34], community detection [1, 23], phase synchronization [6, 30], and robust PCA .
The model function linearizes the smooth map , keeps the outer convex function , and is therefore convex. Each iteration thus requires solving a -strongly convex problem, which is (frequently) efficient.
When is Lipschitz and is smooth, the prox-linear algorithm has global convergence to a stationary point, measured by the sub-differential stationarity . Analysis of its local convergence has been more sophisticated—local linear convergence has been established under tilt stability, which requires a unique minimizer and strong growth around it . A naive transformation of the constrained problem (1) into (4), taking , violates such local analyses as there may be multiple with . Our seek alternative approaches for solving (1) with local convergence guarantees.
1.1 Outline and our contribution
In the matrix problem, this corresponds to solving
Such an exact penalty term encourages to fall onto the constraint set, and as the norm grows linearly in , one expects that it has a dominating effect over the objective when is not on the set, therefore penalizing infeasible ’s .
We summarize our contributions.
We define norm convexity, a local geometric property for generic non-smooth functions. Norm convexity is a weak notion of local convexity, and it dovetails with other regularity conditions for composite optimization (e.g. sub-differential reguality). We show that the exact penalty function (6) satisfies norm convexity if it has quadratic growth around the (local) minimizing set (Section 2).
We show that the prox-linear algorithm has local linear convergence for generic composite optimization if has quadratic growth and norm convexity. Our result extends Drusvyatskiy and Lewis  and does not rely on the tilt stability assumption required there. Consequently, the prox-linear algorithm on the exact penalty function (6) has local linear convergence as long as the problem has quadratic growth (Section 3).
We instantiate our result on the factorized matrix problem (7). To verify the assumptions for the convergence result, we study whether the quadratic growth of (6) can be deduced from the quadratic growth of the original SDP (2). We show that quadratic growth is always preserved if the rank is exact (), and is preserved for linear objectives () when the rank is over-specified (). In contrast, when is non-linear, quadratic growth is no longer preserved under rank over-specification (Section 4). This gives a precise characterization for the convergence of the prox-linear algorithm on factorized SDPs and could be of broader interest for understanding matrix factorization.
We provide a roadmap of our main results in Figure 1.
1.2 Related work
The exact penalty method was one particular early motivation for considering convex composite functions . The workof Burke , Lewis and Wright , Drusvyatskiy and Lewis , Drusvyatskiy et al.  studies the convergence of proximal algorithms on composite problems. In particular, Drusvyatskiy and Lewis  establish local linear convergence of the prox-linear algorithm, which shows the “natural” rate of convergence in presence of sub-differential regularity and the tilt-stability condition, and poses the question (in its Section 5) whether sub-differential regularity is implied by quadratic growth for general convex composite problems. We estbalish the norm convexity condition under quadratic growth, thereby resolving the problem in the special case of penalized objectives.
The idea of solving SDPs by factorizing is due to Burer and Monteiro . This factorization encodes the PSD constraint into the objective, at the cost of turning the problem non-convex. When , the enlightening result of Pataki  shows that any SDP with linear constraints always has a solution satisfying , thus when the number of constraints is small, one can always set wih , thereby saving huge computational and storage cost. The non-convexity of the factorized problem (3) makes possible spurious local minimizers and weird geometries. Boumal et al.  establish benign geometry in a general case, showing that for generic , all second-order critical points of the linear problem (3) are global minima as long as the rank is overspecified (consistent with our results). For some special problems with non-linear such as matrix completion and low-rank matrix sensing, a recent line of work [21, 22] shows often there is no spurious local minimum.
Alternative methods for semidefinite optimization
The majority of early development on SDPs focuses on interior-point algorithms, established by Nesterov and Nemirovskii  and Alizadeh . Interior-point methods are efficient and robust on small-scale problems () but quickly become infeasible beyond that, as they must compute matrix inverses or Schur complements. Augmented Lagrangian methods, including ADMM and Newton-CG algorithms, appear to be faster and more scalable, with well-developed software available (e.g. SDPAD  and SDPNAL [37, 36]).
Riemannian methods are suitable for solving problem (3) when the constraint has special structures such as block-diagonal constraints or orthogonality constraints . These methods are very efficient in practice. Results on their local  and global convergence  are present. For a thorough introduction to Riemannian optimization on matrix manifolds see the book of Absil et al. .
We usually reserve letters
for vector variables and capital lettersfor matrices. The space of symmetric matrices is . For a matrix , we let and sorted in decreasing order. The two norm, Frobenius norm, and operator norm are denoted by , , and . For twice differentiable , and denote its gradient and Hessian. For vector-valued function , let be the (tranposed) Jacobian, so that the first-order Taylor expansion reads
For convex composite, where is smooth and is convex, let denote its (Frechet) sub-differential
We let be the distance of to and denote the -neighborhood of .
2 Geometry of the composite objective
In this section, we analyze the local geometry of , the composite objective. We first show that quadratic growth (Definition 2) is preserved when we reformulate problem (1) to (6). We then show such quadratic growth will imply norm convexity and sub-differential regularity (Definition 2.2 and 2.2).
Throughout the paper, we let
denote the objective and the penalty function, and
denote the penalized objective (the subscript is omitted when it is clear from the context). Let be a local minimizing set of Problem (1), i.e. for any and for all such that .
A first question will be whether minimizing is equivalent to solving the original constrained problem (1), i.e. whether is also the local minimizing set of . On the constraint set, , so the minimizing set is ; off the constraint set, the term has a “pointy” behavior and will produce strong growth, so if is sufficiently smooth, intuitively this penalty term will dominate and force to also grow off the constraint set. We will make this argument precise in Section 2.1 and give an affirmative answer under quadratic growth and constraint qualification.
We now define the quadratic growth property. [Quadratic growth] A function is said to have -quadratic growth in around a local minimizing set if
where is the function value on .
We now collect our assumptions for this section. For properties that are required to hold locally, we assume there exists an such that all local properties hold in .
[Smoothness] In a neighborhood , the objective with -bounded and -Lipschitz Hessian. That is,
Further, is -locally Lipschitz. Functions and are also smooth with parameters accordingly (for example, is -smooth).
[Quadratic growth] There exists some such that locally
holds for all such that . Assumption 2 ensures that the constrained optimization problem has quadratic growth around the minimizing set . This will be the main assumption that we hinge on to show various geometric properties.
An additional assumption that we make on the constrained problem (1) is the following. [Constraint qualification] There exists some constant such that for all , the Jacobian of the constraint function has full row rank and satisfy the quantitative bound . Consequently, in the neighborhood , the constraint set
is a smooth manifold. We further assume that the minimizing set is a compact smooth submanifold of . Assumption 2 is known as the Linear Independence Constraint Qualification (LICQ) in the nonlinear programming literature and requires that the normal space of at is -dimensional. (The reader can refer to [2, Chapter 3] for background on smooth manifolds.)
2.1 Preservation of quadratic growth
We start by asking the following question: does the penalty method preserve quadratic growth? Namely, if a constrained problem has quadratic growth on the constraint set, does the penalized objective have quadratic growth in the whole space? The following result gives an affirmative answer. Let Assumptions 2 and 2 hold. Let Assumption 2 hold, i.e. for all such that ,
then the penalized objective has local quadratic growth: for any , setting , there exists a neighborhood such that for all and , we have
2.2 Quadratic growth implies norm convexity
We now define norm convexity and sub-differential regularity, two geometric properties that are essential to establishing the convergence of proximal algorithms in Section 3. [Norm convexity] The function is norm convex around the minimizing set with constant if for all near , we have
Next, we recall the definition of sub-differential regularity . [Sub-differential regularity] The function is -subdifferentialy regular at if for all near , we have
We now present our main geometric result, that is, for having the penalty structure (8), quadratic growth of implies norm convexity and sub-differential regularity.
To gain some intuition, let us illustrate that quadratic growth implies norm convexity and sub-differential regularity in the convex case. For convex, assuming -quadratic growth, we have
for any choice of the subgradient and minimum . Choosing the minimum norm subgradient and closest to gives sub-differential regularity
Therefore norm convexity generalizes convexity by only requiring .
Norm convexity specifies a local regularity condition for non-convex non-smooth functions. While it does not necessarily hold for general composite functions, we show that for our exact penalty objective , quadratic growth does imply norm convexity and hence sub-differential regularity.
3 Local convergence of the prox-linear algorithm
Recall that the prox-linear algorithm iterates
and is a small stepsize.
For the convergence result, we assume that is -Lipschitz and is -smooth, i.e.
An immediate consequence of the smoothness is that
so gives a local quadratic approximation of near . In particular, when , is an upper bound on , implying that the prox-linear algorithm is a descent method.
We now present our main algorithmic result: the prox-linear algorithm has local linear convergence as long as has quadratic growth and norm convexity. [Local linear convergence of the prox-linear algorithm] Suppose satisfies the above assumptions and has a compact local minimizing set . Assume that there exists such that the following happens in :
has -quadratic growth and norm convexity with constant around ;
Prox-linear iterates has the proximity property (see Definition D.2 for a formal definition and discussion).
Then, for sufficiently close to , the prox-linear algorithm (5) has linear convergence:
The proof builds on existing results on composite optimization from [18, Section 5,6], which we review in Appendix D.1, and makes novel use of norm convexity to establish the local linear convergence. The proof can be found in Appendix D.2. Relationship between our result and existing results based on tilt stability is discussed in Appendix D.3.
4 Application on factorized matrix problems
We now instantiate our geometry and convergence results on our main application, the matrix problem (7), and show that the prox-linear method achieves local linear convergence for solving many low-rank semidefinite problems.
For the matrix problem, recall that and
Recall that our algorithmic result (Theorem 3) requires to have quadratic growth. The main focus of this section is to study whether the quadratic growth of (Assumption 2) can be deduced from the quadratic growth of , which can often be verified more straightforwardly.
We build such connection in two separate cases: factorization with the exact rank () and with rank over-specification (). We show that quadratic growth is transferred from to only if we have the exact rank (Section 4.2), or that we over-specify the rank and is linear () (Section 4.3.2). In both cases, we adapt Theorem 3 to provide the convergence result as a corollary. If is not linear, quadratic growth will in general fail to hold on and the prox-linear algorithm no longer have local linear convergence. We demonstrate this via a counter-example (Section 4.3.1).
Throughout this section, we assume the following assumptions on the semidefinite objective , which we will use to deduce properties on the factorized objective . [Smoothness] The objective with -bounded and -Lipschitz Hessian, i.e. for all ,
Further, is -locally Lipschitz near .
[Dual optimum] There exists at least one dual optimum associated with , i.e. and that satisfy the KKT conditions
[Rank- quadratic growth] There exists such that
for all in the set
Assumption 4 states that the semidefinite problem has a unique low-rank solution and that there is no duality gap. A number of conditions such as the Slater’s condition guarantee no duality gap and such dual optimum must exist. Assumption 12 ensures that on low-rank feasible points, has strong growth around . As we only require strong growth on low-rank matrices, this assumption is often more likely to hold than full quadratic growth. We will demonstrate this in Section F. [Rank -constraint qualification] The constraint set
is a smooth manifold. There exists some constant such that the constraint coefficient matrices satisfy for all in a neighborhood of , where is defined via
4.1 Preliminaries: global optimality, matrix distance
Let be the unique minimum of problem (2) and .
Non-uniqueness and matrix distance
The factorization brings in a great deal of non-uniqueness issues, but this non-uniqueness acts fairly nicely and satisfies the assumptions we require in our geometry and optimization results. One particular nice property is that the minimizing set , having the form
is a compact smooth manifold, and the distance
is the Procrustes distance between and . More background on the factorization map and the Procrustes distance are provided in Appendix E.1.
4.2 Matrix growth: the exact-rank case
If we know the exact rank , we could set and factorize with . In this case, the Euclidean distance in the space and the space is nicely connected, as stated in the following bound. [Lemma 41, ] Let be two matrices such that (so that they are aligned in the Procrustes distance), then
Building on this result, we show that when has quadratic growth around in the constraint set, so does . The proof is in Appendix E.3. Let Assumption 12 hold for rank , then for all such that , we have
where . The above Lemma allows us to establish quadratic growth of and based on low-rank quadratic growth of . In particular, applying Lemma 2.1, we get that has local quadratic growth with constant
in a neighborhood of . We could then apply Theorem 2.2 here and obtain norm convexity and sub-differential regularity on the penalized objective . We summarize this in the following theorem. [Geometry of matrix factorization with exact rank] Suppose is convex and satisfies Assumption 4, 4, Assumption 12 with rank and constant , and Assumption 4, then for sufficiently large , satisfies the norm convexity (9) with constant and sub-differential regularity (10) with constant .
4.3 Matrix growth: the rank over-specified case
In many real problems, the true rank cannot be known exactly. In these cases, a common strategy is to conjecture an upper bound on the rank and factorize for . We show that over-specifying the rank will preserve quadratic growth when is linear. Hence, for solving SDPs, local linear convergence can still be achieved when we over-specify the rank. In constrast, quadratic growth will not be preserved with generic convex .
4.3.1 Quadratic growth is not preserved in general
This bound requires , so if we used an upper bound and factorized for , then , and the bound becomes vacuous. The following example demonstrates that the growth can indeed be slower when we over-specify the rank.
Let have full column rank and , so that . Let , where is such that . Then,
so and are optimally aligned. However, we have
The distance in the PSD space depends quadratically in the distance in the low-rank space, not linearly.
Taking for example, the factorized version will only have fourth-order growth around in certain directions. The prox-linear algorithm will not have local linear convergence due to this slow growth, for the same reason that gradient descent will not converge linearly on .
Knowing that can be (as opposed to linear in the distance), we extend Lemma 4.2 in the following result, showing that a quadratic lower bound is indeed true. Consequently, any problem (2) with quadratic growth will have at least fourth-order growth under rank over-specification. [Matrix growth bound under rank over-specification] Let be such that , and . Then
The proof can be found in Appendix E.4.
4.3.2 Quadratic growth is preserved on linear objectives
In contrast to the generic case, we show that for linear objectives, rank over-specification preserves quadratic growth under some mild additional assumptions. Letting for , problem (2) reads
Problem (13) admits a rank- solution , where has full column rank Let be the eigenvalue decomposition, where and is diagonal. We can then take . Let be the orthogonal complement of , i.e. such that
is an orthogonal matrix.
[Strict complementarity and dual non-degeneracy] There exists a pair of dual optimal such that the -th smallest singular value of is lower bounded by . There exists some constant such that
Assuming this condition, we can lower bound the growth of as follows. For any , , we have . Hence
is a smooth manifold, then there exists a neighborhood of and a positive constant such that for all feasible in this neighborhood,
where the constant is
Consequently, have local quadratic growth around with constant .
By the above Lemma, we establish quadratic growth of the rank over-specified objective and thereby verifying Assumption 12 for the constrained problem. As a direct consequence, we obtain norm convexity and sub-differential regularity from Theorem 2.2. We summarize this in the following theorem. [Geometry of factorized SDPs with rank over-specification] Suppose and Assumptions 4 and 4.3.2 hold for rank , then the solution to problem (13) is unique, and has local quadratic growth: for all near and feasible,
where depends on but not (so that Assumption 12 holds). Further, for sufficiently large , satisfies the norm convexity (9) and sub-differential regularity (10) (with replaced by ). Theorem 4.3.2 applies generally to the Burer-Monteiro factorization of SDPs, and spells out the reason why over-specifying the rank often works in practice – quadratic growth is carried to from to , as long as dual non-degeneracy holds.
4.4 Algorithmic consequences
We now adapt Theorem 3 in both the exact rank case or the rank over-specified case with linear objectives to obtain local linear convergence of the prox-linear algorithm. [Local linear convergence of factorized semidefinite optimization] Under the settings of Theorem 4.2 or 4.3.2, let be the local quadratic growth constant of . Initializing sufficiently close to the minimizing set , the prox-linear algorithm converges linearly:
If we initialize in a sufficiently small neighborhood of and let be the lowest possible choice as provided in Theorem 2.2, then the linear rate is with
The proof as well as discussions on this last linear rate can be found in Appendix E.6.
5 Examples of quadratic growth
In this section, we provide examples of problem (2) that have low-rank quadratic growth, i.e. satisfying Assumption 12. By giving conditions under which these are true, we identify some situations in which the geometric results given in Theorems 4.2 or 4.3.2 will hold.
5.1 Linear objectives
As we see in Theorem 4.3.2, our sufficient conditions for quadratic growth requires checking CQ, strict complementarity, and dual non-degeneracy. We illustrate showing these conditions for the SDP for synchronization and SO(d) synchronization when the data contains strong signal.
[ synchronization] Let be an unknown binary vector. The -synchronization problem is to recover from the matrix of noisy observations
where is a Gaussian Orthogonal Ensemble (GOE): , for , , and these entries are independent. This problem is a simplified model for community detection problems such as in the stochastic block model .
The maximum likelihood estimate of the above problem is computationally intractable for its need to search overpossibilities. However, the maximum likelihood problem can be relaxed into an SDP: letting , we solve
with high probability,is the unique solution to (44) and strict complementarity holds. If this happpens, dual non-degeneracy also holds: we have , , and the matrices so they certainly span .
Finally, we note that CQ holds for any MaxCut problem: the constraints are and for . For , constraint qualification requires that are linearly independent, or that have non-zero rows. This has to be true, as the rows have norm one.
Putting together, the assumptions of Theorem 4.3.2 will hold, and the factorized problem with rank will have quadratic growth, norm convexity, and sub-differential regularity.
[SO(d) synchronization] The SO(d) synchronization problem is an multi-dimensional extension of the MaxCut problem: we are interested in recovering orthogonal matrices given their noisy pairwise compositions
Arranging into and forming the decision variable with row blocks , we solve (for )
The SDP relaxation is