1 Introduction
We consider constrained composite optimization problems of the form
(1)  
in which is a smooth map, is convex, , and . Our main motivating example for such problems is the BurerMonteiro factorization method for solving the semidefinite optimization problem
(2)  
where is smooth convex, is a symmetric linear operator with , and . The celebrated Burer and Monteiro [12] approach proposes to solve (2) by factorizing for and solving
(3)  
This is an instance of the problem (1) with . When the solution of (2) has low rank, satisfying
for some with , the BurerMonteiro factorization is particularly appealing because, in addition to its lower storage and computational cost, it can solve the original problem (2). Many problems in science and engineering can be cast as problem (2) with a lowrank solution, including phase retrieval [16, 34], community detection [1, 23], phase synchronization [6, 30], and robust PCA [26].
Problem (1) is the constrained variant of composite optimization problems [13], a family of structured nonconvex (and potentially nonsmooth) optimization problems of the form
(4) 
where is smooth and is convex. Such composite structure appears naturally in learning problems with a convex but nonsmooth loss, such as robust phase retrieval [20, 17].
A prevailing algorithm for solving the composite problem (4) is the proxlinear algorithm [13, 15], which sequentially minimizes a Taylorlike model of :
(5)  
The model function linearizes the smooth map , keeps the outer convex function , and is therefore convex. Each iteration thus requires solving a strongly convex problem, which is (frequently) efficient.
When is Lipschitz and is smooth, the proxlinear algorithm has global convergence to a stationary point, measured by the subdifferential stationarity . Analysis of its local convergence has been more sophisticated—local linear convergence has been established under tilt stability, which requires a unique minimizer and strong growth around it [18]. A naive transformation of the constrained problem (1) into (4), taking , violates such local analyses as there may be multiple with . Our seek alternative approaches for solving (1) with local convergence guarantees.
1.1 Outline and our contribution
In this paper, we consider the exact penalty method [cf. 8] for solving problem (1), which translates the constraint into an exact penalty term and solve the unconstrained problem
(6) 
In the matrix problem, this corresponds to solving
(7) 
Such an exact penalty term encourages to fall onto the constraint set, and as the norm grows linearly in , one expects that it has a dominating effect over the objective when is not on the set, therefore penalizing infeasible ’s [7].
Problem (6) is a composite optimization problem, and we therefore study the convergence of the proxlinear algorithm (5), providing arguments on its convergence for matrix problems of the form (7).
We summarize our contributions.

We define norm convexity, a local geometric property for generic nonsmooth functions. Norm convexity is a weak notion of local convexity, and it dovetails with other regularity conditions for composite optimization (e.g. subdifferential reguality). We show that the exact penalty function (6) satisfies norm convexity if it has quadratic growth around the (local) minimizing set (Section 2).

We show that the proxlinear algorithm has local linear convergence for generic composite optimization if has quadratic growth and norm convexity. Our result extends Drusvyatskiy and Lewis [18] and does not rely on the tilt stability assumption required there. Consequently, the proxlinear algorithm on the exact penalty function (6) has local linear convergence as long as the problem has quadratic growth (Section 3).

We instantiate our result on the factorized matrix problem (7). To verify the assumptions for the convergence result, we study whether the quadratic growth of (6) can be deduced from the quadratic growth of the original SDP (2). We show that quadratic growth is always preserved if the rank is exact (), and is preserved for linear objectives () when the rank is overspecified (). In contrast, when is nonlinear, quadratic growth is no longer preserved under rank overspecification (Section 4). This gives a precise characterization for the convergence of the proxlinear algorithm on factorized SDPs and could be of broader interest for understanding matrix factorization.
We provide a roadmap of our main results in Figure 1.
1.2 Related work
Composite optimization
The exact penalty method was one particular early motivation for considering convex composite functions [14]. The workof Burke [13], Lewis and Wright [24], Drusvyatskiy and Lewis [18], Drusvyatskiy et al. [19] studies the convergence of proximal algorithms on composite problems. In particular, Drusvyatskiy and Lewis [18] establish local linear convergence of the proxlinear algorithm, which shows the “natural” rate of convergence in presence of subdifferential regularity and the tiltstability condition, and poses the question (in its Section 5) whether subdifferential regularity is implied by quadratic growth for general convex composite problems. We estbalish the norm convexity condition under quadratic growth, thereby resolving the problem in the special case of penalized objectives.
Matrix factorization
The idea of solving SDPs by factorizing is due to Burer and Monteiro [12]. This factorization encodes the PSD constraint into the objective, at the cost of turning the problem nonconvex. When , the enlightening result of Pataki [29] shows that any SDP with linear constraints always has a solution satisfying , thus when the number of constraints is small, one can always set wih , thereby saving huge computational and storage cost. The nonconvexity of the factorized problem (3) makes possible spurious local minimizers and weird geometries. Boumal et al. [11] establish benign geometry in a general case, showing that for generic , all secondorder critical points of the linear problem (3) are global minima as long as the rank is overspecified (consistent with our results). For some special problems with nonlinear such as matrix completion and lowrank matrix sensing, a recent line of work [21, 22] shows often there is no spurious local minimum.
Alternative methods for semidefinite optimization
The majority of early development on SDPs focuses on interiorpoint algorithms, established by Nesterov and Nemirovskii [27] and Alizadeh [3]. Interiorpoint methods are efficient and robust on smallscale problems () but quickly become infeasible beyond that, as they must compute matrix inverses or Schur complements. Augmented Lagrangian methods, including ADMM and NewtonCG algorithms, appear to be faster and more scalable, with welldeveloped software available (e.g. SDPAD [35] and SDPNAL [37, 36]).
Riemannian methods are suitable for solving problem (3) when the constraint has special structures such as blockdiagonal constraints or orthogonality constraints [9]. These methods are very efficient in practice. Results on their local [2] and global convergence [10] are present. For a thorough introduction to Riemannian optimization on matrix manifolds see the book of Absil et al. [2].
1.3 Notation
We usually reserve letters
for vector variables and capital letters
for matrices. The space of symmetric matrices is . For a matrix , we let anddenote the eigenvalues / singular values of
sorted in decreasing order. The two norm, Frobenius norm, and operator norm are denoted by , , and . For twice differentiable , and denote its gradient and Hessian. For vectorvalued function , let be the (tranposed) Jacobian, so that the firstorder Taylor expansion readsFor convex composite, where is smooth and is convex, let denote its (Frechet) subdifferential
We let be the distance of to and denote the neighborhood of .
2 Geometry of the composite objective
In this section, we analyze the local geometry of , the composite objective. We first show that quadratic growth (Definition 2) is preserved when we reformulate problem (1) to (6). We then show such quadratic growth will imply norm convexity and subdifferential regularity (Definition 2.2 and 2.2).
Throughout the paper, we let
denote the objective and the penalty function, and
(8) 
denote the penalized objective (the subscript is omitted when it is clear from the context). Let be a local minimizing set of Problem (1), i.e. for any and for all such that .
A first question will be whether minimizing is equivalent to solving the original constrained problem (1), i.e. whether is also the local minimizing set of . On the constraint set, , so the minimizing set is ; off the constraint set, the term has a “pointy” behavior and will produce strong growth, so if is sufficiently smooth, intuitively this penalty term will dominate and force to also grow off the constraint set. We will make this argument precise in Section 2.1 and give an affirmative answer under quadratic growth and constraint qualification.
We now define the quadratic growth property. [Quadratic growth] A function is said to have quadratic growth in around a local minimizing set if
where is the function value on .
We now collect our assumptions for this section. For properties that are required to hold locally, we assume there exists an such that all local properties hold in .
[Smoothness] In a neighborhood , the objective with bounded and Lipschitz Hessian. That is,
Further, is locally Lipschitz. Functions and are also smooth with parameters accordingly (for example, is smooth).
[Quadratic growth] There exists some such that locally
holds for all such that . Assumption 2 ensures that the constrained optimization problem has quadratic growth around the minimizing set . This will be the main assumption that we hinge on to show various geometric properties.
An additional assumption that we make on the constrained problem (1) is the following. [Constraint qualification] There exists some constant such that for all , the Jacobian of the constraint function has full row rank and satisfy the quantitative bound . Consequently, in the neighborhood , the constraint set
is a smooth manifold. We further assume that the minimizing set is a compact smooth submanifold of . Assumption 2 is known as the Linear Independence Constraint Qualification (LICQ) in the nonlinear programming literature and requires that the normal space of at is dimensional. (The reader can refer to [2, Chapter 3] for background on smooth manifolds.)
2.1 Preservation of quadratic growth
We start by asking the following question: does the penalty method preserve quadratic growth? Namely, if a constrained problem has quadratic growth on the constraint set, does the penalized objective have quadratic growth in the whole space? The following result gives an affirmative answer. Let Assumptions 2 and 2 hold. Let Assumption 2 hold, i.e. for all such that ,
then the penalized objective has local quadratic growth: for any , setting , there exists a neighborhood such that for all and , we have
where . Lemma 2.1 says that quadratic growth is preserved in the penalized formulation. In particular, is a local minimum of . The proof can be found in Appendix C.1.
2.2 Quadratic growth implies norm convexity
We now define norm convexity and subdifferential regularity, two geometric properties that are essential to establishing the convergence of proximal algorithms in Section 3. [Norm convexity] The function is norm convex around the minimizing set with constant if for all near , we have
Next, we recall the definition of subdifferential regularity [18]. [Subdifferential regularity] The function is subdifferentialy regular at if for all near , we have
We now present our main geometric result, that is, for having the penalty structure (8), quadratic growth of implies norm convexity and subdifferential regularity.
To gain some intuition, let us illustrate that quadratic growth implies norm convexity and subdifferential regularity in the convex case. For convex, assuming quadratic growth, we have
for any choice of the subgradient and minimum . Choosing the minimum norm subgradient and closest to gives subdifferential regularity
Therefore norm convexity generalizes convexity by only requiring .
Norm convexity specifies a local regularity condition for nonconvex nonsmooth functions. While it does not necessarily hold for general composite functions, we show that for our exact penalty objective , quadratic growth does imply norm convexity and hence subdifferential regularity.
3 Local convergence of the proxlinear algorithm
We now analyze the convergence of the proxlinear algorithm (5) for generic composite problems of the form (4).
Recall that the proxlinear algorithm iterates
where
and is a small stepsize.
For the convergence result, we assume that is Lipschitz and is smooth, i.e.
An immediate consequence of the smoothness is that
(11) 
so gives a local quadratic approximation of near . In particular, when , is an upper bound on , implying that the proxlinear algorithm is a descent method.
We now present our main algorithmic result: the proxlinear algorithm has local linear convergence as long as has quadratic growth and norm convexity. [Local linear convergence of the proxlinear algorithm] Suppose satisfies the above assumptions and has a compact local minimizing set . Assume that there exists such that the following happens in :

has quadratic growth and norm convexity with constant around ;

Proxlinear iterates has the proximity property (see Definition D.2 for a formal definition and discussion).
Then, for sufficiently close to , the proxlinear algorithm (5) has linear convergence:
where
The proof builds on existing results on composite optimization from [18, Section 5,6], which we review in Appendix D.1, and makes novel use of norm convexity to establish the local linear convergence. The proof can be found in Appendix D.2. Relationship between our result and existing results based on tilt stability is discussed in Appendix D.3.
4 Application on factorized matrix problems
We now instantiate our geometry and convergence results on our main application, the matrix problem (7), and show that the proxlinear method achieves local linear convergence for solving many lowrank semidefinite problems.
For the matrix problem, recall that and
Recall that our algorithmic result (Theorem 3) requires to have quadratic growth. The main focus of this section is to study whether the quadratic growth of (Assumption 2) can be deduced from the quadratic growth of , which can often be verified more straightforwardly.
We build such connection in two separate cases: factorization with the exact rank () and with rank overspecification (). We show that quadratic growth is transferred from to only if we have the exact rank (Section 4.2), or that we overspecify the rank and is linear () (Section 4.3.2). In both cases, we adapt Theorem 3 to provide the convergence result as a corollary. If is not linear, quadratic growth will in general fail to hold on and the proxlinear algorithm no longer have local linear convergence. We demonstrate this via a counterexample (Section 4.3.1).
Throughout this section, we assume the following assumptions on the semidefinite objective , which we will use to deduce properties on the factorized objective . [Smoothness] The objective with bounded and Lipschitz Hessian, i.e. for all ,
Further, is locally Lipschitz near .
[Dual optimum] There exists at least one dual optimum associated with , i.e. and that satisfy the KKT conditions
(12) 
[Rank quadratic growth] There exists such that
for all in the set
Assumption 4 states that the semidefinite problem has a unique lowrank solution and that there is no duality gap. A number of conditions such as the Slater’s condition guarantee no duality gap and such dual optimum must exist. Assumption 12 ensures that on lowrank feasible points, has strong growth around . As we only require strong growth on lowrank matrices, this assumption is often more likely to hold than full quadratic growth. We will demonstrate this in Section F. [Rank constraint qualification] The constraint set
is a smooth manifold. There exists some constant such that the constraint coefficient matrices satisfy for all in a neighborhood of , where is defined via
4.1 Preliminaries: global optimality, matrix distance
Let be the unique minimum of problem (2) and .
Global optimality
Nonuniqueness and matrix distance
The factorization brings in a great deal of nonuniqueness issues, but this nonuniqueness acts fairly nicely and satisfies the assumptions we require in our geometry and optimization results. One particular nice property is that the minimizing set , having the form
is a compact smooth manifold, and the distance
is the Procrustes distance between and . More background on the factorization map and the Procrustes distance are provided in Appendix E.1.
4.2 Matrix growth: the exactrank case
If we know the exact rank , we could set and factorize with . In this case, the Euclidean distance in the space and the space is nicely connected, as stated in the following bound. [Lemma 41, [22]] Let be two matrices such that (so that they are aligned in the Procrustes distance), then
Building on this result, we show that when has quadratic growth around in the constraint set, so does . The proof is in Appendix E.3. Let Assumption 12 hold for rank , then for all such that , we have
where . The above Lemma allows us to establish quadratic growth of and based on lowrank quadratic growth of . In particular, applying Lemma 2.1, we get that has local quadratic growth with constant
in a neighborhood of . We could then apply Theorem 2.2 here and obtain norm convexity and subdifferential regularity on the penalized objective . We summarize this in the following theorem. [Geometry of matrix factorization with exact rank] Suppose is convex and satisfies Assumption 4, 4, Assumption 12 with rank and constant , and Assumption 4, then for sufficiently large , satisfies the norm convexity (9) with constant and subdifferential regularity (10) with constant .
4.3 Matrix growth: the rank overspecified case
In many real problems, the true rank cannot be known exactly. In these cases, a common strategy is to conjecture an upper bound on the rank and factorize for . We show that overspecifying the rank will preserve quadratic growth when is linear. Hence, for solving SDPs, local linear convergence can still be achieved when we overspecify the rank. In constrast, quadratic growth will not be preserved with generic convex .
4.3.1 Quadratic growth is not preserved in general
Recall that when converting quadratic growth of into that of (Lemma 4.2), we relied on Lemma 4.2, which says that grows at least linearly in :
This bound requires , so if we used an upper bound and factorized for , then , and the bound becomes vacuous. The following example demonstrates that the growth can indeed be slower when we overspecify the rank.
Let have full column rank and , so that . Let , where is such that . Then,
so and are optimally aligned. However, we have
and
The distance in the PSD space depends quadratically in the distance in the lowrank space, not linearly.
Taking for example, the factorized version will only have fourthorder growth around in certain directions. The proxlinear algorithm will not have local linear convergence due to this slow growth, for the same reason that gradient descent will not converge linearly on .
Knowing that can be (as opposed to linear in the distance), we extend Lemma 4.2 in the following result, showing that a quadratic lower bound is indeed true. Consequently, any problem (2) with quadratic growth will have at least fourthorder growth under rank overspecification. [Matrix growth bound under rank overspecification] Let be such that , and . Then
The proof can be found in Appendix E.4.
4.3.2 Quadratic growth is preserved on linear objectives
In contrast to the generic case, we show that for linear objectives, rank overspecification preserves quadratic growth under some mild additional assumptions. Letting for , problem (2) reads
(13)  
Problem (13) admits a rank solution , where has full column rank Let be the eigenvalue decomposition, where and is diagonal. We can then take . Let be the orthogonal complement of , i.e. such that
is an orthogonal matrix.
We will assume an additional assumption on the dual SDP (cf. (12)), which is a mild condition that guarantees a low rank solution of an SDP is unique [4].
[Strict complementarity and dual nondegeneracy] There exists a pair of dual optimal such that the th smallest singular value of is lower bounded by . There exists some constant such that
Assuming this condition, we can lower bound the growth of as follows. For any , , we have . Hence
(14)  
The following result further lower bounds by , and thereby showing the quadratic growth of . See Appendix E.5 for the proof. Let , and for . Suppose Assumption 4.3.2 holds, and the feasible set
is a smooth manifold, then there exists a neighborhood of and a positive constant such that for all feasible in this neighborhood,
where the constant is
Consequently, have local quadratic growth around with constant .
By the above Lemma, we establish quadratic growth of the rank overspecified objective and thereby verifying Assumption 12 for the constrained problem. As a direct consequence, we obtain norm convexity and subdifferential regularity from Theorem 2.2. We summarize this in the following theorem. [Geometry of factorized SDPs with rank overspecification] Suppose and Assumptions 4 and 4.3.2 hold for rank , then the solution to problem (13) is unique, and has local quadratic growth: for all near and feasible,
where depends on but not (so that Assumption 12 holds). Further, for sufficiently large , satisfies the norm convexity (9) and subdifferential regularity (10) (with replaced by ). Theorem 4.3.2 applies generally to the BurerMonteiro factorization of SDPs, and spells out the reason why overspecifying the rank often works in practice – quadratic growth is carried to from to , as long as dual nondegeneracy holds.
4.4 Algorithmic consequences
We now adapt Theorem 3 in both the exact rank case or the rank overspecified case with linear objectives to obtain local linear convergence of the proxlinear algorithm. [Local linear convergence of factorized semidefinite optimization] Under the settings of Theorem 4.2 or 4.3.2, let be the local quadratic growth constant of . Initializing sufficiently close to the minimizing set , the proxlinear algorithm converges linearly:
where
If we initialize in a sufficiently small neighborhood of and let be the lowest possible choice as provided in Theorem 2.2, then the linear rate is with
The proof as well as discussions on this last linear rate can be found in Appendix E.6.
5 Examples of quadratic growth
In this section, we provide examples of problem (2) that have lowrank quadratic growth, i.e. satisfying Assumption 12. By giving conditions under which these are true, we identify some situations in which the geometric results given in Theorems 4.2 or 4.3.2 will hold.
5.1 Linear objectives
As we see in Theorem 4.3.2, our sufficient conditions for quadratic growth requires checking CQ, strict complementarity, and dual nondegeneracy. We illustrate showing these conditions for the SDP for synchronization and SO(d) synchronization when the data contains strong signal.
[ synchronization] Let be an unknown binary vector. The synchronization problem is to recover from the matrix of noisy observations
where is a Gaussian Orthogonal Ensemble (GOE): , for , , and these entries are independent. This problem is a simplified model for community detection problems such as in the stochastic block model [5].
The maximum likelihood estimate of the above problem is computationally intractable for its need to search over
possibilities. However, the maximum likelihood problem can be relaxed into an SDP: letting , we solve(15)  
We are interested in when the relaxation is tight, or, that is the unique solution to (44). Recent work [6] establishes that when
with high probability,
is the unique solution to (44) and strict complementarity holds. If this happpens, dual nondegeneracy also holds: we have , , and the matrices so they certainly span .Finally, we note that CQ holds for any MaxCut problem: the constraints are and for . For , constraint qualification requires that are linearly independent, or that have nonzero rows. This has to be true, as the rows have norm one.
Putting together, the assumptions of Theorem 4.3.2 will hold, and the factorized problem with rank will have quadratic growth, norm convexity, and subdifferential regularity.
[SO(d) synchronization] The SO(d) synchronization problem is an multidimensional extension of the MaxCut problem: we are interested in recovering orthogonal matrices given their noisy pairwise compositions
Arranging into and forming the decision variable with row blocks , we solve (for )
The SDP relaxation is
(16)  
Comments
There are no comments yet.