Proximal algorithms for constrained composite optimization, with applications to solving low-rank SDPs

03/01/2019 ∙ by Yu Bai, et al. ∙ Stanford University 0

We study a family of (potentially non-convex) constrained optimization problems with convex composite structure. Through a novel analysis of non-smooth geometry, we show that proximal-type algorithms applied to exact penalty formulations of such problems exhibit local linear convergence under a quadratic growth condition, which the compositional structure we consider ensures. The main application of our results is to low-rank semidefinite optimization with Burer-Monteiro factorizations. We precisely identify the conditions for quadratic growth in the factorized problem via structures in the semidefinite problem, which could be of independent interest for understanding matrix factorization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider constrained composite optimization problems of the form

(1)

in which is a smooth map, is convex, , and . Our main motivating example for such problems is the Burer-Monteiro factorization method for solving the semidefinite optimization problem

(2)

where is smooth convex, is a symmetric linear operator with , and . The celebrated Burer and Monteiro [12] approach proposes to solve (2) by factorizing for and solving

(3)

This is an instance of the problem (1) with . When the solution of (2) has low rank, satisfying

for some with , the Burer-Monteiro factorization is particularly appealing because, in addition to its lower storage and computational cost, it can solve the original problem (2). Many problems in science and engineering can be cast as problem (2) with a low-rank solution, including phase retrieval [16, 34], community detection [1, 23], phase synchronization [6, 30], and robust PCA [26].

Problem (1) is the constrained variant of composite optimization problems [13], a family of structured non-convex (and potentially non-smooth) optimization problems of the form

(4)

where is smooth and is convex. Such composite structure appears naturally in learning problems with a convex but non-smooth loss, such as robust phase retrieval [20, 17].

A prevailing algorithm for solving the composite problem (4) is the prox-linear algorithm [13, 15], which sequentially minimizes a Taylor-like model of :

(5)

The model function linearizes the smooth map , keeps the outer convex function , and is therefore convex. Each iteration thus requires solving a -strongly convex problem, which is (frequently) efficient.

When is Lipschitz and is smooth, the prox-linear algorithm has global convergence to a stationary point, measured by the sub-differential stationarity . Analysis of its local convergence has been more sophisticated—local linear convergence has been established under tilt stability, which requires a unique minimizer and strong growth around it [18]. A naive transformation of the constrained problem (1) into (4), taking , violates such local analyses as there may be multiple with . Our seek alternative approaches for solving (1) with local convergence guarantees.

1.1 Outline and our contribution

In this paper, we consider the exact penalty method [cf. 8] for solving problem (1), which translates the constraint into an exact penalty term and solve the unconstrained problem

(6)

In the matrix problem, this corresponds to solving

(7)

Such an exact penalty term encourages to fall onto the constraint set, and as the norm grows linearly in , one expects that it has a dominating effect over the objective when is not on the set, therefore penalizing infeasible ’s [7].

Problem (6) is a composite optimization problem, and we therefore study the convergence of the prox-linear algorithm (5), providing arguments on its convergence for matrix problems of the form (7).

We summarize our contributions.

  • We define norm convexity, a local geometric property for generic non-smooth functions. Norm convexity is a weak notion of local convexity, and it dovetails with other regularity conditions for composite optimization (e.g. sub-differential reguality). We show that the exact penalty function (6) satisfies norm convexity if it has quadratic growth around the (local) minimizing set (Section 2).

  • We show that the prox-linear algorithm has local linear convergence for generic composite optimization if has quadratic growth and norm convexity. Our result extends Drusvyatskiy and Lewis [18] and does not rely on the tilt stability assumption required there. Consequently, the prox-linear algorithm on the exact penalty function (6) has local linear convergence as long as the problem has quadratic growth (Section 3).

  • We instantiate our result on the factorized matrix problem (7). To verify the assumptions for the convergence result, we study whether the quadratic growth of (6) can be deduced from the quadratic growth of the original SDP (2). We show that quadratic growth is always preserved if the rank is exact (), and is preserved for linear objectives () when the rank is over-specified (). In contrast, when is non-linear, quadratic growth is no longer preserved under rank over-specification (Section 4). This gives a precise characterization for the convergence of the prox-linear algorithm on factorized SDPs and could be of broader interest for understanding matrix factorization.

  • We provide concrete examples of matrix problems on which our theory is applicable (Appendix F) and numerical experiments verifying our convergence results (Appendix G).

We provide a roadmap of our main results in Figure 1.

Figure 1: A roadmap of the main results. Left: results for generic constrained composite optimization (Section 2 and 3). Right: application on factorized matrix problems (Section 4).

1.2 Related work

Composite optimization

The exact penalty method was one particular early motivation for considering convex composite functions [14]. The workof Burke [13], Lewis and Wright [24], Drusvyatskiy and Lewis [18], Drusvyatskiy et al. [19] studies the convergence of proximal algorithms on composite problems. In particular, Drusvyatskiy and Lewis [18] establish local linear convergence of the prox-linear algorithm, which shows the “natural” rate of convergence in presence of sub-differential regularity and the tilt-stability condition, and poses the question (in its Section 5) whether sub-differential regularity is implied by quadratic growth for general convex composite problems. We estbalish the norm convexity condition under quadratic growth, thereby resolving the problem in the special case of penalized objectives.

Matrix factorization

The idea of solving SDPs by factorizing is due to Burer and Monteiro [12]. This factorization encodes the PSD constraint into the objective, at the cost of turning the problem non-convex. When , the enlightening result of Pataki [29] shows that any SDP with linear constraints always has a solution satisfying , thus when the number of constraints is small, one can always set wih , thereby saving huge computational and storage cost. The non-convexity of the factorized problem (3) makes possible spurious local minimizers and weird geometries. Boumal et al. [11] establish benign geometry in a general case, showing that for generic , all second-order critical points of the linear problem (3) are global minima as long as the rank is overspecified (consistent with our results). For some special problems with non-linear such as matrix completion and low-rank matrix sensing, a recent line of work [21, 22] shows often there is no spurious local minimum.

Alternative methods for semidefinite optimization

The majority of early development on SDPs focuses on interior-point algorithms, established by Nesterov and Nemirovskii [27] and Alizadeh [3]. Interior-point methods are efficient and robust on small-scale problems () but quickly become infeasible beyond that, as they must compute matrix inverses or Schur complements. Augmented Lagrangian methods, including ADMM and Newton-CG algorithms, appear to be faster and more scalable, with well-developed software available (e.g. SDPAD [35] and SDPNAL [37, 36]).

Riemannian methods are suitable for solving problem (3) when the constraint has special structures such as block-diagonal constraints or orthogonality constraints [9]. These methods are very efficient in practice. Results on their local [2] and global convergence [10] are present. For a thorough introduction to Riemannian optimization on matrix manifolds see the book of Absil et al. [2].

1.3 Notation

We usually reserve letters

for vector variables and capital letters

for matrices. The space of symmetric matrices is . For a matrix , we let and

denote the eigenvalues / singular values of

sorted in decreasing order. The two norm, Frobenius norm, and operator norm are denoted by , , and . For twice differentiable , and denote its gradient and Hessian. For vector-valued function , let be the (tranposed) Jacobian, so that the first-order Taylor expansion reads

For convex composite, where is smooth and is convex, let denote its (Frechet) sub-differential

We let be the distance of to and denote the -neighborhood of .

2 Geometry of the composite objective

In this section, we analyze the local geometry of , the composite objective. We first show that quadratic growth (Definition 2) is preserved when we reformulate problem (1) to (6). We then show such quadratic growth will imply norm convexity and sub-differential regularity (Definition 2.2 and 2.2).

Throughout the paper, we let

denote the objective and the penalty function, and

(8)

denote the penalized objective (the subscript is omitted when it is clear from the context). Let be a local minimizing set of Problem (1), i.e. for any and for all such that .

A first question will be whether minimizing is equivalent to solving the original constrained problem (1), i.e. whether is also the local minimizing set of . On the constraint set, , so the minimizing set is ; off the constraint set, the term has a “pointy” behavior and will produce strong growth, so if is sufficiently smooth, intuitively this penalty term will dominate and force to also grow off the constraint set. We will make this argument precise in Section 2.1 and give an affirmative answer under quadratic growth and constraint qualification.

We now define the quadratic growth property. [Quadratic growth] A function is said to have -quadratic growth in around a local minimizing set if

where is the function value on .

We now collect our assumptions for this section. For properties that are required to hold locally, we assume there exists an such that all local properties hold in .

[Smoothness] In a neighborhood , the objective with -bounded and -Lipschitz Hessian. That is,

Further, is -locally Lipschitz. Functions and are also smooth with parameters accordingly (for example, is -smooth).

[Quadratic growth] There exists some such that locally

holds for all such that . Assumption 2 ensures that the constrained optimization problem has quadratic growth around the minimizing set . This will be the main assumption that we hinge on to show various geometric properties.

An additional assumption that we make on the constrained problem (1) is the following. [Constraint qualification] There exists some constant such that for all , the Jacobian of the constraint function has full row rank and satisfy the quantitative bound . Consequently, in the neighborhood , the constraint set

is a smooth manifold. We further assume that the minimizing set is a compact smooth submanifold of . Assumption 2 is known as the Linear Independence Constraint Qualification (LICQ) in the nonlinear programming literature and requires that the normal space of at is -dimensional. (The reader can refer to [2, Chapter 3] for background on smooth manifolds.)

2.1 Preservation of quadratic growth

We start by asking the following question: does the penalty method preserve quadratic growth? Namely, if a constrained problem has quadratic growth on the constraint set, does the penalized objective have quadratic growth in the whole space? The following result gives an affirmative answer. Let Assumptions 2 and 2 hold. Let Assumption 2 hold, i.e. for all such that ,

then the penalized objective has local quadratic growth: for any , setting , there exists a neighborhood such that for all and , we have

where . Lemma 2.1 says that quadratic growth is preserved in the penalized formulation. In particular, is a local minimum of . The proof can be found in Appendix C.1.

2.2 Quadratic growth implies norm convexity

We now define norm convexity and sub-differential regularity, two geometric properties that are essential to establishing the convergence of proximal algorithms in Section 3. [Norm convexity] The function is norm convex around the minimizing set with constant if for all near , we have

Next, we recall the definition of sub-differential regularity [18]. [Sub-differential regularity] The function is -subdifferentialy regular at if for all near , we have

We now present our main geometric result, that is, for having the penalty structure (8), quadratic growth of implies norm convexity and sub-differential regularity.

To gain some intuition, let us illustrate that quadratic growth implies norm convexity and sub-differential regularity in the convex case. For convex, assuming -quadratic growth, we have

for any choice of the subgradient and minimum . Choosing the minimum norm subgradient and closest to gives sub-differential regularity

Therefore norm convexity generalizes convexity by only requiring .

Norm convexity specifies a local regularity condition for non-convex non-smooth functions. While it does not necessarily hold for general composite functions, we show that for our exact penalty objective , quadratic growth does imply norm convexity and hence sub-differential regularity.

[Norm convexity and sub-differential regularity] Let Assumptions 2,  2 and 2 hold. Then, there exist a constant and a neighborhood in which setting the penalty parameter , we have norm convexity

(9)

Consequently, sub-differential regularity holds:

(10)

where . The proof can be found in Appendix C.2.

3 Local convergence of the prox-linear algorithm

We now analyze the convergence of the prox-linear algorithm (5) for generic composite problems of the form (4).

Recall that the prox-linear algorithm iterates

where

and is a small stepsize.

For the convergence result, we assume that is -Lipschitz and is -smooth, i.e.

An immediate consequence of the smoothness is that

(11)

so gives a local quadratic approximation of near . In particular, when , is an upper bound on , implying that the prox-linear algorithm is a descent method.

We now present our main algorithmic result: the prox-linear algorithm has local linear convergence as long as has quadratic growth and norm convexity. [Local linear convergence of the prox-linear algorithm] Suppose satisfies the above assumptions and has a compact local minimizing set . Assume that there exists such that the following happens in :

  1. has -quadratic growth and norm convexity with constant around ;

  2. Prox-linear iterates has the proximity property (see Definition D.2 for a formal definition and discussion).

Then, for sufficiently close to , the prox-linear algorithm (5) has linear convergence:

where

The proof builds on existing results on composite optimization from [18, Section 5,6], which we review in Appendix D.1, and makes novel use of norm convexity to establish the local linear convergence. The proof can be found in Appendix D.2. Relationship between our result and existing results based on tilt stability is discussed in Appendix D.3.

To apply Theorem 3 on the exact penalty formulation (6), we only need to verify the quadratic growth assumption (as norm convexity is then implied by Theorem 2.2) – we will see more on this on matrix problems in Section 4.

4 Application on factorized matrix problems

We now instantiate our geometry and convergence results on our main application, the matrix problem (7), and show that the prox-linear method achieves local linear convergence for solving many low-rank semidefinite problems.

For the matrix problem, recall that and

Recall that our algorithmic result (Theorem 3) requires to have quadratic growth. The main focus of this section is to study whether the quadratic growth of (Assumption 2) can be deduced from the quadratic growth of , which can often be verified more straightforwardly.

We build such connection in two separate cases: factorization with the exact rank () and with rank over-specification (). We show that quadratic growth is transferred from to only if we have the exact rank (Section 4.2), or that we over-specify the rank and is linear () (Section 4.3.2). In both cases, we adapt Theorem 3 to provide the convergence result as a corollary. If is not linear, quadratic growth will in general fail to hold on and the prox-linear algorithm no longer have local linear convergence. We demonstrate this via a counter-example (Section 4.3.1).

Throughout this section, we assume the following assumptions on the semidefinite objective , which we will use to deduce properties on the factorized objective . [Smoothness] The objective with -bounded and -Lipschitz Hessian, i.e. for all ,

Further, is -locally Lipschitz near .

[Dual optimum] There exists at least one dual optimum associated with , i.e. and that satisfy the KKT conditions

(12)

[Rank- quadratic growth] There exists such that

for all in the set

Assumption 4 states that the semidefinite problem has a unique low-rank solution and that there is no duality gap. A number of conditions such as the Slater’s condition guarantee no duality gap and such dual optimum must exist. Assumption 12 ensures that on low-rank feasible points, has strong growth around . As we only require strong growth on low-rank matrices, this assumption is often more likely to hold than full quadratic growth. We will demonstrate this in Section F. [Rank -constraint qualification] The constraint set

is a smooth manifold. There exists some constant such that the constraint coefficient matrices satisfy for all in a neighborhood of , where is defined via

4.1 Preliminaries: global optimality, matrix distance

Let be the unique minimum of problem (2) and .

Global optimality

We first show that the (global) minimizing set of the exact penalty function corresponds exactly to the solution of the original semidefinite problem (2) when the penalty parameter is sufficiently large. Under Assumption 4, when and , the minimizing set of is

The proof can be found in Appendix E.2.

Non-uniqueness and matrix distance

The factorization brings in a great deal of non-uniqueness issues, but this non-uniqueness acts fairly nicely and satisfies the assumptions we require in our geometry and optimization results. One particular nice property is that the minimizing set , having the form

is a compact smooth manifold, and the distance

is the Procrustes distance between and . More background on the factorization map and the Procrustes distance are provided in Appendix E.1.

4.2 Matrix growth: the exact-rank case

If we know the exact rank , we could set and factorize with . In this case, the Euclidean distance in the space and the space is nicely connected, as stated in the following bound. [Lemma 41, [22]] Let be two matrices such that (so that they are aligned in the Procrustes distance), then

Building on this result, we show that when has quadratic growth around in the constraint set, so does . The proof is in Appendix E.3. Let Assumption 12 hold for rank , then for all such that , we have

where . The above Lemma allows us to establish quadratic growth of and based on low-rank quadratic growth of . In particular, applying Lemma 2.1, we get that has local quadratic growth with constant

in a neighborhood of . We could then apply Theorem 2.2 here and obtain norm convexity and sub-differential regularity on the penalized objective . We summarize this in the following theorem. [Geometry of matrix factorization with exact rank] Suppose is convex and satisfies Assumption 44, Assumption 12 with rank and constant , and Assumption 4, then for sufficiently large , satisfies the norm convexity (9) with constant and sub-differential regularity (10) with constant .

4.3 Matrix growth: the rank over-specified case

In many real problems, the true rank cannot be known exactly. In these cases, a common strategy is to conjecture an upper bound on the rank and factorize for . We show that over-specifying the rank will preserve quadratic growth when is linear. Hence, for solving SDPs, local linear convergence can still be achieved when we over-specify the rank. In constrast, quadratic growth will not be preserved with generic convex .

4.3.1 Quadratic growth is not preserved in general

Recall that when converting quadratic growth of into that of (Lemma 4.2), we relied on Lemma 4.2, which says that grows at least linearly in :

This bound requires , so if we used an upper bound and factorized for , then , and the bound becomes vacuous. The following example demonstrates that the growth can indeed be slower when we over-specify the rank.

Let have full column rank and , so that . Let , where is such that . Then,

so and are optimally aligned. However, we have

and

The distance in the PSD space depends quadratically in the distance in the low-rank space, not linearly.

Taking for example, the factorized version will only have fourth-order growth around in certain directions. The prox-linear algorithm will not have local linear convergence due to this slow growth, for the same reason that gradient descent will not converge linearly on .

Knowing that can be (as opposed to linear in the distance), we extend Lemma 4.2 in the following result, showing that a quadratic lower bound is indeed true. Consequently, any problem (2) with quadratic growth will have at least fourth-order growth under rank over-specification. [Matrix growth bound under rank over-specification] Let be such that , and . Then

The proof can be found in Appendix E.4.

4.3.2 Quadratic growth is preserved on linear objectives

In contrast to the generic case, we show that for linear objectives, rank over-specification preserves quadratic growth under some mild additional assumptions. Letting for , problem (2) reads

(13)

Problem (13) admits a rank- solution , where has full column rank Let be the eigenvalue decomposition, where and is diagonal. We can then take . Let be the orthogonal complement of , i.e. such that

is an orthogonal matrix.

We will assume an additional assumption on the dual SDP (cf. (12)), which is a mild condition that guarantees a low rank solution of an SDP is unique [4].

[Strict complementarity and dual non-degeneracy] There exists a pair of dual optimal such that the -th smallest singular value of is lower bounded by . There exists some constant such that

Assuming this condition, we can lower bound the growth of as follows. For any , , we have . Hence

(14)

The following result further lower bounds by , and thereby showing the quadratic growth of . See Appendix E.5 for the proof. Let , and for . Suppose Assumption 4.3.2 holds, and the feasible set

is a smooth manifold, then there exists a neighborhood of and a positive constant such that for all feasible in this neighborhood,

where the constant is

Consequently, have local quadratic growth around with constant .

By the above Lemma, we establish quadratic growth of the rank over-specified objective and thereby verifying Assumption 12 for the constrained problem. As a direct consequence, we obtain norm convexity and sub-differential regularity from Theorem 2.2. We summarize this in the following theorem. [Geometry of factorized SDPs with rank over-specification] Suppose and Assumptions 4 and 4.3.2 hold for rank , then the solution to problem (13) is unique, and has local quadratic growth: for all near and feasible,

where depends on but not (so that Assumption 12 holds). Further, for sufficiently large , satisfies the norm convexity (9) and sub-differential regularity (10) (with replaced by ). Theorem 4.3.2 applies generally to the Burer-Monteiro factorization of SDPs, and spells out the reason why over-specifying the rank often works in practice – quadratic growth is carried to from to , as long as dual non-degeneracy holds.

4.4 Algorithmic consequences

We now adapt Theorem 3 in both the exact rank case or the rank over-specified case with linear objectives to obtain local linear convergence of the prox-linear algorithm. [Local linear convergence of factorized semidefinite optimization] Under the settings of Theorem 4.2 or 4.3.2, let be the local quadratic growth constant of . Initializing sufficiently close to the minimizing set , the prox-linear algorithm converges linearly:

where

If we initialize in a sufficiently small neighborhood of and let be the lowest possible choice as provided in Theorem 2.2, then the linear rate is with

The proof as well as discussions on this last linear rate can be found in Appendix E.6.

5 Examples of quadratic growth

In this section, we provide examples of problem (2) that have low-rank quadratic growth, i.e. satisfying Assumption 12. By giving conditions under which these are true, we identify some situations in which the geometric results given in Theorems 4.2 or 4.3.2 will hold.

5.1 Linear objectives

As we see in Theorem 4.3.2, our sufficient conditions for quadratic growth requires checking CQ, strict complementarity, and dual non-degeneracy. We illustrate showing these conditions for the SDP for synchronization and SO(d) synchronization when the data contains strong signal.

[ synchronization] Let be an unknown binary vector. The -synchronization problem is to recover from the matrix of noisy observations

where is a Gaussian Orthogonal Ensemble (GOE): , for , , and these entries are independent. This problem is a simplified model for community detection problems such as in the stochastic block model [5].

The maximum likelihood estimate of the above problem is computationally intractable for its need to search over

possibilities. However, the maximum likelihood problem can be relaxed into an SDP: letting , we solve

(15)

We are interested in when the relaxation is tight, or, that is the unique solution to (44). Recent work [6] establishes that when

with high probability,

is the unique solution to (44) and strict complementarity holds. If this happpens, dual non-degeneracy also holds: we have , , and the matrices so they certainly span .

Finally, we note that CQ holds for any MaxCut problem: the constraints are and for . For , constraint qualification requires that are linearly independent, or that have non-zero rows. This has to be true, as the rows have norm one.

Putting together, the assumptions of Theorem 4.3.2 will hold, and the factorized problem with rank will have quadratic growth, norm convexity, and sub-differential regularity.

[SO(d) synchronization] The SO(d) synchronization problem is an multi-dimensional extension of the MaxCut problem: we are interested in recovering orthogonal matrices given their noisy pairwise compositions

Arranging into and forming the decision variable with row blocks , we solve (for )

The SDP relaxation is

(16)