# A Riemannian low-rank method for optimization over semidefinite matrices with block-diagonal constraints

We propose a new algorithm to solve optimization problems of the form f(X) for a smooth function f under the constraints that X is positive semidefinite and the diagonal blocks of X are small identity matrices. Such problems often arise as the result of relaxing a rank constraint (lifting). In particular, many estimation tasks involving phases, rotations, orthonormal bases or permutations fit in this framework, and so do certain relaxations of combinatorial problems such as Max-Cut. The proposed algorithm exploits the facts that (1) such formulations admit low-rank solutions, and (2) their rank-restricted versions are smooth optimization problems on a Riemannian manifold. Combining insights from both the Riemannian and the convex geometries of the problem, we characterize when second-order critical points of the smooth problem reveal KKT points of the semidefinite problem. We compare against state of the art, mature software and find that, on certain interesting problem instances, what we call the staircase method is orders of magnitude faster, is more accurate and scales better. Code is available.

## Authors

• 15 publications
05/14/2020

### Multilevel Riemannian optimization for low-rank problems

Large-scale optimization problems arising from the discretization of pro...
06/10/2019

### Efficiently escaping saddle points on manifolds

Smooth, non-convex optimization problems on Riemannian manifolds occur i...
03/02/2019

### Block-Coordinate Minimization for Large SDPs with Block-Diagonal Constraints

The so-called Burer-Monteiro method is a well-studied technique for solv...
07/22/2019

### Block-sparse Recovery of Semidefinite Systems and Generalized Null Space Conditions

This article considers the recovery of low-rank matrices via a convex nu...
06/11/2018

### Smoothed analysis of the low-rank approach for smooth semidefinite programs

We consider semidefinite programs (SDPs) of size n with equality constra...
10/08/2019

### On Polyhedral and Second-Order-Cone Decompositions of Semidefinite Optimization Problems

We study a cutting-plane method for semidefinite optimization problems (...
03/05/2019

### An O(m/ε^3.5)-Cost Algorithm for Semidefinite Programs with Diagonal Constraints

We study semidefinite programs with diagonal constraints. This problem c...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper considers the generic problem of estimating matrices , , with orthonormal rows, that is, such that (identity of size ) for all . We further focus on problems where only relative information is available, that is, information about for some of the pairs is available, but there is no information about individual ’s. As will be detailed below, particular cases of this come up in a number of applications. For example, when , the variables reduce to and the products indicate whether and have the same sign or not, allowing to model certain combinatorial problems. When

, the variables reduce to unit-norm vectors, and the products correspond to inner products between them, allowing to model correlations and proximity on a sphere. Finally, when

, the matrices are orthogonal, and the products represent relative orthogonal transformations, such as rotations, reflections and permutations.

For ease of notation, we stack the orthonormal matrices on top of each other to form , . Then, is a block matrix whose block corresponds to the relative product . Define the (transposed) Stiefel manifold as

 St(d,p)={Z∈Rd×p:ZZ⊤=Id},

and the set of ’s obtained by stacking as

 St(d,p)m ={Y∈Rmd×p:Y⊤=(Y⊤1Y⊤2⋯Y⊤m) and Y1,…,Ym∈St(d,p)} ={Y∈Rn×p:(YY⊤)ii=Id for i=1…m}. (1)

This paper is concerned with solving optimization problems of the form

 minY∈Rn×p  g(Y)=f(YY⊤),subject toY∈St(d,p)m, (RPp)

with twice continuously differentiable cost defined over the symmetric matrices. Here, is for example the negative likelihood of with respect to available data. The restriction that be only a function of encodes the property that only relative information is available, through . This induces invariance of the cost under right-action of the orthogonal group. Indeed,

for any orthogonal matrix

of size . Thus, solutions of () are only defined up to this group action.

Problem () is computationally hard. In particular, for and linear , it covers the NP-hard Max-Cut problem [36]. Following that and other previous work [13, 66, 9], we consider a relaxation through the following observation. For all , the matrix is positive semidefinite, its diagonal blocks are identity matrices , and it has rank at most . Conversely, any matrix with those properties can be factored as with . In other words, problem () is equivalent to optimizing over the convex set

 C ={X∈Sn×n:X⪰0 and Xii=Id for i=1…m}, (2)

with the additional constraint . As often, the rank constraint is the culprit. Indeed, continuing with the Max-Cut example (linear ), optimization over without the rank constraint is a semidefinite program, which can be solved to arbitrary precision in polynomial time [76].

This is motivation to study the relaxation obtained by ignoring the rank constraint (for linear , it is also the dual of the dual of ()):

 minX∈Rn×n  f(X),subject % toX∈C. (P)

The optimal cost of (P) is a lowerbound on that of (). Furthermore, if (P) admits a solution with , that is, a solution of rank at most , then is a solution of (). When that is not the case, a higher-rank solution may still be projected to a (hopefully good) initial guess for the nonconvex problem (). See [49, 9] for a discussion of approximation results related to these projections. The price to pay is that is much higher dimensional than : this is called a lift [13].

For linear , solutions of (P) can of course be computed using standard SDP solvers, such as interior point methods (IPM). Unfortunately, as demonstrated in Section 5, IPM’s do not scale well. The main reason for it is that, as the name suggests, IPM’s iterate inside the interior of the search space . The latter is formed by full-rank, dense matrices of size : this quickly becomes unmanageable.

The full-rank operations seem even more wasteful considering that, still for linear , problem (P) always admits a solution of rank at most

 p∗=√1+4md(d+1)−12<(d+1)√m≪n. (3)

Indeed, this follows a general result of Shapiro [62], Barvinok [12] and Pataki [53] regarding extreme points of spectrahedra,111

The name spectrahedron for the search space of a semidefinite program echoes the name polyhedron for the search space of a linear program.

that is, intersections of the positive semidefinite cone with an affine subspace—the geometry of is discussed in Section 2.2. This prompted Burer and Monteiro [19, 20] to propose SDPLR, a generic SDP solver which exploits the low-rank phenomenon. Applying SDPLR to our problem amounts to computing a local minimizer of () for some small , using classical nonlinear optimization algorithms and penalizing for the constraints in a Lagrangian way. Then, is increased as needed until can be certified as a solution to the SDP.

SDPLR is powerful and generic, and the theory accompanying the algorithm brings great insight into the problem. But it also has some downsides we want to improve on in the context of (P). First, it is not an easy matter to guarantee convergence to (even local) optimizers in the nonlinear subproblems. Furthermore, since constraints are enforced by penalization, they are not accurately satisfied by the returned solution. Finally, we would like to allow for nonlinear . Nevertheless, Section 5 shows SDPLR improves significantly upon IPM’s.

Journée et al. [40] build upon SDPLR, observing that certain SDP’s harbor an elegant Riemannian geometry that can be put to good algorithmic use. In particular, they cover what here corresponds to the case and observe that, as remains true for , () is an optimization problem on a smooth space: is a Riemannian manifold—this geometry is detailed in Section 2.1. Allowing for smooth nonlinear , they apply essentially the SDPLR machinery, replacing the nonlinear programming algorithms for () by Riemannian optimization algorithms [4]. These algorithms exploit the smooth structure of the nonconvex search space, resulting in constraint satisfaction up to numerical accuracy, as well as notable speedups.

As a further refinement, Journée et al. [40] address the invariance of under orthogonal group action. Instead of optimizing over , they optimize over the quotient space , where is an equivalence relation defined over (the full-rank elements of ) by . The advantage is that this quotient space, which is still a smooth Riemannian manifold, is now one-to-one with the rank- matrices in . Unfortunately, the geometry breaks down at rank-deficient ’s (to see this, notice that equivalence classes of different rank have different dimension; see also Figure 1). The breakdown is problematic since, as will become clear, it is desirable to converge to rank-deficient ’s. Furthermore, that paper too asks for computation of local optimizers of the subproblems, which, on Riemannian manifolds too, is a difficult task.

In both [20] and [40], one of the keys to practical efficiency is (well-justified) optimism: () is first solved for small values of , and is increased only as needed. In both papers, it is observed that, in practice, it often suffices to reach just above the rank of the target solution of (P), which may be quite small; but there is no theory to confirm this. We do not prove such a strong result either, but we give some nontrivial, deterministic bounds on “how high one must lift”, refining certain results of [20].

### 1.1 Contribution

In this paper, we describe the Riemannian geometry of in order to frame () as a Riemannian optimization problem. We use existing algorithms [4] and the Manopt toolbox [17] to compute critical points of (), that is, points where the (Riemannian) gradient of the cost function vanishes. In practice, those algorithms tend to converge to second-order critical points, that is, points where the (Riemannian) Hessian is also positive semidefinite, because all other critical points are unstable fixed points of the iteration.

For , is a connected,222For , has disconnected components, because the orthogonal group has two components: matrices with determinant and . This is a strong incentive to relax at least to . compact and smooth space. Since we further assume sufficient smoothness in too, this makes for a nice problem with no delicate limit cases to handle. Furthermore, Riemannian optimization algorithms iterate on the manifold directly: all iterates satisfy constraints up to numerical accuracy.

We then turn our attention to computing Karush-Kuhn-Tucker (KKT) points for (P). These are points that satisfy first-order necessary optimality conditions. If is convex, the conditions are also sufficient. Our goal is to compute KKT points via the computation of second-order critical points of (), which is lower-dimensional. A key property that makes this possible is the availability of an explicit dual matrix  (21) which intervenes in both sets of conditions.

Using this dual matrix, we show that rank-deficient second-order critical points reveal KKT points . Furthermore, when a computed second-order critical point is full rank, it is shown how to use it as a warm-start for the computation of a second-order critical point of () with a larger value of . It is guaranteed that if is allowed to grow up to , then all second-order critical points reveal KKT points, so that the procedure terminates. This is formalized in Algorithm 1, which we call the Riemannian Staircase, as it lifts () to (P) step by step, instead of all at once.

The above points rest extensively on work discussed earlier in this introduction [19, 20, 40], and improve upon those along the lines announced in the same. In particular, we do not require the computation of local optimizers of (), and we avoid the geometry breakdown tied to the quotient approach in [40]. We also stress that the latter reference only covers , and SDPLR only covers linear .

We further take particular interest in understanding how large may grow in the staircase algorithm. We view this part as our principal theoretical contribution. This investigation calls for inspection of the convex geometry of , with particular attention to its faces and their dimension. To this effect, we use results by Pataki [53] to describe the face of which contains a given in its relative interior, and we quote the lower-bound on the dimension of that face as a function of . We further argue that this bound is almost always tight, and we give an essentially tight upper bound on the dimension of a face, generalizing a result of Laurent and Poljak [42] to .

Using this facial description of , we establish that for strongly concave , for  (3), all second-order critical points of () reveal KKT points. Also, for concave , we show the same for (Corollary 3.16), and argue that is sufficient under an additional condition we believe to be mild. Hence,

For linear , above a certain threshold for , all second-order critical points of () are global optimizers.

The condition is stronger than the one proposed in [20], and the statement is about second-order critical points, rather than about local optimizers of (). There are no similar results for convex , as then solutions can have any rank.

We close the paper with numerical experiments showing the efficiency of the staircase algorithm to solve (P) on certain synchronization problems involving rotations and permutations, as compared to IPM’s and SDPLR.

Note that, up to a linear change of variable, problem (P) also encompasses constraints of the form where each is positive definite. We assume all diagonal blocks have identical size as this simplifies exposition, but the proposed method can easily accommodate inhomogeneous sizes, and many of the developments go through for complex matrices as well.

### 1.2 Applications

Problem () and its relaxation (P) appear in numerous applications. Many of those belong to the class of synchronization problems, which consist in estimating group elements from measurements of pairwise ratios. Further applications are also described, e.g., in [65, 49, 9].

##### Combinatorial problems

can be modeled in () with . A seminal example is Max-Cut: the problem of clustering a graph in two classes, so as to maximize the sum of weights of edges joining the two classes. The cost is linear, determined by the graph’s adjacency matrix. Its relaxation to (P) is the subject of an influential analysis by Goemans and Williamson [36], which helped popularize the type of lifts considered here. See [31, eq. (3)] for a recent application of Max-Cut to genomics. The same setup, but with different linear costs, appears in the stochastic block model [1], in community detection [27], in maximum a posteriori (MAP) inference in Markov random fields

with binary variables and pairwise interactions

[35] and in robust PCA [46, Alg. 1]. All of these study the effects of the relaxation on the final outcome, mostly under random data models. Their linear cost matrices are often structured (sparse or low-rank), which is easily exploited here.

##### Spherical embeddings

is the general problem of estimating points on a sphere in

, and appears notably in machine learning for classification

[81] and in the fundamental problem of packing spheres on a sphere [26]. It is modeled by () with . The same setup also models correlation matrix completion and approximation [37]. In the latter, an algorithm to solve (P) is proposed, which inspired [40], which inspired this work.

##### Synchronization of rotations

is the problem of estimating rotation matrices (orthogonal matrices with determinant 1, to exclude reflections), based on pairwise relative rotation measurements. It is modeled in () with (often, 2 or 3) and comes up in structure from motion [7], pose graph estimation [21], global registration [24], the generalized Procrustes problem [72] and simultaneous localization and mapping (SLAM) [22]. It serves in global camera path estimation [18], scan alignment [15, 78], and sensor network localization and the molecule problem [29, 30]. In many of these problems, translations must be estimated as well, and it has been shown in practical contexts that rotations and translations are best estimated separately [22, Fig. 1]. Here, () can easily accommodate the determinant constraint: it comes down to picking one of the connected components of , as in [16]. The relaxation (P) ignores this, though; see [60] for relaxations which explicitly model this difference (at additional computational cost). The problem of estimating orthogonal matrices appears notably in the noncommutative little Grothendieck problem [49, 9]. In the latter, the relaxation (P) with linear is called Orthogonal-Cut, and its effect on () is analyzed. The same relaxation with a nonsmooth cost, for robust estimation, is proposed and analyzed in [78]. See also [8] for another robust formulation of the same problem, based on low-rank–plus–sparse modeling.

##### The common lines problem in Cryo-EM

is an important biomedical imaging instance of (), where orthonormal matrices are to be estimated with  [79].

##### Phase synchronization and recovery

can be modeled with (as phases are rotations in ). It is sometimes attractive to model phases as unit-modulus complex numbers instead, as is done in [65] for phase synchronization, with the same SDP relaxation. This can be used for clock synchronization. See [10] for a study of the tightness of this SDP, and [28] for an application to ranking. The Phase-Cut algorithm for phase recovery uses the same SDP [77], with a different linear cost. While not explicitly treated, many of the results in this paper extend to the complex case.

### 1.3 Related work

Problem () is an instance of optimization on manifolds [4]. Optimization over orthonormal matrices is also studied in, e.g., [34, 80]. Being equivalent to (P) with a rank constraint, () also falls within the scope of optimization over matrices with bounded rank [61, 47], where the latter is also an extension of [40]. The particular case of optimization over bounded-rank positive semidefinite matrices with linear constraints was already addressed in [71]. The same without positive semidefiniteness constraint is studied recently in [44], also with a discussion of global optimality of second-order critical points. With a linear cost , problem () (which then has a quadratic cost ) is a subclass of quadratically constrained quadratic programming (QCQP). QCQP’s and their SDP relaxations have been extensively studied, notably in [50, 67], with particular attention to approximation ratios. For (P), these approximation ratios can be found in [9].

In part owing to the success of (P) with linear in adequately solving a myriad of hard problems, there has been strong interest in developing fast, large-scale SDP solvers. The present paper is one example of such a solver, restricted to the class of problems (P). SDPLR is a more generic such solver [19, 20]. See also [75] for a review, and [32] for a recent low-complexity example with precise convergence results, but which does not handle constraints.

Much of this paper is concerned with characterizing the rank of solutions of (P), especially with respect to how large must be allowed to grow in () to solve (P). There is also considerable value in determining under what conditions (P) admits solutions of the desired rank for a specific application, that is: when is the relaxation tight? This question is partially answered in [10] for the closely related phase synchronization problem, under a stochastic model for the data. See [1] for a proof in the stochastic block model, and [6]

for a study of phase transitions in random convex programs. There also exist

deterministic tightness results, typically relying on special structure in a graph underlying the problem data. See for example [68, 60, 58]. See also Appendix A for a deterministic proof of tightness in the case of single-cycle synchronization of rotations. The proof rests on the availability of a closed-form expression for the dual matrix  (21), and for the solution to be certified. With the same ingredients, it is easy to show, for example, that (P) is tight for Max-Cut when the graph is bipartite.

Semidefinite relaxations in the form of (P) with additional constraints have also appeared in the literature. In particular, this occurs in estimation of rotations, with explicit care for the determinant constraints: Saunderson et al. [60] explicitly constrain off-diagonal blocks to belong to the convex hull of the rotation group; this is not necessary for the orthogonal group—see Proposition 2.1. Similarly, for synchronization of permutations in joint shape matching, off-diagonal blocks are restricted to be doubly stochastic [25, 38]. Finally, in recent work, Bandeira et al. [11] study a more powerful class of synchronization problems with additional linear constraints of various forms. An example with an additional nonlinear constraint appears in [78], which imposes an upperbound on the spectral norm of . All of these are motivation to generalize the framework studied here, in future work.

We mention in passing that the MaxBet and MaxDiff problems [73] do not fall within the scope of this paper. Indeed, although they also involve estimating orthonormal matrices as in (), their cost function has a different type of invariance, which would also lead to a different type of relaxation.

### 1.4 Notation

The size parameters obey . Matrices are thought of as block matrices with blocks of size . Subscript indexing such as refers to the block on the th row and th column of blocks, . For , refers to the th slice of size , . The Kronecker product is written and vectorizes a matrix by stacking its columns on top of each other. A real number is rounded down as . The operator norm

is the largest singular value of a matrix, and its Frobenius norm

is the -norm of . is the set of symmetric matrices of size , and means is positive semidefinite. extracts the symmetric part of a matrix. is the group of orthogonal matrices of size . denotes the null-space, or kernel, of a linear operator.

## 2 Geometry

Both search spaces of () and (P) enjoy rich geometry, which leads to efficient analytical and numerical tools for the study of these optimization problems. The former is a smooth Riemannian manifold, while the latter is a compact convex set. Figure 1 depicts the two.

### 2.1 Smooth geometry of the rank-restricted search space

Endow with the classical Euclidean metric , corresponding to the Frobenius norm: . We view the search space of () as a submanifold of and endow it with the Riemannian submanifold geometry [4]. First, define a linear operator which symmetrizes diagonal blocks and zeroes out all other blocks:

 symblockdiag(M)ij ={Mii+M⊤ii2if i=j,0otherwise. (4)

This allows for a simple definition of the manifold via an equality constraint as

 St(d,p)m ={Y∈Rn×p:symblockdiag(YY⊤)=In}. (5)

The set is non-empty if . It is connected if . Counting dimensions yields . The tangent space to at is a subspace of obtained by differentiating the equality constraint:

 TYSt(d,p)m ={˙Y∈Rn×p:symblockdiag(˙YY⊤+Y˙Y⊤)=0}. (6)

Among the tangent vectors are all vectors of the form , for skew-symmetric: these correspond to “vertical directions”, in the sense that following them does not affect the product (at first order). Each tangent space is equipped with a restriction of the metric , thus making a Riemannian submanifold of . The orthogonal projector from the embedding space to the tangent space at is

 ProjY(Z) =Z−symblockdiag(ZY⊤)Y. (7)

The total computational cost of a projection is thus flops.

The optimization problem () involves a function defined over . Denote its classical, Euclidean gradient at as . The Riemannian gradient of at , , is defined as the unique tangent vector at such that, for all tangent , Naturally, this is given by the projection of the classical gradient to the tangent space [4, eq. (3.37)]:

where is the classical gradient of , a symmetric333 is symmetric because is formally defined over the symmetric matrices. If the gradient of over the square matrices is not symmetric, is obtained by extracting its symmetric part. matrix of size , and we used (10). Furthermore, denote by the classical Hessian of at . This is a symmetric operator on . The Riemannian Hessian of at is a symmetric operator on the tangent space at obtained as the projection of the derivative of the gradient vector field [4, eq. (5.15)]:

 Hessg(Y)[˙Y] =ProjY(D(Y↦ProjY(∇g(Y)))(Y)[˙Y]) =ProjY(∇2g(Y)[˙Y]−symblockdiag(∇g(Y)Y⊤)˙Y), (9)

where denotes a classical directional derivative and we used . For future reference, we note these expressions of the derivatives of in terms of those of :

 ∇g(Y) =2∇f(X)Y,  and (10) ∇2g(Y)[˙Y] =2(∇2f(X)[˙X]Y+∇f(X)˙Y), with ˙X=˙YY⊤+Y˙Y⊤. (11)

Optimization algorithms on Riemannian manifolds typically are iterative. As such, they require a means of moving away from a point along a prescribed tangent direction , to reach a new point on the manifold: the next iterate. Since does not, in general, belong to the manifold, extra operations are required. Retractions achieve exactly this [4, § 4.1]. One possible retraction for (1) is as follows. For each “slice” in ,

 (RetractionY(˙Y))i =UiV⊤i, with Yi+˙Yi=UiΣiV⊤i, (12)

where

is a thin singular value decomposition of the

th slice . This retraction projects each slice of to the closest orthonormal matrix. Consequently, this is even a second-order retraction [2]. The total cost of computing a retraction is flops.

### 2.2 Convex geometry of the full search space

The optimization problem (P) is defined over the compact convex set

 C ={X∈Sn×n:X⪰0 and Xii=Id for i=1…m}.

For , this is the elliptope, or set of correlation matrices [42]. Often, we hope to recover matrices in such that has rank , or such that off-diagonal blocks are orthogonal. The following proposition shows that these two considerations are equivalent, and that the convex relaxation leading to is tight in that sense (the tightest relaxation would consider the convex hull of rank- matrices in , but this is difficult to handle444Let be the convex hull of rank- matrices in . The extreme points of are these matrices [56, Cor. 18.3.1]. Thus, optimizing a linear cost function over solves (

), which is NP-hard. Hence, there probably does exist an efficient representation of

.).

###### Proposition 2.1.

For all , all blocks are in the convex hull of , that is, . Furthermore, if and only if for all .

###### Proof.

Since is positive semidefinite, for all , the submatrix formed by the blocks and is positive semidefinite. By Schur, this holds if and only if , which in turn happens if and only if all singular values of are at most 1. The set of such matrices is the convex hull of all orthogonal matrices of size  [59].

Consider such that and . Clearly, if , is orthogonal, since all ’s are orthogonal. Conversely, if is orthogonal, then the rows of and span the same subspace. Indeed, . Multiply by on the right. Now notice that is an orthogonal projector onto the subspace spanned by the rows of . Since remains unaffected by such a projection first on the subspace of then again on the subspace of , they must span the same subspace. Hence, fixing , for each , there exists such that . Finally, (where is the Kronecker product), which confirms that and have rank . Notice that this proof further shows that has rank if and only if there is a spanning tree of edges on an -nodes graph such that the ’s are orthogonal (in which case they are all orthogonal). ∎

The set may be decomposed into faces of various dimensions.

###### Definition 2.1 (faces, §18 in [56]).

A face of is a convex subset of such that every (closed) line segment in with a relative interior point in has both endpoints in . The empty set and itself are faces of .

By [56, Thm. 18.2], the collection of relative interiors555The relative interior of a singleton is the singleton. of the non-empty faces forms a partition of . That is, each is in the relative interior of exactly one face of , called . Furthermore, all faces of are exposed [55, Cor. 1], that is, for every face , there exists a linear function such that is the set of solutions of (P). Of particular interest are the zero-dimensional faces of (singletons), called its extreme points.

###### Definition 2.2 (Extreme and exposed points).

is an extreme point of if there does not exist and such that . is an exposed point of if there exists such that is the unique maximizer of in .

In other words, is extreme if it does not lie on an open line segment included in . Since is compact, it is the convex hull of its extreme points [56, Cor. 18.5.1]. Extreme points are of interest notably because they often arise as the solution of optimization problems. Specifically, if is a concave function (in particular, if is linear), then attains its minimum on at one of its extreme points [56, Cor. 32.3.2].

Following the construction in the proof of [53, Thm. 2.1], given of full rank such that (), we find that

 FX ={~X=Y(Ip+A)Y⊤:A∈kerLX and Ip+A⪰0}, with (13) LX :Sp×p→(Sd×d)m:A↦LX(A)=(Y1AY⊤1,⋯,YmAY⊤m). (14)

The dimension of is the dimension of the kernel of . The rank-nullity theorem gives a lowerbound (see also Theorem 3.15 for an upperbound):

 dimFX=p(p+1)2−rankLX≥p(p+1)2−md(d+1)2≜Δ. (15)

It follows that extreme points (i.e., points such that ) have small rank:

 d ≤ rank(X) ≤ p∗:=(√1+4md(d+1)−1)/2. (16)

Note that when .

###### Remark 2.2.

For linear , (P) admits an extreme point as global optimizer, so that () and (P) have the same optimal value as soon as  (16). In other words: for linear , () is not NP-hard if .

Not all feasible ’s with rank as in (16) are extreme. For example, setting and as in Figure 1, select two distinct, admissible matrices of rank 1, and . For all , the matrix , lying on the open line segment between and , is admissible and has rank 2. Thus, satisfies (16), but it is not an extreme point, by construction. Notwithstanding, the expectation that is generically of full rank suggests that almost all feasible ’s satisfying (16) should be extreme; an intuition that is supported by Figure 1. More generally, in Theorem B.1, we prove for that for almost all of rank .

Many applications look for solutions of rank . All ’s of rank are exposed (hence extreme), meaning they can all be recovered as unique solutions of (P).

###### Proposition 2.3.

For all , . Furthermore, if and only if . In particular, each of rank is an exposed extreme point of .

###### Proof.

Let denote the singular values of . By Proposition 2.1, for all . Hence,

 ∥X∥2F =m∑i,j=1∥Xij∥2F=m∑i,j=1d∑k=1σ2k(Xij)≤m2d. (17)

The upperbound is attained if and only if for all , thus, if and only all ’s are orthogonal. By Proposition 2.1, this is the case if and only if . Now consider has rank . We show it is exposed (and hence extreme):

 max^X∈C ⟨X,^X⟩ ≤max∥^X∥2F≤m2d ⟨X,^X⟩≤∥X∥F⋅max∥^X∥2F≤m2d ∥^X∥F=m2d. (18)

The second inequality follows by Cauchy-Schwarz, and equality is attained if and only if , which is in . Thus, both max problems admit as unique solution, confirming that is an exposed extreme point. ∎

###### Remark 2.4.

Proposition 2.3 is an extension of [45, Thm. 1] to the case . We note that the proof in that reference does not generalize to .

## 3 From second-order critical points Y to KKT points X

In this section, we show that rank-deficient second-order critical points of () yield KKT points of  (P). Furthermore, when is second-order critical but is not KKT, it is shown how to escape the saddle point by increasing . If increases all the way to , then all second-order critical points reveal KKT points. The proofs parallel those in [40]. The main novelty is explicit bounds on such that all second-order critical points of () reveal KKT points. The proofs bring us to consider the facial structure of .

A key ingredient for all proofs in this section is the availability of an explicit matrix  (21) which is positive semidefinite if and only if is KKT (Theorem 3.3). The formula for is simply read off from the first-order optimality conditions of (), owing to smoothness of the latter.

###### Lemma 3.1 (Necessary optimality conditions for (P)).

is called a KKT point for (P) if there exist a symmetric matrix and a symmetric, block-diagonal matrix (dual variables) such that

 ^SX =0, ^S =∇f(X)+^Λ, and ^S ⪰0.

If is a local optimizer for (P), then is a KKT point. If is convex, all KKT points are global optimizers.

###### Proof.

Apply Theorems 3.25 and 3.34, and Example 3.36 in [57]. KKT conditions are necessary since Slater’s condition holds: is feasible for (P) and it is strictly positive definite. ∎

###### Lemma 3.2 (Necessary optimality conditions for (RPp)).

Let and . A critical point of () satisfies , that is,

 (∇f(X)−symblockdiag(∇f(X)X))Y=0. (19)

A second-order critical point is a critical point which satisfies , that is,

 for all ˙Y∈TYSt(d,p)m,⟨˙Y,∇2g(Y)[˙Y]−symblockdiag(∇g(Y)Y⊤)˙Y⟩≥0. (20)

If is a local optimizer for (), then it is a second-order critical point.

###### Proof.

This is a direct generalization of the classical necessary optimality conditions for unconstrained optimization [14, Prop. 1.1.1] to the Riemannian setting, as per the formalism in [4, 82]. Use equations (7), (8) and (9) and the fact that is self-adjoint to obtain equations (19) and (20). ∎

Lemmas 3.1 and 3.2 suggest the definition of an (as yet merely tentative) formula for the dual certificate , based on (19):

 S =S(X)=∇f(X)−symblockdiag(∇f(X)X). (21)

Indeed, for any critical point of (), it holds (with ) that , so that . In that case, for , can be advantageously interpreted as a graph Laplacian (up to a change of variable666For a critical point of () with , we have with orthogonal ’s, and . Write for short. implies, after some algebra, that the ’s are symmetric. This can be used to see that is a matrix with off-diagonal blocks equal to and diagonal blocks equal to . Thus, is exactly the Laplacian of the graph with nodes and edge “weights” given by the matrices .), as is often the case for dual certificates of estimation problems on graphs [10, 33]. We now show that is indeed the unique possible dual certificate for any feasible . (This is similar to, but different from, Theorem 4 in [40]; a complex version appears in [10] for .)

###### Theorem 3.3 (S is the right certificate).

is a KKT point for (P) if and only if  (21) is positive semidefinite. If so, is the unique dual certificate for Lemma 3.1.

###### Proof.

We show the if and only if parts of the first statement separately.

• By construction, . Since both and are positive semidefinite, this implies . Apply Lemma 3.1 with and .

• Since is a KKT point for (P), there exist and symmetric, block-diagonal satisfying the conditions in Lemma 3.1. In particular, and . Thus, and . Here, we used both the fact that is symmetric, block-diagonal and the fact that . Consequently, .

The last point also shows there exists only one pair certifying is a KKT point. ∎

Notice how the Riemannian structure underlying problem (P) made it possible to simply read off an analytical expression for a dual certificate from the necessary optimality conditions of (). This smooth geometry also leads to uniqueness of the dual certificate (this is connected to nondegeneracy [5, Thm. 7]). Theorem 3.3 makes for an unusually comfortable situation and will be helpful throughout the paper.

For convex , we can make the following statement regarding uniqueness of the solution.

###### Theorem 3.4.

Assume is convex. If is an extreme point for (P) (which is true in particular if ), and , and (strict complementarity), then is the unique global optimizer of (P).

###### Proof.

From Theorem 3.3, it is clear that is a global optimizer. We prove by contradiction that it is unique. Let be another global optimizer. Since (P) is a convex problem in this setting, is constant over the whole (optimal) segment for . Hence, the directional derivative of at along is zero:

 0 =⟨∇f(X),˙X⟩ =⟨S+symblockdiag(∇f(X)X),˙X⟩ (definition of S~{}(???)) =⟨S,˙X⟩ (diagonal blocks of ˙X are zero) =⟨S,X′⟩ (SX=0).

Since both and are positive semidefinite, it ensues that . (Note that for linear , this shows is the dual certificate for all global optimizers of (P), not only for .) Hence, .

Let and be a full-rank matrix such that . Strict complementarity and imply the columns of form a basis for the kernel of . Hence, ensures that for some , , such that . In other words, is in the face . This is a contradiction, since . ∎

###### Remark 3.5.

In general, the condition that be an extreme point in the previous theorem cannot be removed. Indeed, if , then all admissible ’s are globally optimal and . In particular, satisfies strict complementarity, but if , it is not extreme, and it is not a unique global optimizer. Likewise, any rank- admissible is extreme and globally optimal, but does not satisfy strict complementarity. Similar examples can be built with nonzero linear costs , where the sparsity pattern of corresponds to a disconnected graph.

Conversely, it is not true in general that uniqueness of the global optimizer implies extremeness or strict complementarity. Simply consider with : the global optimizer is unique, and , so that . For large enough, can be chosen to be both not extreme and rank deficient. For an illustration of this in semidefinite programs, see the nice example after Theorem 10 in [5].

Continuing with convex cost functions, KKT points of (P) coincide with global optimizers. This and the fact that () is a relaxation of (P) lead to the following summary regarding global optimality conditions. For (), these sufficient conditions are conclusive whenever the relaxation is tight.

###### Corollary 3.6 (Global optimality conditions).

Assume is convex and let ; is globally optimal for (P) if and only if . If so, then is a global optimizer for the nonconvex problem (). Furthermore, if is extreme (in particular, if ) and if , then is the unique global optimizer of (P) and is the unique global optimizer of (), up to orthogonal action , .

Returning to the general case of not necessarily convex, we establish links between second-order critical points of () and KKT points of (P). In doing so, it is useful to reformulate the second-order optimality condition (20) on () in terms of  (21) and . Let and . Then, is a second-order critical point for () if and only if it is critical and, for all , with , it holds that (using (11)):

 ⟨˙Y,Hessg(Y)[˙Y]⟩ =⟨˙Y,∇2g(Y)[˙Y]−symblockdiag(∇g(Y)Y⊤)˙Y⟩ =2⟨˙Y,∇2f(X)[˙X]Y+∇f(X)˙Y−symblockdiag(∇f(X)X)˙Y⟩ =2⟨˙Y,∇2f(X)[˙X]Y+S˙Y⟩ =⟨˙X,∇2f(X)[˙X]⟩+2⟨˙Y,S˙Y⟩≥0. (22)

Since (