# Factorization Approach for Low-complexity Matrix Completion Problems: Exponential Number of Spurious Solutions and Failure of Gradient Methods

It is well-known that the Burer-Monteiro (B-M) factorization approach can efficiently solve low-rank matrix optimization problems under the RIP condition. It is natural to ask whether B-M factorization-based methods can succeed on any low-rank matrix optimization problems with a low information-theoretic complexity, i.e., polynomial-time solvable problems that have a unique solution. In this work, we provide a negative answer to the above question. We investigate the landscape of B-M factorized polynomial-time solvable matrix completion (MC) problems, which are the most popular subclass of low-rank matrix optimization problems without the RIP condition. We construct an instance of polynomial-time solvable MC problems with exponentially many spurious local minima, which leads to the failure of most gradient-based methods. Based on those results, we define a new complexity metric that potentially measures the solvability of low-rank matrix optimization problems based on the B-M factorization approach. In addition, we show that more measurements of the ground truth matrix can deteriorate the landscape, which further reveals the unfavorable behavior of the B-M factorization on general low-rank matrix optimization problems.

## Authors

• 1 publication
• 2 publications
• 10 publications
• 26 publications
• ### Provable Inductive Matrix Completion

Consider a movie recommendation system where apart from the ratings info...
06/04/2013 ∙ by Prateek Jain, et al. ∙ 0

• ### No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis

In this paper we develop a new framework that captures the common landsc...
04/03/2017 ∙ by Rong Ge, et al. ∙ 0

• ### A low-rank matrix equation method for solving PDE-constrained optimization problems

PDE-constrained optimization problems arise in a broad number of applica...
05/29/2020 ∙ by Alexandra Bünger, et al. ∙ 0

• ### On the Convergence of Stochastic Gradient Descent with Low-Rank Projections for Convex Low-Rank Matrix Problems

We revisit the use of Stochastic Gradient Descent (SGD) for solving conv...
01/31/2020 ∙ by Dan Garber, et al. ∙ 0

• ### Landscape Correspondence of Empirical and Population Risks in the Eigendecomposition Problem

Spectral methods include a family of algorithms related to the eigenvect...
06/11/2021 ∙ by Shuang Li, et al. ∙ 0

• ### Fast Low-Rank Matrix Estimation without the Condition Number

In this paper, we study the general problem of optimizing a convex funct...
12/08/2017 ∙ by Mohammadreza Soltani, et al. ∙ 0

• ### Low-rank Matrix Factorization under General Mixture Noise Distributions

Many computer vision problems can be posed as learning a low-dimensional...
01/06/2016 ∙ by Xiangyong Cao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The low-rank matrix optimization problem aims to recover a low-rank ground truth matrix through some measurements modeled as , where the measurement operator is a function from to . The operator can be either linear as in the linear matrix sensing problem and the matrix completion problem (Candès & Recht, 2009; Recht et al., 2010), or nonlinear as in the one-bit matrix sensing problem (Davenport et al., 2014) and the phase retrieval problem (Shechtman et al., 2015). There are two variants of the problem, known as symmetric and asymmetric problems. The first one assumes that is a positive semi-definite (PSD) matrix, whereas the second one makes no such assumption and allows to be non-symmetric or sign indefinite. Since the asymmetric problem can be equivalently transformed into a symmetric problem (Zhang et al., 2021), we focus on the latter one.

There are in general two different approaches to overcome the non-convex low-rank constraint. The first approach is to design a convex penalty function that prefers low-rank matrices and then optimize the penalty function under the measurement constraint (Candès & Recht, 2009; Recht et al., 2010; Candès & Tao, 2010). However, this approach works in the matrix space and has a high computational complexity. The other widely accepted technique is the Burer-Monteiro (B-M) factorization approach (Burer & Monteiro, 2003), which converts the original problem into an unconstrained one by replacing the original PSD matrix variable with the product of a low-dimensional variable and its transpose. The optimization problem based on the B-M factorization approach can be written as

 minX∈Rn×r g[A(XXT)−A(M∗)],

where

is a loss function that penalizes the mismatch between

and . Using the B-M factorization, the objective function is generally non-convex even if the loss function is convex. Nonetheless, it has been proved that under certain strong conditions, such as the Restricted Isometry Property (RIP) condition, saddle-escaping methods can converge to the ground truth solution with a random initialization (Zhang et al., 2021; Bi et al., 2021) and first-order methods with spectral initialization converge locally (Tu et al., 2016; Bhojanapalli et al., 2016); see Chi et al. (2019) for an overview.

Then, it is natural to ask whether optimization methods based on the B-M factorization approach can succeed on general low-rank matrix optimization problems with a low information-theoretic complexity (i.e., problems that have a unique global solution and can be solved in polynomial time), especially when the RIP condition does not hold. In this work, we focus on a common class of problems without the RIP condition, namely the matrix completion (MC) problem. For the MC problem, the measurement operator is given by

 AΩ(M)ij:={Mijif (i,j)∈Ω0otherwise,

where is the set of indices of observed entries. We denote the measurement operator as for simplicity. An instance of the MC problem, denoted as , can be formulated as

 find M∈Rn×n (PM∗,Ω,n,r) s.t. rank(M)≤r,M⪰0,MΩ=M∗Ω.

If is the only solution of this problem, we will say that has a unique solution. Using the B-M factorization approach, the MC problem can be solved via the optimization problem

 minX∈Rn×rf(X), (1)

where . For example, if the -loss function is used, the problem becomes

 minX∈Rn×r∥∥(XXT−M∗)Ω∥∥2F. (2)

#### Contributions.

We provide a negative answer to the preceding question by constructing MC problem instances for which the optimization complexity of local search methods using the B-M factorization does not align with the information-theoretic complexity of the underlying MC problem instance. The information-theoretic complexity refers to the minimum number of operations that the best possible algorithm takes to find the ground truth matrix, while the optimization complexity refers to the minimum number of operations that a given optimization method takes to find the ground truth matrix. In general, the optimization complexity of local search methods depends on the properties of spurious solutions of the optimization problem, e.g., the number, the sharpness and the regions of attraction of spurious solutions. The optimization complexity predicts the performance of an algorithm and provides a hint on which algorithm to use for a given problem. Therefore, the results in this work imply that the popular B-M factorization approach is not able to capture the benign properties of the low-rank problem when the RIP condition does not hold. We summarize our contributions as follows:

1. [label=),noitemsep]

2. Given natural numbers and with , we construct a class of MC problem instances , whose ground matrix has rank . For every instance in this class, there exists a unique global solution and the solution can be found in polynomial time via graph-theoretic algorithms.

3. Next, we show the existence of an instance in whose B-M factorization formulation (2) has at least equivalent classes of spurious111A solution is called spurious if it is a local minimum but has a larger objective value than the optimal objective value. local solutions. Note that this claim holds for general loss functions under a weak assumption.

4. Moreover, for the rank-

case, we prove that most gradient-based methods with a random initialization converge to a spurious local minimum with probability at least

. Numerical studies verify that the failure of the gradient-based methods also happens for general rank cases.

5. We present an instance that has no spurious solution under the B-M factorization formulation (2), but introducing additional observations of the ground truth matrix leads to at least exponentially many spurious solutions. This example further reveals the unfavorable behavior of the B-M factorization approach on general low-rank matrix optimization problems.

Based on these results, we define a new complexity metric that potentially captures the optimization complexity of optimization methods based on the B-M factorization.

#### Related Work.

The low-rank optimization problem has been well studied under the RIP condition (Recht et al., 2010). Several recent works (Zhang et al., 2019; Bi & Lavaei, 2021; Zhang et al., 2021) showed that the non-convex formulation has no spurious local minima with a small RIP parameter. To understand how conservative the RIP condition is, we consider a class of polynomial-time solvable problems without the RIP condition and study the behavior of optimization methods on this class. Specifically, we consider the polynomial-time solvable MC problems. Most existing literature on the MC problem is based on the assumption that the measurement set is randomly constructed and the global solution is coherent (Candès & Recht, 2009; Candès & Tao, 2010; Ge et al., 2016, 2017; Ma et al., 2019; Chen et al., 2020). In comparison, there is a small range of works that have focused on the deterministic MC problem (Bhojanapalli & Jain, 2014; Király et al., 2015; Pimentel-Alarcón et al., 2016; Li et al., 2016). Furthermore, efficient graph-theoretic algorithms utilizing the special structures of a deterministic measurement set can be designed (Ma et al., 2018). Existing works on a deterministic measurement set case have focused on the completability problem and the convex relaxation approach, while the B-M factorization approach has not been analyzed. Moreover, there are several existing works that also provided negative results on the low-rank matrix optimization problem. The counterexamples in Candès & Tao (2010); Bhojanapalli & Jain (2014) have non-unique global solutions, which make the recovery of the ground truth matrix impossible. The counterexamples in Waldspurger & Waters (2020) have a unique global solution but the objective function must be a linear function. We refer the reader to Chi et al. (2019) for a review of the low-rank matrix optimization problem. Our work is the first one in the literature that studies the optimization complexity in the case when the information-theoretic complexity is low.

#### Notations.

The set represents the set of integers from to . We use lower case bold letters

to represent vectors and capital bold letters

to represent matrices. and are the -norm and the Frobenius norm of the matrix , respectively. Let be the inner product between matrices. The notations and mean that the matrix is PSD and positive definite, respectively. The set of PSD matrices is denoted as . For a function , we denote the gradient and the Hessian as and

, respectively. The Hessian is a four-dimensional tensor with

for all and . The quadratic form of the Hessian in the direction is defined as . We use and to denote the ceiling and flooring functions, respectively. The cardinality of a set is shown as .

## 2 Exponential Number of Spurious Local Minima

In this section, we show that MC problem instances with a low information-theoretic complexity may have exponentially many spurious local minima if the B-M factorization is employed. We first construct a class of MC problem instances with a low information-theoretic complexity and then identify the problematic instances.

### 2.1 Low-complexity Class of MC Problems

Suppose that and are two given integers. We construct a class of MC problem instances whose ground truth matrix is rank-. For every instance in this class, the global solution is unique and can be found in polynomial time in terms of and . Let . We divide the first rows and the first columns of the matrix into block matrices, where each block has dimension . For every , we denote the block matrix at position as . We now define the block measurement patterns induced by a given graph.

###### Definition 1 (Induced measurement set).

Let be a pair of undirected graphs with the node set and the disjoint edge sets , respectively. The induced measurement set is defined as follows: if , then the entire block are observed; if , then all nondiagonal entries of the block are observed; otherwise, none of the entries of the block is observed. In addition, the last rows and the last columns of the matrix are fully observed. We refer to the graph as the block sparsity graph.

The following definition introduces a low-complexity class of MC problem instances.

###### Definition 2 (Low-complexity class of MC problems).

Define to be the class of low-complexity MC problems with the following properties:

1. [label= ), noitemsep]

2. The ground truth is rank-r.

3. The matrix is rank- for all .

4. The measurement set is induced by , where is connected and non-bipartite.

The next proposition states that every MC problem instance in is polynomial-time solvable.

###### Proposition 1.

For an arbitrary instance in , the ground truth is the unique solution of this problem and can be found in time.

### 2.2 Intuition for Rank-1 Case with ℓ2-loss Function

We start with the case when the rank is equal to and the loss function is the -loss. We study two instances in the class with and observations, respectively. The B-M formulation (2) of both instances contains exponentially many spurious local minima. Since the decomposition variable is a column vector in the rank- case, we write it as .

###### Example 1.

We first provide an instance with observations. Note that the number of blocks, namely , is equal to in the rank- case. Let the graph be chosen as and

 E1:={(i,j) | i,j∈[n], |i−j|≤1},E2:=∅.

The measurement set is the induced set . Namely, we observe the diagonal, sub-diagonal and super-diagonal entries of the ground truth matrix. One can verify that the subgraph is connected and non-bipartite. Now, we construct a specific ground truth matrix. We define the vector by

 x∗2k+1

and let . For the B-M factorization formulation (2), the set of global minima is given by

 X∗:={x∈Rn | x22k+1 =1,∀k∈[⌈n/2⌉], x2k =0,∀k∈[⌊n/2⌋]},

which has cardinality . For every global solution and every , the Hessian satisfies

 Δ:∇2f(^x):Δ =2∥∥(^xΔT+Δ^xT)Ω∥∥2F =8∥Δ∥2−1[nis even]4Δ2n>0,

where is the indicator function. Therefore, the Hessian is positive definite at every global minimum. Then, we perturb the ground truth solution to

 M∗(ϵ):=x∗(ϵ)[x∗(ϵ)]T=(x∗+ϵ)(x∗+ϵ)T,

where and is a small perturbation. We denote the associated problem (2) as

 minx∈Rn~f(x;ϵ), (3)

where . For a generic perturbation , all components of are nonzero and problem belongs to the class . This implies that the global solution of problem (3) is unique up to a sign flip.

We analyze the relation between the local minima of the original problem and those of the perturbed problem. Consider the equation near an unperturbed global minimum . Since is a solution to the gradient equation and the Jacobian matrix with respect to is equal to the positive definite Hessian , the Implicit Function Theorem (IFT) states that there exists a unique solution in a neighbourhood of for all values of with a small norm. In addition, the continuity of Hessian implies that . Thus, is a local minimum of the perturbed problem (3). As a result, we have proved the existence of a local minimum for the perturbed problem corresponding to each of the global minima of the unperturbed problem. Hence, the problem (3) has at least local minima, while only two of them are global minima. In summary, we have constructed an instance in that has exponentially many spurious local solutions.

###### Example 2.

Next, we construct an MC problem instance with exponentially many spurious local minima and observations. We choose the same ground truth matrix as in the last example, but assume that the measurement set is induced by the graph with , and

 E1:={(i,i),(i,2k),(2k,i) | ∀i∈[n], k∈[⌊n/2⌋]}.

Since the subgraph is connected and non-bipartite, the perturbed problem belongs to the class . Moreover, one can verify that the set of global minima of this problem is still and the Hessian at every global solution is positive definite. By the same argument, IFT implies that problem (3) has at least spurious local minima for a generic and small perturbation .

Note that the instances analyzed in this section, as well as those in the remainder of this paper, satisfy the incoherence condition (Candès & Recht, 2009) with the parameter . The results in this subsection will be formalized with a unified framework next.

### 2.3 Rank-1 Case with General Measurement Sets

In this subsection, we estimate the largest lower bound on the number of spurious local minima for the given parameters

and . We address the problem by first finding a lower bound on the number of spurious local minima given a general measurement set , and then maximizing the lower bound over . The following theorem utilizes the topology of to quantify a lower bound on the number of spurious solutions for the measurement set .

###### Theorem 1.

Let such that is connected and non-bipartite with vertices. Assume that there exists a maximal independent set222For a graph , the set is called an independent set if no two nodes in are adjacent. The set is called a maximal independent set if it is an independent set and is not a strict subset of another independent set. of such that every vertex in the set has a self-loop. There exists an instance in for which the problem (2) has at least spurious local minima.

In both Examples 1 and 2, a maximal independent set is 333 is shorthand notation for .. Hence, Theorem 1 implies that there are spurious local minima, which is consistent with our analysis. Since a maximal independent set of a connected and non-bipartite graph can have up to vertices, the number of spurious local minima can be as large as .

###### Corollary 1.

There exist a graph and an instance in such that problem (2) has spurious solutions. In addition, there exist a graph and an instance in with such that the problem (2) has spurious solutions.

Corollary 1 implies that the B-M factorization may not be an efficient approach to the MC problem, since it has a spurious solution even in the highly ideal case when almost all entries of the matrix are measured. Generally, the proof of Theorem 1 implies that, as a necessary condition for not having a spurious solution in formulation (2), the elements of associated with the nodes outside of the maximal independent set should not be much smaller than those associated with the nodes in .

###### Corollary 2.

Under the setting of Theorem 1, there exists a function such that has at least spurious local minima in formulation (2) for every generic satisfying

 ∥x∗Sc∥≤hS(mini∈S|x∗i|)⋅∥x∗S∥,

where and .

Because the maximal independent set of a graph is not necessarily unique, the set of functions over all maximal independent sets designates a necessary condition for the nonexistence of spurious local minima given a measurement set .

### 2.4 Extension to General Rank-r Case

We generalize the results to the case when the ground truth matrix has a general rank. Eisenberg-Nagy et al. (2013) showed that the rank- MC problem is -hard in the worst case for every . However, we focus on instances in the low-complexity class and show that there are instances in this class whose B-M factorization formulation (2) has a highly undesirable landscape. We cannot simply extend the proof of the rank- case to the rank- case since there exist an infinite number of matrices such that when . The global optimality of a solution is not lost under any orthogonal transformation. This implies that the Hessian at the global solutions of problem (2) cannot be positive definite, which fails the applicability of IFT. Therefore, we consider the quotient manifold , where is the lie group of orthogonal matrices. To simplify the analysis, we instead consider the following lower-diagonal subspace

 Wn×r:={X∈Rn×r | Xij=0, ∀i∈[n] , j∈[r] s.t.i

We define an embedding of the manifold into and composite it with the quotient map.

###### Definition 3 (Restriction map).

Given a matrix , we define the embedding , where is the RQ decomposition with

being an orthogonal matrix and

having non-negative diagonal elements. The restriction map is defined as .

When the RQ decomposition is not unique, we choose an arbitrary decomposition for the embedding . However, the properties of the RQ decomposition ensure that is a bijection in a small neighborhood of each matrix whose first rows are linearly independent. Consider the restricted version of problem (2):

 minX∈Wn×rf(X), (4)

Results in Section 2.3 can be extended to the problem (4), and then be translated back to the problem (2).

###### Lemma 1.

Consider a graph with vertices for which the subgraph is connected and non-bipartite. Assume that there exists a maximal independent set of whose vertices each have a self-loop. If the induced subgraph444See Harary (2018) for the definition. is connected, then there exists an instance in for which the problem (4) has at least spurious local minima. In addition, the first rows of each local minimum are linearly independent.

The Hessian at each local minimum of the unperturbed problem is positive definite along the tangent space of , where the off-diagonal observations of play a key role. If the first rows of a local minimum are linearly independent, the diagonal elements of are nonzero. Therefore, by flipping the signs of columns, we can find an equivalent local minimum with positive diagonal elements, i.e., lies in the range of . By symmetry, the total number of such local minima is . Since the restriction map is a bijection in a neighborhood of , the equivalent class is a local minimum of problem (2) on the quotient manifold and thus is a local minimum of problem (2). The above argument leads to Theorem 2.

###### Theorem 2.

Consider a graph satisfying the conditions of Lemma 1. There exists an instance in for which problem (2) has at least equivalent classes of spurious solutions.

Finally, we give an estimate on the largest lower bound for the number of spurious local minima.

###### Corollary 3.

There exists an instance in for which the problem (2) has at least equivalent classes of spurious solutions. In addition, there exists an instance in with for which the problem (2) has spurious solutions.

### 2.5 General Loss Functions

In this part, we generalize the preceding results to the problem (1). To extend the constructions to a general loss function , we require a few weak assumptions on the loss function .

###### Assumption 1.

The following conditions hold for the function :

1. [label= ), noitemsep]

2. is twice continuously differentiable;

3. the matrix is the unique minimizer of ;

4. the Hessian of at is positive definite.

Now, we can extend the results in Section 2.4 to the general loss function case under the above assumption.

###### Theorem 3.

Consider a graph satisfying the conditions of Lemma 1 and suppose that Assumption 1 holds. There exists an instance in for which the problem (1) has at least equivalent classes of spurious local minima.

We note that the -loss function satisfies the conditions in Assumption 1. As another example, regularizers are ubiquitously used in the low-rank matrix optimization literature (Ge et al., 2016, 2017; Fattahi & Sojoudi, 2020). As a corollary to Theorem 3, the regularized version of the problem (2) also suffers from the same issue. In this case, the loss function is equal to

 g(X):=∥∥(XXT−M∗)Ω∥∥2F+Q(X),

where is from Ge et al. (2016), and are constants. Since the regularizer does not change the landscape around global solutions with a large , Assumption 1 is satisfied and Theorem 3 is applicable to the regularized problem.

## 3 More Observations Lead to Spurious Local Minima

In Section 2, we showed that the B-M factorization formulation (2) has an exponential number of spurious local minima on low-complexity MC problem instances. In this section, we exhibit another unfavorable behaviour of the B-M factorization. We identify an MC problem instance in with some pattern that has no spurious solution while adding observations to leads to spurious solutions. Let and be natural numbers with . We define and let the graph be where is an arbitrary index and

 V:=[m],Ek:={(k,j), (j,k) | ∀j∈[m]}.

In the measurement set , we observe the blocks for all ; see Figure 1.

contains only full block observations induced by . The next proposition states that if the ground truth matrix is generic, every SOCP555Second-order critical points are defined as those points that satisfy the first-order and the second-order necessary optimality conditions. of the problem (2) is a global minimum.

###### Proposition 2.

Given an index , let the measurement set be equal to . Assume that the block of the ground truth matrix has rank for all . Then, every SOCP of problem (2) is a global minimum.

Next, we construct a graph , where

 ~Ek :=Ek∪{(i,i) | ∀i∈[m]}, E2 :={(i,j) | ∀i,j∈[m], i≠j}.

Namely, we have included all self-loops and nondiagonal observations of each block in the new graph. A maximal independent set for the subgraph is . We define a new measurement set . Since is a superset of , the measurement set is larger than . Using Theorem 2, we obtain the following result.

###### Corollary 4.

Every instance of the MC problem with the measurement set and full rank blocks for all belongs to . The formulation (2) of an instance of the problem has at least equivalent classes of spurious local minima, while all spurious solutions disappear when using the smaller set .

Results of Proposition 2 and Corollary 4 conclude that the landscape of the problem (2) deteriorates when the number of observations is increased. This phenomenon further reveals the unfavorable behavior of the B-M factorization on low-rank matrix optimization problems, even if the information-theoretic complexity is low.

## 4 Measure of Complexity for Factorization Approach

In Section 2, we showed that if there is an MC problem with a non-unique completion, a slightly perturbed problem will have exponentially many spurious local minima in the B-M factorization formulation (1). Hence, bifurcation behaviors appear around measurement matrices that are associated with multiple global solutions. For a given measurement operator , a measurement matrix that allows multiple global solutions designates an unacceptable region in the space of ground truth solutions. Based on this intuition, we define a metric to capture the extent of the bifurcation behavior. We define the set of measurements that allow non-unique solutions:

 TA:={A(M) s.t. X1XT1≠X2XT2, A(M)=A(X1X1T)=A(X2X2T)}.

Then, we define a complexity metric below.

###### Definition 4 (Complexity metric).

The complexity metric for operator and ground truth is defined as

 dist(A(M∗),TA):=minA(M)∈TA∥A(M∗)−A(M)∥F.

It is expected that for instances with a large complexity metric, the optimization complexity of algorithms based on the B-M factorization approach will be aligned with the corresponding information-theoretic complexity. For example, if the RIP condition is satisfied, the set is empty and therefore the complexity metric is always . Oppositely, the instances studied in Section 2 with spurious solutions all have small complexity metrics. Consequently, the complexity metric is a possible measure of the optimization complexity for the MC problem with the B-M factorization: the optimization problem should be more difficult if the complexity metric is lower.

## 5 Gradient-Based Methods Fail With High Probability

We show that the exponential number of spurious local minima in preceding instances will make most randomly initialized gradient-based methods fail with a high probability. The existence of spurious local minima does not necessarily imply the failure of gradient-based methods; see Ma et al. (2018); Chen et al. (2020). The analysis in this section is based on the gradient flow

 ˙X(t)=−∇Xf(X(t)),X(0)=X0. (5)

It is known that the trajectories of gradient-based methods with a small enough step size are close to those of the gradient flow. We can view a variety of gradient-based methods as ordinary differential equation (ODE) solvers applied to the gradient flow (

5). Then, the convergence of the discrete trajectories when the step size goes to can be guaranteed by the consistency and the stability of the ODE solver. Scieur et al. (2017) proved that the consistency and stability conditions are satisfied by several commonly used gradient-based methods, such as the gradient descent, the proximal point and the accelerated gradient descent methods. Although Scieur et al. (2017) considered minimizing a strongly convex function, the consistency and stability conditions only depend on the ODE solver and the Lipschitz continuity of the underlying gradient flow. We need the following assumption on the loss function to characterize the global landscape.

###### Assumption 2.

The loss function satisfies the sparse -RIP condition in the -norm for some constant and integer . Namely, the inequality

 (1−δ)∥NΩ∥2F≤N:∇2g(MΩ):N≤(1+δ)∥NΩ∥2F

holds for all matrices and with rank at most .

The sparse RIP condition is remarkably different from the conservative RIP condition. For example, the -loss function satisfies the sparse -RIP condition for every , while the RIP condition does not hold if is not a complete graph. Under the above assumption, we show that for the unperturbed example constructed in Section 2, the gradient flow will converge to each global minimum with equal probability in the rank- case. The main difficulty is to show that all saddle points of the objective function are strict and therefore their region of attractions (ROAs) have measure zero (Lee et al., 2016).

###### Lemma 2.

Suppose that Assumption 2 holds for and , where is the size of the ground truth matrix. There exists an MC problem instance such that the following statements hold for the problem (1):

• [noitemsep]

• there are equivalent global minima;

• if the gradient flow (5

) is initialized with an absolutely continuous radial probability distribution, it converges to each global minimum with the equal probability

,

where a probability distribution is called radial if its density function at point only depends on and the absolute continuity is with respect to the Lebesgue measure.

Examples of absolutely continuous radial probability distributions include zero-mean Gaussian distributions and uniform distributions over a ball centered at the origin. Note that the

-loss function satisfies the assumption of Lemma 2. Next, we show that with a sufficiently small perturbation to the previous instance of the problem, the ROA of each local minimum will not shrink significantly. Therefore, the gradient flow will converge to each global minimum or spurious solution with approximately the same probability.

###### Theorem 4.

Under the setting of Lemma 2, consider an absolutely continuous radial probability distribution. There exists an instance in for which the problem (1) satisfies the following properties:

• [noitemsep]

• the global minima are unique up to a sign flip;

• if the gradient flow (5) is initialized with the given distribution, it converges to a global minimum associated with the ground truth solution with probability at most .

The results of Theorem 4 imply that, in the rank- case, most gradient-based methods with a small enough step size and a suitable random initialization will converge to a spurious solution with an overwhelming probability. The proof works for the general rank case if it can be shown that there is no degenerate saddle points for the above-mentioned unperturbed instances of the problem.

###### Remark.

We remark that the trajectories of stochastic gradient descent (SGD) methods cannot be approximated by those of the gradient flow. Hence, our analysis cannot automatically imply the failure of SGD. However, the proof of Lemma

2 can be adopted to conclude that SGD methods with a random initialization will converge to each global minimum with equal probability in the unperturbed case. Since the trajectories of SGD methods will not vary dramatically with a sufficiently small perturbation, they still converge to each solution with approximately the same probability. Therefore, we also expect the SGD methods to fail with high probability.

## 6 Experiments

Numerical results are presented to support the failure of the gradient descent algorithm. Each MC problem with the B-M factorization formulation (2) is solved by the gradient descent algorithm with a constant step size, where the step size is chosen to be small enough to guarantee that the algorithm converges to a stationary point. Regarding the measurement set, the graph is generated randomly by the Erdös–Rényi model , where and each edge of the graph is included independently with probability . If is not connected or a node in the maximal independent set does not have a self-loop, the missed edges are added to satisfy these conditions. In addition, a connected subtree is generated for nondiagonal observations. We define and subsequently the measurement set . In addition, the unperturbed ground truth matrix is defined as for all , and otherwise. Lastly, a Gaussian random perturbation matrix is generated and normalized, e.g. . Then, the perturbed ground truth matrix is generated, where is the perturbation size. We evaluate the success rate of the algorithm at equally distributed values of with random initializations of the gradient algorithm for each instance and each .

The top figure in Figure 2 illustrates the success rate of the gradient descent algorithm with the rank , dimension and various maximum independent set sizes . The results conform to Theorem 4 implying that the success rate is less than when the perturbation size is small. The bottom figure in Figure 2 makes similar observations with different ranks and maximum independent set sizes when is equal to . These observations imply that Theorem 4 can be extended to the general rank case, since the success rate is less than when is small. We note that the behavior of the algorithm may change when is large. Specifically, significant improvements in success rate are observed when for most problem instances. This is in accordance with our notion of complexity metric.

## 7 Conclusion

In this paper, we provided a negative answer to the question of whether the B-M factorization approach can capture the benign properties of low-complexity MC problem instances. More specifically, we defined a class of MC problem instances that could be solved in polynomial time. We showed that there exist MC problem instances in this class that have exponentially many spurious local minima in the B-M factorization formulation (1). The results hold for a general class of loss functions, including the commonly used regularized formulation. In addition, for the rank- case, we proved that gradient-based methods fail with high probability for such instances. Numerical results verify that similar behaviors also hold for higher rank cases. These results imply that the optimization complexity of methods based on the factorized problem (1) are not aligned with the information-theoretic complexity of the MC problem. Furthermore, we derived a complexity metric that potentially captures the complexity of the B-M factorization formulation (1).

## References

• Bhojanapalli & Jain (2014) Bhojanapalli, S. and Jain, P. Universal matrix completion. In

International Conference on Machine Learning

, pp. 1881–1889. PMLR, 2014.
• Bhojanapalli et al. (2016) Bhojanapalli, S., Kyrillidis, A., and Sanghavi, S. Dropping convexity for faster semi-definite optimization. In Conference on Learning Theory, pp. 530–582. PMLR, 2016.
• Bi & Lavaei (2021) Bi, Y. and Lavaei, J. On the absence of spurious local minima in nonlinear low-rank matrix recovery problems. In

International Conference on Artificial Intelligence and Statistics

, pp. 379–387. PMLR, 2021.
• Bi et al. (2021) Bi, Y., Zhang, H., and Lavaei, J. Local and global linear convergence of general low-rank matrix recovery problems. arXiv preprint arXiv:2104.13348, 2021.
• Burer & Monteiro (2003) Burer, S. and Monteiro, R. D. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.
• Candès & Recht (2009) Candès, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
• Candès & Tao (2010) Candès, E. J. and Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
• Chen et al. (2020) Chen, J., Liu, D., and Li, X. Nonconvex rectangular matrix completion via gradient descent without regularization. IEEE Transactions on Information Theory, 66(9):5806–5841, 2020.
• Chi et al. (2019) Chi, Y., Lu, Y. M., and Chen, Y. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.
• Davenport et al. (2014) Davenport, M. A., Plan, Y., Van Den Berg, E., and Wootters, M. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.
• Eisenberg-Nagy et al. (2013) Eisenberg-Nagy, M., Laurent, M., and Varvitsiotis, A. Complexity of the positive semidefinite matrix completion problem with a rank constraint. Springer, Fields Institute Communications, vol. 69, 2013.
• Fattahi & Sojoudi (2020) Fattahi, S. and Sojoudi, S.

Exact guarantees on the absence of spurious local minima for non-negative rank-1 robust principal component analysis.

Journal of machine learning research, 2020.
• Ge et al. (2016) Ge, R., Lee, J. D., and Ma, T. Matrix completion has no spurious local minimum. Advances in Neural Information Processing Systems, pp. 2981–2989, 2016.
• Ge et al. (2017) Ge, R., Jin, C., and Zheng, Y. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In International Conference on Machine Learning, pp. 1233–1242. PMLR, 2017.
• Harary (2018) Harary, F. Graph theory. CRC Press, 2018.
• Khalil (2002) Khalil, H. K. Nonlinear systems; 3rd ed. Prentice-Hall, Upper Saddle River, NJ, 2002.
• Király et al. (2015) Király, F. J., Theran, L., and Tomioka, R. The algebraic combinatorial approach for low-rank matrix completion. J. Mach. Learn. Res., 16(1):1391–1436, 2015.
• Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In Conference on learning theory, pp. 1246–1257. PMLR, 2016.
• Li et al. (2016) Li, Y., Liang, Y., and Risteski, A. Recovery guarantee of weighted low-rank approximation via alternating minimization. In International Conference on Machine Learning, pp. 2358–2367. PMLR, 2016.
• Ma et al. (2019) Ma, C., Wang, K., Chi, Y., and Chen, Y. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics, 2019.
• Ma et al. (2018) Ma, Y., Olshevsky, A., Szepesvari, C., and Saligrama, V. Gradient descent for sparse rank-one matrix completion for crowd-sourced aggregation of sparsely interacting workers. In International Conference on Machine Learning, pp. 3335–3344. PMLR, 2018.
• Pimentel-Alarcón et al. (2016) Pimentel-Alarcón, D. L., Boston, N., and Nowak, R. D. A characterization of deterministic sampling patterns for low-rank matrix completion. IEEE Journal of Selected Topics in Signal Processing, 10(4):623–636, 2016.
• Recht et al. (2010) Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
• Scieur et al. (2017) Scieur, D., Roulet, V., Bach, F., and d’Aspremont, A. Integration methods and optimization algorithms. Advances in Neural Information Processing Systems, 30, 2017.
• Shechtman et al. (2015) Shechtman, Y., Eldar, Y. C., Cohen, O., Chapman, H. N., Miao, J., and Segev, M. Phase retrieval with application to optical imaging: a contemporary overview. IEEE signal processing magazine, 32(3):87–109, 2015.
• Tu et al. (2016) Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., and Recht, B. Low-rank solutions of linear matrix equations via procrustes flow. In International Conference on Machine Learning, pp. 964–973. PMLR, 2016.
• Waldspurger & Waters (2020) Waldspurger, I. and Waters, A. Rank optimality for the burer–monteiro factorization. SIAM journal on Optimization, 30(3):2577–2602, 2020.
• Zhang et al. (2021) Zhang, H., Bi, Y., and Lavaei, J. General low-rank matrix optimization: Geometric analysis and sharper bounds. arXiv preprint arXiv:2104.10356, 2021.
• Zhang et al. (2019) Zhang, R. Y., Sojoudi, S., and Lavaei, J. Sharp restricted isometry bounds for the inexistence of spurious local minima in nonconvex matrix recovery. J. Mach. Learn. Res., 20(114):1–34, 2019.

## Appendix A Proofs in Section 2

### a.1 Proof of Proposition 1

###### Proof.

The condition implies that

is connected and non-bipartite. Since the graph is non-bipartite, there exists a cycle with an odd number of vertices

in in which . To numerically find an odd cycle, the breadth first search method requires operations. Without loss of generality, we assume that the set of vertices of the cycle is and the set of edges is , where is a nonnegative integer. Suppose that the matrix satisfies . We denote the -th block of as for all , i.e.,

 M∗=X∗(X∗)T=⎡⎢ ⎢⎣X∗1⋮X∗m⎤⎥ ⎥⎦[(X∗1)T…(X∗m)T].

Since , the block is nonsingular for every , which further implies that the block is nonsingular for all . Using the relation that , we can calculate that

 [k∏i=1(M∗2i−1,2i(M∗2i,2i+1)−T)]M∗2k+1,1=X∗1(X∗1)T,

Since the left-hand side only contains observed blocks, the matrix can be computed via observed blocks. Since computing the inverse of an matrix and computing the product of two matrices both require operations, the total number of operations required for computing is . In addition, computing the Cholesky decomposition of requires operations, which produces a matrix for some orthogonal matrix .

With the knowledge of , we can recursively compute the block using the connectivity of . More specifically, we use to denote the set of vertices or which we have computed . We start with . At each iteration, we choose indices and such that . Such a pair of indices always exists unless , since the graph is connected. Then, using the observation

 M∗j,i=X∗j(X∗i)T=(X∗jR)(X∗iR)T,

we first compute the matrix with operations and then add to the set . We stop the iteration when . After this process, we can concatenate for all to obtain the matrix . The number of iterations is and thus the total number of operations is .

Summarizing the two parts, the total number of operations to compute the matrix is .

### a.2 Gradient and Hessian of the Problem (2)

Before proceeding with the analysis for the proofs in the remainder of this paper, we first derive the gradient and the Hessian of the objective function of the problem (2). We omit the proof since the calculation can be done via basic calculus. The gradient of the objective function can be written as

 ∇f(X)=2(XXT−M∗)ΩX. (6)

Similarly, the quadratic variant of the Hessian can be written as

 Δ:∇2f(X):Δ =4⟨(XXT−M∗)Ω,ΔΔT⟩+2∥∥(XΔT+ΔXT)Ω∥∥2F. (7)

### a.3 Proof of Theorem 1

###### Proof.

Let be a maximal independent set of such that every vertex in the set has a self-loop. We define the global solution as , where

 x∗i:=ci,∀i∈S,x∗i:=0,∀i∉S

and is a set of nonzero constants. We note that in the case when , the factor is a vector. Therefore, we represent it using the notation for vectors, i.e., . Considering the problem instance , the set of global solutions of the problem (2) is given by

 X∗:={x∈Rn | x2i=c2i, ∀i∈S, xi=0, ∀i∉S},

which has the cardinality . For every global solution , we have . Thus, we know that is a first-order critical point of the problem (2). For every , the quadratic form of the Hessian (7) can be written as

 Δ:∇2f(^x):Δ =2∥∥(^xΔT+Δ^xT)Ω∥∥2F=∑i∈S2(Δi^xi+^xiΔi)2+∑j∉S∑i∈S(i,j)∈E1[2(Δj^xi)2+2(^xiΔj)2] =∑i∈S4(Δi^xi)2+∑j∉S∑i∈S(i,j)∈E14(Δj^xi)2=∑i∈S4^x2i⋅Δ2i+∑j∉S(∑i∈S(i,j)∈E14^x2i)⋅Δ2j.

The first term in the above expression corresponds to self-loops in , while the second term corresponds to the edges between and . We note that the edges whose endpoints are both in do not contribute to the quadratic form. Since is a maximal independent, we know that

 {i∈S | (i,j)∈E1}≠∅,∀j∉S.

As a result, it holds that

 Δ:∇2f(^x):Δ>0,∀Δ∈Rn∖{0},

which implies that the Hessian at the global solution is positive definite. Then, we perturb the global solution of the above problem to be

 M∗(ϵ):=x∗(ϵ)[x∗(ϵ)]T=(x∗+ϵ)(x∗+ϵ)T,

where and is a small perturbation. We denote the problem (2) after perturbation as

 minx∈Rn~f(x;ϵ),

where . For a generic perturbation , all components of are nonzero and the problem belongs to the class . This implies that the global solution of the problem (3) is unique up to a sign flip.

The earlier argument implies that