Error bound of local minima and KL property of exponent 1/2 for squared F-norm regularized factorization

11/11/2019 ∙ by Ting Tao, et al. ∙ South China University of Technology International Student Union 0

This paper is concerned with the squared F(robenius)-norm regularized factorization form for noisy low-rank matrix recovery problems. Under a suitable assumption on the restricted condition number of the Hessian matrix of the loss function, we establish an error bound to the true matrix for those local minima whose ranks are not more than the rank of the true matrix. Then, for the least squares loss function, we achieve the KL property of exponent 1/2 for the F-norm regularized factorization function over its global minimum set under a restricted strong convexity assumption. These theoretical findings are also confirmed by applying an accelerated alternating minimization method to the F-norm regularized factorization problem.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Low-rank matrix recovery problems aim at recovering an unknown true low-rank matrix

from as few observations as possible, and have wide applications in a host of fields such as statistics, control and system identification, signal and image processing, machine learning, quantum state tomography, and so on (see

[11, 10, 15, 26]

). Generally, when a tight upper estimation, say an integer

, is available for the rank of , these problems can be formulated as the following rank constrained optimization problem


where is a loss function. Otherwise, one needs to solve a sequence of rank constrained optimization problems with an updated upper estimation for the rank of . For the latter scenario, one may consider the following rank regularized model


with an appropriate to achieve a desirable low-rank solution. The model (1)-(2) reduce to the rank constrained and regularized least squares problem, respectively, when


where is the sampling operator and is the noisy observation from


Due to the combinatorial property of the rank function, the rank optimization problems are NP-hard and it is impossible to seek a global optimal solution with a polynomial-time algorithm. A common way to deal with them is to adopt the convex relaxation technique. For the rank regularized problem (2), the popular nuclear norm relaxation method (see, e.g., [11, 26, 7]) yields a desirable solution via a single convex minimization


Over the past decade active research, this method has made great progress in theory (see, e.g., [7, 8, 26, 21]). In spite of its favorable performance in theory, improving computational efficiency remains a challenge. In fact, almost all convex relaxation algorithms for (2) require an SVD of a full matrix in each iteration, which forms the major computational bottleneck and restricts their scalability to large-scale problems. Inspired by this, recent years have witnessed the renewed interest in the Burer-Monteiro factorization [4] for low-rank matrix optimization problems. By replacing with its factored form for with , the factorization form of (5) is


Although the factorization form tremendously reduces the number of optimization variables since is usually much smaller than , the intrinsic bi-linearity makes the factored objective functions nonconvex and introduces additional critical points that are not global optima of factored optimization problems. Hence, one research line for factored optimization problems focuses on the nonconvex geometry landscape, especially the strict saddle property (see, e.g., [25, 13, 14, 3, 20, 39, 40]). Most of these works center around the factored optimization forms of the problem (1) or their regularized forms with a balance term except the paper [20], in which for the exact or over-parametrization case , the authors proved that each critical point of (6) either corresponds to a global minimum of (5

) or is a strict saddle where the Hessian matrix has a strictly negative eigenvalue. This, along with the equivalence between (

5) and (6) (see also Lemma 1 in Appendix C), implies that many local search algorithms such as gradient descent and its variants can converge to a global optimum with even random initialization [12, 19, 29]. Another research line considers the (regularized) factorizations of rank optimization problems from a local view and aims to characterize the growth behavior of objective functions around the set of global optimal solutions (see, e.g., [18, 24, 30, 32, 37, 38]).

In addition, for the problem (1) associated to noisy low-rank matrix recovery, some researchers are interested in the error bound to the true matrix for the local minima of the factorization form or its regularized form with a balance term. For example, for the noisy low-rank positive semidefinite matrix recovery, Bhojanapalli et al. [3] showed that all local minima of the nonconvex factorized exact-parametrization are very close to the global minimum (which will become the true under the noiseless setting) under a RIP condition on the sampling operator; and Zhang et al. [35] achieved the same conclusion for the local minima of a regularized factorization form under a restricted strong convexity and smoothness condition of the loss function. However, there are few works to discuss error bounds for the local minima of the factorization associated to the rank regularized problem (2) or its convex relaxation (5) except [5] in which, for noisy matrix completion, the nonconvex Burer-Monteiro approach is used to demonstrate that the convex relaxation approach achieves near-optimal estimation errors.

This work is concerned with the error bound for the local minima of the nonconvex factorization (6) with and the KL property of exponent of with over the global minimum set. Specifically, under a suitable assumption on the restricted condition number of the Hessian matrix , we derive an error bound to the true for the local minima with rank not more than , which is demonstrated to be optimal by using the exact characterization of global optimal solutions in [36] for the ideal noiseless and full sampling setup. Different from [5]

, we achieve the error bound result for a general smooth loss function by adopting a deterministic analysis technique, rather than a probability analysis technique. In addition, for the least squares loss function in (

3), under a restricted positive definiteness of , we achieve the KL property of exponent of over the global minimum set. This result not only extends the work of [36] to the noisy and partial sampling setting but also, together with the strict saddle property of established in [20], implies that many first-order methods with a good starting point can yield the iterate sequence converging to a global optimal solution, and consequently fills in the convergent rate analysis gap of the alternating minimization methods proposed in [26, 17] for solving (6). Although Li et al. mentioned in [20] that the explicit convergence rate for certain algorithms in [12, 29] can be obtained by extending the strict saddle property with the similar analysis approach in [40], to the best of our knowledge, there is no strict proof for this, and moreover, the analysis in [40] is tailored to the factorization form with a balanced regularization term.

2 Notation and preliminaries

Throughout this paper,

represents the vector space of all

real matrices, equipped with the trace inner product for and its induced Frobenius norm, and we stipulate . The notation denotes the set of matrices with orthonormal columns, and stands for . Let , I and

denote an identity matrix, a matrix of all ones, and a vector of all ones, respectively, whose dimensions are known from the context. For a matrix


denotes the singular value vector of

arranged in a nonincreasing order, for an integer means the vector consisting of the first entries of , and is the set

where represents a rectangular diagonal matrix with as the diagonal vector. We denote by and the spectral norm and the nuclear norm of , respectively, by the pseudo-inverse of , and by the column space of . Let and denote the linear mappings from to itself, respectively, defined by


For convenience, with a pair , we define by


Unless otherwise stated, in the sequel we denote by the true matrix of rank with the SVD given by for , and define

with for . With an arbitrary , we always write

Restricted strong convexity (RSC) and restricted smoothness (RSS) are the common requirement for loss functions when handling low-rank matrix recovery problems (see, e.g., [21, 22, 20, 40, 35]). Now we recall the concepts of RSC and RSS used in this paper.

Definition 2.1

A twice continuously differentiable function is said to satisfy the -RSC of modulus and the -RSS of modulus , respectively, if and for any with and ,


For the least squares loss in (3), the -RSC of modulus and -RSS of modulus reduces to requiring the -restricted smallest and largest eigenvalue of to satisfy

Consequently, the -RSC of modulus along with the -RSS of modulus for some reduces to the RIP condition for the operator . Thus, from [26] the least squares loss associated to many types of random sampling operators satisfy this property with a high probability. In addition, from the discussions in [39, 20], some loss functions definitely have this property such as the weighted PCAs with positive weights, the noisy low-rank matrix recovery with noise matrix obeying Subbotin density [28, Example 2.13], or the one-bit matrix completion with full observations.

The following Lemma improves a little the result of [20, Proposition 2.1] that requires to have the -RSC of modulus and the -RSS of modulus .

Lemma 2.1

Let be a twice continuously differentiable function satisfying the -RSC of modulus and the -RSS of modulus . Then, for any with and any with ,

Proof: Fix an arbitrary with . Fix an arbitrary with . If one of and

is the zero matrix, the result is trivial. So, we assume that

and . Write and . Notice that and . Then, we have

Along with ,

The last two inequalities imply the desired inequality. The proof is then completed.

From the reference [33]

, we recall that a random variable

is called sub-Gaussian if

and is referred to as the sub-Gaussian norm of . Equivalently, the sub-Gaussian random variable satisfies the following bound for a constant :


We call the smallest satisfying (12) the sub-Gaussian parameter. The tail-probability characterization in (12) enables us to define centered sub-Gaussian random vectors.

Definition 2.2

(see [6]) A random vector is said to be a centered sub-Gaussian random vector if there exists such that for all and all ,

3 Error bound for local minimizers

To achieve the main result of this section, we need to establish several lemmas. We first take a look at the stationary points of . Define by


For a given , the gradient of at takes the form of


and for any with and , it holds that


By invoking (14), it is easy to get the balance property of the stationary points of .

Lemma 3.1

Fix an arbitrary . Any stationary point of belongs to the set


and consequently the set of local minimizers to the problem (6) is included in .

For any stationary point of with , the following lemma implies a lower bound for , whose proof is included in Appendix A.

Lemma 3.2

Suppose has the -RSC of modulus and the -RSS of modulus . Fix an arbitrary and an arbitrary . Then, for any stationary point of with and any column orthonormal spanning ,

Remark 3.1

Write . The result of Lemma 3.2 implies that


Recall that for any and any . Then, it holds that

Together with (17) and , we have


That is, is lower bounded by .

When the stationary point in Lemma 3.2 is strengthened as a local minimizer, we can provide a lower bound for where is a special direction defined by


This result is stated in the following lemma, whose proof is included in Appendix B.

Lemma 3.3

Suppose that has the -RSC of modulus and the -RSS of modulus . Fix an arbitrary and an arbitrary . Let be a local optimum of (6) with . Then, for the defined by (19) with and ,


Now we are ready to state the main result, which shows that the distance of any local minimizer with rank at most to can be upper bounded via and .

Theorem 3.1

Suppose that satisfies the -RSC of modulus and the -RSS of modulus , respectively, with . Fix an arbitrary and an arbitrary . Then, for any local optimum of (6) with , there exists (depending only on and ) such that the following inequality holds


Proof: Let be given by (19) with and . Note that . It is easy to check that is PSD. By [20, Lemma 3.6], we have


In addition, from the inequality (21) in Lemma 3.3, it follows that

where the third inequality is due to for any and any . From the last two inequalities, it is not hard to obtain that


Combining this inequality with the inequality (18) yields that

Since , we have So, the desired inequality holds with for and .

Remark 3.2

(i) For the least squares loss (3) with a noiseless and full sampling , the error bound in Theorem 3.1 becomes for each local minimum of (6) with . On the other hand, by the characterization of the global optimal solution set in [36] for (6) with , each global optimal with satisfies . This shows that the obtained error bound is optimal.

(ii) It should be emphasized that under the assumption of Theorem 3.1, the local minimizers of (6) with are not necessarily a global one, unless satisfies the -RSC of modulus and the -RSS of modulus with by [20, Theorem 4.1].

(iii) By combining Theorem 3.1 with Lemma 1 in Appendix C, it follows that each optimal solution of the convex problem (5) with satisfies

which is consistent with the one in [21, Corollary 1] for the optimal solution (though it is unknown whether its rank is less than or not) of the convex relaxation approach. This implies that the error bound of the convex relaxation approach is near optimal.

Next we illustrate the result of Theorem 3.1 via two specific observation models.

3.1 Matrix sensing

The matrix sensing problem aims to recover the true matrix via the observation model (4), where the sampling operator is defined by for , and the entries of the noise vector are assumed to be i.i.d. sub-Gaussian of parameter . By Definition 2.2 and the discussions in [9, Page 24], for every , there exists an absolute constant such that with probability at least ,

Assumption 3.1

The sampling operator has the -RIP of constant .

Take for . Then, under Assumption 3.1, the loss function satisfies the conditions in Theorem 3.1 with and . We next upper bound . Let denote the Euclidean sphere in . From the variational characterization of the spectral norm of matrices,

By invoking (25) and the RIP of , with probability at least it holds that

Notice that . By Theorem 3.1, we obtain the following conclusion.

Corollary 3.1

Suppose that the sampling operator satisfies Assumption 3.1. Then, for any local optimum with of the problem (6) for ,


holds w.p. at least , where is a nondecreasing positive function of . When for an absolute constant , w.p. at least we have

3.2 Weighted principle component analysis

The weighted PCA problem aims to recover an unknown true matrix from an elementwise weighted observation , where is the positive weight matrix, is the noise matrix, and ”” denotes the Hadamard product of matrices. This corresponds to the observation model (4) with and . We assume that the entries of are i.i.d. sub-Gaussian random variables of parameter . By Definition 2.2 and the discussions in [9, Page 24], for every , there exists an absolute constant such that with probability at least ,


Take for . Then, for each ,

Clearly, satisfies the -RSC of modulus and -RSS of modulus , where and . Notice that