I Introduction
Matrix completion deals with the problem of recovering of a matrix from its partially observed (may be noisy) entries, which has attracted considerable interest recently [1]–[4]. The matrix completion problem arises in many applications in signal processing, image/video processing, and machine learning, such as rating value estimation in recommendation system [7], friendship prediction in social network, collaborative filtering [8], image processing [6], [10], video denoising [12], [13], system identification [14], multiclass learning [15], [16], and dimensionality reduction [17]. Specifically, the goal of matrix completion is to recover a matrix
from its partially observed (incomplete) entries(1) 
where is a random subset. Obviously, the completion of an arbitrary matrix is an illposed problem. To make the problem wellposed, a commonly used assumption is that the underlying matrix comes from a restricted class, e.g., lowrank. Exploiting the lowrank structure of the matrix is a powerful method.
Modeling the matrix completion problem as a lowrank matrix recovery problem, a natural formulation is to minimize the rank of under the linear constraint (1) as
(2) 
where denotes projection onto the set , and . While the nonconvex rank minimization problem (2) is highly nonconvex and difficult to solve, a popular convex relaxation method is to replace the rank function by its convex envelope, the nuclear norm ,
(3) 
In most realistic applications, entrywise noise is inevitable. Taking entrywise noise into consideration, a robust variant of (3) is
(4) 
where is the noise tolerance. This constrained formulation (4) can be converted into an unconstrained form as
(5) 
where is a regularization parameter related to the noise tolerance parameter in (4). The unconstrained formulation is favorable in some applications as existing efficient firstorder convex algorithms, such as alternative direction method of multipliers (ADMM) or proximal gradient descent (PGD) algorithm, can be directly applied. Even in the noise free case, the solution of (5) can accurately approach that of (3) via choosing a sufficiently small value of , since the solution of (5) satisfies as . The problems (3) and (4) can be recast into semidefinite program (SDP) problems and solved to global minimizer by wellestablished SDP solvers when the matrix dimension is not large. For problems with larger size, more efficient firstorder algorithms have been developed based on the formulation (5), e.g., variants of the proximal gradient method [19], [20].
Besides the tractability of the convex formulations (3)–(5) employing nuclear norm, theoretical guarantee provided in [1], [2], [21], [22] demonstrated that under certain conditions, e.g., when the lowrank matrix satisfies an incoherence condition and the observed entries are uniformly randomly sampled,
can be exactly recovered from a small portion of its entries with high probability by using the nuclear norm regularization. However, the nuclear norm regularization has a bias problem and would introduce bias to the recovered singular values [23]–[25]. To alleviate the bias problem and achieve better recovery performance, a nonconvex lowrank penalty, such as the Schatten
norm (which is in fact the norm of the matrix singular values with), smoothly clipped absolute deviation (SCAD), minimax concave (MC), or firmthresholding penalty can be used. In the past a few years, nonconvex regularization has shown better performance over convex regularization in many sparse and lowrank recovery involved applications. These applications include compressive sensing, sparse regression, sparse demixing, sparse covariance and precision matrix estimation, and robust principal component analysis [9], [26].
In this work, we consider the following formulation for matrix completion
(6) 
where is a generalized nonconex lowrank promotion penalty. For the particular case of being the nuclear norm, i.e., , this formulation reduces to (5). Existing works considering the nonconvex formulation (6) include [27]–[31]. In [27], [28], the Schatten norm has been considered and PGD methods have been proposed. In [29], using a smoothed Schatten norm, an iteratively reweighted algorithm has been designed for (6), which involves solving a sequence of linear equations. Another iteratively reweighted algorithm for Schatten
norm regularized matrix minimization problem with a generalized smooth loss function has been investigated in [30]. More recently in [31],
being the MC penalty has been considered and an ADMM algorithm has been developed.Besides, for the linearly constrained formulation, an iterative algorithm employing Schatten norm, which monotonically decreasing the objective, has been proposed in [32]. Meanwhile, a truncated nuclear norm has been used in [33]. Then, robust matrix completion using Schatten regularization has been considered in [34]. Moreover, it has been shown in [35] that, the sufficient condition for reliable recovery of Schatten norm regularization is weaker than that of nuclear norm regularization.
Among the nonconvex algorithms for the problem (6), only subsequence convergence of the methods [27]–[31] have been proved. In fact, based on the recent convergence results for nonconvex and nonsmooth optimization [36]–[38], global convergence of the PGD algorithm [27], [28] and the ADMM algorithm [31] to a stationary point can be guaranteed under some mild conditions. However, for a nonconvex , whether these algorithms converge to a local minimizer is still unclear. Meanwhile, for the problem (6), linear convergence rate of the PGD algorithm has been established when is the nuclear norm under certain conditions [39], [40], but the convergence rate of PGD in the case of a nonconvex is still an open problem.
To address these problems, this work provides a thorough analysis on the PGD algorithm for the matrix completion problem (6) using a generalized nonconvex penalty. The main contributions are as follows.
Ia Contribution
First, we derived some properties on the gradient and Hessian of a generalized lowrank penalty, which are important for the convergence analysis. Then, for a popular and important class of nonconvex penalties which have discontinuous thresholding functions, we have established the following convergence properties for the PGD algorithm under certain conditions:
1) rank convergence within finitely many iterations;
2) convergence to a restricted strictly local minimizer;
3) convergence to a local minimizer for the hardthresholding penalty;
4) an eventually linear convergence rate.
As the singular value thresholding function is implicitly dependent on the lowrank matrix, the derivation is nontrivial. Finally, illustration of the PGD algorithm via inpainting experiments has been provided.
It is worth noting that, there exist a line of recent works on factorization based nonconvex algorithms, e.g., [5], [11], [18]. It has been shown that the nonconvex objective function has no spurious local minimum, and efficient nonconvex optimization algorithms can converge to local minimum. While these works focus on matrix factorization based methods, this work considers the general matrix completion problem (6). Our result is the first explains that the nonconvex matrix completion problem (6) only have restricted strictly local minimum, and the PGD algorithm can converge to such minimum with eventually linear rate under certain conditions.
Outline: The rest of this paper is organized as follows. Section II introduces the proximity operator for generalized nonconvex penalty, and reviews the PGD algorithm for matrix completion. Section III provides convergence analysis of the PGD algorithm. Section IV provides experimental results on inpainting. Finally, section V ends the paper with concluding remarks.
Penalty name  Penalty formulation  Proximity operator  

(i) Hard thresholding  
(ii) Soft thresholding  
(iii) norm  , 

Notations: For a matrix , , , and stand for the rank, trace, Frobenius norm and range space of , respectively, whilst denotes the th largest singular value, and
For a symmetric real matrix , and
respectively denote the maximal and minimal eigenvalues, whilst
contains the descendingly ordered eigenvalues. and mean that is semidefinite and positive definite, respectively. denotes the th element.is the “vectorization” operator stacking the columns of the matrix one below another.
represents the diagonal matrix generated by the vector , represents the vector containing the diagonal elements of . denotes the Euclidean norm. and denote the Hadamard and Kronecker product, respectively. and denote the inner product and transpose, respectively. denotes the sign of a quantity with . is an identity matrix. is a zero vector or matrix with a proper size.Ii Proximity Operator and Proximal Gradient Algorithm
This section introduces the proximity operator for nonconvex regularization and the PGD algorithm for the matrix completion problem (6).
Iia Proximal Operator for Nonconvex Penalties
For a proper and lower semicontinuous penalty function , the corresponding proximity operator is defined as
(7) 
where is a penalty parameter.
Table I shows several popular penalties along with their thresholding functions. The proximal minimization problem (7) for many popular nonconvex penalties can be computed in an efficient manner. The hardthresholding is a natural selection for sparsity promotion, while the softthresholding is of the most popular due to its convexity. The penalty with bridges the gap between the hard and softthresholding penalties. Except for two known cases of and , the proximity operator of the penalty does not have a closedform expression, but it can be efficiently computed by an iterative method. Moreover, there also exist other nonconvex penalties, including the shrinkage [41]–[42], SCAD [43], MC [44] and firm thresholding [45].
As shown in Fig. 1, the softthresholding imposes a constant shrinkage on the parameter when the parameter magnitude exceeds the threshold, and, thus, has a bias problem. The hard and SCAD thresholding are unbiased for large parameter. The other nonconvex thresholding functions are sandwiched between the hard and the softthresholding, which can mitigate the bias problem of the softthresholding. For a generalized nonconvex penalty, we make the following assumptions.
Assumption 1: is an even folded concave function, which satisfies the following conditions:
(i) is nondecreasing on with ;
(ii) for any , there exists a such that for any ;
(iii) is on , and on ;
(iv) the firstorder derivative is convex on and .
This assumption implies that is coercive, weakly sequential lower semicontinuous in , and responsible for sparsity promotion.
IiB Generalized Singular Value Thresholding
For a matrix , lowrank inducing on can be achieved via sparsity inducing on the singular values as
(8) 
where is a sparsity inducing penalty. For the particular cases of being the , and norm, become the rank, Schatten norm and nuclear norm of , respectively. For such a lowrank penalty, define the corresponding proximal operator
(9) 
Property 1. [Generalized singular value thresholding]: Let
be any full singular value decomposition (SVD) of
, where and contain the left and right singular vectors, respectively. Then, the proximal minimization problem (9) is solved by the singularvalue thresholding operator(10) 
where
Although this property can be derived via straightforwardly extending Lemma 1 in [7], we provide here a completely different but more intuitive derivation of it. Assume that the minimizer of (9) is of rank with any truncated SVD , where . Then, the objective in (9) can be equivalently rewritten as
(11) 
By Assumption 1, is differential on , hence, is differential with respective to rank matrix . Denote
where is the firstorder derivative of , we have (see Appendix A)
(12) 
Let , and use , , it follows from (12) that
Since and are diagonal, and the columns of (also ) are orthogonal, it is easy to see that there exists a full SVD such that
(13) 
Substituting these relations into (11) yields
(14) 
where contains singular values of . As (14) is separable, can be solved elementwise as (7), i.e., . Further, is nondecreasing on by Assumption 1, hence for any . Thus, must contain the largest singular values of with a same descending order as , i.e., . Consequently, we have , which together with and (13) results in (10).
IiC PGD Algorithm for Matrix Completion
PGD is a powerful optimization algorithm suitable for many largescale problems arising in signal/image processing, statistics and machine learning. It can be viewed as a variant of majorization minimization algorithms which has a special choice for the quadratic majorization. Let
The core idea of the PGD algorithm is to consider a linear approximation of at the th iteration at a given point as
(15) 
where and is a proximal parameter. Then, minimizing is a form of the proximity operator (9) as
(16) 
which can be computed as (10).
In the PGD algorithm, the dominant computational load in each iteration is the SVD calculation. To further improve the efficiency of the algorithm and make it scale well for largescale problems, the techniques such as approximate SVD or PROPACK [7], [19] can be adopted.
Iii Convergence Analysis
This section investigates the convergence properties of the PGD algorithm with special consideration on the class of nonconvex penalties which have discontinuous thresholding functions. First, we make some assumptions on the discontinuous property of such threshoding functions.
Assumption 2: satisfies Assumption 1, and the corresponding proximity operator has a formulation as
(17) 
where is defined on as , for any and . is the threshold point given by . is the “jumping” size at the threshold point. is continuous on and the range of is .
A significant property of such a nonconvex penalty is its jumping discontinuity. Typical nonconvex penalties satisfying this discontinuous property include the
, , and log penalties.In the analysis, the KurdykaLojasiewicz (KL) property of the objective function is used. In the convergence analysis, based on a “uniformization” result [36], using the KL property can considerably predigest the main arguments and avoid involved induction reasoning.
Definition 1. [KL property]: For a proper function and any , if there exists , a neighborhood of and a continuous concave function such that:
(i) and is continuously differentiable on with positive derivatives;
(ii) for all satisfying , it holds that ;
then is said to have the KL property at . Further, if a proper closed function satisfies the KL property at all points in , it is called a KL function.
Furthermore, we define the restricted strictly local minimizer as follows. Let denote the projection onto the complementary set of .
Definition 2. [Restricted strictly local minimizer]: For a proper function , any and a subset , if there exists a neighborhood of such that for any ,
is said to be a restricted strictly local minimizer of .
It is obvious that, if is a strictly local minimizer of , then is a restricted strictly local minimizer of , but not vice versa.
Meanwhile, we provide three lemmas needed in later analysis. The first lemma is on the distance between the singular values of two matrices.
Lemma 1: For two matrices and , it holds
This result can be directly derived by extending the HoffmanWielandt Theorem [47], which indicates that the “distance” between the respective singular values of two matrices is bounded by the “distance” between the matrices.
The following two lemmas present some properties of the gradient and Hessian of a generalized lowrank penalty [46] (the derivation is also provided here in Appendices A and B).
Lemma 2: For a matrix of rank , , with any truncated SVD , , and contains the corresponding singular vectors. Suppose that is on , denote
Then, and
where is a commutation matrix defined as for .
Lemma 3: Under the condition and definition in Lemma 2, if on , then, and the nonzero eigenvalues of are given by
Further suppose that is a nondecreasing function on , then it holds
Iiia Convergence for A Generalized Nonconvex Penalty
In the following, let denote the matrix , such that . Then, the Hessian of can be expressed as
It is easy to see that . Then, for a generalized nonconvex penalty satisfying the KL property, the global convergence of the PGD algorithm to a stationary point can be directly derived from the results in [37], which is given as follows.
Property 2 [37]. [Convergence to stationary point]: Let be a sequence generated by the PGD algorithm (16), suppose that is a closed, proper, lower semicontinuous functions, if , there hold
(i) the sequence is nonincreasing as
and there exists a constant such that ;
(ii) as , converges to a cluster point set, and any cluster point is a stationary point of ;
(iii) further, if there exists a point at which satisfies the KL property, has finite length
and converges to .
Property 2(i) establishes the sufficient decrease property of the objective , which is a basic property desired for a descent algorithm. Property 2(ii) establishes the subsequence convergence of the PGD algorithm, whilst (iii) establishes the global convergence of the PGD algorithm to a stationary point. Property 2(iii) obviously holds if is a KL function. The global convergence result applies to a generalized nonconvex penalty as long as it satisfies the KL property. The KL property is satisfied by most popular nonconvex penalties, such as the hard, , SCAD and firm thresholding penalties.
IiiB Convergence for Discontinuous Thresholding
Among existing nonconvex penalties, there is an important class which has discontinuous thresholding functions (also referred to as “jumping thresholding” in [48, 49, 50]), including the popular , , MC, firm thresholding and log penalties. For such penalties, we present more deep analysis on the convergence properties of the PGD algorithm.
The first result is on the rank convergence of the sequence generated by the PGD algorithm.
Lemma 4. [Rank convergence]: Let be a sequence generated by the PGD algorithm (16). Suppose that satisfies Assumption 1 and 2, if , then for any cluster point , there exist two positive integers and such that, when ,
Proof: See Appendix C.
This lemma implies that the rank of only changes finitely many times. By Lemma 4, when , the rank of freezes, i.e., , . Let be a rank matrix, when , minimizing the objective in (6) is equivalent to minimizing the following objective
(18) 
For , we consider the equivalent objective (18), as is when (as is on by Assumption 1), which facilitates further convergence analysis of . By Lemma 4, the convergence of the whole sequence is equivalent to the convergence of the sequence .
Next, we provide a global convergence result for discontinuous thresholding penalties.
Theorem 1. [Convergence to local minimizer]: Under conditions of Lemma 4, suppose that is a KL function or satisfies the KL property at a cluster point of the sequence , if , then converges to a stationary point of . Further, let , if
(19) 
is a local minimizer of .
The convergence to a stationary point can be directly claimed from Property 2. The convergence to a local minimizer is proved in Appendix D. Let , a sufficient condition for (19) is
(20) 
This can be justified as follows. By Lemma 2 and 3, under Assumption 1, the Hessian of at satisfies
which together with , for any nonempty , and the Weyl Theorem implies that the condition (19) is satisfied if (20) holds. Obviously, the sufficient condition (20) is satisfied by the hardthresholding penalty, for which .
Corollary 1. [Convergence for hard thresholding]: Let be a sequence generated by the PGD algorithm (16), is the hardthresholding penalty, if , converges to a local minimizer of .
Next, we show that the nonconvex matrix completion problem (6) does not have strictly local minimizer, but has restricted strictly local minimizer. Specifically, if is a strictly local minimizer of with , then for any sufficiently small satisfying , it holds , hence . However, when , by Assumption 1 and Lemma 3, which together with and the Weyl Theorem implies that
That is cannot be positive definite. Thus, cannot be a strictly local minimizer of , and the strictly local minimizer set of is empty. Despite of this, we have the following result of convergence to a restricted strictly local minimizer. In the following, let denote the submatrix of corresponding to the index subset .
Theorem 2. [Convergence to restricted strictly local minimizer]: Under conditions of Lemma 4, suppose that is a KL function or satisfies the KL property at a cluster point of the sequence , then converges to a stationary point of . Further, let , if
(21) 
is a restricted strictly local minimizer of .
The proof is given in Appendix E. Since , it is easy to see that
Then, the condition in (21) is equivalent to
(22) 
By this Theorem, we have the following result for the () penalty.
Corollary 2. [Convergence for penalty]: Let be a sequence generated by the PGD algorithm (16), is the penalty with , if , converges to a stationary point of . Further, if
(23) 
then is a restricted strictly local minimizer of .
For the () penalty,
which together with (22) results in the left hand of (23). The right hand condition in (23) follows from the property of the thresholding (see Table I) and (16) that
Furthermore, for the hardthresholding penalty, the convergence to a restricted strictly local minimizer is straightforward if .
IiiC Eventually Linear Convergence Rate for Discontinuous Thresholding
This subsection derives the linear convergence of the PGD algorithm for nonconvex penalties with discontinuous thresholding function. Before proceeding to the analysis, we first show some properties on the sequence in the neighborhood of .
Consider a neighborhood of as
for any , is the “jumping” size of the thresholding function (corresponding to in (16)) at the its threshold point. Under Assumption 1, by Lemma 3 and is nondecreasing on , thus, there exists a sufficiently small constant , which is dependent on and as , such that.
(24) 
For the second property, we denote and for some , which have the following full SVD
where , and
Let
Then, it follows that , and
where and . When (hence and ), the range space of , denoted by , tends to be orthogonal with the range space of , denoted by . In other words, let be a vector contains the principal angles between the two range spaces and , it follows that
Based on this fact, for each there exists a constant which is dependent on , satisfying as , such that
(25) 
For any , when is a stationary point of the (hence a fixed point of the PGD algorithm, i.e.,