Matrix Completion via Nonconvex Regularization: Convergence of the Proximal Gradient Algorithm

03/02/2019 ∙ by Fei Wen, et al. ∙ Shanghai Jiao Tong University 26

Matrix completion has attracted much interest in the past decade in machine learning and computer vision. For low-rank promotion in matrix completion, the nuclear norm penalty is convenient due to its convexity but has a bias problem. Recently, various algorithms using nonconvex penalties have been proposed, among which the proximal gradient descent (PGD) algorithm is one of the most efficient and effective. For the nonconvex PGD algorithm, whether it converges to a local minimizer and its convergence rate are still unclear. This work provides a nontrivial analysis on the PGD algorithm in the nonconvex case. Besides the convergence to a stationary point for a generalized nonconvex penalty, we provide more deep analysis on a popular and important class of nonconvex penalties which have discontinuous thresholding functions. For such penalties, we establish the finite rank convergence, convergence to restricted strictly local minimizer and eventually linear convergence rate of the PGD algorithm. Meanwhile, convergence to a local minimizer has been proved for the hard-thresholding penalty. Our result is the first shows that, nonconvex regularized matrix completion only has restricted strictly local minimizers, and the PGD algorithm can converge to such minimizers with eventually linear rate under certain conditions. Illustration of the PGD algorithm via experiments has also been provided. Code is available at https://github.com/FWen/nmc.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Matrix completion deals with the problem of recovering of a matrix from its partially observed (may be noisy) entries, which has attracted considerable interest recently [1]–[4]. The matrix completion problem arises in many applications in signal processing, image/video processing, and machine learning, such as rating value estimation in recommendation system [7], friendship prediction in social network, collaborative filtering [8], image processing [6], [10], video denoising [12], [13], system identification [14], multiclass learning [15], [16], and dimensionality reduction [17]. Specifically, the goal of matrix completion is to recover a matrix

from its partially observed (incomplete) entries

(1)

where is a random subset. Obviously, the completion of an arbitrary matrix is an ill-posed problem. To make the problem well-posed, a commonly used assumption is that the underlying matrix comes from a restricted class, e.g., low-rank. Exploiting the low-rank structure of the matrix is a powerful method.

Modeling the matrix completion problem as a low-rank matrix recovery problem, a natural formulation is to minimize the rank of under the linear constraint (1) as

(2)

where denotes projection onto the set , and . While the nonconvex rank minimization problem (2) is highly nonconvex and difficult to solve, a popular convex relaxation method is to replace the rank function by its convex envelope, the nuclear norm ,

(3)

In most realistic applications, entry-wise noise is inevitable. Taking entry-wise noise into consideration, a robust variant of (3) is

(4)

where is the noise tolerance. This constrained formulation (4) can be converted into an unconstrained form as

(5)

where is a regularization parameter related to the noise tolerance parameter in (4). The unconstrained formulation is favorable in some applications as existing efficient first-order convex algorithms, such as alternative direction method of multipliers (ADMM) or proximal gradient descent (PGD) algorithm, can be directly applied. Even in the noise free case, the solution of (5) can accurately approach that of (3) via choosing a sufficiently small value of , since the solution of (5) satisfies as . The problems (3) and (4) can be recast into semi-definite program (SDP) problems and solved to global minimizer by well-established SDP solvers when the matrix dimension is not large. For problems with larger size, more efficient first-order algorithms have been developed based on the formulation (5), e.g., variants of the proximal gradient method [19], [20].

Besides the tractability of the convex formulations (3)–(5) employing nuclear norm, theoretical guarantee provided in [1], [2], [21], [22] demonstrated that under certain conditions, e.g., when the low-rank matrix satisfies an incoherence condition and the observed entries are uniformly randomly sampled,

can be exactly recovered from a small portion of its entries with high probability by using the nuclear norm regularization. However, the nuclear norm regularization has a bias problem and would introduce bias to the recovered singular values [23]–[25]. To alleviate the bias problem and achieve better recovery performance, a nonconvex low-rank penalty, such as the Schatten-

norm (which is in fact the norm of the matrix singular values with

), smoothly clipped absolute deviation (SCAD), minimax concave (MC), or firm-thresholding penalty can be used. In the past a few years, nonconvex regularization has shown better performance over convex regularization in many sparse and low-rank recovery involved applications. These applications include compressive sensing, sparse regression, sparse demixing, sparse covariance and precision matrix estimation, and robust principal component analysis [9], [26].

In this work, we consider the following formulation for matrix completion

(6)

where is a generalized nonconex low-rank promotion penalty. For the particular case of being the nuclear norm, i.e., , this formulation reduces to (5). Existing works considering the nonconvex formulation (6) include [27]–[31]. In [27], [28], the Schatten- norm has been considered and PGD methods have been proposed. In [29], using a smoothed Schatten- norm, an iteratively reweighted algorithm has been designed for (6), which involves solving a sequence of linear equations. Another iteratively reweighted algorithm for Schatten-

norm regularized matrix minimization problem with a generalized smooth loss function has been investigated in [30]. More recently in [31],

being the MC penalty has been considered and an ADMM algorithm has been developed.

Besides, for the linearly constrained formulation, an iterative algorithm employing Schatten- norm, which monotonically decreasing the objective, has been proposed in [32]. Meanwhile, a truncated nuclear norm has been used in [33]. Then, robust matrix completion using Schatten- regularization has been considered in [34]. Moreover, it has been shown in [35] that, the sufficient condition for reliable recovery of Schatten- norm regularization is weaker than that of nuclear norm regularization.

Among the nonconvex algorithms for the problem (6), only subsequence convergence of the methods [27]–[31] have been proved. In fact, based on the recent convergence results for nonconvex and nonsmooth optimization [36]–[38], global convergence of the PGD algorithm [27], [28] and the ADMM algorithm [31] to a stationary point can be guaranteed under some mild conditions. However, for a nonconvex , whether these algorithms converge to a local minimizer is still unclear. Meanwhile, for the problem (6), linear convergence rate of the PGD algorithm has been established when is the nuclear norm under certain conditions [39], [40], but the convergence rate of PGD in the case of a nonconvex is still an open problem.

To address these problems, this work provides a thorough analysis on the PGD algorithm for the matrix completion problem (6) using a generalized nonconvex penalty. The main contributions are as follows.

I-a Contribution

First, we derived some properties on the gradient and Hessian of a generalized low-rank penalty, which are important for the convergence analysis. Then, for a popular and important class of nonconvex penalties which have discontinuous thresholding functions, we have established the following convergence properties for the PGD algorithm under certain conditions:

1) rank convergence within finitely many iterations;

2) convergence to a restricted strictly local minimizer;

3) convergence to a local minimizer for the hard-thresholding penalty;

4) an eventually linear convergence rate.

As the singular value thresholding function is implicitly dependent on the low-rank matrix, the derivation is nontrivial. Finally, illustration of the PGD algorithm via inpainting experiments has been provided.

It is worth noting that, there exist a line of recent works on factorization based nonconvex algorithms, e.g., [5], [11], [18]. It has been shown that the nonconvex objective function has no spurious local minimum, and efficient nonconvex optimization algorithms can converge to local minimum. While these works focus on matrix factorization based methods, this work considers the general matrix completion problem (6). Our result is the first explains that the nonconvex matrix completion problem (6) only have restricted strictly local minimum, and the PGD algorithm can converge to such minimum with eventually linear rate under certain conditions.

Outline: The rest of this paper is organized as follows. Section II introduces the proximity operator for generalized nonconvex penalty, and reviews the PGD algorithm for matrix completion. Section III provides convergence analysis of the PGD algorithm. Section IV provides experimental results on inpainting. Finally, section V ends the paper with concluding remarks.

Penalty name Penalty formulation Proximity operator
(i) Hard thresholding
(ii) Soft thresholding
(iii) -norm ,
where , ,
TABLE I: Proximity operator for some popular regularization penalties.

Notations: For a matrix , , , and stand for the rank, trace, Frobenius norm and range space of , respectively, whilst denotes the -th largest singular value, and

For a symmetric real matrix , and

respectively denote the maximal and minimal eigenvalues, whilst

contains the descendingly ordered eigenvalues. and mean that is semi-definite and positive definite, respectively. denotes the -th element.

is the “vectorization” operator stacking the columns of the matrix one below another.

represents the diagonal matrix generated by the vector , represents the vector containing the diagonal elements of . denotes the Euclidean norm. and denote the Hadamard and Kronecker product, respectively. and denote the inner product and transpose, respectively. denotes the sign of a quantity with . is an identity matrix. is a zero vector or matrix with a proper size.

Ii Proximity Operator and Proximal Gradient Algorithm

This section introduces the proximity operator for nonconvex regularization and the PGD algorithm for the matrix completion problem (6).

Ii-a Proximal Operator for Nonconvex Penalties

For a proper and lower semicontinuous penalty function , the corresponding proximity operator is defined as

(7)

where is a penalty parameter.

Table I shows several popular penalties along with their thresholding functions. The proximal minimization problem (7) for many popular nonconvex penalties can be computed in an efficient manner. The hard-thresholding is a natural selection for sparsity promotion, while the soft-thresholding is of the most popular due to its convexity. The penalty with bridges the gap between the hard- and soft-thresholding penalties. Except for two known cases of and , the proximity operator of the penalty does not have a closed-form expression, but it can be efficiently computed by an iterative method. Moreover, there also exist other nonconvex penalties, including the -shrinkage [41]–[42], SCAD [43], MC [44] and firm thresholding [45].

As shown in Fig. 1, the soft-thresholding imposes a constant shrinkage on the parameter when the parameter magnitude exceeds the threshold, and, thus, has a bias problem. The hard- and SCAD thresholding are unbiased for large parameter. The other nonconvex thresholding functions are sandwiched between the hard- and the soft-thresholding, which can mitigate the bias problem of the soft-thresholding. For a generalized nonconvex penalty, we make the following assumptions.

Fig. 1: Thresholding/shrinkage function output (with a same threshold).

Assumption 1: is an even folded concave function, which satisfies the following conditions:

(i) is non-decreasing on with ;

(ii) for any , there exists a such that for any ;

(iii) is on , and on ;

(iv) the first-order derivative is convex on and .

This assumption implies that is coercive, weakly sequential lower semi-continuous in , and responsible for sparsity promotion.

Ii-B Generalized Singular Value Thresholding

For a matrix , low-rank inducing on can be achieved via sparsity inducing on the singular values as

(8)

where is a sparsity inducing penalty. For the particular cases of being the , and norm, become the rank, Schatten- norm and nuclear norm of , respectively. For such a low-rank penalty, define the corresponding proximal operator

(9)

Property 1. [Generalized singular value thresholding]: Let

be any full singular value decomposition (SVD) of

, where and contain the left and right singular vectors, respectively. Then, the proximal minimization problem (9) is solved by the singular-value thresholding operator

(10)

where

Although this property can be derived via straightforwardly extending Lemma 1 in [7], we provide here a completely different but more intuitive derivation of it. Assume that the minimizer of (9) is of rank with any truncated SVD , where . Then, the objective in (9) can be equivalently rewritten as

(11)

By Assumption 1, is differential on , hence, is differential with respective to rank- matrix . Denote

where is the first-order derivative of , we have (see Appendix A)

(12)

Let , and use , , it follows from (12) that

Since and are diagonal, and the columns of (also ) are orthogonal, it is easy to see that there exists a full SVD such that

(13)

Substituting these relations into (11) yields

(14)

where contains singular values of . As (14) is separable, can be solved element-wise as (7), i.e., . Further, is nondecreasing on by Assumption 1, hence for any . Thus, must contain the largest singular values of with a same descending order as , i.e., . Consequently, we have , which together with and (13) results in (10).

Ii-C PGD Algorithm for Matrix Completion

PGD is a powerful optimization algorithm suitable for many large-scale problems arising in signal/image processing, statistics and machine learning. It can be viewed as a variant of majorization minimization algorithms which has a special choice for the quadratic majorization. Let

The core idea of the PGD algorithm is to consider a linear approximation of at the -th iteration at a given point as

(15)

where and is a proximal parameter. Then, minimizing is a form of the proximity operator (9) as

(16)

which can be computed as (10).

In the PGD algorithm, the dominant computational load in each iteration is the SVD calculation. To further improve the efficiency of the algorithm and make it scale well for large-scale problems, the techniques such as approximate SVD or PROPACK [7], [19] can be adopted.

Iii Convergence Analysis

This section investigates the convergence properties of the PGD algorithm with special consideration on the class of nonconvex penalties which have discontinuous thresholding functions. First, we make some assumptions on the discontinuous property of such threshoding functions.

Assumption 2: satisfies Assumption 1, and the corresponding proximity operator has a formulation as

(17)

where is defined on as , for any and . is the threshold point given by . is the “jumping” size at the threshold point. is continuous on and the range of is .

A significant property of such a nonconvex penalty is its jumping discontinuity. Typical nonconvex penalties satisfying this discontinuous property include the

, , and log- penalties.

In the analysis, the Kurdyka-Lojasiewicz (KL) property of the objective function is used. In the convergence analysis, based on a “uniformization” result [36], using the KL property can considerably predigest the main arguments and avoid involved induction reasoning.

Definition 1. [KL property]: For a proper function and any , if there exists , a neighborhood of and a continuous concave function such that:

(i) and is continuously differentiable on with positive derivatives;

(ii) for all satisfying , it holds that ;

then is said to have the KL property at . Further, if a proper closed function satisfies the KL property at all points in , it is called a KL function.

Furthermore, we define the restricted strictly local minimizer as follows. Let denote the projection onto the complementary set of .

Definition 2. [Restricted strictly local minimizer]: For a proper function , any and a subset , if there exists a neighborhood of such that for any ,

is said to be a -restricted strictly local minimizer of .

It is obvious that, if is a strictly local minimizer of , then is a -restricted strictly local minimizer of , but not vice versa.

Meanwhile, we provide three lemmas needed in later analysis. The first lemma is on the distance between the singular values of two matrices.

Lemma 1: For two matrices and , it holds

This result can be directly derived by extending the Hoffman-Wielandt Theorem [47], which indicates that the “distance” between the respective singular values of two matrices is bounded by the “distance” between the matrices.

The following two lemmas present some properties of the gradient and Hessian of a generalized low-rank penalty [46] (the derivation is also provided here in Appendices A and B).

Lemma 2: For a matrix of rank , , with any truncated SVD , , and contains the corresponding singular vectors. Suppose that is on , denote

Then, and

where is a commutation matrix defined as for .

Lemma 3: Under the condition and definition in Lemma 2, if on , then, and the nonzero eigenvalues of are given by

Further suppose that is a nondecreasing function on , then it holds

Iii-a Convergence for A Generalized Nonconvex Penalty

In the following, let denote the matrix , such that . Then, the Hessian of can be expressed as

It is easy to see that . Then, for a generalized nonconvex penalty satisfying the KL property, the global convergence of the PGD algorithm to a stationary point can be directly derived from the results in [37], which is given as follows.

Property 2 [37]. [Convergence to stationary point]: Let be a sequence generated by the PGD algorithm (16), suppose that is a closed, proper, lower semi-continuous functions, if , there hold

(i) the sequence is nonincreasing as

and there exists a constant such that ;

(ii) as , converges to a cluster point set, and any cluster point is a stationary point of ;

(iii) further, if there exists a point at which satisfies the KL property, has finite length

and converges to .

Property 2(i) establishes the sufficient decrease property of the objective , which is a basic property desired for a descent algorithm. Property 2(ii) establishes the subsequence convergence of the PGD algorithm, whilst (iii) establishes the global convergence of the PGD algorithm to a stationary point. Property 2(iii) obviously holds if is a KL function. The global convergence result applies to a generalized nonconvex penalty as long as it satisfies the KL property. The KL property is satisfied by most popular nonconvex penalties, such as the hard, , SCAD and firm thresholding penalties.

Iii-B Convergence for Discontinuous Thresholding

Among existing nonconvex penalties, there is an important class which has discontinuous thresholding functions (also referred to as “jumping thresholding” in [48, 49, 50]), including the popular , , MC, firm thresholding and log- penalties. For such penalties, we present more deep analysis on the convergence properties of the PGD algorithm.

The first result is on the rank convergence of the sequence generated by the PGD algorithm.

Lemma 4. [Rank convergence]: Let be a sequence generated by the PGD algorithm (16). Suppose that satisfies Assumption 1 and 2, if , then for any cluster point , there exist two positive integers and such that, when ,

Proof: See Appendix C.

This lemma implies that the rank of only changes finitely many times. By Lemma 4, when , the rank of freezes, i.e., , . Let be a rank- matrix, when , minimizing the objective in (6) is equivalent to minimizing the following objective

(18)

For , we consider the equivalent objective (18), as is when (as is on by Assumption 1), which facilitates further convergence analysis of . By Lemma 4, the convergence of the whole sequence is equivalent to the convergence of the sequence .

Next, we provide a global convergence result for discontinuous thresholding penalties.

Theorem 1. [Convergence to local minimizer]: Under conditions of Lemma 4, suppose that is a KL function or satisfies the KL property at a cluster point of the sequence , if , then converges to a stationary point of . Further, let , if

(19)

is a local minimizer of .

The convergence to a stationary point can be directly claimed from Property 2. The convergence to a local minimizer is proved in Appendix D. Let , a sufficient condition for (19) is

(20)

This can be justified as follows. By Lemma 2 and 3, under Assumption 1, the Hessian of at satisfies

which together with , for any nonempty , and the Weyl Theorem implies that the condition (19) is satisfied if (20) holds. Obviously, the sufficient condition (20) is satisfied by the hard-thresholding penalty, for which .

Corollary 1. [Convergence for hard thresholding]: Let be a sequence generated by the PGD algorithm (16), is the hard-thresholding penalty, if , converges to a local minimizer of .

Next, we show that the nonconvex matrix completion problem (6) does not have strictly local minimizer, but has restricted strictly local minimizer. Specifically, if is a strictly local minimizer of with , then for any sufficiently small satisfying , it holds , hence . However, when , by Assumption 1 and Lemma 3, which together with and the Weyl Theorem implies that

That is cannot be positive definite. Thus, cannot be a strictly local minimizer of , and the strictly local minimizer set of is empty. Despite of this, we have the following result of convergence to a restricted strictly local minimizer. In the following, let denote the submatrix of corresponding to the index subset .

Theorem 2. [Convergence to -restricted strictly local minimizer]: Under conditions of Lemma 4, suppose that is a KL function or satisfies the KL property at a cluster point of the sequence , then converges to a stationary point of . Further, let , if

(21)

is a -restricted strictly local minimizer of .

The proof is given in Appendix E. Since , it is easy to see that

Then, the condition in (21) is equivalent to

(22)

By this Theorem, we have the following result for the () penalty.

Corollary 2. [Convergence for penalty]: Let be a sequence generated by the PGD algorithm (16), is the penalty with , if , converges to a stationary point of . Further, if

(23)

then is a -restricted strictly local minimizer of .

For the () penalty,

which together with (22) results in the left hand of (23). The right hand condition in (23) follows from the property of the -thresholding (see Table I) and (16) that

Furthermore, for the hard-thresholding penalty, the convergence to a -restricted strictly local minimizer is straightforward if .

Iii-C Eventually Linear Convergence Rate for Discontinuous Thresholding

This subsection derives the linear convergence of the PGD algorithm for nonconvex penalties with discontinuous thresholding function. Before proceeding to the analysis, we first show some properties on the sequence in the neighborhood of .

Consider a neighborhood of as

for any , is the “jumping” size of the thresholding function (corresponding to in (16)) at the its threshold point. Under Assumption 1, by Lemma 3 and is nondecreasing on , thus, there exists a sufficiently small constant , which is dependent on and as , such that.

(24)

For the second property, we denote and for some , which have the following full SVD

where , and

Let

Then, it follows that , and

where and . When (hence and ), the range space of , denoted by , tends to be orthogonal with the range space of , denoted by . In other words, let be a vector contains the principal angles between the two range spaces and , it follows that

Based on this fact, for each there exists a constant which is dependent on , satisfying as