# Provable Low Rank Plus Sparse Matrix Separation Via Nonconvex Regularizers

This paper considers a large class of problems where we seek to recover a low rank matrix and/or sparse vector from some set of measurements. While methods based on convex relaxations suffer from a (possibly large) estimator bias, and other nonconvex methods require the rank or sparsity to be known a priori, we use nonconvex regularizers to minimize the rank and l_0 norm without the estimator bias from the convex relaxation. We present a novel analysis of the alternating proximal gradient descent algorithm applied to such problems, and bound the error between the iterates and the ground truth sparse and low rank matrices. The algorithm and error bound can be applied to sparse optimization, matrix completion, and robust principal component analysis as special cases of our results.

Comments

There are no comments yet.

## Authors

• 2 publications
• 3 publications
• ### Factor Group-Sparse Regularization for Efficient Low-Rank Matrix Recovery

This paper develops a new class of nonconvex regularizers for low-rank m...
11/13/2019 ∙ by Jicong Fan, et al. ∙ 0

read it

• ### Exactly Robust Kernel Principal Component Analysis

We propose a novel method called robust kernel principal component analy...
02/28/2018 ∙ by Jicong Fan, et al. ∙ 0

read it

• ### A Nonconvex Projection Method for Robust PCA

Robust principal component analysis (RPCA) is a well-studied problem wit...
05/21/2018 ∙ by Aritra Dutta, et al. ∙ 5

read it

• ### Structured Low-Rank Matrix Factorization with Missing and Grossly Corrupted Observations

Recovering low-rank and sparse matrices from incomplete or corrupted obs...
09/03/2014 ∙ by Fanhua Shang, et al. ∙ 0

read it

• ### Global Optimality Guarantees for Nonconvex Unsupervised Video Segmentation

In this paper, we consider the problem of unsupervised video object segm...
07/09/2019 ∙ by Brendon G. Anderson, et al. ∙ 0

read it

• ### Sparse spectral estimation with missing and corrupted measurements

Supervised learning methods with missing data have been extensively stud...
11/26/2018 ∙ by Andreas Elsener, et al. ∙ 0

read it

• ### Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

Substantial progress has been made recently on developing provably accur...
09/25/2018 ∙ by Yuejie Chi, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In order to better understand large datasets and to make inferences about them, it is helpful to understand the underlying patterns in the datasets. Even when the underlying pattern is highly nonlinear, the data matrix can be approximated as being low rank, an observation that enables techniques to analyze the data in terms of a low dimensional latent space, such as Principal Component Analysis (PCA), identifying outliers through Robust PCA (RPCA), and accurately inferring data points from very few observations of a data matrix through matrix completion.

Data analysis techniques based upon this low rank property have received much attention in the past decade, with impressive computational results on large matrices and theoretical results guaranteeing the success of RPCA and matrix completion [8][4]

. Many of these results are based on minimizing the nuclear norm of a matrix (defined as the sum of the singular values) as a surrogate for the rank function, similar to minimizing the

norm to promote sparsity in a vector.

While the convex relaxation is an incredibly useful technique in many applications, minimizing the nuclear norm of a matrix has been shown to introduce a (sometimes very large) estimator bias. Intuitively, we expect to see this bias because if we hope to recover a rank matrix, we must impose enough weight on the nuclear norm term so that the th singular value is zero. By the nature of the nuclear norm, this requires also putting weight on minimizing the first singular values, resulting in a bias towards zero proportional to the spectral norm of the noise added to the true data matrix.

Fortunately, recent work has shown that the estimator bias from convex regularizers can be reduced (or even eliminated, for well conditioned matrices) by using nonconvex regularizers, such as the Schatten-p norm or the minimax concave penalty (MCP). It has been shown that for sparse optimization, the nonconvexity introduced from these regularizers does not create a further burden in the the optimization process – in the right circumstances, the nonconvex problem has just one minimizer [21]. Similar results for rank minimization problems have been previously unavailable, a gap that we have aimed to fill in this paper.

### 1.1 Summary of Contributions

In this paper, we focus on the nonconvex, unconstrained optimization problem where we find a low-rank matrix and sparse vector .

 min L,sλLd1d2ΦγL(L)+λsdsϕγs(s)+12n||AL(L)+ASs−b||22 (1)

The linear mappings and serve as the observation models of the underlying low rank matrices and sparse vectors. Most commonly, we are interested in the observation model

 [AΩobs(X)]k=Xik,jk

for , where is the set of indices where we have a measurement of the low rank matrix we hope to reconstruct.

We denote to be a concave function used to promote sparsity in both the singular values of and individual entries in . We overload the notation to allow for to be a function of a vector whose range is , and we denote as a surrogate to the rank function:

 ϕγs(x)=ds∑i=1ϕγs(xi),ΦγL(X)=min(d1,d2)∑i=1ϕγL(σi(X)).

where denotes the th largest singular value of . We restrict our focus to nonconvex regularizers that are amenable regularizers, as described in [20], and defined below.

###### Definition 1.

A function is amenable if it satisfies the following criteria.

1. is non decreasing

2. For , the function is non increasing in .

3. the function is differentiable for all and subdifferentiable at with .

4. The function is weakly convex. That is, the function is convex.

We present a very simple alternating algorithm to find a stationary point of (1). Though similar algorithms have been previously studied and shown to converge, we present a novel analysis of this algorithm and show that not only does this algorithm linearly converge, but it converges to the exact low rank matrix and sparse vector when no Gaussian noise is present. Furthermore, when Gaussian noise is present, we obtain an error bound that matches the minmax optimal rate. Our bound greatly improves upon bounds obtained in previous results analysing the convex relaxation and quantify the observation based on computational results that nonconvex regularizers reduce the impact of noise on the quality of the estimator.

### 1.2 Related Works

The matrix completion problem is as follows: given the values of a low rank matrix for only a sparse set of indices, we seek to determine the rest of the values of the matrix. While the problem of finding the minimum rank matrix that fits the observations is NP hard in general, it has been shown that under some assumptions, the global minimizer to the convex problem

 minX∈Rd1×d2||X||∗s.t.% Xij=Mij∀(i,j)∈Ωobs

is exactly , where is the set of indices of we have observed. If is rank and

incoherent (as defined in section 3.1), then with high probability for a set,

of indices chosen uniformly at random, is the unique minimizer to the convex relaxation so long as for some universal constant [8, 4]. This condition was later improved to [29].

Using a convex relaxation for Robust PCA has similar results. While PCA is a powerful technique, it has been shown to be less reliable when just a sparse set of data points are grossly corrupted, and so the goal of RPCA is to identify and remove such corruptions by separating the data matrix into the sum of a low rank and sparse matrix.

 minL,s||L||∗+λ0||S||1 st Lij+Sij=Mij∀(i,j)∈Ωobs

The convex relaxation was shown to give the exact solution when every entry of is observed in [10], and when only partially observed (under the same assumptions necessary in matrix completion) by [6]. In contrast to this paper, these works assume that there is measurement noise (besides the sparse corrupted entries).

In many cases of practical interest, our measurements may have some level of noise in addition to being only partially observed or having some corrupted entries. For matrix completion, we can relax the constraint using a penalty formulation as follows:

 minX∈Rd1×d2λ||X||∗+∑(i,j)∈Ωobs(Xij−Mij)2

Likewise, RPCA can be formulated as solving:

 minL,S∈Rd1×d2 λL||L||∗+λs||S||1+∑(i,j)∈Ωobs(Lij+Sij−Mij)2 (2)

Statistical guarantees on the performance of the first of these estimators are discussed in [7, 26, 25], and the latter in [2]. Specific bounds and assumptions for these works are discussed in Section 4.

In order to reduce the estimator bias for minimization, [9] proposed an iteratively reweighted norm method to place more weight on minimizing smaller entries, and less on entries further from zero. This idea was generalized to minimizing any amenable regularizes to promote sparsity. Theoretical results on the subject include algorithmic guarantees similar to the ones presented in this paper [34], and a proof that the nonconvex problem has no spurious local minimizers [20, 21].

The same regularizers could be used as a surrogate to the rank function, as originally proposed by [24]. In [22, 23], the authors propose a generalization of the singular value thresholding algorithm proposed by [3], which was later applied to the problem of RPCA in [11, 18]. For the problem of matrix completion, the algorithms proposed by [36] and [30] achieve the fastest computational complexity in other state of the art methods.

Other approaches to low rank optimization rely on, instead of minimizing a surrogate to the rank function, constraining the matrix to be a given rank. This can be done using constrained optimization to optimize over the set of all rank matrices [33], or by using the low-rank factorization of a matrix for and [35, 31].

Previous work on theoretical results pertaining to rank and sparsity constrained methods have consisted of algorithmic guarantees ensuring that we can obtain a matrix sufficiently close to the ground truth low rank and/or sparse matrix for both matrix completion [17] and RPCA [27, 38, 37]. Additionally, both of these problems have been shown to have no spurious local minimizers, and so the ground truth matrices are the only minimizers under some assumptions [13] [12].

## 2 Alternating Proximal Gradient Descent Algorithm

Many different methods have been shown to be effective when minimizing nonconvex relaxations of the and rank functions, including iterative reweighted methods, and methods based on low rank factorization. In this paper, we focus on the most commonly used technique: alternating proximal gradient descent.

Consider the objective function , where is convex and differentiable, and is weakly convex. Instead of minimizing directly, the proximal gradient descent method approximates by a quadratic, strongly convex function centered about the point .

 ¯gk(x)=g(xk)+⟨∇g(xk),x−xk⟩+12τ||x−xk||2

At each iteration, we now minimize the function . For sufficiently small , this function is strongly convex, and the proximal gradient descent algorithm is guaranteed to converge.

The proximal gradient descent method applied to the function iteratively solves the following problem:

where we defined the proximal operator of a function as the minimum of a combination of the function and the distance from a given point. For many functions that we are interested in, the proximal operator has a closed form solution, some of which are shown in Table 1.

For each of the sparsity promoting regularizers in Table 1, the proximal operator is also dubbed as a shrinkage operator or a thresholding operator because when the input is less than , the output is 0. Otherwise, the input is moved towards zero or, for some nonconvex regularizers, is unchanged. So, we can view the proximal gradient algorithm as iteratively taking a step in the gradient direction of , and then applying the proximal operator to promote sparsity.

When applied to the optimization problem in Equation (1), we have

 ~Lk+1= Lk−τLd1d2nA∗L(AL(Lk)+Assk−b) (3a) Lk+1= proxτLλLΦγL(~Lk+1) (3b) ~sk+1= sk−τSdsnATs(AL(Lk+1)+Assk−b) (3c) sk+1= proxτsλsϕγs(~sk+1) (3d)

To further simplify the problem, the following proposition will allow subproblem to be solved in each singular value separately.

###### Proposition 1.

Consider the optimization problem

 min X∑iϕγ(σi(X))+12τ||X−Y||2F (4)

where is a weakly convex function. If , then equation (4) is strongly convex and the minimizer has the same singular vectors as with singular values given by

 σi(X∗)= argmin xϕγ(x)+12τ(x−σi(Y))2 := proxτϕγ(σi(Y))

where is the proximal operator.

###### Proof.

By convexity of ,

 ϕ(λi(X2))≥ϕ(λi(X1))+ϕ′(λi(X1))(λi(X2)−λi(X1))

Summing over all ,

 Φ(X2)≥Φ(X1)+⟨∇ϕ(λ(X1)),λ(X1)−λ(X2)⟩

By Corollary 1,

 Φ(X2)≥Φ(X1)+⟨U1∇ϕ(λ(X1))UT1,X1−X2⟩

And, as , is convex.

Consider if the global optimizer was not of the form , for a diagonal matrix . Let , where . By the of Hoffman-Wielandt inequality (Corollary A.1),

And, because , must also be a global minimizer, contradicting the premise that is the sole global minimizer. ∎

Proposition 1 tells us that has the singular vectors of and singular values given by

 σi(Lk)=proxτLλLϕγ(σi(~Lk)).

Likewise, the subproblem in can be solved in each entry of individually.

The slowest operation in the alternating proximal gradient method is by far the singular value decomposition. However, in practice we can reduce the number of operations by calculating the truncated singular value decomposition only using the first

singular values, where is an upper bound on the rank, and enforce that the remaining singular values are zero. Alternatively, we can calculate each singular value in descending order and stop when a singular value falls below , as all remaining singular values will be set to zero by the proximal operator. So, in the case of RPCA where each entry is observed, each iteration has a computational complexity of , which matches other state of the art methods.

In the case of matrix completion, however, only a sparse set of entries of the low rank matrix are observed, which could be used to increase the efficiency by reducing the amount of computation needed to find the singular value decomposition of at each iterations. For a low rank matrix with a low rank factorization , we refer to the problem of finding the SVD of the rank approximation to the matrix for a sparse matrix as Low Rank plus Sparse SVD (LRSSVD), originally proposed by [16].

The LRSSVD task can be accomplished efficiently using the same methods as if we were to find the SVD of any other matrix, such as the Power Iteration method. Recall that the computational complexity of the Power Iteration is limited by the amount of operations needed to multiply the matrix by a vector. Because the computational complexities of calculating both for and for are , we can calculate the top singular values and vectors of with only operations.

The other operations in Algorithm 1 take no more time than the LRSSVD. The gradient in the direction, , requires calculating for each entry in the support of . In the case of matrix completion, this is operations, which matches the computational complexity per iteration for state of the art matrix completion algorithms.

## 3 Analysis of APGD Algorithm

In this section, we present the main result of the paper: a recursive bound on the difference of the iterates of the alternating proximal gradient algorithm and the ground truth low rank matrix and sparse vector. We present the bound for the most general case, and give results on specific problems in the following section.

### 3.1 Restricted Isometry and Orthogonality Properties

In order to bound the error in the output of our algorithm relative to the underlying ground truth low-rank and sparse matrices and , we must first make a number of assumptions about , , and the observation models and .

First, we must assure that the low rank matrix can be separated from a sparse matrix – that is, is not sparse itself. Not only is this necessary for low-rank plus sparse decomposition, but for the problem of matrix completion, this assumption is necessary to assure that a sparse set of observations is a good representation of the entire matrix. For example, consider the matrix consisting of zeros in every entry besides one entry, where the value is 1. We must observe every entry in the matrix to assure that we can reconstruct the matrix exactly, due to the fact that we must observe the nonzero entry and every entry in its row and column. To exclude such ill-posed problems from our analysis, we will assume that is incoherent, defined as:

###### Definition 2 ([8]).

Let be a rank matrix with singular value decomposition , for orthonormal matrices , and diagonal matrix . The tangent space of is defined as

 T={UA+BV∣∣∣A∈Rr×d2,B∈Rd1×r}. (5)

Furthermore, we say the matrix (or its tangent space ) is incoherent if

 ||U:i||22≤μrd1,||Vj:||22≤μrd2∀i,j (6)

We define the projection of a matrix onto the sparse space as

 PΩ(X)={Xij(i,j)∈Ω0else (7)

and onto the tangent space as

 PT(X)=UUTX+XVVT−UUTXVVT (8)

Next, we discuss the conditions that the observation models and must satisfy in order to recover the ground truth low rank and sparse matrix, known as the restricted isometry property. Loosely, the RIP states that for any two vectors in (or matrices in ), we can obtain a sufficiently accurate estimate of the distance between the two through the observation model (or ).

The RIP was originally proposed for sparse vectors by Candès and Tao in [5]. Original attempts to extend the property to low rank matrices failed to consider coherent matrices, and thus had little practical applications. Candès and Tao [8] later introduced the incoherence assumption and proved that it applied to the problem of matrix completion. Here, we give two versions of a definition of restricted isometry property, one for sparse vectors and one for matrices that have a low-rank, incoherent tangent space.

###### Definition 3.

The linear mapping satisfies the sparse Restricted Isometry Property if, for any satisfying

 (1−κS)||x||2 ≤τsnds||Asx||2≤(1+κS)||x||2

for some constant . Likewise, the linear mapping satisfies the low rank Restricted Isometry Property if for any in a -incoherent rank tangent space ,

 (1−κL)||X||2F ≤τLnd1d2||ALX||2F≤(1+κL)||X||2F

for some constant .

In some cases, it may be more useful to use the following characterization of the RIP, which bounds the difference between the operator and the identity operator when restricted to sparse vectors or low-rank and incoherent matrices.

###### Proposition 2.

For a matrix satisfying the sparse RIP,

 ||τsndsPΩATsAsPΩ−PΩ||2≤κs.

Likewise, for any linear mapping satisfying the low rank RIP,

 ||τLd1d2nPTA∗LALPT−PT||2≤κL

Finally, we discuss the interplay between the set of sparse matrices and the low rank, incoherent tangent space, and their observation models. We hope to be able to separate the measurement vector into two parts: one in the span of , and one in the span of . In order to achieve this quickly, we require that there are no non-trivial vectors in the intersection of the two sets, which is equivalent to saying that . Under some assumptions, this norm is actually close to zero, a concept we refer to as restricted orthogonality, which we define here and verify that it applies to the problems we are interested in in Section 4.

###### Definition 4.

Linear maps and satisfy the restricted orthogonality property over the sets and (respectively) when

 dsn||PΩA∗sALPT||2≤κ,d1d2n||PTA∗LAsPΩ||2≤κ

### 3.2 Main Result

Define the difference between the iterates of the alternating proximal gradient descent algorithm and the ground truth low rank matrix and sparse vector at iteration as and . Our main result in the most general form gives a bound on the norm of and in terms of the differences at the previous iteration, and .

###### Theorem 1.

Let and be the sequences generated by Algorithm 1. Assume that

 b=AL(L∗)+Ass∗+E∈Rn,

where is a rank and incoherent matrix, and is a sparse vector with , and the linear mappings and satisfy the low rank RIP and the sparse RIP respectively, and together satisfy the ROP with constant . If , then

 ||Δk+1L ||2F≤κL||ΔkL||2F+κτL||Δks||2+τLd1d2n||PTA∗LE||2F+λLrϕ′(σr(L∗))

Likewise, if and is the smallest non-zero value of , then

 ||Δk+1s||2 ≤κs||Δks||2+κτs||Δk+1L||2F+τsdsn||PΩATsE||2+λsαdsϕ′(smin−λs)

For each of these bounds, we can think of the third term as the estimation error introduced by the noise, and the fourth term as the approximation error, which accounts for the bias in the regularizer proportional to the derivative of the regularizer. Previous results for the nuclear norm and norm give similar bounds, but make the concession that the approximation error is the dominating term. Under some circumstances, that term is equal to zero in our bound.

### 3.3 Proof of Main Result

We start by presenting the the following two lemmas regarding the proximal operator for the low rank regularizers, which we prove in the following section.

###### Lemma 1.

Let be a rank , -incoherent matrix whose singular vectors form the tangent space , and be defined as

 ¯L=argminL∈Rd1×d2λΦ(L)+||L−L∗+δ||2F (9)

where is an at most weakly convex regularizer satisfying Assumption 1, and satisfies . Define . Then,

 ||ΔL||2F≤2||PT(δ)||2F+λrϕ′(σr(L∗)) (10)

In order to utilize the RIP and ROP conditions, we need to verify that is low rank and incoherent, and that is sparse, which we will do inductively. Assume that is at most rank , and that its tangent space is incoherent. Additionally, assume that . Clearly, these conditions are met at the first iteration by initializing the algorithm with and .

At iteration ,

 L∗ −¯Lk+1 = L∗−(Lk−τLd1d2nA∗L(AL(Lk)+Assk−b))) = L∗−Lk+τLd1d2nA∗LALLk+τLd1d2nA∗LAssk−τLd1d2nA∗L(AL(L∗)+As(s∗)+E) = (ΔkL−τLd1d2nA∗LALΔkL)−τLd1d2n(A∗LAsΔkS−A∗LE)

where the first equality comes from the definition of , and the second inequality substitutes and . By Lemma 1 (along with the triangle inequality), we have that

 ||Δk+1L||2F≤ ||PT(ΔkL−τLd1d2nA∗LALΔkL)||2F+τLd1d2n||PTA∗LAsΔkS||2F +τLd1d2n||PTA∗LE||2F+λLrϕ′(σr(L∗)−2λLτL)

Let denote the union of the tangent space of the rank approximation and so that . Then,

 ||PT(ΔkL−τLd1d2nA∗LALΔkL)||2F ≤||PTk(ΔkL−τLd1d2nA∗LALΔkL)||2F ≤||PTk(τLd1d2nA∗LAL−I)PTkΔkL||2F≤κL||ΔkL||2F

where the first inequality comes from the contractive property of , and the second inequality comes from the fact that . And, because is has incoherence at most , we can apply the low-rank RIP to obtain the third inequality.

By the inductive hypothesis stating that (and thus, that is -sparse), we can use the ROP to claim that

 d1d2n||PTA∗LAsΔkS||2F≤κ||ΔkS||2F.

Combining these gives the desired bound on .

Now, we must show that is rank and incoherent. By Weyl’s inequality ([15], see Appendix A), we know that

 σr+1(~Lk+1)≤ σr+1(L∗)+||L∗−~Lk+1||2 ≤ ||Δks||2+||A∗LE||2

Because this is less than by our assumptions, must be rank .

In order to show that is at most incoherent, we apply the following Lemma, which uses the Davis-Kahan inequality [davis-kahan].

###### Lemma 2.

If (with ) is a rank , incoherent matrix, and satisfies , then the top singular vectors of the matrix form a incoherent tangent space.

Next, we bound and showing in a similar manner. We present the following Lemma, which mirrors Lemma 1.

###### Lemma 3.

Let be an sparse vector with support , and let be defined as

 ¯s=argmin s∈Rdsλϕ(s)+||s−s∗+δ||2 (11)

Where is an at most weakly convex amenable regualarizer, and satisfies . Then, . Furthermore, we can bound the difference as

 ||Δs||2F≤||PΩ(δ)||F+2λ√|Ω|ϕ′(smin−λ). (12)

By the update equation for , we have

 s∗−~sk+1= s∗−(sk−A∗s(As(sk)+AL(Lk)−b))) = s∗−sk+τsdsA∗sAssk+τsdsA∗LALLk−τsdsA∗s(As(s∗)+AL(L∗)+E) = (Δks−τsdsA∗sAsΔks)−τsdsA∗sALΔk+1L−τsdsA∗sE

By Lemma 3,

 ||Δk+1s||2F≤ ||PΩ(Δks−A∗sAsΔks)||2F+||PΩA∗sALΔk+1L||2F +||PΩA∗sE||2F+λs|Ω|ϕ′s(smin−λs)

Applying the RIP to the first term gives:

 ||PΩ(Δks−τsds