# Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization

The Schatten quasi-norm was introduced to bridge the gap between the trace norm and rank function. However, existing algorithms are too slow or even impractical for large-scale problems. Motivated by the equivalence relation between the trace norm and its bilinear spectral penalty, we define two tractable Schatten norms, i.e. the bi-trace and tri-trace norms, and prove that they are in essence the Schatten-1/2 and 1/3 quasi-norms, respectively. By applying the two defined Schatten quasi-norms to various rank minimization problems such as MC and RPCA, we only need to solve much smaller factor matrices. We design two efficient linearized alternating minimization algorithms to solve our problems and establish that each bounded sequence generated by our algorithms converges to a critical point. We also provide the restricted strong convexity (RSC) based and MC error bounds for our algorithms. Our experimental results verified both the efficiency and effectiveness of our algorithms compared with the state-of-the-art methods.

## Authors

• 30 publications
• 21 publications
• 35 publications
• ### Scalable Algorithms for Tractable Schatten Quasi-Norm Minimization

The Schatten-p quasi-norm (0<p<1) is usually used to replace the standar...
06/04/2016 ∙ by Fanhua Shang, et al. ∙ 0

• ### Unified Scalable Equivalent Formulations for Schatten Quasi-Norms

The Schatten quasi-norm can be used to bridge the gap between the nuclea...
06/02/2016 ∙ by Fanhua Shang, et al. ∙ 0

• ### Bilinear Factor Matrix Norm Minimization for Robust PCA: Algorithms and Applications

The heavy-tailed distributions of corrupted outliers and singular values...
10/11/2018 ∙ by Fanhua Shang, et al. ∙ 0

• ### Matrix reconstruction with the local max norm

We introduce a new family of matrix norms, the "local max" norms, genera...
10/18/2012 ∙ by Rina Foygel, et al. ∙ 0

• ### Variational Representations related to Quantum Rényi Relative Entropies

In this note, we focus on the variational representations of some matrix...
11/11/2019 ∙ by Guanghua Shi, et al. ∙ 0

• ### Regularized Orthogonal Tensor Decompositions for Multi-Relational Learning

Multi-relational learning has received lots of attention from researcher...
12/26/2015 ∙ by Fanhua Shang, et al. ∙ 0

• ### New Perspectives on k-Support and Cluster Norms

We study a regularizer which is defined as a parameterized infimum of qu...
12/27/2015 ∙ by Andrew M. McDonald, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The rank minimization problem has a wide range of applications in matrix completion (MC) [1]

, robust principal component analysis (RPCA)

[2], low-rank representation [3], multivariate regression [4] and multi-task learning [5]. To efficiently solve these problems, a principled way is to relax the rank function by its convex envelope [6, 7], i.e., the trace norm (also known as the nuclear norm), which also leads to a convex optimization problem. In fact, the trace norm penalty is an

-norm regularization of the singular values, and thus it motivates a low-rank solution. However,

[8] pointed out that the

-norm over-penalizes large entries of vectors, and results in a biased solution. Similar to the

-norm case, the trace norm penalty shrinks all singular values equally, which also leads to over-penalize large singular values. In other words, the trace norm may make the solution deviate from the original solution as the -norm does. Compared with the trace norm, although the Schatten- quasi-norm for is non-convex, it gives a closer approximation to the rank function. Therefore, the Schatten- quasi-norm minimization has attracted a great deal of attention in images recovery [9, 10], collaborative filtering [11] and MRI analysis [12].

[13] and [14] proposed iterative reweighted lease squares (IRLS) algorithms to approximate associated Schatten- quasi-norm minimization problems. In addition, [10] proposed an iteratively reweighted nuclear norm (IRNN) algorithm to solve non-convex surrogate minimization problems. In some recent work [15, 16, 11, 9, 10], the Schatten- quasi-norm has been shown to be empirically superior to the trace norm. Moreover, [17] theoretically proved that the Schatten- quasi-norm minimization with small

requires significantly fewer measurements than the convex trace norm minimization. However, all existing algorithms have to be solved iteratively and involve singular value decomposition (SVD) or eigenvalue decomposition (EVD) in each iteration. Thus they suffer from high computational cost and are even not applicable for large-scale problems

[18].

In contrast, the trace norm has a scalable equivalent formulation, the bilinear spectral regularization [19, 7], which has been successfully applied in many large-scale applications, such as collaborative filtering [20, 21]. Since the Schatten- quasi-norm is equivalent to the quasi-norm on the singular values, it is natural to ask the following question: can we design an equivalent matrix factorization form to some cases of the Schatten- quasi-norm, e.g., or ?

In this paper we first define two tractable Schatten norms, the bi-trace (Bi-tr) and tri-trace (Tri-tr) norms. We then prove that they are in essence the Schatten- and quasi-norms, respectively, for solving whose minimization we only need to perform SVDs on much smaller factor matrices to replace the large matrices in the algorithms mentioned above. Then we design two efficient linearized alternating minimization algorithms with guaranteed convergence to solve our problems. Finally, we provide the sufficient condition for exact recovery, and the restricted strong convexity (RSC) based and MC error bounds.

## 2 Notations and Background

The Schatten- norm () of a matrix () is defined as

 ∥X∥Sp=(∑ni=1σpi(X))1/p,

where denotes the -th singular value of . For it defines a natural norm, for instance, the Schatten- norm is the so-called trace norm, , whereas for it defines a quasi-norm. As the non-convex surrogate for the rank function, the Schatten- quasi-norm with is the better approximation of the matrix rank than the trace norm [17] (analogous to the superiority of the quasi-norm to the -norm [14, 22]).

We mainly consider the following Schatten quasi-norm minimization problem to recover a low-rank matrix from a small set of linear observations, ,

 minX∈Rm×n{∥X∥pSp:A(X)=b}, (1)

where is a linear measurement operator. Alternatively, the Lagrangian version of (1) is

 minX∈Rm×n{∥X∥pSp+1μf(A(X)−b)}, (2)

where

is a regularization parameter, and the loss function

generally denotes certain measurement for characterizing the loss term (for instance, is the linear projection operator , and in MC problems [15, 13, 23, 10]).

The Schatten- quasi-norm minimization problems (1) and (2) are non-convex, non-smooth and even non-Lipschitz [24]. Therefore, it is crucial to develop efficient algorithms that are specialized to solve some alternative formulations of Schatten- quasi-norm minimization (1) or (2). So far, only few algorithms, such as IRLS [14, 13] and IRNN [10], have been developed to solve such problems. In addition, since all existing Schatten- quasi-norm minimization algorithms involve SVD or EVD in each iteration, they suffer from a high computational cost of , which severely limits their applicability to large-scale problems.

## 3 Tractable Schatten Quasi-Norm Minimization

[19] and [7] pointed out that the trace norm has the following equivalent non-convex formulations.

###### Lemma 1.

Given a matrix with , the following holds:

 ∥X∥tr=minU∈Rm×d,V∈Rn×d:X=UVT∥U∥F∥V∥F=minU,V:X=UVT∥U∥2F+∥V∥2F2.

### 3.1 Bi-Trace Quasi-Norm

Motivated by the equivalence relation between the trace norm and its bilinear spectral regularization form stated in Lemma 1, our bi-trace (Bi-tr) norm is naturally defined as follows [18].

###### Definition 1.

For any matrix with , we can factorize it into two much smaller matrices and such that . Then the bi-trace norm of is defined as

 ∥X∥Bi−tr:=minU,V:X=UVT∥U∥tr∥V∥tr.

In fact, the bi-trace norm defined above is not a real norm, because it is non-convex and does not satisfy the triangle inequality of a norm. Similar to the well-known Schatten- quasi-norm (), the bi-trace norm is also a quasi-norm, and their relationship is stated in the following theorem [18].

###### Theorem 1.

The bi-trace norm is a quasi-norm. Surprisingly, it is also the Schatten- quasi-norm, i.e.,

 ∥X∥Bi−tr=∥X∥S1/2,

where is the Schatten- quasi-norm of .

The proof of Theorem 1 can be found in the Supplementary Materials. Due to such a relationship, it is easy to verify that the bi-trace quasi-norm possesses the following properties.

###### Property 1.

For any matrix with , the following holds:

 ∥X∥Bi−tr=minU,V:X=UVT∥U∥tr∥V∥tr=minU,V:X=UVT∥U∥2tr+∥V∥2tr2=minU,V:X=UVT(∥U∥tr+∥V∥tr2)2.
###### Property 2.

The bi-trace quasi-norm satisfies the following properties:

1. , with equality iff .

2. is unitarily invariant, i.e., , where and have orthonormal columns.

### 3.2 Tri-Trace Quasi-Norm

Similar to the definition of the bi-trace quasi-norm, our tri-trace (Tri-tr) norm is naturally defined as follows.

###### Definition 2.

For any matrix with , we can factorize it into three much smaller matrices , and such that . Then the tri-trace norm of is defined as

 ∥X∥Tri−tr:=minU,V,W:X=UVWT∥U∥tr∥V∥tr∥W∥tr.

Like the bi-trace quasi-norm, the tri-trace norm is also a quasi-norm, as stated in the following theorem.

###### Theorem 2.

The tri-trace norm is a quasi-norm. In addition, it is also the Schatten- quasi-norm, i.e.,

 ∥X∥Tri−tr=∥X∥S1/3.

The proof of Theorem 2 is very similar to that of Theorem 1 and is thus omitted. According to Theorem 2, it is easy to verify that the tri-trace quasi-norm possesses the following properties.

###### Property 3.

For any matrix with , the following holds:

 ∥X∥Tri−tr=minX=UVWT(∥U∥tr+∥V∥tr+∥W∥tr3)3=minX=UVWT∥U∥tr∥V∥tr∥W∥tr=minX=UVWT∥U∥3tr+∥V∥3tr+∥W∥3tr3.
###### Property 4.

The tri-trace quasi-norm satisfies the following properties:

1. , with equality iff .

2. is unitarily invariant, i.e., , where and have orthonormal columns.

The following relationship between the trace-norm and Frobenius norm is well known: . Similarly, the analogous bounds hold for the bi-trace and tri-trace quasi-norms, as stated in the following property.

###### Property 5.

For any matrix with , the following inequalities hold:

 ∥X∥tr≤∥X∥Bi−tr≤r∥X∥tr,∥X∥tr≤∥X∥Bi−tr≤∥X∥Tri−tr≤r2∥X∥tr.
###### Proof.

The proof of this property involves the following properties of the quasi-norm. For any vectors and in and , we have

 ∥x∥1≤∥x∥p1,∥x∥p1≤∥x∥p2≤n1/p2−1/p1∥x∥p1.

Suppose is of rank , and denote its skinny SVD by . By Theorems 1 and 2, and the properties of the quasi-norm, we have

. ∎

It is easy to see that Property 5 in turn implies that any low bi-trace or tri-trace quasi-norm approximation is also a low trace norm approximation.

### 3.3 Problem Formulations

Bounding the Schatten quasi-norm of in (1) by the bi-trace or tri-trace quasi-norm defined above, the noiseless low-rank structured matrix factorization problem is given by

 minU,V{R(U,V)=(∥U∥tr+∥V∥tr)/2:A(UVT)=b}, (3)

where can also denote , and is replaced by . In addition, (3) has the following Lagrangian forms,

 F(U,V):=minU,V{∥U∥tr+∥V∥tr2+f(A(UVT)−b)μ}, (4)
 minU,V,W{∥U∥tr+∥V∥tr+∥W∥tr3+f(A(UVWT)−b)μ}. (5)

The formulations (3), (4) and (5) can address a wide range of problems, such as MC [13, 10], RPCA [2, 25, 26] ( is the identity operator, and or ), and low-rank representation [3] or multivariate regression [4] ( with being a given matrix, and or ). In addition, may be also chosen as the Hinge loss in [19] or the structured atomic norms in [27].

## 4 Optimization Algorithms

In this section, we mainly propose two efficient algorithms to solve the challenging bi-trace quasi-norm regularized problem (4) with a smooth or non-smooth loss function, respectively. In other words, if is a smooth loss function, e.g., , we employ the proximal alternating linearized minimization (PALM) method as in [28] to solve (4). In contrast, to solve efficiently (4) with a non-smooth loss function, e.g., , we need to introduce an auxiliary variable and obtain the following equivalent form:

 minU,V,e{∥U∥tr+∥V∥tr2+f(e)μ:e=A(UVT)−b}. (6)

To avoid introducing more auxiliary variables, inspired by [29], we propose a linearized alternating direction method (LADM) to solve (6), whose augmented Lagrangian function is given by

 L(U,V,e,λ,β)=12(∥U∥tr+∥V∥tr)+f(e)μ+⟨λ,A(UVT)−b−e⟩+(β/2)∥A(UVT)−b−e∥22,

where is the Lagrange multiplier, denotes the inner product, and is a penalty parameter. By applying the classical augmented Lagrangian method to (6), we obtain the following iterative scheme:

 Uk+1=argminU∥U∥tr2+βk2∥A(UVTk)−ek−˜bk∥22, (7a) Vk+1=argminV∥V∥tr2+βk2∥A(Uk+1VT)−ek−˜bk∥22, (7b) ek+1=argminef(e)μ+βk2∥A(Uk+1VTk+1)−e−˜bk∥22, (7c) λk+1=λk+βk(A(Uk+1VTk+1)−b−ek+1), (7d)

where

. In many machine learning problems

[15, 3, 4], is not identity, e.g., the operator . Due to the presence of and , thus we usually need to introduce some auxiliary variables to achieve closed-form solutions to (7a) and (7b). To avoid introducing additional auxiliary variables, we propose the following linearization technique for (7a) and (7b).

#### 4.1.1 Updating Uk+1 and Vk+1

Let , then we can know that the gradient of is Lipschitz continuous with the constant , i.e., for any . By linearizing at and adding a proximal term, we have

 ˆφk(U,Uk)=φk(Uk)+⟨∇φk(Uk),U−Uk⟩+tφk2∥U−Uk∥2F. (8)

Therefore, we have

 (9)

Similarly, we have

 Vk+1=argminV12∥V∥tr+βktψk2∥V−Vk+∇ψk(Vk)tψk∥2F, (10)

where with the Lipschitz constant . Using the so-called matrix shrinkage operator [30], we can obtain a closed-form solution to (9) and (10), respectively. Additionally, if , the optimal solution to (7c) can be obtained by the well-known soft-thresholding operator [31].

#### 4.1.2 Computing Step Sizes

There are two step sizes, i.e., the Lipschitz constants in (9) and in (10), need to be set during the iteration.

 ∥∇φk(U1)−∇φk(U2)∥F=∥A∗{A[(U1−U2)VTk]}Vk∥F≤∥A∗A∥2∥VTkVk∥2∥U1−U2∥F,∥∇ψk(V1)−∇ψk(V2)∥F=∥UTk+1A∗{A[Uk+1(V1−V2)T]}∥F≤∥A∗A∥2∥UTk+1Uk+1∥2∥V1−V2∥F,

where denotes the adjoint operator of . Thus, both step sizes are defined in the following way:

 {tφk≥∥A∗A∥2∥VTkVk∥2,tψk≥∥A∗A∥2∥UTk+1Uk+1∥2. (11)

Based on the description above, we develop an efficient LADM algorithm to solve the Bi-tr quasi-norm regularized problem (4) with a non-smooth loss function (e.g., RPCA problems), as outlined in Algorithm 1. To further accelerate the convergence of the algorithm, the penalty parameter is adaptively updated by the strategy as in [32], as well as . Moreover, Algorithm 1 can be used to solve the noiseless problem (3) and also extended to solve the Tri-tr quasi-norm regularized problem (5) with a non-smooth loss function.

### 4.2 PALM Algorithm

By using the similar linearization technique in (9) and (10), we design an efficient PALM algorithm to solve (4) with a smooth loss function, e.g., MC problems. Specifically, by linearizing the smooth loss function at and adding a proximal term, we have the following approximation:

 Uk+1=argminU∥U∥tr2+⟨∇φk(Uk)μ,U−Uk⟩+tφk2μ∥U−Uk∥2F=argminU∥U∥tr2+tφk2μ∥U−Uk+∇φk(Uk)tφk∥2F, (12)

where . Similarly,

 Vk+1=argminV∥V∥tr2+tψk2μ∥V−Vk+∇ψk(Vk)tψk∥2F, (13)

where .

### 4.3 Convergence Analysis

In the following, we provide the convergence analysis of our algorithms. First, we analyze the convergence of our LADM algorithm for solving (4) with a non-smooth loss function, e.g., .

###### Theorem 3.

Let be a sequence generated by Algorithm 1, then we have

1. are all Cauchy sequences;

2. If , then the accumulation point of the sequence satisfies the KKT conditions for (6).

The proof of Theorem 3 is provided in the Supplementary Materials. From Theorem 3, we can know that under mild conditions each sequence generated by our LADM algorithm converges to a critical point, similar to the LADM algorithms for solving convex problems as in [32].

Moreover, we provide the global convergence of our PALM algorithm for solving (4) with a smooth loss function, e.g., .

###### Theorem 4.

Let be a sequence generated by our PALM algorithm, then it is a Cauchy sequence and converges to a critical point of (4) with the squared loss, .

The proof of Theorem 4 can be found in the Supplementary Materials. Theorem 4 shows the global convergence of our PALM algorithm. We emphasize that, different from the general subsequence convergence property, the global convergence property is given by as the number of iteration , where is a critical point of (4). On the contrary, existing algorithms for solving non-convex and non-smooth problems, such as [14] and [10], have only subsequence convergence property.

By the Kurdyka-Łojasiewicz (KL) property (for more details, see [28]) and Theorem 2 in [33], our PALM algorithm has the following convergence rate:

###### Theorem 5.

The sequence generated by our PALM algorithm converges to a critical point of with , which satisfies the KL property at each point of with for and . We have

• If , converges to in finite steps;

• If , then and such that ;

• If , then such that .

Theorem 5 shows us the convergence rate of our PALM algorithm for solving the non-convex and non-smooth bi-trace quasi-norm problem (4) with the squared loss . Moreover, we can see that the convergence rate of our PALM algorithm is at least sub-linear.

## 5 Recovery Guarantees

We provide theoretical guarantees for our Bi-tr quasi-norm minimization in recovering low-rank matrices from small sets of linear observations. By using the null-space property (NSP), we first provide a sufficient condition for exact recovery of low-rank matrices. We then establish the restricted strong convexity (RSC) condition based and MC error bounds.

### 5.1 Null Space Property

The wide use of NSP for recovering sparse vectors and low-rank matrices can be found in [22, 34]. We give the sufficient and necessary condition for exact recovery via our bi-trace quasi-norm model (3) that improves the NSP condition for the Schatten- quasi-norm in [34]. Let , and , where and denote the matrices consisting the top left and right singular vectors of the true matrix (which satisfies ) with rank at most . denotes the null space of the linear operator . Then we have the following theorem, the proof of which is provided in the Supplementary Materials.

###### Theorem 6.

can be uniquely recovered by (3), if and only if for any , where , , we have

 r∑i=1σi(W1)+σi(W2)

Remark: Since , where , the sufficient condition in Theorem 6 is weaker than the corresponding sufficient condition for the Schatten- quasi-norm in [34].

### 5.2 RSC based Error Bound

Unlike most of existing recovery guarantees as in [17, 34], we do not impose the restricted isometry property (RIP) on the general operator , rather, we require the operator to satisfy a weaker and more general condition known as restricted strong convexity (RSC) [35], as shown in the following.

###### Assumption 1 (Rsc).

We suppose that there is a positive constant such that the general operator satisfies the following inequality

 1√l∥A(Δ)∥2≥κ(A)∥Δ∥F

for all .

We mainly provide the RSC based error bound for robust recovery via our bi-trace quasi-norm algorithm with noisy measurements. To our knowledge, our recovery guarantee analysis is the first one for solutions generated by Schatten quasi-norm algorithms, not for the global optima111It is well known that the Schatten- quasi-norm () problems in [15, 11, 14, 10, 9] are non-convex, non-smooth and non-Lipschitz [24]. The recovery guarantees in [36, 17, 34] are naturally based on the global optimal solution of associated models. of (4) as in [36, 17, 34].

###### Theorem 7.

Assume is a true matrix and the corrupted measurements , where is noise with . Let be a critical point of (4) with the squared loss , and suppose the operator satisfies the RSC condition with a constant . Then

 ∥X0−^U^VT∥F√mn≤ϵκ(A)√lmn+μ√d2C1κ(A)√lmn,

where .

The proof of Theorem 7 and the analysis of lower-boundedness of is provided in the Supplementary Materials.

### 5.3 Error Bound on Matrix Completion

Although the MC problem is a practically important application of (4), the projection operator in (15) does not satisfy the standard RIP and RSC conditions in general [1, 37, 38]. Therefore, we also need to provide the recovery guarantee for performance of our Bi-tr quasi-norm minimization for solving the following MC problem.

 minU,V{∥U∥tr+∥V∥tr2+12μ∥PΩ(UVT)−PΩ(D)∥2F}. (15)

Without loss of generality, assume that the observed matrix can be decomposed as a true matrix of rank and a random Gaussian noise , i.e., . We give the following recovery guarantee for our Bi-tr quasi-norm minimization (15).

###### Theorem 8.

Let be a critical point of the problem (15) with given rank , and . Then there exists an absolute constant

, such that with probability at least

,

 ∥X0−ˆUˆVT∥F√mn≤∥E∥F√mn+C2δ(mdlog(m)|Ω|)1/4+μ√d2C3√|Ω|,

where and .

The proof of Theorem 8 and the analysis of lower-boundedness of can be found in the Supplementary Materials. When the samples size , the second and third terms diminish, and the recovery error is essentially bounded by the “average” magnitude of entries of noise . In other words, only observed entries are needed, significantly lower than in standard matrix completion theories [37, 39, 7], which will be confirmed by the following experimental results.

## 6 Experimental Results

We evaluate both the effectiveness and efficiency of our methods (i.e., the Bi-tr and Tri-tr methods) for solving MC and RPCA problems, such as collaborative filtering and text separation. All experiments were conducted on an Intel Xeon E7-4830V2 2.20GHz CPU with 64G RAM.

### 6.1 Synthetic Matrix Completion

The synthetic matrices with rank are generated randomly by the following procedure: the entries of both random matrices and are first generated as independent and identically distributed (i.i.d.) numbers, and then is assembled. The experiments are conducted on random matrices with different noise factors, or

, where the observed subset is corrupted by i.i.d. standard Gaussian random variables as in

[18]

. In both cases, the sampling ratio (SR) is set to 20% or 30%. We use the relative standard error (

) as the evaluation measure, where denotes the recovered matrix.

We compare our methods with two trace norm solvers: NNLS [40] and ALT [4], one bilinear spectral regularization method, LRMF [20], and two Schatten- norm methods, IRLS [14] and IRNN [10]. The recovery results of IRLS and IRNN () on noisy random matrices are shown in Figure 4, from which we can observe that as a scalable alternative to trace norm regularization, LRMF with relatively small ranks often obtains more accurate solutions than its trace norm counterparts, i.e., NNLS and ALT. If is chosen from the range of , IRLS and IRNN have similar performance, and usually outperform NNLS, ALT and LRMF in terms of RSE, otherwise they sometimes perform much worse than the latter three methods, especially . This means that both our methods (which are in essence the Schatten- and quasi-norm algorithms) should perform better than them. As expected, the RSE results of both our methods under all of these settings are consistently much better than those of the other approaches. This clearly justifies the usefulness of our Bi-tr and Tri-tr quasi-norm penalties. Moreover, the running time of all these methods on random matrices with different sizes is provided in the Supplementary Materials, which shows that our methods are much faster than the other methods. This confirms that both our methods have very good scalability and can address large-scale problems.

### 6.2 Collaborative Filtering

We test our methods on the real-world recommendation system datasets: MovieLens1M, MovieLens10M and MovieLens20M, and Netflix [41]. We randomly choose 90% as the training set and the remaining as the testing set, and the experimental results are reported over 10 independent runs. Besides those methods used above, we also compare our methods to one of the fastest methods, LMaFit [42], and use the root mean squared error (RMSE) as evaluation measure.

The testing RMSE of all those methods on the four datasets is reported in Figure 5, where the rank varies from 5 to 20 (the running time of all methods are provided in Supplementary Materials). From all these results, we can observe that for these fixed ranks, the matrix factorization methods including LMaFit, LRMF and our methods significantly perform better than the trace norm solvers including NNLS and ALT in terms of RMSE, especially on the three larger datasets, as shown in Figures 5

(b)-(d). In most cases, the sophisticated matrix factorization based approaches outperform LMaFit as a baseline method without any regularization term. This suggests that those regularized models can alleviate the over-fitting problem of matrix factorization. The testing RMSE of both our methods varies only slightly when the number of the given rank increases, while that of the other matrix factorization methods changes dramatically. This further means that our methods perform much more robust than them in terms of the given ranks. More importantly, both our methods under all of the rank settings consistently outperform the other methods in terms of prediction accuracy. This confirms that our Bi-tr or Tri-tr quasi-norm regularized models can provide a good estimation of a low-rank matrix. Note that IRLS and IRNN could not run on the three larger datasets due to runtime exceptions. Moreover, our methods are much faster than LRMF, NNLS, ALT, IRLS and IRNN on all these datasets, and are comparable in speed with LMaFit. This shows that our methods have very good scalability and can solve large-scale problems.

### 6.3 Text Separation

We conducted an experiment on artificially generated data to separate some text from an image. The ground-truth image is of size with rank equal to 10. Figure 6(a) shows the input image together with the original image. The input data are generated by setting 10% of the randomly selected pixels as missing entries. We compare our Bi-tr+, Tri-tr+ and Bi-tr+ methods (see Supplementary Materials for the details) to three state-of-the-art methods, including PCP [2], LRMF+ [43] and + [11] with . For fairness, we set the rank of all methods to 15, and for all these algorithms.

The results of different methods are shown in Figure 6

, where the text detection accuracy (the score Area Under the receiver operating characteristic Curve, AUC) and the RSE of low-rank component recovery are reported. Note that we present the best performance results of

+ with all choices of in . For both low-rank component recovery and text separation, our Bi-tr+ method is significantly better than the other methods, not only visually but also quantitatively. In addition, our Bi-tr+ and Tri-tr+ methods have very similar performance to the + method, and all these three methods outperform PCP and LRMF+ in terms of AUC and RSE. Moreover, the running time of PCP, LRMF+, +, Tri-tr+, Bi-tr+ and Bi-tr+ is 31.57sec, 6.91sec, 163.65sec, 0.96sec, 0.57sec and 1.62sec, respectively. In other words, our three methods are at least 7, 12 and 4 times faster than the other methods, respectively. This is a very impressive result as our three methods are nearly 170, 290 or 100 times faster than the most related + method, which further confirms that our methods have good scalability.

## 7 Conclusions

In this paper, we defined two tractable Schatten quasi-norm formulations, and then proved that they are in essence the Schatten- and quasi-norms, respectively. By applying the two defined quasi-norms to various rank minimization problems, such as MC and RPCA, we achieved some challenging non-smooth and non-convex problems. Then we designed two classes of efficient PALM and LADM algorithms to solve our problems with smooth and non-smooth loss functions, respectively. Finally, we established that each bounded sequence generated by our algorithms converges to a critical point, and also provided the recovery performance guarantees for our algorithms. Experiments on real-world data sets showed that our methods outperform the state-of-the-art methods in terms of both efficiency and effectiveness. For future work, we are interested in analyzing the recovery bound for our algorithms to solve the Bi-tr or Tri-tr quasi-norm regularized problems with non-smooth loss functions.

#### Acknowledgements

We thank the reviewers for their valuable comments. The authors are supported by the Hong Kong GRF 2150851. The project is funded by Research Committee of CUHK.

## References

• [1] E. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9(6):717–772, 2009.
• [2] E. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):1–37, 2011.
• [3] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, pages 663–670, 2010.
• [4] C. Hsieh and P. A. Olsen. Nuclear norm minimization via active subspace selection. In ICML, pages 575–583, 2014.
• [5] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task structure learning. In NIPS, pages 25–32, 2007.
• [6] M. Fazel, H. Hindi, and S. P. Boyd.

A rank minimization heuristic with application to minimum order system approximation.

In ACC, pages 4734–4739, 2001.
• [7] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev., 52:471–501, 2010.
• [8] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its Oracle properties. J. Am. Statist. Assoc., 96:1348–1361, 2001.
• [9] Z. Lu and Y. Zhang. Schatten- quasi-norm regularized matrix optimization via iterative reweighted singular value minimization. arXiv:1401.0869v2, 2015.
• [10] C. Lu, J. Tang, S. Yan, and Z. Lin. Generalized nonconvex nonsmooth low-rank minimization. In CVPR, pages 4130–4137, 2014.
• [11] F. Nie, H. Wang, X. Cai, H. Huang, and C. Ding. Robust matrix completion via joint Schatten -norm and -norm minimization. In ICDM, pages 566–574, 2012.
• [12] A. Majumdar and R. K. Ward. An algorithm for sparse MRI reconstruction by Schatten -norm minimization. Magn. Reson. Imaging, 29:408–417, 2011.
• [13] K. Mohan and M. Fazel. Iterative reweighted algorithms for matrix rank minimization. J. Mach. Learn. Res., 13:3441–3473, 2012.
• [14] M. Lai, Y. Xu, and W. Yin. Improved iteratively rewighted least squares for unconstrained smoothed minimization. SIAM J. Numer. Anal., 51(2):927–957, 2013.
• [15] G. Marjanovic and V. Solo. On optimization and matrix completion. IEEE Trans. Signal Process., 60(11):5714–5724, 2012.
• [16] F. Nie, H. Huang, and C. Ding. Low-rank matrix recovery via efficient Schatten -norm minimization. In AAAI, pages 655–661, 2012.
• [17] M. Zhang, Z. Huang, and Y. Zhang. Restricted -isometry properties of nonconvex matrix recovery. IEEE Trans. Inform. Theory, 59(7):4316–4323, 2013.
• [18] F. Shang, Y. Liu, and J. Cheng. Scalable algorithms for tractable Schatten quasi-norm minimization. In AAAI, pages 2016–2022, 2016.
• [19] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In NIPS, pages 1329–1336, 2004.
• [20] K. Mitra, S. Sheorey, and R. Chellappa. Large-scale matrix factorization with missing data under additional constraints. In NIPS, pages 1642–1650, 2010.
• [21] A. Aravkin, R. Kumar, H. Mansour, B. Recht, and F. J. Herrmann.

Fast methods for denoising matrix completion formulations, with applications to robust seismic data interpolation.

SIAM J. Sci. Comput., 36(5):S237–S266, 2014.
• [22] S. Foucart and M. Lai. Sparsest solutions of underdetermined linear systems via -minimization for . Appl. Comput. Harmon. Anal., 26:397–407, 2009.
• [23] Y. Liu, F. Shang, H. Cheng, and J. Cheng. A Grassmannian manifold algorithm for nuclear norm regularized least squares problems. In UAI, pages 515–524, 2014.
• [24] W. Bian, X. Chen, and Y. Ye. Complexity analysis of interior point algorithms for non-Lipschitz and nonconvex minimization. Math. Program., 149:301–327, 2015.
• [25] F. Shang, Y. Liu, J. Cheng, and H. Cheng. Robust principal component analysis with missing data. In CIKM, pages 1149–1158, 2014.
• [26] F. Shang, Y. Liu, J. Cheng, and H. Cheng. Recovering low-rank and sparse matrices via robust bilateral factorization. In ICDM, pages 965–970, 2014.
• [27] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, pages 427–435, 2013.
• [28] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program., 146:459–494, 2014.
• [29] J. Yang and X. Yuan. Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization. Math. Comp., 82:301–329, 2013.
• [30] J. Cai, E. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 20(4):1956–1982, 2010.
• [31] I. Daubechies, M. Defrise, and C. DeMol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pur. Appl. Math., 57(11):1413–1457, 2004.
• [32] Z. Lin, R. Liu, and Z. Su. Linearized alternating direction method with adaptive penalty for low-rank representation. In NIPS, pages 612–620, 2011.
• [33] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program., 116:5–16, 2009.
• [34] S. Oymak, K. Mohan, M. Fazel, and B. Hassibi. A simplified approach to recovery conditions for low rank matrices. In ISIT, pages 2318–2322, 2011.
• [35] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for highdimensional analysis of M-estimators with decomposable regularizers. In NIPS, pages 1348–1356, 2009.
• [36] A. Rohde and A. B. Tsybakov. Estimation of high-dimensional low-rank matrices. Ann. Statist., 39(2):887–930, 2011.
• [37] E. Candès and Y. Plan. Matrix completion with noise. Proc. IEEE, 98(6):925–936, 2010.
• [38] P. Jain, R. Meka, and I. Dhillon. Guaranteed rank minimization via singular value projection. In NIPS, pages 937–945, 2010.
• [39] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Trans. Inform. Theory, 56(6):2980–2998, 2010.
• [40] K.-C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac. J. Optim., 6:615–640, 2010.
• [41] KDDCup. ACM SIGKDD and Netflix. In Proc. KDD Cup and Workshop, 2007.
• [42] Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Math. Prog. Comp., 4(4):333–361, 2012.
• [43] R. Cabral, F. Torre, J. Costeira, and A. Bernardino. Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition. In ICCV, pages 2488–2495, 2013.
• [44] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res., 11:2287–2322, 2010.
• [45] D. P. Bertsekas. Nonlinear Programming. The 2nd edition, Athena Scientific, Belmont, 2004.
• [46] M. C. Yue and A. M. C. So. A perturbation inequality for concave functions of singular values and its applications in low-rank matrix recovery. Appl. Comput. Harmon. Anal., 40(2):396–416, 2016.
• [47] Y. Wang and H. Xu. Stability of matrix factorization for collaborative filtering. In ICML, pages 417–424, 2012.
• [48] D. Krishnan and R. Fergus. Fast image deconvolution using hyper-Laplacian priors. In NIPS, pages 1033–1041, 2009.
• [49] J. Zeng, S. Lin, Y. Wang, and Z. Xu. regularization: Convergence of iterative half thresholding algorithm. IEEE Trans. Signal Process., 62(9):2317–2329, 2014.
• [50] R. Larsen. PROPACK-software for large and sparse SVD calculations. Available from http://sun.stanford.edu/srmunk/PROPACK/, 2005.

## 8 More Notations

denotes the -dimensional Euclidean space, and the set of all matrices with real entries is denoted by . Given matrices and , the inner product is defined by , where denotes the trace of a matrix. is the spectral norm and is equal to the maximum singular value of .

denotes an identity matrix.

For any vector , its quasi-norm for is defined as

 ∥x∥