Fast Low-Rank Matrix Learning with Nonconvex Regularization

12/03/2015 ∙ by Quanming Yao, et al. ∙ The Hong Kong University of Science and Technology 0

Low-rank modeling has a lot of important applications in machine learning, computer vision and social network analysis. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better recovery performance. However, the resultant optimization problem is much more challenging. A very recent state-of-the-art is based on the proximal gradient algorithm. However, it requires an expensive full SVD in each proximal step. In this paper, we show that for many commonly-used nonconvex low-rank regularizers, a cutoff can be derived to automatically threshold the singular values obtained from the proximal operator. This allows the use of power method to approximate the SVD efficiently. Besides, the proximal operator can be reduced to that of a much smaller matrix projected onto this leading subspace. Convergence, with a rate of O(1/T) where T is the number of iterations, can be guaranteed. Extensive experiments are performed on matrix completion and robust principal component analysis. The proposed method achieves significant speedup over the state-of-the-art. Moreover, the matrix solution obtained is more accurate and has a lower rank than that of the traditional nuclear norm regularizer.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

Code Repositories

FaNCL

Fast Low-Rank Matrix Learning with Nonconvex Regularization. Matlab Code


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The learning of low-rank matrices is a central issue in many machine learning problems. For example, matrix completion [2], which is one of the most successful approaches in collaborative filtering, assumes that the target ratings matrix is low-rank. Besides collaborative filtering, matrix completion has also been used on tasks such as sensor networks [3], social network analysis [4], and image processing [5, 6].

Another important use of low-rank matrix learning is robust principal component analysis (RPCA) [7], which assumes the target matrix is low-rank and also corrupted by sparse data noise. It is now popularly used in various computer vision applications, such as shadow removal of aligned faces and background modeling of surveillance videos [7, 8]. Besides, low-rank minimization has also been used in tasks such as multilabel learning [9] and multitask learning [10].

However, rank minimization is NP-hard. To alleviate this problem, a common approach is to use instead a convex surrogate such as the nuclear norm (which is the sum of singular values of the matrix). It is known that the nuclear norm is the tightest convex lower bound of the rank. Besides, there are theoretical guarantees that the incomplete matrix can be recovered with nuclear norm regularization [2, 7]. Moreover, though the nuclear norm is non-smooth, the resultant optimization problem can often be solved efficiently using modern tools such as accelerated proximal gradient descent [11]

, Soft-Impute

[12], and active subspace selection methods [13].

Despite its success, recently there have been numerous attempts that use nonconvex surrogates to better approximate the rank function. The key idea is that the larger, and thus more informative, singular values should be less penalized. Example nonconvex low-rank regularizers include the capped- penalty [14], log-sum penalty (LSP) [15], truncated nuclear norm (TNN) [16], smoothly clipped absolute deviation (SCAD) [17], and minimax concave penalty (MCP) [18]. Empirically, these nonconvex regularizers achieve better recovery performance than the convex nuclear norm regularizer.

However, the resultant nonconvex optimization problem is much more challenging. One approach is to use the concave-convex procedure [19], which decomposes the nonconvex regularizer into a difference of convex functions [14, 16]. However, a sequence of relaxed problems have to be solved, and can be computationally expensive [20]. A more efficient method is the recently proposed iteratively re-weighted nuclear norm (IRNN) algorithm [21]

. It is based on the observation that existing nonconvex regularizers are all concave and their super-gradients are non-increasing. Though IRNN still has to iterate, each of its iterations only involves computing the super-gradient of the regularizer and a singular value decomposition (SVD). However, performing SVD on a

matrix (where ) still takes time, and can be expensive when the matrix is large.

Recently, the proximal gradient algorithm has also been used on this nonconvex low-rank minimization problem [8, 16, 21, 22]

. However, it requires performing the full SVD in each proximal step, which is expensive for large-scale applications. To alleviate this problem, we first observe that for the commonly-used nonconvex low-rank regularizers, the singular values obtained from the corresponding proximal operator can be automatically thresholded. One then only needs to find the leading singular values/vectors in order to generate the next iterate. By using the power method

[23], a fast and accurate approximation of such a subspace can be obtained. Moreover, instead of computing the proximal operator on a large matrix, one only needs to compute that on its projection onto this leading subspace. The size of the matrix is significantly reduced and the proximal operator can be made much more efficient. In the context of matrix completion problems, further speedup is possible by exploiting a special “sparse plus low-rank” structure of the matrix iterate.

The rest of the paper is organized as follows. Section II reviews the related work. The proposed algorithm is presented in Section III; Experimental results on matrix completion and RPCA are shown in Section IV, and the last section gives some concluding remarks.

In the sequel, the transpose of vector/matrix is denoted by the superscript . For a matrix , is its trace, is the Frobenius norm, and is the nuclear norm. Given , constructs a diagonal matrix whose th diagonal element is . Moreover,

denotes the identity matrix. For a differentiable function

, we use for its gradient. For a nonsmooth function, we use for its subdifferential.

Ii Background

Ii-a Proximal Gradient Algorithms

In this paper, we consider composite optimization problems of the form

(1)

where is smooth and is nonsmooth. In many machine learning problems, is the loss and a low-rank regularizer. In particular, we make the following assumptions on .

  • , not necessarily convex, is differentiable with -Lipschitz continuous gradient, i.e., . Without loss of generality, we assume that .

  • is bounded below, i.e., .

In recent years, proximal gradient algorithms [24] have been widely used for solving (1). At each iteration , a quadratic function is used to upper-bound the smooth at the current iterate , while leaving the nonsmooth intact. For a given stepsize , the next iterate is obtained as

where , and is the proximal operator [24]. Proximal gradient algorithms can be further accelerated, by replacing with a proper linear combination of and . In the sequel, as our focus is on learning low-rank matrices, in (1) becomes a matrix .222In the following, we assume .

Ii-B Convex and Nonconvex Low-Rank Regularizers

An important factor for the success of proximal gradient algorithms is that its proximal operator can be efficiently computed. For example, for the nuclear norm , the following Proposition shows that its proximal operator has a closed-form solution.

Proposition II.1.

[25] , where is the SVD of , and .

While the convex nuclear norm makes the low-rank optimization problem easier, it may not be a good approximation of the matrix rank [8, 16, 21, 22]. As mentioned in Section I, a number of nonconvex surrogates for the rank have been recently proposed. In this paper, we make the following assumption on the low-rank regularizer in (1).

  • is possibly non-smooth and nonconvex, and of the form , where are singular values of , and is a concave and non-decreasing function of with .

All nonconvex low-rank regularizers introduced in Section I satisfy this assumption. Their corresponding ’s are shown in Table I.

capped-
LSP
TNN
SCAD
MCP
TABLE I: ’s for some popular nonconvex low-rank regularizers. For the TNN regularizer, is the number of leading singular values that are not penalized.

The Iteratively Reweighted Nuclear Norm (IRNN) algorithm [21] is a state-of-the-art solver for nonconvex low-rank minimization. It is based on upper-bounding the nonconvex , and approximates the matrix rank by a weighted version of the nuclear norm , where , Intuitively, imposes a smaller penalty on the larger (and more informative) singular values. Other solvers that are designed only for specific nonconvex low-rank regularizers include [8] (for the capped-), [16] (for the TNN), and [26] (for the MCP). All these (including IRNN) need SVD in each iteration. It takes time, and thus can be slow.

While proximal gradient algorithms have mostly been used on convex problems, recently they are also applied to nonconvex ones [8, 16, 21, 22]. In particular, in the very recent generalized proximal gradient (GPG) algorithm [22], it is shown that for any nonconvex satisfying assumption A3, its proximal operator can be computed by the following generalized singular value thresholding (GSVT) operator.

Proposition II.2.

[22] , where is the SVD of , and with

(2)

In GPG, problem (2) is solved by a fixed-point iteration algorithm. Indeed, closed-form solutions exist for the regularizers in Table I [20]. While the obtained proximal operator can then be immediately plugged into a proximal gradient algorithm, Proposition II.2 still involves SVD.

Iii Proposed Algorithm

In this section, we show that the GSVT operator can be computed more efficiently. It is based on two ideas. First, the singular values in are automatically thresholded. Second, can be obtained from the proximal operator on a smaller matrix.

Iii-a Automatic Thresholding of Singular Values

The following Proposition shows that in (2) becomes zero when is smaller than a regularizer-specific threshold. Because of the lack of space, proofs will be reported in a longer version of this paper.

Proposition III.1.

For any satisfying Assumption A3, there exists a threshold such that when .

By examining the optimality conditions of (2), simple closed-form solutions can be obtained for the nonconvex regularizers in Table I.

Corollary III.2.

For the nonconvex regularizers in Table I, their values are equal to

  • capped-: ;

  • LSP: ;

  • TNN: ;

  • SCAD: ;

  • MCP: if , and otherwise.

0:  matrix , .
1:  ;
2:  for  do
3:     

;  // QR decomposition

4:     ;
5:  end for
6:  return .
Algorithm 1 Power method to obtain an approximate left subspace of .

Proposition III.1 suggests that in each proximal iteration , we only need to compute the leading singular values/vectors of the matrix iterate . The power method (Algorithm 1) [23] is a fast and accurate algorithm for obtaining an approximation of such a subspace. Besides the power method, algorithms such as PROPACK [27] have also been used [28]. However, the power method is more efficient than PROPACK [23]. It also allows warm-start, which is particularly useful because of the iterative nature of the proximal gradient algorithm.

Iii-B Proximal Operator on a Smaller Matrix

Assume that has singular values larger than , and its rank- SVD is . The following Proposition shows that can be obtained from the proximal operator on a smaller matrix.

Proposition III.3.

Assume that , where , is orthogonal and . Then, .

Though SVD is still needed to obtain , is much smaller than ( vs ). This smaller SVD takes time, and the other matrix multiplication steps take time. Thus, the time complexity for this SVD step is reduced from to .

Iii-C Complete Procedure

The complete procedure (Algorithm 2) will be called FaNCL (Fast NonConvex Lowrank). The core steps are 9–16. We first use the power method to efficiently obtain an approximate , whose singular values are then thresholded according to Corollary III.2. With , the rank of will be equal to that of . In each iteration, we ensure a sufficient decrease of the objective:

(3)

where ; otherwise, the power method is restarted. Moreover, similar to [13, 28], steps 6-7 use the column spaces of the previous iterates ( and ) to warm-start the power method. For further speedup, we employ a continuation strategy as in [12, 21, 28]. Specifically, is initialized to a large value and then decreases gradually.

1:  choose , , and ;
2:  initialize as random Gaussian matrices, and ;
3:  for  do
4:     ;
5:     ;
6:     , and remove any zero columns;
7:     ;
8:     for  do
9:        ;
10:        ;
11:         number of ’s are in Corollary III.2;
12:         leading columns of ;
13:         leading columns of ;
14:        for  do
15:           obtain from (2) with and ;
16:        end for
17:        ;
18:        if  then
19:           ,   ;
20:           break;
21:        else
22:           ;
23:        end if
24:     end for
25:  end for
26:  return  .
Algorithm 2 FaNCL (Fast NonConvex Low-rank).

Algorithm 2 can also be used with the nuclear norm. It can be shown that the threshold is equal to , and in step 15 has the closed-form solution . However, since our focus is on nonconvex regularizers, using Algorithm 2 for nuclear norm minimization will not be further pursued in the sequel.

The power method has also been recently used to approximate the SVT in nuclear norm minimization [13]. However, [13] is based on active subspace selection (which uses SVT to update the active row and column subspaces of the current solution), and is thus very different from the proposed algorithm (which is a proximal gradient algorithm). In Section IV, it will be shown that the proposed method has better empirical performance. Moreover, [13] is only designed for nuclear norm minimization, and cannot be extended for the nonconvex regularizers considered here.

A breakdown of the time complexity of Algorithm 2 is as follows. For simplicity, assume that ’s always have rank . Step 5 takes time; step 6 and 7 take time; step 9 and 10 take time; step 17 takes time; and step 18 takes time. Thus, the per-iteration time complexity is . In the experiment, we set and . Empirically, this setting is enough to guarantee (3). In contrast, SVDs in GPG and IRNN take time, and are thus much slower as .

Iii-D Convergence Analysis

The following Proposition shows that from Algorithm 2 converges to a limit point .

Proposition III.4.

.

The following shows that it is also a critical point.333Since is nonconvex and its subdifferential for points in its domain may be empty, we define as a critical point by extending the definition in [20], namely that , where , and and are convex.

Theorem III.5.

converges to a critical point of problem (1) in a finite number of iterations.

By combining with Proposition III.4, the following shows that converges to zero at a rate of .

Corollary III.6.

.

Iii-E Further Speedup for Matrix Completion

In matrix completion, one attempts to recover a low-rank matrix by observing only some of its elements. Let the observed positions be indicated by , such that if is observed, and 0 otherwise. It can be formulated as an optimization problem in (1), with , where if and 0 otherwise, and is a low-rank regularizer.

It can be easily seen that step 5 in Algorithm 2 becomes . By observing that is low-rank and is sparse, Mazumder et al. [12] showed that this “sparse plus low-rank” structure allows matrix multiplications of the form and to be efficiently computed. Here, this trick can also be directly used to speed up the computation of in Algorithm 1. Since is very sparse, this step takes time instead of , thus is much faster.

The following Proposition shows that in step 18 of Algorithm 2 can also be easily computed.

Proposition III.7.

Let the reduced SVD of be , and be orthogonal matrices such that and . Then .

Let the reduced SVDs of and be and , respectively. Let and . Using Proposition III.7, . This takes instead of time. The per-iteration time complexity is reduced from to and is much faster. Table II compares the per-iteration time complexities and convergence rates for the various low-rank matrix completion solvers used in the experiments (Section IV-A).

regularizer method complexity rate
(convex) APG [11, 28]
nuclear Soft-Impute [12]
norm active ALT [13]
fixed-rank LMaFit [29]
factorization R1MP [30]
nonconvex IRNN [21]
GPG [22]
FaNCL
TABLE II: Comparison of the per-iteration time complexities and convergence rates of various matrix completion solvers. Here, is a constant.

Iii-F Handling Multiple Matrix Variables

The proposed algorithm can be extended for optimization problems involving matrices :

(4)

Assumptions A1-A3 are analogously extended. In particular, A1 now assumes that for some , where is the function obtained by keeping all the ’s (where ) in fixed.

Many machine learning problems can be cast into this form. One example that will be considered in Section IV is robust principal component analysis (RPCA) [7]. Given a noisy data matrix , RPCA assumes that can be approximated by the sum of a low-rank matrix plus sparse data noise . Mathematically, we have

(5)

where , is a low-rank regularizer on , and encourages to be sparse. Since both and the regularizer are nonsmooth, (5) does not fit into formulation (1). Besides RPCA, problems such as subspace clustering [31], multilabel learning [9] and multitask learning [10] can also be cast as (4).

For simplicity, we focus on the case with two parameter blocks. Extension to multiple blocks is straightforward. To solve the two-block problem in (5), we perform alternating proximal steps on and at each iteration :

where , and . can be easily obtained as , where denotes the sign of . Similar to (3), we ensure a sufficient decrease of the objective in each iteration:

where , and . The resultant algorithm is similar to Algorithm 2.

When is convex, convergence of this alternating minimization scheme has been well studied [32]. However, here is nonconvex. We extend the convergence results in Section III-D to the following.

Theorem III.8.

With parameter blocks and generated by the algorithm, we have

  1. ;

  2. converges to a critical point of (4) in a finite number of iterations;

  3. .

Iv Experiments

Iv-a Matrix Completion

(observed: ) (observed: ) (observed: ) (observed: )
NMSE rank time NMSE rank time NMSE rank time NMSE rank time
nuclear APG 49 4.8 59 59.5 71 469.3 85 1383.3
norm Soft-Impute 49 64.9 59 176.0 71 464.4 85 1090.2
active ALT 49 17.1 59 81.9 71 343.8 85 860.1
fixed LMaFit 5 0.6 5 1.7 5 4.5 5 7.1
rank R1MP 39 0.3 54 0.8 62 1.4 63 3.4
capped IRNN 5 8.5 5 75.5 5 510.8 5 1112.3
GPG 5 8.5 5 72.4 5 497.0 5 1105.8
FaNCL 5 0.3 5 0.9 5 2.6 5 4.1
LSP IRNN 5 21.8 5 223.9 5 720.9 5 2635.0
GPG 5 21.2 5 235.3 5 687.4 5 2612.0
FaNCL 5 0.5 5 2.2 5 3.3 5 7.6
TNN IRNN 5 8.5 5 72.6 5 650.7 5 1104.1
GPG 5 8.3 5 71.7 5 655.3 5 1098.2
FaNCL 5 0.3 5 0.8 5 2.7 5 4.2
TABLE III: Matrix completion performance on the synthetic data. Here, NMSE is scaled by , and CPU time is in seconds.

We compare a number of low-rank matrix completion solvers, including models based on (i) the commonly used (convex) nuclear norm regularizer; (ii) fixed-rank factorization models [29, 30], which decompose the observed matrix into a product of rank- matrices and . Its optimization problem can be written as: ; and (iii) nonconvex regularizers, including the capped- (with in Table I set to ), LSP (with ), and TNN (with ).

The nuclear norm minimization algorithms to be compared include:

  1. Accelerated proximal gradient (APG)444http://perception.csl.illinois.edu/matrix-rank/Files/apg_partial.zip algorithm [11, 28], with the partial SVD by PROPACK [27];

  2. Soft-Impute555http://cran.r-project.org/web/packages/softImpute/index.html [12], which iteratively replaces the missing elements with those obtained from SVT. The “sparse plus low-rank” structure of the matrix iterate is utilized to speed up computation (Section III-E);

  3. Active alternating minimization666http://www.cs.utexas.edu/~cjhsieh/nuclear_active_1.1.zip (denoted “active ALT”) [13], which adds/removes rank-one subspaces from the active set in each iteration. The nuclear norm optimization problem is then reduced to a smaller problem defined only on this active set.

We do not compare with the Frank-Wolfe algorithm [33]

and stochastic gradient descent

[34], as they have been shown to be less efficient [13]. For the fixed-rank factorization models (where the rank is tuned by the validation set), we compare with the two state-of-the-art algorithms:

  1. Low-rank matrix fitting (LMaFit) algorithm777http://www.caam.rice.edu/~optimization/L1/LMaFit/download.html [29]; and

  2. Rank-one matrix pursuit (R1MP) [30], which pursues a rank-one basis in each iteration.

We do not compare with the concave-convex procedure [14, 16], since it has been shown to be inferior to IRNN [20].

For models with nonconvex low-rank regularizers, we compare the following solvers:

  1. Generalized proximal gradient (GPG) algorithm [22], with the underlying problem (2) solved more efficiently using the closed-form solutions in [20];

  2. The proposed FaNCL algorithm (, ).

All algorithms are implemented in Matlab. The same stopping criterion is used, namely that the algorithm stops when the difference in objective values between consecutive iterations is smaller than a given threshold. Experiments are run on a PC with i7 4GHz CPU and 24GB memory.

Iv-A1 Synthetic Data

The observed matrix is generated as , where the elements of (with

) are sampled i.i.d. from the normal distribution

, and elements of sampled from . A total of random elements in are observed. Half of them are used for training, and the rest as validation set for parameter tuning. Testing is performed on the non-observed (missing) elements.

For performance evaluation, we use (i) the normalized mean squared error , where is the recovered matrix; (ii) rank of ; and (iii) training CPU time. We vary in the range . Each experiment is repeated five times.

Results are shown in Table III. As can be seen, the nonconvex regularizers (capped-, LSP and TNN) lead to much lower NMSE’s than the convex nuclear norm regularizer and fixed-rank factorization. Moreover, as is also observed in [34], the nuclear norm needs to use a much higher rank than the nonconvex ones. In terms of speed, FaNCL is the fastest among the nonconvex low-rank solvers. Figure 1 shows its speedup over GPG (which in turn is faster than IRNN). As can be seen, the larger the matrix, the higher is the speedup.

Fig. 1: Speedup of FaNCL over GPG at different matrix sizes.

Recall that the efficiency of the proposed algorithm comes from (i) automatic singular value thresholding; (ii) computing the proximal operator on a smaller matrix; and (iii) exploiting the “sparse plus low-rank” structure in matrix completion. Their individual contributions are examined in Table IV. The baseline is GPG, which uses none of these; while the proposed FaNCL uses all. As all the variants produce the same solution, the obtained NMSE and rank values are not shown. As can be seen, tricks (i), (ii) and (iii) lead to average speedups of about 6, 4, and 3, respectively; and are particularly useful on the large data sets.

solver
capped baseline (GPG) 8.5 72.4 497.0 1105.8
i 5.4 37.6 114.8 203.7
i, ii 0.6 3.2 11.4 25.6
i, ii, iii (FaNCL) 0.3 0.9 2.6 6.8
LSP baseline (GPG) 21.2 235.3 687.4 2612.0
i 4.9 44.0 70.0 154.9
i, ii 1.0 9.7 14.8 31.1
i, ii, iii (FaNCL) 0.5 2.2 3.3 8.2
TNN baseline (GPG) 8.3 71.7 655.3 1098.2
i 5.4 32.5 122.3 194.1
i, ii 0.6 2.8 10.3 15.8
i, ii, iii (FaNCL) 0.3 0.8 2.7 3.3
TABLE IV: Effects of the three tricks on CPU time (in seconds) using the synthetic data. (i) automatic singular value thresholding; (ii) computing the proximal operator on a smaller matrix; and (iii) “sparse plus low-rank” structure.

Iv-A2 MovieLens

(a) capped-.
(b) LSP.
(c) TNN.
Fig. 2: Objective value vs CPU time for the various nonconvex low-rank regularizers on the MovieLens-100K data set.

Experiment is performed on the popular MovieLens999http://grouplens.org/datasets/movielens/ data set (Table V), which contain ratings of different users on movies. We follow the setup in [30], and use of the observed ratings for training, for validation and the rest for testing. For performance evaluation, we use the root mean squared error on the test set : , where is the recovered matrix. The experiment is repeated five times.

#users #movies #ratings
MovieLens 100K 943 1,682 100,000
1M 6,040 3,449 999,714
10M 69,878 10,677 10,000,054
netflix 480,189 17,770 100,480,507
yahoo 249,012 296,111 62,551,438
TABLE V: Recommendation data sets used in the experiments.

Results are shown in Table VI. Again, nonconvex regularizers lead to the lowest RMSE’s. Moreover, FaNCL is also the fastest among the nonconvex low-rank solvers, even faster than the state-of-the-art. In particular, it is the only solver (among those compared) that can be run on the MovieLens-10M data. Table VII examines the usefulness of the three tricks. The behavior is similar to that as observed in Table IV. Figures 2 and 3 compare the objective and RMSE vs CPU time for the various methods on the MovieLens-100K data set. As can be seen, FaNCL decreases the objective and RMSE much faster than the others.

MovieLens-100K MovieLens-1M MovieLens-10M
RMSE rank time RMSE rank time RMSE rank time
nuclear norm APG 36 18.9 67 735.8
Soft-Impute 36 13.8 67 311.8
active ALT 36 4.1 67 133.4 119 3675.2
fixed rank LMaFit 2 3.0 6 39.2 9 650.1
R1MP 5 0.1 19 2.9 29 37.3
capped- IRNN 3 558.9
GPG 3 523.6
FaNCL 3 3.2 5 29.4 8 634.6
LSP IRNN 2 195.9
GPG 2 192.8
FaNCL 2 0.7 5 25.6 9 616.3
TNN IRNN 3 621.9
GPG 3 572.7
FaNCL 3 1.9 5 25.8 8 710.7
TABLE VI: Matrix completion results on the MovieLens data sets (time is in seconds).
solver 100K 1M 10M
capped baseline (GPG) 523.6
i 212.2 1920.5
i, ii 29.2 288.8
i, ii, iii (FaNCL) 3.2 29.4 634.6
LSP baseline (GPG) 192.8
i 35.8 2353.8
i, ii 5.6 212.4
i, ii, iii (FaNCL) 0.7 25.6 616.3
TNN baseline (GPG) 572.7
i 116.9 1944.8
i, ii 15.4 256.1
i, ii, iii (FaNCL) 1.9 25.8 710.7
TABLE VII: Effects of the three tricks on CPU time (in seconds) on the MovieLens data.
Fig. 3: RMSE vs CPU time on the MovieLens-100K data set.
(a) netflix.
(b) yahoo.
Fig. 4: RMSE vs CPU time on the netflix and yahoo data sets.

Iv-A3 Netflix and Yahoo

Next, we perform experiments on two very large recommendation data sets, Netflix101010http://archive.ics.uci.edu/ml/datasets/Netflix+Prize and Yahoo111111http://webscope.sandbox.yahoo.com/catalog.php?datatype=c (Table V). We randomly use of the observed ratings for training, for validation and the rest for testing. Each experiment is repeated five times.

Results are shown in Table VIII. APG, Soft-Impute, GPG and IRNN cannot be run as the data set is large. Figure 4 shows the objective and RMSE vs time for the compared methods.121212On these two data sets, R1MP easily overfits as the rank increases. Hence, the validation set selects a rank which is small (relative to that obtained by the nuclear norm) and R1MP stops earlier. However, as can be seen, its RMSE is much worse. Again, the nonconvex regularizers converge faster, yield lower RMSE’s and solutions of much lower ranks. Moreover, FaNCL is fast.

netflix yahoo
RMSE rank time RMSE rank time
nuclear norm active ALT 399 47.6 221 118.9
fixed rank LMaFit 16 2.4 10 6.6
R1MP 31 0.2 92 0.3
capped- FaNCL 15 2.5 8 5.9
LSP FaNCL 13 1.9 9 6.1
TNN FaNCL 17 3.3 8 6.2
TABLE VIII: Results on the netflix and yahoo data sets (CPU time is in hours).

Iv-B Robust Principal Component Analysis

Iv-B1 Synthetic Data

NMSE rank time NMSE rank time NMSE rank time NMSE rank time
nuclear norm APG 5 1.5 10 9.7 15 33.9 20 94.7
capped- GPG 5 0.9 10 6.7 15 18.7 20 60.4
FaNCL 5 0.2 10 1.4 15 2.7 20 6.5
LSP GPG 5 2.7 10 18.5 15 111.2 20 250.2
FaNCL 5 0.4 10 1.8 15 3.9 20 7.1
TNN GPG 5 0.8 10 6.0 15 23.1 20 51.4
FaNCL 5 0.2 10 1.2 15 2.9 20 5.8
TABLE IX:

RPCA performance of the various methods on synthetic data. The standard deviations of NMSE are all smaller than

and so not reported. CPU time is in seconds.

In this section, we first perform experiments on a synthetic data set. The observed matrix is generated as , where elements of (with ) are sampled i.i.d. from , and elements of are sampled from . Matrix is sparse, with of its elements randomly set to or

with equal probabilities. The sparsity regularizer is the standard

, while different convex/nonconvex low-rank regularizers are used.

For performance evaluation, we use (i) NMSE , where and are the recovered low-rank and sparse components, respectively in (5); (ii) accuracy on locating the sparse support of (i.e., percentage of entries that both and are nonzero or zero together); and (iii) the recovered rank. We vary in . Each experiment is repeated five times.

Note that IRNN and the active subspace selection method cannot be used here. Their objectives are of the form “smooth function plus low-rank regularizer”, while RPCA has a nonsmooth regularizer besides its low-rank regularizer. Similarly, Soft-Impute is for matrix completion only.

Results are shown in Table IX. The accuracy on locating the sparse support are always 100% for all methods, and thus are not shown. As can be seen, while both convex and nonconvex regularizers can perfectly recover the matrix rank and sparse locations, the nonconvex regularizers have lower NMSE’s. Moreover, as in matrix completion, FaNCL is again much faster. The larger the matrix, the higher is the speedup.

Iv-B2 Background Removal on Videos

In this section, we use RPCA to perform video denoising on background removal of corrupted videos. Four benchmark videos131313http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html in [7, 8] are used (Table X), and example image frames are shown in Figure 5. As discussed in [7], the stable image background can be treated as low-rank, while the foreground moving objects contribute to the sparse component.

bootstrap campus escalator hall
#pixels / frame 19,200 20,480 20,800 25,344
total #frames 9,165 4,317 10,251 10,752
TABLE X: Videos used in the experiment.
(a) bootstrap.
(b) campus.
(c) escalator.
(d) hall.
Fig. 5: Example image frames in the videos.
(a) original.
(b) nuclear norm.
(c) capped-.
(d) LSP.
(e) TNN.
Fig. 6: Example foreground images in bootstrap, as recovered by using various low-rank regularizers.

Each image frame is reshaped as a column vector, and all frames are then stacked together to form a matrix. The pixel values are normalized to , and Gaussian noise from is added. The experiment is repeated five times.

For performance evaluation, we use the commonly used peak signal-to-noise ratio [35]: PSNR , where , is the recovered video, and is the ground-truth.

Results are shown in Table XI. As can be seen, the nonconvex regularizers lead to better PSNR’s than the convex nuclear norm. Moreover, FaNCL is more than times faster than GPG. Figure 6 shows an example of the recovered foreground in the bootstrap video. As can been seen, the nonconvex regularizers can better separate foreground from background. Figure 7 shows the PSNR vs time on bootstrap. Again, FaNCL converges much faster than others.

bootstrap campus escalator hall
PSNR time PSNR time PSNR time PSNR time
nuclear norm APG 688.4 102.6 942.5 437.7
capped- GPG 1009.3 90.6 1571.2 620.0
FaNCL 60.4 12.4 68.3 34.7
LSP GPG 1420.2 88.9 1523.1 683.9
FaNCL 56.0 17.4 54.5 35.8
TNN GPG 1047.5 130.3 1857.7 626.2
FaNCL