Fast Low-Rank Matrix Learning with Nonconvex Regularization. Matlab Code
Low-rank modeling has a lot of important applications in machine learning, computer vision and social network analysis. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better recovery performance. However, the resultant optimization problem is much more challenging. A very recent state-of-the-art is based on the proximal gradient algorithm. However, it requires an expensive full SVD in each proximal step. In this paper, we show that for many commonly-used nonconvex low-rank regularizers, a cutoff can be derived to automatically threshold the singular values obtained from the proximal operator. This allows the use of power method to approximate the SVD efficiently. Besides, the proximal operator can be reduced to that of a much smaller matrix projected onto this leading subspace. Convergence, with a rate of O(1/T) where T is the number of iterations, can be guaranteed. Extensive experiments are performed on matrix completion and robust principal component analysis. The proposed method achieves significant speedup over the state-of-the-art. Moreover, the matrix solution obtained is more accurate and has a lower rank than that of the traditional nuclear norm regularizer.READ FULL TEXT VIEW PDF
Low-rank modeling has many important applications in computer vision and...
Matrix learning is at the core of many machine learning problems. To
Matrix and tensor completion aim to recover a low-rank matrix / tensor f...
Recently, neural embeddings of documents have shown success in various l...
Many practical problems involve the recovery of a binary matrix from par...
Matrix completion focuses on recovering a matrix from a small subset of ...
Low-rank matrix completion is a problem of immense practical importance....
Fast Low-Rank Matrix Learning with Nonconvex Regularization. Matlab Code
The learning of low-rank matrices is a central issue in many machine learning problems. For example, matrix completion , which is one of the most successful approaches in collaborative filtering, assumes that the target ratings matrix is low-rank. Besides collaborative filtering, matrix completion has also been used on tasks such as sensor networks , social network analysis , and image processing [5, 6].
Another important use of low-rank matrix learning is robust principal component analysis (RPCA) , which assumes the target matrix is low-rank and also corrupted by sparse data noise. It is now popularly used in various computer vision applications, such as shadow removal of aligned faces and background modeling of surveillance videos [7, 8]. Besides, low-rank minimization has also been used in tasks such as multilabel learning  and multitask learning .
However, rank minimization is NP-hard. To alleviate this problem, a common approach is to use instead a convex surrogate such as the nuclear norm (which is the sum of singular values of the matrix). It is known that the nuclear norm is the tightest convex lower bound of the rank. Besides, there are theoretical guarantees that the incomplete matrix can be recovered with nuclear norm regularization [2, 7]. Moreover, though the nuclear norm is non-smooth, the resultant optimization problem can often be solved efficiently using modern tools such as accelerated proximal gradient descent 
, Soft-Impute, and active subspace selection methods .
Despite its success, recently there have been numerous attempts that use nonconvex surrogates to better approximate the rank function. The key idea is that the larger, and thus more informative, singular values should be less penalized. Example nonconvex low-rank regularizers include the capped- penalty , log-sum penalty (LSP) , truncated nuclear norm (TNN) , smoothly clipped absolute deviation (SCAD) , and minimax concave penalty (MCP) . Empirically, these nonconvex regularizers achieve better recovery performance than the convex nuclear norm regularizer.
However, the resultant nonconvex optimization problem is much more challenging. One approach is to use the concave-convex procedure , which decomposes the nonconvex regularizer into a difference of convex functions [14, 16]. However, a sequence of relaxed problems have to be solved, and can be computationally expensive . A more efficient method is the recently proposed iteratively re-weighted nuclear norm (IRNN) algorithm 
. It is based on the observation that existing nonconvex regularizers are all concave and their super-gradients are non-increasing. Though IRNN still has to iterate, each of its iterations only involves computing the super-gradient of the regularizer and a singular value decomposition (SVD). However, performing SVD on amatrix (where ) still takes time, and can be expensive when the matrix is large.
. However, it requires performing the full SVD in each proximal step, which is expensive for large-scale applications. To alleviate this problem, we first observe that for the commonly-used nonconvex low-rank regularizers, the singular values obtained from the corresponding proximal operator can be automatically thresholded. One then only needs to find the leading singular values/vectors in order to generate the next iterate. By using the power method, a fast and accurate approximation of such a subspace can be obtained. Moreover, instead of computing the proximal operator on a large matrix, one only needs to compute that on its projection onto this leading subspace. The size of the matrix is significantly reduced and the proximal operator can be made much more efficient. In the context of matrix completion problems, further speedup is possible by exploiting a special “sparse plus low-rank” structure of the matrix iterate.
The rest of the paper is organized as follows. Section II reviews the related work. The proposed algorithm is presented in Section III; Experimental results on matrix completion and RPCA are shown in Section IV, and the last section gives some concluding remarks.
In the sequel, the transpose of vector/matrix is denoted by the superscript . For a matrix , is its trace, is the Frobenius norm, and is the nuclear norm. Given , constructs a diagonal matrix whose th diagonal element is . Moreover,
denotes the identity matrix. For a differentiable function, we use for its gradient. For a nonsmooth function, we use for its subdifferential.
In this paper, we consider composite optimization problems of the form
where is smooth and is nonsmooth. In many machine learning problems, is the loss and a low-rank regularizer. In particular, we make the following assumptions on .
, not necessarily convex, is differentiable with -Lipschitz continuous gradient, i.e., . Without loss of generality, we assume that .
is bounded below, i.e., .
In recent years, proximal gradient algorithms  have been widely used for solving (1). At each iteration , a quadratic function is used to upper-bound the smooth at the current iterate , while leaving the nonsmooth intact. For a given stepsize , the next iterate is obtained as
where , and is the proximal operator . Proximal gradient algorithms can be further accelerated, by replacing with a proper linear combination of and . In the sequel, as our focus is on learning low-rank matrices, in (1) becomes a matrix .222In the following, we assume .
An important factor for the success of proximal gradient algorithms is that its proximal operator can be efficiently computed. For example, for the nuclear norm , the following Proposition shows that its proximal operator has a closed-form solution.
 , where is the SVD of , and .
While the convex nuclear norm makes the low-rank optimization problem easier, it may not be a good approximation of the matrix rank [8, 16, 21, 22]. As mentioned in Section I, a number of nonconvex surrogates for the rank have been recently proposed. In this paper, we make the following assumption on the low-rank regularizer in (1).
is possibly non-smooth and nonconvex, and of the form , where are singular values of , and is a concave and non-decreasing function of with .
The Iteratively Reweighted Nuclear Norm (IRNN) algorithm  is a state-of-the-art solver for nonconvex low-rank minimization. It is based on upper-bounding the nonconvex , and approximates the matrix rank by a weighted version of the nuclear norm , where , Intuitively, imposes a smaller penalty on the larger (and more informative) singular values. Other solvers that are designed only for specific nonconvex low-rank regularizers include  (for the capped-),  (for the TNN), and  (for the MCP). All these (including IRNN) need SVD in each iteration. It takes time, and thus can be slow.
While proximal gradient algorithms have mostly been used on convex problems, recently they are also applied to nonconvex ones [8, 16, 21, 22]. In particular, in the very recent generalized proximal gradient (GPG) algorithm , it is shown that for any nonconvex satisfying assumption A3, its proximal operator can be computed by the following generalized singular value thresholding (GSVT) operator.
 , where is the SVD of , and with
In this section, we show that the GSVT operator can be computed more efficiently. It is based on two ideas. First, the singular values in are automatically thresholded. Second, can be obtained from the proximal operator on a smaller matrix.
The following Proposition shows that in (2) becomes zero when is smaller than a regularizer-specific threshold. Because of the lack of space, proofs will be reported in a longer version of this paper.
For any satisfying Assumption A3, there exists a threshold such that when .
For the nonconvex regularizers in Table I, their values are equal to
MCP: if , and otherwise.
Proposition III.1 suggests that in each proximal iteration , we only need to compute the leading singular values/vectors of the matrix iterate . The power method (Algorithm 1)  is a fast and accurate algorithm for obtaining an approximation of such a subspace. Besides the power method, algorithms such as PROPACK  have also been used . However, the power method is more efficient than PROPACK . It also allows warm-start, which is particularly useful because of the iterative nature of the proximal gradient algorithm.
Assume that has singular values larger than , and its rank- SVD is . The following Proposition shows that can be obtained from the proximal operator on a smaller matrix.
Assume that , where , is orthogonal and . Then, .
Though SVD is still needed to obtain , is much smaller than ( vs ). This smaller SVD takes time, and the other matrix multiplication steps take time. Thus, the time complexity for this SVD step is reduced from to .
The complete procedure (Algorithm 2) will be called FaNCL (Fast NonConvex Lowrank). The core steps are 9–16. We first use the power method to efficiently obtain an approximate , whose singular values are then thresholded according to Corollary III.2. With , the rank of will be equal to that of . In each iteration, we ensure a sufficient decrease of the objective:
where ; otherwise, the power method is restarted. Moreover, similar to [13, 28], steps 6-7 use the column spaces of the previous iterates ( and ) to warm-start the power method. For further speedup, we employ a continuation strategy as in [12, 21, 28]. Specifically, is initialized to a large value and then decreases gradually.
Algorithm 2 can also be used with the nuclear norm. It can be shown that the threshold is equal to , and in step 15 has the closed-form solution . However, since our focus is on nonconvex regularizers, using Algorithm 2 for nuclear norm minimization will not be further pursued in the sequel.
The power method has also been recently used to approximate the SVT in nuclear norm minimization . However,  is based on active subspace selection (which uses SVT to update the active row and column subspaces of the current solution), and is thus very different from the proposed algorithm (which is a proximal gradient algorithm). In Section IV, it will be shown that the proposed method has better empirical performance. Moreover,  is only designed for nuclear norm minimization, and cannot be extended for the nonconvex regularizers considered here.
A breakdown of the time complexity of Algorithm 2 is as follows. For simplicity, assume that ’s always have rank . Step 5 takes time; step 6 and 7 take time; step 9 and 10 take time; step 17 takes time; and step 18 takes time. Thus, the per-iteration time complexity is . In the experiment, we set and . Empirically, this setting is enough to guarantee (3). In contrast, SVDs in GPG and IRNN take time, and are thus much slower as .
The following Proposition shows that from Algorithm 2 converges to a limit point .
The following shows that it is also a critical point.333Since is nonconvex and its subdifferential for points in its domain may be empty, we define as a critical point by extending the definition in , namely that , where , and and are convex.
converges to a critical point of problem (1) in a finite number of iterations.
By combining with Proposition III.4, the following shows that converges to zero at a rate of .
In matrix completion, one attempts to recover a low-rank matrix by observing only some of its elements. Let the observed positions be indicated by , such that if is observed, and 0 otherwise. It can be formulated as an optimization problem in (1), with , where if and 0 otherwise, and is a low-rank regularizer.
It can be easily seen that step 5 in Algorithm 2 becomes . By observing that is low-rank and is sparse, Mazumder et al.  showed that this “sparse plus low-rank” structure allows matrix multiplications of the form and to be efficiently computed. Here, this trick can also be directly used to speed up the computation of in Algorithm 1. Since is very sparse, this step takes time instead of , thus is much faster.
The following Proposition shows that in step 18 of Algorithm 2 can also be easily computed.
Let the reduced SVD of be , and be orthogonal matrices such that and . Then .
Let the reduced SVDs of and be and , respectively. Let and . Using Proposition III.7, . This takes instead of time. The per-iteration time complexity is reduced from to and is much faster. Table II compares the per-iteration time complexities and convergence rates for the various low-rank matrix completion solvers used in the experiments (Section IV-A).
|(convex)||APG [11, 28]|
|norm||active ALT |
The proposed algorithm can be extended for optimization problems involving matrices :
Assumptions A1-A3 are analogously extended. In particular, A1 now assumes that for some , where is the function obtained by keeping all the ’s (where ) in fixed.
Many machine learning problems can be cast into this form. One example that will be considered in Section IV is robust principal component analysis (RPCA) . Given a noisy data matrix , RPCA assumes that can be approximated by the sum of a low-rank matrix plus sparse data noise . Mathematically, we have
where , is a low-rank regularizer on , and encourages to be sparse. Since both and the regularizer are nonsmooth, (5) does not fit into formulation (1). Besides RPCA, problems such as subspace clustering , multilabel learning  and multitask learning  can also be cast as (4).
For simplicity, we focus on the case with two parameter blocks. Extension to multiple blocks is straightforward. To solve the two-block problem in (5), we perform alternating proximal steps on and at each iteration :
where , and . can be easily obtained as , where denotes the sign of . Similar to (3), we ensure a sufficient decrease of the objective in each iteration:
where , and . The resultant algorithm is similar to Algorithm 2.
With parameter blocks and generated by the algorithm, we have
converges to a critical point of (4) in a finite number of iterations;
|(observed: )||(observed: )||(observed: )||(observed: )|
We compare a number of low-rank matrix completion solvers, including models based on (i) the commonly used (convex) nuclear norm regularizer; (ii) fixed-rank factorization models [29, 30], which decompose the observed matrix into a product of rank- matrices and . Its optimization problem can be written as: ; and (iii) nonconvex regularizers, including the capped- (with in Table I set to ), LSP (with ), and TNN (with ).
The nuclear norm minimization algorithms to be compared include:
Accelerated proximal gradient (APG)444http://perception.csl.illinois.edu/matrix-rank/Files/apg_partial.zip algorithm [11, 28], with the partial SVD by PROPACK ;
Soft-Impute555http://cran.r-project.org/web/packages/softImpute/index.html , which iteratively replaces the missing elements with those obtained from SVT. The “sparse plus low-rank” structure of the matrix iterate is utilized to speed up computation (Section III-E);
Active alternating minimization666http://www.cs.utexas.edu/~cjhsieh/nuclear_active_1.1.zip (denoted “active ALT”) , which adds/removes rank-one subspaces from the active set in each iteration. The nuclear norm optimization problem is then reduced to a smaller problem defined only on this active set.
We do not compare with the Frank-Wolfe algorithm 34], as they have been shown to be less efficient . For the fixed-rank factorization models (where the rank is tuned by the validation set), we compare with the two state-of-the-art algorithms:
Low-rank matrix fitting (LMaFit) algorithm777http://www.caam.rice.edu/~optimization/L1/LMaFit/download.html ; and
Rank-one matrix pursuit (R1MP) , which pursues a rank-one basis in each iteration.
For models with nonconvex low-rank regularizers, we compare the following solvers:
Iterative reweighted nuclear norm (IRNN)888https://sites.google.com/site/canyilu/file/2014-CVPR-IRNN.zip?attredirects=0&d=1 ;
The proposed FaNCL algorithm (, ).
All algorithms are implemented in Matlab. The same stopping criterion is used, namely that the algorithm stops when the difference in objective values between consecutive iterations is smaller than a given threshold. Experiments are run on a PC with i7 4GHz CPU and 24GB memory.
The observed matrix is generated as , where the elements of (with
) are sampled i.i.d. from the normal distribution, and elements of sampled from . A total of random elements in are observed. Half of them are used for training, and the rest as validation set for parameter tuning. Testing is performed on the non-observed (missing) elements.
For performance evaluation, we use (i) the normalized mean squared error , where is the recovered matrix; (ii) rank of ; and (iii) training CPU time. We vary in the range . Each experiment is repeated five times.
Results are shown in Table III. As can be seen, the nonconvex regularizers (capped-, LSP and TNN) lead to much lower NMSE’s than the convex nuclear norm regularizer and fixed-rank factorization. Moreover, as is also observed in , the nuclear norm needs to use a much higher rank than the nonconvex ones. In terms of speed, FaNCL is the fastest among the nonconvex low-rank solvers. Figure 1 shows its speedup over GPG (which in turn is faster than IRNN). As can be seen, the larger the matrix, the higher is the speedup.
Recall that the efficiency of the proposed algorithm comes from (i) automatic singular value thresholding; (ii) computing the proximal operator on a smaller matrix; and (iii) exploiting the “sparse plus low-rank” structure in matrix completion. Their individual contributions are examined in Table IV. The baseline is GPG, which uses none of these; while the proposed FaNCL uses all. As all the variants produce the same solution, the obtained NMSE and rank values are not shown. As can be seen, tricks (i), (ii) and (iii) lead to average speedups of about 6, 4, and 3, respectively; and are particularly useful on the large data sets.
|i, ii, iii (FaNCL)||0.3||0.9||2.6||6.8|
|i, ii, iii (FaNCL)||0.5||2.2||3.3||8.2|
|i, ii, iii (FaNCL)||0.3||0.8||2.7||3.3|
Experiment is performed on the popular MovieLens999http://grouplens.org/datasets/movielens/ data set (Table V), which contain ratings of different users on movies. We follow the setup in , and use of the observed ratings for training, for validation and the rest for testing. For performance evaluation, we use the root mean squared error on the test set : , where is the recovered matrix. The experiment is repeated five times.
Results are shown in Table VI. Again, nonconvex regularizers lead to the lowest RMSE’s. Moreover, FaNCL is also the fastest among the nonconvex low-rank solvers, even faster than the state-of-the-art. In particular, it is the only solver (among those compared) that can be run on the MovieLens-10M data. Table VII examines the usefulness of the three tricks. The behavior is similar to that as observed in Table IV. Figures 2 and 3 compare the objective and RMSE vs CPU time for the various methods on the MovieLens-100K data set. As can be seen, FaNCL decreases the objective and RMSE much faster than the others.
|i, ii, iii (FaNCL)||3.2||29.4||634.6|
|i, ii, iii (FaNCL)||0.7||25.6||616.3|
|i, ii, iii (FaNCL)||1.9||25.8||710.7|
Next, we perform experiments on two very large recommendation data sets, Netflix101010http://archive.ics.uci.edu/ml/datasets/Netflix+Prize and Yahoo111111http://webscope.sandbox.yahoo.com/catalog.php?datatype=c (Table V). We randomly use of the observed ratings for training, for validation and the rest for testing. Each experiment is repeated five times.
Results are shown in Table VIII. APG, Soft-Impute, GPG and IRNN cannot be run as the data set is large. Figure 4 shows the objective and RMSE vs time for the compared methods.121212On these two data sets, R1MP easily overfits as the rank increases. Hence, the validation set selects a rank which is small (relative to that obtained by the nuclear norm) and R1MP stops earlier. However, as can be seen, its RMSE is much worse. Again, the nonconvex regularizers converge faster, yield lower RMSE’s and solutions of much lower ranks. Moreover, FaNCL is fast.
|nuclear norm||active ALT||399||47.6||221||118.9|
RPCA performance of the various methods on synthetic data. The standard deviations of NMSE are all smaller thanand so not reported. CPU time is in seconds.
In this section, we first perform experiments on a synthetic data set. The observed matrix is generated as , where elements of (with ) are sampled i.i.d. from , and elements of are sampled from . Matrix is sparse, with of its elements randomly set to or
with equal probabilities. The sparsity regularizer is the standard, while different convex/nonconvex low-rank regularizers are used.
For performance evaluation, we use (i) NMSE , where and are the recovered low-rank and sparse components, respectively in (5); (ii) accuracy on locating the sparse support of (i.e., percentage of entries that both and are nonzero or zero together); and (iii) the recovered rank. We vary in . Each experiment is repeated five times.
Note that IRNN and the active subspace selection method cannot be used here. Their objectives are of the form “smooth function plus low-rank regularizer”, while RPCA has a nonsmooth regularizer besides its low-rank regularizer. Similarly, Soft-Impute is for matrix completion only.
Results are shown in Table IX. The accuracy on locating the sparse support are always 100% for all methods, and thus are not shown. As can be seen, while both convex and nonconvex regularizers can perfectly recover the matrix rank and sparse locations, the nonconvex regularizers have lower NMSE’s. Moreover, as in matrix completion, FaNCL is again much faster. The larger the matrix, the higher is the speedup.
In this section, we use RPCA to perform video denoising on background removal of corrupted videos. Four benchmark videos131313http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html in [7, 8] are used (Table X), and example image frames are shown in Figure 5. As discussed in , the stable image background can be treated as low-rank, while the foreground moving objects contribute to the sparse component.
|#pixels / frame||19,200||20,480||20,800||25,344|
Each image frame is reshaped as a column vector, and all frames are then stacked together to form a matrix. The pixel values are normalized to , and Gaussian noise from is added. The experiment is repeated five times.
For performance evaluation, we use the commonly used peak signal-to-noise ratio : PSNR , where , is the recovered video, and is the ground-truth.
Results are shown in Table XI. As can be seen, the nonconvex regularizers lead to better PSNR’s than the convex nuclear norm. Moreover, FaNCL is more than times faster than GPG. Figure 6 shows an example of the recovered foreground in the bootstrap video. As can been seen, the nonconvex regularizers can better separate foreground from background. Figure 7 shows the PSNR vs time on bootstrap. Again, FaNCL converges much faster than others.