Introduction
The sparse and low rank
structures have received much attention in recent years. There have been many applications which exploit these two structures, such as face recognition
[Wright et al.2009], subspace clustering [Cheng et al.2010, Liu et al.2013b] and background modeling [Candès et al.2011]. To achieve sparsity, a principled approach is to use the convex norm. However, the minimization may be suboptimal, since the norm is a loose approximation of thenorm and often leads to an overpenalized problem. This brings the attention back to the nonconvex surrogate by interpolating the
norm and norm. Many nonconvex penalities have been proposed, including norm () [Frank and Friedman1993], Smoothly Clipped Absolute Deviation (SCAD) [Fan and Li2001], Logarithm [Friedman2012], Minimax Concave Penalty (MCP) [Zhang and others2010], Geman [Geman and Yang1995] and Laplace [Trzasko and Manduca2009]. Their definitions are shown in Table 1. Numerical studies [Candès, Wakin, and Boyd2008] have shown that the nonconvex optimization usually outperforms convex models.Penalty  Formula , , 

norm  , . 
SCAD  
Logarithm  
MCP  
Geman  . 
Laplace  . 
The low rank structure is an extension of sparsity defined on the singular values of a matrix. A principled way is to use the nuclear norm which is a convex surrogate of the rank function [Recht, Fazel, and Parrilo2010]. However, it suffers from the same suboptimal issue as the norm in many cases. Very recently, many popular nonconvex surrogate functions in Table 1 are extended on the singular values to better approximate the rank function [Lu et al.2014]. However, different from the convex optimization, the nonconvex low rank minimization is much more challenging than the nonconvex sparse minimization.
The Iteratively Reweighted Nuclear Norm (IRNN) method is proposed to solve the following nonconvex low rank minimization problem [Lu et al.2014]
(1) 
where denotes the th singular value of (we assume in this work). is continuous, concave and nonincreasing on . Popular nonconvex surrogate functions in Table 1 are some examples.
is the loss function which has Lipschitz continuous gradient. IRNN updates
by minimizing a surrogate function which upper bounds the objective function in (9). The surrogate function is constructed by linearizing and at , simultaneously. In theory, IRNN guarantees to decrease the objective function value of (9) in each iteration. However, it may decrease slowly since the upper bound surrogate may be quite loose. It is expected that minimizing a tighter surrogate will lead to a faster convergence.A possible tighter surrogate function of the objective function in (9) is to keep and relax only. This leads to the following updating rule which is named as Generalized Proximal Gradient (GPG) method in this work
(2) 
where , is the Lipschitz constant of , guarantees the convergence of GPG as shown later. It can be seen that solving (10) requires solving the following problem
(3) 
In this work, the mapping is called the Generalized Singular Value Thresholding (GSVT) operator associated with the function defined on the singular values. If , is degraded to the convex nuclear norm . Then (3) has a closed form solution , where , and and are from the SVD of , i.e., . This is the known Singular Value Thresholding (SVT) operator associated with the convex nuclear norm (when ) [Cai, Candès, and Shen2010]. More generally, for a convex , the solution to (3) is
(4) 
where is defined elementwise as follows,
(5) 
where is the known proximal operator associated with a convex [Combettes and Pesquet2011]. That is to say, solving (3) is equivalent to performing on each singular value of . In this case, the mapping is unique, i.e., (5) has a unique solution. More importantly, is monotone, i.e., for any . This guarantees to preserve the nonincreasing order of the singular values after shrinkage and thresholding by the mapping . For a nonconvex , we still call as the proximal operator, but note that such a mapping may not be unique. It is still an open problem whether is monotone or not for a nonconvex . Without proving the monotonity of , one cannot simply perform it on the singular values of to obtain the solution to (3) as SVT. Even if is monotone, since it is not unique, one also needs to carefully choose the solution such that . Another challenging problem is that there does not exist a general solver to (5) for a general nonconvex .
It is worth mentioning that some previous works studied the solution to (3) for some special choices of nonconvex [Nie, Huang, and Ding2012, Chartrand2012, Liu et al.2013a]. However, none of their proofs was rigorous since they ignored proving the monotone property of . See the detailed discussions in the next section. Another recent work [Gu et al.2014] considered the following problem related to the weighted nuclear norm:
(6) 
where , . Problem (6) is a little more general than (3) by taking different . It is claimed in [Gu et al.2014] that the solution to (6) is
(7) 
where is the SVD of , and . However, such a result and their proof are not correct. A counterexample is as follows:
where is obtained by (7). The solution is not optimal to (6) since there exists shown above such that . The reason behind is that
(8) 
does not guarantee to hold for any . Note that (8) holds when , and thus (7) is optimal to (6) in this case.
In this work, we give the first rigorous proof that is monotone for any lower bounded function (regardless of the convexity of ). Then solving (3) is degenerated to solving (5) for each . The Generalized Singular Value Thresholding (GSVT) operator associated with any lower bounded function in (3) is much more general than the known SVT associated with the convex nuclear norm [Cai, Candès, and Shen2010]. In order to compute GSVT, we analyze the solution to (5) for certain types of (some special cases are shown in Table 1) in theory, and propose a general solver to (5). At last, with GSVT, we can solve (9) by the Generalized Proximal Gradient (GPG) algorithm shown in (10). We test both Iteratively Reweighted Nuclear Norm (IRNN) and GPG on the matrix completion problem. Both synthesis and real data experiments show that GPG outperforms IRNN in terms of the recovery error and the objective function value.
Generalized Singular Value Thresholding
Problem Reformulation
A main goal of this work is to compute GSVT (3), and uses it to solve (9). We will show that, if is monotone, problem (3) can be reformulated into an equivalent problem which is much easier to solve.
Lemma 1.
(von Neumann’s trace inequality [Rhea2011]) For any matrices , (), , where and are the singular values of and , respectively. The equality holds if and only if there exist unitaries and such that and are the SVDs of and , simultaneously.
Theorem 1.
Let be a function such that is monotone. Let be the SVD of . Then an optimal solution to (3) is
(9) 
where satisfies , , and
(10) 
Proof.
Denote as the singular values of . Problem (3) can be rewritten as
(11) 
By using the von Neumann’s trace inequality in Lemma 1, we have
Note that the above equality holds when
admits the singular value decomposition
, where and are the left and right orthonormal matrices in the SVD of . In this case, problem (11) is reduced to(12) 
Since is monotone and , there exists , such that . Such a choice of is optimal to (12), and thus (9) is optimal to (3). ∎
From the above proof, it can be seen that the monotone property of is a key condition which makes problem (12) separable conditionally. Thus the solution (9) to (3) shares a similar formulation as the known Singular Value Thresholding (SVT) operator associated with the convex nuclear norm [Cai, Candès, and Shen2010]. Note that for a convex , is always monotone. Indeed,
The above inequality can be obtained by the optimality of and the convexity of .
The monotonicity of for a nonconvex is still unknown. There were some previous works [Nie, Huang, and Ding2012, Chartrand2012, Liu et al.2013a] claiming that the solution (9) is optimal to (3) for some special choices of nonconvex . However, their results are not rigorous since the monotone property of is not proved. Surprisingly, we find that the monotone property of holds for any lower bounded function .
Theorem 2.
For any lower bounded function , its proximal operator is monotone, i.e., for any , , , when .
Note that it is possible that for some in (10). Since may not be unique, we need to choose and such that . This is the only difference between GSVT and SVT.
Proximal Operator of Nonconvex Function
So far, we have proved that solving (3) is equivalent to solving (5) for each , , for any lower bounded function . For a nonconvex , only for some special cases, the candidate solutions to (5) have a closed form [Gong et al.2013]. There does not exist a general solver for a more general nonconvex . In this section, we analyze the solution to (5) for a broad choice of the nonconvex . Then a general solver will be proposed in the next section.
Assumption 1.
, . is concave, nondecreasing and differentiable. The gradient is convex.
In this work, we are interested in the nonconvex surrogate of norm. Except the differentiablity of and the convexity of , all the other assumptions in Assumption 2 are necessary to construct a surrogate of norm. As shown later, these two additional assumptions make our analysis much easier. Note that the assumptions for the nonconvex function considered in Assumption 2 are quite general. It is easy to verify that many popular surrogates of norm in Table 1 satisfy Assumption 2, including norm, Logarithm, MCP, Geman and Laplace penalties. Only the SCAD penalty violates the convex assumption, as shown in Figure 1.
The above fact is obvious since both and are nondecreasing on . Such a result limits the solution space, and thus is very useful for our analysis. Our general solver to (5) is also based on Proposition 1.
Note that the solutions to (5) lie in 0 or the local points . Our analysis is mainly based on the number of intersection points of and the line . Let . We have the solution to (5) in different cases. Please refer to the supplementary material for the detailed proofs.
Proposition 2.
Given satisfying Assumption 2 and . Restricted on , when , and have two intersection points, denoted as , , and . If there does not exist such that , then for all . If there exists such that , let . Then we have
Proposition 3.
Given satisfying Assumption 2 and . Restricted on , if we have for all , then and have only one intersection point when . Furthermore,
Suppose there exists such that . Then, when , and have two intersection points, which are denoted as and such that . When , and have only one intersection point . Also, there exists such that and . Let . We have
Algorithms
In this section, we first give a general solver to (5) in which satisfies Assumption 2. Then we are able to solve the GSVT problem (3). With GSVT, problem (9) can be solved by Generalized Proximal Gradient (GPG) algorithm as shown in (10). We also give the convergence guarantee of GPG.
A General Solver to (5)
Given satisfying Assumption 2, as shown in Corollary 2, 0 and are the candidate solutions to (5). The left task is to find which is the largest local minimum point near . So we can start searching for from by the fixed point iteration algorithm. Note that it will be very fast since we only need to search within . The whole procedure to find can be found in Algorithm 1. In theory, it can be proved that the fixed point iteration guarantees to find .
If is nonsmooth or is nonconvex, the fixed point iteration algorithm may also be applicable. The key is to find all the local solutions with smart initial points. Also all the nonsmooth points should be considered as the candidates.
All the nonconvex surrogates except SCAD in Table 1 satisfy Assumption 2, and thus the solution to (5) can be obtained by Algorithm 1. Figure 2 illustrates the shrinkage effect of proximal operators of these functions and the convex norm. The shrinkage and thresholding effect of these proximal operators are similar when is relatively small. However, when is relatively large, the proximal operators of the nonconvex functions are nearly unbiased, i.e., keeping nearly the same as the norm. On the contrast, the proximal operator of the convex norm is biased. In this case, the norm may be overpenalized, and thus may perform quite differently from the norm. This also supports the necessity of using nonconvex penalties on the singular values to approximate the rank function.
Generalized Proximal Gradient Algorithm for (9)
Given satisfying Assumption 2, we are now able to get the optimal solution to (3) by (9) and Algorithm 1. Now we have a better solver than IRNN to solve (9) by the updating rule (10), or equivalently
The above updating rule is named as Generalized Proximal Gradient (GPG) for the nonconvex problem (9), which generalizes some previous methods [Beck and Teboulle2009, Gong et al.2013]. The main periteration cost of GPG is to compute an SVD, which is the same as many convex methods [Toh and Yun2010a, Lin, Chen, and Ma2009]. In theory, we have the following convergence results for GPG.
Theorem 3.
If , the sequence generated by (10) satisfies the following properties:

is monotonically decreasing.

;

If when , then any limit point of is a stationary point.
It is expected that GPG will decrease the objective function value faster than IRNN since it uses a tighter surrogate function. This will be verified by the experiments.
Experiments
In this section, we conduct some experiments on the matrix completion problem to test our proposed GPG algorithm
(13) 
where is the index set, and is a linear operator that keeps the entries in unchanged and those outside zeros. Given , the goal of matrix completion is to recover which is of low rank. Note that we have many choices of which satisfies Assumption 2, and we simply test on the Logarithm penalty, since it is suggested in [Lu et al.2014, Candès, Wakin, and Boyd2008] that it usually performs well by comparing with other nonconvex penalties. Problem (13) can be solved by GPG by using GSVT (9) in each iteration. We compared GPG with IRNN on both synthetic and real data. The continuation technique is used to enhance the low rank matrix recovery in GPG. The initial value of in the Logarithm penalty is set to , and dynamically decreased till reaching .
LowRank Matrix Recovery on Random Data
We conduct two experiments on synthetic data without and with noises [Lu et al.2014]. For the noise free case, we generate , where , are i.i.d. random matrices, and . The underlying rank varies from 20 to 33. Half of the elements in are missing. We set , and . The relative error RelErr is used to evaluate the recovery performance. If RelErr is smaller than , is regarded as a successful recovery of . We repeat the experiments 100 times for each . We compare GPG by using GSVT with IRNN and the convex Augmented Lagrange Multiplier (ALM) [Lin, Chen, and Ma2009]. Figure 3 (a) plots v.s. the frequency of success. It can be seen that GPG is slightly better than IRNN when is relatively small, while both IRNN and GPG fail when . Both of them outperform the convex ALM method, since the nonconvex logarithm penalty approximates the rank function better than the convex nuclear norm.
For the noisy case, the data matrix is generated in the same way, but are added some additional noises , where
is an i.i.d. random matrix. For this task, we set
, and in GPG. The convex APGL algorithm [Toh and Yun2010b] is compared in this task. Each method is run 100 times for each . Figure 3 (b) shows the mean relative error. It can be seen that GPG by using GSVT in each iteration significantly outperforms IRNN and APGL. The reason is that is not that small as in the noise free case. Thus, the upper bound surrogate of in IRNN will be much more loose than that in GPG. Figure 3 (c) plots some convergence curves of GPG and IRNN. It can be seen that GPG without relaxing will decrease the objective function value faster.Applications on Real Data
Matrix completion can be applied to image inpainting since the main information is dominated by the top singular values. For a color image, assume that 40% of pixels are uniformly missing. They can be recovered by applying low rank matrix completion on each channel (red, green and blue) of the image independently. Besides the relative error defined above, we also use the Peak SignaltoNoise Ratio (PSNR) to evaluate the recovery performance. Figure 4 shows two images recovered by APGL, IRNN and GPG, respectively. It can be seen that GPG achieves the best performance, i.e., the largest PSNR value and the smallest relative error.
We also apply matrix completion for collaborative filtering. The task of collaborative filtering is to predict the unknown preference of a user on a set of unrated items, according to other similar users or similar items. We test on the MovieLens data set [Herlocker et al.1999] which includes three problems, “movie100K”, “movie1M” and “movie10M”. Since only the entries in of are known, we use Normalized Mean Absolute Error (NMAE) to evaluate the performance as in [Toh and Yun2010b]. As shown in Table 2, GPG achieves the best performance. The improvement benefits from the GPG algorithm which uses a fast and exact solver of GSVT (9).
Problem  size of :  APGL  IRNN  GPG 

moive100K  (943, 1682)  2.76e3  2.60e3  2.53e3 
moive1M  (6040, 3706)  2.66e1  2.52e1  2.47e1 
moive10M  (71567, 10677)  3.13e1  3.01e1  2.89e1 
Conclusions
This paper studied the Generalized Singular Value Thresholding (GSVT) operator associated with the nonconvex function on the singular values. We proved that the proximal operator of any lower bounded function (denoted as ) is monotone. Thus, GSVT can be obtained by performing on the singular values separately. Given , we also proposed a general solver to find for certain type of . At last, we applied the generalized proximal gradient algorithm by using GSVT as the subroutine to solve the nonconvex low rank minimization problem (9). Experimental results showed that it outperformed previous method with smaller recovery error and objective function value.
For nonconvex low rank minimization, GSVT plays the same role as SVT in convex minimization. One may extend other convex low rank models to nonconvex cases, and solve them by using GSVT in place of SVT. An interesting future work is to solve the nonconvex low rank minimization problem with affine constraint by ALM [Lin, Chen, and Ma2009] and prove the convergence.
Acknowledgements
This research is supported by the Singapore National Research Foundation under its International Research Centre @Singapore Funding Initiative and administered by the IDM Programme Office. Z. Lin is supported by NSF China (grant nos. 61272341 and 61231002), 973 Program of China (grant no. 2015CB3525) and MSRA Collaborative Research Program. C. Lu is supported by the MSRA fellowship 2014.
References
 [Beck and Teboulle2009] Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences.
 [Cai, Candès, and Shen2010] Cai, J.F.; Candès, E. J.; and Shen, Z. 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4):1956–1982.

[Candès et al.2011]
Candès, E. J.; Li, X.; Ma, Y.; and Wright, J.
2011.
Robust principal component analysis?
Journal of the ACM 58(3).  [Candès, Wakin, and Boyd2008] Candès, E. J.; Wakin, M. B.; and Boyd, S. P. 2008. Enhancing sparsity by reweighted minimization. Journal of Fourier Analysis and Applications 14(56):877–905.
 [Chartrand2012] Chartrand, R. 2012. Nonconvex splitting for regularized lowrank+ sparse decomposition. IEEE Transactions on Signal Processing 60(11):5810–5819.
 [Cheng et al.2010] Cheng, B.; Yang, J.; Yan, S.; Fu, Y.; and Huang, T. S. 2010. Learning with graph for image analysis. TIP 19(Compendex):858–866.
 [Clarke1983] Clarke, F. 1983. Nonsmooth analysis and optimization. In Proceedings of the International Congress of Mathematicians.
 [Combettes and Pesquet2011] Combettes, P. L., and Pesquet, J.C. 2011. Proximal splitting methods in signal processing. Fixedpoint algorithms for inverse problems in science and engineering.
 [Fan and Li2001] Fan, J., and Li, R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456):1348–1360.
 [Frank and Friedman1993] Frank, L., and Friedman, J. 1993. A statistical view of some chemometrics regression tools. Technometrics.
 [Friedman2012] Friedman, J. 2012. Fast sparse regression and classification. International Journal of Forecasting 28(3):722 – 738.
 [Geman and Yang1995] Geman, D., and Yang, C. 1995. Nonlinear image recovery with halfquadratic regularization. TIP 4(7):932–946.
 [Gong et al.2013] Gong, P.; Zhang, C.; Lu, Z.; Huang, J.; and Ye, J. 2013. A general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems. In ICML.
 [Gu et al.2014] Gu, S.; Zhang, L.; Zuo, W.; and Feng, X. 2014. Weighted nuclear norm minimization with application to image denoising. In CVPR.
 [Herlocker et al.1999] Herlocker, J. L.; Konstan, J. A.; Borchers, A.; and Riedl, J. 1999. An algorithmic framework for performing collaborative filtering. In International ACM SIGIR conference on Research and development in information retrieval. ACM.
 [Lewis and Sendov2005] Lewis, A. S., and Sendov, H. S. 2005. Nonsmooth analysis of singular values. Part I: Theory. SetValued Analysis 13(3):213–241.
 [Lin, Chen, and Ma2009] Lin, Z.; Chen, M.; and Ma, Y. 2009. The augmented Lagrange multiplier method for exact recovery of a corrupted lowrank matrices. UIUC Technical Report UILUENG092215, Tech. Rep.
 [Liu et al.2013a] Liu, D.; Zhou, T.; Qian, H.; Xu, C.; and Zhang, Z. 2013a. A nearly unbiased matrix completion approach. In Machine Learning and Knowledge Discovery in Databases.
 [Liu et al.2013b] Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; and Ma, Y. 2013b. Robust recovery of subspace structures by lowrank representation. TPAMI 35(1):171–184.
 [Lu et al.2014] Lu, C.; Tang, J.; Yan, S. Y.; and Lin, Z. 2014. Generalized nonconvex nonsmooth lowrank minimization. In CVPR.
 [Nie, Huang, and Ding2012] Nie, F.; Huang, H.; and Ding, C. H. 2012. Lowrank matrix recovery via efficient Schatten norm minimization. In AAAI.
 [Recht, Fazel, and Parrilo2010] Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52(3):471–501.
 [Rhea2011] Rhea, D. 2011. The case of equality in the von Neumann trace inequality. preprint.
 [Toh and Yun2010a] Toh, K., and Yun, S. 2010a. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization.
 [Toh and Yun2010b] Toh, K., and Yun, S. 2010b. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization 6(615640):15.
 [Trzasko and Manduca2009] Trzasko, J., and Manduca, A. 2009. Highly undersampled magnetic resonance image reconstruction via homotopicminimization. IEEE Transactions on Medical imaging 28(1):106–121.
 [Wright et al.2009] Wright, J.; Yang, A. Y.; Ganesh, A.; Sastry, S. S.; and Ma, Y. 2009. Robust face recognition via sparse representation. TPAMI 31(2):210–227.
 [Zhang and others2010] Zhang, C.H., et al. 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38(2):894–942.
Ananlysis of the Proximal Operator of Nonconvex Function
In the following development, we consider the following problem
(1) 
where satisfies the following assumption.
Assumption 2.
, . is concave, nondecreasing and differentiable. The gradient is convex.
Set and . Let , and .
Proof of Proposition 2
Proposition 2.
Given satisfying Assumption 2 and . Restricted on , when , and have two intersection points, denoted as , , and . If there does not exist such that , then for all . If there exists such that , let . Then we have
(2) 
Remark: When exists and , because is convex and decreasing, we can conclude that and have exactly two intersection points. When , and may have multiple intersection points.
Proof.
When , since , we can easily see that is increasing on , decreasing on and increasing on . So, and are two local minimum points of on .
Case 1 : If there exists such that , denote .
First, we consider . Let for some . We have
Since is decreasing on , we conclude that . So, when , is the global minimum of on .
Second, we consider . We show that by contradiction. Suppose that there exists such that . Since is strictly increasing on , we have . Because we have
by a direct computation, we get
According to the intermediate value theorem, there exists such that and
Comments
There are no comments yet.