1 Introduction
Singular Value Decomposition (SVD) is a powerful matrix decomposition model which has been widely used in different fields [Alter et al.2000, Aharon et al.2006, Zhou et al.2015]. Suppose is a matrix of size (e.g., a gene expression matrix), then the classical SVD of is:
(1) 
where is the rank of , and ) are left and right singular vectors respectively, and both and are orthogonal matrices. The is a diagonal matrix. However, the nonsparse singular vectors (, ) can sometimes be difficult to interpret. Many studies impose sparsity on singular vectors to improve this [Shi et al.2013, Jing et al.2015]. The advantage of sparsity is that the sparse singular vectors are better able to capture inherent structures and patterns of the input data.
We here introduce a formal framework of the rank one sparse SVD [Witten et al.2009],
(2)  
subject to  
where and are two penalty functions. For different purpose, different sparsity penalties can be induced in this framework [Witten et al.2009, Lee et al.2010, Sill et al.2011]. For example, Witten et al. witten2009penalized has proposed a penalized matrix decomposition (PMD) by using penalty to induce sparsity [Witten et al.2009]. In addition, Lee et al. lee2010biclustering has developed a sparse SVD (ALSVD) by imposing Adaptive Lasso penalty to the left/right singular vectors and employed it as a biclustering tool to capture bicluster structure of a gene expression data. However, these sparse SVD models have several limitations. First, they impose sparseness via norm penalty, while little work is concerned with norm one. Compared to norm, norm can enforce to get a desirable level of sparsity. Second, they ignore the prior graph knowledge for variables. Recent studies have shown that graphregularized penalty can help to get better interpretability and higher generalization performance for regression tasks and others [Ando and Zhang2007, Li and Yeung2009, Li and Li2008, Li and Li2010, Jiang et al.2013]. To our knowledge, there is yet no study to incorporate graph structure into the sparse SVD framework.
In this paper, we propose a novel sparse graphregularized SVD to model the highdimensional data with prior graph knowledge. This method is inspired by recent developments of sparse coding and graphregularized norm [Lee et al.2006, Ando and Zhang2007, Zheng et al.2011, He et al.2011]. It imposes a novel graphregularized norm () instead of the general one () and the norm () penalty on singular vectors to induce structural sparsity. In summary, our key contributions are threefold: (1) impose norm penalty to graphregularized SVD (sgSVD); (2) consider smoothing of absolute signal in graphregularized penalty (sgSVD*); (3) design an efficient Alternating Iterative Sparse Projection (AISP) algorithm to solve it. Finally, we apply sgSVD* and related methods to simulated and real data to demonstrate its efficiency in capturing blocking data structure.
2 Related Works
Recent studies have a growing interest in regularized SVD including the following main three aspects.
(1) Sparse SVD. Witten et al. witten2009penalized proposed the first sparse SVD via norm and fused lasso as sparsityinducing norm. Lee et al. lee2010biclustering recommended to use different penalties, such as Lasso, adaptive Lasso and Smoothly Clipped Absolute Deviation (SCAD), as penalty functions. A natural choice of the penalty function is norm. Min et al. min2015novel proposed a sparse SVD via norm penalty (L0SVD).
(2) Graphregularized SVD. Graphregularized norm has been used in many different techniques such as nonnegative matrix factorization [Cai et al.2011]. However, to our knowledge, there is yet no study to incorporate graphregularized norm into the sparse SVD framework. We believe that sparse graphregularized SVD is a promising tool as explored in other problems [Li and Yeung2009, Jenatton et al.2010, Zheng et al.2011] to enforce the smoothness of variable coefficients. In addition, is widely used to enforce the structural sparsity at the intergroup level [Jacob et al.2009]. We can also consider it as a penalty function of SVD in future studies.
(3) The relationship between SVD and PCA. As we all know, principal component analysis (PCA) can be efficiently solved by using SVD. However, the identified nonsparse principal components can sometimes be difficult to interpret. To solve it, recent studies have developed several different sparse PCA models
[Deshpande and Montanari2014, Gu et al.2014, Shen and Huang2008, Witten et al.2009]. However, it is difficult to develop effective algorithms for solving these sparse PCA models. There are a kind of commonly used methods based on regularized SVD for solving them [Shen and Huang2008]. Moreover, Witten et al. witten2009penalized proposed another model based on regularized SVD to ensure the orthogonally of their left singular vectors.3 The Proposed Approach
3.1 Sparse Graphregularized Penalty
Given a simple graph , its Laplacian matrix is defined as , where is the adjacency matrix of graph , if vertex and is connected in the graph, otherwise, and is a diagonal matrix whose diagonal elements is the degree of vertex (i.e., ). We first consider the ordinary graphregularized norm and impose sparsity on singular vectors in SVD framework with the following penalty,
(3) 
where and are two regularization parameters. The first term is the norm penalty to induce sparsity. The second term is a quadratic Laplacian penalty which forces the coefficients of to be smooth. Thus, we can create a sparse graphregularized SVD using the penalty Eq. (3) in the framework Eq. (2). The norm has some good nature and has been applied to induce sparsity in SVD [Witten et al.2009]. Whereas little work has been done using a more natural sparseness regularization, i.e., norm penalty. In this paper, we adopt norm penalty rather than norm one to induce sparsity on singular vectors. Thus, we replace Eq. (3) with the following one,
(4) 
where is the norm penalty.
The classical graphregularized penalty in Eq. (4) ensures that two positively correlated variables would tend to influence their coefficients in the same direction. However, it makes two negatively correlated variables have opposite effect. Thus, the classical graphregularized penalty may not perform well when the coefficients of a number of variables linked on the prior graph have different signs. This would lead to the absolute values of their coefficients are not expected to be locally smooth. To solve this drawback, we redefine a sparse graphregularized penalty by only considering the magnitude of their coefficients,
(5) 
where and . This is also reflected by the fact that linked genes in a gene interaction graph are usually significantly positively or negatively correlated. This new penalty Eq. (5) encourages highly intercorrelated variables corresponding to a dense subnetwork in to be jointly selected.
Mathematically, the normalized Laplacian matrix defined by can be used to replace in Eq. (3), Eq. (4), and Eq. (5). The choice of and normalized depends on the specific problems. In addition, there are a wide variety of graph matrix representations [Wilson and Zhu2008], which can also be considered to replace . In this paper, we apply the sparse graphregularized penalty with for biclustering gene expression data.
3.2 norm Sparse Graphregularized SVD
The sparse graphregularized penalty via norm is nonconvex, which leads to a challenge issue for efficient implementation. Here, we first discuss a sparse graphregularized SVD by replacing norm in Eq.(5) with norm, denoted as sgSVD*. Formally, we should deal with the following minimization problem,
(6)  
subject to  
where is a positive singular value, is a column vector, is a column vector, both and are Laplace matrices, and , , and are four superparameters.
The objective function in Eq. (6) is biconvex to and . Thus, we propose an alternating iterative learning strategy to solve the optimization problem Eq. (6). Next we consider optimization over for a fixed , and get the following constrained optimization problem,
(7)  
subject to  
We note that this objective function . Obviously, minimizing this function is equivalent to minimizing . Let . The problem Eq. (7) can thus be written as
(8)  
subject to  
The is general nonconvex, which leads to a challenge issue for solving the problem Eq. (8). Next we propose Theorem 1 to cleverly remove the absolute operator in the constrained condition.
Theorem 1.
Suppose is an optimal solution of Eq. (8), then for all .
Proof.
Suppose the Theorem is false. Then there is at least exists to meet . We first construct a vector which satisfies for all and . Obviously, the vector meets and . Let and , then (i.e., ). Thus, we get a vector which corresponds to a more smaller objective value than that of the optimal solution . It leads to a desired contradiction. Thus, the Theorem is true. ∎
Based on Theorem 1, we can replace Eq. (8) with the following optimization problem,
(9)  
subject to  
where . A standard way to solve the optimization problem Eq. (9) is to formulate the Lagrangian form as follows:
(10) 
where are Lagrangian multipliers. Actually, we can easily get Theorem 2 to confirm the function Eq. (10) is a convex function (we omit the proof here).
Theorem 2.
The function Eq. (10) is a convex function.
Thus, the optimal solution of Eq. (10) is characterized by its subgradient equation as follows:
(11) 
Ideally, we can get the optimal solution of Eq. (10) by solving the corresponding linear system in Eq. (11). However, this is inefficient due to the numerical difficulty of inverting a Hessian matrix. As an alternative way, we adopt a simple coordinate descent method [Friedman et al.2007] to learn vector . The subgradient of in Eq. (10) is
(12) 
where is the degree of node and is the th column of the adjacency matrix . The complementary slackness KKT condition implies that if , then and if , then . Thus, let Eq. (12) be zero, then
(13) 
Let , and . To meet the normalizing condition, we let Finally, based on Theorem 1, we get the optimal solution of Eq. (7) as follows:
(14) 
where denotes elementwise product. Likewise, fixed in Eq. (6), we can obtain the coordinate update for .
However, there are still two problems remain unresolved: (1) How to solve the singular value ? (2) What is the termination condition to stop the iterations? Note that the objective function of Eq. (6), . Given and , this is a quadratic function about . Hence the optimal solution , and the corresponding optimal value is . Note that the minimum is only related with . Therefore, we can control iteration by monitoring the change of . Once the change of is smaller than a small positive constant , where represents a tolerance level, we stop the iterations. Finally, we solve the optimization problem Eq. (6) (Lagrangian form) by using an efficient Alternating Iterative Projection Algorithm (Algorithm 1). Note that the time complexity of Algorithm 1 is , where is the number of iterations.
3.3 norm Sparse Graphregularized SVD
In this section, we consider a novel sparse graphregularized SVD via norm (denoted as sgSVD* or sgSVD*) as follows:
(15)  
subject to  
To solve problem Eq. (15), we first fix and let . We then get the following optimization problem:
(16)  
subject to  
Similar with Theorem 1, we also have Theorem 3 to cleverly remove the absolute operator in the last constrained condition of Eq. (16).
Theorem 3.
Suppose is an optimal solution of Eq. (16), then for all .
Therefore, to solve Eq. (16), we first solve the following optimization problem,
(17)  
subject to  
We can further consider the following problem,
(18)  
subject to 
where , and are Lagrangian multipliers. Similar with Theorem 2, we can also prove is a convex function. Note that when , Eq. (18) corresponds to L0SVD. If we remove the constrained condition in problem Eq. (18), then we can obtain an update rule for by using coordinate descent method [Friedman et al.2007] (the derivation is similar to Eq. (13)). The complementary slackness KKT condition implies that if , then and if , then . Thus, we obtain the following update rule,
(19) 
Let and . To satisfy the condition , we force the elements of with the smallest absolute values to be zeros, i.e., project the vector onto the closets absolute vector in Euclidean space,
(20) 
where is the indicator function, denotes elementwise product and denotes the th order statistic of . To meet the normalizing condition, let . Based on Theorem 3, we can get a feasible solution of Eq. (16) as follows,
(21) 
Hence, we obtain an approximate solution of Eq. (16) by using a sparse projection strategy. Note that the iterative way of Eq. (20) is consistent with the iterative threshold algorithm [She2009]. In the same manner, for a fixed , we can also obtain the coordinate update of .
Finally, we propose an Alternating Iterative Sparse Projection (AISP) algorithm (Algorithm 2) to solve the optimization problem (15) (Lagrangian form). When , sgSVD* algorithm reduces to L0SVD [Min et al.2015]. We can prove the Eq. (21) is optimal solution for rank one L0SVD model by fixing . In addition, prior graph information is considered to be more important with large (). Moreover, Algorithm 2 has the same time complexity with Algorithm 1.
4 Experiments
4.1 Synthetic Data Experiments
Here, we evaluate the performance of sgSVD* with simulated data and compare its performance with sgSVD, L0SVD [Min et al.2015], ALSVD [Lee et al.2010] and SCADSVD (sparse SVD via SCAD penalty) [Fan and Li2001].
Without loss of generality, we let a rankone true signal matrix and , where and are vectors of size and , respectively. The observed matrix is generated as the sum of and a noise matrix , i.e., , where the elements of
are randomly sampled from a standard normal distribution (i.e.,
), and is a nonnegative parameter to control the signaltonoise ratio. Accordingly, we generate two sparse singular vectors and with as follows:where denotes a vector of size , whose elements are sampled from , and denotes a vector of size , whose elements are zeros. Here we create a data matrix for each ranging from 0.02 to 0.06 in steps of 0.005 and two prior graphs for row and column variables of
, whose first 50 vertexes are connected with probability
, and remaining ones are connected with probability . Moreover, to test sgSVD whether depends on the signs of singular vectors, we also simulate the data generated from the variables with the same signs by setting and with the same parameters (e.g., , , ).We apply the five methods to the simulation data with and which are selected by using cross validation strategy. We enforce each singular vector ( or ) to contain 50 nonzeros elements (the same sparsity level) for each method by tuning the parameters to get comparable results for different methods. We first demonstrate the magnitude of singular vectors with absolute operator is how to overcome the opposite effect of classical graphregularized penalty with a simple example which corresponds to Figure 1C with (Figure 1A). Moreover, we can clearly find that the performance on both sensitivity and specificity of sgSVD* and sgSVD are superior to that of L0SVD, ALSVD and SCADSVD (Figure 1B). As to the signals with diverse signs, sgSVD* demonstrates distinct better performance than sgSVD and others (Figure 1C). These results suggest that (1) sgSVD* indeed show effectiveness compared to other methods (e.g., sgSVD and L0SVD); (2) sgSVD does suffer distinct limitations when some connected variables in the prior graph have different signal signs.
4.2 Applications to Gene Expression Data
We apply sgSVD* to two cancer gene expression datasets to demonstrate its effectiveness in deciphering samplespecific gene modules. We download a gene expression dataset from the Cancer Genome Project (CGP) [Garnett et al.2012] and obtain a data matrix with 13321 genes across 641 cell lines (samples). We further download a breast cancer gene expression dataset (BRCA) from TCGA database [Koboldt et al.2012], and obtain a data matrix consisting of 11221 genes and 518 samples. For both applications, we construct a prior gene graph consisting of 13,321 genes and 262,462 interactions (a gene interaction network) from the PathwayCommons database.
We apply sgSVD* to CGP and BRCA data with (genes), and a prior gene graph denoted as for gene variables, (samples), , i.e., there is no prior graph for sample variables here. For each data, we first obtain the first pair singular vectors and . Next, we subtract the signal of current pair singular vectors from the input data (i.e., ), and apply sgSVD* again to identify the next pair singular vectors. The first 40 pair singular vectors are considered to identify 40 gene modules. Similarly, we also apply sgSVD and L0SVD to the CGP and BRCA data for comparison. We find that the gene pairs within each identified module do show distinct expression correlations by comparing the sum of absolute Pearson correlation coefficient between each pair within a module to those randomly generated ones. We also test with different values and find the corresponding results are basically consistent. Herein below, we thus only show the results obtained with .
For a given module of genes and interactions (edge) in the prior graph, we assess the edgeenrichment of the module using a Fold Change (FC) score computed as FC = , where is the total number of interactions and is the total number of genes in the prior gene graph. We can clearly see that that the FC scores of the modules identified from CGP and BRCA data by sgSVD* are significantly higher than those of sgSVD and L0SVD (Figure 2A). In addition, the statistical significance of the FC is calculated by using the right tailed hypergeometric test. We find the percentages of edgeenriched gene modules of sgSVD* are much higher than that of sgSVD and L0SVD for different significance levels on both BRCA and CGP data (Table 1). These results suggest the modules identified using sgSVD* show distinct biological relevance than those of other methods.
BRCA  CGP  

L0SVD  sgSVD  sgSVD*  L0SVD  sgSVD  sgSVD*  Level 
57.5  60.0  70.0  57.5  62.5  67.5  0.10 
57.5  57.5  70.0  57.5  60.0  65.0  0.05 
55.0  57.5  65.0  57.5  57.5  62.5  0.01 
55.0  57.5  65.0  57.5  57.5  62.5  0.005 
55.0  57.5  65.0  55.0  57.5  60.0  0.001 
To further valid our findings with biological functions, we also evaluate the functional enrichment of each module. We find that the gene modules identified by sgSVD* are significantly enriched in known Gene Ontology (GO) biological process (BP) terms at different significance levels than those of sgSVD and L0SVD in both data (Figure 2B). In addition, we find that the number of enriched modules for most significance levels of sgSVD are lower than those of L0SVD in BRCA data. It is probably because the coefficients of a number of variables linked on the prior graph have different signs in this data. Furthermore, to assess whether the number of significant GO biological processes are in favor of chance, we also perform the same test on forty random modules whose gene labels are randomly permuted. However, no significant GO BP term is discovered. All these results demonstrate that sgSVD* can identify more biologically relevant gene modules than other methods.
5 Conclusion
In this paper, we propose a norm sparse graphregularized SVD for biclustering highdimensional data. We adopt the graphregularized penalty and norm to induce structure sparsity of singular vectors and enhance intelligibility of data structure. Importantly, we consider the magnitude of singular vectors with absolute operator to overcome the opposite effect of classical graphregularized penalty. To solve this novel model, we first find a clever trick to remove the absolute operator in the new graphregularized penalty, and then design an efficient Alternating Iterative Sparse Projection (AISP) algorithm to solve it. We expect the proposed method can be applied to highdimensional data from diverse domains. Moreover, it will be valuable to extend current concept into other statistical learning frameworks [e.g., Canonical Correlation Analysis (CCA), partial least square (PLS)].
Acknowledgment
This work was supported by the National Science Foundation of China [61379092, 61422309], Natural Science Foundation of Hubei Province of China 2014CFB194, and the Outstanding Young Scientist Program of CAS, and the Key Laboratory of Random Complex Structures and Data Science, CAS.
References
 [Aharon et al.2006] Michal Aharon, Michael Elad, and Alfred Bruckstein. Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. on Signal Processing, 54(11):4311–4322, 2006.
 [Alter et al.2000] Orly Alter, Patrick O Brown, and David Botstein. Singular value decomposition for genomewide expression data processing and modeling. Proceedings of the National Academy of Sciences, 97(18):10101–10106, 2000.
 [Ando and Zhang2007] Rie Kubota Ando and Tong Zhang. Learning on graph with laplacian regularization. NIPS, 19:25, 2007.
 [Cai et al.2011] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 33(8):1548–1560, 2011.
 [Deshpande and Montanari2014] Yash Deshpande and Andrea Montanari. Sparse pca via covariance thresholding. In NIPS, pages 334–342, 2014.
 [Fan and Li2001] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
 [Friedman et al.2007] Jerome Friedman, Trevor Hastie, Holger Höfling, and Robert Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.
 [Garnett et al.2012] Mathew J Garnett, Elena J Edelman, Sonja J Heidorn, Chris D Greenman, et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature, 483(7391):570–575, 2012.
 [Gu et al.2014] Quanquan Gu, Zhaoran Wang, and Han Liu. Sparse pca with oracle property. In NIPS, pages 1529–1537, 2014.

[He et al.2011]
Ran He, WeiShi Zheng, BaoGang Hu, and XiangWei Kong.
Nonnegative sparse coding for discriminative semisupervised learning.
In CVPR, pages 2849–2856. IEEE, 2011.  [Jacob et al.2009] Laurent Jacob, Guillaume Obozinski, and JeanPhilippe Vert. Group lasso with overlap and graph lasso. In ICML, pages 433–440, 2009.
 [Jenatton et al.2010] Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Structured sparse principal component analysis. In AISTATS, pages 366–373, 2010.
 [Jiang et al.2013] Bo Jiang, Chibiao Ding, Bio Luo, and Jin Tang. Graphlaplacian pca: Closedform solution and robustness. In CVPR, pages 3492–3498, 2013.
 [Jing et al.2015] Liping Jing, Peng Wang, and Liu Yang. Sparse probabilistic matrix factorization by laplace distribution for collaborative filtering. In IJCAI, pages 1771–1777, 2015.
 [Koboldt et al.2012] Daniel C Koboldt, Robert S Fulton, Michael D McLellan, Schmidt Heather, et al. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61–70, 2012.
 [Lee et al.2006] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In NIPS, pages 801–808, 2006.
 [Lee et al.2010] Mihee Lee, Haipeng Shen, Jianhua Z Huang, and JS Marron. Biclustering via sparse singular value decomposition. Biometrics, 66(4):1087–1095, 2010.
 [Li and Li2008] Caiyan Li and Hongzhe Li. Networkconstrained regularization and variable selection for analysis of genomic data. Bioinformatics, 24(9):1175–1182, 2008.

[Li and Li2010]
Caiyan Li and Hongzhe Li.
Variable selection and regression analysis for graphstructured covariates with an application to genomics.
The Annals of Applied Statistics, 4(3):1498, 2010.  [Li and Yeung2009] WuJun Li and DitYan Yeung. Relation regularized matrix factorization. In IJCAI, pages 1126–1131, 2009.
 [Min et al.2015] Wenwen Min, Juan Liu, and Shihua Zhang. A novel twostage method for identifying mirnagene regulatory modules in breast cancer. In Bioinformatics and Biomedicine (BIBM), pages 151–156, 2015.
 [She2009] Yiyuan She. Thresholdingbased iterative selection procedures for model selection and shrinkage. Electronic Journal of Statistics, 3:384–415, 2009.

[Shen and Huang2008]
Haipeng Shen and Jianhua Z Huang.
Sparse principal component analysis via regularized low rank matrix
approximation.
Journal of multivariate analysis
, 99(6):1015–1034, 2008.  [Shi et al.2013] Jianping Shi, Naiyan Wang, Yang Xia, DitYan Yeung, Irwin King, and Jiaya Jia. Scmf: sparse covariance matrix factorization for collaborative filtering. In IJCAI, pages 2705–2711, 2013.
 [Sill et al.2011] Martin Sill, Sebastian Kaiser, Axel Benner, and Annette KoppSchneider. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics, 27(15):2089–2097, 2011.
 [Wilson and Zhu2008] Richard C Wilson and Ping Zhu. A study of graph spectra for comparing graphs and trees. Pattern Recognition, 41(9):2833–2841, 2008.
 [Witten et al.2009] Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009.
 [Zheng et al.2011] Miao Zheng, Jiajun Bu, Chun Chen, Can Wang, Lijun Zhang, Guang Qiu, and Deng Cai. Graph regularized sparse coding for image representation. IEEE Trans. on Image Processing, 20(5):1327–1336, 2011.
 [Zhou et al.2015] Xun Zhou, Jing He, Guangyan Huang, and Yanchun Zhang. Svdbased incremental approaches for recommender systems. Journal of Computer and System Sciences, 81(4):717–733, 2015.
Comments
There are no comments yet.