Canonical correlation analysis (CCA) is a powerful tool to integrate two data matrices [Klami et al.2013, Sun et al.2008, Yang et al.2017, Cai et al.2016, Wang et al.2017], which has been comprehensively used in many diverse fields. Given two matrices and from the same samples, CCA is used to find two sparse canonical vectors and to maximize the correlation between and . However, in many real-world problems like those in bioinformatics [Witten et al.2009, Mizutani et al.2012, Lê Cao et al.2009, Fang et al.2016, Yoshida et al.], the number of variables in each data matrix is usually much larger than the sample size. The classical CCA leads to non-sparse canonical vectors which are difficult to interpret in biology. To conquer this issue, a large number of sparse CCA models [Witten et al.2009, Mizutani et al.2012, Lê Cao et al.2009, Fang et al.2016, Yoshida et al., Parkhomenko et al.2009, Witten and Tibshirani2009, Asteris et al.2016, Chu et al.2013, Hardoon and Shawe-Taylor2011, Gao et al.2015] have been proposed by using regularized penalties (e.g., LASSO and -norm) to obtain sparse canonical vectors for variable selection. Parkhomenko et al. [Parkhomenko et al.2009] first proposed a Sparse CCA (SCCA) model using LASSO (-norm) penalty to genomic data integration. Cao et al. [Lê Cao et al.2009] further proposed a regularized CCA with Elastic-Net penalty for a similar task. Witten et al. [Witten et al.2009] proposed the Penalized matrix decomposition (PMD) algorithm to solve the Sparse CCA with two penalties: LASSO and Fused LASSO to integrate DNA copy number and gene expression from the same samples/individuals. Furthermore, a large number of generalized LASSO regularized CCA models have been proposed to consider prior structural information of variables [Lin et al.2013, Virtanen et al.2011, Chen et al.2012a, Chen et al.2012b, Du et al.2016]. For example, Lin et al. [Lin et al.2013] proposed a Group LASSO regularized CCA to explore the relationship between two types of genomic data sets. If we consider a pathway as a gene group, then these gene pathways form an overlapping group structure [Jacob et al.2009]. Chen et al. [Chen et al.2012b] developed an overlapping group LASSO regularized CCA model to employ such group structure.
These existing sparse CCA models can find two sparse canonical vectors with a small subset of variables across all samples (Fig.1(a)). However, many real data such as the cancer genomic data show distinct heterogeneity [Dai et al.2015, Chen and Zhang2016]. Thus, the current CCA models fail to consider such heterogeneity and cannot directly identify a set of sample-specific correlated variables. To this end, we propose a novel Sparse weighted CCA (SWCCA) model, where weights are used for regularizing different samples with a typical penalty (e.g., LASSO and -norm) (Fig.1(b)). In this way, SWCCA can not only select two variable sets, but also select a sample set (Fig.1 (b)). We further adopt an efficient alternating iterative algorithm to solve (or ) regularized SWCCA model. We apply -SWCCA and related ones onto two simulated datasets and two real biological data to demonstrate its efficiency in capturing correlated variables across a subset of samples.
2 -regularized SWCCA
Here, we assume that there are two data matrices ( samples and variables) and ( samples and variables) across a same set of samples. The classical CCA seeks two components ( and ) to maximize the correlation between linear combinations of variables from the two data matrices as Eq.(1).
If and are centered, we obtain the empirical covariance matrices , and . Thus we have the following equivalent criterion as Eq.(2).
Obviously, of (2) is invariant to the scaling of and . Thus, maximizing criterion (2) is equivalent to solve the following constrained optimization problem as Eq.(3).
Previous studies [Witten et al.2009, Witten and Tibshirani2009] have shown that considering the covariance matrix (, ) as diagonal one can obtain better results. For this reason, Asteris et al. [Asteris et al.2016] assume that and , and the -regularized Sparse CCA (-SCCA) (also called “diagonal penalized CCA”) model can be presented as Eq.(4).
where is the -norm penalty function, which returns to the number of non-zero entries of . Asteris et al.[Asteris et al.2016] applied a projection strategy to solve -SCCA. Let , then the model of Eq.(4) is equivalent to rank-one -SVD model [Min et al.2015].
Let and , then the objective function . To consider the different contribution for samples, we modify the objective function of Eq.(4) to be with . Thus, we obtain a new objective function as Eq.(5).
Furthermore, we also force to be sparse to select a limited number of samples. Finally we propose a -regularized SWCCA (-SWCCA) model as Eq.(6).
where is a diagonal matrix and . If , then -SWCCA reduces to -SCCA.
In this section, we design an alternating iterative algorithm to solve (6) by using a sparse projection strategy. We start with the sparse projection problem corresponding to the sub-problem of (6) with fixed and as Eq.(7).
For a given column vector and , we define a sparse project operator as Eq.(8).
where is defined as a set of indexes corresponding to the largest absolute values of . For example, if , then .
Theorem 1 The solution of problem (7) is
Note that denotes the Euclidean norm. We can prove the Theorem 1 by contradiction (we omit the proof here). Based on Theorem 1, we design an alternating iterative approach to solve Eq.(6).
i) Optimizing with fixed and . Fix and in Eq.(6), let , then Eq.(6) reduces to as Eq.(10).
Based on the Theorem 1, we obtain the update rule of as Eq.(11).
ii) Optimizing with fixed and . Fix and in Eq.(6), let , then Eq.(6) reduces to as Eq.(12).
Similarly, we obtain the update rule of as Eq.(13).
iii) Optimizing with fixed and . Fix and in Eq.(6), then Eq.(6) reduces to as Eq.(14).
Let , and where ‘’ denotes point multiplication which is equivalent to ‘.*’ in Matlab, then we have . Thus, problem (14) reduces to as Eq.(15).
Similarly, we obtain the update rule of as Eq.(16).
Finally, combining (11), (13) and (16), we propose the following alternating iterative algorithm to solve problem (6) as Algorithm 1.
Terminating Condition: We can set different stop conditions to control the iterations. For example, the update length of , , and are smaller than a given threshold (i.e., , and ), or the maximum number of iterations is a given number (e.g., 1000), or the change of objective value is less than a give threshold.
Computation Complexity: The complexity of matrix multiplication with one matrix and another one is . To reduce the computational complexity of , we note that . Let , and . Thus, the complexity of is . Similarly, we can see that the complexity of is , and the complexity of is . In Algorithm 1, we need to obtain the largest absolute values of a given vector of size [i.e., ]. We adopt a linear time selection algorithm called Quick select (QS) algorithm to compute , which applies a divide and conquer strategy, and the average time complexity of QS algorithm is . Thus, the entire time complexity of Algorithm 1 is , where is the number of iterations for convergence. In general, is a small number.
4.1 Synthetic data 1
Here we generate the first synthetic data matrices and with , and using the following two steps:
Step 1: Generate two canonical vectors , and a weighted vector as Eq.(17).
where denotes a row vector of size , whose elements are equal to , denotes a row vector of size
, whose elements are randomly sampled from a standard normal distribution.
Step 2: Generate two input matrices and as Eq.(18).
where the elements of and are randomly sampled from a standard normal distribution.
We evaluate the performance of -SWCCA with the above synthetic data and compare its performance with the typical sparse CCA, including -SCCA [Asteris et al.2016] and PMD [Witten et al.2009] with -penalty. For comparison, we set parameters , and for -SWCCA; , for -SCCA; and for PMD. Note that and where for PMD are to approximately control the sparse proportion of the canonical vectors ( and ).
The true and estimated patterns for , and in the synthetic data 1 are shown in Fig.2. Compared to PMD, -SWCCA and -SCCA does fairly well for identifying the local non-zero pattern of the underlying factors (i.e., and ). However, the two traditional SCCA methods (-SCCA and PMD) do not recognize the difference between samples and remove the noisy samples. Interestingly, -SWCCA not only discovers the true patterns for , (Fig.2(b) and (g)), but also identifies the true non-zero characteristics of samples () (Fig.2(e)). Furthermore, to assess our approach is indeed able to find a greater correlation level between two input matrices, we define the correlation criterion as Eq.(19).
where is a function to calculate the correlation coefficient of the two vectors. For comparison, we set for -SCCA and PMD to compute the correlation criterion. -SWCCA gets the largest compared to -SCCA with and PMD with in the above synthetic data. All results show that -SWCCA is more effective to capture the latent patterns of canonical vectors than other methods.
4.2 Synthetic data 2
Here we apply another way to generate synthetic data matrices and with , , and . The following three steps are used to generate the second synthetic data matrices and :
Step 1: We first generate two zero matrices as Eq.(20).
Step 2: We then update two sub-matrices in the and .
Step 3: We add the Gaussian noise in and .
For simplicity and comparison, we can set true , true and true to characterize the patterns of and (Fig.3(a), (f) and (e)). Similarly, we also apply -SWCCA, -SCCA [Asteris et al.2016] and PMD [Witten et al.2009] to the synthetic data 2. For comparison, we set parameters , and for -SWCCA; , for -SCCA; and for PMD.
The true and estimated patterns for , and are shown in Fig.3. -SWCCA and -SCCA are superior to PMD about identifying the latent patterns of canonical vectors and (Fig.3(b), (c), (d), (g), (h) and (i)). However -SCCA and PMD fail to remove interference samples. Compared to -SCCA and PMD, -SWCCA can clearly identify the true non-zero characteristics of samples (Fig.3(e)). Similarly, we also compute the correlation criterion based on the formula (19). We find that -SWCCA gets the largest correlation compared to -SCCA with and PMD with . All results show that our method is more effective to capture the latent patterns of canonical vectors than other ones.
4.3 Breast cancer data
We first consider a breast cancer dataset [Witten et al.2009, Chin et al.2006] consisting of gene expression and DNA copy number variation data across 89 cancer samples. Specifically, the gene expression data and the DNA copy number data are of size and with , and . We apply SWCCA and related ones to this data to identify a gene set whose expression is strongly correlated with copy number changes of some genomic regions.
In PMD [Witten et al.2009], we set and , where is to approximately control the sparse ratio of canonical vectors. We ensure that the canonical vectors ( and ) extracted by the three methods (PMD, -SCCA, and -SWCCA) have the same sparsity level for comparison. We first apply PMD in the breast cancer data to obtain two sparse canonical vectors and for each given . Then, we compute the number of nonzero elements in the above extracted and , denoted as and . Finally, we set , in -SCCA and -SWCCA, and set in -SWCCA to identify the sample loading with sparse ratio .
We adopt two criteria: correlation level defined in formula (19) and objective value defined in Eq.(5) for comparison. Here we consider different values (i.e., ) to control the different sparse ratio of canonical vectors. We find that, compared to PMD and -SCCA, -SWCCA does obtain higher correlation level and objective value for all cases (Table 1). Since the ‘breast cancer data’ did not collect any clinical information of patients, it is very difficult to study the specific characteristics of these selected samples. To this end, we also apply our method to another biological data with more detailed clinical information.
|Table 1. Results on Correlation level (CL) and Objective|
value (OV) for different c values.
4.4 TCGA BLCA data
Recently, it is a hot topic to study microRNA (miRNA) and gene regulatory relationship from matched miRNA and gene expression data [Min et al.2015, Zhang et al.2011]. Here, we apply SWCCA onto the bladder urothelial carcinoma (BLCA) miRNA and gene expression data across 405 patients from TCGA (https://cancergenome.nih.gov/
) to identify a subtype-specific miRNA-gene co-correlation module. To remove some noise miRNAs and genes, we first adapt standard deviation method to extract 200 largest variance miRNAs and 5000 largest variance genes for further analysis. Finally, we obtain a miRNA expression matrix, which is standardized for each miRNA, and a gene expression matrix , which is standardized for ecah gene. We apply -SWCCA onto BLCA data with , and to identify a miRNA set with 10 miRNAs and a gene set with 200 genes and a sample set with 203 patients. We also apply PMD with , and -SCCA with and onto BLCA data for comparison.
Similarly, -SWCCA obtains the largest correlation level (CL) and objective value (OV) than others ones [(CL, OV): (0.98, 1210) for -SWCCA, (0.84, 346) for PMD, (0.86,469) for -SCCA], respectively. More importantly, we also analyze characteristics of these selected patients by -SWCCA. We find that it is significantly different with respect to patient survival time between the selected 203 patients and the remaining 202 patients with -value (Fig.4). These results imply that -SWCCA can be used to discover BLCA subtype-specific miNRA-gene co-correlation modules.
Furthermore, we also assess whether these identified genes by -SWCCA are biologically relevant with BLCA. DAVID (https://david.ncifcrf.gov/) is used to perform the Gene Ontology (GO) biological processes (BPs) and KEGG pathways enrichment analysis. Several significantly enriched GO BPs and pathways relating to BLCA are discovered including GO:0008544:epidermis development (B-H adjusted -value 1.1E-12), hsa00591:Linoleic acid metabolism (B-H adjusted -value 4.8E-3), hsa00590:Arachidonic acid metabolism (B-H adjusted -value 3.6E-3) and hsa00601:Glycosphingolipid biosynthesis-lacto and neolacto series (B-H adjusted -value 2.6E-2). Finally, we also examine whether the identified miRNAs by -SWCCA are associated with BLCA. Interestingly, in the identified 10 miRNAs by -SWCCA, we find that there are six miRNAs (including hsa-miR-200a-3p, hsa-miR-200b-5p, hsa-miR-200b-3p, hsa-miR-200a-5p, hsa-miR-200c-3p and hsa-miR-200c-5p) belonging to miR-200 family. Notably, several studies [Wiklund et al.2011, Cheng et al.2016] have also reported miR-200 family plays key roles in BLCA. All these results imply that the identified miNRA-gene module by -SWCCA may help us to find new therapeutic strategy for BLCA.
5.1 SWCCA with generalized penalties
We first consider a general regularized SWCCA framework as Eq.(23).
where , , are three regularized functions. For different prior knowledge, we can use different sparsity inducing penalties.
5.1.1 LASSO regularized SWCCA
If , , and . We obtain a (Lasso) regularized SWCCA (-SWCCA). Similar to solve Eq.(6), we only need to solve the following problem to solve -SWCCA as Eq.(24).
where . We first replace the constraint with and obtain the following the problem as Eq.(25).
It is easy to see that problem (25) is equivalent to (24). Thus, we can obtain its Lagrangian form as Eq.(26).
Thus, we can use a coordinate descent method to minimize Eq.(26) and obtain the following update rule of as Eq.(27).
where is a soft thresholding operator and . Based on the above, an alternating iterative strategy can be used to solve -SWCCA.
5.1.2 Group LASSO regularized SWCCA
If , and . Problem (24) reduces to -regularized SWCCA (-SWCCA). Similarly, we should solve the following projection problem as Eq.(28).
Thus, we obtain its Lagrangian form as Eq.(29).
where is the th group of . We adopt a block-coordinate descent method [Tseng2001] to solve it and obtain the learning rule of () as Eq.(30).
By cyclically applying the above updates, we can minimize Eq.(29). Thus, an alternating iterative strategy can be used to solve -SWCCA.
5.2 Multi-view sparse weighted CCA
In various scientific fields, multiple view data (more than two views) can be available from multiple sources or diverse feature subsets. For example, multiple high-throughput molecular profiling data by omics technologies can be produced for the same individuals in bioinformatics [Li et al.2012, Min et al.2015, Sun et al.2015]. Integrating these data together can significantly increase the power of pattern discovery and individual classification.
Here we extend SWCCA to Multi-view SWCCA (MSWCCA) model for multi-view data analysis (Fig.5) as follows:
where . When , we can see that and it reduces to SWCCA. So we can solve MSWCCA in a similar manner with SWCCA.
In this paper, we propose a sparse weighted CCA framework. Compared to SCCA, SWCCA can reveal that the selected variables are only strongly related to a subset of samples. We develop an efficient alternating iterative algorithm to solve the -regularized SWCCA. Our tests using both simulation and biological data show that SWCCA can obtain more reasonable patterns compared to the typical SCCA. Moreover, the key idea of SWCCA is easy to be adapted by other penalties like LASSO and Group LASSO. Lastly, we extend SWCCA to MSWCCA for multi-view situation with multiple data matrices.
Shihua Zhang and Juan Liu are the corresponding authors of this paper. Wenwen Min would like to thank the support of National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, CAS during his visit.
- [Asteris et al.2016] Megasthenis Asteris, Anastasios Kyrillidis, Oluwasanmi Koyejo, and Russell Poldrack. A simple and provable algorithm for sparse diagonal cca. In ICML, pages 1148–1157, 2016.
- [Bolte et al.2014] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494, 2014.
- [Cai et al.2016] Jia Cai, Yi Tang, and Jianjun Wang. Kernel canonical correlation analysis via gradient descent. Neurocomputing, 182:322–331, 2016.
- [Chen and Zhang2016] Jinyu Chen and Shihua Zhang. Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data. Bioinformatics, 32(11):1724–1732, 2016.
- [Chen et al.2012a] Jun Chen, Frederic D Bushman, James D Lewis, Gary D Wu, and Hongzhe Li. Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. Biostatistics, 14(2):244–258, 2012.
- [Chen et al.2012b] Xi Chen, Han Liu, and Jaime G Carbonell. Structured sparse canonical correlation analysis. In AISTATS, pages 199–207, 2012.
- [Cheng et al.2016] Yidong Cheng, Xiaolei Zhang, Peng Li, Chengdi Yang, Jinyuan Tang, Xiaheng Deng, Xiao Yang, Jun Tao, Qiang Lu, and Pengchao Li. Mir-200c promotes bladder cancer cell migration and invasion by directly targeting reck. OncoTargets and therapy, 9:5091–5099, 2016.
- [Chin et al.2006] Koei Chin, Sandy DeVries, Jane Fridlyand, Paul T Spellman, Ritu Roydasgupta, Wen-Lin Kuo, Anna Lapuk, Richard M Neve, Zuwei Qian, Tom Ryder, et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer cell, 10(6):529–541, 2006.
- [Chu et al.2013] Delin Chu, Li-Zhi Liao, Michael K Ng, and Xiaowei Zhang. Sparse canonical correlation analysis: new formulation and algorithm. IEEE transactions on pattern analysis and machine intelligence, 35(12):3050–3065, 2013.
- [Dai et al.2015] Xiaofeng Dai, Ting Li, Zhonghu Bai, Yankun Yang, Xiuxia Liu, Jinling Zhan, and Bozhi Shi. Breast cancer intrinsic subtype classification, clinical use and future trends. American journal of cancer research, 5(10):2929, 2015.
- [Du et al.2016] Lei Du, Heng Huang, Jingwen Yan, Sungeun Kim, Shannon L Risacher, Mark Inlow, Jason H Moore, Andrew J Saykin, Li Shen, and Alzheimer’s Disease Neuroimaging Initiative. Structured sparse canonical correlation analysis for brain imaging genetics: an improved GraphNet method. Bioinformatics, 32(10):1544–1551, 2016.
- [Fang et al.2016] Jian Fang, Dongdong Lin, S Charles Schulz, Zongben Xu, Vince D Calhoun, and Yu-Ping Wang. Joint sparse canonical correlation analysis for detecting differential imaging genetics modules. Bioinformatics, 32(22):3480–3488, 2016.
- [Gao et al.2015] Chao Gao, Zongming Ma, Zhao Ren, Harrison H Zhou, et al. Minimax estimation in sparse canonical correlation analysis. The Annals of Statistics, 43(5):2168–2197, 2015.
- [Hardoon and Shawe-Taylor2011] David R Hardoon and John Shawe-Taylor. Sparse canonical correlation analysis. Machine Learning, 83(3):331–353, 2011.
- [Jacob et al.2009] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In ICML, pages 433–440. ACM, 2009.
- [Klami et al.2013] Arto Klami, Seppo Virtanen, and Samuel Kaski. Bayesian canonical correlation analysis. Journal of Machine Learning Research, 14:965–1003, 2013.
- [Lê Cao et al.2009] Kim-Anh Lê Cao, Pascal GP Martin, Christèle Robert-Granié, and Philippe Besse. Sparse canonical methods for biological data integration: application to a cross-platform study. BMC bioinformatics, 10:34, 2009.
- [Li et al.2012] Wenyuan Li, Shihua Zhang, Chun-Chi Liu, and Xianghong Jasmine Zhou. Identifying multi-layer gene regulatory modules from multi-dimensional genomic data. Bioinformatics, 28(19):2458–2466, 2012.
- [Lin et al.2013] Dongdong Lin, Jigang Zhang, Jingyao Li, Vince D Calhoun, Hong-Wen Deng, and Yu-Ping Wang. Group sparse canonical correlation analysis for genomic data integration. BMC bioinformatics, 14:245, 2013.
- [Min et al.2015] Wenwen Min, Juan Liu, Fei Luo, and Shihua Zhang. A novel two-stage method for identifying microrna-gene regulatory modules in breast cancer. In BIBM, pages 151–156, 2015.
- [Mizutani et al.2012] Sayaka Mizutani, Edouard Pauwels, Véronique Stoven, Susumu Goto, and Yoshihiro Yamanishi. Relating drug–protein interaction network with drug side effects. Bioinformatics, 28(18):i522–i528, 2012.
- [Parkhomenko et al.2009] Elena Parkhomenko, David Tritchler, and Joseph Beyene. Sparse canonical correlation analysis with application to genomic data integration. Statistical applications in genetics and molecular biology, 8(1:1), 2009.
- [Sun et al.2008] Liang Sun, Shuiwang Ji, and Jieping Ye. A least squares formulation for canonical correlation analysis. In ICML, pages 1024–1031, 2008.
- [Sun et al.2015] Jiangwen Sun, Jin Lu, Tingyang Xu, and Jinbo Bi. Multi-view sparse co-clustering via proximal alternating linearized minimization. In ICML, pages 757–766, 2015.
- [Tseng2001] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475–494, 2001.
- [Virtanen et al.2011] Seppo Virtanen, Arto Klami, and Samuel Kaski. Bayesian cca via group sparsity. In ICML, pages 457–464, 2011.
[Wang et al.2017]
Caihua Wang, Juan Liu, Wenwen Min, and Aiping Qu.
A novel sparse penalty for singular value decomposition.Chinese Journal of Electronics, 26(2):306–312, 2017.
- [Wiklund et al.2011] Erik D Wiklund, Jesper B Bramsen, Toby Hulf, Lars Dyrskjøt, Ramshanker Ramanathan, Thomas B Hansen, Sune B Villadsen, Shan Gao, Marie S Ostenfeld, Michael Borre, et al. Coordinated epigenetic repression of the mir-200 family and mir-205 in invasive bladder cancer. International journal of cancer, 128(6):1327–1334, 2011.
- [Witten and Tibshirani2009] Daniela M Witten and Robert J Tibshirani. Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical applications in genetics and molecular biology, 8(1:28), 2009.
- [Witten et al.2009] Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009.
- [Yang et al.2017] Xinghao Yang, Weifeng Liu, Dapeng Tao, and Jun Cheng. Canonical correlation analysis networks for two-view image recognition. Information Sciences, 385:338–352, 2017.
[Yoshida et al.]
Kosuke Yoshida, Junichiro Yoshimoto, and Kenji Doya.
Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data.BMC bioinformatics, 18:108.
- [Zhang et al.2011] Shihua Zhang, Qingjiao Li, Juan Liu, and Xianghong Jasmine Zhou. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microrna-gene regulatory modules. Bioinformatics, 27(13):i401–i409, 2011.