Probabilistic Canonical Correlation Analysis for Sparse Count Data

by   Lin Qiu, et al.
Penn State University

Canonical correlation analysis (CCA) is a classical and important multivariate technique for exploring the relationship between two sets of continuous variables. CCA has applications in many fields, such as genomics and neuroimaging. It can extract meaningful features as well as use these features for subsequent analysis. Although some sparse CCA methods have been developed to deal with high-dimensional problems, they are designed specifically for continuous data and do not consider the integer-valued data from next-generation sequencing platforms that exhibit very low counts for some important features. We propose a model-based probabilistic approach for correlation and canonical correlation estimation for two sparse count data sets (PSCCA). PSCCA demonstrates that correlations and canonical correlations estimated at the natural parameter level are more appropriate than traditional estimation methods applied to the raw data. We demonstrate through simulation studies that PSCCA outperforms other standard correlation approaches and sparse CCA approaches in estimating the true correlations and canonical correlations at the natural parameter level. We further apply the PSCCA method to study the association of miRNA and mRNA expression data sets from a squamous cell lung cancer study, finding that PSCCA can uncover a large number of strongly correlated pairs than standard correlation and other sparse CCA approaches.



There are no comments yet.


page 24


Sparse semiparametric canonical correlation analysis for data of mixed types

Canonical correlation analysis investigates linear relationships between...

Canonical Autocorrelation Analysis

We present an extension of sparse Canonical Correlation Analysis (CCA) d...

Sparse Additive Functional and Kernel CCA

Canonical Correlation Analysis (CCA) is a classical tool for finding cor...

Common Information Components Analysis

We give an information-theoretic interpretation of Canonical Correlation...

Canonical Correlation Analysis in high dimensions with structured regularization

Canonical Correlation Analysis (CCA) is a technique for measuring the as...

Conditional canonical correlation estimation based on covariates with random forests

Investigating the relationships between two sets of variables helps to u...

BLOCCS: Block Sparse Canonical Correlation Analysis With Application To Interpretable Omics Integration

We introduce Block Sparse Canonical Correlation Analysis which estimates...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advancements in next-generation sequencing technology have enabled the measurement of multiple high-dimensional data types in a single study, such as genomics, transcriptomics, epigenomics, and metabolomics. Integrative analysis of high-dimensional omics data is becoming increasingly important and popular. It has been shown that combining multiple omics data types can improve analysis and lead to biologically more meaningful results for complex diseases

(safo18; Lu19).

Omics datasets have three main characteristics that pose modeling challenges: (1) they are high-dimensional data, with a large number of variables and small sample size ; (2) the raw data represent count variables which violate the distributional assumptions for standard correlation and canonical correlation analysis, which can lead to invalid inference in the presence of a small sample size; (3) the data are very sparse, with a large proportion of these counts being very close to zero and having random missing values.

1.1 Cca

Canonical correlation analysis (CCA) is a classical multivariate method proposed by Hotelling36

for exploring the relationship between two sets of variables. Consider two random vectors

X, Y. Define = cov(X), = cov(Y), and = cov(X,Y). CCA finds canonical correlation directions (,) that maximize the correlation between and , where the (,) are the linear combinations of X and Y canonical variables. Formally, we can write the first pair of canonical variables as

() = { b: a=1, b=1},

Then the optimization can be attained by applying the singular value decomposition (SVD) and replacing

with their sample estimates . However, in a high-dimensional setting, when the dimensions n, the SVD approach is not applicable because and are not invertible.

1.2 Related work

Motivated by genomics, neuroimaging and other applications, researchers have been working on generalizing CCA to accommodate high dimensions, usually called sparse CCA (Witten09; Avants10; Hardoon11; Gao17). These methods impose sparsity constraints on the canonical directions which effectively can reduce the dimensionality and improve the interpretation of the correlations. Penalized matrix decomposition (PMD) (Witten09)is one of the most popular sparse CCA methods, which uses the penalized matrix decomposition to replace and with identity matrices to avoid singularities. By doing so, PMD can obtain sparse estimates of the canonical directions by penalization. However, PMD may perform poorly on data sets when and are far from diagonal. With respect to genomics data, for example, genes usually have strong correlations among them.

The probabilistic interpretation of CCA was initiated by (Bach06). Later on, several Bayesian versions of CCA were developed (Archambeau08; Virtanen11; Klami13). One of the key promising features of Bayesian CCA is that it enables analysis of high-dimensional data in life sciences (Fujiwara09; Huopaniemi10)

. However, these methods assume the data to follow normal distributions. Thus, the aforementioned Bayesian methods may not work well for non-normally distributed data. PCAN is the first approach that describes a Bayesian correlation analysis method for count data

(Zoh16), in which the correlations are estimated based on the latent weights from the natural parameters of the data generating model, rather than the correlations among the counts. In the latent variable model, priors or strong regularizations are used on the latent weights to induce sparsity West03.

The rise of big datasets with few signals, such as omics datasets, has spurred the study of sparse models. From a Bayesian perspective, discrete mixtures (George93) and shrinkage priors (Tipping01) are the two main sparse estimation methods. In latent variable models, Bayesian shrinkage priors are popular because of their flexible and interpretable solutions (Carvalho08; Knowles11; Bhattacharya11). The spike-and-slab prior is a mixture of a point mass at zero and a flat distribution across the space of real values, in that the excluded loadings are modeled by the "spike" distribution, while the included loadings are modeled by the "slab" distribution (Carvalho08). The disadvantages are that the results can be sensitive to prior choices and it is computationally demanding for posterior inference with a large number of variables due to a huge model space. Scale mixtures of normal priors have been proposed recently as a computationally efficient alternative to the two component spike-and-slab prior (Armagan13; Bhattacharya14)

. These types of priors usually assume normal distributions with a mixed variance term and the mixing variance distribution enables strong shrinkage close to zero. For example, Bayesian canonical correlation analysis (BCCA)

(Klami13) consists of applying an automatic relevance determination (ARD) (Neal96)

prior for the latent weights which is a Normal-gamma prior that imposes an inverse gamma distribution on the variance term. The horseshoe prior is popular due to its good performance in simulations and under theoretical study, which has shown comparable performance to the spike-and-slab prior in a variety of problems where a sparse prior is desirable

(Carvalho08; Carvalho10; Polson11).The horseshoe prior is a scale mixture of normals, with a product of half-Cauchy priors on the variance. It is given by


“global hyperparameter

can shrink all the parameters toward zero, especially if its domain is restricted to a finite interval, while the heavy-tailed half-Cauchy local priors allow some parameters to escape. Different levels of sparsity can be accommodated by changing the value of : the large will have little shrinkage, while small will shrink all the weights to zero. Despite the good performance, there are two shortcomings for the horseshoe prior. First, how to perform inference for the “global hyperparameter” which determines the overall sparsity in the parameter vector is not fully answered yet. Second, parameters far from zero will not be regularized at all. Quite a few researchers have investigated the impact of concerning the resulting posterior distribution both for recovery and for uncertainty quantification, either in a deterministic way or a hierarchical full Bayes approach (Carvalho08; Datta14; Pas14; Pas17). We take the second shortcoming as the key strength of this prior and to incorporate it with latent variable model to infer the feature sparsity jointly. For an omics data set we assume only important variables are strongly identified and the parameters far from zero will not be regularized.

1.3 Our contribution

In this study, we propose a new probabilistic framework of CCA for sparse count data, which we label a probabilistic sparse canonical correlation analysis (PSCCA). Our work contributes several important advances. First, we propose to estimate the canonical correlations at the natural parameter level for data expressed as raw counts, which is lacking in omics analyses. Second, we provide a theoretical justification for estimating the correlations and canonical correlations based on the natural parameters rather than based on the raw data. The former are larger in magnitude than the latter, which is very meaningful for CCA. Because CCA is an exploratory analytical method, larger values of the canonical correlations yield less chance to miss the true correlation pairs. Third, the horseshoe prior is widely studied in the literature, via both simulation studies and theoretical research. Nevertheless, we do not see many examples in applications. We formulate the natural parameters as a latent variable model, and we invoke the horseshoe prior for the latent weight to model the sparsity. To better extract the sparse signals we assume for the “global hyperparameter”. As discussed in Piironen17, this prior results in sensible inference only when is strongly identified by the data. Our simulation study and real data applications show that our approach performs better than existing methods. Lastly, our approach is built on an exponential family and can be easily extended to other formats of data.

The rest of the article is organized as follows. Section 2 contains our model details and inference. Section 3 discusses the theoretical results and Section 4 describes simulation studies. Section 5 presents the real data application. Finally, Section 6 contains a discussion and future directions of PSCCA.

2 Method

2.1 Model

Let be a distribution function from the natural parameter exponential family. The random component of a generalized linear model consists of a response vector

which has a conditional distribution in the exponential family. This family has probability density function or mass function of the form

. The value of the parameter may vary for depending on values of the explanatory variables. The term is the natural parameter; and

are non-negative functions that distinguish one member of the exponential family from another. For our case, assume we have two sets of multivariate random variables,

, and . The observed data samples are expressed as with observations, where is 1 or 2. Let represent the observed value of the individual for the feature (variable) in a set of measured features (variables).

We motivate our formulation in the latent variable interpretation of CCA (Bach06) to model the natural parameters and the ideas from PCAN (Zoh et al., 2016) for the correlation estimation from the natural parameters. We assume each individual data vector follows conditionally an exponential family distribution and here we consider a generalized linear model. The generative model for coupled natural parameters with =1,2 and is


We write the model as a function of latent variables by concatenating the features into the vector


The matrix denotes the loading matrix associated with the latent vector ; denotes the identity matrix; the parameter vector represents the mean of the natural parameters associated with the feature in the data sets ; and is an independently distributed error term, with denoting a normal distribution with null mean and variance . The core generative process is the unobserved shared latent variables , which are transformed via linear mappings to the observation spaces, and can capture the variation common to both data sets and allow for dependency between variables in a specific data set.

We impose horseshoe priors on matrix , let denote the ith row vector of . Then we assume that:


We refer to the as the local shrinkage parameters and to as the global shrinkage parameters. Let

denote the half-Cauchy distribution. The half-Cauchy prior for the local shrinkage parameter

has shown good performance (Carvalho08; Carvalho10). There has been a vast amount of research on how to choose the prior for the global hyperparameter which plays an important role in overall sparsity in the parameter matrix . As discussed in Introduction section, we choose the full Bayesian specification for . Thus, we assume:


For i =1,…,; k=1,…,; and j=1,…,N, we construct and . The vector has a multivariate normal distribution with mean
and covariance matrix


where and . The correlation between and , for any sample can be obtained as


For the canonical correlations, let denote the square-root decomposition of the positive definite matrix . Let

, then the nonnull eigenvalues of

R correspond to the squared canonical coefficients for the natural parameters. Inference on the correlations and canonical correlations will be based on the marginal posterior distribution of , , , and .

2.2 Identifiability and Prior

The latent variable model (Equation 1) is identifiable up to orthonormal rotations, for any invertible with . Then and will produce the same estimate of the data covariance matrix in Equation (5) and has an equal likelihood. Zoh16 recommend imposing a lower triangular structure for the matrices , following the work of (Geweke96), and further require that the diagonal elements are non-negative to remove the non-identifiability related to the sign. This approach relies on the choice of based on the dimensions of W, but we assume the value of will be small so that the impact is negligible Lopes04.


where for , and for , . 1 is an indicator function, 1(W)=1

if W is true and 0 otherwise. We assume conjugate priors for the remaining parameters in the model as


The hyperparameters , , and are known.

2.3 Inference

The form of the full conditional posterior distribution is proportional to the product of (1) the joint conditional likelihood for the data matrices and and (2) the prior distributions:


The posterior distribution in (9) is difficult to directly simulate. We update the parameters in a Markov chain Monte Carlo (MCMC).

3 Theoretical Results

Let and be distribution functions from the natural parameter exponential family as we discussed in the model section. We model and , and we denote and , , as two sets of count random variables. The natural parameters and have the following expressions:


where is and is

. We further assume the following independent probability distributions:

,, , .

Theorem 1. We define the unconditional variance-covariance of and as . Then we have the correlation coefficients, , constructed from , where , , and . The canonical correlation coefficients , where . The detailed proof is in the Appendix.

Corollary 1. Correlation coefficients and canonical correlation coefficients calculated from the raw count data X and Y will be smaller numerically in magnitude than the correlation coefficients and canonical correlation coefficients calculated from the natural parameters and .

4 Simulation

4.1 Settings

In this section we conduct simulations to assess the performance of SPCCA in comparison with several existing methods that have been proposed for probabilistic correlation analysis (PCAN), Bayesian CCA (BCCA), and sparse CCA (PMD), as mentioned in the Introduction. First, we evaluate the performance of SPCCA on correlation analysis, comparing with PCAN because the PCAN paper already demonstrated that it outperforms traditional Spearman and Pearson correlation methods. Second, we compare SPCCA’s performance on the canonical correlation analysis with that of BCCA, PMD, and the modified PCAN, which we name .

To evaluate the methods, let , be the true generated loading matrices. For the all methods with estimates , , we can calculate the correlations and the canonical correlations according to Equations (5) and (6). Let be the true matrix and

be the estimated matrix. Then we construct the Frobenius loss function as

, assuming .

Scenario I: Correlation analysis– In the first scenario, we simulated 100 datasets assuming for each dataset , , and subjects. The weight matrices are and . We consider three correlation matrices for the natural parameters:
(a) the identity correlation matrix assuming the true for and .
(b) a correlation matrix obtained assuming for and .
(c) a correlation matrix obtained assuming for and .

We fit the PSCCA and PCAN to each of these 100 datasets assuming different dimensions of to compute the posterior mean correlation matrices.

Mean 95%CI Mean 95%CI
0 2 25.21 (24.35, 26.54) 28.57 (28.31, 29.20)
0 5 19.55 (19.20, 20.49) 23.30 (21.85, 24.95)
0 10 14.12 (13.23, 14.83) 21.26 (19.66, 22.17)
5 2 4.25 (4.05, 4.34) 22.51 (21.26, 22.87)
5 5 3.89 (3.60, 4.24) 19.88 (18.72, 20.63)
5 10 4.54 (4.21, 4.91) 20.35 (19.45, 22.44)

2 10.31 (9.41, 12.21) 15.55 (14.68, 16.31)
10 5 8.81 (8.26, 9.35) 14.62 (13.34, 15.21)
10 10 7.85 (7.54, 8.06) 13.27 (12.82, 13.51)

Table 1: Summary of the Frobenius loss when estimating the true correlation structure for the natural parameters from the PCAN and our PSCCA model. Here, is the value of assumed for the true correlation matrix; represents the value assumed when fitting the model. Frobenius losses are calculated between the true correlation matrix at the natural parameter level vs the posterior mean correlation estimated based on the posterior of , and the other parameters.

Scenario II: Canonical correlation analysis– In the second scenario, we simulated 100 datasets, for each dataset we set , under moderate and high dimensions of . We use three models for the correlation matrices of the the natural parameters .
Model I (Independent covariances): there is no covariance structure within each of the natural parameters .
Model II (Identity covariances): ,
Model III (Moderate covariances): = 0.5, = 0.5
(1) Moderate dimensions: = 60, 100, 300
(2) High dimensions: = 500, 1000, 2000

Model I is used in PCAN (Zoh16), and similar models of Model II and III have been used to generate the raw data in (Gao17). We fit PSCCA, , PMD, and BCCA to the datasets simulated under the above scenario and compute the canonical correlation matrices for the purpose of comparison.

4.2 Results

We report the Frobenius loss in (Table 1) and Stein loss (Supplementary Table 1) for each of the estimated correlation matrices. Stein loss is defined as for estimating the matrix V and the matrix U.

Overall, under each of the scenarios, PSCCA yields a smaller Frobenius loss and a smaller Stein loss compared to PCAN, and both methods yield much smaller Frobenius losses compared to Stein losses. We found that under the true =5, 10, when the assumed is closer to the truth, PSCCA and PCAN result in smaller Frobenius losses, and smaller losses are preferred. However, when we observe the opposite situation in that the closest value to the truth when =2 yields the largest Frobenius loss for PSCCA and PCAN.

We also estimate the correlations using other standard correlation estimation methods. Because we assumed , other standard correlation estimate methods are valid. We report the summary of the Frobenius loss incurred with estimating the true correlation matrices using Pearson and Spearman approaches based on the raw data in Supplementary Table 2. PSCCA resulted in a smaller Frobenius loss, whereas PCAN and Spearman correlations perform similarly under the true = 0, 10, which is consistent with the results in PCAN paper (Zoh16).

For the canonical correlation analysis comparison, we modified the method of PCAN based on Equations (5) and (6) to render it as an alternative approach for a probabilistic canonical correlation analysis method, which we named . We report the summary of the results compared with , BCCA, and PMD under Models I-III and moderate and high dimensions of in Table 2.

max width=1.1center Mode I Mode II Model III Model I Model II Model III Model I Model II Model III PSCCA 0.8178 1.102 0.8706 1.142 1.287 1.082 1.263 1.124 1.349 (0.1321) (0.075) (0.0191) (0.037) (0.1083) (0.0652) (0.014) (0.0152) (0.0352) PCAN* 0.1258 1.115 1.135 1.597 1.347 1.272 1.594 1.295 1.726 (0.1459) (0.0649) (0.0102) (0.1710) (0.028) (0.0687) (0.022) (0.0159) (0.0913) PMD 0.1246 1.094 1.569 1.445 1.293 1.450 1.495 1.135 1.742 (0.049) (0.0344) (0.0342) (0.033) (0.0253) (0.4562) (0.009) (0.0235) (0.4289) BCCA 1.436 1.181 1.035 1.328 1.362 1.156 1.292 1.7997 1.543 (0.2462) (0.2739) (0.2268) (0.3551) (0.281) (0.1596) (0.276) (0.0197) (0.2039) Mode I Mode II Model III Model I Model II Model III Model I Model II Model III PSCCA 1.124 1.307 1.381 1.448 1.265 1.611 1.220 1.532 1.542 (0.0076) (0.0570) (0.0158) (0.0014) (0.0104) (0.058) (0.0020) (0.0368) (0.0267) PCAN* 1.737 1.407 1.547 1.606 1.842 1.661 1.609 1.842 1.589 (0.0138) (0.0323) (0.0152) (0.0043) (0.0302) (0.0934) (0.0017) (0.0302) (0.0640) PMD 1.471 1.375 1.681 1.518 1.684 1.694 1.263 1.591 1.783 (0.0037) (0.0121) (0.5264) (0.0043) (0.5903) (0.4157) (0.0050) (0.4013) (0.4106) BCCA 1.332 1.497 1.589 1.603 1.717 1.788 1.391 1.669 1.969 (0.4000) (0.4430) (0.2452) (0.3021) (0.1845) (0.2262) (0.3900) (0.2331) (0.3903)

Table 2: Simulation results for Models I-III under moderate and high dimensions of

. The reported numbers are the medians and standard errors (in parentheses) of canonical correlation’s Frobenius loss over 100 replicates.

Moderate dimensions. PSCCA uniformly outperforms the three competitors. The estimator of PSCCA is closer to the truth than the estimates given by , PMD, and BCCA. It also is worth noting that under Model II, PMD performs similarly with PSCCA because we generated the natural parameters with identity variance matrices. However, under Model I and Model III, PMD performs poorly. This confirms that methods with no assumptions on the variance matrices have broader applicability. Another point worth noting is that BCCA produces the largest standard errors compared to other methods, which indicates very unstable estimation for count data.

High dimensions. PSCCA continues to outperform the competitors when the dimensions are very high. PMD still displays similar performance with PSCCA under Model II. This suggests when the identity variance assumption holds, the performance of PMD can be improved. when the dimension exceeds 1000 under Model II, however, PMD displays very large standard errors. Under Model III, performs nearly as well as PSCCA, which indicates there exists a moderate level of correlation and the canonical correlation estimated from the natural parameters is closer to the truth. BCCA still has large standard errors compared to other methods, which indicates it is not a proper method for count data.

5 Real Data Analysis

The Cancer Genome Atlas (TCGA) ( was initiated in 2006 to develop a publicly accessible infrastructure data on an increasing number of well-characterized cancer genomes. TCGA finalized tissue collection with matched tumor and normal tissues from 11,000 patients with 33 cancer types and subtypes, including 10 rare types of cancer. TCGA data has been used to characterize key genomic changes, find novel mutations, define intrinsic tumor types, discover similarities and differences across cancer types, reveal therapy resistance mechanisms, and collect tumor evolution evidence (Tomczak15).

MicroRNAs (miRNA) are very short non-coding RNAs that regulate gene expression at the post-transcriptional level. They bind to mRNAs and inhibit translation or induce mRNA degradation. There are many studies that demonstrate negative correlations in the expression of specific miRNAs and their corresponding target mRNAs, and their interaction in many disease-related regulatory pathways is well established (Ruike08; Wang09; Shah11). In recent years, there have been numerous studies about miRNA and mRNA correlation analysis on different cancer diseases using TCGA data (Ding19; Yu19). However, these correlation analyses all are based on the normalized continuous data, not the raw count data.

In our analysis, we consider the read count next generation sequencing (NGS) expression data from squamous cell lung cancer (LUSC). We downloaded the LUSC dataset from TCGA data portal. LUSC has 504 samples, and we processed the tumor miRNA and mRNA data according to TCGAbiolinks (Colaprico16). Each sample contains 1,881 miRNAs and 56,537 mRNAs. Firstly, we are interested in the correlation analysis between low-expressed mRNAs and a given set of miRNAs. We consider matched miRNA and mRNA samples, and we select miRNAs and mRNAs. For miRNAs we choose some reported with high correlations with mRNAs in PCAN (Zoh16), and among 60 mRNAs we choose 30 mRNAs with average counts between 1 and 2, and the remaining 30 mRNAs are the significant expressed genes reported by (Shah11).

Figure 1: PSCCA heatmap of the posterior mean correlation estimates between the miRNA and mRNA under . Red color indicates the positive correlation and blue color indicates the negative correlation.

We fit the PSCCA model using the priors in Appendix A under . We ran two separate MCMC chains for 10,000 iterations, and monitored them for proper mixing. The first 5,000 iterations were discarded as burn-in and the inference was based on the 10,000 remaining iterations. We estimate the correlations between miRNA and mRNA based on the posterior mean values of natural parameters, and we also report results based on the standard correlation estimation approaches (Spearman and Pearson) applied to the raw data for comparison. We display, as a heatmap, the posterior mean estimates of each correlation in Figure 1.

The correlation results identify very interesting miRNA-mRNA interactions. PSCCA demonstrated the highest power to select the potentially correct miRNA-mRNA interactions. Our results show that miR-539 is negatively correlated with genes , , , and . The gene encoding miR-539 is located on human chromosome 4q32.31, and miR-539 has been reported to be down-regulated in many human cancers, including prostate cancer, nasopharyngeal carcinoma and thyroid cancer, and miR-539 has been reported to play a tumor suppression role in many human malignancies (Guo18). Through TargetScanHuman we confirmed that is the target gene of miR-539, and all the methods estimated the correct negative correlation direction. However, PSCCA has the lowest estimated correlation value -0.3102 between miR-539 and compared to -0.1054 in PCAN, -0.1149 in Spearman, and -0.1218 in Pearson. Meanwhile, from TargetScanHuman, we noticed that miR-539 also regulates , which are the same protein family of , and , respectively. Thus, our findings might add new members for the target gene family of miR-539, and provide more clues for miR-539’s regulation role in lung cancer. Another interesting miRNA is miR-205 which was reported to play a dual role, depending on the specific tumor type and target genes (Nordby17), we found it to be negatively correlated with , , , and in our study. is the target validated through TargetScanHuman, and the estimated correlation value between and miR-205 in PSCCA is -0.2875 compared to -0.0896 in PCAN, -0.00617 in Spearman, and -0.01743 in Pearson. The same situation occurred for the mir-338 and TP63 interaction in which all the methods estimated the correct negative correlation directions, but PSCCA has the most extreme negative value compared to other three methods. Here, we report a few interesting miRNA and their estimated correlation with mRNAs in Figure 2(a). From Figure 2(a), we can see that for the same pair of miRNA-mRNA that our PSCCA can estimate the most extreme correlation values among all the other methods.

Figure 2: Correlation estimates (a) and Venn diagram summarising the positive correlation greater than 0.25 (b) and negative correlation less than -0.25 (c) under PSCCA, PCAN, Spearman, and Pearson.

Figure 2(b) and Figure 2(c) show the venndiagram depicting the overlap between the pairs miRNA-mRNA correlation idenfied by PSCCA, PCAN, Spearman and Pearson approaches. Overall, there is very few pairs jointly identified by four approaches, PSCCA identified 59 negative correlation pairs with estimated correlation values less than -0.25, compared to 36 in PCAN, 9 in Spearman, and 2 in Pearson. For positive correlations, PSCCA identified 319 pairs with estimated correlation values larger than 0.25, compared to 28 in PCAN, 27 in Pearson, and 21 in Spearman. These results suggest PSCCA can identify more extreme negative and positive correlation estimates in sparse count data, and thus have less chance to miss the true correlation pairs for high-dimensional exploratory analysis. For detailed results please check the Supplementary material.

For canonical correlation analysis, we modified PCAN to calculate the canonical correlation from the natural parameters, termed . We apply PSCCA, , PMD, and BCCA on the same LUSC dataset as above for sample size , miRNAs, and mRNAs. Here, for ease of presentation, we focus only on the first two canonical correlations. We fit all the models on , that is, the number of canonical vectors to be obtained. For PMD, we use equal tuning parameters , in which the tuning parameter is chosen by the function CCA.permute in the R package PMA. For BCCA, we use the default settings for initial parameter values. PSCCA and yield high canonical correlations, while PMD and BCCA do not perform very well, both yielding small canonical correlations. In addition, under has slightly larger canonical correlation values than PSCCA, which might be because the small cannot capture the variance about the truth. To check the reasons for poor performance of PMD and BCCA, we found that the variance matrices of the two data sets X and Y are distant from identity matrices, which severely violates the identity variance assumption imposed by PMD. Also, the data are very sparse because we selected 30 out of 60 mRNAs with the average count between 1 and 2 and the remaining 30 mRNAs have large count values which severely violates the standard normal distribution assumption on the two data sets made by BCCA. PMD and BCCA still can run on this low-dimensional dataset; however, when we apply those methods on the miRNA and mRNA data sets with high dimensions (, PMD was shut down directly, while BCCA yields the first canonical correlation value of 0.4174 which is far less than PSCCA 0.95 under . That again indicates that for high-dimensional sparse count data in genomics, efficient and more accurate canonical correlation methods are needed. Here, we display only the results on the low dimension dataset (Table 3).

d=10 d=5 d=2
1st 2nd 1st 2nd 1st 2nd
PSCCA 0.9759 0.9229 0.8732 0.8245 0.7979 0.6791
0.8884 0.8528 0.8343 0.7733 0.8188 0.7179
PMD 0.5858 0.5368 0.5858 0.5368 0.5858 0.5147
BCCA 0.2624 0.2550 0.1981 0.1786 0.1608 0.0069
Table 3: Canonical correlation results on real data. is the value of assumed for the true correlation matrix. We report the 1st and 2nd canonical correlations.

6 Discussion and Future Work

We proposed a probabilistic approach of correlation and canonical correlation analysis for sparse count data. PSCCA is a model-based approach to estimate correlations and canonical correlations at the natural parameter level rather than at the raw data level. Both the simulation study results and the real data application indicate that PSCCA compares favorably to existing methods.

We provided a theoretical justification to prove that correlation coefficients and canonical correlation coefficients calculated from the raw count data X and Y will be smaller in magnitude than the correlation coefficients and canonical correlation coefficients calculated from the natural parameters and in section 3. And this explains why PSCCA achieves more extreme correlation and canonical correlation estimations in real data application. Meanwhile, we demonstrate that horseshoe prior can handle the sparsity very well and this, probably, is due to the horseshoe prior not regularizing the parameters far from zero, which is very important in extracting the important variables that only strongly identified in the NGS data.

As the demand increases for integrative high-dimensional complex data analysis, PSCCA is a linear method which may not be appropriate for fitting the complicated nonlinear situations. Recently, researchers in computer science and engineering developed deep CCA (Benton19)

to extract the nonlinear associations and extended it to multiple views. However, deep CCA benefits from the expressive power of deep neural networks, which have a black box drawback in that they are not easy to interpret and understand. PSCCA is a model-based approach which estimates the dependency within and between two datasets as a joint task, thus, PSCCA is more interpretable. One direction for our work is to develop a probabilistic deep CCA to handle more complex data structures and provide interpretable results. Another possible extension of our model is to build a longitudinal framework to extract the dependency relationship between two or multiple co-occurring time series data.


The authors thank the discussions with Dr.Roger S.Zoh about PCAN during the early stage of the PSCCA study design and anonymous referee for very useful comments that improved the presentation of the paper.