1 Introduction
Recent advancements in nextgeneration sequencing technology have enabled the measurement of multiple highdimensional data types in a single study, such as genomics, transcriptomics, epigenomics, and metabolomics. Integrative analysis of highdimensional omics data is becoming increasingly important and popular. It has been shown that combining multiple omics data types can improve analysis and lead to biologically more meaningful results for complex diseases
(safo18; Lu19).Omics datasets have three main characteristics that pose modeling challenges: (1) they are highdimensional data, with a large number of variables and small sample size ; (2) the raw data represent count variables which violate the distributional assumptions for standard correlation and canonical correlation analysis, which can lead to invalid inference in the presence of a small sample size; (3) the data are very sparse, with a large proportion of these counts being very close to zero and having random missing values.
1.1 Cca
Canonical correlation analysis (CCA) is a classical multivariate method proposed by Hotelling36
for exploring the relationship between two sets of variables. Consider two random vectors
X, Y. Define = cov(X), = cov(Y), and = cov(X,Y). CCA finds canonical correlation directions (,) that maximize the correlation between and , where the (,) are the linear combinations of X and Y canonical variables. Formally, we can write the first pair of canonical variables as() = { b: a=1, b=1},
Then the optimization can be attained by applying the singular value decomposition (SVD) and replacing
with their sample estimates . However, in a highdimensional setting, when the dimensions n, the SVD approach is not applicable because and are not invertible.1.2 Related work
Motivated by genomics, neuroimaging and other applications, researchers have been working on generalizing CCA to accommodate high dimensions, usually called sparse CCA (Witten09; Avants10; Hardoon11; Gao17). These methods impose sparsity constraints on the canonical directions which effectively can reduce the dimensionality and improve the interpretation of the correlations. Penalized matrix decomposition (PMD) (Witten09)is one of the most popular sparse CCA methods, which uses the penalized matrix decomposition to replace and with identity matrices to avoid singularities. By doing so, PMD can obtain sparse estimates of the canonical directions by penalization. However, PMD may perform poorly on data sets when and are far from diagonal. With respect to genomics data, for example, genes usually have strong correlations among them.
The probabilistic interpretation of CCA was initiated by (Bach06). Later on, several Bayesian versions of CCA were developed (Archambeau08; Virtanen11; Klami13). One of the key promising features of Bayesian CCA is that it enables analysis of highdimensional data in life sciences (Fujiwara09; Huopaniemi10)
. However, these methods assume the data to follow normal distributions. Thus, the aforementioned Bayesian methods may not work well for nonnormally distributed data. PCAN is the first approach that describes a Bayesian correlation analysis method for count data
(Zoh16), in which the correlations are estimated based on the latent weights from the natural parameters of the data generating model, rather than the correlations among the counts. In the latent variable model, priors or strong regularizations are used on the latent weights to induce sparsity West03.The rise of big datasets with few signals, such as omics datasets, has spurred the study of sparse models. From a Bayesian perspective, discrete mixtures (George93) and shrinkage priors (Tipping01) are the two main sparse estimation methods. In latent variable models, Bayesian shrinkage priors are popular because of their flexible and interpretable solutions (Carvalho08; Knowles11; Bhattacharya11). The spikeandslab prior is a mixture of a point mass at zero and a flat distribution across the space of real values, in that the excluded loadings are modeled by the "spike" distribution, while the included loadings are modeled by the "slab" distribution (Carvalho08). The disadvantages are that the results can be sensitive to prior choices and it is computationally demanding for posterior inference with a large number of variables due to a huge model space. Scale mixtures of normal priors have been proposed recently as a computationally efficient alternative to the two component spikeandslab prior (Armagan13; Bhattacharya14)
. These types of priors usually assume normal distributions with a mixed variance term and the mixing variance distribution enables strong shrinkage close to zero. For example, Bayesian canonical correlation analysis (BCCA)
(Klami13) consists of applying an automatic relevance determination (ARD) (Neal96)prior for the latent weights which is a Normalgamma prior that imposes an inverse gamma distribution on the variance term. The horseshoe prior is popular due to its good performance in simulations and under theoretical study, which has shown comparable performance to the spikeandslab prior in a variety of problems where a sparse prior is desirable
(Carvalho08; Carvalho10; Polson11).The horseshoe prior is a scale mixture of normals, with a product of halfCauchy priors on the variance. It is given byThe
“global hyperparameter”
can shrink all the parameters toward zero, especially if its domain is restricted to a finite interval, while the heavytailed halfCauchy local priors allow some parameters to escape. Different levels of sparsity can be accommodated by changing the value of : the large will have little shrinkage, while small will shrink all the weights to zero. Despite the good performance, there are two shortcomings for the horseshoe prior. First, how to perform inference for the “global hyperparameter” which determines the overall sparsity in the parameter vector is not fully answered yet. Second, parameters far from zero will not be regularized at all. Quite a few researchers have investigated the impact of concerning the resulting posterior distribution both for recovery and for uncertainty quantification, either in a deterministic way or a hierarchical full Bayes approach (Carvalho08; Datta14; Pas14; Pas17). We take the second shortcoming as the key strength of this prior and to incorporate it with latent variable model to infer the feature sparsity jointly. For an omics data set we assume only important variables are strongly identified and the parameters far from zero will not be regularized.1.3 Our contribution
In this study, we propose a new probabilistic framework of CCA for sparse count data, which we label a probabilistic sparse canonical correlation analysis (PSCCA). Our work contributes several important advances. First, we propose to estimate the canonical correlations at the natural parameter level for data expressed as raw counts, which is lacking in omics analyses. Second, we provide a theoretical justification for estimating the correlations and canonical correlations based on the natural parameters rather than based on the raw data. The former are larger in magnitude than the latter, which is very meaningful for CCA. Because CCA is an exploratory analytical method, larger values of the canonical correlations yield less chance to miss the true correlation pairs. Third, the horseshoe prior is widely studied in the literature, via both simulation studies and theoretical research. Nevertheless, we do not see many examples in applications. We formulate the natural parameters as a latent variable model, and we invoke the horseshoe prior for the latent weight to model the sparsity. To better extract the sparse signals we assume for the “global hyperparameter”. As discussed in Piironen17, this prior results in sensible inference only when is strongly identified by the data. Our simulation study and real data applications show that our approach performs better than existing methods. Lastly, our approach is built on an exponential family and can be easily extended to other formats of data.
The rest of the article is organized as follows. Section 2 contains our model details and inference. Section 3 discusses the theoretical results and Section 4 describes simulation studies. Section 5 presents the real data application. Finally, Section 6 contains a discussion and future directions of PSCCA.
2 Method
2.1 Model
Let be a distribution function from the natural parameter exponential family. The random component of a generalized linear model consists of a response vector
which has a conditional distribution in the exponential family. This family has probability density function or mass function of the form
. The value of the parameter may vary for depending on values of the explanatory variables. The term is the natural parameter; andare nonnegative functions that distinguish one member of the exponential family from another. For our case, assume we have two sets of multivariate random variables,
, and . The observed data samples are expressed as with observations, where is 1 or 2. Let represent the observed value of the individual for the feature (variable) in a set of measured features (variables).We motivate our formulation in the latent variable interpretation of CCA (Bach06) to model the natural parameters and the ideas from PCAN (Zoh et al., 2016) for the correlation estimation from the natural parameters. We assume each individual data vector follows conditionally an exponential family distribution and here we consider a generalized linear model. The generative model for coupled natural parameters with =1,2 and is
(1) 
We write the model as a function of latent variables by concatenating the features into the vector
(2)  
The matrix denotes the loading matrix associated with the latent vector ; denotes the identity matrix; the parameter vector represents the mean of the natural parameters associated with the feature in the data sets ; and is an independently distributed error term, with denoting a normal distribution with null mean and variance . The core generative process is the unobserved shared latent variables , which are transformed via linear mappings to the observation spaces, and can capture the variation common to both data sets and allow for dependency between variables in a specific data set.
We impose horseshoe priors on matrix , let denote the ith row vector of . Then we assume that:
(3) 
We refer to the as the local shrinkage parameters and to as the global shrinkage parameters. Let
denote the halfCauchy distribution. The halfCauchy prior for the local shrinkage parameter
has shown good performance (Carvalho08; Carvalho10). There has been a vast amount of research on how to choose the prior for the global hyperparameter which plays an important role in overall sparsity in the parameter matrix . As discussed in Introduction section, we choose the full Bayesian specification for . Thus, we assume:(4) 
For i =1,…,; k=1,…,; and j=1,…,N, we construct and . The vector has a multivariate normal distribution with mean
and covariance matrix
(5) 
where and . The correlation between and , for any sample can be obtained as
(6) 
For the canonical correlations, let denote the squareroot decomposition of the positive definite matrix . Let
, then the nonnull eigenvalues of
R correspond to the squared canonical coefficients for the natural parameters. Inference on the correlations and canonical correlations will be based on the marginal posterior distribution of , , , and .2.2 Identifiability and Prior
The latent variable model (Equation 1) is identifiable up to orthonormal rotations, for any invertible with . Then and will produce the same estimate of the data covariance matrix in Equation (5) and has an equal likelihood. Zoh16 recommend imposing a lower triangular structure for the matrices , following the work of (Geweke96), and further require that the diagonal elements are nonnegative to remove the nonidentifiability related to the sign. This approach relies on the choice of based on the dimensions of W, but we assume the value of will be small so that the impact is negligible Lopes04.
(7)  
where for , and for , . 1 is an indicator function, 1(W)=1
if W is true and 0 otherwise. We assume conjugate priors for the remaining parameters in the model as
(8)  
The hyperparameters , , and are known.
2.3 Inference
The form of the full conditional posterior distribution is proportional to the product of (1) the joint conditional likelihood for the data matrices and and (2) the prior distributions:
(9)  
The posterior distribution in (9) is difficult to directly simulate. We update the parameters in a Markov chain Monte Carlo (MCMC).
3 Theoretical Results
Let and be distribution functions from the natural parameter exponential family as we discussed in the model section. We model and , and we denote and , , as two sets of count random variables. The natural parameters and have the following expressions:
(10) 
where is and is
. We further assume the following independent probability distributions:
,, , .Theorem 1. We define the unconditional variancecovariance of
and as .
Then we have the correlation coefficients, , constructed from , where , , and . The canonical correlation coefficients , where . The detailed proof is in the Appendix.
Corollary 1. Correlation coefficients and canonical correlation coefficients calculated from the raw count data X and Y will be smaller numerically in magnitude than the correlation coefficients and canonical correlation coefficients calculated from the natural parameters and .
4 Simulation
4.1 Settings
In this section we conduct simulations to assess the performance of SPCCA in comparison with several existing methods that have been proposed for probabilistic correlation analysis (PCAN), Bayesian CCA (BCCA), and sparse CCA (PMD), as mentioned in the Introduction. First, we evaluate the performance of SPCCA on correlation analysis, comparing with PCAN because the PCAN paper already demonstrated that it outperforms traditional Spearman and Pearson correlation methods. Second, we compare SPCCA’s performance on the canonical correlation analysis with that of BCCA, PMD, and the modified PCAN, which we name .
To evaluate the methods, let , be the true generated loading matrices. For the all methods with estimates , , we can calculate the correlations and the canonical correlations according to Equations (5) and (6). Let be the true matrix and
be the estimated matrix. Then we construct the Frobenius loss function as
, assuming .Scenario I: Correlation analysis– In the first scenario, we simulated 100 datasets assuming for each dataset , , and subjects. The weight matrices are and . We consider three correlation matrices for the natural parameters:
(a) the identity correlation matrix assuming the true for and .
(b) a correlation matrix obtained assuming for and .
(c) a correlation matrix obtained assuming for and .
We fit the PSCCA and PCAN to each of these 100 datasets assuming different dimensions of to compute the posterior mean correlation matrices.

PSCCA  PCAN  

Mean  95%CI  Mean  95%CI  
0  2  25.21  (24.35, 26.54)  28.57  (28.31, 29.20) 
0  5  19.55  (19.20, 20.49)  23.30  (21.85, 24.95) 
0  10  14.12  (13.23, 14.83)  21.26  (19.66, 22.17) 
5  2  4.25  (4.05, 4.34)  22.51  (21.26, 22.87) 
5  5  3.89  (3.60, 4.24)  19.88  (18.72, 20.63) 
5  10  4.54  (4.21, 4.91)  20.35  (19.45, 22.44) 
10 
2  10.31  (9.41, 12.21)  15.55  (14.68, 16.31) 
10  5  8.81  (8.26, 9.35)  14.62  (13.34, 15.21) 
10  10  7.85  (7.54, 8.06)  13.27  (12.82, 13.51) 

Scenario II: Canonical correlation analysis– In the second scenario, we simulated 100 datasets, for each dataset we set , under moderate and high dimensions of . We use three models for the correlation matrices of the the natural parameters .
Model I (Independent covariances): there is no covariance structure within each of the natural parameters .
Model II (Identity covariances): ,
Model III (Moderate covariances): = 0.5, = 0.5
(1) Moderate dimensions: = 60, 100, 300
(2) High dimensions: = 500, 1000, 2000
Model I is used in PCAN (Zoh16), and similar models of Model II and III have been used to generate the raw data in (Gao17). We fit PSCCA, , PMD, and BCCA to the datasets simulated under the above scenario and compute the canonical correlation matrices for the purpose of comparison.
4.2 Results
We report the Frobenius loss in (Table 1) and Stein loss (Supplementary Table 1) for each of the estimated correlation matrices. Stein loss is defined as for estimating the matrix V and the matrix U.
Overall, under each of the scenarios, PSCCA yields a smaller Frobenius loss and a smaller Stein loss compared to PCAN, and both methods yield much smaller Frobenius losses compared to Stein losses. We found that under the true =5, 10, when the assumed is closer to the truth, PSCCA and PCAN result in smaller Frobenius losses, and smaller losses are preferred. However, when we observe the opposite situation in that the closest value to the truth when =2 yields the largest Frobenius loss for PSCCA and PCAN.
We also estimate the correlations using other standard correlation estimation methods. Because we assumed , other standard correlation estimate methods are valid. We report the summary of the Frobenius loss incurred with estimating the true correlation matrices using Pearson and Spearman approaches based on the raw data in Supplementary Table 2. PSCCA resulted in a smaller Frobenius loss, whereas PCAN and Spearman correlations perform similarly under the true = 0, 10, which is consistent with the results in PCAN paper (Zoh16).
For the canonical correlation analysis comparison, we modified the method of PCAN based on Equations (5) and (6) to render it as an alternative approach for a probabilistic canonical correlation analysis method, which we named . We report the summary of the results compared with , BCCA, and PMD under Models IIII and moderate and high dimensions of in Table 2.
Moderate dimensions. PSCCA uniformly outperforms the three competitors. The estimator of PSCCA is closer to the truth than the estimates given by , PMD, and BCCA. It also is worth noting that under Model II, PMD performs similarly with PSCCA because we generated the natural parameters with identity variance matrices. However, under Model I and Model III, PMD performs poorly. This confirms that methods with no assumptions on the variance matrices have broader applicability. Another point worth noting is that BCCA produces the largest standard errors compared to other methods, which indicates very unstable estimation for count data.
High dimensions. PSCCA continues to outperform the competitors when the dimensions are very high. PMD still displays similar performance with PSCCA under Model II. This suggests when the identity variance assumption holds, the performance of PMD can be improved. when the dimension exceeds 1000 under Model II, however, PMD displays very large standard errors. Under Model III, performs nearly as well as PSCCA, which indicates there exists a moderate level of correlation and the canonical correlation estimated from the natural parameters is closer to the truth. BCCA still has large standard errors compared to other methods, which indicates it is not a proper method for count data.
5 Real Data Analysis
The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov) was initiated in 2006 to develop a publicly accessible infrastructure data on an increasing number of wellcharacterized cancer genomes. TCGA finalized tissue collection with matched tumor and normal tissues from 11,000 patients with 33 cancer types and subtypes, including 10 rare types of cancer. TCGA data has been used to characterize key genomic changes, find novel mutations, define intrinsic tumor types, discover similarities and differences across cancer types, reveal therapy resistance mechanisms, and collect tumor evolution evidence (Tomczak15).
MicroRNAs (miRNA) are very short noncoding RNAs that regulate gene expression at the posttranscriptional level. They bind to mRNAs and inhibit translation or induce mRNA degradation. There are many studies that demonstrate negative correlations in the expression of specific miRNAs and their corresponding target mRNAs, and their interaction in many diseaserelated regulatory pathways is well established (Ruike08; Wang09; Shah11). In recent years, there have been numerous studies about miRNA and mRNA correlation analysis on different cancer diseases using TCGA data (Ding19; Yu19). However, these correlation analyses all are based on the normalized continuous data, not the raw count data.
In our analysis, we consider the read count next generation sequencing (NGS) expression data from squamous cell lung cancer (LUSC). We downloaded the LUSC dataset from TCGA data portal. LUSC has 504 samples, and we processed the tumor miRNA and mRNA data according to TCGAbiolinks (Colaprico16). Each sample contains 1,881 miRNAs and 56,537 mRNAs. Firstly, we are interested in the correlation analysis between lowexpressed mRNAs and a given set of miRNAs. We consider matched miRNA and mRNA samples, and we select miRNAs and mRNAs. For miRNAs we choose some reported with high correlations with mRNAs in PCAN (Zoh16), and among 60 mRNAs we choose 30 mRNAs with average counts between 1 and 2, and the remaining 30 mRNAs are the significant expressed genes reported by (Shah11).
We fit the PSCCA model using the priors in Appendix A under . We ran two separate MCMC chains for 10,000 iterations, and monitored them for proper mixing. The first 5,000 iterations were discarded as burnin and the inference was based on the 10,000 remaining iterations. We estimate the correlations between miRNA and mRNA based on the posterior mean values of natural parameters, and we also report results based on the standard correlation estimation approaches (Spearman and Pearson) applied to the raw data for comparison. We display, as a heatmap, the posterior mean estimates of each correlation in Figure 1.
The correlation results identify very interesting miRNAmRNA interactions. PSCCA demonstrated the highest power to select the potentially correct miRNAmRNA interactions. Our results show that miR539 is negatively correlated with genes , , , and . The gene encoding miR539 is located on human chromosome 4q32.31, and miR539 has been reported to be downregulated in many human cancers, including prostate cancer, nasopharyngeal carcinoma and thyroid cancer, and miR539 has been reported to play a tumor suppression role in many human malignancies (Guo18). Through TargetScanHuman we confirmed that is the target gene of miR539, and all the methods estimated the correct negative correlation direction. However, PSCCA has the lowest estimated correlation value 0.3102 between miR539 and compared to 0.1054 in PCAN, 0.1149 in Spearman, and 0.1218 in Pearson. Meanwhile, from TargetScanHuman, we noticed that miR539 also regulates , which are the same protein family of , and , respectively. Thus, our findings might add new members for the target gene family of miR539, and provide more clues for miR539’s regulation role in lung cancer. Another interesting miRNA is miR205 which was reported to play a dual role, depending on the specific tumor type and target genes (Nordby17), we found it to be negatively correlated with , , , and in our study. is the target validated through TargetScanHuman, and the estimated correlation value between and miR205 in PSCCA is 0.2875 compared to 0.0896 in PCAN, 0.00617 in Spearman, and 0.01743 in Pearson. The same situation occurred for the mir338 and TP63 interaction in which all the methods estimated the correct negative correlation directions, but PSCCA has the most extreme negative value compared to other three methods. Here, we report a few interesting miRNA and their estimated correlation with mRNAs in Figure 2(a). From Figure 2(a), we can see that for the same pair of miRNAmRNA that our PSCCA can estimate the most extreme correlation values among all the other methods.
Figure 2(b) and Figure 2(c) show the venndiagram depicting the overlap between the pairs miRNAmRNA correlation idenfied by PSCCA, PCAN, Spearman and Pearson approaches. Overall, there is very few pairs jointly identified by four approaches, PSCCA identified 59 negative correlation pairs with estimated correlation values less than 0.25, compared to 36 in PCAN, 9 in Spearman, and 2 in Pearson. For positive correlations, PSCCA identified 319 pairs with estimated correlation values larger than 0.25, compared to 28 in PCAN, 27 in Pearson, and 21 in Spearman. These results suggest PSCCA can identify more extreme negative and positive correlation estimates in sparse count data, and thus have less chance to miss the true correlation pairs for highdimensional exploratory analysis. For detailed results please check the Supplementary material.
For canonical correlation analysis, we modified PCAN to calculate the canonical correlation from the natural parameters, termed . We apply PSCCA, , PMD, and BCCA on the same LUSC dataset as above for sample size , miRNAs, and mRNAs. Here, for ease of presentation, we focus only on the first two canonical correlations. We fit all the models on , that is, the number of canonical vectors to be obtained. For PMD, we use equal tuning parameters , in which the tuning parameter is chosen by the function CCA.permute in the R package PMA. For BCCA, we use the default settings for initial parameter values. PSCCA and yield high canonical correlations, while PMD and BCCA do not perform very well, both yielding small canonical correlations. In addition, under has slightly larger canonical correlation values than PSCCA, which might be because the small cannot capture the variance about the truth. To check the reasons for poor performance of PMD and BCCA, we found that the variance matrices of the two data sets X and Y are distant from identity matrices, which severely violates the identity variance assumption imposed by PMD. Also, the data are very sparse because we selected 30 out of 60 mRNAs with the average count between 1 and 2 and the remaining 30 mRNAs have large count values which severely violates the standard normal distribution assumption on the two data sets made by BCCA. PMD and BCCA still can run on this lowdimensional dataset; however, when we apply those methods on the miRNA and mRNA data sets with high dimensions (, PMD was shut down directly, while BCCA yields the first canonical correlation value of 0.4174 which is far less than PSCCA 0.95 under . That again indicates that for highdimensional sparse count data in genomics, efficient and more accurate canonical correlation methods are needed. Here, we display only the results on the low dimension dataset (Table 3).
d=10  d=5  d=2  

1st  2nd  1st  2nd  1st  2nd  
PSCCA  0.9759  0.9229  0.8732  0.8245  0.7979  0.6791 
0.8884  0.8528  0.8343  0.7733  0.8188  0.7179  
PMD  0.5858  0.5368  0.5858  0.5368  0.5858  0.5147 
BCCA  0.2624  0.2550  0.1981  0.1786  0.1608  0.0069 
6 Discussion and Future Work
We proposed a probabilistic approach of correlation and canonical correlation analysis for sparse count data. PSCCA is a modelbased approach to estimate correlations and canonical correlations at the natural parameter level rather than at the raw data level. Both the simulation study results and the real data application indicate that PSCCA compares favorably to existing methods.
We provided a theoretical justification to prove that correlation coefficients and canonical correlation coefficients calculated from the raw count data X and Y will be smaller in magnitude than the correlation coefficients and canonical correlation coefficients calculated from the natural parameters and in section 3. And this explains why PSCCA achieves more extreme correlation and canonical correlation estimations in real data application. Meanwhile, we demonstrate that horseshoe prior can handle the sparsity very well and this, probably, is due to the horseshoe prior not regularizing the parameters far from zero, which is very important in extracting the important variables that only strongly identified in the NGS data.
As the demand increases for integrative highdimensional complex data analysis, PSCCA is a linear method which may not be appropriate for fitting the complicated nonlinear situations. Recently, researchers in computer science and engineering developed deep CCA (Benton19)
to extract the nonlinear associations and extended it to multiple views. However, deep CCA benefits from the expressive power of deep neural networks, which have a black box drawback in that they are not easy to interpret and understand. PSCCA is a modelbased approach which estimates the dependency within and between two datasets as a joint task, thus, PSCCA is more interpretable. One direction for our work is to develop a probabilistic deep CCA to handle more complex data structures and provide interpretable results. Another possible extension of our model is to build a longitudinal framework to extract the dependency relationship between two or multiple cooccurring time series data.
Acknowledgements
The authors thank the discussions with Dr.Roger S.Zoh about PCAN during the early stage of the PSCCA study design and anonymous referee for very useful comments that improved the presentation of the paper.
Comments
There are no comments yet.