 # Imbalanced Sparse Canonical Correlation Analysis

Classical canonical correlation analysis (CCA) requires matrices to be low dimensional, i.e. the number of features cannot exceed the sample size. Recent developments in CCA have mainly focused on the high-dimensional setting, where the number of features in both matrices under analysis greatly exceeds the sample size. However, these approaches make considerable sparsity assumptions and impose penalties that may be unnecessary for some datasets. We consider an imbalanced setting that is commonly encountered, where one matrix is high dimensional and the other is low dimensional. We provide an explicit link between sparse multiple regression with sparse canonical correlation analysis, and an efficient algorithm that exploits the imbalanced data structure and estimates multiple canonical pairs rather than sequentially. We provide theoretical results on the consistency of canonical pairs. Simulation results and the analysis of several real datasets support the improved performance of the proposed approach.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Canonical correlation analysis (CCA) is a widely used method to determine the relationship between two sets of variables. In CCA, the objective is to find a linear combination of variables from each set of variables such that the correlation is maximized. The vectors consist of coefficients from each linear combination is called canonical pairs. Originally proposed by

Hotelling (1936)

, CCA has been applied to numerous problems, including those of large scale. In large scale problems, including genomic studies, researchers are often faced with high dimensional data. For example,

Chen et al. (2012) studied the association between nutrient intake and human gut microbiome composition, and Wang et al. (2015) studied the group interactions among genes. Projects such as GTEx (Aguet et al., 2019) also provide rich datasets for which CCA might be used to identify important genetic modules relevant to disease. In these works, classical CCA cannot be used to analyze the high dimensional data, where the number of variables exceeds the number of observations.

To study the relationship between two sets of high dimensional variables, many extensions of classical CCA have been proposed. One popular approach, sparse canonical correlation analysis, imposes sparse structure on the canonical vectors. An incomplete list of sparse CCA methods is Parkhomenko et al. (2009); Witten and Tibshirani (2009); Waaijenborg et al. (2008); Lê Cao et al. (2009); Witten et al. (2009); Chen et al. (2012), and references therein. In these works, the dimensions of the two variables being compared are both high.

In some cases, the dimensions of the two variables can be quite different. For example, in the application to gut microbiome data analysis (Chen et al., 2012), the dimensions of two variables were and , while the sample size was . Such datasets have a partially low dimensional structure, that is, the dimension of one variable is larger than the sample size, and the other one has dimension lower than the sample size. The sparse canonical correlation analysis does not utilize the partially low dimensional structure, and may provide biased canonical pairs of canonical vectors.

In order to estimate the canonical pairs in the problems with partially low dimensional structure, in this work, we propose imbalanced sparse canonical correlation analysis. Specifically, we link sparse multiple regression with sparse canonical correlation analysis, and apply Lasso regularization to the estimation of the canonical pairs. We propose an efficient algorithm to provide canonical pairs simultaneously for . This advantage significantly differentiates our method from other sparse canonical correlation analysis methods, which usually estimate multiple canonical pairs sequentially. Importantly, we also provide theoretical guarantees on the consistency of estimated canonical pairs.

We note that the relationship between multiple regression and CCA has been considered previously. In Glahn (1968), multiple regression was considered as be a special case of CCA, but the high dimensional situation was not considered. Lutz and Eckert (1994) analyzed the relationship between multiple regression and CCA via eigenstructure. Yamamoto et al. (2008) applied CCA to multivariate regression. Song et al. (2016) assumed that the responses have a linear relationship with some underlying signals. However, we are not aware of any works that apply sparse multiple regression to canonical correlation analysis.

The rest of this paper is arranged as follows. In Section 2, we introduce classical canonical correlation analysis and sparse canonical correlation analysis. We propose an imbalanced sparse canonical correlation analysis approach, with attendant theoretical properties. In Section 3, we compare the imbalanced sparse canonical correlation analysis to existing methods, and highlight the potential advantages and importance of the proposed method. In Section 4, we conduct numeric simulation studies. In Section 5, we apply imbalanced sparse canonical correlation analysis and competing methods to three applied problems, including human gut microbiome data, GTEx thyroid imaging/expression data, and GTEx liver genotype/expression data. A Discussion with conclusions is provided in Section 6. Technical proofs are provided in the Appendix.

## 2 Background and Proposed Methods

In this section, we provide a brief introduction to classical canonical correlation analysis and sparse canonical correlation analysis, propose imbalanced sparse canonical correlation analysis, and study its theoretical properties.

### 2.1 Classical CCA and Sparse CCA

Suppose we are interested in studying the correlation between two sets of random variables

and . The goal of canonical correlation analysis (CCA) is to find and such that is the solution to

 maxa∈Rd,b∈Rp corr(aTy,bTx). (1)

Without loss of generality, we assume and have mean zero, for otherwise we can shift the mean. Let and be the covariance matrix of and , respectively. Let be the covariance matrix between and . The optimization problem (1) is the same as

 maxa∈Rd,b∈Rp aTΣyxb√aTΣyya√bTΣxxb. (2)

The solution to (2), denoted by and , are called the first pair of canonical vectors, and the new variables and are called the first pair of canonical variables (Mardia et al., 1979). Once the -th pair of canonical vectors and are obtained, the -th pair of canonical vectors is the solution to the optimization problem

 maxa∈Rd,b∈Rp aTΣyxb√aTΣyya√bTΣxxb s.t. aTΣyyal=0,bTΣxxbl=0,1≤l≤k−1, (3)

for .

By basic matrix computation, one can obtain that the solution to the optimization problem (2.1) is the

-th eigenvector of

 Σ−1yyΣyxΣ−1xxΣxy, (4)

and is proportional to

 Σ−1xxΣxyak. (5)

Note that the solution to (2.1) is not unique, because for any constant and , if is the solution to (2.1), then so is . Therefore, we restrict the norms of and such that , and the first nonzero element of () is positive to make the solution to (2.1) unique, where is the Euclidean norm.

Let , be observations, where and . Let and be the sample matrices. In classical CCA, the covariance matrices , , and are replaced by , , and , respectively (González et al., 2008). This approach does not work if the dimension of or is larger than the sample size , because or is singular. To address the case when or is larger than , several modifications of the classical CCA have been proposed. One widely used method, sparse canonical correlation analysis, restricts the estimated canonical vectors to be sparse. In Parkhomenko et al. (2009), the sparse canonical vectors are obtained by an algorithm where a threshold is used to control the sparsity of canonical vectors. Another way to obtain the sparse canonical vectors is via regularization (Witten and Tibshirani, 2009; Witten et al., 2009; Chen et al., 2012), where the -th pair of canonical vectors is obtained by solving

 maxa,b 1naTYXTb s.t. ∥a∥2≤1,∥b∥2≤1,P1(a)≤c1,P2(b)≤c2,aTYYTal=0,bTXXTbl=0,1≤l

where and are two convex penalty functions, and and are two constants. In this model, the sparsity is imposed on the canonical vectors by using different penalty functions. In Section 3, we will provide a more detailed introduction to sparse canonical correlation analysis.

### 2.2 Imbalanced Sparse CCA

In some cases, the dimensions of and are very different. Without loss of generality for our application domain, we assume the dimension of is much larger than the sample size, while the dimension of is relatively small. In this work, we propose imbalanced sparse canonical correlation analysis, which can be used to estimate the canonical vectors under the the setting . This imbalance allows us to use the low dimensional structure of to compute canonical vectors efficiently. One may still use sparse CCA (2.1) in this setting to estimate the canonical vectors, but may lose some efficiency or incur substantial bias, as we will see in the numerical studies. In contrast with existing sparse CCA methods (2.1), we do not approach the problem directly as in terms of the CCA correlation maximization. First, we establish a relationship between multivariate regression and CCA. This relationship allows us to apply existing methodologies from regression, which makes the algorithm of estimating canonical vectors more efficient.

Consider a multivariate linear regression on

with variables ,

 y=B∗x+ϵy, (7)

where is the coefficient matrix. The variable is the residual, and . With the relationship (7), we can compute the covariance matrices and by

 Σxy=ΣxxBT∗,Σyy=B∗ΣxxBT∗+Σϵyϵy, (8)

where the second equality follows from . By the results in classical CCA, the first canonical vector of the -th pair of canonical vectors is the -th eigenvector of

 Σ−1yyΣyxΣ−1xxΣxy =Σ−1yyB∗ΣxxBT∗, (9)

where the equality is because of (8). The second canonical vector is propotional to

 Σ−1xxΣxyak=Σ−1xxΣxxBT∗ak=BT∗ak. (10)

Note that in (9) and (10), we do not need to compute . Therefore, we avoid the problem that is singular. Thus, if is known, we can replace and in (9) by and , respectively, to estimate the canonical vectors.

In practice, is rarely known. Therefore, we need to estimate in order to use (9) and (10) to obtain the canonical vectors. Note that with . One natural idea is to assume the coefficient matrix has some sparse structure, and to use the elementwise penalty as a regularization as in Lasso (Tibshirani, 1996). To be specific, let be an estimator of . We compute by the following optimization problem

 minB∈Rd×pn∑i=1∥Yi−BXi∥22+λ1∥B∥F1, (11)

where

 ∥B∥F1=d∑j=1∥βj∥1,

for , is the norm, and is a tuning parameter. Noting that (11) can be rewritten as

 n∑i=1∥Yi−BXi∥22+λ1∥B∥F1 = n∑i=1d∑j=1(yij−βTjXi)2+λ1d∑j=1∥βj∥1 = d∑j=1(n∑i=1(yij−βTjXi)2+λ1∥βj∥1),

we can decompose (11) into Lasso problems,

 minβjn∑i=1(yij−βTjXi)2+λ1∥βj∥1 (12)

for . Let be the solution to (12), and . Therefore, is an estimator of . By replacing , and in (9) and (10) with , and , respectively, we can obtain the -th pair of estimated canonical vectors and as follows. The first estimated canonical vector is the -th eigenvector of

 (YYT)−1^BX(^BX)T,

and the second canonical vector is proportional to

 ^BT^ak.

Algorithm 1 describes the procedure to obtain the -th pair of estimated canonical vectors.

Compared with existing methods, the proposed algorithm has the following advantages. First, we do not use any iteration in our algorithm, except in solving Lasso, which has been well studied and optimized in the literature (Friedman et al., 2010). Therefore, Algorithm 1 is efficient, since we assume is small. Second, in Algorithm 1, it can be seen that we obtain multiple pairs of estimated canonical vectors simultaneously. This implies that we do not need to estimate multiple pairs of canonical vectors sequentially. In particular, if one is only interested in the -th pair of canonical vectors, one does not need to know -th canonical vectors for . We will discuss the comparison between imbalanced sparse CCA and existing methodologies in greater detail in Section 3.

### 2.3 Theoretical Properties

In this subsection, we present theoretical results of imbalanced sparse CCA. We mainly focus on the consistency of the estimated canonical vectors. We first introduce some technical assumptions. In the rest of this work, we will use the following definitions. For notational simplicity, we will use and to denote the constants, of which the values can change from line to line. For two positive sequences and , we write if, for some constants , . Similarly, we write if for some constant , and if for some constant .

The first assumption is the regularity conditions on the covariance matrices and .

###### Assumption 1.

Let and

be the maximum and minimum eigenvalues of matrix

, respectively. Assume there exist positive constants and such that

 K1≤min{λmin(Σxx),λmin(Σϵyϵy)}≤max{λmax(Σxx),λmax(Σϵyϵy}≤K2.

Assumption 1 assures that the eigenvalues of the covariance matrix are bounded, which is a typical condition for the high dimensional analysis; see Gao et al. (2017); Chen et al. (2013) for example.

The second assumption is on the coefficient matrix .

###### Assumption 2.

Suppose satisfies for some constant , where

is the maximum singular value of

.

As a simple consequence of Assumptions 1 and 2, the eigenvalues of the covariance matrix are bounded above by a constant, and bounded below from zero, as shown in the following proposition.

###### Proposition 1.

Suppose Assumptions 1 and 2 hold. Then there exist positive constants and such that

 K1≤λmin(Σyy)≤λmax(Σyy)≤K2.

The following assumption is on the matrix .

###### Assumption 3.

Let . Suppose the Schur decomposition of with respect to -th eigenvalue and eigenvector is

 QTkΓQk=[λkvTk0Tk],

where is orthogonal (thus, is the -th eigenvector of ). Assume there exist some constants and such that for all , and , where is the minimum singular value of .

Assumption 3 imposes the conditions on the matrix , which ensures the numerical stability of the calculation of the eigenvectors of . The singular value condition is necessary because if is a nondefective, repeated eigenvalue of , there exist infinitely many of eigenvectors corresponding to , thus the consistency of eigenvectors cannot hold. Roughly speaking, Assumption 3 requires that the eigenvalues of are well separated.

The next assumption is related to the tail behaviors of variables , , and .

###### Definition 1.

A vector is sub-Gaussian, if there exist positive constants and such that

 K2(Eev2i/K2−1)≤σ2

holds for all .

###### Assumption 4.

The random variables , , and are all sub-Gaussian. Furthermore, is independent of .

The sub-Gaussian assumption in Assumption 4 is also typical in high dimensional analysis. As a simple example, and are sub-Gaussian, where

is a multivariate normal distribution with mean zero and covariance matrix

. The independence assumption of and is slightly stronger than the condition , which can be always done by projection of onto , if and are normally distributed.

Under Assumptions 1-4, we have the following consistency results, as shown in Theorem 1. The proof of Theorem 1 can be found in Appendix C.

###### Theorem 1.

Let with . Suppose Assumptions 1-4 hold. Furthermore, assume , , and , where and is the cardinality of set

. Then with probability at least

,

 max{∥ak−^ak∥2,∥bk−^bk∥2}≲ √d(d+s∗)logp/n, (13)

for all , where is a positive constant not depending on .

In Theorem 1, it can be seen that if is small, then imbalanced sparse CCA can provide consistent estimators of canonical vectors, under the high dimensional settings with respect to the second random variable.

## 3 Comparison with Existing Methods

In this section, we first review some existing methods, and then compare imbalanced sparse CCA with these existing methods. As introduced in Section 2.1, when the dimension of or is larger than the sample size , the classical CCA cannot be directly applied. One naive method to estimate the canonical vectors is to add diagonal matrices and with such that the estimated covariance matrix and are invertible, where and are two identity matrices of size and , respectively, , and . Following the terminology in spatial statistics (Stein, 1999) and computer experiments (Peng and Wu, 2014), we call and “nugget” parameters, and call the corresponding method CCA with a nugget parameter. CCA with a nugget parameter provides the first canonical vector as an eigenvector of

 (ΣYY+μYId)−1ΣYX(ΣXX+μXIp)−1ΣXY, (14)

and the second canonical vector is proportional to

 (ΣXX+μXIp)−1ΣXYa, (15)

where and . Although using a nugget parameter enables the matrix inverse, it may produce non-sparse canonical vectors, which may hard to interpret. If the dimension of is not large, then the non-sparse canonical vector is reasonable. However, since the dimension of is large, it is desirable to have a sparse canonical vector , and CCA with a nugget parameter may not be appropriate to use.

Many other approaches to generalize classical CCA to high dimensional settings have been proposed. In these works, thresholding or regularization is introduced into the optimization problem (1). For example, Parkhomenko et al. (2009, 2007); Waaijenborg et al. (2008) introduced a soft-thersholding for each element of canonical vectors. Therefore, elements with small absolute value are forced to be zero, and a sparse solution is obtained. Chen et al. (2013) introduced iterative thresholding to estimate the canonical vectors, and showed that the consistency of estimated canonical vectors holds under the assumptions that and (or the inverses of them) are sparse. Tenenhaus and Tenenhaus (2011) proposed a regularized generalized CCA, where the constraint on canonical vectors are changed to be and , where are two tuning parameters.

Regularization-based sparse CCA usually has the form (2.1). This method was proposed by Witten et al. (2009), and has been extended by Witten and Tibshirani (2009). An algorithm based on Witten and Tibshirani (2009) has been proposed by Lee et al. (2011b). In Waaijenborg et al. (2008), the elastic net was also used to obtain sparsity of the estimated canonical vectors. Chen et al. (2012) modified sparse CCA as in Witten and Tibshirani (2009) by adding a structure based constraint to the canonical vectors. Gao et al. (2017) proposed a method called convex program with group-Lasso refinement, which is a two-stage method based on group Lasso, and they proved the consistency of estimated canonical variables.

Another type of sparse CCA methods is via reformulation. In Hardoon and Shawe-Taylor (2011), it was shown that based on a primal-dual framework, (2) is equivalent to the following problem

 minw,e∥XTw−YTYe∥22, (16)

subject to , in the sense that is the solution to (16) if and only if there exists such that is the solution to (2). Then by imposing regularization on and , sparse canonical vectors can be obtained. Recent work by Mai and Zhang (2019) reformulated (2.1) into a constrained quadratic optimization problem, and proposed an iterative penalized least squares algorithm to solve the optimization problem. Theoretical guarantees on the consistency of the estimated canonical vectors were also presented in Mai and Zhang (2019).

Our proposed method, imbalanced sparse CCA, is substantially different from all existing methods from the following perspectives. First, the scope of imbalanced sparse CCA is different from sparse CCA. In imbalanced sparse CCA, we target the estimation of canonical vectors under the settings . Although the case of can be covered by the methods mentioned above, these methods do not utilize the partially low dimensional structure. Imbalanced sparse CCA, on the other hand, utilizes the low dimensional structure. Many methods mentioned above need to solve Lasso in each iteration of the algorithm, until the algorithm converges; see Mai and Zhang (2019); Waaijenborg et al. (2008) for example. In imbalanced sparse CCA, we need only to solve Lasso optimization problems, which is fast, since is small and Lasso can be efficiently solved by existing algorithms.

Second, because of this low dimensional structure, imbalanced sparse CCA can provide pairs of canonical correlation vectors simultaneously, and does not need to obtain pairs of canonical correlation vectors one by one. In particular, if one is only interested in the -th pair of canonical vectors for , one does not need to know all -th canonical vectors for . Other methods mentioned above usually need to solve a separate optimization problem in order to get another pair of canonical vectors, and must know -th pairs of canonical vectors for all in order to get -th pair of canonical vectors. This advantage of imbalanced sparse CCA makes our method extremely powerful when researchers need to estimate a relatively large number of pairs of canonical vectors.

Third, it is mentioned in Mai and Zhang (2019) that one advantage of their method is that they do not need any assumptions on the covariance matrices. In our algorithm, we only use the estimation of the covariance matrix at the last step. Therefore, our approach also does not require assumptions on the covariance matrices.

Fourth, because we solve Lasso problems separately in imbalanced sparse CCA, our method naturally allows parallel computing, which can make our algorithm more efficient. The parallel computing makes imbalanced sparse CCA useful when , , and are extremely large. This trivially parallel computing cannot be used by most methods mentioned above, where iteration is required.

Fifth, imbalanced sparse CCA has theoretical guarantees, and the theoretical development is much different from the other methods mentioned above. The theory underlying imbalanced sparse CCA involves not only statistical theory, but also results from numerical algebra. In our theoretical development, we are dealing with the eigenvalues and eigenvectors directly. Because we can obtain multiple eigenvectors at one time, we can obtain multiple canonical vectors simultaneously. Therefore, our theory and method may be able to inspire new directions of analysis and development of new methodologies for sparse CCA. Because of this new direction, our theory itself may be of interest to researchers studying sparse CCA.

## 4 Numeric Simulation

In this section, we conduct numeric studies on the applications of the imbalanced sparse CCA and sparse CCA methods. We compare the imbalanced sparse CCA (isCCA) with CCA with a nugget parameter (nCCA) as in (14) and (15), CCA with penalty (-CCA) (Witten et al., 2009), sCCA (Lee et al., 2011b), and regularized and sparse generalized CCA (rgCCA) (Tenenhaus and Tenenhaus, 2011). In our simulation study, we first simulate , , with

 Σxx =A1AT1+0.1Ip,Σyy=B∗ΣxxBT∗+σ2Id,Σxy=ΣxxBT∗, (17)

where is a sparse matrix, , is a parameter, and and are identity matrices with size and , respectively.

Given , and , we can calculate the -th true canonical vectors by (4) and (5), denoted by and , respectively. In the numeric simulation, we mainly focus on the first pair of canonical vectors and . We use R packages PMA (Witten and Tibshirani, 2020), sCCA (Lee et al., 2011a), and RGCCA (Tenenhaus and Guillemot, 2017) to implement -CCA, sCCA, and rgCCA, respectively. For isCCA and nCCA, we generate an independent validation set of and with the same sample size as the training set. The validation set is used to select the tuning parameters. Specifically, let be candidates of tuning parameters, and be the canonical vectors obtained by using parameters , respectively. Then we compute corr for , and choose . The tuning parameter then is chosen to be , and the first pair of the estimated canonical vectors are and . After obtaining estimated canonical vectors, we compare the errors and for all five methods.

Note in (17), the parameter controls the correlation between and . Roughly speaking, a larger leads to a smaller correlation between and . Therefore, we choose , for to see the change of and when the correlation of and changes. For each , we run replicates. For -th replicate, we compute the estimated canonical vectors and , and use

 (^E∥^a−a1∥22)1/2= ⎷1NN∑j=1∥^a1,j−a1∥22, (^E∥^b−b1∥22)1/2= ⎷1NN∑j=1∥^b1,j−b1∥22

to approximate the root mean squared prediction error (RMSE)

 (E∥^a−a1∥22)1/2,(E∥^b−b1∥22)1/2,

respectively. We consider two cases, where the matrix is different. In both cases, we use the sample size . The matrix is randomly generated by

 αjk{∼Unif(0,2) % with probability 0.3,=0 with probability 0.7,

where

is the uniform distribution on the interval

.

Case 1:

In (17), we choose , where

 B1=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝211211⋱⋱⋱⋱112⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠∈Rd×d

is a tridiagonal matrix, and

is a zero matrix. The results under different

are shown in Figures 1 - 3. Figure 1: The approximated root mean squared prediction error (^E∥^a−a1∥22)1/2 for Case 1. Panel 1: p=100,d=5. Panel 2: p=100,d=10. Panel 3: p=500,d=3. Panel 4: p=1,000,d=3. Figure 2: The approximated root mean squared prediction error (^E∥^b−b1∥22)1/2 for Case 1. Panel 1: p=100,d=5. Panel 2: p=100,d=10. Panel 3: p=500,d=3. Panel 4: p=1,000,d=3. Figure 3: The average of approximated root mean squared prediction errors (^E∥^a−a1∥22)1/2+(^E∥^b−b1∥22)1/2)/2 for Case 1. Panel 1: p=100,d=5. Panel 2: p=100,d=10. Panel 3: p=500,d=3. Panel 4: p=1,000,d=3.

Case 2: We choose in (17), where

 B2=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝121211\reflectbox$⋱$\reflectbox$⋱$1\reflectbox$⋱$\reflectbox$⋱$21⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠∈Rd×d,

and is a zero matrix. The results under different are shown in Figures 4 - 6. Figure 4: The approximated root mean squared prediction error (^E∥^a−a1∥22)1/2 for Case 2. Panel 1: p=100,d=5. Panel 2: p=100,d=10. Panel 3: p=500,d=3. Panel 4: p=1,000,d=3. Figure 5: The approximated root mean squared prediction error (^E∥^b−b1∥22)1/2 for Case 2. Panel 1: p=100,d=5. Panel 2: p=100,d=10. Panel 3: p=500,d=3. Panel 4: p=1,000,d=3. Figure 6: The average of approximated root mean squared prediction errors (^E∥^a−a1∥22)1/2+(E∥^b−b1∥22)1/2)/2 for Case 2. Panel 1: p=100,d=5. Panel 2: p=100,d=10. Panel 3: p=500,d=3. Panel 4: p=1,000,d=3.

From Figures 1 - 6, we can see that sCCA performs well in most cases when estimating the first canonical vector . isCCA is comparable to rgCCA on the estimation of . -CCA cannot provide a consistent estimator of . nCCA does not perform well in Case 2. However, when we turn to look at the estimation of the second canonical vector in the first pair of canonical vectors, we can see that nCCA, -CCA, sCCA, and rgCCA cannot provide a consistent estimator. Therefore, the average errors of these methods shown in Figure 3 and Figure 6 are large. This indicates that these methods are not appropriate when the dimensions of and are quite different, because these methods do not utilize the low dimensional structure of . isCCA works well on the estimation of . We can also see the prediction error of isCCA increases as increases, which is natural because the accuracy of the estimation of coefficients using (11

) is influenced by the variance

.

## 5 Real Data Examples

### 5.1 Analysis of Human Gut Microbiome Data

We applied imbalanced sparse CCA to a microbiome study conducted at University of Pennsylvania (Chen et al., 2012). The study profiled 16S rRNA in the human gut and measured components of nutrient intake using a food frequency questionnaire for 99 healthy people. Microbiome OTUs were consolidated at the genus level, with relatively common genera considered (i.e., was a OTU abundance matrix). Following Chen et al. (2012), the daily intake for nutrients was calculated for each person, and regressed upon energy consumption, and the residuals used as a processed nutrient intake matrix .

ssCCA (Chen et al., 2012) identified 24 nutrients and 14 genera whose linear combinations gave a cross-validated canonical correlation of 0.42 between gut bacterial abundance and nutrients. Imbalanced sparse CCA reached a canonical correlation of 0.60. To test the canonical correlation between gut bacterial abundance and nutrients, we permuted columns of the nutrient matrix 1,000 times, and calculated the canonical correlation between them using the five CCA methods described in Section 4. These correlations constitute a null distribution for each method, to which we compared the respective observed canonical correlation. The isCCA method was significant at the 0.05 level, with -value 0.025. Of the remaining methods, only sCCA and nCCA (with a large nugget parameter) also provided significant -values. However, results from nCCA appeared highly sensitive to the nugget parameter, and range of choices for nugget parameters produced nonsignificant -values. -CCA and rgCCA did not appear to provide insightful results for this dataset. The heatmap of the covariance matrix () is shown in Figure 7. The marginal plots of absolute values of and provide insights for the relative weighting of OTUs and nutritional components toward the overall canonical correlation, i.e. larger values correspond to greater weight for that component. Figure 7: The heatmap of the covariance matrix between gut bacterial abundance and nutrients.

### 5.2 Analysis of GTEx Thyroid Histology Images

The GTEx project offers an opportunity to explore the relationship between imaging and gene expression, while also considering the effect of a clinically-relevant trait. We obtained the original GTEx thyroid histology images (see Figure 8 for example) from the Biospecimen Research Database (https://brd.nci.nih.gov/brd/image-search/searchhome). These image files are in Aperio SVS format, a single-file pyramidal tiled TIFF. The RBioFormats R package (https://github.com/aoles/RBioFormats), which interfaces the OME Bio-Formats Java library (https://www.openmicroscopy.org/bio-formats), was used to convert the files to JPEG format. These images were further processed using the Bioconductor package EBImage (Pau et al., 2010). Following the method proposed by Barry et al. (2018) to segment individual tissue pieces, the average intensity across color channels was calculated, and adaptive thresholding was performed to distinguish tissue from background. A total of 108 independent Haralick image features were extracted from each tissue piece by calculating 13 base Haralick features for each of the three RGB color channels and across three Haralick scales by sampling every 1, 10, or 100 pixels. The features were log2-transformed and normalized to ensure feature comparability across samples.

To obtain a trait with clinical relevance, we also downloaded the thyroiditis Hashimoto pathology data from the GTEx Portal (https://www.gtexportal.org/home/histologyPage). Sex and age are also provided. The phenotype Hashimoto’s thyroiditis was the presence (coded 1) or absence (0) of a particular pathology.

For thyroid, a subset of these subjects (570) also had gene expression data from RNA-Seq. The v8 release is available on the GTEx Portal (https://www.gtexportal.org/home/datasets). Gene read counts were normalized between samples using TMM, genes were selected based on expression thresholds explained in their paper (Aguet et al., 2019). In this example, we collect both processed image feature matrix and gene expression data on these overlapped 570 subjects, with 37 cases of Hashimoto’s thyroiditis.

We applied imbalanced sparse CCA and other four methods in Section 4, to study the correlation between the processed image feature matrix and gene expression data. Since imbalanced sparse CCA works well for the low dimensional data

, we first applied principal component analysis (PCA) to reduce the dimension of

. We used the first half of principal components, denoted by , and performed imbalanced sparse canonical correlation analysis on the transformed variables and . We randomly split the data into a training set and testing set with ratio for 500 times. For each run, we obtained the first pair of estimated canonical vectors and by the methods mentioned in Section 4, and compared the correlations on the testing set . We found that nCCA is very sensitive to the value of the nugget parameter, so we did not include it in the comparison. The results obtained by isCCA, -CCA, sCCA and rgCCA are shown in Figure 9. Figure 9: Boxplots of the GTEx thyroid cross-validated canonical correlations of processed image feature matrix and gene expression data.

From Figure 9, we can see that rgCCA does not provide reliable estimation of the canonical variables in this study, and sCCA provides a smaller correlation between the estimated canonical variables on the testing data. isCCA is slightly better than -CCA. In some cases, isCCA provides relatively small correlations between the estimated canonical variables on the testing data. This may be because we fixed the number of principal components. Therefore, a further study on adaptively choosing number of principal components is needed.

In order to explore the effect of a clinical phenotype on isCCA, we performed isCCA separately on the set of individuals without Hashimoto’s thyroiditis (median isCCA of 0.578, nearly the same as for the full dataset), and for individuals with Hashimoto’s thyroiditis (median isCCA of 0.375). The dramatic change in estimated correlation by case/control status provides a window into potential additional uses of sparse CCA methods, e.g. by using the contrast in canonical correlation by phenotype to improve omics-based phenotype prediction.

### 5.3 Analysis of SNP Genotype Data and RNA-seq Gene Expression Data

We tested imbalanced sparse CCA and other four methods in Section 4 using data from the GTEx V8 release (https://www.gtexportal.org/home/), including genotype data from a selected set of SNPs and RNA-seq gene expression data from liver tissue samples. SNPs were coded from - as the number of minor alleles, and RNA-seq expression data were normalized using simple scaling. Among the problems that arise in such datasets is the powerful mapping of sets of SNPs that are collectively associated with expression traits. Here we use CCA to demonstrate a proof of principle for finding such collective association in a biological pathway. We selected SNP sets for each gene by grouping SNPs located within 5kb of a gene’s transcription start site (TSS). Then we grouped genes into gene sets based on the canonical pathways listed in the Molecular Signatures Database (MSigDB) v7.0 (https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). These gene sets are canonical representations of a biological process compiled by domain experts. We analyzed 2,072 pathways with a size between 5-200 genes. So for each pathway, we have a genotype matrix with SNPs and samples and expression matrix with genes and samples, with . For each method, a permutation-based -value was calculated after performing 1,000 permutations.

Here we focus on two pathways of potential biological relevance in the liver, with strong eQTL evidence. One is the keratinization pathway (https://www.reactome.org/content/detail/R-HSA-6805567), which included 72 genes and 3,005 SNPs from our dataset. The -values were , and for isCCA, nCCA, sCCA, -CCA, and rgCCA respectively. Three genes, KRT13, KRT4, and KRT5, showed values of that are much larger than those of the remaining genes, while the values of are spread more uniformly across the SNPs. Keratins are important for the mechanical stability and integrity of epithelial cells and liver tissues. They play a role in protecting liver cells from apoptosis, against stress, and from injury, and defects may predispose to liver diseases (Moll et al., 2008). Figure 10 shows heatmap plots and Manhattan-style line plots showing the absolute values of and , in which SNPs (rows) and genes (columns) are ordered by genomic position. Figure 10: The heatmap plots and Manhattan-style line plots showing the absolute values of ^a and ^b, in which SNPs (rows) and genes (columns) are ordered by genomic position.

A smaller pathway is the synthesis of ketone bodies (https://www.reactome.org/content/detail/R-HSA-77111), with 8 genes and 265 SNPs. The -values were 0.003, 0.16, 0.52, 0.35, and 0.66 for isCCA, nCCA, sCCA, -CCA, and rgCCA respectively. Again three genes, ACSS3, BDH2, and BDH1 showed of greater magnitude than the others. Ketone bodies are metabolites derived from fatty and amino acids and are mainly produced in the liver. Both in the biosynthesis of ketone bodies (ketogenesis) and in ketone body utilization (ketolysis), inborn errors of metabolism are known, resulting in various metabolic diseases (Sass, 2012).

## 6 Conclusions and Discussion

In this work, we proposed imbalanced sparse canonical correlation analysis, which can be applied to data where the dimensions of two variables are very different. Our method can provide pairs of canonical vectors simultaneously for , and can be implemented by an efficient algorithm based on Lasso. The implementation is straightforward. The computation time can be significantly decreased if parallel computing is available. We show the consistency of the estimated canonical pairs in the case that the dimension of one variable can increase as an exponential rate in comparison to the sample size. The dimension of the other variable should be smaller than the sample size. We also present numerical studies and real data analysis to validate our methodology.

We consider the imbalanced case, where the dimensions of two variables are much different. In practice, there are also some cases that are balanced, i.e., the dimensions of two variables are comparable but both much larger than the sample size. One straightforward potential extension is to apply two sets of Lasso problems. Specifically, we might set as the dependent variable and as the independent variable in the first set of Lasso problems, and set as the dependent variable and as the independent variable in the second set of Lasso problems. However, the number of Lasso optimizations is very large, which leads to the inefficiency of the algorithm. One possible remedy is to apply principal components analysis to reduce the dimension of one variable. This approach has shown its potential in the real data analysis. However, the theoretical justification is currently lacking. Imbalanced sparse canonical correlation analysis also shows great potential for prediction problems, since it can provides pairs of canonical vectors simultaneously for . These possible extensions of imbalanced sparse canonical correlation analysis to the balanced case will be pursued in the future work.

## Appendix A Notation

We first present some notation used in Appendix. Let . Let be the -norm of a matrix , defined by

 ∥A∥p=supx≠0∥Ax∥p∥x∥p,

where is the norm of a vector . With an abuse of notation, we use for both -norm of a matrix and norm of a vector. Let . In the special cases ,

 ∥A∥1 =max1≤j≤nm∑i=1|aij|,∥A∥2=σmax(A),∥A∥∞=max1≤i≤mn∑j=1|aij|,

where is the largest singular value of matrix . These matrix norms are equivalent, which are implied by the following inequality,

 1√n∥A∥∞ ≤∥A∥2≤√m∥A∥∞, 1√m∥A∥1 ≤∥A∥2≤√n∥A∥1, ∥A∥max ≤∥A∥2≤√mn∥A∥max.

## Appendix B Proof of Proposition 1

Recall in (8), we have

 Σyy=B∗ΣxxBT∗+Σϵyϵy.

By Weyl’s theorem (Horn and Johnson (2012), Theorem 4.3.1), we have

 λmin(Σyy)≥λmin(Σϵyϵy)+λmin(B∗ΣxxBT∗)≥λmin(Σϵyϵy)≥K1,

where the last inequality is because of Assumption 1. Using Weyl’s theorem again, we can bound the largest eigenvalue by

 λmax(Σyy)≤ λmax(Σϵyϵy)+λmax(B∗ΣXXBT∗) ≤ λmax(Σϵyϵy)+∥B∗∥22λmax(ΣXX)≤K2,

where the last inequality is because of Assumptions 1 and 2. This finishes the proof.

## Appendix C Proof of Theorem 1

We first present some lemmas used in this proof. Lemma C.1 states the consistency of obtained by (12). Lemma C.2 is the Bernstein inequality. Lemma C.3 is the concentration inequality for sub-Gaussian random vectors. Lemma C.4 describes the accuracy of solving linear systems; see Theorem 2.7.3 in Van Loan and Golub (1983). Lemma C.5 states the eigenvector sensitivity for a pertubation of a matrix, which is a slight recasting of Theorem 4.11 in Stewart (1973); also see Corollary 7.2.6 in Van Loan and Golub (1983).

###### Lemma C.1.

Suppose the conditions of Theorem 1 hold. Then with probability at least ,

 max1≤k≤d∥^βk−β∗k∥1≤C2s∗√logpn,max1≤k≤d∥^βk−β∗k∥2≤C3√s