1 Introduction
Because of the complexity of most natural phenomena, it is unlikely that a single data source encapsulates all information necessary to fully understand the phenomena of interest. For example, meteorologists cannot predict the weather based on temperature alone. Audio is only one component in surveillance efforts, and in science, it is known that DNA is not the sole driver of human health. Thus, it is not surprising that data integration, or the strategic analysis of multiple sources of data simultaneously, can lead to discoveries which are not evident in analyses of a single data source. Returning to the previous examples, meteorologists must integrate data from satellites, groundbased sensors, and numerical models (Ghil and MalanotteRizzoli 1991). Audio and video are often combined for surveillance as well as speech recognition (Shivappa et al. 2010), and as new technologies arise in biology, scientists are leveraging data integration to better understand complex biological processes (Lock et al. 2013). In general, by exploiting the diversity and commonalities among different datasets, data integration helps to build a wholistic and perhaps more realistic model of the phenomena of interest. Furthermore, as technological advances fuel the growing number of available datasets, data integration is becoming increasingly necessary in numerous disciplines.
1.1 Related Work
Though the data deluge is a recent development, the concept of data integration is certainly not new. For instance, it is a fundamental idea in the classical canonical correlation analysis (CCA) (Hotelling 1936), which infers the relationship between features from two datasets.
More recently, data integration has revolved around factorization methods due to their ease of interpretability and computational feasibility. The framework of coupled matrix and tensor (CMTF) factorizations
(Singh and Gordon 2008; Acar et al. 2014) unifies a large class of these factorizations while Joint and Individual Variation Explained (JIVE) (Lock et al. 2013) is another factorization method that is commonly used in integrative genomics. One general challenge with matrix factorization methods however is that they depend on the rank of the factorized matrices, which is frequently unknown and must be specified a priori.In contrast to the matrix factorization methods, the generalized SVD (GSVD) provides an exact matrix decomposition for coupled data (Alter et al. 2003; Ponnapalli et al. 2011) and is not dependent on the matrix rank. However, it is limited in scope. The GSVD assumes each matrix has full row rank, excluding integrated problems with both highdimensional and lowdimensional datasets.
There has also been work on extending principal components analysis (PCA) to integrated data problems. This family of methods is known as the multiblock PCA family, and it includes Multiple Factor Analysis (MFA) (Escofier and Pages 1994; Abdi et al. 2013) and Consensus PCA (Westerhuis et al. 1998). In these methods, each data matrix is normalized according to a specific procedure, then data matrices are concatenated together, and finally, PCA is performed on the normalized concatenated data. Closely related to this is Distributed PCA (Fan et al. 2017), which integrates data that are stored across multiple servers.
Thus far, the existing methods have largely been algorithmic constructions, which lack a rigorous statistical model. As a result, the statistical properties and theoretical foundations of data integration methods are generally unknown. We begin to bridge this gap between data integration methodology and statistical theory by developing a new data integration method, Integrated Principal Components Analysis (iPCA), which extends a modelbased framework of PCA to the integrated data setting. The two main building blocks of iPCA are PCA and the matrixvariate normal distribution, which we review next.
1.2 Principal Components Analysis
Given a dataset , recall that PCA finds orthogonal directions , which maximize the covariance . That is, for each ,
(1) 
It is wellknown that the PC loading
is the eigenvector of
with thelargest eigenvalue, and its corresponding PC score is
. In practice, since the population covariance is typically unknown, the sample version of PCA plugs in an estimate for in (1) (assuming is column centered). It follows that the PC loadings are the eigenvectors of , and the PC scores are the eigenvectors of . Equivalently, if the SVD gives , then the columns of and are the PC loadings and scores, respectively.To later establish the link between iPCA and PCA, we point out that is the maximum likelihood estimator (MLE) of under the multivariate normal model , and is the MLE of under . Here, is the row of , and is the column of
. While this is not the only way of viewing PCA, this framework illustrates two points which we explore further in iPCA: (1) PCA can be framed as maximizing the variance in
under a normal model; (2) Eigenvectors correspond to the dominant (or variancemaximizing) patterns in the data.1.3 Matrixvariate Normal Model
The matrixvariate normal distribution (Gupta and Nagar 1999; Dawid 1981) is an extension of the multivariate normal distribution such that the matrix is the fundamental object of study. We say that an random matrix follows a matrixvariate normal distribution and write if follows a multivariate normal distribution with Kronecker product covariance structure, , where
is the column vector formed by stacking the columns of
below one another. We call the mean matrix, the row covariance matrix, and the column covariance matrix, where denotes the set of symmetric positive definite matrices.Intuitively, the row covariance encodes the dependencies between rows of , and the column covariance encodes the dependencies among the columns of , i.e. and . Given these normallydistributed marginals, it can be shown that if and , we are in the familiar multivariate normal setting, . Similarly, if and , then .
An important distinction between the multivariate and matrixvariate normal distributions is that the multivariate normal can only model relationships between elements of a single row or a single column in whereas the matrixvariate normal can model relationships between elements from different rows and columns. The matrixvariate normal is thus far more general than the multivariate normal. With this level of generality, the matrixvariate normal has proven to be a versatile tool in various contexts such as graphical models (Yin and Li 2012; Tsiligkaridis et al. 2013; Zhou 2014a), spatiotemporal models (Greenewald and Hero 2015), and transposable models (Allen and Tibshirani 2010). Our work on iPCA is the first to consider the matrixvariate normal model in light of data integration.
1.4 Overview
iPCA is a generalization of PCA for integrated data via matrixvariate normal models, and it serves as a practical tool to identify and visualize joint patterns that are shared by multiple datasets. We formally introduce iPCA in Section 2, and in Section 3, we discuss covariance estimation methods for iPCA and highlight our two main theoretical contributions  proving subspace consistency of the additive correlation iPCA estimator and proving global convergence of the multiplicative Frobenius iPCA estimator. Beyond these theoretical guarantees, iPCA has several practical advantages: (1) since iPCA is not a matrix factorization method, it is not dependent on the rank of the factorized matrices; instead, iPCA gives ordered, nested, orthogonal PCs as in PCA; (2) iPCA (with the multiplicative Frobenius estimator) converges to the global solution, which is important for reproducibility, and (3) from an interpretability standpoint, iPCA can extract patterns in feature and sample space simultaneously and provides convenient visualizations of these patterns. Furthermore, we will empirically demonstrate in Section 4, through extensive simulations and a case study with integrative genomics for Alzheimer’s Disease, the strong performance of iPCA in practice. Finally, we conclude with a discussion of iPCA and future work in Section 5.
2 Integrated PCA
Similar to PCA, iPCA is an unsupervised tool for exploratory data analysis, pattern recognition, and visualization. Unlike PCA however, iPCA extracts dominant joint patterns which are
common to multiple datasets, not necessarily the variancemaximizing patterns since they might be specific to one dataset. These joint patterns are typically of considerable interest to practitioners as its common occurrence in multiple datasets may point to some foundational mechanism or structure.Generally speaking, we find these joint patterns by modeling the dependencies between and within datasets via the matrixvariate normal model. The inherent Kronecker product covariance structure enables us to decompose the total covariance of each data matrix into two separable components  an individual column covariance structure which is unique to each dataset and a joint row covariance structure which is shared among all datasets. The joint row covariance structure is our primary interest, and maximizing this joint variation will yield the dominant patterns which are common across all datasets.
2.1 Population Model of iPCA
Suppose we observe coupled data matrices, , of dimensions , where is the number of samples and is the number of features in . Throughout this paper, we let and . Suppose also that each of the data matrices has a distinct set of features that are measured on the same samples and that all of the rows of perfectly align (see Figure 1). Under the iPCA model, we assume that each dataset arises from a matrixvariate normal distribution. Namely, for each ,
(2) 
where is an column vector of ones, is a vector of column means specific to , is an row covariance matrix that is jointly shared by all data matrices, and is a column covariance matrix that is specific to .
Given the underlying model in (2), we define the population version of iPCA to maximize the row and column covariances. For , iPCA solves
(3)  
(4) 
for which we know the solution to be given by the eigendecompositions of and . That is, is the eigenvector of with the largest eigenvalue, and is the eigenvector of with the largest eigenvalue. Most notably, maximizes the joint variation and is interpreted as the most dominant pattern shared by all datasets. We call the columns of the integrated principal component (iPC) scores and the columns of the iPC loadings for the dataset. Since we are interested in the joint patterns, we primarily plot the iPC scores to visualize the joint patterns in sample space.
Remark.
The population covariances in (2) are not identifiable (e.g. for ), but the iPC scores and loadings are identifiable since eigenvectors are scaleinvariant.
To build intuition for the iPCA model (2), we note that is the matrix of column means. Moreover, by properties of the matrixvariate normal, we interpret as describing the covariance structure among features in , giving rise to patterns unique to , and we interpret as describing the common covariance structure among samples, corresponding to joint patterns that are shared by all datasets. Hence, maximizing the joint row covariance in (3) yields the dominant patterns among samples, which are common to all datasets.
We also point out that if and , then are the PC loadings from PCA, and if and , then are the PC scores from PCA. This reveals the fundamental difference between PCA and iPCA  PCA computes the PC loadings and scores by assuming independent samples and features, respectively, while iPCA models the dependencies among samples and features concurrently, making no assumptions on independence.
2.2 Sample Version of iPCA
To go from the population iPCA model to the sample version, we simply plug in estimators and for and . The sample version of iPCA can be summarized as follows:

Model each dataset via a matrixvariate normal model with Kronecker product covariance structure: for each .

Estimate the covariance matrices to obtain . Methods for covariance estimation will be discussed in detail in Section 3.

Compute the eigenvectors, say and . We interpret as the dominant joint patterns in sample space and as the dominant patterns in feature space which are specific to .

Visualize the dominant joint patterns by plotting the iPC scores .
Figure 2 illustrates a motivating example for when iPCA is advantageous. In the example, strong dependencies among features obscure the true joint patterns among the samples. As a result, applying PCA separately to each of the datasets (panels BD) fails to reveal the joint signal. iPCA can better recover the joint signal because it exploits the known integrated data structure and extracts the shared information among all three datasets simultaneously.
2.2.1 Variance Explained by iPCA
After performing iPCA, one might wonder how to interpret the iPCs. Analogous to PCA, we can define a notion of variance explained.
Definition 1.
Assume that has been column centered. We define the cumulative proportion of variance explained in dataset by the top iPCs to be
(5) 
where are the top iPC scores, and are the top iPC loadings associated with .
Definition 2.
The marginal proportion of variance explained in dataset by the iPC is defined as .
We verify in Appendix A that is a proportion and monotonically increasing as increases. Aside from being welldefined, we also show in Appendix A that (5) generalizes the cumulative proportion of variance explained in PCA and hence, is a natural definition.
Remark.
Unlike PCA, it may be that in iPCA. This is because iPCA does not maximize the total variance, e.g. if , this simply means that dataset contributed more variation to the joint pattern in iPC2 than in iPC1.
2.3 Relationship to Existing Methods
Throughout our development of iPCA, we have established the connection between iPCA and PCA, demonstrating that iPCA is indeed a generalization of PCA. We also find it instructive to draw connections between iPCA and other existing data integration methods.
Relationship to Multiblock PCA Family
As discussed in Abdi et al. (2013), multiblock PCA methods reduce to performing PCA on the normalized concatenated data , where each has been normalized to according to some procedure. We show in Proposition 1 that performing PCA on the unnormalized concatenated data (referred to as concatenated PCA) is a special case of iPCA, where we assume for each . Proposition 1 can easily be extended to show that multiblock PCA methods are a special case of iPCA for some fixed , and the exact form of depends on the normalization procedure. For example, since MFA normalizes
by dividing all of its entries by its largest singular value
, MFA is a special case of iPCA, where each . Consequently, iPCA can be viewed as a unifying framework for the multiblock PCA family.Relationship to Matrix Factorizations
Since coupled matrix factorizations (CMF) decompose the data into the product of a lowrank joint and individual factors, an argument similar to Theorem 2 from Hastie et al. (2015) shows that CMF is closely related to the SVD of the concatenated data. More precisely, one solution of the CMF optimization problem is given by the PC scores and loadings from concatenated PCA, which we know to be a special case of iPCA by Proposition 1. JIVE, on the other hand, is an additive model and decomposes coupled data into the sum of a lowrank joint variation matrix, a lowrank individual variation matrix, and an error matrix. While CMF and iPCA are closely related, JIVE and iPCA are quite different models and are each advantageous in different situations.
3 Covariance Estimators for iPCA
Fitting the iPCA model requires estimating the population parameters in (2). As in PCA, we consider maximum likelihood estimation. However, we will see in Section 3.1 that there are substantial challenges with this approach, so we propose new estimators in Section 3.2.
3.1 Unpenalized Maximum Likelihood Estimators
Under the iPCA population model (2), the loglikelihood function reduces to
(6) 
Taking partial derivatives of (3.1) with respect to each parameter gives the following:
Lemma 1.
The unpenalized MLEs of satisfy
(7)  
(8)  
(9) 
Notice that the unpenalized MLE of is the vector of column means of , so is simply with its columns centered to mean . We can proceed to compute the unpenalized MLEs of and via a FlipFlop algorithm (also known as block coordinate descent) analogous to Dutilleul (1999), which iteratively optimizes over each of the parameters while keeping all other parameters fixed.
However, because we only have one matrix observation per matrixvariate normal model, existence of the unpenalized MLEs is not guaranteed. In fact, the following theorem essentially implies that the unpenalized MLEs do not exist for all practical purposes.
Theorem 1.
The proof of Theorem 1 also shows that if the unpenalized MLEs for and exist, then for each . Thus in summary, if for some or if the population means are unknown (which is almost always the case), then the unpenalized MLEs do not exist. These severe restrictions motivate new covariance estimators.
For example, we can naively estimate and by setting their counterparts to .
Proposition 1.

The MLE for , assuming that , is
(10) 
Let . The MLE for , assuming for all , is
(11)
This approach for estimating is the familiar MLE for the concatenated data . Hence, the eigenvectors of from (11) are the PC scores from concatenated PCA (i.e. PCA applied to ). While this illuminates another way in which iPCA is related to PCA, we will see in Section 4 that concatenated PCA performs poorly when the datasets are of different scales. In the next section, we discuss more effective methods for estimating and .
3.2 Penalized Maximum Likelihood Estimators
Since the unpenalized MLEs do not exist for a large class of problems, we instead look to penalized maximum likelihood estimation. To reduce cluttering of notation, we assume that has been columncentered. Then, the penalized maximum likelihood simplifies to
(12) 
Similar to previous work on the penalized matrixvariate normal loglikelihood (Yin and Li 2012; Allen and Tibshirani 2010), we can consider an additive penalty term and define the additive iPCA penalty to be , where are tuning parameters. We can take to induce sparsity or to induce smoothness. However, solving (12) with these additive penalties is a nonconvex problem, for which we can only guarantee a local solution.
Nevertheless, even though (12) is nonconvex in Euclidean space, Wiesel (2012) showed that the matrixvariate normal loglikelihood is geodesically convex (gconvex) with respect to the manifold of positive definite matrices. Gconvexity is a generalized notion of convexity on a Riemannian manifold, and like convexity, all local minima of gconvex functions are globally optimal. We thus exploit the gconvex properties of the matrixvariate normal loglikelihood and construct a new penalty: the multiplicative Frobenius iPCA penalty , which we will show to be gconvex in Theorem 3. Note that because , the multiplicative penalty can be rewritten as a product , giving rise to its name.
The FlipFlop algorithms for solving (12) with various penalties are derived in Appendix B.2, but in general, for the Frobenius penalties, each FlipFlop update has a closed form solution determined by a full eigendecomposition. For the penalties (also known as the Kronecker Graphical Lasso (Tsiligkaridis et al. 2013)), each update can be solved by the graphical lasso (Hsieh et al. 2011). We provide the multiplicative Frobenius FlipFlop algorithm here in Algorithm 1, but since the other algorithms take similar forms, we leave them to Appendix B.2. The next theorem guarantees numerical convergence of the FlipFlop algorithms for the multiplicative Frobenius, additive Frobenius, and additive penalties.
Theorem 2.
Suppose that the objective function in (12) is bounded below. Suppose also that either (i) is a differentiable convex function with respect to each coordinate or (ii) , where is a (nondifferentiable) convex function for each . If either (i) or (ii) holds, then the FlipFlop algorithm corresponding to (12) converges to a stationary point of the objective.
While Theorem 2 guarantees convergence to a local solution, we can build upon Wiesel (2012) to prove a much stronger statement for the multiplicative Frobenius iPCA estimator.
Theorem 3.
The multiplicative Frobenius iPCA estimator is jointly geodesically convex in and . Because of this, the FlipFlop algorithm for the multiplicative Frobenius iPCA estimator given in Algorithm 1 converges to the global solution.
There are currently only a handful of nonconvex problems where there exists an achievable global solution, so this guarantee that the multiplicative Frobenius iPCA estimator always reaches a global solution is both extremely rare and highly desirable. In Section 4, we will also see that the multiplicative Frobenius iPCA estimator undoubtedly gives the best empirical performance, indicating that in addition to its theoretical advantages from global convergence, there are significant practical advantages associated with the gconvex penalty. A selfcontained review of gconvexity and the proof of Theorem 3 are given in Appendix C.
3.2.1 Subspace Consistency of the Additive Correlation iPCA Estimator
In this section, we begin exploring statistical properties of iPCA by proving subspace consistency for a slight variation of the additive iPCA estimator. Rather than applying the penalty to the inverse covariances, we can apply the penalties to the inverse correlation matrices, as outlined by Algorithm 6 in Appendix D. This approach was adopted in Zhou (2014a) and Rothman et al. (2008), and we extend their idea to integrated data. Throughout this section, we let denote the estimate obtained from Algorithm 6, and we refer to it as the additive correlation iPCA estimator. We will show in Theorem 4 that the onestep additive correlation iPCA estimator consistently estimates the underlying joint subspace.
Though previous works have studied the convergence of related sparse penalties, none are directly applicable for data integration. For instance, Tsiligkaridis et al. (2013) proved convergence rates for KGlasso but assumed multiple matrix observations per matrixvariate normal model. Since iPCA assumes only one matrix observation per model, the results from Tsiligkaridis et al. (2013) do not apply. In fact, the theory suggests it is difficult to estimate the diagonal of consistently with only one matrix observation. This motivated us to follow the approach in Zhou (2014a), where convergence rates were derived for the onestep (noniterative) version of Algorithm 6. Zhou (2014a) assumed that only one matrix instance was observed from the matrixvariate normal distribution, so we extend the proof techniques and results in Zhou (2014a), where , to iPCA, where we observe one matrix instance for each of the distinct matrixvariate normal models.
For simplicity, we assume is initialized to the identity matrix, and like Zhou (2014a), we analyze the onestep version of Algorithm 6. That is, we stop the algorithm after one iteration of the while loop. For now, we also assume that for each , or what is classically known as the “large n” setting. We will later discuss how to adapt our results to the for each case, or the“large p” setting.
We next collect notation. Suppose for each , the true population model is given by , and the true correlation matrices associated to and are and , respectively. For identifiability, define and so that and for each . If is a matrix, let denote the operator norm or the maximum singular value of . Let denote the Frobenius norm (i.e. ). Let denote the number of nonzero offdiagonal entries in . Let , and let . Write and for the minimum and maximum eigenvalues of . Also write and . If , then as . If , then there exists positive constants such that as .
The following assumptions are needed to establish subspace consistency of .

Assume that and are sparse with respect to each other’s dimensions: and for each .

Assume that we have uniformly bounded spectra: and for each .

Assume that the inverse correlation matrices satisfy and for each .

Assume that is finite and the growth rate of and satisfy for each .
Remarks.
Under these four assumptions, we summarize our main statistical convergence result:
Theorem 4.
Suppose that AD hold and that for each . Let denote the onestep additive correlation iPCA estimator, where we choose and for each
. Then with probability
,(13)  
(14) 
Details of this proof are provided in Appendix D, but the overarching proof idea is to follow Algorithm 6 and sequentially bound each step of the algorithm. Our proof closely resembles Zhou (2014a) with technical differences due to the estimation of from multiple ’s with different ’s. Note when , Theorem 4 gives the same rates as Zhou (2014a).
Remark.
If for each (i.e. the “large ” setting), then we modify Algorithm 6 to first initialize an estimate for , next estimate , and then obtain the final estimate of . The same convergence rates can be obtained for the “large ” setting by using this modified algorithm. The only additional assumption required here is a bound on , namely, for each and .
Therefore, in either the large or the large setting, we have a onestep FlipFlop algorithm such that under certain assumptions, the estimate of converges in the operator and Frobenius norms at rates given by (13) and (14). Convergence in the operator norm, in particular, has two direct consequences. By Weyl’s Theorem (Horn and Johnson 2012), the eigenvalues of are consistent, and by a variant of the DavisKahan sin theorem (Yu et al. 2015), the eigenvectors of are consistent. Since eigenvectors of define the estimated iPCA subspace, this in turn implies subspace consistency.
3.2.2 Selecting Penalty Parameters
In practice, it is important to select penalty parameters for (12
) in a principled datadriven manner. We propose to do this via a missing data imputation framework.
Let denote the space of penalty parameters, and let be a specific choice of penalty parameters in . The idea is to first randomly leave out scattered elements from each . Then, for each , impute the missing elements via an EMlike algorithm, similar to Allen and Tibshirani (2010). Finally, select the which minimizes the error between the imputed values and the observed values. Further details, technical derivations, and numerical results regarding our imputation method and penalty parameter selection are provided in Appendix E.
Remarks.

If is large or the datasets are large, it might be computationally intractable to try all combinations of penalty parameters in . In these cases, we can select the penalty parameters in a greedy manner: first fix and optimize over , then fix and optimize , and so forth, and we stop after optimizing .

Our imputation procedure may also be used to perform iPCA in missing data scenarios.
4 Empirical Results
In the following simulations and case study, we evaluate iPCA against individual PCAs on each of the datasets , concatenated PCA, distributed PCA, JIVE, and MFA. Note that many data integration methods from the multitable PCA family are known to perform similarly to MFA (Abdi et al. 2013), so we only include MFA to minimize redundancy. We also omit CMF, as it performs similarly to concatenated PCA, and the GSVD since it is not applicable for integrated data with both lowdimensional and highdimensional datasets.
Within the class of iPCA estimators, we focus our attention on the additive and multiplicative Frobenius iPCA estimators, but to also represent the sparse estimators, we include the most commonly used sparse estimator, the additive penalty () applied to the inverse covariance matrices. In order to reduce the computational burden, we stop the FlipFlop algorithm after one iteration, and we select the iPCA penalty parameters in a greedy manner, as discussed in Section 3.2.2.
4.1 Simulations
The base simulation is set up as follows: Three coupled data matrices, , with , , , were simulated according to the iPCA Kronecker covariance model (2). Here, is taken to be a spiked covariance with the top two factors forming three clusters as shown in Figure 2A; is an autoregressive Toeplitz matrix with entry given by ; is the observed covariance matrix of miRNA data from TCGA ovarian cancer (Network 2011); and is a blockdiagonal matrix with five equallysized blocks. We also ensured that the largest eigenvalue of each was larger than that of so that the joint patterns are intentionally obscured by individualistic patterns. From this base simulation, we systematically varied the parameters  number of samples, number of features, and strength of the joint signal in (i.e. )  one at a time while keeping everything else constant.
We evaluate the performance of various methods using the subspace recovery error: If the true subspace of was simulated to be of dimension with the orthogonal eigenbasis and the top eigenvectors of the estimate are given by , then the subspace recovery error is defined to be , where and . This metric simply quantifies the distance between the true subspace of and the estimated subspace from . We note that a lower subspace recovery error implies higher estimation accuracy, and in the base simulation, . Although there are other metrics like canonical angles, which also quantify the distance between subspaces, these metrics behave similarly to the subspace recovery error, so we omit the results for brevity.
The average subspace recovery error, measured over trials, from various simulations are shown in Figure 3. We clearly see that the additive and multiplicative Frobenius iPCA estimators consistently outperformed all other methods. Since was not simulated to be sparse, it is no surprise that the Frobenius iPCA estimators outperformed the iPCA estimator. It is also expected that distributed PCA performs poorly since the ’s are not all identical, violating its basic assumption. What may be surprising is that doing PCA on performed better than its competitors, excluding the Frobenius iPCA estimators. We speculate that this is because the observed covariance happened to be a very lowrank matrix, and because was lowrank, the signal from most likely dominated much of the variation in the second PC. Looking ahead at Figure 4A, as Laplacian error was added to the simulated data, PCA on failed to recover the true signal since the Laplacian error increasingly contributed to the variation in the data. We also point out that MFA always yielded a lower error than concatenated PCA in Figure 3, indicating that there is value in normalizing datasets to be on comparable scales. On the other hand, we must be weary of this normalization process. In the case of these simulations, PCA on outperformed MFA, illustrating that normalization can sometimes remove important information.
To verify that these simulation results are not heavily dependent by the base simulation setup of and , we also ran simulations, varying the dimension of the true joint subspace and the number of datasets . We provide these results in Appendix F.
Beyond simulating from the iPCA model (2), we check for robustness from the two main iPCA assumptions  normality and Kronecker covariance structure. To deviate from normality, we add Laplacian noise to the base simulation setup, and to depart from the Kronecker covariance structure, we simulate data from the JIVE model. The results are summarized in Figure 4, and we leave the simulation details to Appendix F.
As seen in Figure 4
A, the Frobenius iPCA estimators appear to be relatively robust to the nonGaussian noise and outperformed their competitors even as the standard deviation of added Laplacian errors increased. From the simulations under the JIVE model in Figure
4B, we see that as the amount of noise increases, JIVE given the known ranks yields the lowest error, as expected. But similar to how JIVE was comparable to competing methods under the iPCA model (Figure 3), the iPCA estimators are comparable to competing methods for high noise levels under the JIVE model. At low noise levels however, the Frobenius iPCA estimators are able to recover the true joint subspace better than JIVE with the known ranks. Further investigation into this peculiar behavior reveals that the Frobenius iPCA estimators give lower subspace recovery errors but larger approximation errors , compared to JIVE with the known ranks. This brings up a subtle, but important distinction  iPCA revolves around estimating the underlying subspace, determined by eigenvectors, while JIVE focuses on minimizing the matrix approximation error. These are inherently different objectives, and it is common for iPCA to estimate the eigenvectors well at the cost of a poor matrix approximation due to the regularized eigenvalues.4.2 Case Study: Integrative Genomics of Alzheimer’s Disease
A key motivating example for our research is in integrative genomics, where the goal is to combine genetic information from different, yet related, technologies in order to better understand the genetic basis of particular diseases. In particular, apart from the APOE gene, little is known about the genomic basis of Alzheimer’s Disease (AD) and the genes which contribute to dominant expression patterns in AD. Thus, in this case study, we delve into the integrative genomics of AD and jointly analyze miRNA expression, gene expression via RNASeq, and DNA methylation data obtained from the Religious Orders Study Memory and Aging Project (ROSMAP) Study (Mostafavi et al. 2018). The ROSMAP study is a longitudinal clinicalpathological cohort study of aging and AD, consisting of subjects with data across all three datasets of interest, and it is uniquely positioned for the study of AD since its genomics data is collected from postmortem brain tissue from the dorsolateral prefrontal cortex of the brain, an area known to play an important role in cognitive functions.
The ROSMAP data originally contained miRNAs, genes, and CpG sites, so we aggressively preprocessed the number of features in the RNASeq and methylation datasets to manageable sizes. First, we transformed the methylation data to mvalues and logtransformed the RNASeq counts, as is common in most analyses for these data types. Then, we removed batch (experimental) effects from both datasets via ComBat (Johnson et al. 2007). We next filtered the features by taking those with the highest variance (top 20,000 genes for RNASeq and top 50,000 CpG sites for methylation). Then, we performed univariate filtering and kept the features with the highest association to clinician’s diagnosis. This left us with , , in the miRNA, RNASeq, and methylation datasets, respectively.
For our analysis, we consider two clinical outcome variables: clinician’s diagnosis and global cognition score. The clinician’s diagnosis is the last clinical evaluation prior to the patient’s death and is a categorical variable with three levels  Alzheimer’s Disease (AD), mild cognitive impairment (MCI), and no cognitive impairment (NCI). Global cognition score, a continuous variable, is the average of 19 cognitive tests and is the last cognitive testing score prior to death. While the clinician’s diagnosis is sometimes subjective, global cognition score is viewed as a more objective measure of cognition. Our goal is to find common patterns among patients, which occur in all three datasets, and to understand whether these joint patterns are predictive of AD, as measured by clinician’s diagnosis and global cognition score.
To this end, we run iPCA and other existing methods to extract dominant patterns from the ROSMAP data. Figure 5 shows the PC plots obtained from the various methods  each point represents a subject and is colored by either clinician’s diagnosis or cognition score.
Since visuals are a subjective measure of performance, we quantify it by taking the top PCs and using them as predictors in a random forest to predict the outcome of interest. For the random forest, we split the ROSMAP data into a training (
) and test set () and used the default random forest settings in R. The test errors, averaged over 100 random training/test splits, are shown in Figure 6. From Figure 6, it is clear that the joint patterns extracted from iPCA using the multiplicative Frobenius penalty were the most predictive of the clinician’s diagnosis of AD and the patient’s global cognition score. Moreover, most of the predictive power can be attributed to the first three iPCs, which we visualize in Figure 7AB. We also note that the top iPCs from iPCA with the multiplicative Frobenius penalties were more predictive than combining the PCs from the three individual PCAs performed on each dataset. This showcases empirically that a joint analysis of the integrated datasets can be advantageous over three disparate analyses.Beyond the high predictive power of iPCA with the multiplicative Frobenius penalty, it is perhaps more important for scientists to be able to interpret the iPCA results. One way is through the proportion of variance explained by the joint iPCs, as defined in Section 2.2.1. Figure 7C shows the marginal proportions for the top 5 iPCs. It reveals that the RNASeq dataset contributed the most variation in the joint patterns found by iPC1 and iPC2, and the miRNA dataset contributed the most variation in iPC3. More interestingly, even though iPC2 and iPC3 have relatively small variances, iPCA is able to pick out these weak joint signals, which we found to be predictive of AD. This reiterates that the most variable patterns in the data are not necessarily the most relevant patterns for the question at hand. In this case, our goal was to find joint patterns which occur in all three datasets, and since the joint signal is not the most dominant source of variation in each dataset, no individualistic PCA analysis would have identified the joint signal found by iPCA.
We conclude our ROSMAP analysis by extracting the top genetic features which are associated to the joint patterns shown in Figure 7. Since iPCA provides an estimate of both and , we can select the top features by applying sparse PCA to each obtained from iPCA. Table 1 lists the top miRNAs, genes, and genes affiliated with the selected CpG sites obtained from sparse PCA. Here, we used the sparseEigen R package (Benidis et al. 2016) and chose the tuning parameter such that there were only nonzero features.
Because the RNASeq data contributed most of the variation in iPC1, we did a literature search on the top five genes extracted by sparse PCA on . Out of the top five genes, we found evidence in the biological literature, which links four of the five genes (the exception being SVOP) to AD (Carrette et al. 2003; Li et al. 2017; Han et al. 2014; EspunyCamacho et al. 2017). While this is only a preliminary investigation into the importance of the genetic features obtained from iPCA, it is encouraging evidence and may potentially hint at candidate genes for future research.
miRNA  RNASeq  Methylation  

1  miR 216a  VGF  TMCO6 
2  miR 127 3p  SVOP  PHF3 
3  miR 124  PCDHGC5  BRUNOL4 
4  miR 30c  ADCYAP1  OSCP1 
5  miR 143  LINC01007  GRIN2B 
6  miR 27a  FRMPD2L1  CASP9 
7  miR 603  SLC30A3  ZFP91; LPXN; ZFP91CNTF 
8  miR 423 3p  NCALD  CNP 
9  miR 204  S100A4  YWHAE 
10  miR 128  AZGP1  C11orf73 
11  miR 193a 3p  PAK1  TMED10 
12  ebv miRBART14  MAL2  RELL1 
5 Discussion
As showcased in the simulation study and the Alzheimer’s Disease case study, iPCA is not simply a theoretical construct that generalizes PCA to the integrated data setting. iPCA is also a useful tool in practice to discover interesting joint patterns which occur in multiple datasets. We believe that iPCA’s strong performance is due in part to the appropriateness of the iPCA model for many integrated data problems. The iPCA model assumes that the variation in each dataset is a function of the dependencies among samples and the dependencies among features, and because each dataset is measured on the same set of samples, then the dependencies among samples should be the same across all datasets. We encode these shared sample dependencies as and the individualistic feature dependencies as . Then, whereas ordinary PCA estimates each independently of , iPCA estimates and jointly, inherently accounting for the effects of the other factor.
While there are many different penalized iPCA estimators, we recommend that if the covariance matrices are not known to be sparse, practitioners should strongly consider using the multiplicative Frobenius iPCA estimator. The simulations show that the Frobenius penalties are relatively robust to departures from model assumptions, and in theory, the multiplicative Frobenius penalized estimator requires one less penalty parameter to tune and always converges to the global solution. Though it is not obvious as to why the multiplicative Frobenius iPCA estimator drastically outperforms the other data integration methods in the ROSMAP case study, we speculate that it is related to gconvexity and convergence to the global solution. If not this, then it might be that the multiplicative Frobenius penalty simply gives the appropriate scalings and interactions between penalty parameters and covariance estimates. We leave this as open question for future research.
Still, there are many other open avenues for future exploration. Analogous to PCA, one might imagine similar fruitful extensions of iPCA to higherorder data, functional data, and other structured applications. One could continue exploring gconvex penalties in different contexts and problems. Another interesting area for future research would be to develop a general framework to prove the consistency of gconvex estimators using the intrinsic manifold space, rather than the Euclidean space. We believe this intersection of gconvexity and statistical theory is a particularly ripe area of future research, but overall, in this work, we developed a theoretically sound and practical tool for performing dimension reduction in the integrated data setting, thus facilitating wholistic analyses at a large scale.
Acknowledgements
The authors acknowledge support from NSF DMS1264058, NSF DMS1554821 and NSF NeuroNex1707400. The authors thank Dr. Joshua Shulman and Dr. David Bennett for help acquiring the ROS/MAP data and acknowledge support from NIH P30AG10161, RF1AG15819, R01AG17917, and R01AG36042 for this data. The authors also thank Dr. Zhandong Liu and Dr. YingWooi Wan for help with processing and interpreting the ROS/MAP data.
Integrated Principal Components Analysis: Supplementary Materials
Tiffany M. Tang, Genevera I. Allen
The supplementary materials are structured as follows. In Appendix A, we verify that the variance explained by iPCA is a welldefined definition. In Appendix B, we provide the proofs, derivations, and algorithms related to the unpenalized and penalized MLEs for iPCA. We then review geodesic convexity and prove the global convergence property of the multiplicative Frobenius iPCA estimator in Appendix C. In Appendix D, we prove the subspace consistency of the onsestep additive correlation iPCA estimator. We discuss the penalty parameter selection in Appendix E, and we conclude with additional simulation details in Appendix F.
Appendix A Variance Explained by iPCA
To ensure that the cumulative proportion of variance explained from Definition 1 is a welldefined concept, we check that is a proportion and is an increasing function as increases. This implies that the marginal proportion of variance explained given in Definition 2 is also a proportion.
Lemma 2.
The cumulative proportion of variance explained in by the top iPCs, as defined in (5), satisfies the following properties: for each and ,

;

.
Proof.
(i) Since the Frobenius norm is always nonnegative, it is clear that . So it suffices to show that , or equivalently, .
By definition of the Frobenius norm, we have that
Here, [2] holds by the orthogonality of and , and [1] follows from the facts that , and are precisely the first columns of and respectively, and the summand is nonnegative. This concludes the proof of part (i).
(ii) We follow a similar argument as part (i) to see that
Comments
There are no comments yet.