Integrated Principal Components Analysis

10/01/2018
by   Tiffany M. Tang, et al.
0

Data integration, or the strategic analysis of multiple sources of data simultaneously, can often lead to discoveries that may be hidden in individualistic analyses of a single data source. We develop a new statistical data integration method named Integrated Principal Components Analysis (iPCA), which is a model-based generalization of PCA and serves as a practical tool to find and visualize common patterns that occur in multiple datasets. The key idea driving iPCA is the matrix-variate normal model, whose Kronecker product covariance structure captures both individual patterns within each dataset and joint patterns shared by multiple datasets. Building upon this model, we develop several penalized (sparse and non-sparse) covariance estimators for iPCA and study their theoretical properties. We show that our sparse iPCA estimator consistently estimates the underlying joint subspace, and using geodesic convexity, we prove that our non-sparse iPCA estimator converges to the global solution of a non-convex problem. We also demonstrate the practical advantages of iPCA through simulations and a case study application to integrative genomics for Alzheimer's Disease. In particular, we show that the joint patterns extracted via iPCA are highly predictive of a patient's cognition and Alzheimer's diagnosis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 23

01/31/2018

De-biased sparse PCA: Inference and testing for eigenstructure of large covariance matrices

Sparse principal component analysis (sPCA) has become one of the most wi...
02/26/2013

Convex vs nonconvex approaches for sparse estimation: GLasso, Multiple Kernel Learning and Hyperparameter GLasso

The popular Lasso approach for sparse estimation can be derived via marg...
06/08/2015

Stay on path: PCA along graph paths

We introduce a variant of (sparse) PCA in which the set of feasible supp...
05/25/2020

Supervised Convex Clustering

Clustering has long been a popular unsupervised learning approach to ide...
03/18/2021

Dynamic Kernel Matching for Non-conforming Data: A Case Study of T-cell Receptor Datasets

Most statistical classifiers are designed to find patterns in data where...
07/28/2019

Multi-Rank Sparse and Functional PCA: Manifold Optimization and Iterative Deflation Techniques

We consider the problem of estimating multiple principal components usin...
02/26/2021

sJIVE: Supervised Joint and Individual Variation Explained

Analyzing multi-source data, which are multiple views of data on the sam...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Because of the complexity of most natural phenomena, it is unlikely that a single data source encapsulates all information necessary to fully understand the phenomena of interest. For example, meteorologists cannot predict the weather based on temperature alone. Audio is only one component in surveillance efforts, and in science, it is known that DNA is not the sole driver of human health. Thus, it is not surprising that data integration, or the strategic analysis of multiple sources of data simultaneously, can lead to discoveries which are not evident in analyses of a single data source. Returning to the previous examples, meteorologists must integrate data from satellites, ground-based sensors, and numerical models (Ghil and Malanotte-Rizzoli 1991). Audio and video are often combined for surveillance as well as speech recognition (Shivappa et al. 2010), and as new technologies arise in biology, scientists are leveraging data integration to better understand complex biological processes (Lock et al. 2013). In general, by exploiting the diversity and commonalities among different datasets, data integration helps to build a wholistic and perhaps more realistic model of the phenomena of interest. Furthermore, as technological advances fuel the growing number of available datasets, data integration is becoming increasingly necessary in numerous disciplines.

1.1 Related Work

Though the data deluge is a recent development, the concept of data integration is certainly not new. For instance, it is a fundamental idea in the classical canonical correlation analysis (CCA) (Hotelling 1936), which infers the relationship between features from two datasets.

More recently, data integration has revolved around factorization methods due to their ease of interpretability and computational feasibility. The framework of coupled matrix and tensor (CMTF) factorizations

(Singh and Gordon 2008; Acar et al. 2014) unifies a large class of these factorizations while Joint and Individual Variation Explained (JIVE) (Lock et al. 2013) is another factorization method that is commonly used in integrative genomics. One general challenge with matrix factorization methods however is that they depend on the rank of the factorized matrices, which is frequently unknown and must be specified a priori.

In contrast to the matrix factorization methods, the generalized SVD (GSVD) provides an exact matrix decomposition for coupled data (Alter et al. 2003; Ponnapalli et al. 2011) and is not dependent on the matrix rank. However, it is limited in scope. The GSVD assumes each matrix has full row rank, excluding integrated problems with both high-dimensional and low-dimensional datasets.

There has also been work on extending principal components analysis (PCA) to integrated data problems. This family of methods is known as the multiblock PCA family, and it includes Multiple Factor Analysis (MFA) (Escofier and Pages 1994; Abdi et al. 2013) and Consensus PCA (Westerhuis et al. 1998). In these methods, each data matrix is normalized according to a specific procedure, then data matrices are concatenated together, and finally, PCA is performed on the normalized concatenated data. Closely related to this is Distributed PCA (Fan et al. 2017), which integrates data that are stored across multiple servers.

Thus far, the existing methods have largely been algorithmic constructions, which lack a rigorous statistical model. As a result, the statistical properties and theoretical foundations of data integration methods are generally unknown. We begin to bridge this gap between data integration methodology and statistical theory by developing a new data integration method, Integrated Principal Components Analysis (iPCA), which extends a model-based framework of PCA to the integrated data setting. The two main building blocks of iPCA are PCA and the matrix-variate normal distribution, which we review next.

1.2 Principal Components Analysis

Given a dataset , recall that PCA finds orthogonal directions , which maximize the covariance . That is, for each ,

(1)

It is well-known that the PC loading

is the eigenvector of

with the

largest eigenvalue, and its corresponding PC score is

. In practice, since the population covariance is typically unknown, the sample version of PCA plugs in an estimate for in (1) (assuming is column centered). It follows that the PC loadings are the eigenvectors of , and the PC scores are the eigenvectors of . Equivalently, if the SVD gives , then the columns of and are the PC loadings and scores, respectively.

To later establish the link between iPCA and PCA, we point out that is the maximum likelihood estimator (MLE) of under the multivariate normal model , and is the MLE of under . Here, is the row of , and is the column of

. While this is not the only way of viewing PCA, this framework illustrates two points which we explore further in iPCA: (1) PCA can be framed as maximizing the variance in

under a normal model; (2) Eigenvectors correspond to the dominant (or variance-maximizing) patterns in the data.

1.3 Matrix-variate Normal Model

The matrix-variate normal distribution (Gupta and Nagar 1999; Dawid 1981) is an extension of the multivariate normal distribution such that the matrix is the fundamental object of study. We say that an random matrix follows a matrix-variate normal distribution and write if follows a multivariate normal distribution with Kronecker product covariance structure, , where

is the column vector formed by stacking the columns of

below one another. We call the mean matrix, the row covariance matrix, and the column covariance matrix, where denotes the set of symmetric positive definite matrices.

Intuitively, the row covariance encodes the dependencies between rows of , and the column covariance encodes the dependencies among the columns of , i.e. and . Given these normally-distributed marginals, it can be shown that if and , we are in the familiar multivariate normal setting, . Similarly, if and , then .

An important distinction between the multivariate and matrix-variate normal distributions is that the multivariate normal can only model relationships between elements of a single row or a single column in whereas the matrix-variate normal can model relationships between elements from different rows and columns. The matrix-variate normal is thus far more general than the multivariate normal. With this level of generality, the matrix-variate normal has proven to be a versatile tool in various contexts such as graphical models (Yin and Li 2012; Tsiligkaridis et al. 2013; Zhou 2014a), spatio-temporal models (Greenewald and Hero 2015), and transposable models (Allen and Tibshirani 2010). Our work on iPCA is the first to consider the matrix-variate normal model in light of data integration.

1.4 Overview

iPCA is a generalization of PCA for integrated data via matrix-variate normal models, and it serves as a practical tool to identify and visualize joint patterns that are shared by multiple datasets. We formally introduce iPCA in Section 2, and in Section 3, we discuss covariance estimation methods for iPCA and highlight our two main theoretical contributions - proving subspace consistency of the additive correlation iPCA estimator and proving global convergence of the multiplicative Frobenius iPCA estimator. Beyond these theoretical guarantees, iPCA has several practical advantages: (1) since iPCA is not a matrix factorization method, it is not dependent on the rank of the factorized matrices; instead, iPCA gives ordered, nested, orthogonal PCs as in PCA; (2) iPCA (with the multiplicative Frobenius estimator) converges to the global solution, which is important for reproducibility, and (3) from an interpretability standpoint, iPCA can extract patterns in feature and sample space simultaneously and provides convenient visualizations of these patterns. Furthermore, we will empirically demonstrate in Section 4, through extensive simulations and a case study with integrative genomics for Alzheimer’s Disease, the strong performance of iPCA in practice. Finally, we conclude with a discussion of iPCA and future work in Section 5.

2 Integrated PCA

Similar to PCA, iPCA is an unsupervised tool for exploratory data analysis, pattern recognition, and visualization. Unlike PCA however, iPCA extracts dominant joint patterns which are

common to multiple datasets, not necessarily the variance-maximizing patterns since they might be specific to one dataset. These joint patterns are typically of considerable interest to practitioners as its common occurrence in multiple datasets may point to some foundational mechanism or structure.

Generally speaking, we find these joint patterns by modeling the dependencies between and within datasets via the matrix-variate normal model. The inherent Kronecker product covariance structure enables us to decompose the total covariance of each data matrix into two separable components - an individual column covariance structure which is unique to each dataset and a joint row covariance structure which is shared among all datasets. The joint row covariance structure is our primary interest, and maximizing this joint variation will yield the dominant patterns which are common across all datasets.

2.1 Population Model of iPCA

Suppose we observe coupled data matrices, , of dimensions , where is the number of samples and is the number of features in . Throughout this paper, we let and . Suppose also that each of the data matrices has a distinct set of features that are measured on the same samples and that all of the rows of perfectly align (see Figure 1). Under the iPCA model, we assume that each dataset arises from a matrix-variate normal distribution. Namely, for each ,

(2)

where is an column vector of ones, is a vector of column means specific to , is an row covariance matrix that is jointly shared by all data matrices, and is a column covariance matrix that is specific to .

Figure 1: Integrated Data Setting for iPCA: We observe coupled data matrices, each with a distinct set of features that are measured on the same set of samples. Assume that the rows align.

Given the underlying model in (2), we define the population version of iPCA to maximize the row and column covariances. For , iPCA solves

(3)
(4)

for which we know the solution to be given by the eigendecompositions of and . That is, is the eigenvector of with the largest eigenvalue, and is the eigenvector of with the largest eigenvalue. Most notably, maximizes the joint variation and is interpreted as the most dominant pattern shared by all datasets. We call the columns of the integrated principal component (iPC) scores and the columns of the iPC loadings for the dataset. Since we are interested in the joint patterns, we primarily plot the iPC scores to visualize the joint patterns in sample space.

Remark.

The population covariances in (2) are not identifiable (e.g. for ), but the iPC scores and loadings are identifiable since eigenvectors are scale-invariant.

To build intuition for the iPCA model (2), we note that is the matrix of column means. Moreover, by properties of the matrix-variate normal, we interpret as describing the covariance structure among features in , giving rise to patterns unique to , and we interpret as describing the common covariance structure among samples, corresponding to joint patterns that are shared by all datasets. Hence, maximizing the joint row covariance in (3) yields the dominant patterns among samples, which are common to all datasets.

We also point out that if and , then are the PC loadings from PCA, and if and , then are the PC scores from PCA. This reveals the fundamental difference between PCA and iPCA - PCA computes the PC loadings and scores by assuming independent samples and features, respectively, while iPCA models the dependencies among samples and features concurrently, making no assumptions on independence.

2.2 Sample Version of iPCA

To go from the population iPCA model to the sample version, we simply plug in estimators and for and . The sample version of iPCA can be summarized as follows:

  1. Model each dataset via a matrix-variate normal model with Kronecker product covariance structure: for each .

  2. Estimate the covariance matrices to obtain . Methods for covariance estimation will be discussed in detail in Section 3.

  3. Compute the eigenvectors, say and . We interpret as the dominant joint patterns in sample space and as the dominant patterns in feature space which are specific to .

  4. Visualize the dominant joint patterns by plotting the iPC scores .

Figure 2 illustrates a motivating example for when iPCA is advantageous. In the example, strong dependencies among features obscure the true joint patterns among the samples. As a result, applying PCA separately to each of the datasets (panels B-D) fails to reveal the joint signal. iPCA can better recover the joint signal because it exploits the known integrated data structure and extracts the shared information among all three datasets simultaneously.

Figure 2: Coupled matrices with , , , were simulated from the iPCA model (2). Here, and were taken to be as in the base simulation described in Section 4.1. (A) plots the top two eigenvectors of . In separate PCA analyses (B-D), the individual signal in each dataset masks the true joint signal, but (E) iPCA (using the multiplicative Frobenius estimator) exploits the integrated data structure and recovers the true joint signal.

2.2.1 Variance Explained by iPCA

After performing iPCA, one might wonder how to interpret the iPCs. Analogous to PCA, we can define a notion of variance explained.

Definition 1.

Assume that has been column centered. We define the cumulative proportion of variance explained in dataset by the top iPCs to be

(5)

where are the top iPC scores, and are the top iPC loadings associated with .

Definition 2.

The marginal proportion of variance explained in dataset by the iPC is defined as .

We verify in Appendix A that is a proportion and monotonically increasing as increases. Aside from being well-defined, we also show in Appendix A that (5) generalizes the cumulative proportion of variance explained in PCA and hence, is a natural definition.

Remark.

Unlike PCA, it may be that in iPCA. This is because iPCA does not maximize the total variance, e.g. if , this simply means that dataset contributed more variation to the joint pattern in iPC2 than in iPC1.

2.3 Relationship to Existing Methods

Throughout our development of iPCA, we have established the connection between iPCA and PCA, demonstrating that iPCA is indeed a generalization of PCA. We also find it instructive to draw connections between iPCA and other existing data integration methods.

Relationship to Multiblock PCA Family

As discussed in Abdi et al. (2013), multiblock PCA methods reduce to performing PCA on the normalized concatenated data , where each has been normalized to according to some procedure. We show in Proposition 1 that performing PCA on the unnormalized concatenated data (referred to as concatenated PCA) is a special case of iPCA, where we assume for each . Proposition 1 can easily be extended to show that multiblock PCA methods are a special case of iPCA for some fixed , and the exact form of depends on the normalization procedure. For example, since MFA normalizes

by dividing all of its entries by its largest singular value

, MFA is a special case of iPCA, where each . Consequently, iPCA can be viewed as a unifying framework for the multiblock PCA family.

Relationship to Matrix Factorizations

Since coupled matrix factorizations (CMF) decompose the data into the product of a low-rank joint and individual factors, an argument similar to Theorem 2 from Hastie et al. (2015) shows that CMF is closely related to the SVD of the concatenated data. More precisely, one solution of the CMF optimization problem is given by the PC scores and loadings from concatenated PCA, which we know to be a special case of iPCA by Proposition 1. JIVE, on the other hand, is an additive model and decomposes coupled data into the sum of a low-rank joint variation matrix, a low-rank individual variation matrix, and an error matrix. While CMF and iPCA are closely related, JIVE and iPCA are quite different models and are each advantageous in different situations.

3 Covariance Estimators for iPCA

Fitting the iPCA model requires estimating the population parameters in (2). As in PCA, we consider maximum likelihood estimation. However, we will see in Section 3.1 that there are substantial challenges with this approach, so we propose new estimators in Section 3.2.

3.1 Unpenalized Maximum Likelihood Estimators

Under the iPCA population model (2), the log-likelihood function reduces to

(6)

Taking partial derivatives of (3.1) with respect to each parameter gives the following:

Lemma 1.

The unpenalized MLEs of satisfy

(7)
(8)
(9)

Notice that the unpenalized MLE of is the vector of column means of , so is simply with its columns centered to mean . We can proceed to compute the unpenalized MLEs of and via a Flip-Flop algorithm (also known as block coordinate descent) analogous to Dutilleul (1999), which iteratively optimizes over each of the parameters while keeping all other parameters fixed.

However, because we only have one matrix observation per matrix-variate normal model, existence of the unpenalized MLEs is not guaranteed. In fact, the following theorem essentially implies that the unpenalized MLEs do not exist for all practical purposes.

Theorem 1.
  1. If the population means in (2) are known, , and for , then the unpenalized MLEs for , exist.

  2. If the population means in (2) are unknown, then the unpenalized MLEs for and are not positive definite and hence do not exist.

The proof of Theorem 1 also shows that if the unpenalized MLEs for and exist, then for each . Thus in summary, if for some or if the population means are unknown (which is almost always the case), then the unpenalized MLEs do not exist. These severe restrictions motivate new covariance estimators.

For example, we can naively estimate and by setting their counterparts to .

Proposition 1.
  1. The MLE for , assuming that , is

    (10)
  2. Let . The MLE for , assuming for all , is

    (11)

This approach for estimating is the familiar MLE for the concatenated data . Hence, the eigenvectors of from (11) are the PC scores from concatenated PCA (i.e. PCA applied to ). While this illuminates another way in which iPCA is related to PCA, we will see in Section 4 that concatenated PCA performs poorly when the datasets are of different scales. In the next section, we discuss more effective methods for estimating and .

3.2 Penalized Maximum Likelihood Estimators

Since the unpenalized MLEs do not exist for a large class of problems, we instead look to penalized maximum likelihood estimation. To reduce cluttering of notation, we assume that has been column-centered. Then, the penalized maximum likelihood simplifies to

(12)

Similar to previous work on the penalized matrix-variate normal log-likelihood (Yin and Li 2012; Allen and Tibshirani 2010), we can consider an additive penalty term and define the additive iPCA penalty to be , where are tuning parameters. We can take to induce sparsity or to induce smoothness. However, solving (12) with these additive penalties is a non-convex problem, for which we can only guarantee a local solution.

Nevertheless, even though (12) is non-convex in Euclidean space, Wiesel (2012) showed that the matrix-variate normal log-likelihood is geodesically convex (g-convex) with respect to the manifold of positive definite matrices. G-convexity is a generalized notion of convexity on a Riemannian manifold, and like convexity, all local minima of g-convex functions are globally optimal. We thus exploit the g-convex properties of the matrix-variate normal log-likelihood and construct a new penalty: the multiplicative Frobenius iPCA penalty , which we will show to be g-convex in Theorem 3. Note that because , the multiplicative penalty can be rewritten as a product , giving rise to its name.

The Flip-Flop algorithms for solving (12) with various penalties are derived in Appendix B.2, but in general, for the Frobenius penalties, each Flip-Flop update has a closed form solution determined by a full eigendecomposition. For the penalties (also known as the Kronecker Graphical Lasso (Tsiligkaridis et al. 2013)), each update can be solved by the graphical lasso (Hsieh et al. 2011). We provide the multiplicative Frobenius Flip-Flop algorithm here in Algorithm 1, but since the other algorithms take similar forms, we leave them to Appendix B.2. The next theorem guarantees numerical convergence of the Flip-Flop algorithms for the multiplicative Frobenius, additive Frobenius, and additive penalties.

1:Center the columns of , and initialize , to be positive definite.
2:while not converged do
3:     Take eigendecomposition:  

4:     Regularize eigenvalues:  

5:     Update

6:     for  do
7:         Take eigendecomposition:

8:         Regularize eigenvalues: .
9:         Update

     

Update

Update
Algorithm 1 Flip-Flop Algorithm for Multiplicative Frobenius iPCA Estimators
Theorem 2.

Suppose that the objective function in (12) is bounded below. Suppose also that either (i) is a differentiable convex function with respect to each coordinate or (ii) , where is a (non-differentiable) convex function for each . If either (i) or (ii) holds, then the Flip-Flop algorithm corresponding to (12) converges to a stationary point of the objective.

While Theorem 2 guarantees convergence to a local solution, we can build upon Wiesel (2012) to prove a much stronger statement for the multiplicative Frobenius iPCA estimator.

Theorem 3.

The multiplicative Frobenius iPCA estimator is jointly geodesically convex in and . Because of this, the Flip-Flop algorithm for the multiplicative Frobenius iPCA estimator given in Algorithm 1 converges to the global solution.

There are currently only a handful of non-convex problems where there exists an achievable global solution, so this guarantee that the multiplicative Frobenius iPCA estimator always reaches a global solution is both extremely rare and highly desirable. In Section 4, we will also see that the multiplicative Frobenius iPCA estimator undoubtedly gives the best empirical performance, indicating that in addition to its theoretical advantages from global convergence, there are significant practical advantages associated with the g-convex penalty. A self-contained review of g-convexity and the proof of Theorem 3 are given in Appendix C.

3.2.1 Subspace Consistency of the Additive Correlation iPCA Estimator

In this section, we begin exploring statistical properties of iPCA by proving subspace consistency for a slight variation of the additive iPCA estimator. Rather than applying the penalty to the inverse covariances, we can apply the penalties to the inverse correlation matrices, as outlined by Algorithm 6 in Appendix D. This approach was adopted in Zhou (2014a) and Rothman et al. (2008), and we extend their idea to integrated data. Throughout this section, we let denote the estimate obtained from Algorithm 6, and we refer to it as the additive correlation iPCA estimator. We will show in Theorem 4 that the one-step additive correlation iPCA estimator consistently estimates the underlying joint subspace.

Though previous works have studied the convergence of related sparse penalties, none are directly applicable for data integration. For instance, Tsiligkaridis et al. (2013) proved convergence rates for KGlasso but assumed multiple matrix observations per matrix-variate normal model. Since iPCA assumes only one matrix observation per model, the results from Tsiligkaridis et al. (2013) do not apply. In fact, the theory suggests it is difficult to estimate the diagonal of consistently with only one matrix observation. This motivated us to follow the approach in Zhou (2014a), where convergence rates were derived for the one-step (noniterative) version of Algorithm 6. Zhou (2014a) assumed that only one matrix instance was observed from the matrix-variate normal distribution, so we extend the proof techniques and results in Zhou (2014a), where , to iPCA, where we observe one matrix instance for each of the distinct matrix-variate normal models.

For simplicity, we assume is initialized to the identity matrix, and like Zhou (2014a), we analyze the one-step version of Algorithm 6. That is, we stop the algorithm after one iteration of the while loop. For now, we also assume that for each , or what is classically known as the “large n” setting. We will later discuss how to adapt our results to the for each case, or the“large p” setting.

We next collect notation. Suppose for each , the true population model is given by , and the true correlation matrices associated to and are and , respectively. For identifiability, define and so that and for each . If is a matrix, let denote the operator norm or the maximum singular value of . Let denote the Frobenius norm (i.e. ). Let denote the number of non-zero off-diagonal entries in . Let , and let . Write and for the minimum and maximum eigenvalues of . Also write and . If , then as . If , then there exists positive constants such that as .

The following assumptions are needed to establish subspace consistency of .

  1. Assume that and are sparse with respect to each other’s dimensions: and for each .

  2. Assume that we have uniformly bounded spectra: and for each .

  3. Assume that the inverse correlation matrices satisfy and for each .

  4. Assume that is finite and the growth rate of and satisfy for each .

Remarks.
  1. Rather than verifying the sparsity assumption of in A and the growth rate D for each , it is sufficient to check that and , respectively.

  2. A implies that and as .

Under these four assumptions, we summarize our main statistical convergence result:

Theorem 4.

Suppose that A-D hold and that for each . Let denote the one-step additive correlation iPCA estimator, where we choose and for each

. Then with probability

,

(13)
(14)

Details of this proof are provided in Appendix D, but the overarching proof idea is to follow Algorithm 6 and sequentially bound each step of the algorithm. Our proof closely resembles Zhou (2014a) with technical differences due to the estimation of from multiple ’s with different ’s. Note when , Theorem 4 gives the same rates as Zhou (2014a).

Remark.

If for each (i.e. the “large ” setting), then we modify Algorithm 6 to first initialize an estimate for , next estimate , and then obtain the final estimate of . The same convergence rates can be obtained for the “large ” setting by using this modified algorithm. The only additional assumption required here is a bound on , namely, for each and .

Therefore, in either the large or the large setting, we have a one-step Flip-Flop algorithm such that under certain assumptions, the estimate of converges in the operator and Frobenius norms at rates given by (13) and (14). Convergence in the operator norm, in particular, has two direct consequences. By Weyl’s Theorem (Horn and Johnson 2012), the eigenvalues of are consistent, and by a variant of the Davis-Kahan sin theorem (Yu et al. 2015), the eigenvectors of are consistent. Since eigenvectors of define the estimated iPCA subspace, this in turn implies subspace consistency.

3.2.2 Selecting Penalty Parameters

In practice, it is important to select penalty parameters for (12

) in a principled data-driven manner. We propose to do this via a missing data imputation framework.

Let denote the space of penalty parameters, and let be a specific choice of penalty parameters in . The idea is to first randomly leave out scattered elements from each . Then, for each , impute the missing elements via an EM-like algorithm, similar to Allen and Tibshirani (2010). Finally, select the which minimizes the error between the imputed values and the observed values. Further details, technical derivations, and numerical results regarding our imputation method and penalty parameter selection are provided in Appendix E.

Remarks.
  1. If is large or the datasets are large, it might be computationally intractable to try all combinations of penalty parameters in . In these cases, we can select the penalty parameters in a greedy manner: first fix and optimize over , then fix and optimize , and so forth, and we stop after optimizing .

  2. Our imputation procedure may also be used to perform iPCA in missing data scenarios.

4 Empirical Results

In the following simulations and case study, we evaluate iPCA against individual PCAs on each of the datasets , concatenated PCA, distributed PCA, JIVE, and MFA. Note that many data integration methods from the multitable PCA family are known to perform similarly to MFA (Abdi et al. 2013), so we only include MFA to minimize redundancy. We also omit CMF, as it performs similarly to concatenated PCA, and the GSVD since it is not applicable for integrated data with both low-dimensional and high-dimensional datasets.

Within the class of iPCA estimators, we focus our attention on the additive and multiplicative Frobenius iPCA estimators, but to also represent the sparse estimators, we include the most commonly used sparse estimator, the additive penalty () applied to the inverse covariance matrices. In order to reduce the computational burden, we stop the Flip-Flop algorithm after one iteration, and we select the iPCA penalty parameters in a greedy manner, as discussed in Section 3.2.2.

4.1 Simulations

The base simulation is set up as follows: Three coupled data matrices, , with , , , were simulated according to the iPCA Kronecker covariance model (2). Here, is taken to be a spiked covariance with the top two factors forming three clusters as shown in Figure 2A; is an autoregressive Toeplitz matrix with entry given by ; is the observed covariance matrix of miRNA data from TCGA ovarian cancer (Network 2011); and is a block-diagonal matrix with five equally-sized blocks. We also ensured that the largest eigenvalue of each was larger than that of so that the joint patterns are intentionally obscured by individualistic patterns. From this base simulation, we systematically varied the parameters - number of samples, number of features, and strength of the joint signal in (i.e. ) - one at a time while keeping everything else constant.

We evaluate the performance of various methods using the subspace recovery error: If the true subspace of was simulated to be of dimension with the orthogonal eigenbasis and the top eigenvectors of the estimate are given by , then the subspace recovery error is defined to be , where and . This metric simply quantifies the distance between the true subspace of and the estimated subspace from . We note that a lower subspace recovery error implies higher estimation accuracy, and in the base simulation, . Although there are other metrics like canonical angles, which also quantify the distance between subspaces, these metrics behave similarly to the subspace recovery error, so we omit the results for brevity.

Figure 3: Subspace recovery as simulation pararmeters vary from the base simulation: (A) As the number of samples increases, it becomes more difficult to estimate the joint row subspace; (B) As the number of features increases, it becomes slightly easier to estimate the joint row subspace; (C) Performance drastically improves as the strength of the joint signal in (i.e. the top singular value of ) increases. Moreover, in almost every scenario, the multiplicative and additive Frobenius iPCA estimators outperform their competitors.

The average subspace recovery error, measured over trials, from various simulations are shown in Figure 3. We clearly see that the additive and multiplicative Frobenius iPCA estimators consistently outperformed all other methods. Since was not simulated to be sparse, it is no surprise that the Frobenius iPCA estimators outperformed the iPCA estimator. It is also expected that distributed PCA performs poorly since the ’s are not all identical, violating its basic assumption. What may be surprising is that doing PCA on performed better than its competitors, excluding the Frobenius iPCA estimators. We speculate that this is because the observed covariance happened to be a very low-rank matrix, and because was low-rank, the signal from most likely dominated much of the variation in the second PC. Looking ahead at Figure 4A, as Laplacian error was added to the simulated data, PCA on failed to recover the true signal since the Laplacian error increasingly contributed to the variation in the data. We also point out that MFA always yielded a lower error than concatenated PCA in Figure 3, indicating that there is value in normalizing datasets to be on comparable scales. On the other hand, we must be weary of this normalization process. In the case of these simulations, PCA on outperformed MFA, illustrating that normalization can sometimes remove important information.

To verify that these simulation results are not heavily dependent by the base simulation setup of and , we also ran simulations, varying the dimension of the true joint subspace and the number of datasets . We provide these results in Appendix F.

Beyond simulating from the iPCA model (2), we check for robustness from the two main iPCA assumptions - normality and Kronecker covariance structure. To deviate from normality, we add Laplacian noise to the base simulation setup, and to depart from the Kronecker covariance structure, we simulate data from the JIVE model. The results are summarized in Figure 4, and we leave the simulation details to Appendix F.

As seen in Figure 4

A, the Frobenius iPCA estimators appear to be relatively robust to the non-Gaussian noise and outperformed their competitors even as the standard deviation of added Laplacian errors increased. From the simulations under the JIVE model in Figure 

4B, we see that as the amount of noise increases, JIVE given the known ranks yields the lowest error, as expected. But similar to how JIVE was comparable to competing methods under the iPCA model (Figure 3), the iPCA estimators are comparable to competing methods for high noise levels under the JIVE model. At low noise levels however, the Frobenius iPCA estimators are able to recover the true joint subspace better than JIVE with the known ranks. Further investigation into this peculiar behavior reveals that the Frobenius iPCA estimators give lower subspace recovery errors but larger approximation errors , compared to JIVE with the known ranks. This brings up a subtle, but important distinction - iPCA revolves around estimating the underlying subspace, determined by eigenvectors, while JIVE focuses on minimizing the matrix approximation error. These are inherently different objectives, and it is common for iPCA to estimate the eigenvectors well at the cost of a poor matrix approximation due to the regularized eigenvalues.

Figure 4: Robustness Simulations: (A) As Laplacian error is increasingly added to the simulated datasets, the Frobenius iPCA estimators appear to be robust to the departures from Gaussianity; (B) As the amount of noise in the JIVE model increases, iPCA seems to be comparable to existing methods, illustrating its relative robustness to departures from the Kronecker product model.

4.2 Case Study: Integrative Genomics of Alzheimer’s Disease

A key motivating example for our research is in integrative genomics, where the goal is to combine genetic information from different, yet related, technologies in order to better understand the genetic basis of particular diseases. In particular, apart from the APOE gene, little is known about the genomic basis of Alzheimer’s Disease (AD) and the genes which contribute to dominant expression patterns in AD. Thus, in this case study, we delve into the integrative genomics of AD and jointly analyze miRNA expression, gene expression via RNASeq, and DNA methylation data obtained from the Religious Orders Study Memory and Aging Project (ROSMAP) Study (Mostafavi et al. 2018). The ROSMAP study is a longitudinal clinical-pathological cohort study of aging and AD, consisting of subjects with data across all three datasets of interest, and it is uniquely positioned for the study of AD since its genomics data is collected from post-mortem brain tissue from the dorsolateral prefrontal cortex of the brain, an area known to play an important role in cognitive functions.

The ROSMAP data originally contained miRNAs, genes, and CpG sites, so we aggressively preprocessed the number of features in the RNASeq and methylation datasets to manageable sizes. First, we transformed the methylation data to m-values and log-transformed the RNASeq counts, as is common in most analyses for these data types. Then, we removed batch (experimental) effects from both datasets via ComBat (Johnson et al. 2007). We next filtered the features by taking those with the highest variance (top 20,000 genes for RNASeq and top 50,000 CpG sites for methylation). Then, we performed univariate filtering and kept the features with the highest association to clinician’s diagnosis. This left us with , , in the miRNA, RNASeq, and methylation datasets, respectively.

For our analysis, we consider two clinical outcome variables: clinician’s diagnosis and global cognition score. The clinician’s diagnosis is the last clinical evaluation prior to the patient’s death and is a categorical variable with three levels - Alzheimer’s Disease (AD), mild cognitive impairment (MCI), and no cognitive impairment (NCI). Global cognition score, a continuous variable, is the average of 19 cognitive tests and is the last cognitive testing score prior to death. While the clinician’s diagnosis is sometimes subjective, global cognition score is viewed as a more objective measure of cognition. Our goal is to find common patterns among patients, which occur in all three datasets, and to understand whether these joint patterns are predictive of AD, as measured by clinician’s diagnosis and global cognition score.

To this end, we run iPCA and other existing methods to extract dominant patterns from the ROSMAP data. Figure 5 shows the PC plots obtained from the various methods - each point represents a subject and is colored by either clinician’s diagnosis or cognition score.

Figure 5: We plot the first two (integrated) principal components from various methods applied to the ROSMAP data. Each point represents a subject, colored by the clinician’s diagnosis in panels A-I and by global cognition score in panels J-R.

Since visuals are a subjective measure of performance, we quantify it by taking the top PCs and using them as predictors in a random forest to predict the outcome of interest. For the random forest, we split the ROSMAP data into a training (

) and test set () and used the default random forest settings in R. The test errors, averaged over 100 random training/test splits, are shown in Figure 6. From Figure 6, it is clear that the joint patterns extracted from iPCA using the multiplicative Frobenius penalty were the most predictive of the clinician’s diagnosis of AD and the patient’s global cognition score. Moreover, most of the predictive power can be attributed to the first three iPCs, which we visualize in Figure 7A-B. We also note that the top iPCs from iPCA with the multiplicative Frobenius penalties were more predictive than combining the PCs from the three individual PCAs performed on each dataset. This showcases empirically that a joint analysis of the integrated datasets can be advantageous over three disparate analyses.

Beyond the high predictive power of iPCA with the multiplicative Frobenius penalty, it is perhaps more important for scientists to be able to interpret the iPCA results. One way is through the proportion of variance explained by the joint iPCs, as defined in Section 2.2.1. Figure 7C shows the marginal proportions for the top 5 iPCs. It reveals that the RNASeq dataset contributed the most variation in the joint patterns found by iPC1 and iPC2, and the miRNA dataset contributed the most variation in iPC3. More interestingly, even though iPC2 and iPC3 have relatively small variances, iPCA is able to pick out these weak joint signals, which we found to be predictive of AD. This reiterates that the most variable patterns in the data are not necessarily the most relevant patterns for the question at hand. In this case, our goal was to find joint patterns which occur in all three datasets, and since the joint signal is not the most dominant source of variation in each dataset, no individualistic PCA analysis would have identified the joint signal found by iPCA.

Figure 6:

We took the top PCs and used them as predictor variables in a random forest to predict (A) the clinician’s diagnosis and (B) the global cognition score. The average test error from the random forests over 100 random splits are shown as the number of PCs used in the random forests increases.

Figure 7: (A)-(B) We show the top 3 iPCs obtained from iPCA with the multiplicative Frobenius estimator; the points are colored by clinician’s diagnosis and global cognition score in (A) and (B), respectively. (C) We plot the marginal proportion of variance explained by the top iPCs in each dataset in the ROSMAP analysis (using the multiplicative Frobenius iPCA estimator).

We conclude our ROSMAP analysis by extracting the top genetic features which are associated to the joint patterns shown in Figure 7. Since iPCA provides an estimate of both and , we can select the top features by applying sparse PCA to each obtained from iPCA. Table 1 lists the top miRNAs, genes, and genes affiliated with the selected CpG sites obtained from sparse PCA. Here, we used the sparseEigen R package (Benidis et al. 2016) and chose the tuning parameter such that there were only non-zero features.

Because the RNASeq data contributed most of the variation in iPC1, we did a literature search on the top five genes extracted by sparse PCA on . Out of the top five genes, we found evidence in the biological literature, which links four of the five genes (the exception being SVOP) to AD (Carrette et al. 2003; Li et al. 2017; Han et al. 2014; Espuny-Camacho et al. 2017). While this is only a preliminary investigation into the importance of the genetic features obtained from iPCA, it is encouraging evidence and may potentially hint at candidate genes for future research.

miRNA RNASeq Methylation
1 miR 216a VGF TMCO6
2 miR 127 3p SVOP PHF3
3 miR 124 PCDHGC5 BRUNOL4
4 miR 30c ADCYAP1 OSCP1
5 miR 143 LINC01007 GRIN2B
6 miR 27a FRMPD2L1 CASP9
7 miR 603 SLC30A3 ZFP91; LPXN; ZFP91-CNTF
8 miR 423 3p NCALD CNP
9 miR 204 S100A4 YWHAE
10 miR 128 AZGP1 C11orf73
11 miR 193a 3p PAK1 TMED10
12 ebv miRBART14 MAL2 RELL1
Table 1: Top genetic features obtained by applying Sparse PCA to each in ROSMAP analysis (using the multiplicative Frobenius iPCA estimator)

5 Discussion

As showcased in the simulation study and the Alzheimer’s Disease case study, iPCA is not simply a theoretical construct that generalizes PCA to the integrated data setting. iPCA is also a useful tool in practice to discover interesting joint patterns which occur in multiple datasets. We believe that iPCA’s strong performance is due in part to the appropriateness of the iPCA model for many integrated data problems. The iPCA model assumes that the variation in each dataset is a function of the dependencies among samples and the dependencies among features, and because each dataset is measured on the same set of samples, then the dependencies among samples should be the same across all datasets. We encode these shared sample dependencies as and the individualistic feature dependencies as . Then, whereas ordinary PCA estimates each independently of , iPCA estimates and jointly, inherently accounting for the effects of the other factor.

While there are many different penalized iPCA estimators, we recommend that if the covariance matrices are not known to be sparse, practitioners should strongly consider using the multiplicative Frobenius iPCA estimator. The simulations show that the Frobenius penalties are relatively robust to departures from model assumptions, and in theory, the multiplicative Frobenius penalized estimator requires one less penalty parameter to tune and always converges to the global solution. Though it is not obvious as to why the multiplicative Frobenius iPCA estimator drastically outperforms the other data integration methods in the ROSMAP case study, we speculate that it is related to g-convexity and convergence to the global solution. If not this, then it might be that the multiplicative Frobenius penalty simply gives the appropriate scalings and interactions between penalty parameters and covariance estimates. We leave this as open question for future research.

Still, there are many other open avenues for future exploration. Analogous to PCA, one might imagine similar fruitful extensions of iPCA to higher-order data, functional data, and other structured applications. One could continue exploring g-convex penalties in different contexts and problems. Another interesting area for future research would be to develop a general framework to prove the consistency of g-convex estimators using the intrinsic manifold space, rather than the Euclidean space. We believe this intersection of g-convexity and statistical theory is a particularly ripe area of future research, but overall, in this work, we developed a theoretically sound and practical tool for performing dimension reduction in the integrated data setting, thus facilitating wholistic analyses at a large scale.


Acknowledgements

The authors acknowledge support from NSF DMS-1264058, NSF DMS-1554821 and NSF NeuroNex-1707400. The authors thank Dr. Joshua Shulman and Dr. David Bennett for help acquiring the ROS/MAP data and acknowledge support from NIH P30AG10161, RF1AG15819, R01AG17917, and R01AG36042 for this data. The authors also thank Dr. Zhandong Liu and Dr. Ying-Wooi Wan for help with processing and interpreting the ROS/MAP data.

Integrated Principal Components Analysis: Supplementary Materials

Tiffany M. Tang, Genevera I. Allen

The supplementary materials are structured as follows. In Appendix A, we verify that the variance explained by iPCA is a well-defined definition. In Appendix B, we provide the proofs, derivations, and algorithms related to the unpenalized and penalized MLEs for iPCA. We then review geodesic convexity and prove the global convergence property of the multiplicative Frobenius iPCA estimator in Appendix C. In Appendix D, we prove the subspace consistency of the onse-step additive correlation iPCA estimator. We discuss the penalty parameter selection in Appendix E, and we conclude with additional simulation details in Appendix F.

Appendix A Variance Explained by iPCA

To ensure that the cumulative proportion of variance explained from Definition 1 is a well-defined concept, we check that is a proportion and is an increasing function as increases. This implies that the marginal proportion of variance explained given in Definition 2 is also a proportion.

Lemma 2.

The cumulative proportion of variance explained in by the top iPCs, as defined in (5), satisfies the following properties: for each and ,

  1. ;

  2. .

Proof.

(i) Since the Frobenius norm is always non-negative, it is clear that . So it suffices to show that , or equivalently, .

By definition of the Frobenius norm, we have that

Here, [2] holds by the orthogonality of and , and [1] follows from the facts that , and are precisely the first columns of and respectively, and the summand is non-negative. This concludes the proof of part (i).

(ii) We follow a similar argument as part (i) to see that