Large-scale studies commonly involve multiple sources of data. These data sets can stem from in-house experiments, public data-base resources, studies conducted in different experimental settings or obtained through different technological platforms. There is thus an ever increasing need to provide analysis methods that can integrate heterogeneous data from several cohorts or studies and several sets of features.
The integration of data from multiple cohorts observed across a common set of features is commonly referred to as multi-group integration (Figure 1A). Similarly, the multi-view (multi-source, multi-modal) data integration problem focuses on one cohort where multiple sets of features or measurements are available (Figure 1B). The multi-group and multi-view settings are to some extent equivalent algorithmically. By transposing the data matrices of a multi-group data set, it can be analyzed by multi-view methods and vice versa.
Several methods for multi-view (or multi-group) integrative data analysis are based on PCA or SVD. Here, we provide a brief summary while an extensive literature review can be found in the online supplement (Table 2). Methods that extend to more than two matrices are generally limited to finding a joint structure shared by all matrices. The interpretational value of multi-group (or multi-view) data integration can be improved by also modeling properties that are only present in one data matrix. van_der_kloet_separating_2016 present a comparative review of such methods, including JIVE (lock_joint_2013), DISCO-SCA (schouteden_sca_2013) and O2-PLS (trygg_o2-pls_2002), in the context of multi-view data. The three methods all fit a global model for the joint structure and individual models to the residuals in each matrix. They differ primarily in how strongly they impose assumptions of orthogonality of the individual structures of each matrix. Recently, feng_angle-based_2018
proposed AJIVE which uses the same model as JIVE but with improved computational efficiency for parameter estimation.zhu_generalized_2018 developed GIPCA: a generalization of JIVE to different data types which also allows for the presence of missing values.
To enhance interpretability further still, sparsity constraints can be applied to the structure parameterizations (e.g. the loadings of an SVD-based analysis). For example, wang_sparse_2015 proposed sMVMF to provide a decomposition of multi-group data into global and individual signal while imposing sparsity penalties on the global and individual loadings. This results in a stratification of features into sets that are; (i) globally relevant to explain all data groups; (ii) individually relevant; and (iii) irrelevant.
Simultaneous data integration for multi-group and multi-view data (Figure 1C) was first addressed by lofstedt_bi-modal_2012. The method, bi-modal OnPLS, can decompose such datasets into global and individual signal. Linked matrix factorization by oconnell_linked_2017 later extended JIVE to the same generality, with the additional ability to handle elementwise missing data. BIDIFAC (park_integrative_2019) is another extension of JIVE which, in addition, allows signal to be group-global and view-global.
In this paper, we aim to generalize the multi-group and multi-view integration to allow information to be only partially shared across a subset of groups and/or a subset of views. In many applications, a dataset may contain heterogenous matrices, and it is therefore unlikely that a globally shared signal exists. In such datasets, methods that only look for globally joint signal are forced to explain all variation as individual.
argelaguet_multiomics_2018 recently developed MOFA, which is, to our knowledge, the only method for multi-view data that can find joint signal shared only for a subset of matrices. klami_group-sparse_2014
proposed CMF to perform multi-group and multi-view integration with loadings that are shared by a subset of matrices. Both MOFA and CMF utilize a Bayesian setup, where an automatic relevance determination (ARD) prior is used for each loading so that some are active for only a subset of its associated data matrices. The ARD prior does not generate exact zeros. Consequently, it is difficult to interpret results in terms of which data matrices have a common structure, and how that common structure should be characterized. Furthermore, it does not favor low rank models. CMF does not produce sparse loadings. MOFA combines an ARD prior with a spike-and-slab prior where the role of the spike-and-slab prior is to produce sparse loadings (feature selection) but this setup has not been generalized to a multi-group and multi-view setting.
We propose a frequentist multi-group and multi-view data integration method, MM-PCA, that utilizes a symmetric treatment of views and groups, paired with regularization, to achieve; (i) partial component sharing across subsets of groups and/or views; (ii) orthogonal components through an efficient parametrization; (iii) sparse loadings and scores with respect to each component; and, (iv) adaptive rank selection for the integrative procedure. Comparing MM-PCA to CMF, CMF offers an approximate solution to (i) and does not provide solutions for (ii-iv) which greatly reduces the interpretability of the CMF data integration procedure.
MM-PCA can be used for explorative analysis, for example integrative bi-clustering and the identification of key features that drive cohort stratification. In integrative genomics, such driver-strata components can have relevance for development of new therapeutic solutions. To demonstrate the strength and usefulness of MM-PCA we conduct simulations and apply MM-PCA for the analysis of data from the Cancer Genome Atlas.
The rest of this article is organized as follows. Section 2 defines the model used in MM-PCA and how its parameters are estimated and should be interpreted. In Section 3 we compare MM-PCA with CMF, the only existing method of similar generality, and with JIVE, a popular method for integrative analysis in genomics, in three simulation based experiments. In Section 4 we apply MM-PCA to a data set from the Cancer Genome Atlas (TCGA, tcga_stuart_2013). Section 5 concludes the article and discusses limitations and potential future improvements.
2 Multi-group and Multi-view Principal Component Analysis
To facilitate the generalization of multi-group and multi-view integration, we treat the data integration as symmetric with respect to group or view. klami_group-sparse_2014 termed this approach augmented multi-view data (AMD). To define AMD, let us consider data matrices. Two matrices can be directly related by sharing rows or columns, or indirectly related via other matrices. More precisely, two matrices share rows if row in each of the two matrices concern the same object, and similarly for columns. This defines a partition of the sets of rows/columns into sets. We call such sets views, regardless of whether they consist of observations or features. For example, consider two data cohorts; cohort 1 with patients and cohort 2 with patients. Each cohort has two feature sets; the expression of genes and the methylation state of probes (Figure 1C). There are 8 sets of rows/columns for 4 unique views. The generality of AMD allows for the inclusion of a matrix with prior knowledge of covariance between gene expression and methylation. It also allows for inclusion of feature sets for which data are not available for all cohorts (Figure 1D).
Let be a data matrix, where is the set of view pairs for which a data matrix exists. The cardinality is the number of matrices in the data set in total. MM-PCA aims to find a rank integrative solution to minimize the following loss
subject to . () are the view-specific loading matrices and are diagonal matrices.
The optimal MM-PCA decomposition of a single data matrix is , where is equal to the standard PCA scores for and is equal to the PCA loadings. Since MM-PCA treats rows and columns symmetrically, we call all matrices loadings.
Without further constraints each column of would influence all matrices . By imposing exact zeros on some elements of the matrices we can constrain the MM-PCA components to be shared between only a subset of data matrices. Since the matrices are diagonal and of the same size, , they can be compactly represented by a matrix where each column is equal to the diagonal of one of the matrices and each row corresponds to a MM-component. We call the augmented D matrix.
We now illustrate how a joint structure is defined by . A zero at the element corresponding to component (row) and view (column) means that component is not capturing variation in data matrices with data for view . Thus, all views with non-zero values for component contain variation that is jointly captured by component . For example, consider matrices , and . If and , then component would be shared between matrices and but not . If several rows of have identical patterns of zeros, then the corresponding components make up a subspace where the views with non-zero values co-vary.
defines both the proportion of variation explained by estimated components and how much of the explained variation that is shared between data matrices. Denote the optimal -component MM-PCA solution for matrices by
. The vector
is the vector of singular values of the approximation of. Thus, the proportion of variation explained by the -rank solution of is given by
Similarly, the proportion of variation in a matrix that can be linearly predicted with another matrix can be computed as follows:
where is the set of components that influence . This gives a directed relation between each pair of data matrices, and can be used to form hypotheses of how groups of variables influence each other (Figure 5B).
Two problems arise ; (i) for relevant biological datasets the resulting optimization problem involves a massive amount of parameters (order of to
); and (ii) the constraints are challenging to handle numerically. Massive amount of parameters is not an insurmountable problem in optimization, as seen for example in recent advances in the field of deep neural networks. The constraints are challenging, however, in that they are non-linear and constrain the set of feasible solutions to a manifold so that all solutions are on the border of the set.
To address these problems, we introduce a parsimonious parametrization for the loading matrices. This not only reduces the number of parameters, but, more importantly, also removes the optimization constraints.
2.1.1 Parsimonious Parametrization
Let (e.g. ) be columns of a -dimensional rotation matrix, i.e. it is a matrix of parameters constrained to have orthogonal columns of unit -norm. Such matrices are called k-frames and effectively have free parameters (james_normal_1954).
To find a parsimonious parametrization of we make use of a parametrization based on Givens rotation matrices (shepard_representation_2015). A rotation matrix for two-dimensional rotation can be derived from angles, . Generalizing this definition to -dimensional space requires the multiplication of several rotation matrices, each constituting a two-dimensional rotation in a two-dimensional subspace. Such matrices
are known as Givens rotation matrices.
We include in the supplement a proof in our notation that for , an arbitrary -dimensional -frame can be expressed as a multiplication of Givens rotation matrices as , where are the first columns of the
-dimensional identity matrix. Each elementis an angle, and elements are collected in , a lower triangular matrix with zero diagonal, of dimension . The function thus yields a parsimonious parametrization of k-frames from angles.
This parametrization was recently used by pourzanjani_general_2017
for Bayesian inference of probabilistic PCA to facilitate sampling from a distribution over k-frames. It has not previously been used for optimization based inference of SVD nor in the context of SVD based data integration, and we have adapted the framework to more conveniently be applied for such tasks.
With this parametrization of k-frames we can restate (1) without the challenging orthogonality constraints on the loadings. That is, we write for view , , of sizes . When computing , the product of several large matrices, we make use of the sparse and regular structure of Givens rotation matrices.
We use the inverse of , stated in the supplement, to find initial parameter values for the optimization problem (1).
2.1.2 Loss, penalties and model selection
We now restate the loss function as
with four added penalty terms to achieve; (i) data integration through the identification of partially shared components; (ii) rank selection; (iii) sparse loadings; and (iv) variable selection.
Data integration is achieved via an sparsity penalty on the diagonals of : , where is the number of views and is the sum of the absolute values of the matrix’ elements. This penalty plays a similar role to the ARD prior on loadings used in CMF. Unlike CMF, however, we have separated the loadings () from their norms (). This allows us to define an integration penalty that only involves a small fraction of the parameters.
The rank of solutions is penalized by adding a group penalty on each diagonal position across all views: .
Sparsity of loadings is also achieved with an penalty. To avoid solutions where less important components tend to be more sparse, each is multiplied by its associated : .
Variable selection is achieved with a group penalty on each row of the loading matrices: , where is the dimension of view .
The division by is in both cases used to make all hyper parameters have approximately the same scale.
To allow for missing elements in the data matrices, the corresponding terms of the loss function are removed. For each data matrix define a matrix whose elements are zero if the same position in is missing and one otherwise. The MM-PCA loss function thus becomes
where is the set of pairs for which a data matrix exists with rows concerning view and columns concerning view and is elementwise multiplication. This optimization problem has variables.
The matrices only enter the loss function in pairs, and thus (Figure 6B) are more directly informative than the themselves (Figure 6A). The former contain the joint structure, the amount of variation explained and the amount of shared variation between matrices (as defined above). Still, values in are not arbitrary. The penalty terms promote solutions where the elements of the diagonal matrices are of similar size, and each may partake in several matrices.
Optimization of the objective function (2) is performed with the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, a quasi-Newton method that makes use of the gradients of the objective function (gradients shown in the supplement) for iteratively updating an estimate of the Hessian. Details on how the various penalties are scaled to remove their dependencies on scale and size of the data matrices are given in the supplement.
Values for the penalty parameters
are chosen by dividing the elements of the data matrices into a test set and a training set. Non-missing elements of the data matrices are randomly assigned to the test set (with probability 0.1). For given values of, the model is fitted with the elements of the test set treated as missing data. A is sought that minimizes the squared reconstruction error of elements in the test set. The user can specify any sequence of candidate values. To limit the search space we typically choose candidate values by deciding which penalties that should be active (specified as four binary parameters ) and choosing a range and step size of values . The set of candidate values is then given by for each value of . After the optimal has been found, the model is finally estimated using all data (including the test set). Each candidate value of can be tried in parallel, making it possible to utilize many processor cores simultaneously for parameter tuning.
2.1.3 Normalization and Initialization
It is advisable to center both rows and columns of each similarly to normalization for ordinary PCA. One may also normalize data matrices in relation to each other through scaling e.g. to equal Frobenius norm, (square) Frobenius norm proportional to the number of elements, rows or columns, or equal Frobenius norm of the first principal components. Further matrix-wise normalizations are discussed in Section 2.5 of lofstedt_bi-modal_2012. In the end, data normalization must depend on data or domain knowledge.
We initialize the optimization with an approximation of the solution under the assumptions that all components are globally joint. Alternatively, MM-PCA can use CMF to find initial values. The R package allows for both.
2.2 Components of the MM-PCA solution
To summarize the above sections, MM-PCA provides a multi-group and multi-view structure decomposition of data matrices that includes;
Automatic rank selection. MM-PCA estimates the optimal rank by using a group penalty on the rows of . The estimated rank is given by the number of rows in that contain non-zero elements.
Interpretable sparse loadings. Sparsity in loadings, that is in columns of , increases interpretability.
Multi-group multi-view bi-clustering. Sparse loadings facilitate a direct translation from the MM-PCA decomposition to an integrative bi-clustering of the data. Each column of the loading matrix partitions the observations or variables of a view into three groups: positive, zero and negative. The use of sparse loadings for bi-clustering of a single data matrix was explored in lee_biclustering_2010 which, with MM-PCA, is thus generalized to the multi-group and multi-view setting.
Detection of outliers and irrelevant variables.
Removing outliers (row or columns) from the model is achieved via a group penalty on rows of. This penalizes difference in sparsity structure between loadings for the same view. Removing an outlier means that the corresponding row or column is considered to contain noise only.
Imputation of missing values and missing matrices of data.
Data matrix approximations given by the fitted model have no missing values, thus achieving imputation. Furthermore, data matrices that are not present when fitting the model, but that has views that are present through other matrices, can be predicted. E.g. if, and are present, a missing matrix consisting of views and is predicted by .
3 Method Evaluation on Simulated Data
We investigate MM-PCA performance via three simulation studies. Simulation 1 is aimed at demonstrating MM-PCA’s ability, compared with CMF, to correctly identify partially shared structure in a data set. Simulation 2 investigates MM-PCA’s automatic rank selection and ability to identify globally joint signal, where we compare against CMF and JIVE. Finally, simulation 3 demonstrates MM-PCA’s ability to identify sparse loadings.
3.1 Simulation 1
We generate four data matrices of size , thought to represent four different cohorts (views 1 through 4) observed over a common set of features (view 5). Each matrix is the sum of a rank one matrix (the signal) and noise. Data matrices one and two share a rank 1 component, and similarly for matrices three and four. That is,
where each element of each is independent standard normal and
is chosen to obtain a fixed signal-to-noise ratio (SNR).
Both MM-PCA and CMF are run with default settings. The correct noiseless data rank (two) is given as input to both methods. This gives an advantage to CMF since it does not penalize the rank of the estimated model.
We compare the estimated joint structure to the correct structure. Recall from Section 2, the structure of the decomposition is captured by the augmented -matrix, . For row (component) of , the non-zero elements describe which views share this component. Here, the correct structure is that row of is non-zero for only columns (views) or . As a metric of structural error, we thus compute the RMSE distance between the MM-PCA -solution and the correct . For CMF the estimated is directly given by the norm of the columns of its loading matrix solution (see supplement). We also evaluate the reconstruction error with regard to the signal term of each matrix.
Results, presented in Figure 2, are based on 100 simulation runs. MM-PCA outperforms CMF in finding the correct joint structure (lower panel in Figure 2). CMF results do not provide explicit zeroes which makes it difficult for the user to select a threshold to decide if a signal is shared between matrices. Here, a range of threshold values were applied and the best results are reported. By contrast, when SNR is moderate to high, MM-PCA often finds the correct structure exactly.
In terms of reconstruction error (upper panel), CMF performs better than MM-PCA due to MM-PCA adhering to the more constrained (and correct) structure and due to penalization bias. Including a debiasing step in MM-PCA has been left for future work.
3.2 Simulation 2
We generate three data matrices of size , representing three cohorts (views 1, 2 and 3) observed over a common set of features (view 4). Each observation is associated with either a loading that is common to all three matrices (i.e. globally joint) or with a loading that is unique for the matrix of that observation. We vary the proportion of observations that are associated with the globally joint loading. Thus, in each matrix, each row lies either along the globally joint direction or along the individual direction of the matrix. The individual directions are not orthogonal, they have pairwise angle of . Data is generated according to:
where is the proportion of data coming from the joint loading and where each element of each
is independent normal with mean zero and standard deviation. We generate data in two settings; a low-dimensional case, and a high-dimensional case (with respect to the direction of integration).
MM-PCA is run with maximum rank 10 (the correct rank for this problem is four). CMF is run with both rank four and rank 10. Since results were slightly, but significantly, better for CMF with rank 10 these are reported. For CMF we find the approximate rank of the solution by thresholding . As in Simulation 1, a range of threshold values were tried and the best results are reported. JIVE is run with its permutation testing method for rank selection. JIVE has an option to assume that the individual loadings are orthogonal. This assumption makes JIVE more robust to noise, though it is incorrect for this data. JIVE is run in both settings for comparison.
We evaluate the performance of the methods based on the estimated joint direction of the data. If a method selects a model that does not estimate the existence of any globally joint direction, the dominating direction of the residual is used as estimate of the joint direction. This is what a sensible user would do if the goal is to find the globally joint direction rather than to decide if such a signal exists. We choose to focus on this measure in order to facilitate a comparison of methods of different complexity. Accuracy is defined as the estimated joint direction having a smaller angular distance to the correct joint direction than to each individual direction and to the dominating direction of the noise.
Results from 100 simulation runs are shown in Figure 3, the low dimensional case in Figures 3(A,C,E) and the high dimensional case in Figures 3(B,D,F). In the low dimensional case, CMF finds the globally joint loading more often than MM-PCA (Figure 3A). In the high dimensional case, MM-PCA is better (Figure 3B). JIVE without orthogonal constraint is the best performing method for low joint proportion, but the method performs poorly and unpredictably for higher joint proportions. This makes the method difficult to use in practice where the strength of the joint signal is unknown. As the individual signals weaken (due to the joint proportion increasing) JIVE tends to fail to separate the individual signals and misclassify them as joint. JIVE only looks for globally joint signal (here correct) which gives JIVE a smaller search space compared to both MM-PCA and CMF. JIVE with orthogonal constraint shows a robust performance, but the performance is poor compared to MM-PCA and CMF. MM-PCA shows the most robust performance overall.
Figures 3C-D examine the methods’ estimates of the rank of the globally joint signal. The correct rank is one except when the joint proportion is zero where the correct rank is zero. In the low dimensional case CMF performs best while MM-PCA performs best in the high dimensional case where CMF greatly overestimates. JIVE overestimates when the joint and individual signal have near the same magnitude (joint proportion 0.5). Figures 3E-F examine the estimated rank of each matrix. The correct rank is two, the globally joint component and the individual component, except when the joint proportion is zero or one where the correct rank is one. Results show that the rank selection of MM-PCA works as intended in this setting and that the performance of MM-PCA is better than JIVE, despite MM-PCA addressing a much wider range of integrative analysis settings. CMF again performs well in the low dimensional case but overestimates in the high dimensional case. MM-PCA performs well in both settings, albeit being conservative.
3.3 Simulation 3
Simulation 3 demonstrates the ability of MM-PCA to estimate sparse loadings. We generate a matrix of size . The matrix consists of two components and Gaussian noise. Each component is the product of two sparse loadings with ones at three random positions and zeros elsewhere.
We compare the estimated components to true sparse loadings. MM-PCA is run with a maximum rank of two whereas CMF is given the correct rank (two). As in the other simulations, a range of threholds are applied to the CMF loadings to provide a comparison with MM-PCA. Accuracy is measured with Matthews correlation coefficient (MCC).
Results from 100 simulation runs are shown in Figure 4 (upper panel). MM-PCA is able to identify the sparse loadings with high accuracy for moderate to high SNR values. MM-PCA outperforms CMF in the entire SNR range, though this comparison is less meaningful than for Simulations 1 and 2 since CMF has not been developed to explicitly provide sparse loadings.
We also investigate how MM-PCA rank-selection works in a sparse setting. Here, both MM-PCA and CMF, included for comparison, are run with maximum rank 3. In Figure 4 (lower panel), we see that MM-PCA performs quite well for moderate to high SNRs. In a few cases, MM-PCA results in rank zero solutions due to poor initialization and those results are excluded from the analysis. In this setting, thresholding the CMF solutions does not result in a rank selection. This tendency for CMF to overfit was also seen in Simulation 2.
4 Integrative Analysis of Gene Expression and Methylation Data for Three Patient Cohorts
We apply MM-PCA to a well established set of glioblastoma (GBM) data from TCGA (tcga_stuart_2013). Earlier versions of TCGA data were used to identify four subtypes of GBM, termed proneural (PN), classical (CL), neural (NL) and mesenchymal (MS), by clustering of gene expression array profiles (verhaak_integrated_2010). Analysis of methylation data from TCGA subsequently identified a subset of patients within the PN subtype with favorable prognosis termed the G-CIMP (glioma-CpG island methylator phenotype) subgroup, more frequently observed among younger patients (noushmehr_identification_2010). Recent work has questioned the existence of the NL subtype, showing that a clustering using only tumor-intrinsic gene expression identifies three subtypes (wang_tumor_2017).
Here, we analyze gene expression and methylation data centered around four canonical pathways of GBM; mTOR, Ras, MAPK and PI3K-Akt, focusing on processes critical for glioblastoma development and progression.
The data set consists of six data matrices: gene expressions for patient cohorts 1 and 2, methylation data for patient cohorts 2 and 3, and patient covariance matrices between cohorts 1 and 2 and between cohorts 2 and 3. This results in five views: the three patient cohorts constitute three groups of observations, and the genes and methylation probe sites constitute two groups of variables. Data preprocessing and normalization steps are described in detail in the supplement. After filtering on high variance genes and methylation sites, the data set to be modelled comprises three patient cohorts of sizes 96, 76 and 212, 297 genes and 305 DNA sites (Figure5A).
4.1 MM-PCA Integrates Across Views and Cohorts
MM-PCA was applied with ten regularization levels, between and 1 on a logarithmic scale. The lowest cross-validation error was found for . We allowed the solution to have a maximal rank of 40. The selected solution has an effective rank of 24. Figure 5A shows the data as modeled by the MM-PCA solution. The six data matrix estimates shown are . View indices label the five views in the order: cohort 1, cohort 2, cohort 3, genes, and methylation probes.
In Figure 5A matrix columns and rows are ordered according to the bi-clustering given by the MM-PCA solution. The hierarchical nature of this clustering is shown with cluster trees for each view. The importance of each component for each view is shown in Figure 6C. It is apparent in Figure 5A that clusters are pronounced in all matrices associated with each view, i.e. the clustering is integrative. One cluster is particularly pronounced. It can be seen as a bright yellow wide rectangular area in the lower right part of the cohort 3 methylation data. We found that the associated cluster of patients is strongly associated with low age and G-CIMP status (Section 4.2). MM-PCA thus detects the known connection between low age at diagnosis and G-CIMP status in an unsupervised fashion. Looking at the methylation data for cohort 2 and covariance between cohorts 2 and 3, it can also be seen that this patient cluster in cohort 3 is associated with a patient cluster in cohort 2 with strongly correlated methylation. The cluster in cohort 2 is small and its presence is thus boosted through the integration with the larger cohort 3.
The partially shared structure, captured by and (defined in Section 2), is shown in Figures 6A-D. Figure 6A shows the proportion of total variation of each matrix captured by each component. Figure 6B groups components according to which views. Thus the set of components with a given color in Figure 6B constitutes a subspace for variation in a subset of data matrices. Figure 6C shows of the MM-PCA solution. Figure 6D shows the proportion of variation explained by each component, resulting in a scree-plot as commonly used in ordinary PCA. Directed is shown in Figures 5B-C. Methylation data explains variation in gene expression, in agreement with their biological cause-effect relationship.
Figure 6C provides the basis for interpreting the components illustrated in Figure 5D-F. Components 1, 2, 5 and 11 (Figure 5D) capture structure that is shared globally. Components 3, 7 and 8 (Figure 5E) capture structure that is primarily shared by cohorts 1 and 2 through gene expression, but also models variation in methylation data for cohort 2. Components 4, 6, 9 and 10 (Figure 5F) capture structure that is primarily shared by cohort 2 and 3 through methylation, but also additional variation in gene expression for cohort 2. The contribution of the remaining components is small, mainly capturing variation in methylation in cohort 3
As with ordinary PCA, MM-PCA can be used to plot data in lower dimensions. In Figure 7B we plot all observations by their first two loadings. Patients of different subtype and G-CIMP status are separated already in the first two components. Ordinary PCA would not be able to use the entire data set, but would need to base the analysis on either cohorts 1 and 2 and gene expression or cohorts 2 and 3 and methylation data. From the MM-PCA plot one can infer a plausible subtype and G-CIMP status for the observations with unknown status.
The MM-PCA solution can also be used for integrative bi-clustering of all observations and variables (Figure 7A). This is in contrast to the bi-clustering in Figure 5A which clustered the items of each view. We impute the two missing data matrices (cohort 1 methylation data) and (cohort 3 gene expressions) by multiplying the corresponding and matrices and construct the bigger matrix
Similarly to the bi-clustering described above, we reorder columns and rows of according to signs of elements in and (positive, zero or negative). Cluster hierarchies are shown as trees. We cut the trees at two heights, using two and three components, resulting in and clusters, respectively. The low age, G-CIMP bi-cluster considered above is again present, but now includes patients from all cohorts and some genes in addition to its methylation sites (dark blue patient cluster, dark green variable cluster). A closer look at the relation between cluster and subtype shows that CL is predominant in the red rank 2 cluster, PN in the blue and MS in the green and purple clusters (Figure 7C). The split of the rank 2 purple cluster into rank 3 light and dark purple clusters separates PN patients by G-CIMP status. The analysis captures several clinical variables (Section 4.2), and the clustering is thus not primarily focused on separation by subtype. GBM disease subtype has been shown to have little clinical relevance (johansson_large_2018).
4.2 Leading MM-PCA Components Capture Subtype, G-CIMP and Clinical Features of GBM
To elucidate the potential biological and clinical relevance of the MM-PCA components we quantify for each: the proportion of variation in each matrix and view explained by each component (Figure 6A-C), the proportion of total variance explained by each component (Figure 6D), the relationship between each component and clinical parameters subtype, G-CIMP status, sex, age, survival time, tissue type, stemness and pathway enrichment (Figure 6E-F, Table 1, Figures 8-13
). P-values are calculated using standard methods: Kruskal-Wallis test for factor variables, Fisher z-transformation for continuous variables and Cox proportional hazard model for survival times. Tissue type indicates whether a sample is from a primary tumor, recurrent tumor or from normal tissue. We use a machine learning based measure of stemness to quantify each sample’s degree of oncogenic dedifferentiation(malta_machine_2018). Pathway enrichment is assessed using gene set enrichment analysis (GSEA, subramanian_gene_2005) using v3.0 of javaGSEA obtained from the Broad institute. Hallmark pathway annotations where obtained from MSigDB v6.2 (liberzon_molecular_2015).
Several components found with MM-PCA correlate significantly with clinical parameters (Figures 8-13). Component 1 separates patients with MS subtype from patients with PN subtype. The component is also associated with genes defining the epithelial-mesenchymal transition, a hallmark pathway consistent with the gradient between the PN and MS subtype. Component 2 strongly separates G-CIMP patients from non G-CIMP patients as well as patients with CL subtype from other patients. Component 3 captures variation in the expression data that associates with several hallmark pathways including oxidative phosphorylation and mitotic spindle. The component is associated with tissue type and correlates with the stemness signature. Component 4 captures variation in the methylation data and is associated with subtype and G-CIMP status. Interestingly, the component associates some CL subtype patients with G-CIMP patients, suggesting a set of genes and probes where these CL patients are similar to G-CIMP patients. Component 5 captures variation in both data types and describes variation between normal and tumorous samples. Of note, a number of NL samples have high loadings for this component, possibly indicating contamination of normal cells, which has been suggested as the basis for the NL subtype (wang_tumor_2017). Component 6 captures patient sex, primarily by a small number of methylation probes on chromosome X (e.g. IRAK1 and FLNA).
|Hallmark pathways:||Epith. mesench. trans.|
|KRAS signaling DN|
We present a method for multi-group and multi-view data integration, MM-PCA, capable of identifying partially shared components between any subset of data matrices. The goal of MM-PCA is to provide a general data integration pipeline that is applicable to heterogeneous data where global integration would be too restrictive.
MM-PCA provides a sparse, low-rank representation of multiple data matrices. The MM-PCA solution is easily interpreted since; (i) each component is identified as associated with a subset of data matrices; (ii) sparse component loadings directly translates the MM-PCA solution to an integrative bi-clustering procedure; and (iii) MM-PCA automatically selects the rank of the approximation as part of the procedure.
We show through simulations that MM-PCA performs well in a range of settings including low- and high-dimensional tasks, small and large proportion of joint variation, and low to high signal-to-noise ratio. MM-PCA exhibits a robust and stable performance across this range, whereas competitors CMF and JIVE perform poorly in some of these settings. This provides support for using MM-PCA in real data settings where the underlying structure is unknown. We apply MM-PCA to an ’omic data set of glioblastoma to illustrate the power of the method as an explorative tool for complex data integration.
MM-PCA constitutes a complex optimization task. The current implementation takes 20 minutes to analyze the data treated in Section 4 on a 4 core 3.5 GHz desktop computer. On simulated data, runtimes are on par with CMF (time per iteration is somewhat lower for MM-PCA) despite MM-PCA adressing a more complex problem. Future work could be directed at providing approximate solutions or optimization strategies to improve runtimes.
To further extend the applicability of MM-PCA, the method should be generalized to other loss functions (e.g. for binary data) and tensor data. This is left for future work.
MM-PCA is available as an R package on CRAN.
7 Supplementary Material
Supplementary material is available in the appendix.
This work was supported by grants from Vetenskapsrådet (VR 2013-05101) and the Swedish Foundation for Strategic Research (SSF BD15-0088).
Conflict of Interest: None declared.
Appendix A Literature Review
This literature review is summarized in Table 2 and the different data integration problems are illustrated in Figure 1. The simplest case is the integration of data from multiple cohorts observed across a common set of features, commonly referred to as multi-group integration (Figure 1A). An early extension of PCA for multi-group data is CPCA (flury_common_1984). It summarizes a data set consisting of several cohorts with one matrix of loadings that is common for all cohorts. The multi-view (multi-source, multi-modal) data integration problem focuses instead on one cohort where multiple sets of features or measurements are available (Figure 1B). The popular iCluster algorithm (shen_integrative_2009) is an example of multi-view integration for genomic data. Below we review methods for multi-group or multi-view data integration together since they are algorithmically similar.
Several classical methods for multi-view (or multi-group) integrative data analysis are based on PCA or SVD. Canonical correlation analysis (CCA) (hotelling_relations_1936) and partial least squares (PLS) (wold_causal_1974) are perhaps the most used and studied. They can only be applied to data sets comprising two matrices, and they focus only on finding joint structure or parameters to explain both matrices together. Hierarchical PCA, hierarchical PLS (wold_hierarchical_1996), generalized SVD (van_loan_generalizing_1976) and higher order generalized SVD (ponnapalli_higher-order_2011) are extensions to more than two data matrices but still limited to finding a joint structure shared by all matrices in a dataset.
JIVE (lock_joint_2013), DISCO-SCA (schouteden_sca_2013) and O2-PLS (trygg_o2-pls_2002) are methods that models data as a sum of both globally joint signal and signal that is individual to each matrix. They estimate model parameters by iteratively alternating between fitting a global model for the joint structure and fitting individual models to the residuals in each matrix. O2-PLS applies only to datasets of two matrices but OnPLS (lofstedt_onpls-novel_2011) generalizes it to an arbitrary matrix count. AJIVE (feng_angle-based_2018)
uses the same model as JIVE but bases rank estimation on random matrix theory instead of bootstrapping. GIPCA(zhu_generalized_2018) generalizes JIVE to allow for different data types and also allows for the presence of missing values.
The method sMVMF (wang_sparse_2015) provides a decomposition of multi-group data into global and individual signal while imposing sparsity penalties on the global and individual loadings.
Simultaneous data integration for multi-group and multi-view data (Figure 1C) was first addressed by bi-modal OnPLS (lofstedt_bi-modal_2012), an extension of OnPLS. The method can decompose such datasets into global and individual signal. Linked matrix factorization (oconnell_linked_2017) later extended JIVE to the same generality, with the additional ability to handle elementwise missing data. BIDIFAC (park_integrative_2019) allow signal to be group-global and view global, in addition to the global and individual signal. Integrative analysis of data that is both multi-view and multi-group will become increasingly relevant in molecular biology (richardson_statistical_2016).
MOFA (argelaguet_multiomics_2018) is the only method for multi-view data that can find joint signal shared only for a subset of matrices. Like sMVMF, MOFA also performs feature selection. It does so by using a sparsity inducing prior.
CMF (klami_group-sparse_2014) builds on the augmented multi-view data (AMD, Figure 1D) framework to perform multi-group and multi-view integration with loadings that are shared by a subset of matrices. The partial sharing problem for multi-view integration was, as mentioned above, addressed in MOFA. Both of these methods utilize a Bayesian setup. CMF uses the model where each and
follow a Gaussian distribution. The model parameters are estimated using variational Bayesian inference.
|CCA, Hotelling, 1936||Multi-view||2||No||Global||No|
|PLS, H. Wold, 1974||Multi-view||2||No||Global||No|
|GSVD, Van Loan, 1976||Multi-view||2||No||Global||No|
|CPCA, Flury, 1984||Multi-group||Arbit.||No||Global||No|
|HOGSVD, Ponnapalli, 2011||Multi-view||Arbit.||No||Global||No|
|Hier. PCA, S. Wold, 1996||Multi-view||Arbit.||Yes||Global||No|
|Hier. PLS, S. Wold, 1996||Multi-view||Arbit.||Yes||Global||No|
|O2-PLS, Trygg, 2002||Multi-view||2||No||Glob. & ind.||No|
|DISCO-SCA, Schouteden, 2013||Multi-group or -view||2||No||Glob. & ind.||No|
|OnPLS, Löfstedt, 2011||Multi-view||Arbit.||No||Glob. & ind.||No|
|JIVE, Lock, 2013||Multi-view||Arbit.||No||Glob. & ind.||No|
|AJIVE, Feng, 2018||Multi-view||Arbit.||No||Glob. & ind.||No|
|sMVMF, Wang, 2015||Multi-group||Arbit.||No||Glob. & ind.||Yes|
|GIPCA, Zhu, 2018||Multi-view||Arbit.||Yes||Glob. & ind.||No|
|MOFA, Argelaguet, 2018||Multi-view||Arbit.||Yes||Any subset*||Yes|
|LMF, O’Connel, 2017||Multi-group and -view||3||Yes||Glob. & ind.||No|
|Bi-mod. OnPLS, Löfstedt, 2012||Multi-group and -view||Arbit.||No||Glob. & ind.||No|
|BIDIFAC, Park, 2019||Multi-group and -view||Arbit.||Yes||Glob. & ind.||No|
|CMF, Klami, 2014||Augmented multi-view||Arbit.||Yes||Any subset*||No|
|MM-PCA, This article||Augmented multi-view||Arbit.||Yes||Any subset||Yes|
Appendix B Generalized Euler Parametrization
We present the following corollary to the results of (hoffman_generalization_1972). It states that -frames can be factorized into Givens rotation matrices.
For , an arbitrary -dimensional -frame can be expressed as , where , are generalized Euler rotations and are the first columns of .
Let be an arbitrary -frame and let be any rotation such that . Such exist by the definition of -frames. hoffman_generalization_1972 gives which can be factorized into where and since the factors of are ordered so that this can be done without reordering them. Furthermore since all indices in are greater than . Thus . Lastly . ∎
In order to find a good initial value for the optimization problem we need to find the angles that correspond to a given -frame , i.e. . hoffman_generalization_1972 states this inverse for rotations. To utilize the k-frames setup for MM-PCA, we here generalize the rotation inverse to -frames in the following way. Let be the inverse for rotations given in (hoffman_generalization_1972), i.e. the function that gives the angles that correspond to a given rotation matrix . Using the orthogonal complement of and that is a rotation matrix we set to be the first columns of . One step remains. Let be equal to except for the last column, for which the signs are changed. Then is also an orthogonal complement to . Thus, let be the first columns of . Either or . The candidate for which the equality holds is used to invert .
Appendix C Gradients of the Objective Function
Let be the objective function given in the article. The gradient with respect to element of is
where and are defined by
(see corollary 1), is a row vector with a one at position and zeros elsewhere and
If then . The gradient with respect to is
where is elementwise division and is a column vector with each element being 1.
Appendix D Optimization, Algorithmic Details
The most computationally demanding task in MM-PCA is to compute the gradients of the objective with regard to the matrices . This task has been parallelized across views and uses cores simultaneously.
To improve the convergence rate, the penalties are approximated by a smooth function based on tangens hyperbolicus as in (trendafilov_projected_2006).
The matrices contain angles in radians, so they are on the scale . When selecting tolerance and stopping criterion for BFGS it is convenient if all variables are on the same scale. Therefore the data is rescaled so that the biggest singular value, among all the singular values of all data matrices, is . This results in the variables in being on approximately the same scale as .
The penalty parameters are rescaled to remove their dependency on the magnitude and size of the data matrices. In the objective function, each is multiplied with a factor , where , the mean Frobenius norm of the data matrices. The following derives this factor. Scaling all data by some constant should lead to scaling all matrix approximations by the same factor: . Thus, if the sum of all penalty terms before scaling the data was , then after scaling, the sum becomes . The last equality is due to penalties being linear in the scale of . Multiplying the data and approximations by gives the objective which has the same minimizing argument as . Thus, setting to removes the dependency of on the scale of the data. This results in the parameters being on the same scale regardless of the scale and size of the data.
As initial values in optimization, we use an approximation of the solution under the assumptions that all components are globally joint.
We use SVD to set to the loadings of the concatenated matrix consisting of all matrices (some transposed) with a view . We find singular values for each matrix. We factorize into by greedily choosing signs for each diagonal element in each and then finding the magnitude of the elements with continuous optimization. Finally, the initial values of are found using the inverse -frame operation, , defined above.
Appendix E Integrative Analysis of Gene Expression and Methylation Data for Three Patient Cohorts
e.1 Data Selection, Preprocessing and Normalization
Genomic data from GBM cancer patients from TCGA is publicly available through UCSC cancer genomics browser (goldman_zhu_2014). The data contains log-transformed RNASeq counts belonging to 20530 genes for 172 patients. It also contains measurement of methylation at 27578 DNA sites for a partially overlapping group of 288 patients. The measurements are ratios (so-called
-values) with 0 being unmethylated and 1 being methylated. We quantile normalized each RNASeq sample, transforming the counts to follow a standard normal distribution based on the order of genes by count. This removes the dependency of counts on the sequencing depth and ensures that data follows a normal distribution. For each methylation sample we subtracted the sample mean from each measurement. Next, we removed sites having median absolute deviation equal to zero, that is constant across at least half of the patients. This lowered the site count by 2597. We removed half of the genes and half of the sites with lowest variance. We centered genes and scaled them to have unit variance. We also centered the data for each methylation site. We divided the patients into three cohorts based on data availability. Cohort 1 contains 96 patients for which we have gene expression data but not methylation data. Cohort 2 contains 76 patients for which we have both gene expression and methylation data. Cohort 3 contains 212 patients for which we have methylation data but no gene expression data. We focus our analysis on four cellular pathways (mTOR, Ras, MAPK and PI3K-Akt) that are involved in cell proliferation, survival, migration and angiogenesis (forming of new blood vessels) and are involved in GBM pathogenesis(pearson_targeting_2017). We downloaded gene lists for each pathway from KEGG (kanehisa_goto_2000). A total of 760 genes were in at least one of the four pathways. Of these 760 genes 297 remain after the above described gene filtering. Out of the remaining DNA sites 305 sites are believed to be associated with any of these 297 genes, according to the Infinium MethylationEPIC v1.0 B4 Manifest file obtained from Illumina. We use all genes and DNA sites that remain after gene and site filtering to calculate covariance matrices between cohorts 1 and 2 as well as between cohorts 2 and 3. These two matrices are included in our analysis with the role of prior information of covariance between the data matrices of gene expressions, and the data matrices of methylation, respectively. A small proportion (99 elements) of data are missing in the methylation data. Lastly, we scale each matrix so that the Frobenius norm of its first SVD component is proportional to the number of observations (rows) in the matrix. To summarize, the data consists of three patient cohorts of sizes 96, 76 and 212, 297 genes and 305 DNA sites.