Data collection and curation for the Cancer Genome Atlas (TCGA) program completed in 2018, providing a unique and valuable public resource for comprehensive studies of molecular profiles across several types of cancer (hutter2018cancer). The database includes information from several molecular platforms for over tumor samples from individuals representing types of cancer. The molecular platforms capture signal at different ’omics levels (e.g., the genome, epigenome, transcriptome and proteome), which are biologically related and can each influence the behavior of the tumor. Thus, when studying molecular signals in cancer it is often necessary to consider data from multiple omics sources at once. This and other applications have motivated a very active research area in statistical methods for multi-omics integration.
A common task in multi-omics applications is to jointly characterize the molecular heterogeneity of the samples. Several multi-omics methods have been developed for this purpose, which can be broadly categorized by (1) clustering methods that identify molecularly distinct subtypes of the samples (huo2017integrative; lock2013bayesian; gabasova2017clusternomics), (2) factorization methods that identify continuous lower-dimensional patterns of molecular variability (lock2013joint; argelaguet2018multi; gaynanova2017structural), or methods that combine aspects of (1) and (2) (shen2013sparse; mo2017fully; hellton2016integrative)
. These extend classical approaches, such as (1) k-means clustering and (2) principal components analysis, to the multi-omics context, allowing the exploration of heterogeneity that is shared across the different ’omics sources while accounting for their differences.
Several multi-omics analyses have been performed on the TCGA data, including flagship publications for each type of cancer (e.g., see cancer2012comprehensive; cancer2014comprehensive; verhaak2010integrated). These have revealed striking molecular heterogeneity within each classical type of cancer, which is often clinically relevant. However, restricting an analysis to a particular type of cancer sacrifices power to detect important genomic changes that are present across more than one cancer type. In 2013 TCGA began the Pan-Cancer Analysis Project, motivated by the observation that “cancers of disparate organs reveal many shared features, and, conversely, cancers from the same organ are often quite distinct” (weinstein2013cancer). Subsequently, several pan-cancer studies have identified important shared molecular alterations for somatic mutations (kandoth2013mutational), copy number (zack2013pan), mRNA (hoadley2014multiplatform), and protein abundance (akbani2014pan). However, a multi-omics analysis found that pan-cancer molecular heterogeneity is largely dominated by cell-of-origin and other factors that define the classical cancer types (hoadley2018cell).
In this study we do not focus on baseline molecular differences between the cancer types. Rather, we focus on whether patterns of variability within each cancer type are shared across cancer types, i.e., whether multi-omic molecular profiles that drive heterogeneity in one type of cancer also drive heterogeneity in other cancers. Systematic investigations of heterogeneity in a pan-omics and pan-cancer context are presently limited by a lack of principled and computationally feasible statistical approaches for the comprehensive analysis of such data. In particular, the data take the form of bidimensionally linked matrices, i.e., multiple large matrices that may share row sets (here, defined by the omics platforms) or column sets (here, defined by the cancer types); this is illustrated in Figure 1 and the formal framework is described in Section 2.
In this article we propose a flexible approach to the simultaneous factorization and decomposition of variation across bidimensionally linked matrices, BIDIFAC+. This decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). Our approach builds on a growing literature for the factorization and decomposition of linked matrices, which we review in Section 3. Crucially, previous methods have primarily focused on multiple matrices that are linked in one dimension (rows or columns) only.
2 Formal framework and notation
Here we introduce our framework and notation for pan-omics pan-cancer data. Let denote the data matrix for omics data source and sample set (i.e., cancer type) for and . Columns of represent samples, and rows represent variables (e.g., genes, miRNAs, proteins). The sample sets of size are consistent across each omics source, and the features measured for each omics source are consistent across sample sets. As illustrated in Figure 1, the collection of available data can be represented as a single data matrix where and , by horizontally and vertically concatenating its constituent blocks:
Similarly, defines the concatenation of omics source across cancer types and defines the concatenation of cancer type across omics sources:
The notation defines the n’th column of matrix , defines the m’th row, and defines the entry in row and column . In our context, the entries are all quantitative, continuous measurements; missing data are addressed in Section 9.
We will investigate shared or unique patterns of systematic variability (i.e., heterogeneity) among the constituent data blocks. We are not interested in baseline differences between the different omics platforms or sample sets, and so after routine preprocessing the data will be centered so that the mean of the entries within each data block, , is . Moreover, to resolve the disparate scale of the data blocks, each block will be scaled to have comparable variability as described in Section 6.1.
In what follows, denotes the Frobenius norm for any given matrix, so that is the sum of squared entries in . The operator denotes the nuclear norm of
, which is given by the sum of the singular values in; that is, if has ordered singular values , then .
3 Existing integrative factorization methods
There is now an extensive literature on the integrative factorization and decomposition of multiple linked datasets that share a common dimension. Much of this methodology is motivated by multi-omics integration, i.e., vertical integration of multiple matrices with shared columns in the setting of Section 2. For example, the Joint and Individual Variation Explained (JIVE) method (lock2013joint; oconnell2016) decomposes variation into joint components that are shared among multiple omics platforms and individual components that only explain substantial variability in one platform. This distinction simplifies interpretation, and also improves accuracy in recovering underlying signals. Accuracy improves because structured individual variation can interfere with finding important joint signal, just as joint structure can obscure important signal that is individual to a data source. The factorized JIVE decomposition is
Joint structure is represented by the common score matrix , which summarize patterns in the samples that explain variability across multiple omics platforms. The loading matrices indicate how these joint scores are expressed in the rows (variables) of platform . The score matrices summarize sample patterns specific to platform , with loadings . Model (2) can be equivalently represented as a sum of low-rank matrices
where is of rank and is the matrix of rank given by the individual structure for platform and zeros elsewhere:
Several other methods result in a factorized decomposition similar to that in (2) and (3), including approaches that allow for different distributional assumptions on the constituent matrices (li2018general; zhu2018generalized), non-negative factorization (yang2015non), and the incorporation of covariates (li2017incorporating). The Structural Learning and Integrative Decomposition (SLIDE) method (gaynanova2017structural) allows for a more flexible decomposition in which some components are only partially shared across a subset of the constituent data matrices. SLIDE extends model (3) to the more general decomposition
where is a low-rank matrix with non-zero values for some subset of the sources that is identified by a binary matrix : and
Here, gives scores that explain variability for only those patterns for the omics sources identified by .
The BIDIFAC approach (park2019integrative) is designed for the decomposition of bidimensionally linked matrices as in (1). Its low-rank factorization can be viewed as an extension of that for JIVE, decomposing variation into structure that is shared globally (G), across rows (Row), across columns (Col), or individual to the constituent matrices (Ind). Following (3) for JIVE and (4) for SLIDE, its full decomposition can be expressed as
4 Proposed model
We consider a flexible factorization of bidimensionally linked data that combines aspects of the BIDIFAC and SLIDE models. Our full decomposition can be expressed as
and the presence of each is determined by a binary matrix of row indicators and column indicators :
Each gives a low-rank module that explains variability within the submatrix defined by the omics platforms identified by and the cancer types identified by . There are in total such submatrices, so by default we set and let and enumerate all possible modules (see Appendix B). The SLIDE decomposition (4) is a special case when or (i.e., unidimensional integration); the BIDIFAC model (5) is a special case where each column of and contains either entirely ’s (i.e., all rows or columns included) or just one (i.e., just one row set or column set included). The matrix is an error matrix, whose entries are assumed to be sub-Gaussian with mean
Let the rank of each module be rank, so that the dimensions of the non-zero components of the factorization are and . The ’th component of the ’th module gives a (potentially multi-omic) molecular profile that explains variability within those cancer types defined by with corresponding sample scores .
5 Objective function
To estimate model (6), we minimize a least squares criterion with a structured nuclear norm penalty:
subject to if or . The choice of the penalty parameters is critical, and must satisfy the conditions of Proposition 8 to allow for non-zero estimation of each module.
Under objective (7), the following are necessary to allow for each to be non-zero
If for the rows and columns of module are contained within those for module , and , then .
If is any subset of modules that together cover the rows and columns of module , and for positive integers and , then .
(mazumder2010spectral) Let be the SVD of a matrix . The approximation that minimizes
is , where is diagonal with entries .
(rudelson2010non) If is a matrix of independent entries with mean and variance , then provides a tight upper bound on the largest singular value of .
Fixing is a reasonable choice for the matrix approximation task in (8), because it keeps only those components whose signal is greater than that expected for independent error by Proposition 3: (shabalin2013reconstruction). In our context , and thus we set , where gives the dimensions of the non-zero sub-matrix for :
Our choice of is motivated to distinguish signal from noise in module , conditional on the other modules. Moreover, it is guaranteed to satisfy the necessary conditions in Proposition 8, which we establish in Proposition 4.
A similarly motivated choice of penalty weights is used in the BIDIFAC method, which solves an equivalent objective under the restricted scenario where the columns of and are fixed and contain either entirely ’s (i.e., all rows or columns included) or just one (i.e., just one row set or column set included). Thus, we call our more flexible approach BIDIFAC+.
It is often infeasible to explicitly consider each of the possible modules in (7), and the solution is often sparse, with for several . Thus, in practice we also optimize over the row and column sets and for some smaller number of modules :
Note that if is greater than the number of non-zero modules, then the solution to (9) is equivalent to the solution to (7) in which and are fixed and enumerate all possible modules. If is not greater than the number of non-zero modules, then the solution to (9) can still be informative as the set of modules that together give the best structural approximation via (7).
We center each dataset to have mean , and scale each dataset to have residual variance var) of approximately . Such scaling requires an estimate of the residual variance for each dataset. By default we use the median absolute deviation estimator of gavish2017optimal, which is motivated by random-matrix theory under the assumption that is composed of low-rank structure and mean independent noise of variance . We estimate for the unscaled data , and set . An alternative approach is to scale each dataset to have overall variance , var=1, which is more conservative because var; thus, this approach results in relatively larger in the objective function and leads to sparser overall ranks.
6.2 Optimization algorithm
We estimate across all modules simultaneously by iteratively optimizing the objectives in Section 5. First assume the row and column inclusions for each module, defined by and , are fixed as in objective (7). Then, to estimate given the other modules , we can apply the soft-singular value estimator in Proposition 2 to the residual matrix
on the submatrix defined by and . In this way, we iteratively optimize (7) over the modules until convergence. If the row and column inclusions and are not predetermined, then we incorporate additional sub-steps to estimate the non-zero submatrix defined by and for each module to optimize 9. We use a dual forward-selection procedure to iteratively determine the optimal row-set with columns fixed, and the column-set with rows fixed, until convergence prior to estimating each . Further details and pseudocode for the algorithm are provided in Appendix A.
Here we consider the identifiability of the decomposition in (4) under the objective (7). To account for permutation invariance of the modules, throughout this section we assume that and are fixed and that they enumerate all of the possible modules. Without loss of generality, we fix and as in Appendix B. Then, let be the set of possible decompositions that yield a given approximation :
If either or then the cardinality of is infinite, i.e., there are an infinite number of ways to decompose . Thus, model (4) is clearly not identifiable in general, even in the no-noise case . However, optimizing the structured nuclear norm penalty in (7) may uniquely identify the decomposition; let give this penalty:
, gives an equivalence of the left and right singular vectors for any two decompositions that minimize.
Take two decompositions and , and assume that both minimize the structured nuclear norm penalty:
Then, and where and have orthonormal columns, and and are diagonal.
The proof of Proposition 5 uses two novel lemmas (see Appendix C): one establishing that and must be additive in the nuclear norm, , and a general result establishing that any two matrices that are additive in the nuclear norm must have the equivalence in Proposition 5.
Proposition 5 implies that left or right singular vectors of () are either shared with (if ) or orthogonal to (if ). For identifiability, one must establish that for all and . Theorem 2 gives sufficient conditions for identifiability of the decomposition.
Consider and let give the SVD of for . The following three properties uniquely identify .
minimizes over ,
are linearly independent for ,
are linearly independent for .
The linear independence conditions (2. and 3.) are in general not sufficient for identifiability, and several related integrative factorization methods achieve identifiability via stronger orthogonality conditions across the terms of the decomposition (lock2013joint; gaynanova2017structural). Theorem 2 implies that orthogonality is not necessary under the penalty . Conditions 2. and 3. are straightforward to verify for any , and they will generally hold whenever the ranks in the estimated factorization are small relative to the dimensions and . Moreover, the conditions of Theorem 2 are only sufficient for identifiability; there may be cases for which the minimizer of is unique and the terms of its decomposition are not linearly independent.
8 Bayesian interpretation
Express the BIDIFAC+ model (6) in factorized form
for all and , and
for all and . The structured nuclear norm objective (7) can also be represented by penalties on the factorization components and . We formally state this equivalence in Proposition 6, which extends analogous results for a single matrix (mazumder2010spectral) and for the BIDIFAC framework (park2019integrative).