Matrix factorization and its various models (including bi- and tri-matrix factorizations) that utilize genomic data have been wildly used to identify key altered pathways111Pathway is defined as ”a series of actions among molecules in a cell that leads to a certain product or a change in the cell.” (quoted from https://www.genome.gov/27530687/biological-pathways-fact-sheet/). In our problem setting, a pathway simply means a set of genes that lead a certain biological function. associated with cancer development, progress, and patient subgroup identification in cancer genomics. One of key challenges is how to elaborately include biological prior knowledge, e.g., a pathway database or a protein-protein interaction (PPI) network, into the factorization models. However, most approaches are based on regularization approaches, where additional terms that encourage a solution to be similar to what we expect from the prior knowledge are introduced into an objective function. For example, consider the case where prior knowledge is given in the form of a PPI network. We expect that parameters associated with two genes that are connected in the PPI network should have similar values to each other. We can take this smoothness constraint into account by adding a regularizer term involving a Laplace matrix constructed from the PPI network to an objective function (e.g., ). However, this approach yields just a point estimate: the uncertainty of the solution is neglected and the methods often suffer from overfitting. Bayesian approaches are more appealing because they calculate a full distribution over parameters, instead of a single point estimate. However, it is not straightforward to take the prior knowledge into account in a factorization model in the Bayesian framework.
We here propose a Bayesian (semi-)nonnegative matrix factorization for human cancer genomic data, where the biological prior knowledge represented by a pathway database and a PPI network is incorporated into the factorization through a finite dependent Beta-Bernoulli prior model. We tested our method on The Cancer Genome Atlas (TCGA) dataset and found that the pathways identified by our method can be used as a prognostic biomarkers for patient subgroup identification.
Notations: For a matrix , represents its
th row vector, i.e.,. Similarly, refers to its th column vector. The th element of the matrix is expressed by .
2 Related work
We assume that observations are given in a form of matrix, , where N and D are the number of samples and input features respectively. The goal of tri-matrix factorization is to approximate the matrix by a multiplication of three sub latent matrices, such that
where , and . Here, and are set to smaller numbers compared to and . One can add further constraints based on the properties of the observations and available prior knowledge. In fact, most of previous work assumes that the observation matrix is non-negative and finds all non-negative sub latent matrices [4, 10]. However, in this paper, we relax this restriction. Motivated by , we assume that is real-valued and that the sub matrices, and , are still non-negative but is real-valued, which will be explained in detail in the next section.
Nonnegative matrix factorization methods, including the bi-matrix and the tri-matrix factorization models, have been applied to many different biological problems (please see references in ). In particular, in  the authors use the nonnegative tri-matrix factorization model to identify pathways that are relevant to human cancer, which is the same goal as ours. The method also involves a pathway database and a PPI network to learn the latent matrices in the regularization approach. However, this approach gives us just a single point estimate. Furthermore, the regularized coefficients should be pre-specified (tuned) by users. For large-scale data problems, it requires heavy-computational burden when we find optimal values for regularized coefficients via cross-validation.
A Bayesian nonnegative tri-matrix factorization approach has been proposed , where exponential prior distributions are placed on nonnegative latent matrices and their posterior distributions are calculated by Gibbs sampling or the variational inference. However, the method does not consider how to use available biological prior knowledge to learn the sub-matrices in the factorization model.
3 Bayesian (semi-)nonnegative matrix tri-factorization
We propose a Bayesian (semi-)nonnegative matrix factorization for human cancer genomic data, where the prior knowledge represented by a pathway database and a PPI network is taken into account in the factorization through a finite dependent Beta-Bernoulli prior model. One can consider our method as an extension of the work in  in the Bayesian framework. Thus, our method shares several nice properties inherited from Bayesian approaches. Furthermore, unlike the work in , we use gene expression data, instead of mutation data (since it usually yields a highly sparse observation matrix). For gene expression data, the observation matrix includes negative values (it is same for other types of genomic data, such as copy number, miRNA expression etc). As mentioned earlier, we extend the concept of the semi-negative factorization in  to the tri-matrix factorization case, where a certain latent matrix is allowed to have negative values. However, our method can be applied to nonnegative cases as well with minor changes (placing nonnegativity priors on all latent matrices).
Now, we assume the matrix is gene expression data measured from patients: the th element in , , represents an expression value at the th gene from the th patient sample. Furthermore, we construct from patient-cancer-type information: is the number of known cancer types and indicates the th patient has th cancer type. We assume that each patient has a single cancer type, i.e., and . In addition, let denote by a given pathway database where is the number of pathways in the database and if the th gene is annotated in th pathway as a member. With the binary matrix , our factorization model is:
where is Harmardard operator, i.e., element-wise multiplication operator. In our model (2), the matrix is a set of basis (row) vectors and only few elements (corresponding to member genes of each pathway) in the matrix can contribute to the factorization due to the element-wise multiplication with the binary matrix . Then, the matrix represents associations between cancer types and pathways, where each element of represents associations between th cancer type with th pathway. Higher values of elements indicate stronger associations between cancer types and pathways. Thus, our goal can be satisfied by finding an accurate association matrix from the data.
We next need to specify the likelihood model. We assume it to follow a Gaussian distribution:
where the precision
Additional assumptions on the latent matrices are made as follows. With the fixed , the reconstructed matrix, i.e., will have the same valued rows for all the samples that belong to the same cancer types. To remedy this, we also learn the cluster matrix from the data:
where is directly from the patient-cancer-type data, and
is also a probability vector to be learned, i.e.,and . The closer to 1, the more we rely on the given information. Next, an element in
is sampled from an Exponential distribution (nonnegativity):
Furthermore, as mentioned earlier, since the observation matrix is real valued, we allow to have non-negative values. We assume each element in to be sampled from a Gaussian distribution.
Since a pathway database could be incomplete, we also learn a binary matrix based on with a PPI network (in a form of a graph, ). To do this, we place a finite dependent Beta-Bernoulli prior distribution on . For better understating, we first consider a finite Beta-Bernoulli prior distribution without considering the dependency, which is given in a hierarchical way:
The expected number of non-zero entries in is and this parametric (finite) model becomes Indian Buffet process when R goes to infinity . We can naturally think that two genes may work together if they are connected in . In other words, two connected genes in the graph would be active together in a pathway. First, let us define as a matrix containing sets of function values and assume that each column, , follows a Gaussian process given by
where is a normalized Laplacian matrix. we now couple a Bernoulli variable and a function value using the same method in [8, 9]:
where is a cumulative Gaussian distribution. For example, and are more likely to be both 1 or 0 if and are close to each other. For simplicity, we define :
Lastly, we define as nonzero entries in and try to make sure that all the corresponding entries in are also close to 1 during training.
4 Variational inference
We approximately compute posterior distributions of all the variables in the variational inference framework. First, the variational distributions are assumed to be fully factorized:
The form of each variational distribution is as follows
Note that ) is fully determined by and (also refer (15)). Denote a set of all the variables and the parameters by . Then the variational distributions can be computed by maximizing the lower bound on the likelihood
where a marginal distribution of an Bernoulli variable , , can be calculated as follows
The variational distributions, , and , can be updated in closed form (by the similar manner as in ). With defining new variables, , the matrix (obtained by optimizing ) can be found by any unconstrained gradient methods. For and , we solve the following regularization problem,
where is a regularization coefficient. Here, the regularization function (a sum of the negative cross entropy) is defined on the posterior distributions over .
5 Experimental results
We tested our method on the gene expression (mRNA) data from TCGA222The data was downloaded from CBioportal (http://www.cbioportal.org/). The downloading option was ’TCGA_cancer_type_rna_seq_v2_mrna’.. We further removed several cancer types that have small number patient samples (less than 200). The number of cancer types and patient samples were and , respectively. We used ’biocarta’ pathway database333 the data is available from http://software.broadinstitute.org/gsea/msigdb/collections.jsp where the number of pathways is . We also downloaded a PPI network from BioGRID444https://thebiogrid.org/. The version is BIOGRID-ORGANISM-Homo_sapiens-3.4.153. After intersecting all the datasets, the number of the common genes was .
After all the parameters (including the latent matrices) were learned, we selected the top-5 ranked pathways for the th cancer type based on the mean values of the elements in the th row vector of
. All the member genes in these pathways were used for gene signature (biomakers) for the patient stratification of that cancer type. For each cancer type, we performed consensus clustering on the TCGA gene expression again (500 K-means repetition with bootstrapping ) and finally reported two representative results. As shown in Figure1, there is a clear separation between the two groups, which suggests that the pathways identified by our method can be used as prognostic biomarkers.
 Thomas Brouwer and Pietro Lio’. Fast bayesian non-negative matrix factorisation and trifactorisation. In NIPS 2016 Workshop: Advances in Approximate Bayesian Inference, 2016.
 Karthik Devarajan. Nonnegative matrix factorization: An analytical and interpretive tool in computational biology. PLoS Computational Biology, 4, 2008.
 C. Ding, T. Li, and M. I. Jordan. Convex and semi-nonnegative matrix factorizations. Technical Report 60428, Lawrence Berkeley National Lab, 2006.
 C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Philadelphia, PA, 2006.
 T. L. Griffiths and Z. Ghahramani. The Indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185–1224, 2011.
 Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52(1):91–118, 2003.
 Sunho Park, Seung-Jun Kim, Donghyeon Yu, Samuel Pea-Llopis, Jianjiong Gao, Jin Suk Park, Beibei Chen, Jessie Norris, Xinlei Wang, Min Chen, Minsoo Kim, Jeongsik Yong, Zabi Wardak, Kevin Choe, Michael Story, Timothy Starr, Jae-Ho Cheong, and Tae Hyun Hwang. An integrative somatic mutation analysis to identify pathways linked with survival outcomes across 19 cancer types. Bioinformatics, 32(11):1643–1651, 2016.
 Erik B. Sudderth and Michael I. Jordan. Shared segmentation of natural scenes using dependent pitman-yor processes. In Advances in Neural Information Processing Systems (NIPS), 2009.
 Sinead Williamson, Peter Orbanz, and Zoubin Ghahramani. Dependent indian buffet processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2010,.
 J. Yoo and S. Choi. Probabilistic matrix tri-factorization. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, 2009.
 Jun Zhu, Ning Chen, and Eric P. Xing. Bayesian inference with posterior regularization and applications to infinite latent svms. Journal of Machine Learning Research, 15(1), 2014.