Directionally Dependent Multi-View Clustering Using Copula Model

by   Kahkashan Afrin, et al.

In recent biomedical scientific problems, it is a fundamental issue to integratively cluster a set of objects from multiple sources of datasets. Such problems are mostly encountered in genomics, where data is collected from various sources, and typically represent distinct yet complementary information. Integrating these data sources for multi-source clustering is challenging due to their complex dependence structure including directional dependency. Particularly in genomics studies, it is known that there is certain directional dependence between DNA expression, DNA methylation, and RNA expression, widely called The Central Dogma. Most of the existing multi-view clustering methods either assume an independent structure or pair-wise (non-directional) dependency, thereby ignoring the directional relationship. Motivated by this, we propose a copula-based multi-view clustering model where a copula enables the model to accommodate the directional dependence existing in the datasets. We conduct a simulation experiment where the simulated datasets exhibiting inherent directional dependence: it turns out that ignoring the directional dependence negatively affects the clustering performance. As a real application, we applied our model to the breast cancer tumor samples collected from The Cancer Genome Altas (TCGA).



page 11


Bayesian Consensus Clustering

The task of clustering a set of objects based on multiple sources of dat...

Multi-View Fuzzy Clustering with Minimax Optimization for Effective Clustering of Data from Multiple Sources

Multi-view data clustering refers to categorizing a data set by making g...

Incremental Minimax Optimization based Fuzzy Clustering for Large Multi-view Data

Incremental clustering approaches have been proposed for handling large ...

Structural Learning and Integrative Decomposition of Multi-View Data

The increased availability of the multi-view data (data on the same samp...

Multi-view Banded Spectral Clustering with application to ICD9 clustering

Despite recent development in methodology, community detection remains a...

Learning Sparsity and Block Diagonal Structure in Multi-View Mixture Models

Scientific studies increasingly collect multiple modalities of data to i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The development of several new genomics technologies over the past decade have greatly enabled our capabilities for collecting genomics data from multiple sources. These datasets often provide different but complementary information and hence the datasets can be thought of providing different views or are “multi-view” for the same underlying phenomenon (with each dataset representing a particular view). Integrating the datasets from different sources can provide an immense amount of added information that can significantly improve the diagnosis and prognosis of pathologies.

Integrating multiple datasets had been commonly used in several fields, and one of applications in biomedical scientific problems is to cluster a set of objects (patients) based on the multi-source of data, called the multi-view clustering or consensus clustering (lock2013bayesian; kirk2012bayesian). Most of the multi-view clustering methods in the literature may be categorized as one of two different approaches:

  1. i. Separate or source specific clustering of each dataset and integration of the outcome (wang2011nonparametric; hubert1985comparing). It is typically assumed that there is no dependence structure between the datasets. Post-integration of the outcome is performed by using the level of agreement between the separate clustering (nguyen2007consensus; wang2011bayesian; wang2011nonparametric).

  2. ii. Integrating all the datasets prior to a joint clustering to obtain a single unified model (kormaksson2012integrative; mo2013pattern)

    . Although an integrative model would be capable of exploring the hidden dependence structure, such an approach requires the knowledge of the unknown joint distribution of the data sources. Previous models, such as

    (mo2013pattern) and (shen2009integrative)

    , have assumed that the data sources are conditionally independent given the clustering to estimate the joint likelihood.

In such a multi-view clustering, one of key considerations is to how to accommodate the dependence structure existing across the datasets. Incorporating such dependencies between the multiple datasets is called to encapsulate relevant information (klami2007local). This has made the latter of the two approaches gain more popularity, as joint clustering using the integrated datasets provides the possibility of capturing this dependency. However, the integration of multiple datasets and modeling their dependence remains one of the key challenges. Several pioneering works have been done over the years in this domain. For example, (kirk2012bayesian) proposed multiple dataset integration methods which modeled each dataset using a Dirichlet-multinomial mixture model and then used the pairwise dependencies between the datasets to share information. Their method allowed identification of groups of genes that often fell together in one cluster. On the other hand, (lock2013bayesian) proposed an integrative statistical model for integrating two or more data sources. The approach was based on defining a source-specific cluster (i.e., separately clustering the samples for each of the datasets) followed by consensus clustering. The dependence between the datasets was captured by defining a parameter that controls the adherence of the source-specific clustering to the consensus clustering. They also emphasized on the computational scalability of the Bayesian framework for simultaneously estimating the consensus clustering as well as the source-specific clustering.

Based on the literature on integrated clustering, we note that the existing research has only focused on modeling dependence structure without considering the directionality. In contrast, most real-world scenarios, especially genomics datasets have dependencies that are often directional. For example, in the transcription and translation process, the information in DNA is transcribed to messenger RNA (mRNA) and then the mRNA is translated into protein (alberts2017molecular; kim2017integrative), making these datasets directionally dependent, widely called The Central Dogma (crick1958protein; crick1970central).

To address this gap, we propose a Bayesian directional multi-view clustering approach that incorporates the directional dependence between the datasets using a copula function. Copulas are multivariate distribution functions whose one dimensional margins are uniform on the interval that allow us to model the dependence structure by without considering the marginals (nelsen2007introduction). Owing to the great modeling flexibility provided by copulas, they have been used extensively in the literature for obtaining the dependency between datasets. For instance, (rey2012copula) used a Gaussian copula to construct a Dirichlet prior mixture of multivariate distributions to perform dependency-seeking clustering and showed significant improvement in the clustering results. Nonetheless, to the best of our knowledge, no extant work in the multi-view clustering has used directional dependencies. In this work, we obtain the directional dependence between the datasets using an asymmetric copula regression. Symmetric copula can only provide the directional dependence in the marginal behaviour but not in the joint behavior as pointed out by (sungur2005note) and therefore, cannot be used to capture the directional dependence. The asymmetric copula is thus crucial for modeling the directionality. Here, we used the Rodriguez-Lallena and Ubeda-Flores (rodriguez2004new) family of asymmetric copulas described further in Section 2. Further, we analyzed the asymmetric copulas from a regression perspective that allow us to obtain not only the existence of dependence between the datasets but also to quantify the directional dependence (sungur2005note)(details in Sections 2 and 3). To evaluate the efficacy, we applied our model to both synthetic as well as a real dataset of breast cancer tumor samples, publicly available from The Cancer Genome Atlas (TCGA). (lock2013bayesian). Using the results, we demonstrate that including the directional dependence significantly improves the clustering performance.

The outline of this article is as follows. In Section 2, we describe the modeling background for the integrative mixture model and the measurement of directional dependence using Copula. Section 3 provides our estimation and inference for the proposed methodology. In Section 4, we describe the simulation and case study examples along with results and comparative analysis and finally conclude with a brief discussion in Section 5.

2 Background

2.1 Dirichlet Mixture Models

Let denotes a collection of distinctive data sources from objects (e.g., patients with or without cancer tumor), and for each data source the notation represents the data corresponding to the -th object. For example, if the first data source (indexed by ) is RNA gene expression, then denotes the RNA gene expression profile for the -th patient. For each data source , let denotes a latent variable corresponding to the data , and implies that the -th object belongs to the -th clusters.

For each of data source , we assume that the data is generated from a mixture density (mclachlan1988mixture; bishop2006pattern) :



represents the probability of

belonging to the -th cluster, and

is a probability density function for the data

indexed by the parameters . If the density is chosen to be a Gaussian density with mean

and variance

, then we have , leading the the Gaussian mixture density (rasmussen2000infinite). For a detailed explanation for the mixture model, refer to page 430 of bishop2006pattern.

For each of data source , a hierarchical structure of Dirichlet mixture model (walker2007sampling; hjort2010bayesian; muller2015bayesian) is given by



is the (cumulative) distribution function for the density

participated in the mixture density (1), and denotes the scaling parameter for the Dirichlet distribution (4), and denotes a base measure for the parameter . For a detailed description for the Dirichlet mixture model, and its practical implementation for clustering a dataset, refer to (gorur2010dirichlet).

2.2 Motivation to Accommodate Directional Dependency

One of the fundamental ideas of molecular biology is The Central Dogma (crick1958protein; crick1970central), which describes the two-step process, transcription and translation, by which the information in genes flows into proteins: DNA to RNA, and RNA to protein, but not in reverse direction. A ground motivation of our research is to make a use of this directional dependency to a multi-view clustering problem (consensus clustering problem) (kim2017integrative; lock2013bayesian; wang2014breast), eventually leading to a unified clustering decision.

In this paper, we make a use of a notion of copula (trivedi2007copula; jaworski2010copula)

to accommodate such a directional dependency. Copulas is widely used to model the dependence structure between random variables by decoupling the dependence structure from the marginal distribution

(demarta2005t). Technically, a -dimensional copula is defined by a -dimensional joint distribution function with uniform margins. The Sklar’s theorem (sklar1959fonctions) is the central theorem in dealing with copulas, which elucidates the role of in constructing a dependency between multiple random variables. (For a detailed explanation for the copulas, refer to nelsen2007introduction. )

Theorem 1 (Sklar’s theorem).

Let be a -dimensional joint distribution function with margins . Then there exists a n-dimensional copula that satisfies the following equality for all :

Additionally, if all the marginals are continuous, then is unique. Conversely, if is a copula, and all the margins are distribution functions, then the function satisfies the above equation is a joint distribution function with margins .

There have been researches in the literature, aimed at accommodating the directional dependency by using a copula (kim2009directional; kim2014analysis; dodge2000direction; jung2008new). Among many of such researches, regression-based approach is widely used. For example, dodge2000direction

presented the idea of directional dependence by using an ordinary linear regression. To be specific, the authors showed that when regressing some random variable

on , then there exists a directional dependence of on

if the square of sample skewness of

, denoted by , is less than that of , that is, . Recently, sungur2005some; sungur2005note argued that copula regression models may offer a possibility to capture the directional dependence between variables. This is mostly because the copula regression approach can model the joint dependence structure between the random variables, independently from the choice of the marginal distributions.

2.3 A Copula for Directional Dependency

A key idea of accommodating a directional dependency between two variables, saying and , is to construct an asymmetric copula (liebscher2008construction). Technically, bivariate asymmetric copula is defined by any copula that satisfies , where and .

In this paper, we consider an asymmetric copula from the Rodriguez-Lallena and Ubeda-Flores family (rodriguez2004new):

where , and , . The association parameter measures the dependence between and , while the asymmetry in the copula is owing to the parameters and . Given observations , the maximum likelihood estimates (MLE) of and are given as:


and following is the admissibility bound for shown by bairamov2001new:


Let the value of implies a degree of a directional dependency of on , with a higher value indicates a stronger directional dependency. Adopting the idea of (sungur2005note), can be expressed as:


where the expectation is taken with respect to the copula regression function defined by . For the Rodriguez-Lallena and Ubeda-Flores copula, we can express (7) in a closed-form (sungur2005note; sungur2005some):


3 Copula-based Multi-view Clustering

3.1 A Copula-based Dirichlet Mixture Model

Note that the Dirichlet mixture model (2) – (4) has been constructed to each data source (), hence, still there is no information borrowing between datasets. In what follows, a key idea of the proposed model is, first, to make a use of the formula of the directional dependency (8) between two pairs of dataset, and second, utilize these quantities in clustering through the Dirichlet mixture model (2) – (4).

A hierarchical formulation of a copula-based Dirichlet mixture model based on the Rodriguez-Lallena and Ubeda-Flores copula is



denotes the gamma distribution with shape parameter

and rate parameter , and is the indicator such that (otherwise zero) if there is known directional dependency from the -th dataset towards the -th dataset, while (8) is the degree of directional dependency associated with the two datasets. Notations , , and are used as the same way used in the Dirichlet mixture model (2) – (4). The is a re-written version of from equation (3) where . Parameter is the directional dependence of dataset onto dataset , as defined in (7). Finally, is the prior on the parameter , and it will be discussed in Section 3.2.

We emphasize that what makes a difference of the proposed model (9) from Dirichlet mixture model (2) – (4) is the clustering allocating procedure induced by , which is analogous to the idea of (kirk2012bayesian), except that we use a directional dependency parameters , while the authors used a two-directional dependency.

3.2 Directional Dependence Prior

For each , given two datasets and , let the notations and denote the MLEs of the two parameters, and , of the Rodriguez-Lallena and Ubeda-Flores copula, given by (5). Then after averaging each of the quantities over the observations, that is, and , we can express the directional dependence of the -th dataset on the -th dataset (sungur2005some):


where is a measure of association between the random variables and . Note that (10) is a copula-based random variable: first, two quantities and are driven from a directional copula discussed in Subsection 2.3; and second, is the stochastic part, rendering as a random quantity.

We define a Gaussian for the , given by , where the is again copula-driven:

The following is a logic why we choose above. Since Gaussian random variable has the variance , we have the inequality , which implies that holds with very high probability. Since holds with very high probability and the posterior updates are all conjugate updates without the added computational burdens of truncated distributions. Moreover, as

is normally distributed, it is easy to see that

follows a gamma distribution. In particular,

This defines our prior on that is denoted as .

3.3 Posterior Updates

In this section, we present a general Bayesian approach to estimate the posterior updates using a Gibbs sampling approach.

where is observation for dataset , are all the observations not including in dataset associated with component , are all the such that , are all the such that , and is a normalizing constant that ensures .

Please note that a latent variable has been added to help with the computational efficiency and that the updates on will depend on the choice of and .

4 Datasets

To demonstrate the effectiveness of directional dependencies in multi-view clustering, we consider both simulated and real examples which are described in detail in the following subsections.

4.1 Simulated Data

Figure 1:

Similarity matrix for the simulation study when (a) true direction is considered (b) the direction is reversed and (c) when no directionality is considered. The colormaps shows the posterior probability of samples

and to belong to the same cluster.

For the simulation study, we consider two datasets and

each generated from a univariate Gaussian mixture model with 2 components. The corresponding means and standard deviations of the mixture model are

and , respectively. The dependence structure is defined by an asymmetric Tawn copula given as:

where is called the Pickands dependence. For the Tawn copula, the Pickands function is given as,

In this case, we consider the Tawn copula of Type 1 such that . For the current simulation study, we set the values of and to and respectively and the direction of dependence is set from to . Details on the Tawn copula maybe found in (kraus2017d).

For each dataset, we generate 500 data points and the measure of directional dependence (from to ) as estimated from Equation (3) is equal to 1.73. Note the dependence in the opposite direction was obtained to be -1.02, indicating no notable dependence. We first test the performance of the proposed method when the true directionality is used, i.e., . In this case, the method is able to correctly predict the 2 clusters and the labels correctly and is shown in the joint similarity matrix in Figure 1(a). A similarity matrix displays the posterior probability of samples and to belong to the same cluster (see (kirk2012bayesian) for details). The overall accuracy of clustering for this case is 98.8%. We test two more possible scenarios, first when there is no direction of dependence and second when the direction of dependence is reversed, i.e., . The clustering results corresponding to these two cases are presented in the joint similarity matrix shown in Figures 1(b) and (c). We note that the clustering performance is affected in both the cases, but more in the second one where we observe 3 different clusters. In the case where no directionality was considered, we do get 2 clusters, however, the accuracy is 97.2%. This is intuitive this scenario only considers dependence without any directionality (similar to (lock2013bayesian)). In contrast, the second case is similar to assuming no dependence and therefore, the performance is significantly worse as compared to both the other cases.

4.2 TCGA Breast Cancer Data

For the real example, we consider the multi-source genomic data of breast cancer tumor samples from TCGA which was also adopted as data for case study in (lock2013bayesian). To download the data, visit The TCGA data contains a common set of 348 breast cancer tumor samples (), our dataset comprised of four data sources ():

  • RNA gene expression (GE) data for 171 genes.

  • DNA methylation (ME) data for 171 probes.

  • miRNA expression (miRNA) data for 171 miRNAs.

  • Reverse phase protein array (RPPA) data for 171 proteins.

It is known that these four types of data manifest differently, but at the same time, are highly related in that they are directional dependent. Figure 2 shows the four data sources with the yellow arrows representing the direction of dependence between them. The process which determines the direction of dependence between DNA to RNA is called transcription and from RNA to protein is called translation. MiRNAs (or MicroRNAs) are single-stranded RNAs, which exert their regulatory action by binding messenger RNAs and preventing their translation into proteins.

Figure 2: Directional dependencies among biological components

From a statistical perspective, both transcription and translation might be designed in term of directional dependency in our copula model because opposite processes (i.e. Protein to RNA, RNA to DNA) does not occur. Considering the two directions are a priori facts, we can design those by providing deterministic directional indicators for each process where and are corresponding indices for data sources. Also, the corresponding strengths of directional dependencies are quantified by . In the numerical perspective, we guess that providing the deterministic directional indicator to the model renders the summation in less burden, and therefore, improves the computation speed. Because we use the copula to the data level, not to the latent level, matching the data dimensions for each data sources is necessary (). These four data sources are measured on different platforms and represent different biological components. However, they all represent genomic data for the same sample set and it is reasonable to expect some shared structure while considering directional dependencies at hand.

As explained in the previous section, we are clustering samples based on four datasets: gene expression, DNA methylation, microRNA, and RPPA for breast cancer from the TCGA data. Prior studies have found that the total number of clusters can vary anywhere from 2 (duan2013metasignatures) to 10 (curtis2012genomic). In comparison, four comprehensive sub-types have been identified based on multisource consensus clustering of the TCGA data as Basal, Luminal A, Luminal B and HER2 (cancer2012comprehensive). We incorporated our prior biological knowledge, the directionality based on central dogma, into the clustering of our algorithm. Since proteins are the final outcome, we consider the consensus (or final) clustering to be the protein clusters, which summarize all the information from the other three datasets inside itself. To initialize the Bayesian posterior update algorithm, we take advantage of our finite Dirichlet mixture model to define the number of clusters . Although our model considers a finite mixture model, it is equivalent to a Dirichlet process mixture model when and therefore, specifies an upper bound on the number of clusters present in the data. Authors (rousseau2011asymptotic) argue that if the number of clusters specified by is sufficiently large, the posterior updates can automatically determine the true number of clusters present in the data. Based on this analogy, (kirk2012bayesian) suggest that a good choice of the cluster number to avoid the computational burden is . Even though we set a very large number of 500 as our cluster number, our algorithm correctly predict the number of clusters to be four, similar to the sub-typing in TCGA. Since copy number variation is associated with breast cancer risk and prognosis (kumaran2017germline), we calculated the fraction of the genome altered (FGA) as a measure of copy number activity as described in (cancer2012comprehensive) Supplementary Section VII (with threshold T=0.15) for each cluster. Our result are summarized in table 1 and figure 3. The TCGA sub-types and our clusters have different structures, but they are non-independent based on the Chi-squared test of independence (p-value 0.0001). Clusters 1 is mostly a combination of Luminal like breast cancer sub-types (Lum A and Lum B) which are similar to each other with average FGA () and almost 10 of their samples have high FGA (more than 0.4). Cluster 2 is mostly Luminal like breast cancer B with higher FGA (). Moreover, cluster 2 contains highest FGA among all the four clusters with almost 22. Cluster 3 is focusing more on the HER2 and Basal sub-types which are more similar to each other with the least FGA () and very few high FGA with almost only 6. Even though cluster 4 is more spread over the 4 known subtypes (Her2, Basal, Lum A and Lum B) but they include samples with high FGA () and lower standard deviation compared to cluster 2. Moreover, cluster 4 is second in having high FGA samples with almost 17.

There are two other state-of-the-art algorithms for clustering based on multiple datasets which are Bayesian Consensus Clustering (BCC) (lock2013bayesian) and Multiple Dataset Integration (MDI) (kirk2012bayesian)

as explained before. We cannot compare our results with them for the following reasons. Firstly, the MDI algorithm provides separate clustering rather than a consensus or global clustering. Secondly, the BCC algorithm calculates a global clustering, but their algorithm needs to know the number of clusters beforehand, which is a great limitation for the cases when we don’t know the number of clusters. They even developed a heuristics to calculate the number of clusters as a pre-processing step, but they wrongly choose three clusters for the TCGA data.

1 2 3 4
Her2 6 0 28 5
Basal 2 1 54 15
Lum A 71 56 3 31
Lum B 54 4 8 10
Table 1: Confusion matrix for the clustering assignment
Figure 3: Distribution of fraction of the genome altered across breast cancer clustering based

5 Conclusion

It is known that the genomics datasets collected from multiple sources are often related and when used jointly can significantly improve the clustering. Nonetheless, several processes in nature are irreversible and move in a particular direction. This is often true in genomics, where one dataset is not only dependent but directionally dependent, for example, the process of conversion of DNA to protein. We utilized this domain knowledge and proposed a novel method for multi-view clustering by incorporating the directional dependence between the datasets using a copula model. The use of copulas to model directionality provides us with a robust and versatile tool to capture the directional dependence in joint behavior. Application of the proposed method on synthetic as well as real dataset demonstrates its efficacy. Most importantly, we believe that capturing directional dependence instead of simple dependence can provide an added understanding of the underlying process. More rigorous and in-depth comparative analysis between the dependence and directional dependence seeking multi-view clustering.

6 Acknowledgement

Research reported in this publication was partially supported by national cancer Institute of the National Institutes of Health under award number R01CA194391, NSF grants numbers NSF CCF-1934904, NSF IIS-1741173.