Learning Gene Regulatory Networks with High-Dimensional Heterogeneous Data

05/07/2018 ∙ by Bochao Jia, et al. ∙ University of Florida 0

The Gaussian graphical model is a widely used tool for learning gene regulatory networks with high-dimensional gene expression data. Most existing methods for Gaussian graphical models assume that the data are homogeneous, i.e., all samples are drawn from a single Gaussian distribution. However, for many real problems, the data are heterogeneous, which may contain some subgroups or come from different resources. This paper proposes to model the heterogeneous data using a mixture Gaussian graphical model, and apply the imputation-consistency algorithm, combining with the ψ-learning algorithm, to estimate the parameters of the mixture model and cluster the samples to different subgroups. An integrated Gaussian graphical network is learned across the subgroups along with the iterations of the imputation-consistency algorithm. The proposed method is compared with an existing method for learning mixture Gaussian graphical models as well as a few other methods developed for homogeneous data, such as graphical Lasso, nodewise regression and ψ-learning. The numerical results indicate superiority of the proposed method in all aspects of parameter estimation, cluster identification and network construction. The numerical results also indicate generality of the proposed method: it can be applied to homogeneous data without significant harms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The emergence of high-throughput technologies has made it feasible to measure the activities of thousands of genes simultaneously, which provides scientists with a major opportunity to infer gene regulatory networks. Accurate inference of gene regulatory networks is pivotal to gaining a systematic understanding of the molecular mechanism, to shedding light on the mechanism of diseases that occur when cellular processes are dysregulated, and to identifying potential therapeutic targets for the diseases. Given the high dimensionality and complexity of high-throughput data, inference of gene regulatory networks largely relies on the advance of statistical modeling and computation. The Gaussian graphical model is a promising tool to achieve this challenge.

The Gaussian graphical model uses a network to depict conditional independence relationships for a large set of Gaussian random variables, where the absence of an edge between two variables indicate independence of the two variables conditioned on all other variables. In the literature, a variety of methods have been proposed to learn Gaussian graphical networks. To name a few, they include covariance selection (Dempster, 1972), nodewise regression (Meinshausen and Bühlmann, 2006), graphical Lasso (Yuan and Lin, 2007; Friedman et al., 2008), adaptive graphical Lasso (Fan et al., 2009), projected covariance matrix method (Fan et al., 2015), and

-learning (Liang et al., 2015). In general, these methods assume that the data are homogeneous, i.e., all samples are drawn from a single Gaussian distribution. However, in practice, we have often the data that are heterogeneous, i.e., the samples are drawn from a mixture Gaussian distribution, while a single Gaussian graphical network still needs to be learned for all the samples in a fashion of data integration. Here are some examples:

  • Data with hidden biological/clinical subtypes. It is known that complex diseases such as cancer can have significant heterogeneity in response to treatments, and this heterogeneity is often reflected in gene expression. For example, the gene expression patterns can vary with subtypes of the cancer. Since for many types of cancers, the definition of subtypes is still unclear and the number of samples from each subtype can be very small, it is impractical to construct an individual gene regulatory network for each subtype. In this case, we might still be interested in constructing a single gene regulatory network for the heterogeneous data in a fashion of data integration. Such an integrated gene regulatory network can facilitate us to identify fundamental patterns common to the development and progression of the disease.

  • Data with hidden confounding factors. In real-world applications, the gene expression data may contain some systematic differences caused by known or unknown confounding factors, such as study cohorts, sample collection, experimental batches, etc. Due to the limited number of samples from each level of the confounding factors, we also prefer to learn a single gene regulatory network for the heterogeneous data in a fashion of data integration. Moreover, for many problems, the confounding factors can be unknown.

In this paper, we develop a mixture model method to learn Gaussian graphical networks for heterogeneous data with hidden clusters. The new method is developed based on the imputation-consistency (IC) algorithm proposed by Liang et al. (2018)and the

-learning algorithm proposed by Liang et al. (2015). The IC algorithm is a general algorithm for dealing with high-dimensional missing data problems. Like the EM algorithm (Dempster et al., 1977), the IC algorithm works in an iterative manner, iterating between an I-step and a C-step. The I-step is to impute the missing data conditioned on the observed data and the current estimate of parameters, and the C-step is to find a “consistent” estimator for the minimizer of a Kullback-Leibler divergence defined on the pseudo-complete data. For high-dimensional problems, the “consistent” estimate can be found with sparsity constraints or screened data. Refer to Fan and Lv (2008) and Fan and Song (2010) for variable screening methods. Under quite general conditions, Liang et al. (2018)showed that the average of the “consistent” estimators across iterations is consistent to the true parameters. The

-learning algorithm is originally designed for learning Gaussian graphical models for homogeneous data. The proposed method can be viewed as a combination of the IC algorithm and the -learning algorithm, which simultaneously clusters samples to different groups and learn an integrated network across all the groups. When applying the IC algorithm to cluster samples, their cluster membership is treated as missing data.

We note that the proposed mixture model method is different from the methods for joint estimation of multiple Gaussian graphical models, such as fused Lasso (Danaher et al., 2014), Bayesian nodewise regression (Lin et al., 2017), and Bayesian integrated -learning (Jia et al., 2018). For the latter methods, the samples’ cluster membership is known a priori and the goal is to learn an individual network for each cluster of samples. In contrast, the proposed method works for the case that the cluster membership is unknown and the goal is to learn an integrated network across all hidden groups. The proposed method is also different from the methods proposed by Ruan et al. (2011) and Lee et al. (2018). For the former, the goal is to learn an individual network for each cluster of samples, although it assumes that the cluster membership is unknown. The latter is to first group samples to different clusters using an eigen-analysis based approach and then apply the -learning algorithm to learn the network structure. Since the method did not account for the uncertainty of sample clustering, it often performs less well.

The rest part of this paper is organized as follows. In Section 2, we describe the proposed method. In Section 3, we illustrate the performance of the proposed method using simulated examples. In Section 4, we apply the proposed method to learn a gene regulatory network for breast cancer with a heterogeneous gene expression dataset. In Section 5, we conclude the paper with a brief discussion.

2 Mixture Gaussian Graphical Models

2.1 Algorithms for homogeneous data

To have a better description for the proposed method, we first give a brief review for the existing Gaussian graphical model algorithms for homogeneous data.

Let denote a set of Gaussian random variables, where denotes observations of variable . In the context of gene regulatory networks, refers to the expression level of gene measured in experiment . Let denote the expression levels of all genes measured in experiment , which is assumed to follow a Gaussian distribution

with the mean vector

and covariance matrix . Let denote the adjacency matrix, where if the edge is present and 0 otherwise. The adjacency matrix specifies the structure of the Gaussian graphical network. Let denote the partial correlation coefficient of variable and variable conditioned on all other variables. Let denote the concentration matrix, also known as the precision matrix. Let ’s denote the coefficients of the regressions

(1)

where is a zero-mean Gaussian random vector. Since can be expressed as and ’s can be expressed as and , the following relationship holds:

(2)

Based on the relation between partial correlation coefficients and the concentration matrix, Dempster (1972) proposed the covariance selection method, which identifies the edges of the Gaussian graphical network by identifying the nonzero elements of the concentration matrix. However, this method cannot be applied to the problems with , where the sample covariance matrix is nonsingular and thus the concentration matrix cannot be calculated. To tackle this difficulty, Yuan and Lin (2007) proposed to estimate the concentration matrix with

-regularization. Soon, this method was accelerated by Friedman et al. (2008) using the coordinate descent algorithm in a similar way to Lasso regression (Tibishrani, 1996), which leads to the so-called graphical Lasso algorithm. Based on the relation between partial correlation coefficients and regression coefficients, Meinshausen and Bühlmann (2006) proposed the nodewise regression method, which is to learn Gaussian graphical networks by identifying nonzero regression coefficients of the regressions given in (

1) with a sparsity constraint.

Alternative to estimating the concentration matrix and regression coefficients, the -learning algorithm (Liang et al., 2015) is to provide an equivalent measure for the partial correlation coefficient in the sense that

(3)

where is the partial correlation coefficient of variable and variable conditioned on a subset of and the subset is obtained via correlation screening. Since the -learning algorithm is used as a component of the proposed mixture model method for learning Gaussian graphical models with grouped samples, the details of the algorithm are given below.

Algorithm 1.(-learning)

  • (Correlation screening) Determine the reduced neighborhood for each variable .

    • Conduct a multiple hypothesis test to identify the pairs of variables for which the empirical correlation coefficient is significantly different from zero. This step results in a so-called empirical correlation network.

    • For each variable , identify its neighborhood in the empirical correlation network, and reduce the size of the neighborhood to by removing the variables having lower correlation (in absolute value) with . This step results in a so-called reduced correlation network.

  • (-calculation) For each pair of variables and , identify a subset of nodes based on the reduced correlation network resulted in step (a) and calculate , where denotes the partial correlation coefficient of and conditioned on the variables .

    In this paper, we set if and otherwise, where denotes the neighborhood of node in the reduced correlation network, and denotes the cardinality of a set.

  • (-screening) Conduct a multiple hypothesis test to identify the pairs of vertices for which is significantly different from zero, and set the corresponding element of the adjacency matrix to 1.

The multiple hypothesis tests involved in the algorithm can be done using the empirical Bayesian method developed in Liang and Zhang (2008), which allows for the general dependence between test statistics. Other multiple hypothesis testing procedures that account for the dependence between test statistics, e.g., the two-stage procedure of Benjamini et al. (2006), can also be applied here. The correlation screening step involves two procedures, (i) multiple hypothesis test and (ii) sure independence screening, to control the neighborhood size for each variable. The two procedures seem redundant, but actually they are not. Indeed the multiple hypothesis test is able to identify the pairs of independent variables, but the size of each neighborhood cannot be guaranteed to be less than

as established in Liang et al. (2015). We have tried to use the sure independence screening procedure only, which results in the same neighborhood size for each variable. However, in this case, the enlarged neighborhood may contain some variables that are independent of the central one, and thus the power of the followed -screening test will be reduced.

The -learning algorithm consists of two free parameters, namely and , which refer to the significance levels used in correlation screening and -screening, respectively. Following the suggestion of Liang et al. (2015), we specify their values in terms of -values (Storey, 2002); setting and or 0.1 in all computations. In particular, we set for the simulated examples and for the real data example. A large value of avoids to lose more potential interactions between different genes.

Under mild conditions, e.g., the joint Gaussian distribution of satisfies the faithfulness condition, Liang et al. (2015) showed that the -partial correlation coefficient is equivalent to the true partial correlation coefficient in determining the structure of Gaussian graphical models in the sense of (3). Compared to other Gaussian graphical model algorithms, the -learning algorithm has a significant advantage that it has reduced the computation of partial correlation coefficients from a high dimensional problem to a low dimensional problem via correlation screening and thus can be used for very high-dimensional problems. As shown in Liang et al. (2015), the

-learning algorithm is consistent; the resulting network will converge to the true one in probability as the sample size becomes large. The

-learning algorithm tends to produce better numerical performance and cost less CPU time than the existing algorithms, such as gLasso and nodewise regression, especially when is large.

2.2 The Mixture Gaussian Graphical Model Method

Let denote a set of independent samples which are drawn from a mixture Gaussian distribution with components, where the sample size can be much smaller than the dimension . Suppose that is known. Later, we will describe a Bayesian information criterion (BIC) to determine the value of . The log-likelihood function of the samples is given by

(4)

where denotes the collection of unknown parameters, ’s are mixture proportions, ’s are mean vectors, and ’s are covariance matrices of the Gaussian components, respectively; and denotes the density function of the multivariate Gaussian distribution. Let denote the indicator variable for the component/cluster membership of sample , for . That is, and for and . Henceforth, we will use cluster to denote the group of samples assigned to a component of the mixture Gaussian graphical model. Cluster and component are also used exchangeably in this paper.

If the sample size is greater than , then the parameters can be estimated using the EM algorithm as described in what follows. Let , and denote, respectively, the estimates of , and obtained at iteration . Let . The -step calculates the the conditional expectation of given and the current estimate of , i.e.

(5)

which leads to the so-called -function,

(6)

The M-step updates by maximizing the -function, which can be done by maximizing with respect to for each . For each value of , can be updated by setting

(7)

However, this algorithm does not work when , as ’s will be singular in this case.

When , to avoid the issues caused by the singularity of

’s, we propose the following algorithm. For the proposed algorithm, we assume that all components of the mixture Gaussian graphical model share a common adjacency matrix, although their covariance and precision matrices can be different from each other. The new algorithm consists of two stages. The first stage is to apply the Imputation-Consistency (IC) algorithm to generate a series of estimates for the common adjacency matrices, and the second stage is to average the estimates to get a stable estimate for the common adjacency matrix. Note that, as can be seen below, the IC algorithm generates a Markov chain.

To learn the common adjacency matrix at each iteration, a -integration procedure is needed, which is to integrate the adjacency matrices learned for each component into one adjacency matrix. This procedure can be described as follows. Let denote the -partial correlation coefficient calculated for the -th cluster at iteration , which can be transformed to a -score via Fisher’s transformation:

(8)

where denotes the conditioning set used in calculating , and is the number of samples assigned to cluster at iteration . For convenience, we call the -score a -score. The -scores from different clusters can be combined using Stouffer’s meta-analysis method (Stouffer et al., 1949) by setting

(9)

where is a nonnegative weight assigned to cluster at iteration . In this paper, we set . Note that

approximately follows a standard normal distribution under the null hypothesis

. Then a multiple hypothesis test can be conducted on ’s to identify the pairs of nodes for which is differentially distributed from the standard normal , and the adjacency matrix common to all components of the mixture model can be determined thereby. In this paper, we adopted the multiple hypothesis testing procedure of Liang and Zhang (2008) to conduct the test. This testing procedure allows general dependence between test statistics.

Given the -integration procedure, the first stage of the proposed method can be summarized as follows: It starts with an initial estimate , and then iterates between the following steps:

Algorithm 2. (IC estimation for mixture Gaussian graphical models)

  • (imputation) Impute the indicator variable drawn from the probability of multinomial distribution (5) for each .

  • (consistency) Based on the imputed values of ’s, update the estimate by

    • setting , , and
      ;

    • applying the -learning algorithm to learn an adjacency matrix for each cluster of the samples.

    • applying the -integration procedure to integrate the adjacency matrices learned in step (ii) into one.

    • applying the algorithm given in Hastie et al. (2009, page 634) to recover the covariance matrices for each cluster, given the common adjacency matrix learned in step (iii).

Let for . According to the theory of the IC algorithm (Liang et al., 2018), which holds for general conditions, the sequence forms two interleaved Markov chains, and a consistent estimate of can be obtained by averaging ’s. This is similar to the stochastic EM algorithm (Celeux and Diebolt, 1995; Nielsen, 2000). Further, by theory of the IC algorithm, a consistent estimate of the common adjacency matrix can be obtained by averaging the respective estimates along with iterations. More precisely, the adjacency matrix can be averaged in the following way (second stage of the proposed method). Define

where denotes the number of burn-in iterations of the IC algorithm, and then the final estimate of the adjacency matrix can be obtained by conducting another multiple hypothesis test for ’s. As before, under the null hypothesis : , follows the standard normal distribution.

Thus far, we have treated the number of clusters as known. In practice,

can be determined using an information criterion, such as AIC or BIC. Following Ruan et al. (2011), we define the degree of freedom for a model with

components as

(10)

where represents the dimension of the mean vector, and denotes the -th element of the estimated common adjacency matrix. Although we have assumed that the mixture Gaussian graphical model has a common adjacency matrix for all components, it can have a completely different concentration matrix for each component. Hence, for each component, we count each nonzero entry of the concentration matrix as a different parameter. The BIC score is then given by

(11)

where is the log-likelihood function given by equation (4), and can be determined by minimizing .

In (10), we did not count for the parameters . This is due to two reasons. First, the problem is considered under the high-dimensional scenario where is allowed to be greater than and grow with . However, is considered as fixed or to grow at a lower order of . Therefore, including or not in (10) will not affect much the performance of the criterion when becomes large. Second, we ignore in (10) to make the definition of the BIC score (11) consistent with the one used in Ruan et al. (2011), which facilitates comparisons.

3 Simulation Studies

We compare the performance of the proposed method with some methods developed for homogeneous data such as gLasso (Friedman et al.,2008), nodewise regression (Meinshausen and Bühlmann, 2006), and -learning (Liang et al., 2015), as well as the EM-regularization method developed by Ruan et al. (2011) for mixture Gaussian graphical models. As aforementioned, the method by Ruan et al. (2011) is different from the proposed one, as whose goal is to estimate an individual Gaussian graphical network for each cluster. Moreover, since Ruan et al. (2011) applied the gLasso algorithm to learn an individual Gaussian graphical network for each cluster, it will be very hard to integrate those networks into a common one.

3.1 Example 1

We began with the case where the number of clusters of the mixture model is known and the components are different in means. For this simulation study, we fix and the total number of samples , and varied the dimension between and . We set the component means as , and , where denotes a -dimensional vector of ones. We let all the three components share the same precision matrix :

(12)

and generated samples from each component of the mixture model. The samples from different components are combined and shuffled. Three different values of are considered, including , 0.3 and 0.5. Under each setting of and , 50 independent datasets were generated.

The proposed method was applied to this heterogeneous dataset. To initialize ’s and ’s, we randomly grouped the samples into three clusters and calculated their respective proportions and means. To initialize the covariance matrices, we first applied the -learning algorithm to the whole dataset to obtain a common adjacency matrix, and then applied the algorithm by Hastie et al. (2009, page 634) to estimate the covariance matrix for each cluster with the common adjacency matrix. The IC algorithm converges very fast, usually in about 10 iterations. For this example, the algorithm was run for 20 iterations for each dataset.

To access the performance of the proposed method, Figure 1

shows the precision-recall curves obtained for some datasets, where each plot was drawn based on the average of all 50 simulated datasets. The precision and recall are defined by

where , and denote true positives, false positives and false negatives, respectively, and they are defined via a binary decision table (see Table 1). In general, the method producing a larger area under the precision-recall curve is considered as a better method. The area under the precision-recall curve is often denoted by AUC (Area Under Curve) in the literature.

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
Table 1: Outcomes of binary decision.
(a) and
(b) and
(c) and
(d) and
(e) and
(f) and
Figure 1: Comparison of different methods for recovering underling networks for heterogeneous data with different cluster means: ‘gLasso’ refers to the graphical Lasso method, ‘NodeReg’ refers to the nodewise regression method, ‘Penalized’ refers to the EM-regularization method of Ruan et al (2011), -learning refers to the -learning methods, and ’IC’ refers to the proposed method. The plots (a) and (b) represent the scenario of homogeneous data.

For comparison, we have applied other methods, including the EM-regularization method of Ruan et al. (2011), -learning, gLasso and nodewise regression, to this example. As shown in Figure 1, under both scenarios with (representing homogeneous data) and , the proposed method outperforms all others. Moreover, when the value of increases, the performance of the - learning, gLasso and nodewise regression methods deteriorates as they are designed for homogeneous data. The EM-regularization method is robust to the value of , and tends to outperform gLasso, nodewise regression and -learning when becomes large. Note that the EM-regularization method produced a different network for each cluster and thus three precision-recall curves in total. Figure 1 showed only the best one, i.e., the curve with the largest value of AUC. Table 2

summarizes the areas under the precision-recall curves produced by different methods, where the average areas (over50 datasets) were reported with the standard deviations given in the corresponding parentheses. The comparison indicates that the proposed method significantly outperforms other methods for the heterogeneous data. For the homogeneous data (i.e.,

), the proposed method performs as well as the -learning method, while significantly outperforms others. This indicates generality of the proposed method, which can be applied to homogeneous data without significant harms. For the EM-regularization method, its poor performance for the homogeneous data may be due to two reasons. Firstly, the gLasso procedure employed there tends to perform less well than the -learning and nodewise regression methods as shown in Figure 1(a) & 1(b). Secondly, the EM-regularization method produced three different networks, which are not allowed to be integrated under its current procedure. For the purpose of comparison, we reported only the result for the network with the largest AUC area. However, this “best” network may still be worse than the properly integrated one for the homogeneous data.

m gLasso NodeReg -learning Penalized IC
0 0.696(0.002) 0.765(0.003) 0.859(0.003) 0.662(0.003) 0.888(0.004)
0.3 0.437(0.002) 0.453(0.003) 0.602(0.003) 0.634(0.003) 0.892(0.004)
0.5 0.084(0.001) 0.112(0.001) 0.095(0.003) 0.459(0.020) 0.876(0.004)
0 0.658(0.002) 0.731(0.002) 0.834(0.002) 0.654(0.002) 0.855(0.004)
0.3 0.402(0.002) 0.421(0.002) 0.585(0.002) 0.597(0.008) 0.857(0.004)
0.5 0.059(0.001) 0.084(0.001) 0.051(0.002) 0.439(0.015) 0.829(0.004)
Table 2: Average AUCs produced by different methods for the heterogeneous data with different cluster means.

In addition to underlying networks, we are interested in parameter estimation and cluster identification for the mixture Gaussian graphical model. To access the accuracy of parameter estimation, we adopt the criteria used by Ruan et al. (2011), which include the averaged spectral norm defined by

(13)

where

is the largest singular value of matrix

; the averaged Frobenius norm defined by

(15)

and the averaged Kullback-Leibler (KL) loss defined by

(16)

where

(17)

To assess the accuracy of cluster identification, we calculated the averaged false and negative selection rates over different clusters. Let denote the index set of observations for cluster , and let denote its estimate. Define

(18)

where denotes the set cardinality. The smaller the values of fsr and nsr are, the better the performance of the method is. The comparison was summarized in Table 3 where, for each setting of and , each method was evaluated based on50 datasets with the averaged evaluation results reported. The numbers in the parentheses represent the standard deviations of the corresponding averages. The comparison indicates that the proposed method significantly outperforms the other methods in both parameter estimation and cluster identification.

m SL FL KL fsr nsr
penalized 0 3.642(0.015) 22.309(0.018) 149.701(1.217)
0.3 3.618(0.008) 22.231(0.004) 149.058(0.569) 0.453(0.004) 0.393(0.005)
0.5 3.488(0.027) 22.414(0.056) 160.736(1.540) 0.014(0.002) 0.016(0.002)
IC 0 3.261(0.045) 11.222(0.072) 24.619(0.281)
0.3 2.984(0.037) 10.508(0.084) 21.439(0.251) 0.008(0.001) 0.008(0.001)
0.5 3.025(0.035) 10.635(0.081) 21.701(0.276) 0(0) 0(0)
penalized 0 3.644(0.009) 31.480(0.007) 296.131(1.483)
0.3 3.578(0.022) 31.529(0.056) 304.948(4.098) 0.512(0.004) 0.533(0.010)
0.5 3.143(0.033) 32.712(0.124) 388.672(5.041) 0.015(0.002) 0.021(0.007)
IC 0 3.437(0.042) 16.102(0.107) 51.161(0.413)
0.3 3.350(0.042) 15.800(0.039) 49.069(0.311) 0(0) 0(0)
0.5 2.732(0.036) 16.312(0.010) 50.177(0.246) 0(0) 0(0)
Table 3: Comparison of different methods in parameter estimation and cluster identification for the heterogeneous data with different cluster means.

3.2 Example 2

To make the problem harder, we consider the model for which each component has a different mean vector as well as a different concentration matrix, although the adjacency matrix is still the same for all components. As for Example 1, we fix and the total sample size , varied the dimension between and , and set the cluster mean vectors as , and , where denotes a -dimensional vector of ones. The common pattern of the concentration matrix is given by

(19)

for . We set , and for the three components, respectively. From each component, we generated 100 samples. Three different values of are considered, which are 0, 0.3 and 0.5. Under each setting of and , 50 independent datasets were generated.

Figure 2 shows the precision-recall curves produced gLasso, nodewise regression, -learning, EM-regularization, and the proposed method. It indicates that the proposed method outperforms others. The two plots in the first row of Figure 2 compares the performance of different methods when , which represents a very difficult scenario that each cluster is only slightly different in precision matrices and thus the samples will be extremely difficult to be clustered. However, the proposed method still outperform others under this scenario.

(a) and
(b) and
(c) and
(d) and
(e) and
(f) and
Figure 2: Comparison of different methods for recovering underlying networks for heterogeneous data with different cluster means as well as different cluster precision matrices: ‘gLasso’ refers to the graphical Lasso method, ‘NodeReg’ refers to the nodewise regression method, -learning refers to the -learning method, ’Penalized’ refers to the EM-regularization method, and ’IC’ refers to the proposed method.

Table 4 compares the areas under the precision-recall curves produced by different methods, and Table 5 compares the performance of different methods in parameter estimation and cluster identification. For each setting of and , each method was evaluated based on50 datasets the averaged evaluation results reported. The numbers in the parentheses of the two tables represent the standard deviations of the corresponding averages. The comparison indicates that the proposed method outperforms others in both parameter estimation and cluster identification.

m gLasso NodeReg -learning Penalized IC
0 0.653(0.002) 0.732(0.002) 0.888(0.003) 0.595(0.003) 0.927(0.003)
0.3 0.416(0.003) 0.429(0.003) 0.624(0.002) 0.571(0.003) 0.926(0.003)
0.5 0.162(0.001) 0.184(0.001) 0.434(0.005) 0.460(0.003) 0.914(0.003)
0 0.625(0.002) 0.711(0.002) 0.858(0.001) 0.573(0.002) 0.896(0.002)
0.3 0.388(0.002) 0.401(0.003) 0.615(0.002) 0.555(0.002) 0.898(0.003)
0.5 0.136(0.001) 0.161(0.001) 0.380(0.004) 0.358(0.018) 0.878(0.003)
Table 4: Comparison of average AUCs produced by different methods for the heterogeneous data with different cluster means as well as different cluster precision matrices.
m SL FL KL fsr nsr
penalized 3.745(0.019) 22.952(0.033) 258.397(3.042) 0.633(0.121) 0.651(0.106)
0.3 3.723(0.022) 22.902(0.038) 257.245(3.395) 0.174(0.009) 0.196(0.013)
0.5 3.749(0.019) 22.973(0.032) 257.954(2.943) 0.159(0.035) 0.177(0.100)
IC 0 3.453(0.049) 11.782(0.069) 42.363(0.369) 0.103(0.017) 0.094(0.011)
0.3 3.387(0.042) 11.572(0.060) 41.161(0.284) 0.010(0.003) 0.010(0.003)
0.5 3.292(0.041) 11.521(0.069) 42.098(0.419) 0(0) 0(0)
penalized 3.797(0.002) 32.411(0.004) 505.035(3.343) 0.597(0.231) 0.678(0.429)
0.3 3.740(0.020) 32.336(0.045) 510.729(4.079) 0.479(0.104) 0.345(0.085)
0.5 3.752(0.012) 32.333(0.034) 508.792(1.768) 0.267(0.009) 0.108(0.005)
IC 0 3.405(0.014) 17.045(0.080) 92.140(0.709) 0.212(0.010) 0.209(0.009)
0.3 3.533(0.043) 16.989(0.075) 91.503(0.746) 0.005(0.001) 0.004(0.001)
0.5 3.573(0.041) 17.194(0.090) 94.059(0.701) 0(0) 0(0)
Table 5: Comparison of different methods in parameter estimation and cluster identification for the heterogeneous data with different cluster means as well as different cluster precision matrices.

3.3 Identification of cluster numbers

When the number of clusters is unknown, we propose to determine its value according to the BIC criterion give in (11). In what follows, we illustrated the performance of the proposed method under this scenario using some simulated examples. We considered the cases with and and and 200. For each combination of , we simulated samples from each cluster with the same precision matrix as defined in (12). For the cluster means, we set and for , and set , and for .

Figure 3 compares the performance of the EM-regularization method and the proposed method in identification of cluster numbers. It indicates that for the simulated example, the proposed method was able to correctly identify the true value of according to the BIC criterion, while the EM-regularization method could not.

(a) and
(b) and
(c) and
(d) and
Figure 3: BIC scores produced by the EM-regularization (Penalized) method and the proposed method (IC) for different settings of .

4 A Real Data Example

Breast cancer is one of the most prevalent types of cancer which can be classified into four molecular subtypes, namely, luminal A, basal-like, HER2-enriched, and luminal B, based on their tumor expression profiles (Haque et al, 2012). In this study, we aim to construct a single gene regulatory network across the four subtypes to discover the overall gene regulation mechanism in breast cancer. The gene expression data for breast cancer are available at The Cancer Genome Atlas (TCGA), which contains 768 patients and 20502 genes. For each patient, some clinical information such as survival time, age, gender and tumor stages are also available, but the cancer subtypes are unknown. Since the data might be heterogeneous given the existence of breast cancer subtypes, the proposed method can be applied here. For this study, we are interested in learning a gene regulatory network related to the survival time of patients. For this reason, we first applied a marginal screening method to select the survival time-related genes. For each gene, we calculated its

-value using the marginal Cox regression after adjusting the effects of age, gender and tumor stages, and then selected 592 genes according to a multiple hypothesis test at a false discovery rate (FDR) level of 0.05. We used the empirical Bayes method of Liang and Zhang (2008) to conduct the test.

Figure 4: BIC scores produced by the proposed method for breast cancer data.

To determine the number of components for the mixture model, we calculated BIC scores for with the results shown in Figure 4. According to the BIC scores, we set . The resulting three clusters consist of 338, 191 and 238 patients, respectively. Figure 5(a) shows the Kaplan-Meier curves of the three clusters. A log-rank test for the three curves produced a -value of 3.89, which indicate that the patients in different clusters have different survival probabilities. Further, for each gene, we conducted a ANOVA test for its mean expression level across the three clusters. The resulting -values are shown in Figure 5(b), where most -values are very close to 0. This implies that the three clusters have different means and thus the data are heterogeneous. We note that the clustering results produced by the proposed method are biologically meaningful, which is likely due to the existence of hidden subtypes of breast cancer. As stated in Haque et al (2012), women with luminal A tumors had the longest survival time, women with HER2-enriched and luminal B tumors had a much shorter survival time, and women with basal-like tumors had an intermediate survival time with deaths occurring earlier than those with luminal A tumors.

(a) Kaplan-Meier curves for three patient groups.
(b) Histogram of log -values.
Figure 5: The left panel shows the Kaplan-Meier curves for three different patient groups, and the right penal shows the the histogram of the logarithm of -values obtained for each gene in the ANOVA test.

With being set to 3, the proposed method produced a gene regulatory network (shown in Figure 6), from which some hub genes can be identified. The hub genes refer to those with high connectivity in the gene regulatory network, and they tend to play important roles in gene regulation. To provide a stable way to identify hub genes, we consider a cross-validation like method. We divide the dataset into five subsets equally and then run the proposed method for five times, each applying to four of the five subsets only. In each run, we identified ten hub genes according to their connectivity. The results were summarized in Table 6, where the genes were ranked by their frequencies being selected as the hub gene among the 5 runs. The results indicate that the performance of the proposed method is quite stable: quite a few genes are frequently selected as the hub gene in different runs.

Rank Gene Freq Links Rank Gene Freq Links
1 LHFPL3 4 49.2 (9.6) 6 KRT12 3 13.4 (5.1)
2 SEPP1 4 8.4 (1.4) 7 FXYD1 2 5.4 (2.3)
3 MYH11 4 8.6 (1.4) 8 SCARA5 2 6.4 (1.6)
4 F13A1 3 12.2 (3.8) 9 CLEC3B 2 7.8 (2.8)
5 MAMDC2 3 5.4 (1.0) 10 LRRC70 2 5.8 (1.7)
Table 6: Top ten hub genes identified by the proposed method, where ‘Freq’ denotes the number of times that the gene was selected as a hub gene in the five subset runs, ‘Links’ denotes the average number of edges connected to the gene in the five networks with its standard deviation given in the parentheses, and the superscript * indicates that this gene has been verified in the literature to be related with breast cancer.

Our findings of hub genes are pretty consistent with the existing knowledge. Among the top 10 hub genes, 8 of them has been verified in the literature to be related with breast cancer. For example, LHFPL3, the gene has the most connectivities in the networks, is characteristic of primary glioblastoma which are important processes for cancer development and progression (Milinkovic et al., 2013). The gene SEPP1 is significantly associated with breast cancer risk among women (Mohammaddoust et al., 2018). The gene F13A1 is known as a thrombotic factor that plays a major role in tumor formation (Ahmadi et al., 2016). In the cancer coexpression network developed by Meng et al., (2016), they found that MAMDC2 plays a key role in the development of breast invasive ductal carcinoma. Our results also reveal some new findings, such as the gene MYH11. Li et al. (2016) reported that MYH11 plays a role in tumor formation by disturbing stem cell differentiation or affecting cellular energy balance and has been identified as a driver gene in human colorectal cancer, although few researches identify its function in breast cancer.

Figure 6: The gene regulatory network constructed by the proposed method for breast cancer.

5 Discussion

In this paper, we have proposed a new method for constructing gene regulatory networks for heterogeneous data, which is able to simultaneously cluster samples to difference groups and learn an integrated network across the groups. The proposed method was illustrated using some simulated examples and a real-world gene expression data example. The numerical results indicate that the proposed method significantly outperforms the existing ones, such as graphical Lasso, nodewise regression,

-learning, and EM-regularization. For the real-world gene expression data example, we conducted a detailed post-clustering analysis, which indicates the heterogeneity of the data and justifies the importance of the proposed method for real problems.

In addition to microarray gene expression data, the proposed method can be applied to next generation sequencing (NGS) data based on the transformations developed in Jia et al. (2017). To learn gene regulatory networks from NGS data, which are often assumed to follow a Poisson or negative binomial distribution, Jia et al. (2017) developed a random effect model-based transformation to continuize the NGS data. Further, the continuized data can be transformed to Gaussian using the nonparanormal transformation (Liu et al., 2012), and the proposed method can be applied then. We expect that the proposed method can also find wide applications in other scientific fields.

Acknowledgment

The authors thank the book editor Dr. Yichuan Zhao and two referees for their constructive comments which have led to significant improvement of this paper. Liang’s research was support in part by the grants DMS-1612924 and DMS/NIGMS R01-GM117597.

References

Ahmadi, M., Nasiri, M., and Ebrahimi, A. (2016). Thrombosis-Related Factors FV and F13A1 Mutations in Uterine Myomas. Zahedan Journal of Research in Medical Sciences, 18(10).

Benjamini, Y., Krieger, A. M., and Yekutieli, D. (2006). Adaptive linear step-up procedures that control the false discovery rate. Biometrika, 93(3), 491-507.

Celeux, G., and Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern recognition, 28(5), 781-793.

Danaher, P., Wang, P., and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society, Series B, 76(2), 373-397.

Dempster, A. P. (1972). Covariance selection. Biometrics, 157-175.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B, 1-38.

Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties. The annals of applied statistics, 3(2), 521.

Fan, J., Feng, Y., and Xia, L. (2015). A projection based conditional dependence measure with applications to high-dimensional undirected graphical models. arXiv preprint arXiv:1501.01617.

Fan, J. and Lv, J. (2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society, Series B, 70, 849-911.

Fan, J. and Song, R. (2010). Sure independence screening in generalized linear model with NP-dimensionality. Annals of Statistics, 38, 3567-3604.

Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432-441.

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning (Second Edition). Springer-Verlag, 763 pages.

Jia, B., Tseng, G., and Liang, F. (2018). Fast Bayesian Integrative Analysis for Joint Estimation of Multiple Gaussian Graphical Models. Submitted to Journal of the American Statistical Association.

Jia, B., Xu, S., Xiao, G., Lamba, V., and Liang, F. (2017). Learning gene regulatory networks from next generation sequencing data. Biometrics.

Haque, R., Ahmed, S. A., Inzhakova, G., Shi, J., Avila, C., Polikoff, J., … and Press, M. F. (2012). Impact of breast cancer subtypes and treatment on survival: an analysis spanning two decades. Cancer Epidemiology and Prevention Biomarkers, 21(10), 1848-1855.

Lee, S., Liang, F., Cai, L., and Xiao, G. (2018). A two-stage approach of gene network analysis for high-dimensional heterogeneous data. Biostatistics, 19(2), 216-232.

Li, Y., Tang, X. Q., Bai, Z., and Dai, X. (2016). Exploring the intrinsic differences among breast tumor subtypes defined using immunohistochemistry markers based on the decision tree.

Scientific reports, 6, 35773.

Liang, F., Jia, B., Xue, J., Li, Q., and Luo, Y. (2018). An imputation-consistency algorithm for high-dimensional missing data problems and beyond. arXiv preprint arXiv:1802.02251.

Liang, F., Song, Q. and Qiu, P. (2015). An Equivalent Measure of Partial Correlation Coefficients for High Dimensional Gaussian Graphical Models. Journal of the American Statistical Association, 110, 1248-1265.

Liang, F. and Zhang, J. (2008). Estimating the false discovery rate using the stochastic approximation algorithm. Biometrika, 95, 961-977.

Liu, H., Han, F., Yuan, M., Lafferty, J., and Wasserman, L. (2012). High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4), 2293-2326.

Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34, 1436-1462.

Meng, L., Xu, Y., Xu, C., and Zhang, W. (2016). Biomarker discovery to improve prediction of breast cancer survival: using gene expression profiling, meta-analysis, and tissue validation. OncoTargets and therapy, 9, 6177.

Milinkovic, V., Bankovic, J., Rakic, M., Stankovic, T., Skender-Gazibara, M., Ruzdijic, S., and Tanic, N. (2013). Identification of novel genetic alterations in samples of malignant glioma patients. PLoS One, 8(12), e82108.

Mohammaddoust, S., Salehi, Z., and Saeidi Saedi, H. (2018). SEPP1 and SEP15 gene polymorphisms and susceptibility to breast cancer. British Journal of Biomedical Science, 1-11.

Nielsen, S. F. (2000). The stochastic EM algorithm: estimation and asymptotic results. Bernoulli, 6(3), 457-489.

Ruan, L., Yuan, M., and Zou, H. (2011). Regularized parameter estimation in high-dimensional gaussian mixture models.

Neural computation, 23(6), 1605-1622.

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 479-498.

Stouffer, S. A., Suchman, E. A., DeVinney, L. C., Star, S. A., Jr Williams, R. M., (1949), The American Soldier, Vol. 1: Adjustment During Army Life. Princeton, NJ: Princeton University Press.

Tibshirani, R. (1996). Regression analysis and selection via the Lasso.

Journal of the Royal Statistical Society, Series B, 58, 267-288.

Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94, 19-35.