1 Introduction
The emergence of highthroughput technologies has made it feasible to measure the activities of thousands of genes simultaneously, which provides scientists with a major opportunity to infer gene regulatory networks. Accurate inference of gene regulatory networks is pivotal to gaining a systematic understanding of the molecular mechanism, to shedding light on the mechanism of diseases that occur when cellular processes are dysregulated, and to identifying potential therapeutic targets for the diseases. Given the high dimensionality and complexity of highthroughput data, inference of gene regulatory networks largely relies on the advance of statistical modeling and computation. The Gaussian graphical model is a promising tool to achieve this challenge.
The Gaussian graphical model uses a network to depict conditional independence relationships for a large set of Gaussian random variables, where the absence of an edge between two variables indicate independence of the two variables conditioned on all other variables. In the literature, a variety of methods have been proposed to learn Gaussian graphical networks. To name a few, they include covariance selection (Dempster, 1972), nodewise regression (Meinshausen and Bühlmann, 2006), graphical Lasso (Yuan and Lin, 2007; Friedman et al., 2008), adaptive graphical Lasso (Fan et al., 2009), projected covariance matrix method (Fan et al., 2015), and
learning (Liang et al., 2015). In general, these methods assume that the data are homogeneous, i.e., all samples are drawn from a single Gaussian distribution. However, in practice, we have often the data that are heterogeneous, i.e., the samples are drawn from a mixture Gaussian distribution, while a single Gaussian graphical network still needs to be learned for all the samples in a fashion of data integration. Here are some examples:
Data with hidden biological/clinical subtypes. It is known that complex diseases such as cancer can have significant heterogeneity in response to treatments, and this heterogeneity is often reflected in gene expression. For example, the gene expression patterns can vary with subtypes of the cancer. Since for many types of cancers, the definition of subtypes is still unclear and the number of samples from each subtype can be very small, it is impractical to construct an individual gene regulatory network for each subtype. In this case, we might still be interested in constructing a single gene regulatory network for the heterogeneous data in a fashion of data integration. Such an integrated gene regulatory network can facilitate us to identify fundamental patterns common to the development and progression of the disease.

Data with hidden confounding factors. In realworld applications, the gene expression data may contain some systematic differences caused by known or unknown confounding factors, such as study cohorts, sample collection, experimental batches, etc. Due to the limited number of samples from each level of the confounding factors, we also prefer to learn a single gene regulatory network for the heterogeneous data in a fashion of data integration. Moreover, for many problems, the confounding factors can be unknown.
In this paper, we develop a mixture model method to learn Gaussian graphical networks for heterogeneous data with hidden clusters. The new method is developed based on the imputationconsistency (IC) algorithm proposed by Liang et al. (2018)and the
learning algorithm proposed by Liang et al. (2015). The IC algorithm is a general algorithm for dealing with highdimensional missing data problems. Like the EM algorithm (Dempster et al., 1977), the IC algorithm works in an iterative manner, iterating between an Istep and a Cstep. The Istep is to impute the missing data conditioned on the observed data and the current estimate of parameters, and the Cstep is to find a “consistent” estimator for the minimizer of a KullbackLeibler divergence defined on the pseudocomplete data. For highdimensional problems, the “consistent” estimate can be found with sparsity constraints or screened data. Refer to Fan and Lv (2008) and Fan and Song (2010) for variable screening methods. Under quite general conditions, Liang et al. (2018)showed that the average of the “consistent” estimators across iterations is consistent to the true parameters. The
learning algorithm is originally designed for learning Gaussian graphical models for homogeneous data. The proposed method can be viewed as a combination of the IC algorithm and the learning algorithm, which simultaneously clusters samples to different groups and learn an integrated network across all the groups. When applying the IC algorithm to cluster samples, their cluster membership is treated as missing data.We note that the proposed mixture model method is different from the methods for joint estimation of multiple Gaussian graphical models, such as fused Lasso (Danaher et al., 2014), Bayesian nodewise regression (Lin et al., 2017), and Bayesian integrated learning (Jia et al., 2018). For the latter methods, the samples’ cluster membership is known a priori and the goal is to learn an individual network for each cluster of samples. In contrast, the proposed method works for the case that the cluster membership is unknown and the goal is to learn an integrated network across all hidden groups. The proposed method is also different from the methods proposed by Ruan et al. (2011) and Lee et al. (2018). For the former, the goal is to learn an individual network for each cluster of samples, although it assumes that the cluster membership is unknown. The latter is to first group samples to different clusters using an eigenanalysis based approach and then apply the learning algorithm to learn the network structure. Since the method did not account for the uncertainty of sample clustering, it often performs less well.
The rest part of this paper is organized as follows. In Section 2, we describe the proposed method. In Section 3, we illustrate the performance of the proposed method using simulated examples. In Section 4, we apply the proposed method to learn a gene regulatory network for breast cancer with a heterogeneous gene expression dataset. In Section 5, we conclude the paper with a brief discussion.
2 Mixture Gaussian Graphical Models
2.1 Algorithms for homogeneous data
To have a better description for the proposed method, we first give a brief review for the existing Gaussian graphical model algorithms for homogeneous data.
Let denote a set of Gaussian random variables, where denotes observations of variable . In the context of gene regulatory networks, refers to the expression level of gene measured in experiment . Let denote the expression levels of all genes measured in experiment , which is assumed to follow a Gaussian distribution
with the mean vector
and covariance matrix . Let denote the adjacency matrix, where if the edge is present and 0 otherwise. The adjacency matrix specifies the structure of the Gaussian graphical network. Let denote the partial correlation coefficient of variable and variable conditioned on all other variables. Let denote the concentration matrix, also known as the precision matrix. Let ’s denote the coefficients of the regressions(1) 
where is a zeromean Gaussian random vector. Since can be expressed as and ’s can be expressed as and , the following relationship holds:
(2) 
Based on the relation between partial correlation coefficients and the concentration matrix, Dempster (1972) proposed the covariance selection method, which identifies the edges of the Gaussian graphical network by identifying the nonzero elements of the concentration matrix. However, this method cannot be applied to the problems with , where the sample covariance matrix is nonsingular and thus the concentration matrix cannot be calculated. To tackle this difficulty, Yuan and Lin (2007) proposed to estimate the concentration matrix with
regularization. Soon, this method was accelerated by Friedman et al. (2008) using the coordinate descent algorithm in a similar way to Lasso regression (Tibishrani, 1996), which leads to the socalled graphical Lasso algorithm. Based on the relation between partial correlation coefficients and regression coefficients, Meinshausen and Bühlmann (2006) proposed the nodewise regression method, which is to learn Gaussian graphical networks by identifying nonzero regression coefficients of the regressions given in (
1) with a sparsity constraint.Alternative to estimating the concentration matrix and regression coefficients, the learning algorithm (Liang et al., 2015) is to provide an equivalent measure for the partial correlation coefficient in the sense that
(3) 
where is the partial correlation coefficient of variable and variable conditioned on a subset of and the subset is obtained via correlation screening. Since the learning algorithm is used as a component of the proposed mixture model method for learning Gaussian graphical models with grouped samples, the details of the algorithm are given below.
Algorithm 1.(learning)

(Correlation screening) Determine the reduced neighborhood for each variable .

Conduct a multiple hypothesis test to identify the pairs of variables for which the empirical correlation coefficient is significantly different from zero. This step results in a socalled empirical correlation network.

For each variable , identify its neighborhood in the empirical correlation network, and reduce the size of the neighborhood to by removing the variables having lower correlation (in absolute value) with . This step results in a socalled reduced correlation network.


(calculation) For each pair of variables and , identify a subset of nodes based on the reduced correlation network resulted in step (a) and calculate , where denotes the partial correlation coefficient of and conditioned on the variables .
In this paper, we set if and otherwise, where denotes the neighborhood of node in the reduced correlation network, and denotes the cardinality of a set.

(screening) Conduct a multiple hypothesis test to identify the pairs of vertices for which is significantly different from zero, and set the corresponding element of the adjacency matrix to 1.
The multiple hypothesis tests involved in the algorithm can be done using the empirical Bayesian method developed in Liang and Zhang (2008), which allows for the general dependence between test statistics. Other multiple hypothesis testing procedures that account for the dependence between test statistics, e.g., the twostage procedure of Benjamini et al. (2006), can also be applied here. The correlation screening step involves two procedures, (i) multiple hypothesis test and (ii) sure independence screening, to control the neighborhood size for each variable. The two procedures seem redundant, but actually they are not. Indeed the multiple hypothesis test is able to identify the pairs of independent variables, but the size of each neighborhood cannot be guaranteed to be less than
as established in Liang et al. (2015). We have tried to use the sure independence screening procedure only, which results in the same neighborhood size for each variable. However, in this case, the enlarged neighborhood may contain some variables that are independent of the central one, and thus the power of the followed screening test will be reduced.The learning algorithm consists of two free parameters, namely and , which refer to the significance levels used in correlation screening and screening, respectively. Following the suggestion of Liang et al. (2015), we specify their values in terms of values (Storey, 2002); setting and or 0.1 in all computations. In particular, we set for the simulated examples and for the real data example. A large value of avoids to lose more potential interactions between different genes.
Under mild conditions, e.g., the joint Gaussian distribution of satisfies the faithfulness condition, Liang et al. (2015) showed that the partial correlation coefficient is equivalent to the true partial correlation coefficient in determining the structure of Gaussian graphical models in the sense of (3). Compared to other Gaussian graphical model algorithms, the learning algorithm has a significant advantage that it has reduced the computation of partial correlation coefficients from a high dimensional problem to a low dimensional problem via correlation screening and thus can be used for very highdimensional problems. As shown in Liang et al. (2015), the
learning algorithm is consistent; the resulting network will converge to the true one in probability as the sample size becomes large. The
learning algorithm tends to produce better numerical performance and cost less CPU time than the existing algorithms, such as gLasso and nodewise regression, especially when is large.2.2 The Mixture Gaussian Graphical Model Method
Let denote a set of independent samples which are drawn from a mixture Gaussian distribution with components, where the sample size can be much smaller than the dimension . Suppose that is known. Later, we will describe a Bayesian information criterion (BIC) to determine the value of . The loglikelihood function of the samples is given by
(4) 
where denotes the collection of unknown parameters, ’s are mixture proportions, ’s are mean vectors, and ’s are covariance matrices of the Gaussian components, respectively; and denotes the density function of the multivariate Gaussian distribution. Let denote the indicator variable for the component/cluster membership of sample , for . That is, and for and . Henceforth, we will use cluster to denote the group of samples assigned to a component of the mixture Gaussian graphical model. Cluster and component are also used exchangeably in this paper.
If the sample size is greater than , then the parameters can be estimated using the EM algorithm as described in what follows. Let , and denote, respectively, the estimates of , and obtained at iteration . Let . The step calculates the the conditional expectation of given and the current estimate of , i.e.
(5) 
which leads to the socalled function,
(6) 
The Mstep updates by maximizing the function, which can be done by maximizing with respect to for each . For each value of , can be updated by setting
(7) 
However, this algorithm does not work when , as ’s will be singular in this case.
When , to avoid the issues caused by the singularity of
’s, we propose the following algorithm. For the proposed algorithm, we assume that all components of the mixture Gaussian graphical model share a common adjacency matrix, although their covariance and precision matrices can be different from each other. The new algorithm consists of two stages. The first stage is to apply the ImputationConsistency (IC) algorithm to generate a series of estimates for the common adjacency matrices, and the second stage is to average the estimates to get a stable estimate for the common adjacency matrix. Note that, as can be seen below, the IC algorithm generates a Markov chain.
To learn the common adjacency matrix at each iteration, a integration procedure is needed, which is to integrate the adjacency matrices learned for each component into one adjacency matrix. This procedure can be described as follows. Let denote the partial correlation coefficient calculated for the th cluster at iteration , which can be transformed to a score via Fisher’s transformation:
(8) 
where denotes the conditioning set used in calculating , and is the number of samples assigned to cluster at iteration . For convenience, we call the score a score. The scores from different clusters can be combined using Stouffer’s metaanalysis method (Stouffer et al., 1949) by setting
(9) 
where is a nonnegative weight assigned to cluster at iteration . In this paper, we set . Note that
approximately follows a standard normal distribution under the null hypothesis
. Then a multiple hypothesis test can be conducted on ’s to identify the pairs of nodes for which is differentially distributed from the standard normal , and the adjacency matrix common to all components of the mixture model can be determined thereby. In this paper, we adopted the multiple hypothesis testing procedure of Liang and Zhang (2008) to conduct the test. This testing procedure allows general dependence between test statistics.Given the integration procedure, the first stage of the proposed method can be summarized as follows: It starts with an initial estimate , and then iterates between the following steps:
Algorithm 2. (IC estimation for mixture Gaussian graphical models)

(imputation) Impute the indicator variable drawn from the probability of multinomial distribution (5) for each .

(consistency) Based on the imputed values of ’s, update the estimate by

setting , , and
; 
applying the learning algorithm to learn an adjacency matrix for each cluster of the samples.

applying the integration procedure to integrate the adjacency matrices learned in step (ii) into one.

applying the algorithm given in Hastie et al. (2009, page 634) to recover the covariance matrices for each cluster, given the common adjacency matrix learned in step (iii).

Let for . According to the theory of the IC algorithm (Liang et al., 2018), which holds for general conditions, the sequence forms two interleaved Markov chains, and a consistent estimate of can be obtained by averaging ’s. This is similar to the stochastic EM algorithm (Celeux and Diebolt, 1995; Nielsen, 2000). Further, by theory of the IC algorithm, a consistent estimate of the common adjacency matrix can be obtained by averaging the respective estimates along with iterations. More precisely, the adjacency matrix can be averaged in the following way (second stage of the proposed method). Define
where denotes the number of burnin iterations of the IC algorithm, and then the final estimate of the adjacency matrix can be obtained by conducting another multiple hypothesis test for ’s. As before, under the null hypothesis : , follows the standard normal distribution.
Thus far, we have treated the number of clusters as known. In practice,
can be determined using an information criterion, such as AIC or BIC. Following Ruan et al. (2011), we define the degree of freedom for a model with
components as(10) 
where represents the dimension of the mean vector, and denotes the th element of the estimated common adjacency matrix. Although we have assumed that the mixture Gaussian graphical model has a common adjacency matrix for all components, it can have a completely different concentration matrix for each component. Hence, for each component, we count each nonzero entry of the concentration matrix as a different parameter. The BIC score is then given by
(11) 
where is the loglikelihood function given by equation (4), and can be determined by minimizing .
In (10), we did not count for the parameters . This is due to two reasons. First, the problem is considered under the highdimensional scenario where is allowed to be greater than and grow with . However, is considered as fixed or to grow at a lower order of . Therefore, including or not in (10) will not affect much the performance of the criterion when becomes large. Second, we ignore in (10) to make the definition of the BIC score (11) consistent with the one used in Ruan et al. (2011), which facilitates comparisons.
3 Simulation Studies
We compare the performance of the proposed method with some methods developed for homogeneous data such as gLasso (Friedman et al.,2008), nodewise regression (Meinshausen and Bühlmann, 2006), and learning (Liang et al., 2015), as well as the EMregularization method developed by Ruan et al. (2011) for mixture Gaussian graphical models. As aforementioned, the method by Ruan et al. (2011) is different from the proposed one, as whose goal is to estimate an individual Gaussian graphical network for each cluster. Moreover, since Ruan et al. (2011) applied the gLasso algorithm to learn an individual Gaussian graphical network for each cluster, it will be very hard to integrate those networks into a common one.
3.1 Example 1
We began with the case where the number of clusters of the mixture model is known and the components are different in means. For this simulation study, we fix and the total number of samples , and varied the dimension between and . We set the component means as , and , where denotes a dimensional vector of ones. We let all the three components share the same precision matrix :
(12) 
and generated samples from each component of the mixture model. The samples from different components are combined and shuffled. Three different values of are considered, including , 0.3 and 0.5. Under each setting of and , 50 independent datasets were generated.
The proposed method was applied to this heterogeneous dataset. To initialize ’s and ’s, we randomly grouped the samples into three clusters and calculated their respective proportions and means. To initialize the covariance matrices, we first applied the learning algorithm to the whole dataset to obtain a common adjacency matrix, and then applied the algorithm by Hastie et al. (2009, page 634) to estimate the covariance matrix for each cluster with the common adjacency matrix. The IC algorithm converges very fast, usually in about 10 iterations. For this example, the algorithm was run for 20 iterations for each dataset.
To access the performance of the proposed method, Figure 1
shows the precisionrecall curves obtained for some datasets, where each plot was drawn based on the average of all 50 simulated datasets. The precision and recall are defined by
where , and denote true positives, false positives and false negatives, respectively, and they are defined via a binary decision table (see Table 1). In general, the method producing a larger area under the precisionrecall curve is considered as a better method. The area under the precisionrecall curve is often denoted by AUC (Area Under Curve) in the literature.
True Positive (TP)  False Positive (FP)  
False Negative (FN)  True Negative (TN) 
For comparison, we have applied other methods, including the EMregularization method of Ruan et al. (2011), learning, gLasso and nodewise regression, to this example. As shown in Figure 1, under both scenarios with (representing homogeneous data) and , the proposed method outperforms all others. Moreover, when the value of increases, the performance of the  learning, gLasso and nodewise regression methods deteriorates as they are designed for homogeneous data. The EMregularization method is robust to the value of , and tends to outperform gLasso, nodewise regression and learning when becomes large. Note that the EMregularization method produced a different network for each cluster and thus three precisionrecall curves in total. Figure 1 showed only the best one, i.e., the curve with the largest value of AUC. Table 2
summarizes the areas under the precisionrecall curves produced by different methods, where the average areas (over50 datasets) were reported with the standard deviations given in the corresponding parentheses. The comparison indicates that the proposed method significantly outperforms other methods for the heterogeneous data. For the homogeneous data (i.e.,
), the proposed method performs as well as the learning method, while significantly outperforms others. This indicates generality of the proposed method, which can be applied to homogeneous data without significant harms. For the EMregularization method, its poor performance for the homogeneous data may be due to two reasons. Firstly, the gLasso procedure employed there tends to perform less well than the learning and nodewise regression methods as shown in Figure 1(a) & 1(b). Secondly, the EMregularization method produced three different networks, which are not allowed to be integrated under its current procedure. For the purpose of comparison, we reported only the result for the network with the largest AUC area. However, this “best” network may still be worse than the properly integrated one for the homogeneous data.m  gLasso  NodeReg  learning  Penalized  IC  

0  0.696(0.002)  0.765(0.003)  0.859(0.003)  0.662(0.003)  0.888(0.004)  
0.3  0.437(0.002)  0.453(0.003)  0.602(0.003)  0.634(0.003)  0.892(0.004)  
0.5  0.084(0.001)  0.112(0.001)  0.095(0.003)  0.459(0.020)  0.876(0.004)  
0  0.658(0.002)  0.731(0.002)  0.834(0.002)  0.654(0.002)  0.855(0.004)  
0.3  0.402(0.002)  0.421(0.002)  0.585(0.002)  0.597(0.008)  0.857(0.004)  
0.5  0.059(0.001)  0.084(0.001)  0.051(0.002)  0.439(0.015)  0.829(0.004) 
In addition to underlying networks, we are interested in parameter estimation and cluster identification for the mixture Gaussian graphical model. To access the accuracy of parameter estimation, we adopt the criteria used by Ruan et al. (2011), which include the averaged spectral norm defined by
(13) 
where
is the largest singular value of matrix
; the averaged Frobenius norm defined by(15)  
and the averaged KullbackLeibler (KL) loss defined by
(16) 
where
(17) 
To assess the accuracy of cluster identification, we calculated the averaged false and negative selection rates over different clusters. Let denote the index set of observations for cluster , and let denote its estimate. Define
(18) 
where denotes the set cardinality. The smaller the values of fsr and nsr are, the better the performance of the method is. The comparison was summarized in Table 3 where, for each setting of and , each method was evaluated based on50 datasets with the averaged evaluation results reported. The numbers in the parentheses represent the standard deviations of the corresponding averages. The comparison indicates that the proposed method significantly outperforms the other methods in both parameter estimation and cluster identification.
m  SL  FL  KL  fsr  nsr  

penalized  0  3.642(0.015)  22.309(0.018)  149.701(1.217)  —  —  
0.3  3.618(0.008)  22.231(0.004)  149.058(0.569)  0.453(0.004)  0.393(0.005)  
0.5  3.488(0.027)  22.414(0.056)  160.736(1.540)  0.014(0.002)  0.016(0.002)  
IC  0  3.261(0.045)  11.222(0.072)  24.619(0.281)  —  —  
0.3  2.984(0.037)  10.508(0.084)  21.439(0.251)  0.008(0.001)  0.008(0.001)  
0.5  3.025(0.035)  10.635(0.081)  21.701(0.276)  0(0)  0(0)  
penalized  0  3.644(0.009)  31.480(0.007)  296.131(1.483)  —  —  
0.3  3.578(0.022)  31.529(0.056)  304.948(4.098)  0.512(0.004)  0.533(0.010)  
0.5  3.143(0.033)  32.712(0.124)  388.672(5.041)  0.015(0.002)  0.021(0.007)  
IC  0  3.437(0.042)  16.102(0.107)  51.161(0.413)  —  —  
0.3  3.350(0.042)  15.800(0.039)  49.069(0.311)  0(0)  0(0)  
0.5  2.732(0.036)  16.312(0.010)  50.177(0.246)  0(0)  0(0) 
3.2 Example 2
To make the problem harder, we consider the model for which each component has a different mean vector as well as a different concentration matrix, although the adjacency matrix is still the same for all components. As for Example 1, we fix and the total sample size , varied the dimension between and , and set the cluster mean vectors as , and , where denotes a dimensional vector of ones. The common pattern of the concentration matrix is given by
(19) 
for . We set , and for the three components, respectively. From each component, we generated 100 samples. Three different values of are considered, which are 0, 0.3 and 0.5. Under each setting of and , 50 independent datasets were generated.
Figure 2 shows the precisionrecall curves produced gLasso, nodewise regression, learning, EMregularization, and the proposed method. It indicates that the proposed method outperforms others. The two plots in the first row of Figure 2 compares the performance of different methods when , which represents a very difficult scenario that each cluster is only slightly different in precision matrices and thus the samples will be extremely difficult to be clustered. However, the proposed method still outperform others under this scenario.
Table 4 compares the areas under the precisionrecall curves produced by different methods, and Table 5 compares the performance of different methods in parameter estimation and cluster identification. For each setting of and , each method was evaluated based on50 datasets the averaged evaluation results reported. The numbers in the parentheses of the two tables represent the standard deviations of the corresponding averages. The comparison indicates that the proposed method outperforms others in both parameter estimation and cluster identification.
m  gLasso  NodeReg  learning  Penalized  IC  

0  0.653(0.002)  0.732(0.002)  0.888(0.003)  0.595(0.003)  0.927(0.003)  
0.3  0.416(0.003)  0.429(0.003)  0.624(0.002)  0.571(0.003)  0.926(0.003)  
0.5  0.162(0.001)  0.184(0.001)  0.434(0.005)  0.460(0.003)  0.914(0.003)  
0  0.625(0.002)  0.711(0.002)  0.858(0.001)  0.573(0.002)  0.896(0.002)  
0.3  0.388(0.002)  0.401(0.003)  0.615(0.002)  0.555(0.002)  0.898(0.003)  
0.5  0.136(0.001)  0.161(0.001)  0.380(0.004)  0.358(0.018)  0.878(0.003) 
m  SL  FL  KL  fsr  nsr  

penalized  3.745(0.019)  22.952(0.033)  258.397(3.042)  0.633(0.121)  0.651(0.106)  
0.3  3.723(0.022)  22.902(0.038)  257.245(3.395)  0.174(0.009)  0.196(0.013)  
0.5  3.749(0.019)  22.973(0.032)  257.954(2.943)  0.159(0.035)  0.177(0.100)  
IC  0  3.453(0.049)  11.782(0.069)  42.363(0.369)  0.103(0.017)  0.094(0.011)  
0.3  3.387(0.042)  11.572(0.060)  41.161(0.284)  0.010(0.003)  0.010(0.003)  
0.5  3.292(0.041)  11.521(0.069)  42.098(0.419)  0(0)  0(0)  
penalized  3.797(0.002)  32.411(0.004)  505.035(3.343)  0.597(0.231)  0.678(0.429)  
0.3  3.740(0.020)  32.336(0.045)  510.729(4.079)  0.479(0.104)  0.345(0.085)  
0.5  3.752(0.012)  32.333(0.034)  508.792(1.768)  0.267(0.009)  0.108(0.005)  
IC  0  3.405(0.014)  17.045(0.080)  92.140(0.709)  0.212(0.010)  0.209(0.009)  
0.3  3.533(0.043)  16.989(0.075)  91.503(0.746)  0.005(0.001)  0.004(0.001)  
0.5  3.573(0.041)  17.194(0.090)  94.059(0.701)  0(0)  0(0) 
3.3 Identification of cluster numbers
When the number of clusters is unknown, we propose to determine its value according to the BIC criterion give in (11). In what follows, we illustrated the performance of the proposed method under this scenario using some simulated examples. We considered the cases with and and and 200. For each combination of , we simulated samples from each cluster with the same precision matrix as defined in (12). For the cluster means, we set and for , and set , and for .
Figure 3 compares the performance of the EMregularization method and the proposed method in identification of cluster numbers. It indicates that for the simulated example, the proposed method was able to correctly identify the true value of according to the BIC criterion, while the EMregularization method could not.
4 A Real Data Example
Breast cancer is one of the most prevalent types of cancer which can be classified into four molecular subtypes, namely, luminal A, basallike, HER2enriched, and luminal B, based on their tumor expression profiles (Haque et al, 2012). In this study, we aim to construct a single gene regulatory network across the four subtypes to discover the overall gene regulation mechanism in breast cancer. The gene expression data for breast cancer are available at The Cancer Genome Atlas (TCGA), which contains 768 patients and 20502 genes. For each patient, some clinical information such as survival time, age, gender and tumor stages are also available, but the cancer subtypes are unknown. Since the data might be heterogeneous given the existence of breast cancer subtypes, the proposed method can be applied here. For this study, we are interested in learning a gene regulatory network related to the survival time of patients. For this reason, we first applied a marginal screening method to select the survival timerelated genes. For each gene, we calculated its
value using the marginal Cox regression after adjusting the effects of age, gender and tumor stages, and then selected 592 genes according to a multiple hypothesis test at a false discovery rate (FDR) level of 0.05. We used the empirical Bayes method of Liang and Zhang (2008) to conduct the test.
To determine the number of components for the mixture model, we calculated BIC scores for with the results shown in Figure 4. According to the BIC scores, we set . The resulting three clusters consist of 338, 191 and 238 patients, respectively. Figure 5(a) shows the KaplanMeier curves of the three clusters. A logrank test for the three curves produced a value of 3.89, which indicate that the patients in different clusters have different survival probabilities. Further, for each gene, we conducted a ANOVA test for its mean expression level across the three clusters. The resulting values are shown in Figure 5(b), where most values are very close to 0. This implies that the three clusters have different means and thus the data are heterogeneous. We note that the clustering results produced by the proposed method are biologically meaningful, which is likely due to the existence of hidden subtypes of breast cancer. As stated in Haque et al (2012), women with luminal A tumors had the longest survival time, women with HER2enriched and luminal B tumors had a much shorter survival time, and women with basallike tumors had an intermediate survival time with deaths occurring earlier than those with luminal A tumors.
With being set to 3, the proposed method produced a gene regulatory network (shown in Figure 6), from which some hub genes can be identified. The hub genes refer to those with high connectivity in the gene regulatory network, and they tend to play important roles in gene regulation. To provide a stable way to identify hub genes, we consider a crossvalidation like method. We divide the dataset into five subsets equally and then run the proposed method for five times, each applying to four of the five subsets only. In each run, we identified ten hub genes according to their connectivity. The results were summarized in Table 6, where the genes were ranked by their frequencies being selected as the hub gene among the 5 runs. The results indicate that the performance of the proposed method is quite stable: quite a few genes are frequently selected as the hub gene in different runs.
Rank  Gene  Freq  Links  Rank  Gene  Freq  Links 

1  LHFPL3  4  49.2 (9.6)  6  KRT12  3  13.4 (5.1) 
2  SEPP1  4  8.4 (1.4)  7  FXYD1  2  5.4 (2.3) 
3  MYH11  4  8.6 (1.4)  8  SCARA5  2  6.4 (1.6) 
4  F13A1  3  12.2 (3.8)  9  CLEC3B  2  7.8 (2.8) 
5  MAMDC2  3  5.4 (1.0)  10  LRRC70  2  5.8 (1.7) 
Our findings of hub genes are pretty consistent with the existing knowledge. Among the top 10 hub genes, 8 of them has been verified in the literature to be related with breast cancer. For example, LHFPL3, the gene has the most connectivities in the networks, is characteristic of primary glioblastoma which are important processes for cancer development and progression (Milinkovic et al., 2013). The gene SEPP1 is significantly associated with breast cancer risk among women (Mohammaddoust et al., 2018). The gene F13A1 is known as a thrombotic factor that plays a major role in tumor formation (Ahmadi et al., 2016). In the cancer coexpression network developed by Meng et al., (2016), they found that MAMDC2 plays a key role in the development of breast invasive ductal carcinoma. Our results also reveal some new findings, such as the gene MYH11. Li et al. (2016) reported that MYH11 plays a role in tumor formation by disturbing stem cell differentiation or affecting cellular energy balance and has been identified as a driver gene in human colorectal cancer, although few researches identify its function in breast cancer.
5 Discussion
In this paper, we have proposed a new method for constructing gene regulatory networks for heterogeneous data, which is able to simultaneously cluster samples to difference groups and learn an integrated network across the groups. The proposed method was illustrated using some simulated examples and a realworld gene expression data example. The numerical results indicate that the proposed method significantly outperforms the existing ones, such as graphical Lasso, nodewise regression,
learning, and EMregularization. For the realworld gene expression data example, we conducted a detailed postclustering analysis, which indicates the heterogeneity of the data and justifies the importance of the proposed method for real problems.
In addition to microarray gene expression data, the proposed method can be applied to next generation sequencing (NGS) data based on the transformations developed in Jia et al. (2017). To learn gene regulatory networks from NGS data, which are often assumed to follow a Poisson or negative binomial distribution, Jia et al. (2017) developed a random effect modelbased transformation to continuize the NGS data. Further, the continuized data can be transformed to Gaussian using the nonparanormal transformation (Liu et al., 2012), and the proposed method can be applied then. We expect that the proposed method can also find wide applications in other scientific fields.
Acknowledgment
The authors thank the book editor Dr. Yichuan Zhao and two referees for their constructive comments which have led to significant improvement of this paper. Liang’s research was support in part by the grants DMS1612924 and DMS/NIGMS R01GM117597.
References

Ahmadi, M., Nasiri, M., and Ebrahimi, A. (2016). ThrombosisRelated Factors FV and F13A1 Mutations in Uterine Myomas. Zahedan Journal of Research in Medical Sciences, 18(10).

Benjamini, Y., Krieger, A. M., and Yekutieli, D. (2006). Adaptive linear stepup procedures that control the false discovery rate. Biometrika, 93(3), 491507.

Celeux, G., and Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern recognition, 28(5), 781793.

Danaher, P., Wang, P., and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society, Series B, 76(2), 373397.

Dempster, A. P. (1972). Covariance selection. Biometrics, 157175.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B, 138.

Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties. The annals of applied statistics, 3(2), 521.

Fan, J., Feng, Y., and Xia, L. (2015). A projection based conditional dependence measure with applications to highdimensional undirected graphical models. arXiv preprint arXiv:1501.01617.

Fan, J. and Lv, J. (2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society, Series B, 70, 849911.

Fan, J. and Song, R. (2010). Sure independence screening in generalized linear model with NPdimensionality. Annals of Statistics, 38, 35673604.

Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432441.

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning (Second Edition). SpringerVerlag, 763 pages.

Jia, B., Tseng, G., and Liang, F. (2018). Fast Bayesian Integrative Analysis for Joint Estimation of Multiple Gaussian Graphical Models. Submitted to Journal of the American Statistical Association.

Jia, B., Xu, S., Xiao, G., Lamba, V., and Liang, F. (2017). Learning gene regulatory networks from next generation sequencing data. Biometrics.

Haque, R., Ahmed, S. A., Inzhakova, G., Shi, J., Avila, C., Polikoff, J., … and Press, M. F. (2012). Impact of breast cancer subtypes and treatment on survival: an analysis spanning two decades. Cancer Epidemiology and Prevention Biomarkers, 21(10), 18481855.

Lee, S., Liang, F., Cai, L., and Xiao, G. (2018). A twostage approach of gene network analysis for highdimensional heterogeneous data. Biostatistics, 19(2), 216232.

Li, Y., Tang, X. Q., Bai, Z., and Dai, X. (2016). Exploring the intrinsic differences among breast tumor subtypes defined using immunohistochemistry markers based on the decision tree.
Scientific reports, 6, 35773. 
Liang, F., Jia, B., Xue, J., Li, Q., and Luo, Y. (2018). An imputationconsistency algorithm for highdimensional missing data problems and beyond. arXiv preprint arXiv:1802.02251.

Liang, F., Song, Q. and Qiu, P. (2015). An Equivalent Measure of Partial Correlation Coefficients for High Dimensional Gaussian Graphical Models. Journal of the American Statistical Association, 110, 12481265.

Liang, F. and Zhang, J. (2008). Estimating the false discovery rate using the stochastic approximation algorithm. Biometrika, 95, 961977.

Liu, H., Han, F., Yuan, M., Lafferty, J., and Wasserman, L. (2012). Highdimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4), 22932326.

Meinshausen, N. and Bühlmann, P. (2006). Highdimensional graphs and variable selection with the Lasso. Annals of Statistics, 34, 14361462.

Meng, L., Xu, Y., Xu, C., and Zhang, W. (2016). Biomarker discovery to improve prediction of breast cancer survival: using gene expression profiling, metaanalysis, and tissue validation. OncoTargets and therapy, 9, 6177.

Milinkovic, V., Bankovic, J., Rakic, M., Stankovic, T., SkenderGazibara, M., Ruzdijic, S., and Tanic, N. (2013). Identification of novel genetic alterations in samples of malignant glioma patients. PLoS One, 8(12), e82108.

Mohammaddoust, S., Salehi, Z., and Saeidi Saedi, H. (2018). SEPP1 and SEP15 gene polymorphisms and susceptibility to breast cancer. British Journal of Biomedical Science, 111.

Nielsen, S. F. (2000). The stochastic EM algorithm: estimation and asymptotic results. Bernoulli, 6(3), 457489.

Ruan, L., Yuan, M., and Zou, H. (2011). Regularized parameter estimation in highdimensional gaussian mixture models.
Neural computation, 23(6), 16051622. 
Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 479498.

Stouffer, S. A., Suchman, E. A., DeVinney, L. C., Star, S. A., Jr Williams, R. M., (1949), The American Soldier, Vol. 1: Adjustment During Army Life. Princeton, NJ: Princeton University Press.

Tibshirani, R. (1996). Regression analysis and selection via the Lasso.
Journal of the Royal Statistical Society, Series B, 58, 267288. 
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94, 1935.
Comments
There are no comments yet.