Logistic Regression Augmented Community Detection for Network Data with Application in Identifying Autism-Related Gene Pathways

09/07/2018 ∙ by Yunpeng Zhao, et al. ∙ 0

When searching for gene pathways leading to specific disease outcomes, additional information on gene characteristics is often available that may facilitate to differentiate genes related to the disease from irrelevant background when connections involving both types of genes are observed and their relationships to the disease are unknown. We propose method to single out irrelevant background genes with the help of auxiliary information through a logistic regression, and cluster relevant genes into cohesive groups using the adjacency matrix. Expectation-maximization algorithm is modified to maximize a joint pseudo-likelihood assuming latent indicators for relevance to the disease and latent group memberships as well as Poisson or multinomial distributed link numbers within and between groups. A robust version allowing arbitrary linkage patterns within the background is further derived. Asymptotic consistency of label assignments under the stochastic blockmodel is proven. Superior performance and robustness in finite samples are observed in simulation studies. The proposed robust method identifies previously missed gene sets underlying autism related neurological diseases using diverse data sources including de novo mutations, gene expressions and protein-protein interactions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Community detection is a fundamental question in network analysis (Goldenberg et al., 2010; Newman, 2006; Fortunato, 2010)

. Traditional approaches consider the adjacency matrix, whose elements equal one or zero indicating whether there is a connection between two nodes, as the input. Then the nodes are partitioned into cohesive groups, that is, communities, with homogeneous linkage probabilities within and heterogeneous probabilities between the groups. Current community detection methods assume all nodes belong to certain communities of interests. However, this assumption is not always true in real applications. For example, when we are looking for pathways involving genes related to the risk of target disease, connections between candidate genes regardless their involvement in the disease process are collected. Furthermore, whether a gene has impacts on the disease origination and development is usually unknown. Often, information on the characteristics of the nodes/genes can help to differentiate the nodes related to the outcome of interests and the unrelated ones. Novel two-stage models with one joint likelihood are proposed to incorporate the node-specific information which isolate irrelevant nodes from relevant ones and in return improve detection accuracy of communities related to a specific outcome.

Our study is motivated by the problem to discover gene pathways leading to complex diseases in genomic studies. Highly correlated gene expression levels and experimentally verified protein-protein interactions provide useful information on connections between genes. However, not all genes are related to the disease under study. In fact, most genes are “household” genes with functions to maintain normal metabolic processes within healthy human bodies. Mixing genes and pathways for normal life processes with those leading to the target disease in community detection models will introduce noise to disease-generating pathways which are the true interests of clinicians and biologists. De novo mutations refer to gene mutations that occur for the first time in a family compared to mutations inherited from parents. We believe that discrepancy in the numbers of de novo mutations on the same gene in patients and the number in healthy controls would help differentiate genes related to the disease from those unrelated to the disease, which we call the “background”. The three kinds of data, gene expression, protein-protein interaction and number of de novo mutations, can be downloaded from different online data consortiums and combined using unique gene names.

The stochastic blockmodel is the most used statistical tool for modeling and detecting communities (Holland et al., 1983; Snijders and Nowicki, 1997; Nowicki and Snijders, 2001). We model the relationship between the unobserved indicator whether a gene is related to the target disease or not and gene-specific covariates by logistic regression in the first stage, then cluster disease-related genes into several pathways in the second stage. Both indicators for disease relevance in the first stage and community labels in the second stage are latent variables, and the expectation-maximization algorithm is employed. However, this approach is intractable due to the numerous possible label assignments in the E-step. Amini et al. (2013) proposed a fast pseudo-likelihood algorithm for fitting blockmodels and we adapt this algorithm in Section 3 to the joint pseudo-likelihoods incorporating both the logistic regression and the block models. The pseudo-likelihood may also be optimized by other alternative approaches such as the EMM algorithm by Gormley and Murphy (2008).

Another distinct feature of the proposed method is the extension to the robust community detection allowing heterogeneous linkage probabilities in the background, which relaxes the assumption of homogeneous linkage probability within each group in the stochastic blockmodel. For instance, the background can be a mixture of multiple strongly or weakly connected groups. These groups all belong to the background because they are not related to the target disease, but their linkage rates are not necessarily homogeneous. In Section 4, we further develop the model in section 3 to allow for arbitrary linkage patterns within the background. Interestingly, when the linkage probabilities within the background are unspecified, the pseudo-likelihood algorithm can be modified to leave the likelihood of the links in the background out while the classical likelihood approach cannot.

Recently there have been works on community detection which utilize covariates information. These papers use the additional covariates information to improve the accuracy of community detection. Some papers combine a similarity or kernel matrix based on covariates with the adjacency matrix (Binkiewicz et al., 2014; Zhang et al., 2015; Yan and Sarkar, 2016; Xu et al., 2012). Other papers build likelihoods of linkage probabilities incorporating auxiliary nodal information (Tallberg, 2004; Yang et al., 2013; Newman and Clauset, 2016; Handcock et al., 2007; Krivitsky et al., 2009; Gormley and Murphy, 2010). However, none of these works follow the same framework as our method. In short, in our method, the sole reason of using auxiliary information on nodal characteristics is to distinguish the disease related nodes from unrelated ones, then we carry out community detection within the disease-related nodes. On the contrary, in the literature, auxiliary information is used to facilitate partition of all nodes into communities. For example, Tallberg (2004) used covariates to predict the probabilities into each homogeneous community in a Bayesian framework, while we use covariates to predict the probability into the heterogeneous background in a pseudo-likelihood framework.

2 Methods

We begin by introducing the data structure and notation. A network with nodes can be represented by an adjacency matrix , where

In addition to the adjacency matrix , some covariate information on nodes is also available. These covariates are represented by an matrix , where denotes the value of the th covariate on node .

We model networks with a particular community structure where the network is composed of multiple cohesive communities, together with some background nodes. Unlike the usual definition of background set which is diffuse within itself or weakly connected to other parts of the network (Zhao et al., 2011), we assume that the probability of a node belonging to the background set depends on its covariates. Suppose there are communities besides the background set. Let denote the community that each of the nodes/genes belongs to, thus if nodes belongs to community , for , and if node is a background gene. Moreover, let

be a vector indicating whether the node belongs to one of the

communities or the background, i.e. if , otherwise.

The network is generated in three steps.

  1. The random variable

    is independent for and follows a logistic regression

    where is the coefficients vector, and is the th row of . Here the logistic model has an intercept, that is, the first column of is .

  2. The probability that a node with belongs each of the communities is given by the independent multinomial distribution with parameter ,

    In addition, if .

  3. Conditional on the labels, for are independent Bernoulli variables with

    where is a symmetric matrix.

The total number of genes in the th community is and the number of links between the th and th commuity is given by , where is the indicator function. Moreover, let if , and . Then the joint log-likelihood of and is

(2.1)

3 Estimating Procedures

The community labels are unobserved in a community detection problem. Furthermore, the E-step of such algorithm requires evaluating all possible label assignments, which makes the algorithm intractable (Amini et al., 2013; Zhao et al., 2012). We adopt the idea of pseudo-likelihood in Amini et al. (2013) which partitions each row of into blocks and assumes the independence between rows.

We use the same notation as those in Amini et al. (2013). The vector denotes an initial blocking vector, where . And denotes the number of edges associated with node in the th block, that is, . Let and , where is the expected total number of edges in the -th block for a node in community , i.e., . When is large,

can be approximated by a Poisson distribution given

, and the dependence of between different rows is weak. Assuming are independence for and and using the Poisson approximation, the log-pseudolikelihood of and (up to a constant) is

where . And the log-likelihood for the marginal distribution of (up to a constant) is

(3.1)

Given initial labels , equation (A.1) can be maximized by a standard expectation-maximization algorithm. The details of the E-step and M-step are given in Algorithm 1.

Algorithm 1: (The expectation-maximization algorithm under Poisson distribution)

  • E-step: Let and

    be the estimates at the current iteration, and

    . The posterior probability of label assignment is

  • M-step: Given , and can be updated by the following formulae,

    can be updated by solving the logistic regression,

    Note is the sum of the estimated conditional probabilities of gene belonging to one of the communities.

Once the expectation-maximization algorithm converges, we can update the labels by . We repeat this procedure several times until becomes stable.

Amini et al. (2013) also introduced a pseudo-likelihood conditional on the node degrees. We generalize this conditional pseudo-likelihood to our scenario. Denote the node degree by . Then follows multinomial distribution conditional on label and . The multinomial log pseudo-likelihood (up to a constant) is

(3.2)

where is the parameter in the multimomial distribution satisfying .

The algorithm is similar to that for the Poisson pseudo-likelihood. We give the details of the expectation-maximization algorithm under the multinomial distribution in Algorithm 2.

Algorithm 2: (The expectation-maximization algorithm under multinomial distribution)

  • E-step: Based on current estimates and , the posterior probability of label assignment is

  • M-step: Given , , and can be updated by

4 Robust Community Detection

So far we assume that all the disease-related communities and the background satisfy the stochastic blockmodel assumption. In this section, we propose a new pseudo-likelihood method that allows for arbitrary structure in the background, for example, a mixture of strongly and weakly connected groups, or nodes with high degree variations. In other words, we keep the stochastic blockmodel assumption in the disease-related communities, but make no assumption on the structure within the background. A network with the heterogeneous background is generated in three steps, of which the first two steps are identical to the first two steps in Section 2. The last step has been modified as follows.

Step 3: Conditional on the labels, when or , for are independent Bernoulli variables with

The link probabilities within the background set, i.e., when and , are not specified.

The joint likelihood (2) cannot be used as the criteria to estimate c in this situation because it is maximized when all nodes belong to group . By contrast, the pseudo-likelihood method introduced in Section 3 can be extended to this new scenario. Recall the setup in Section 3. Let be an initial blocking vector. And denotes the number of edges associated with node in the th block . When is a reasonable initial vector, can be approximated by a mixture of Poisson distributions as before when . However, when , the distribution of is unknown since the link probabilities within the background are unspecified. Therefore, we exclude this part of unreliable information, and propose the following pseudo-likelihood for robust community detection,

(4.1)

where .

Notice equation (4) is indeed a valid likelihood function conditional on because the blocking vector and the community labeling vector are treated differently in our algorithm. The proof of its identifiability is given in the supplementary material. The blocking vector partitions the columns of into blocks and is the sum in the th row th block. Likelihood (4) does not include – the last column of since the Poisson approximation are inappropriate. But this does not affect the range of , which is still . Community detection based on (4) can be viewed as a classic clustering problem on . We need to assign a label from 1 to to each row data point, i.e., each , which contains features. But we only use the first features since the last one is not reliable. Then we update the labelling of each gene in the columns and iterate several times. In each iteration, because the groups in the columns and the grouping results in the rows are considered separately, dropping one column of noise will not result in all genes (rows) falling into the th group. After each iteration of the outer loop updating , genes with similar linkage probabilities with the first

groups are classified into the same group with higher and higher accuracy.

The algorithm is therefore similar to Algorithm 1 and given in the following.

Algorithm 3: (The expectation-maximization algorithm for robust community detection)

  • E-step: Let and be the estimates at the current iteration, and . The posterior probability of label assignment is

  • M-step: Given , , and can be updated by,

As before, once the expectation-maximization algorithm converges, is updated by . We repeat this procedure until becomes stable.

We do not consider robust community detection using multinomial approximation because the condition is invalid if the last column is removed.

5 Asymptotic Properties

In this section we study the consistency under stochastic blockmodels. Equation (3.2) has slightly simpler form and theoretical derivations than (A.1). The theoretical analysis in this section will focus on the multinomial pseudo-likelihood.

We begin with the setup, which closely follow Amini et al. (2013). The true community labels are the parameters of interests, where . We focus on the case of directed blockmodel. A coupling technique can be used to extend the result to the undirected case analogous to that in Amini et al. (2013). Consider the edge matrix

where . Here and remain constant, while can scale with . The directed blockmodel assumes that all the entries in the adjacency matrix are independent Bernoulli variables without forcing to be symmetric, that is, . For simplicity, a univariate covariate taking values in is assumed.

We illustrate the consistency of one-step expectation-maximization of the multinomial pseudo-likelihood. Starting from some initial labels and initial estimates , of the parameters , and , the initial estimates of and are obtained from the logistic regression, that is,

Define

Let

and be the 2 by 2 matrix with entries given by . The initial estimates is obtained by row normalization of , that is,

With the notation defined above, the output of one-step expectation-maximization is

We use the mis-classification error rate (Choi et al., 2012; Zhao et al., 2012; Amini et al., 2013) to measure the performance of . That is, define

where is the set of permutations of . In this definition we consider all values that are permutations of each other because they result in the same community structure.

Consider the class of initial labels that correctly classify the node as a member of community . The fraction of such nodes among all nodes belonging to community , , is formally given by

where is the size of community .

An extra condition is introduced to avoid perfect separation of in the logistic fit. We define the following class

where is the size of initial estimate of community .

The uniform consistency of within the class is established by the following theorem.

Theorem 5.1 (Main result).

Assume and . Then under some regularity condition, with sufficiently large , and , for any ,

The details of the regularity condition and the proof is given in the supplementary material.

The proof of the main theorem depends on a key fact that the log ratio of the estimated probabilities and has a uniform bound independent with , for . This is summarized in the following lemma.

Lemma 5.1.

Assume . Then if , there exist such that for sufficiently large ,

where is independent with .

The proof is given in the supplementary material.

6 Simulations

We first examine the performance of the proposed methods under standard stochastic blockmodel. Each network contains nodes and each setup is repeated 500 times. There are three groups including two disease-related communities and one disease-irrelevant background set. The probability a gene is related to the disease follows a logistic regression with . Here is the indicator for the th node belonging to a disease-related community and covariate . And correspond to the percentages background , and , respectively. Nodes with are assigned to two non-overlapping communities with equal probabilities . Pairs within the background, as well as pairs composed of one node in the background and the other node in a disease-related community are linked with probability . The linkage probability between the two non-background communities is 005, while the linkage probability for pairs within the same community ranges from 015 to 025.

With Logistic Models Without Logistic Models
Poisson Multinomial Robust Poisson Multinomial Robust
Background Nodels
15 58 (12) 57 (13) 59 (12) 15 (7) 15 (8) 15 (8)
16 66 (8) 66 (9) 67 (8) 23 (11) 24 (11) 23 (11)
17 72 (7) 72 (6) 73 (7) 34 (13) 33 (13) 33 (13)
18 77 (5) 76 (5) 77 (5) 48 (14) 45 (13) 46 (15)
19 81 (5) 80 (5) 81 (4) 61 (11) 55 (13) 60 (11)
20 85 (4) 83 (4) 85 (4) 70 (8) 66 (9) 70 (9)
21 88 (3) 86 (4) 88 (3) 78 (6) 73 (7) 78 (7)
22 91 (3) 88 (3) 91 (3) 83 (4) 79 (6) 83 (5)
23 93 (3) 90 (3) 93 (3) 87 (4) 83 (5) 87 (4)
24 94 (2) 92 (3) 94 (2) 90 (3) 86 (4) 90 (3)
25 96 (2) 93 (2) 96 (2) 93 (3) 89 (3) 93 (3)
Background Nodels
15 74 (5) 74 (5) 74 (5) 44 (10) 44 (10) 43 (11)
16 78 (4) 78 (4) 79 (4) 56 (8) 56 (8) 55 (10)
17 82 (4) 82 (4) 82 (4) 66 (6) 64 (7) 66 (7)
18 86 (3) 85 (4) 86 (3) 74 (6) 72 (6) 74 (6)
19 89 (3) 88 (3) 89 (3) 80 (5) 78 (5) 80 (5)
20 91 (3) 90 (3) 92 (3) 86 (4) 82 (4) 86 (4)
21 94 (2) 92 (3) 94 (2) 89 (3) 86 (4) 89 (3)
22 95 (2) 93 (2) 95 (2) 92 (3) 89 (3) 92 (3)
23 96 (2) 95 (2) 97 (2) 94 (2) 91 (3) 94 (2)
24 98 (1) 96 (2) 97 (1) 96 (2) 93 (3) 96 (2)
25 98 (1) 96 (2) 98 (1) 97 (1) 94 (2) 97 (1)
Background Nodels
15 82 (4) 82 (4) 82 (3) 67 (6) 67 (6) 67 (6)
16 86 (3) 86 (3) 86 (3) 74 (5) 74 (5) 74 (5)
17 89 (3) 88 (3) 89 (3) 81 (4) 80 (4) 80 (4)
18 91 (3) 91 (3) 91 (3) 85 (3) 84 (4) 85 (3)
19 94 (2) 92 (2) 94 (2) 89 (3) 87 (3) 89 (3)
20 96 (2) 94 (2) 96 (2) 92 (2) 90 (3) 92 (2)
21 97 (1) 95 (2) 97 (1) 95 (2) 92 (2) 95 (2)
22 98 (1) 96 (2) 98 (1) 96 (2) 94 (2) 96 (2)
23 98 (1) 97 (1) 98 (1) 97 (1) 95 (2) 97 (1)
24 99 (1) 97 (1) 99 (1) 98 (1) 96 (2) 98 (1)
25 99 (1) 98 (1) 99 (1) 99 (1) 97 (2) 99 (1)

Table 1: Comparison of average adjusted rand index (ARI)

under stochastic blockmodels. Numbers within parentheses are empirical standard deviations of ARI

.

Table 1 compares the performance of three models - the pseudo-likelihood methods with Poisson and multinomial approximation introduced in Section 3 as well as the robust community detection method introduced in Section 4. For each model, we further compare the two versions where auxiliary nodal information, i.e, logistic regression, is either used or unused. The community detection accuracy is measured by the adjusted rand index (ari) (Vinh et al., 2010), a widely-used measure for comparing two partitions. The value of the index is 0 for two independent partitions, and higher values indicate better agreement. The performance of all methods improves as the linkage probability within disease-related community increases, or as the percentage of background nodes decreases. More importantly, the proposed method incorporating auxiliary information through logistic regression always outperforms its counterpart without logistic regression. Moreover, the robust method gives the same performance as the Poisson pseudo-likelihood which suggests the robust method does not lose discriminatory accuracy when data follow standard stochastic block models. On the other hand, the algorithm fitting multinomial distributions performs slightly worse than the other two methods. Rigorously speaking, the multinomial pseudo-likelihood is an approximation to the degree-corrected blockmodel, which is a generalization of standard blockmodel by allowing more variation on degrees (Zhao et al., 2012; Karrer and Newman, 2011; Amini et al., 2013). Therefore, the finite sample performance of multinomial pseudo-likelihood has slightly lower ARI on average since it fits a more complicated model.

Next we consider the setup with heterogeneous background nodes. For any node in background, we generate from . The linkage probability between a background node and a disease-related node is . For two background nodes and , the linkage probability is . The rest of the model setups such as the generation mechanism of communities labels, the linkage probabilities within/between communities and linkage probabilities between a community and the background remain the same.

With Logistic Models Without Logistic Models
Poisson Multinomial Robust Poisson Multinomial Robust
Background Nodes
15 20 (11) 58 (14) 54 (21) 15 (6) 17 (8) 12 (10)
16 23 (13) 65 (9) 63 (18) 18 (5) 23 (10) 17 (12)
17 25 (13) 70 (8) 69 (16) 20 (5) 30 (12) 24 (17)
18 29 (14) 74 (6) 76 (11) 23 (5) 39 (13) 31 (21)
19 35 (18) 78 (5) 81 (8) 24 (5) 50 (12) 43 (26)
20 39 (20) 80 (5) 85 (6) 27 (6) 57 (12) 50 (27)
21 43 (23) 83 (5) 88 (5) 29 (5) 63 (10) 61 (27)
22 48 (25) 85 (4) 91 (3) 30 (6) 66 (10) 71 (25)
23 53 (27) 86 (4) 93 (3) 32 (7) 69 (10) 78 (22)
24 60 (29) 87 (4) 94 (2) 34 (9) 72 (10) 84 (19)
25 67 (30) 89 (4) 95 (2) 37 (13) 74 (10) 89 (15)
Background Nodes
15 62 (19) 73 (5) 74 (5) 34 (12) 44 (9) 42 (12)
16 70 (15) 77 (4) 79 (4) 41 (15) 53 (9) 55 (11)
17 75 (14) 81 (4) 83 (4) 46 (15) 62 (8) 66 (8)
18 81 (10) 84 (4) 86 (3) 52 (15) 69 (6) 74 (7)
19 85 (9) 86 (4) 89 (3) 58 (15) 74 (6) 80 (6)
20 89 (7) 89 (3) 92 (3) 63 (17) 78 (5) 85 (5)
21 92 (4) 90 (3) 93 (2) 72 (18) 82 (5) 89 (3)
22 95 (2) 92 (3) 95 (2) 82 (16) 84 (5) 92 (2)
23 96 (2) 93 (2) 96 (2) 88 (15) 87 (4) 94 (2)
24 97 (2) 94 (2) 97 (1) 93 (10) 88 (4) 96 (2)
25 98 (1) 94 (2) 98 (1) 96 (7) 90 (4) 97 (1)
Background Nodes
15 81 (5) 82 (4) 82 (4) 65 (8) 66 (6) 67 (7)
16 85 (4) 85 (3) 86 (3) 71 (7) 72 (5) 74 (5)
17 89 (3) 88 (3) 89 (3) 77 (7) 78 (5) 81 (4)
18 91 (3) 90 (3) 91 (2) 83 (6) 82 (4) 85 (4)
19 93 (2) 92 (3) 94 (2) 88 (4) 86 (4) 90 (3)
20 95 (2) 93 (2) 95 (2) 91 (3) 88 (3) 92 (3)
21 96 (2) 94 (2) 97 (2) 94 (2) 90 (3) 94 (2)
22 98 (1) 95 (2) 98 (1) 96 (2) 92 (3) 96 (2)
23 98 (1) 96 (2) 98 (1) 97 (2) 93 (2) 97 (1)
24 99 (1) 97 (2) 99 (1) 98 (1) 94 (2) 98 (1)
25 99 (1) 97 (1) 99 (1) 99 (1) 95 (2) 99 (1)
Table 2: Comparison of average adjusted rand index (ARI) under heterogeneous backgrounds. Numbers within parentheses are empirical standard deviations of ARI .
Figure 1: Comparison of the average ARI for Poisson pseudo-likelihood, multinomial pseudo-likelihood and robust community detection with and without logistic regressions under 62 of background nodes. This figure appears in color in the electronic version of this article.
Figure 2: Comparison of the average ARI for Poisson pseudo-likelihood, multinomial pseudo-likelihood and robust community detection with and without logistic regressions under 50 of background nodes. This figure appears in color in the electronic version of this article.
Figure 3: Comparison of the average ARI for Poisson pseudo-likelihood, multinomial pseudo-likelihood and robust community detection with and without logistic regressions under 38 of background nodes. This figure appears in color in the electronic version of this article.

The ari of the six methods are shown in Table 2 and Figures 1 - 3. Similar to what we observed in Table 1, the average ARIs of all methods increases as the linkage probability within community increases, or as the percentage of background nodes decreases. And the method with logistic regression outperforms their counterparts without logistic regression. The robust method with logistic regression gives the best performance in most scenarios. The Poisson pseudo-likelihood has the worst performance when the stochastic blockmodel assumption is violated in the heterogeneous background. Especially, under the case of high percentage of background nodes, the Poisson pseudo-likelihood performs poorly even when the linkage probability within community is much higher than the linkage probability between communities. The multinomial pseudo-likelihood slightly outperforms the robust method when the percentage of background nodes is high, in which case the robust method discards lots of information, while the multinomial pseudo-likelihood (or correspondingly degree corrected stochastic blockmodel) accounts for high variations on degrees. On the other hand, the robust method outperforms the multinomial pseudo-likelihood in all the other cases. In summary, the robust method has the best performance in terms of both accuracy and efficacy in almost all the setups we examined regardless the data follows stochastic blockmodels or not. In the only exception where the multinomial pseudo-likelihood method with logistic regression performs slightly better, the discrepancies between the two methods are small. Therefore, the robust community detection method is our recommended method.

In real applications, the number of communities is often unknown a priori. Saldana et al. (2017) proposed a modified Bayesian information criterion (BIC) for community detection:

Daudin et al. (2008) proposed another model selection criterion – integrated classification likelihood (ICL) with a heavier penalty:

We use simulation studies to verify the performance of BIC and ICL for our case. Since BIC and ICL are designed for the stochastic blockmodel, we compute by (2) in the present studies although is estimated by the robust method.

We follow the aforementioned setup for heterogeneous background nodes and only consider the case with 50% background nodes. For each network, we vary the assumed number of communities from to , and report in the first three columns of Table 3 the percentages of selected numbers of communities in 50 replicates by BIC and ICL, respectively. Both BIC and ICL perfectly identifies the true community number ().

In the last set of simulation studies, we consider the model selection for networks with 5 clusters plus the background. The setup is the same as the previous study except that and . With this setup, the average size of clusters is approximately 125 as in the previous study. The results are shown in the last three column of Table 3: BIC and ICL almost perfectly identifies the community number except for one replicate.

BIC ICL BIC ICL
1 0 0 1 0 0
2 1 1 2 0 0
3 0 0 3 0 0
4 0 0 4 0 0
5 0 0 5 0.98 0.98
6 0 0 6 0.02 0.02
7 0 0 7 0 0
8 0 0 8 0 0
Table 3: Proportions of the Numbers of Communities Selected by BIC and ICL

7 Application

With the development of improved sequencing techniques, more and more de novo mutations in candidate genes associated with neurodevelopmental or neuropshychiatric diseases are being reported. Here we focus on autism spectrum disorder and related neurological disorders. Most identified de novo mutations are rare and patients with the same clinical symptoms often carry heterogeneous mutation loci on different genes. Most probably, the pathophysiology mechanism underpinning autism involves perturbed molecular pathways. There is evidence of enrichment of de novo mutations in gene groups connected by protein-protein interactions, co-expression patterns, or pathways defined by common functions, annotations or evolutional patterns (Allen et al., 2013). Our study targets at interactive groups of biomarkers that form biological pathways related to autism origination and development.

Autism and related disorder data from Hormozdiari et al. (2015) are employed, which reports four types of information (clinically diagnosed disease status, RNA expression levels, de novo mutations, protein-protein interactions) from three major data consortiums including BrainSpan Atlas, published autism studies, protein-protein interaction databases. There are 52,801 verified protein-protein interaction links and 192,499 mRNA pairs with Pearson’s correlation coefficient between their expression levels higher than 05, with an overlap of 1060 links. Together, there are 244,240 unique links from both data sources. These links involve 13,243 genes. Hormozdiari et al. (2015) further gathered the de novo mutation and length information on 796 out of the 13,243 genes. In total, 796 genes with de novo mutations are employed in our analysis with 1334 mutual links between them, among which 602 genes have at least one link and 194 have none.

Synonymous mutations that differ at the DNA level but produce the same protein products are excluded. The frequencies of each type of mutation in a gene in all cases are summed up as well as the total number in the controls. Two covariates are employed in estimating the probability that a gene is involved in the occurrence or progression of autism and related neurological disorders – frequency of missense or loss of function mutations in cases, and number of mutations in controls. The choice of the covariates is based on biological beliefs on their involvement on autism development, hence decided a priori.

As in the simulation study, the BIC (Saldana et al., 2017) and ICL (Daudin et al., 2008) select the same number of communities for this data – five autism related modules plus one irrelevant background group produces.

We then use three community detection methods to cluster genes into seven clusters: the robust community detection (Section 4), the pseudo-likelihood method with nodal covariates (Section 3) and the standard stochastic blockmodel fitted by the profile-likelihood (Bickel and Chen, 2009). We run the algorithm with a number of random initial values for the pseudo-likelihood method with nodal covariates and pick the solution with the largest value of the likelihood (2). For a fair comparison, we use this solution as the initial value for the other two methods.

Table 4 shows the estimated link probabilities within gene groups and between gene group pairs for the five autism-related gene groups (group 1-5) and the background gene group (group 6). We compare the estimates from the three methods side by side. According to the table, the standard stochastic blockmodel classifies the nodes with zero connection as a cluster. The pseudo-likelihood method with nodal covariates gives a very similar partition. The adjusted Rand index for the partitions of these two methods are 0.963. Higher values of this index indicate better agreement and 1 means perfect agreement (Hubert and Arabie, 1985). The robust method gives a more different partition. The adjusted Rand index for the robust method and the standard stochastic blockmodel is 0.635, which may result from the fact that the robust method allows heterogeneous linkage rates in background genes while the two stochastic blockmodels do not. Table 4

shows the estimated link probabilities for the three methods. Furthermore, the estimated odds ratio from the robust method for mutation numbers in cases and mutation numbers in controls are 1.4874 (P-value=0.2808) and -1.0335 (P-value=0.0173), respectively. On the contrary, the two odds ratio estimates are -0.5332 (P-value=0.0419) and -0.2464 (P-value=0.5770) from the stochastic blockmodels, which disagrees with the prior that genes with higher number of mutations in cases are more likely to be related to neurological disorders. Therefore, we employ the results from the robust method.

Stochastic Blockmodel:
Group 1–5 Group 6
0.377 0.014 0 0.002 0.003 0.002
0.014 0.410 0 0.001 0.034 0.069
0.000 0.000 0 0.000 0.000 0.000
0.002 0.001 0 0.009 0.004 0.000
0.003 0.034 0 0.004 0.027 0.081
0.002 0.069 0 0.000 0.081 0.529
Pseudo Likelihood with Covariates:
Group 1–5 Group 6
0.414 0.014 0 0.002 0.004 0.002
0.014 0.433 0 0.003 0.046 0.065
0.000 0.000 0 0.000 0.000 0.000
0.002 0.003 0 0.007 0.006 0.003
0.004 0.046 0 0.006 0.035 0.095
0.002 0.065 0 0.003 0.095 0.514
Robust Community Detection:
Group 1–5 Group 6
0.414 0.002 0.000 0.004 0.012 0.000
0.002 0.618 0.000 0.006 0.190 0.028
0.000 0.000 0.000 0.001 0.000 0.003
0.004 0.006 0.001 0.009 0.009 0.027
0.012 0.190 0.000 0.009 0.087 0.055
0.000 0.028 0.003 0.027 0.055 0.104
Table 4: Estimated Link Probabilities between Groups

The gene set enrichment analysis (GSEA) of the selected gene modules compared with the curated gene sets in the Molecular Signatures Database are listed in Table 5. P-values are calculated assuming a hypergeometric distribution for the number of overlapping genes between the selected group and the curated gene set. Given the large number of multiple comparisons, stringent P-value threshold

is employed. The five autism-related groups overlaps significantly with gene sets in essential cellular functions or abnormal conditions such as cancer, apoptosis, cell structure, circulatory system, nervous system, multicellular organismal development. Group four overlaps with gene sets related to neurological functions or disorders. Gene set “GO NEUROGENESIS” are genes involved in generation of cells within the nervous system. Gene set “GO REGULATION OF NERVOUS SYSTEM DEVELOPMENT” concerns processes that modulate the frequency, rate or extent of nervous system development, the origin and formation of nervous tissue. Gene sets “GO NEURON PROJECTION” and “GO SYNAPSE” are composed of genes involved in nerve cell prolongation and nerve fiber junction, respectively. Furthermore, our results are compared to those from the Merging Aaffected Genes into Integrated-Nnetworks method in

Hormozdiari et al. (2015). The Merging Affected Genes into Integrated-Networks method was not able to detect group four. P-values from gene set enrichment analysis for the two best sets identified by their method against known neurodevelopmental diseases sets are 42 and 10, failing to reach the threshold.

Group Gene Set Group GeneSet Overlap Nominal FDR
Number Name Size Size Size P-value q-value
1 FISCHER DREAM TARGETS 18 929 16 101 107
1 GOBERT OLIGODENDROCYTE DIFFERENTIATION 18 570 11 286 152
1 DUTERTRE ESTRADIOL RESPONSE 24HR UP 18 324 9 177 629
1 FISCHER G2 M CELL CYCLE 18 225 8 122 326
1 PUJANA BRCA2 PCC NETWORK 18 423 9 197 419
1 PUJANA XPRSS INT NETWORK 18 168 7 237 420
1 GEORGES TARGETS OF MIR192 AND MIR215 18 893 10 278 423
1 NUYTTEN EZH2 TARGETS DN 18 1024 10 108 143
1 PUJANA CHEK2 PCC NETWORK 18 779 9 468 554
1 GO CHROMOSOME 18 880 9 138 148
2 GO CHROMOSOME ORGANIZATION 29 1009 11 131 862
2 GRAESSMANN APOPTOSIS BY DOXORUBICIN DN 29 1781 13 162 862
2 DACOSTA UV RESPONSE VIA ERCC3 DN 29 855 10 686 244
3 GO INTRINSIC COMPONENT OF PLASMA MEMBRANE 518 1649 71 489 521
3 GO RIBONUCLEOTIDE BINDING 518 1860 75 137 729
3 GO TRANSPORTER ACTIVITY 518 1276 60 153