1 Introduction
Community detection is a fundamental question in network analysis (Goldenberg et al., 2010; Newman, 2006; Fortunato, 2010)
. Traditional approaches consider the adjacency matrix, whose elements equal one or zero indicating whether there is a connection between two nodes, as the input. Then the nodes are partitioned into cohesive groups, that is, communities, with homogeneous linkage probabilities within and heterogeneous probabilities between the groups. Current community detection methods assume all nodes belong to certain communities of interests. However, this assumption is not always true in real applications. For example, when we are looking for pathways involving genes related to the risk of target disease, connections between candidate genes regardless their involvement in the disease process are collected. Furthermore, whether a gene has impacts on the disease origination and development is usually unknown. Often, information on the characteristics of the nodes/genes can help to differentiate the nodes related to the outcome of interests and the unrelated ones. Novel twostage models with one joint likelihood are proposed to incorporate the nodespecific information which isolate irrelevant nodes from relevant ones and in return improve detection accuracy of communities related to a specific outcome.
Our study is motivated by the problem to discover gene pathways leading to complex diseases in genomic studies. Highly correlated gene expression levels and experimentally verified proteinprotein interactions provide useful information on connections between genes. However, not all genes are related to the disease under study. In fact, most genes are “household” genes with functions to maintain normal metabolic processes within healthy human bodies. Mixing genes and pathways for normal life processes with those leading to the target disease in community detection models will introduce noise to diseasegenerating pathways which are the true interests of clinicians and biologists. De novo mutations refer to gene mutations that occur for the first time in a family compared to mutations inherited from parents. We believe that discrepancy in the numbers of de novo mutations on the same gene in patients and the number in healthy controls would help differentiate genes related to the disease from those unrelated to the disease, which we call the “background”. The three kinds of data, gene expression, proteinprotein interaction and number of de novo mutations, can be downloaded from different online data consortiums and combined using unique gene names.
The stochastic blockmodel is the most used statistical tool for modeling and detecting communities (Holland et al., 1983; Snijders and Nowicki, 1997; Nowicki and Snijders, 2001). We model the relationship between the unobserved indicator whether a gene is related to the target disease or not and genespecific covariates by logistic regression in the first stage, then cluster diseaserelated genes into several pathways in the second stage. Both indicators for disease relevance in the first stage and community labels in the second stage are latent variables, and the expectationmaximization algorithm is employed. However, this approach is intractable due to the numerous possible label assignments in the Estep. Amini et al. (2013) proposed a fast pseudolikelihood algorithm for fitting blockmodels and we adapt this algorithm in Section 3 to the joint pseudolikelihoods incorporating both the logistic regression and the block models. The pseudolikelihood may also be optimized by other alternative approaches such as the EMM algorithm by Gormley and Murphy (2008).
Another distinct feature of the proposed method is the extension to the robust community detection allowing heterogeneous linkage probabilities in the background, which relaxes the assumption of homogeneous linkage probability within each group in the stochastic blockmodel. For instance, the background can be a mixture of multiple strongly or weakly connected groups. These groups all belong to the background because they are not related to the target disease, but their linkage rates are not necessarily homogeneous. In Section 4, we further develop the model in section 3 to allow for arbitrary linkage patterns within the background. Interestingly, when the linkage probabilities within the background are unspecified, the pseudolikelihood algorithm can be modified to leave the likelihood of the links in the background out while the classical likelihood approach cannot.
Recently there have been works on community detection which utilize covariates information. These papers use the additional covariates information to improve the accuracy of community detection. Some papers combine a similarity or kernel matrix based on covariates with the adjacency matrix (Binkiewicz et al., 2014; Zhang et al., 2015; Yan and Sarkar, 2016; Xu et al., 2012). Other papers build likelihoods of linkage probabilities incorporating auxiliary nodal information (Tallberg, 2004; Yang et al., 2013; Newman and Clauset, 2016; Handcock et al., 2007; Krivitsky et al., 2009; Gormley and Murphy, 2010). However, none of these works follow the same framework as our method. In short, in our method, the sole reason of using auxiliary information on nodal characteristics is to distinguish the disease related nodes from unrelated ones, then we carry out community detection within the diseaserelated nodes. On the contrary, in the literature, auxiliary information is used to facilitate partition of all nodes into communities. For example, Tallberg (2004) used covariates to predict the probabilities into each homogeneous community in a Bayesian framework, while we use covariates to predict the probability into the heterogeneous background in a pseudolikelihood framework.
2 Methods
We begin by introducing the data structure and notation. A network with nodes can be represented by an adjacency matrix , where
In addition to the adjacency matrix , some covariate information on nodes is also available. These covariates are represented by an matrix , where denotes the value of the th covariate on node .
We model networks with a particular community structure where the network is composed of multiple cohesive communities, together with some background nodes. Unlike the usual definition of background set which is diffuse within itself or weakly connected to other parts of the network (Zhao et al., 2011), we assume that the probability of a node belonging to the background set depends on its covariates. Suppose there are communities besides the background set. Let denote the community that each of the nodes/genes belongs to, thus if nodes belongs to community , for , and if node is a background gene. Moreover, let
be a vector indicating whether the node belongs to one of the
communities or the background, i.e. if , otherwise.The network is generated in three steps.

The random variable
is independent for and follows a logistic regressionwhere is the coefficients vector, and is the th row of . Here the logistic model has an intercept, that is, the first column of is .

The probability that a node with belongs each of the communities is given by the independent multinomial distribution with parameter ,
In addition, if .

Conditional on the labels, for are independent Bernoulli variables with
where is a symmetric matrix.
The total number of genes in the th community is and the number of links between the th and th commuity is given by , where is the indicator function. Moreover, let if , and . Then the joint loglikelihood of and is
(2.1) 
3 Estimating Procedures
The community labels are unobserved in a community detection problem. Furthermore, the Estep of such algorithm requires evaluating all possible label assignments, which makes the algorithm intractable (Amini et al., 2013; Zhao et al., 2012). We adopt the idea of pseudolikelihood in Amini et al. (2013) which partitions each row of into blocks and assumes the independence between rows.
We use the same notation as those in Amini et al. (2013). The vector denotes an initial blocking vector, where . And denotes the number of edges associated with node in the th block, that is, . Let and , where is the expected total number of edges in the th block for a node in community , i.e., . When is large,
can be approximated by a Poisson distribution given
, and the dependence of between different rows is weak. Assuming are independence for and and using the Poisson approximation, the logpseudolikelihood of and (up to a constant) iswhere . And the loglikelihood for the marginal distribution of (up to a constant) is
(3.1) 
Given initial labels , equation (A.1) can be maximized by a standard expectationmaximization algorithm. The details of the Estep and Mstep are given in Algorithm 1.
Algorithm 1: (The expectationmaximization algorithm under Poisson distribution)

Estep: Let and
be the estimates at the current iteration, and
. The posterior probability of label assignment is

Mstep: Given , and can be updated by the following formulae,
can be updated by solving the logistic regression,
Note is the sum of the estimated conditional probabilities of gene belonging to one of the communities.
Once the expectationmaximization algorithm converges, we can update the labels by . We repeat this procedure several times until becomes stable.
Amini et al. (2013) also introduced a pseudolikelihood conditional on the node degrees. We generalize this conditional pseudolikelihood to our scenario. Denote the node degree by . Then follows multinomial distribution conditional on label and . The multinomial log pseudolikelihood (up to a constant) is
(3.2)  
where is the parameter in the multimomial distribution satisfying .
The algorithm is similar to that for the Poisson pseudolikelihood. We give the details of the expectationmaximization algorithm under the multinomial distribution in Algorithm 2.
Algorithm 2: (The expectationmaximization algorithm under multinomial distribution)

Estep: Based on current estimates and , the posterior probability of label assignment is

Mstep: Given , , and can be updated by
4 Robust Community Detection
So far we assume that all the diseaserelated communities and the background satisfy the stochastic blockmodel assumption. In this section, we propose a new pseudolikelihood method that allows for arbitrary structure in the background, for example, a mixture of strongly and weakly connected groups, or nodes with high degree variations. In other words, we keep the stochastic blockmodel assumption in the diseaserelated communities, but make no assumption on the structure within the background. A network with the heterogeneous background is generated in three steps, of which the first two steps are identical to the first two steps in Section 2. The last step has been modified as follows.
Step 3: Conditional on the labels, when or , for are independent Bernoulli variables with
The link probabilities within the background set, i.e., when and , are not specified.
The joint likelihood (2) cannot be used as the criteria to estimate c in this situation because it is maximized when all nodes belong to group . By contrast, the pseudolikelihood method introduced in Section 3 can be extended to this new scenario. Recall the setup in Section 3. Let be an initial blocking vector. And denotes the number of edges associated with node in the th block . When is a reasonable initial vector, can be approximated by a mixture of Poisson distributions as before when . However, when , the distribution of is unknown since the link probabilities within the background are unspecified. Therefore, we exclude this part of unreliable information, and propose the following pseudolikelihood for robust community detection,
(4.1) 
where .
Notice equation (4) is indeed a valid likelihood function conditional on because the blocking vector and the community labeling vector are treated differently in our algorithm. The proof of its identifiability is given in the supplementary material. The blocking vector partitions the columns of into blocks and is the sum in the th row th block. Likelihood (4) does not include – the last column of since the Poisson approximation are inappropriate. But this does not affect the range of , which is still . Community detection based on (4) can be viewed as a classic clustering problem on . We need to assign a label from 1 to to each row data point, i.e., each , which contains features. But we only use the first features since the last one is not reliable. Then we update the labelling of each gene in the columns and iterate several times. In each iteration, because the groups in the columns and the grouping results in the rows are considered separately, dropping one column of noise will not result in all genes (rows) falling into the th group. After each iteration of the outer loop updating , genes with similar linkage probabilities with the first
groups are classified into the same group with higher and higher accuracy.
The algorithm is therefore similar to Algorithm 1 and given in the following.
Algorithm 3: (The expectationmaximization algorithm for robust community detection)

Estep: Let and be the estimates at the current iteration, and . The posterior probability of label assignment is

Mstep: Given , , and can be updated by,
As before, once the expectationmaximization algorithm converges, is updated by . We repeat this procedure until becomes stable.
We do not consider robust community detection using multinomial approximation because the condition is invalid if the last column is removed.
5 Asymptotic Properties
In this section we study the consistency under stochastic blockmodels. Equation (3.2) has slightly simpler form and theoretical derivations than (A.1). The theoretical analysis in this section will focus on the multinomial pseudolikelihood.
We begin with the setup, which closely follow Amini et al. (2013). The true community labels are the parameters of interests, where . We focus on the case of directed blockmodel. A coupling technique can be used to extend the result to the undirected case analogous to that in Amini et al. (2013). Consider the edge matrix
where . Here and remain constant, while can scale with . The directed blockmodel assumes that all the entries in the adjacency matrix are independent Bernoulli variables without forcing to be symmetric, that is, . For simplicity, a univariate covariate taking values in is assumed.
We illustrate the consistency of onestep expectationmaximization of the multinomial pseudolikelihood. Starting from some initial labels and initial estimates , of the parameters , and , the initial estimates of and are obtained from the logistic regression, that is,
Define
Let
and be the 2 by 2 matrix with entries given by . The initial estimates is obtained by row normalization of , that is,
With the notation defined above, the output of onestep expectationmaximization is
We use the misclassification error rate (Choi et al., 2012; Zhao et al., 2012; Amini et al., 2013) to measure the performance of . That is, define
where is the set of permutations of . In this definition we consider all values that are permutations of each other because they result in the same community structure.
Consider the class of initial labels that correctly classify the node as a member of community . The fraction of such nodes among all nodes belonging to community , , is formally given by
where is the size of community .
An extra condition is introduced to avoid perfect separation of in the logistic fit. We define the following class
where is the size of initial estimate of community .
The uniform consistency of within the class is established by the following theorem.
Theorem 5.1 (Main result).
Assume and . Then under some regularity condition, with sufficiently large , and , for any ,
The details of the regularity condition and the proof is given in the supplementary material.
The proof of the main theorem depends on a key fact that the log ratio of the estimated probabilities and has a uniform bound independent with , for . This is summarized in the following lemma.
Lemma 5.1.
Assume . Then if , there exist such that for sufficiently large ,
where is independent with .
The proof is given in the supplementary material.
6 Simulations
We first examine the performance of the proposed methods under standard stochastic blockmodel. Each network contains nodes and each setup is repeated 500 times. There are three groups including two diseaserelated communities and one diseaseirrelevant background set. The probability a gene is related to the disease follows a logistic regression with . Here is the indicator for the th node belonging to a diseaserelated community and covariate . And correspond to the percentages background , and , respectively. Nodes with are assigned to two nonoverlapping communities with equal probabilities . Pairs within the background, as well as pairs composed of one node in the background and the other node in a diseaserelated community are linked with probability . The linkage probability between the two nonbackground communities is 005, while the linkage probability for pairs within the same community ranges from 015 to 025.
With Logistic Models  Without Logistic Models  
Poisson  Multinomial  Robust  Poisson  Multinomial  Robust  
Background Nodels  
15  58 (12)  57 (13)  59 (12)  15 (7)  15 (8)  15 (8) 
16  66 (8)  66 (9)  67 (8)  23 (11)  24 (11)  23 (11) 
17  72 (7)  72 (6)  73 (7)  34 (13)  33 (13)  33 (13) 
18  77 (5)  76 (5)  77 (5)  48 (14)  45 (13)  46 (15) 
19  81 (5)  80 (5)  81 (4)  61 (11)  55 (13)  60 (11) 
20  85 (4)  83 (4)  85 (4)  70 (8)  66 (9)  70 (9) 
21  88 (3)  86 (4)  88 (3)  78 (6)  73 (7)  78 (7) 
22  91 (3)  88 (3)  91 (3)  83 (4)  79 (6)  83 (5) 
23  93 (3)  90 (3)  93 (3)  87 (4)  83 (5)  87 (4) 
24  94 (2)  92 (3)  94 (2)  90 (3)  86 (4)  90 (3) 
25  96 (2)  93 (2)  96 (2)  93 (3)  89 (3)  93 (3) 
Background Nodels  
15  74 (5)  74 (5)  74 (5)  44 (10)  44 (10)  43 (11) 
16  78 (4)  78 (4)  79 (4)  56 (8)  56 (8)  55 (10) 
17  82 (4)  82 (4)  82 (4)  66 (6)  64 (7)  66 (7) 
18  86 (3)  85 (4)  86 (3)  74 (6)  72 (6)  74 (6) 
19  89 (3)  88 (3)  89 (3)  80 (5)  78 (5)  80 (5) 
20  91 (3)  90 (3)  92 (3)  86 (4)  82 (4)  86 (4) 
21  94 (2)  92 (3)  94 (2)  89 (3)  86 (4)  89 (3) 
22  95 (2)  93 (2)  95 (2)  92 (3)  89 (3)  92 (3) 
23  96 (2)  95 (2)  97 (2)  94 (2)  91 (3)  94 (2) 
24  98 (1)  96 (2)  97 (1)  96 (2)  93 (3)  96 (2) 
25  98 (1)  96 (2)  98 (1)  97 (1)  94 (2)  97 (1) 
Background Nodels  
15  82 (4)  82 (4)  82 (3)  67 (6)  67 (6)  67 (6) 
16  86 (3)  86 (3)  86 (3)  74 (5)  74 (5)  74 (5) 
17  89 (3)  88 (3)  89 (3)  81 (4)  80 (4)  80 (4) 
18  91 (3)  91 (3)  91 (3)  85 (3)  84 (4)  85 (3) 
19  94 (2)  92 (2)  94 (2)  89 (3)  87 (3)  89 (3) 
20  96 (2)  94 (2)  96 (2)  92 (2)  90 (3)  92 (2) 
21  97 (1)  95 (2)  97 (1)  95 (2)  92 (2)  95 (2) 
22  98 (1)  96 (2)  98 (1)  96 (2)  94 (2)  96 (2) 
23  98 (1)  97 (1)  98 (1)  97 (1)  95 (2)  97 (1) 
24  99 (1)  97 (1)  99 (1)  98 (1)  96 (2)  98 (1) 
25  99 (1)  98 (1)  99 (1)  99 (1)  97 (2)  99 (1) 

under stochastic blockmodels. Numbers within parentheses are empirical standard deviations of ARI
.Table 1 compares the performance of three models  the pseudolikelihood methods with Poisson and multinomial approximation introduced in Section 3 as well as the robust community detection method introduced in Section 4. For each model, we further compare the two versions where auxiliary nodal information, i.e, logistic regression, is either used or unused. The community detection accuracy is measured by the adjusted rand index (ari) (Vinh et al., 2010), a widelyused measure for comparing two partitions. The value of the index is 0 for two independent partitions, and higher values indicate better agreement. The performance of all methods improves as the linkage probability within diseaserelated community increases, or as the percentage of background nodes decreases. More importantly, the proposed method incorporating auxiliary information through logistic regression always outperforms its counterpart without logistic regression. Moreover, the robust method gives the same performance as the Poisson pseudolikelihood which suggests the robust method does not lose discriminatory accuracy when data follow standard stochastic block models. On the other hand, the algorithm fitting multinomial distributions performs slightly worse than the other two methods. Rigorously speaking, the multinomial pseudolikelihood is an approximation to the degreecorrected blockmodel, which is a generalization of standard blockmodel by allowing more variation on degrees (Zhao et al., 2012; Karrer and Newman, 2011; Amini et al., 2013). Therefore, the finite sample performance of multinomial pseudolikelihood has slightly lower ARI on average since it fits a more complicated model.
Next we consider the setup with heterogeneous background nodes. For any node in background, we generate from . The linkage probability between a background node and a diseaserelated node is . For two background nodes and , the linkage probability is . The rest of the model setups such as the generation mechanism of communities labels, the linkage probabilities within/between communities and linkage probabilities between a community and the background remain the same.
With Logistic Models  Without Logistic Models  
Poisson  Multinomial  Robust  Poisson  Multinomial  Robust  
Background Nodes  
15  20 (11)  58 (14)  54 (21)  15 (6)  17 (8)  12 (10) 
16  23 (13)  65 (9)  63 (18)  18 (5)  23 (10)  17 (12) 
17  25 (13)  70 (8)  69 (16)  20 (5)  30 (12)  24 (17) 
18  29 (14)  74 (6)  76 (11)  23 (5)  39 (13)  31 (21) 
19  35 (18)  78 (5)  81 (8)  24 (5)  50 (12)  43 (26) 
20  39 (20)  80 (5)  85 (6)  27 (6)  57 (12)  50 (27) 
21  43 (23)  83 (5)  88 (5)  29 (5)  63 (10)  61 (27) 
22  48 (25)  85 (4)  91 (3)  30 (6)  66 (10)  71 (25) 
23  53 (27)  86 (4)  93 (3)  32 (7)  69 (10)  78 (22) 
24  60 (29)  87 (4)  94 (2)  34 (9)  72 (10)  84 (19) 
25  67 (30)  89 (4)  95 (2)  37 (13)  74 (10)  89 (15) 
Background Nodes  
15  62 (19)  73 (5)  74 (5)  34 (12)  44 (9)  42 (12) 
16  70 (15)  77 (4)  79 (4)  41 (15)  53 (9)  55 (11) 
17  75 (14)  81 (4)  83 (4)  46 (15)  62 (8)  66 (8) 
18  81 (10)  84 (4)  86 (3)  52 (15)  69 (6)  74 (7) 
19  85 (9)  86 (4)  89 (3)  58 (15)  74 (6)  80 (6) 
20  89 (7)  89 (3)  92 (3)  63 (17)  78 (5)  85 (5) 
21  92 (4)  90 (3)  93 (2)  72 (18)  82 (5)  89 (3) 
22  95 (2)  92 (3)  95 (2)  82 (16)  84 (5)  92 (2) 
23  96 (2)  93 (2)  96 (2)  88 (15)  87 (4)  94 (2) 
24  97 (2)  94 (2)  97 (1)  93 (10)  88 (4)  96 (2) 
25  98 (1)  94 (2)  98 (1)  96 (7)  90 (4)  97 (1) 
Background Nodes  
15  81 (5)  82 (4)  82 (4)  65 (8)  66 (6)  67 (7) 
16  85 (4)  85 (3)  86 (3)  71 (7)  72 (5)  74 (5) 
17  89 (3)  88 (3)  89 (3)  77 (7)  78 (5)  81 (4) 
18  91 (3)  90 (3)  91 (2)  83 (6)  82 (4)  85 (4) 
19  93 (2)  92 (3)  94 (2)  88 (4)  86 (4)  90 (3) 
20  95 (2)  93 (2)  95 (2)  91 (3)  88 (3)  92 (3) 
21  96 (2)  94 (2)  97 (2)  94 (2)  90 (3)  94 (2) 
22  98 (1)  95 (2)  98 (1)  96 (2)  92 (3)  96 (2) 
23  98 (1)  96 (2)  98 (1)  97 (2)  93 (2)  97 (1) 
24  99 (1)  97 (2)  99 (1)  98 (1)  94 (2)  98 (1) 
25  99 (1)  97 (1)  99 (1)  99 (1)  95 (2)  99 (1) 
The ari of the six methods are shown in Table 2 and Figures 1  3. Similar to what we observed in Table 1, the average ARIs of all methods increases as the linkage probability within community increases, or as the percentage of background nodes decreases. And the method with logistic regression outperforms their counterparts without logistic regression. The robust method with logistic regression gives the best performance in most scenarios. The Poisson pseudolikelihood has the worst performance when the stochastic blockmodel assumption is violated in the heterogeneous background. Especially, under the case of high percentage of background nodes, the Poisson pseudolikelihood performs poorly even when the linkage probability within community is much higher than the linkage probability between communities. The multinomial pseudolikelihood slightly outperforms the robust method when the percentage of background nodes is high, in which case the robust method discards lots of information, while the multinomial pseudolikelihood (or correspondingly degree corrected stochastic blockmodel) accounts for high variations on degrees. On the other hand, the robust method outperforms the multinomial pseudolikelihood in all the other cases. In summary, the robust method has the best performance in terms of both accuracy and efficacy in almost all the setups we examined regardless the data follows stochastic blockmodels or not. In the only exception where the multinomial pseudolikelihood method with logistic regression performs slightly better, the discrepancies between the two methods are small. Therefore, the robust community detection method is our recommended method.
In real applications, the number of communities is often unknown a priori. Saldana et al. (2017) proposed a modified Bayesian information criterion (BIC) for community detection:
Daudin et al. (2008) proposed another model selection criterion – integrated classification likelihood (ICL) with a heavier penalty:
We use simulation studies to verify the performance of BIC and ICL for our case. Since BIC and ICL are designed for the stochastic blockmodel, we compute by (2) in the present studies although is estimated by the robust method.
We follow the aforementioned setup for heterogeneous background nodes and only consider the case with 50% background nodes. For each network, we vary the assumed number of communities from to , and report in the first three columns of Table 3 the percentages of selected numbers of communities in 50 replicates by BIC and ICL, respectively. Both BIC and ICL perfectly identifies the true community number ().
In the last set of simulation studies, we consider the model selection for networks with 5 clusters plus the background. The setup is the same as the previous study except that and . With this setup, the average size of clusters is approximately 125 as in the previous study. The results are shown in the last three column of Table 3: BIC and ICL almost perfectly identifies the community number except for one replicate.
BIC  ICL  BIC  ICL  

1  0  0  1  0  0 
2  1  1  2  0  0 
3  0  0  3  0  0 
4  0  0  4  0  0 
5  0  0  5  0.98  0.98 
6  0  0  6  0.02  0.02 
7  0  0  7  0  0 
8  0  0  8  0  0 
7 Application
With the development of improved sequencing techniques, more and more de novo mutations in candidate genes associated with neurodevelopmental or neuropshychiatric diseases are being reported. Here we focus on autism spectrum disorder and related neurological disorders. Most identified de novo mutations are rare and patients with the same clinical symptoms often carry heterogeneous mutation loci on different genes. Most probably, the pathophysiology mechanism underpinning autism involves perturbed molecular pathways. There is evidence of enrichment of de novo mutations in gene groups connected by proteinprotein interactions, coexpression patterns, or pathways defined by common functions, annotations or evolutional patterns (Allen et al., 2013). Our study targets at interactive groups of biomarkers that form biological pathways related to autism origination and development.
Autism and related disorder data from Hormozdiari et al. (2015) are employed, which reports four types of information (clinically diagnosed disease status, RNA expression levels, de novo mutations, proteinprotein interactions) from three major data consortiums including BrainSpan Atlas, published autism studies, proteinprotein interaction databases. There are 52,801 verified proteinprotein interaction links and 192,499 mRNA pairs with Pearson’s correlation coefficient between their expression levels higher than 05, with an overlap of 1060 links. Together, there are 244,240 unique links from both data sources. These links involve 13,243 genes. Hormozdiari et al. (2015) further gathered the de novo mutation and length information on 796 out of the 13,243 genes. In total, 796 genes with de novo mutations are employed in our analysis with 1334 mutual links between them, among which 602 genes have at least one link and 194 have none.
Synonymous mutations that differ at the DNA level but produce the same protein products are excluded. The frequencies of each type of mutation in a gene in all cases are summed up as well as the total number in the controls. Two covariates are employed in estimating the probability that a gene is involved in the occurrence or progression of autism and related neurological disorders – frequency of missense or loss of function mutations in cases, and number of mutations in controls. The choice of the covariates is based on biological beliefs on their involvement on autism development, hence decided a priori.
As in the simulation study, the BIC (Saldana et al., 2017) and ICL (Daudin et al., 2008) select the same number of communities for this data – five autism related modules plus one irrelevant background group produces.
We then use three community detection methods to cluster genes into seven clusters: the robust community detection (Section 4), the pseudolikelihood method with nodal covariates (Section 3) and the standard stochastic blockmodel fitted by the profilelikelihood (Bickel and Chen, 2009). We run the algorithm with a number of random initial values for the pseudolikelihood method with nodal covariates and pick the solution with the largest value of the likelihood (2). For a fair comparison, we use this solution as the initial value for the other two methods.
Table 4 shows the estimated link probabilities within gene groups and between gene group pairs for the five autismrelated gene groups (group 15) and the background gene group (group 6). We compare the estimates from the three methods side by side. According to the table, the standard stochastic blockmodel classifies the nodes with zero connection as a cluster. The pseudolikelihood method with nodal covariates gives a very similar partition. The adjusted Rand index for the partitions of these two methods are 0.963. Higher values of this index indicate better agreement and 1 means perfect agreement (Hubert and Arabie, 1985). The robust method gives a more different partition. The adjusted Rand index for the robust method and the standard stochastic blockmodel is 0.635, which may result from the fact that the robust method allows heterogeneous linkage rates in background genes while the two stochastic blockmodels do not. Table 4
shows the estimated link probabilities for the three methods. Furthermore, the estimated odds ratio from the robust method for mutation numbers in cases and mutation numbers in controls are 1.4874 (Pvalue=0.2808) and 1.0335 (Pvalue=0.0173), respectively. On the contrary, the two odds ratio estimates are 0.5332 (Pvalue=0.0419) and 0.2464 (Pvalue=0.5770) from the stochastic blockmodels, which disagrees with the prior that genes with higher number of mutations in cases are more likely to be related to neurological disorders. Therefore, we employ the results from the robust method.
Stochastic Blockmodel:  

Group 1–5  Group 6  
0.377  0.014  0  0.002  0.003  0.002 
0.014  0.410  0  0.001  0.034  0.069 
0.000  0.000  0  0.000  0.000  0.000 
0.002  0.001  0  0.009  0.004  0.000 
0.003  0.034  0  0.004  0.027  0.081 
0.002  0.069  0  0.000  0.081  0.529 
Pseudo Likelihood with Covariates:  
Group 1–5  Group 6  
0.414  0.014  0  0.002  0.004  0.002 
0.014  0.433  0  0.003  0.046  0.065 
0.000  0.000  0  0.000  0.000  0.000 
0.002  0.003  0  0.007  0.006  0.003 
0.004  0.046  0  0.006  0.035  0.095 
0.002  0.065  0  0.003  0.095  0.514 
Robust Community Detection:  
Group 1–5  Group 6  
0.414  0.002  0.000  0.004  0.012  0.000 
0.002  0.618  0.000  0.006  0.190  0.028 
0.000  0.000  0.000  0.001  0.000  0.003 
0.004  0.006  0.001  0.009  0.009  0.027 
0.012  0.190  0.000  0.009  0.087  0.055 
0.000  0.028  0.003  0.027  0.055  0.104 
The gene set enrichment analysis (GSEA) of the selected gene modules compared with the curated gene sets in the Molecular Signatures Database are listed in Table 5. Pvalues are calculated assuming a hypergeometric distribution for the number of overlapping genes between the selected group and the curated gene set. Given the large number of multiple comparisons, stringent Pvalue threshold
is employed. The five autismrelated groups overlaps significantly with gene sets in essential cellular functions or abnormal conditions such as cancer, apoptosis, cell structure, circulatory system, nervous system, multicellular organismal development. Group four overlaps with gene sets related to neurological functions or disorders. Gene set “GO NEUROGENESIS” are genes involved in generation of cells within the nervous system. Gene set “GO REGULATION OF NERVOUS SYSTEM DEVELOPMENT” concerns processes that modulate the frequency, rate or extent of nervous system development, the origin and formation of nervous tissue. Gene sets “GO NEURON PROJECTION” and “GO SYNAPSE” are composed of genes involved in nerve cell prolongation and nerve fiber junction, respectively. Furthermore, our results are compared to those from the Merging Aaffected Genes into IntegratedNnetworks method in
Hormozdiari et al. (2015). The Merging Affected Genes into IntegratedNetworks method was not able to detect group four. Pvalues from gene set enrichment analysis for the two best sets identified by their method against known neurodevelopmental diseases sets are 42 and 10, failing to reach the threshold.Group  Gene Set  Group  GeneSet  Overlap  Nominal  FDR 

Number  Name  Size  Size  Size  Pvalue  qvalue 
1  FISCHER DREAM TARGETS  18  929  16  101  107 
1  GOBERT OLIGODENDROCYTE DIFFERENTIATION  18  570  11  286  152 
1  DUTERTRE ESTRADIOL RESPONSE 24HR UP  18  324  9  177  629 
1  FISCHER G2 M CELL CYCLE  18  225  8  122  326 
1  PUJANA BRCA2 PCC NETWORK  18  423  9  197  419 
1  PUJANA XPRSS INT NETWORK  18  168  7  237  420 
1  GEORGES TARGETS OF MIR192 AND MIR215  18  893  10  278  423 
1  NUYTTEN EZH2 TARGETS DN  18  1024  10  108  143 
1  PUJANA CHEK2 PCC NETWORK  18  779  9  468  554 
1  GO CHROMOSOME  18  880  9  138  148 
2  GO CHROMOSOME ORGANIZATION  29  1009  11  131  862 
2  GRAESSMANN APOPTOSIS BY DOXORUBICIN DN  29  1781  13  162  862 
2  DACOSTA UV RESPONSE VIA ERCC3 DN  29  855  10  686  244 
3  GO INTRINSIC COMPONENT OF PLASMA MEMBRANE  518  1649  71  489  521 
Comments
There are no comments yet.