Community detection refers to the problem of identifying communities in networks, e.g., circles of friends in social networks, or groups of genes in graphs of gene co-occurrences (Lancichinetti and Fortunato, 2009; Newman and Girvan, 2004; Bickel and Chen, 2009; Newman, 2006; Reichardt and Bornholdt, 2006; Girvan and Newman, 2002). Although fueled by the increasing importance of graph models and network structures in applications, and the emergence of large-scale social networks on the Internet, the topic is much older in the social sciences, and the algorithmic aspect is very closely related to graph partitioning, a longstanding area in computer science. We refer the reader to the comprehensive survey paper of Fortunato (2010) for more examples and references.
By community detection we mean, here, something slightly different. Indeed, instead of aiming at extracting the community (or communities) from within the network, we simply focus on deciding whether or not there is a community at all. Therefore, instead of considering a problem of graph partitioning, or clustering, we consider a problem of testing statistical hypotheses. We observe an undirected graph with nodes. Without loss of generality, we take . The corresponding adjacency matrix is denoted , where if, and only if, , meaning there is an edge between nodes . Note that is symmetric, and we assume that for all . Under the null hypothesis, the graph is a realization of , the Erdös-Rényi random graph on nodes with probability of connection ; equivalently, the upper diagonal entries of are independent and identically distributed with for any . Under the alternative, there is a subset of nodes indexed by such that for any with , while for any other pair of nodes . We assume that , implying that the connectivity is stronger between nodes in , so that is an assortative community. The subset is not known, although in most of the paper we assume that its size is known. Let denote the null hypothesis, which consists of and is therefore simple. And let denote the alternative where is the anomalous subset of nodes. We are testing versus . We consider an asymptotic setting where
meaning the graph is large in size, and the subgraph is comparatively small, but not too small. Also, the probabilities of connection, and , may change with — in fact, they will tend to zero in most of the paper.
Despite its potential relevance to applications, this problem has received considerably less attention. We mention the work of Wang et al. (2008) who, in a somewhat different model, propose a test based on a statistic similar to the modularity of Newman and Girvan (2004); the test is evaluated via simulations. Sun and Nobel (2008) consider the problem of detecting a clique, a problem that we addressed in detail in our previous paper (Arias-Castro and Verzelen, 2012), and which is a direct extension of the ‘planted clique problem’ (Feige and Ron, 2010; Alon et al., 1998; Dekel et al., 2011). Rukhin and Priebe (2012) consider a test based on the maximum number of edges among the subgraphs induced by the neighborhoods of the vertices in the graph; they obtain the limiting distribution of this statistic in the same model we consider here, with and fixed, and is a power of , and in the process show that their test reduces to the test based on the maximum degree. Closer in spirit to our own work, Butucea and Ingster (2011) study this testing problem in the case where and are fixed. A dynamic setting is considered in (Heard et al., 2010; Mongiovı et al., 2013; Park et al., 2013) where the goal is to detect changes in the graph structure over time.
1.1 Hypothesis testing
We start with some concepts related to hypothesis testing. We refer the reader to (Lehmann and Romano, 2005) for a thorough introduction to the subject. A test is a function that takes as input and returns to claim there is a community in the network, and otherwise. The (worst-case) risk of a test is defined as
where is the distribution under the null and is the distribution under , the alternative where is anomalous. We say that a sequence of tests for a sequence of problems is asymptotically powerful (resp. powerless) if (resp. ). We will often speak of a test being powerful or powerless when in fact referring to a sequence of tests and its asymptotic power properties. Then, practically speaking, a test is asymptotically powerless if it does not perform substantially better than any method that ignores the adjacency matrix , i.e., guessing. We say that the hypotheses merge asymptotically if
and that the hypotheses separate completely asymptotically if , which is equivalent to saying that there exists a sequence of asymptotically powerful tests. Note that if , no sequence of tests is asymptotically powerful, which includes the special case where the two hypotheses are contiguous.
Our general objective is to derive the detection boundary for the problem of community detection. On the one hand, we want to characterize the range of parameters such that either all tests are asymptotically powerless or no test is asymptotically powerful . On the other hand, we want to introduce asymptotically minimax optimal tests, that is tests satisfying whenever or whenever .
1.2 Our previous work
We recently considered this testing problem in (Arias-Castro and Verzelen, 2012), focusing on the dense regime where or equivalently . (For , denotes the minimum of and and denotes their maximum.) We obtained information theoretic lower bounds, and we proposed and analyzed a number of methods, both when is known and when it is unknown. (None of the methods we considered require knowledge of .) In particular, a combination of the total degree test based on
and the scan test based on
was found to be asymptotically minimax optimal when is known and when is not too small, specifically . This extends the results that Butucea and Ingster (2011) obtained for and fixed (and known). In that same paper, we also proposed and studied a convex relaxation of the scan test, based on the largest
-sparse eigenvalue of, inspired by related work of Berthet and Rigollet (2012).
Continuing our work, in the present paper we focus on the sparse regime where
Obviously, (5) implies that . We define
and note that and may vary with . Our results can be summarized as follows.
Regime 1: with fixed . Compared to the setting in our previous work (Arias-Castro and Verzelen, 2012), the total degree test (3) remains a contender, scanning over subsets of size exactly as in (4) does not seem to be optimal anymore, all the more so when is small. Instead, we scan over subsets of a wider range of sizes, using
where . We call this the broad scan test. In analogy with our previous results in (Arias-Castro and Verzelen, 2012), we find that a combination of the total degree test (3) and the broad scan test based on (7) is asymptotically optimal when , in the following sense. Suppose with . When , the total degree test is asymptotically powerful when and the two hypotheses merge asymptotically when . (For two sequences of reals, and , we write to mean that .) When , that is for smaller , there exists a sequence of increasing functions (defined in Theorem 1) such that the broad scan test is asymptotically powerful when and the hypotheses merge asymptotically when . Furthermore, as , when remains fixed, while , and for . As a consequence, the broad scan test is asymptotically powerful when is larger than (up to a numerical) . See Table 1 for a visual summary. (For two real sequences, and , we write to mean that , and when and .)
|Undetectable||; Exact Eq. in (55)|
|Detectable||; Exact Eq. in (14)|
|Optimal test||Broad Scan test||Total Degree test|
When and with , the total degree test is optimal, in the sense that it is asymptotically powerful for , while the hypotheses merge asymptotically for . This is why we assume in the remainder of this discussion that with .
Regime 2: with . When , the broad scan test is asymptotically powerful when and the hypotheses merge asymptotically when . See the first line of Table 2 for a visual summary.
Regime 3: and are fixed. The Poissonian regime where and are assumed fixed is depicted on Figure 1. When , the broad scan test is asymptotically powerful. When and , no test is able to fully separate the hypotheses. In fact, for any fixed a test based on the number of triangles has some nontrivial power (depending on ), implying that the two hypotheses do not completely merge in this case. The case where is not completely settled. No test is able to fully separate the hypotheses if . The largest connected component test is optimal up to a constant when and a test based on counting subtrees of a certain size bridges the gap in constants for , but not completely. When is bounded from above and , the two hypotheses merge asymptotically.
Regime 4: with . Finally, when , the largest connected component test is asymptotically optimal. See Table 2.
|Optimal test||Largest CC test||Broad Scan test|
1.4 Methodology for the lower bounds
Compared to our previous work (Arias-Castro and Verzelen, 2012), the derivation of the various lower bounds here rely on the same general approach. Let denote the random graph obtained by choosing uniformly at random among subsets of nodes of size , and then generating the graph under the alternative with being the anomalous subset. When deriving a lower bound, we first reduce the composite alternative to a simple alternative, by testing versus . Let denote the corresponding likelihood ratio, i.e., , where is the likelihood ratio for testing versus . Then these hypotheses merge in the asymptote if, and only if, in probability under . A variant of the so-called ‘truncated likelihood’ method, introduced by Butucea and Ingster (2011), consists in proving that and , where is a truncated likelihood of the form , where is a carefully chosen event. (For a set or event , denotes the indicator function of .) An important difference with our previous work is the more delicate choice of , which here relies more directly on properties of the graph under consideration. We mention that we use a variant to show that and do not separate in the limit. This could be shown by proving that the two graph models and are contiguous. The ‘small subgraph conditioning’ method of Robinson and Wormald (1994, 1992) — see the more recent exposition in (Wormald, 1999) — was designed for that purpose. For example, this is the method that Mossel et al. (2012) use to compare a Erdös-Rényi graph with a stochastic block model333This is a popular model of a network with communities, also known as the planted partition model. In this model, the nodes belong to blocks: nodes in the same block connect with some probability , while nodes in different blocks connect with probability .
with two blocks of equal size. This method does not seem directly applicable in the situations that we consider here, in part because the second moment of the likelihood ratio, meaning, tends to infinity at the limit of detection.
The remaining of the paper is organized as follow. In Section 2
we introduce some notation and some concepts in probability and statistics, including concepts related to hypothesis testing and some basic results on the binomial distribution. In Section3 we study some tests that are near-optimal in different regimes. In Section 4 we state and prove information theoretic lower bounds on the difficulty of the detection problem. In Section 5 we discuss the situations where and/or are unknown, as well as open problems. Section 6 contains some proofs and technical derivations.
In this section, we first define some general assumptions and some notation, although more notation will be introduced as needed. We then list some general results that will be used multiple times throughout the paper.
2.1 Assumptions and notation
We recall that and the other parameters such as may change with , and this dependency is left implicit. Unless otherwise specified, all the limits are with respect to . We assume that , for otherwise the graph (under the null hypothesis) is so sparse that number of edges remains bounded. Similarly, we assume that , for otherwise there is a non-vanishing chance that the community (under the alternative) does not contain any edges. Throughout the paper, we assume that and are both known, and discuss the situation where they are unknown in Section 5.
which varies with , and notice that with . The dense regime considered in (Arias-Castro and Verzelen, 2012) corresponds to . Here we focus on the sparse regime where . The case where includes the Poisson regime where is constant.
Recall that is the (undirected, unweighted) graph that we observe, and for , let denote the subgraph induced by in .
We use standard notation such as when ; when ; , or equivalently , when ; when and
. We extend this notation to random variables. For example, ifand are random variables, then if in probability.
For , define and , which are the positive and negative parts of . For an integer , let
Because of its importance in describing the tails of the binomial distribution, the following function — which is the relative entropy or Kullback-Leibler divergence ofto — will appear in our results:
We let denote .
2.2 Calibration of a test
We say that the test that rejects for large values of a (real-valued) statistic is asymptotically powerful if there is a critical value such that the test has risk (2) tending to 0. The choice of that makes this possible may depend on . In practice,
is chosen to control the probability of type I error, which does not necessitate knowledge ofas long as itself does not depend on , which is the case of all the tests we consider here. Similarly, we say that the test is asymptotically powerless if, for any sequence of reals , the risk of the test is at least 1 in the limit.
We prefer to leave the critical values implicit as their complicated expressions do not offer any insight into the theoretical difficulty or the practice of testing for the presence of a dense subgraph. Indeed, if a method can run efficiently, then most practitioners will want to calibrate it by simulation (permutation or parametric bootstrap, when is unknown). Besides, the interested reader will be able to obtain the (theoretical) critical values by a cursory examination of the proofs.
2.3 Some general results
Remember the definition of the entropy function in (10). The following is a simple concentration inequality for the binomial distribution.
Lemma 1 (Chernoff’s bound).
For any positive integer , any , we have
Here are some asymptotics for the entropy function.
Define . For , we have
The following are standard bounds on the binomial coefficients. Recall that .
For any integers ,
denotes the hypergeometric distribution counting the number of red balls indraws from an urn containing red balls out of .
is stochastically smaller than , where .
3 Some near-optimal tests
In this section we consider several tests and establish their performances. We start by recalling the result we obtained for the total degree test, based on (3), in our previous work (Arias-Castro and Verzelen, 2012). Recalling the definition of and in (6), define
Proposition 1 (Total degree test).
The total degree test is asymptotically powerful if , and asymptotically powerless if .
In view of Proposition 1, the setting becomes truly interesting when , which ensures that the naive total degree test is indeed powerless.
3.1 The broad scan test
In the denser regimes that we considered in (Arias-Castro and Verzelen, 2012), the (standard) scan test based on defined in (4) played a major role. In the sparser regimes we consider here, the broad scan test based on defined in (7) has more power. Assume that , so that is supercritical under . Then it is preferable to scan over the largest connected component in rather than scan itself.
For any , let denote the smallest solution of the equation . Let denote a largest connected component in and assume that is fixed. Then, in probability, and .
By Lemma 5, most of the edges of lie in its giant component, which is of size roughly . This informally explains why a test based on is more promising that the standard scan test based on .
In the details, the exact dependency of the optimal subset size to scan over seems rather intricate. This is why in we scan over subsets of size . (Recall that , although the exact form of is not important.) For any subset , let
Theorem 1 (Broad scan test).
The scan test based on is asymptotically powerful if
Note that the quantity does not depend on or . We shall prove in the next section that the power of the broad scan test is essentially optimal: if
or and , then no test is asymptotically powerful (at least when , so that the total degree test is powerless). Regarding (14), we could not get a closed-form expression of this supremum. Nevertheless, we show in the proof that
If , then
Hence, assuming and are fixed and positive , the broad scan test is asymptotically powerful when . In contrast, the scan test was proved to be asymptotically powerful when (Arias-Castro and Verzelen, 2012, Prop. 3), so that we have improved the bound by a factor larger than and smaller than . When converges to one, it was proved in (Arias-Castro and Verzelen, 2012) that the minimax detection boundary corresponds to (at least when ). Thus, for going to one, both the broad scan test and the scan test have comparable power and are essentially optimal. In the dense case, the broad scan test and the scan test have also comparable powers as shown by the next result which is the counterpart of (Arias-Castro and Verzelen, 2012, Prop. 3).
Assume that is bounded away from one. The broad scan test is powerful if
The proof is essentially the same as the corresponding result for the scan test itself. See (Arias-Castro and Verzelen, 2012).
Proof of Theorem 1.
First, we control under the null hypothesis. For any positive constant , we shall prove that
since . Consequently,
where the is uniform with respect to . Applying a union bound, we conclude that
We now lower bound under the alternative hypothesis. First, assume that (14) holds, so that there exists a positive constant and a sequence of integers such that eventually. In particular, . We then use (20) in the following concentration result for .
For an integer , define . We have the following deviation inequalities
It follows from Lemma 7 that, with probability going to one under ,
Taking in (18) allows us to conclude that the test based on with threshold is asymptotically powerful.
Now, assume that (15) holds. Because is stochastically increasing in under , we may assume that is fixed. We use a different strategy which amounts to scanning the largest connected component of . Let be a largest connected component of .
For a small to be chosen later, assume that and , which happens with high probability under by Lemma 5. Note that, because , we have , and therefore . Consequently, when computing we scan , implying that
Since above may be taken as small as we wish, and in view of (18), it suffices to show that Since converges to one when goes to one, we have . Consequently, it suffices to show that the function is increasing on . By definition of , we have (since ) and . Consequently, . Hence, is positive if . Recall that is the smallest solution of the equation , the largest solution being . Furthermore, we have for any . To conclude, it suffices to prove . This last bound is equivalent to
The function on the LHS is null for . Furthermore, its derivative is positive for , which allows us to conclude. ∎
Proof of Lemma 7.
The proof is based on moment bounds for functions of independent random variables due to Boucheron et al. (2005) that generalize the Efron-Stein inequality.
Recall that is the subgraph induced by . Fix some integer . For any , define the graph by removing from the edge set of . Let be defined as but computed on , and then let . Observe that and that is a measurable function of , the edges set of . Let be a subset of size such that . Then, we have
where the first equality comes from the fact that .
3.2 The largest connected component
This test rejects for large values of the size (number of nodes) of the largest connected component in , which we denoted .
3.2.1 Subcritical regime
We first study that test in the subcritical regime where . Define
Theorem 2 (Subcritical largest connected component test).
Assume that , , and . The largest connected component test is asymptotically powerful when or
If we further assume that , then the largest connected component test is asymptotically powerless when for all and