In recent years, the problem of detecting communities in networks has received a large amount of attention, with important applications in the social and biological sciences, among others (Fortunato, 2010). The vast majority of this expansive literature focuses on developing realistic models of (random) networks (Albert and Barabási, 2002; Barabási and Albert, 1999), on designing methods for extracting communities from such networks (Newman, 2006; Reichardt and Bornholdt, 2006; Girvan and Newman, 2002) and on fitting models to network data (Bickel et al., 2011).
The underlying model is that of graph , where is the set of edges and is the set of nodes. For example, in a social network, a node would represent an individual and an edge between two nodes would symbolize a friendship or kinship of some sort shared by these two individuals. In the literature just mentioned, almost all the methodology has concentrated on devising graph partitioning methods, with the end goal of clustering the nodes in into groups with strong inner-connectivity and weak inter-connectivity (Lancichinetti and Fortunato, 2009; Newman and Girvan, 2004; Bickel and Chen, 2009).
In this euphoria, perhaps the most basic problem of actually detecting the presence
of a community in an otherwise homogeneous network has been overlooked. From a practical standpoint, this sort of problem could arise in a dynamic setting where a network is growing over time and monitored for clustering. From a mathematical perspective, probing the limits of detection (i.e., hypothesis testing) often offers insight into what is possible in terms of extraction (i.e., estimation).
Many existing community extraction methods can be turned into community detection procedures. For example, one could decide that a community is present in the network if the modularity of Newman and Girvan (2004) exceeds a given threshold. To set this threshold, one needs to define a null model. Newman and Girvan (2004) implicitly assume a random graph conditional on the node degrees. Here, we make the simplest assumption that the null model is an Erdös-Rényi random graph (Bollobás, 2001).
In this context, we also touch on another line of work, that of detecting a clique in a random graph — the so-called Planted (or Hidden) Clique Problem (Feige and Ron, 2010; Alon et al., 1998; Dekel et al., 2011). Although the emphasis there is to find the detection performance of computationally tractable algorithms, we mostly ignore computational consideration and simply establish the absolute detection limits of any algorithm whatsoever.
1.1 The framework
We address a stylized community detection problem, where the task is to detect the presence of clustering in the network and is formalized as a hypothesis testing problem. We observe an undirected graph with nodes. Without loss of generality, we take . The corresponding adjacency matrix is denoted , where if, and only if, , meaning there is an edge between nodes . Note that is symmetric, and we assume that for all . Under the null hypothesis, the graph is a realization of , the Erdös-Rényi random graph on nodes with probability of connection ; equivalently, the upper diagonal entries of are independent and identically distributed with for any . Under the alternative, there is a subset of nodes indexed by such that for any with , with everything else the same. We assume that , implying that the connectivity is stronger between nodes in . When , the subgraph with node set is a clique. The subset is not known, although in most of the paper we assume that its size is known.
We study detectability in this framework in asymptotic regimes where , and may also change; all these parameters are assumed to be functions of . A test is a function that takes as input and returns to claim there is a community in the network, and otherwise. The (worst-case) risk of a test is defined as
where is the distribution under the null and is the distribution under the alternative where indexes the community. We say that a sequence of tests for a sequence of problems is asymptotically powerful (resp. powerless) if (resp. ). Practically speaking, a sequence of tests is asymptotically powerless if it does not perform substantially better than any guessing that ignores the adjacency matrix . We will often speak of a test being powerful or powerless when in fact referring to a sequence of tests and its asymptotic power properties.
1.2 Closely related work
We take the beaten path, following the standard approach in statistics for analyzing such composite hypothesis testing problems, in particular, the work of Ingster (1997) and others (Donoho and Jin, 2004; Ingster and Suslina, 2002; Hall and Jin, 2010)
on the detection of a sparse (normal) mean vector. Most closely related to our work is that ofButucea and Ingster (2011). Specializing their results to our setting, they derive lower bounds and upper bounds for the same detection problem when the graph is directed and the probability of connection under the null (denoted ) is fixed, which is a situation where the graph is extremely dense. Their work leaves out the interesting regime where , which leads to a null model that is much more sparse.
1.3 Main Contribution
Our main contribution in this paper is to derive a sharp detection boundary for the problem of detecting a community in a network as described above. We focus here on the quasi-normal regime333The quasi-Poisson regime where polynomially fast is qualitatively different and necessitates different proof arguments. This is beyond the scope of this paper and will appear somewhere else. where is either bounded away from zero, or tends to zero slowly, specifically,
On the one hand, we derive an information theoretic bound that applies to all tests, meaning conditions under which all tests are powerless. On the other hand, we display a test that basically achieves the best performance possible. The test is the combination of the two natural tests that arise in Butucea and Ingster (2011) and much of the work in that field (Ingster et al., 2010; Arias-Castro et al., 2011):
Total degree test. This test rejects when the total number of edges is unusually large. This is global in nature in that it cannot be directly turned into a method for extraction.
Scan (or maximum modularity) test.
This test amounts to turning modularity into a test statistic by rejecting when its maximum value is unusually large. It is strictly speaking the generalized likelihood ratio test under our framework.
We also consider the situation, common in practice, where is unknown. Interestingly, the detection boundary becomes larger than in the former setting when is moderately sparse. We derive the corresponding lower bound in this situation and design a test that achieves this bound. The test is again the combination of the two tests:
Degree variance test.
This test is based on the differences between two estimates for the degree variance, an analysis of variance of sorts. (Note that the total degree test cannot be calibrated without knowledge of.)
Scan test. This test can be calibrated in various ways when is unknown, for example by estimation of based on the whole graph, or by permutation. We study the former.
Finally, we consider various polynomial-time algorithms, the main one being a convex relaxation of the scan test based on a sparse eigenvalue problem formulation. Our inspiration there comes from the recent work of Berthet and Rigollet (2012). We discuss the discrepancy between the performances of the scan test and the relaxed scan test and compare it with other polynomial-time tests.
is (up to factor) the SNR for detecting the dense subgraph when it is known.
|Scan test||Tot. Deg. test||Scan test||Deg. Var. test|
|Relax. Scan test||Tot. Deg. test||Relax. Scan test||Deg. Var. test|
1.4 Finding a clique
We start the paper by addressing the problem of detecting the presence of a large clique in the graph, and treat it separately, as it is an interesting case in its own right. It is simpler and allows us to focus on the regime where in the rest of the paper. We establish a lower bound and prove that the following (obvious) test achieves that bound:
Clique number test. This tests rejects when the size of the clique number of the graph is unusually large. It can be calibrated without knowledge of , for example by permutation, but we do not know of a polynomial-time algorithm that comes even close.
In Section 2, we consider the problem of detecting the presence of a large clique and analyze the clique number test. In Section 3, we consider the more general problem of detecting a densely connected subgraph and analyze the total degree test and the scan test. The more realistic situation of unknown is handled in Section 4. In Section 5.2, we investigate polynomial-time tests. We then discuss our results and the outlook in Section 6. The technical proofs are postponed to Section 7.
1.6 General assumptions and notation
We assume throughout that and the other parameters (and more) are allowed to change with , unless specified otherwise. This dependency is left implicit. In particular, we assume that , emphasizing the situation where the community to be detected is small compared to the size of the whole network. (When is of the same order as , the total degree test is basically optimal.) We assume that is bounded away from 1, which is the most interesting case by far, and that , the latter implying that the number of edges in the network (under the null) is not bounded. We also hypothesize that either or with , there is a non-vanishing chance that the community does not contain any edges, precluding any test to be powerful.
We use standard notation such as when ; when ; when is bounded; when and ; when there exists a positive constant such that and when there exists a positive constant such that . For an integer let . For two distributions and on the real line, let
denote their convolution, which is the distribution of the sum two independent random variablesand .
2 Detecting a large clique in a random graph
We start with specializing the setting to that of detecting a large clique, meaning we consider the special case where . In this section, is not necessarily increasing with .
2.1 Lower bound
We establish the detection boundary, giving sufficient conditions for the problem to be too hard for any test, meaning that all tests are asymptotically powerless.
All tests are asymptotically powerless if
The result is, in fact, very intuitive. Condition (3) implies that, with high probability under the null, the clique number is at least , which is the size of the implanted clique under the alternative. This is a classical result in random graph theory, and finer results are known — see (Bollobás, 2001, Chap. 11). The arguments underlying Theorem 1 are, however, based on studying the likelihood ratio test when a uniform prior is assumed on the implanted clique , which is the standard approach in detection settings; see (Lehmann and Romano, 2005, Ch. 8)
. In this specific setting, the second moment method — which consists in showing that the variance of the likelihood ratio tends to 0 — suffices.
2.2 The clique number test
Computational considerations aside, the most natural test for detecting the presence of a clique is the clique number test defined in the Introduction. We obtain the following.
The clique number test is powerful if
The proof is entirely based on the fact that, when (4) holds, the clique number under the null is at most with high probability (Bollobás, 2001, Th. 11.6), while it is at least under the alternative. (Thus the proof is omitted.) We conclude that the clique number test is seen to achieve the detection boundary established in Theorem 1.
3 Detecting a dense subgraph in a random graph
We now consider the more general setting of detecting a dense subgraph in a random graph. We start with an information bound that applies to all tests, regardless of their computational requirements. We then study the total degree test and the scan test, showing that the test that combines them with a simple Bonferroni correction is essentially optimal.
3.1 Lower bound
When assuming infinite computational power, what is left is the purely statistical challenge of detecting the subgraph. For simplicity, we assume that is not too small, specifically,
though our result below partially extends to this, particularly when is constant. As usual, a minimax lower bound is derived by choosing a prior over the composite alternative. Assuming that and are known, because of symmetry, the uniform prior over the community is least favorable, so that we consider testing
where the latter is the model where the community is chosen uniformly at random among subset of nodes of size , and then for , if , while otherwise. For this simple versus simple testing problem, the likelihood ratio test is optimal, which is what we examine to derive the following lower bound. Remember the entropy function defined in (2).
Conditions (7) and (8) have their equivalent in the work of Butucea and Ingster (2011). That said, (8) is more complex here because of the different behaviors of the entropy function according to whether is small or large — corresponding to the difference between large deviations and moderate deviations of the binomial distribution. Only in the case where is the normal approximation to the binomial in effect.
To better appreciate (8), note that it is equivalent to
3.2 The total degree test
The total degree test rejects for large values of
The resulting test is exceedingly simple to analyze, since
The total degree tests is powerful if
It is equally straightforward to show that the total degree has risk strictly less than one — meaning has some non-negligible power — when the same ratio tends to a positive and finite constant, while it is asymptotically powerless when that ratio tends to zero.
3.3 The scan test
The scan test is another name for the generalized likelihood ratio test, and corresponds to the test that is based on the maximum modularity. It is particularly simple when is known, as it rejects for large values of
Unlike the total degree (11), the scan statistic (14) has an intricate distribution as the partial sums are not independent. Nevertheless, the union bound and standard tail bounds for the binomial distribution lead to the following result.
The scan test is powerful if
3.4 The combined test
Having studied these two tests individually, we are now in a position to consider them together, by which we mean a simple Bonferroni combination which rejects when either of the two tests rejects. Looking back at our lower bound and the performance bounds we established for these tests, we come to the following conclusion. When the limit in (7) is infinite — yielding (13) — then the total degree test is asymptotically powerful by Proposition 2. When the limit inferior in (8) exceeds one — yielding (15) — then the scan test is asymptotically powerful by Proposition 3.
3.5 Adaptation to unknown
The scan statistic in (14) requires knowledge of . When this is unknown, the common procedure is to combine the scan tests at all different sizes using a simple Bonferroni correction. This is done in (Butucea and Ingster, 2011), with the conclusion that the resulting test is essentially as powerful as the individual tests. It is straightforward to see that, here too, the tail bound used in the proof of Proposition 3 allows for enough room to scan over all subgraphs of all sizes.
4 When is unknown: the fixed expected total degree model
Although it leads to interesting mathematics, the setting where is known is, for the most part, impractical. In this section, we evaluate how not knowing changes the difficulty of the problem. In fact, it makes the problem strictly more difficult in the denser regime.
There are (at least) two ways of formalizing the situation where is unknown. In the first option, we still consider the exact same hypothesis testing problem, but maximize the risk over relevant subsets of ’s and ’s, since now even the null hypothesis is composite. In the second option — which is the one we detail — for a given pair of probabilities , we consider testing
Note that, in this setting, we still assume that are known to the statistician. By design, the graph has the same expected total degree under the null and under the alternative hypotheses, that is we have
denote the probability distribution and corresponding expectation under the model where, for any, if , while otherwise.
The risk of a test for this problem is defined as
We say that the a sequence of tests is asymptotically powerful for the problem with fixed expected total degree (resp. powerless) if (resp. ).
We first compute the detection boundary for this problem and then exhibit some tests achieving this detection boundary. Interestingly, these tests do not require the knowledge of and , or even , so that they can be used in the original setting (6) when these parameters are unknown.
4.1 Lower bound
Comparing with Theorem 2, where is assumed to be known, the condition (18) is substantially weaker than the corresponding condition (7), while we shall see in the proof that (19) is comparable to (8). That said, when , the entropy condition (8) is a stronger requirement than either (7) or (18), implying that the setting where is known and the setting where unknown are asymptotically as difficult in that case.
4.2 Degree variance test
By construction, the total degree has the same expectation under the null and under the alternative in the testing problem with fixed expected total degree — and same variance also up to second order — making it difficult to see how to fruitfully use this statistic in this context.
We design instead a test based on comparing the two estimators for the node degree variance, not unlike an analysis of variance. Let
denote the degree of node in the whole network. The first estimate is simply the maximum likelihood estimator under the null
The second estimator is some sort of sample variance, modified to account for the fact that the are not independent
Both estimators are unbiased for the degree variance under the null, meaning, . Under the alternative, tends to be larger than , leading to a test that rejects for large values of
Assume that . The degree variance test is asymptotically powerful under fixed expected total degree if
The test based on achieves the part (18) of the detection boundary. We note that computing does not require knowledge of , or , and in fact, its calibration can be done without any knowledge of these parameters via a form of parametric bootstrap, as we do for the scan test below.
4.3 The scan test
When is not available a priori, we have at least three options:
Estimate . We replace with its maximum likelihood estimator under the null, i.e., , and then compare the magnitude of the observed scan statistic (14) with what one would get under a random graph model with probability of connection equal to .
Generalized likelihood ratio test. We simply implement the actual generalized likelihood ratio test (Kulldorff, 1997), which rejects for large values of
where , as above, and
which are the maximum likelihood estimates of and for a given subset .
Calibration by permutation. We compare the observed value of the scan statistic to simulated values obtained by generating a random graph with either the same number of edges — which leads to a calibration very similar to the first option — or the same degree distribution — which is the basis for in the modularity function of Newman and Girvan (2004).
We focus on the first option.
Assume that . The scan test calibrated by estimation of is asymptotically powerful for fixed expected total degree if
4.4 Combined test and full adaptation to unknown
A combination of the degree variance test and of the scan test calibrated by estimation of is seen to achieve the detection boundary established in Theorem 3, without requiring knowledge of or , or even .
5 Testing in polynomial-time
While computing the total degree (11) or the degree variance statistic (21) can be done in linear time in the size of the network, i.e., in time, computing the scan statistic (14) is combinatorial in nature and there is no known polynomial-time algorithm to compute it. To see this, note that the ability to compute (14) in polynomial-time implies the ability to compute the size of the largest clique in the graph, since this is equal to
A question of particular importance in modern times is determining the tradeoff between statistical performance and computational complexity. At the most basic level, this boils down to answering the following question: What can be done in polynomial-time?
5.1 Convex relaxation scan test
We now suggest a convex relaxation to the problem of computing the scan statistic. To do so, we follow the footsteps of Berthet and Rigollet (2012)
, who consider the problem of detecting a sparse principal component based on a sample from a multivariate Gaussian distribution in dimension. Assuming the sparse component has at most nonzero entries, they show that a near-optimal procedure is based on the largest eigenvalue of any -by- submatrix of the sample covariance matrix. Computing this statistic is NP-hard, so they resort to the convex relaxation of d’Aspremont et al. (2007), which they also study. We apply their procedure to .
Formally, for a positive semidefinite matrix and , define
where denotes the principal submatrix of indexed by and the largest eigenvalue of . d’Aspremont et al. (2007) relaxed this to
where the maximum is over positive semidefinite matrices and . We consider the relaxed scan test, which rejects for large values of
When is known, we simply calibrate the procedure by Monte Carlo simulations, effectively generating i.i.d. from and computing for each , and estimating the p-value by the fraction of ’s such that . Typically is a large number, and below we consider the asymptote where .
When is unknown, we estimate as we did for the scan test in Proposition 5, and then calibrate the statistic by Monte Carlo, effectively using a form of parametric bootstrap.
In either case, we have the following.
Assume that (1) holds and for some . Then, the relaxed scan test is powerful if
To gain some insights on the relative performance of the scan test and the relaxed scan test, let us assume that , and . Applying Proposition 3 (or Proposition 5) in this setting, we find that the scan test is asymptotically powerful when
Thus, comparing with (25), we lose a factor when using the relaxed version. In the denser regime where , the total degree test and degree variance test both have stronger theoretical guarantees established in Proposition 2 and Proposition 4 respectively. Below we explain why the loss is not unexpected.
The problem is called the Planted (or Hidden) Clique Problem (Feige and Ron, 2010) and has become one of the most emblematic statistical problems where computational constraints seem to substantially affect the difficulty of the problem. Recent advances in compressed sensing and matrix completion have shown that computationally tractable algorithms can achieve the absolute information bounds (up to constants) in most cases. In contrast, in the Planted (or Hidden) Clique Problem there is no known polynomial-time algorithm that can detect a clique of size (Dekel et al., 2011), while the clique test can detect a clique of size , as shown in Proposition 1. In fact, the problem is provably hard in some computational models, such as monotone circuits (Rossman, 2010; Feldman et al., 2012). We refer to Berthet and Rigollet (2012) for a thorough discussion.
More generally, we may want to characterize the sequences for which there are asymptotically powerful tests running in polynomial time. In our findings, the only situation where we found this to be true was in the dense regime, where the total degree test is both powerful in the large-sample limit and computable in polynomial time. (Replace this with the degree variance test when is unknown.)
5.2 Other polynomial-time tests
5.2.1 The maximum degree test
Perhaps the first computationally-feasible test that comes to mind in the sparse regime is the test based on the maximum degree
where is the degree of node in the graph, defined in (20).
The maximal degree test is asymptotically powerful if and
Under condition (1), the maximal degree test is asymptotically powerless if and
5.2.2 Densest subgraph test
Another possible avenue for designing computationally tractable tests for the problem at hand lies in algorithms for finding dense subgraphs of a given size. We follow (Khuller and Saha, 2009), where the reader will find appropriate references and additional results. Define the density of a subgraph as
Finding that maximizes may be done in polynomial-time.
Assume that .
Under the null hypothesis, and this maximum is achieved at subsets satisfying .
The densest subgraph test is powerful if .
Assume that . Under the alternative hypothesis, and this maximum is achieved at subsets satisfying .
The condition is stronger than what we have obtained for the relaxed scan test (25) in the sparser case ( for any ) and than what we have obtained for the total degree test (13) and the degree variance test (22) in the less sparse case . If , then the densest subgraph statistic seems to behave like the total degree statistic and we therefore expect similar performances although we have no proof of this statement.
In order to improve the power, we would like to restrict our attention to subgraphs of size (assumed known for now) and use Computing this, however, is NP-hard, and there is no known polynomial-time approximation within a constant factor. Nevertheless, the following variant statistic can be approximated within a constant factor in polynomial-time. However, the power of the resulting test is not improved. Since the statistic may only be approximated within a constant factor, the resulting test is powerful only if where is positive constant that depends on this approximation factor.
With this paper, we have established the fundamental statistical (information theoretic) difficulty of detecting a community in a network, modeled as the detection of an unusually dense subgraph within an Erdös-Rényi random graph, in the quasi-normal regime where is not too small as made explicit in (1). The quasi-Poisson regime, where is smaller, requires different arguments and the application of somewhat different tests, and this will be detailed in a separate paper under preparation.
For the time being, in the quasi-normal regime, we learned the following. In the moderately sparse setting — for known and for unknown — this detection boundary is achieved by polynomial-time tests. In the sparser setting, there is a large discrepancy between the information theoretic boundaries and performances of known polynomial tests, which in view of the Planted Clique Problem, is not surprising.
It is of great interest to study this optimal detection boundary, this time under computational constraints, a theme of contemporary importance in statistics, machine learning and computer science. This promisingly rich line of research is well beyond the scope of the present paper.
7.1 Auxiliary results
The following is Chernoff’s bound for the binomial distribution. Remember the definition of in (2).
Lemma 1 (Chernoff’s bound).
For any positive integer , any , we have
A consequence of Chernoff’s bound is Bernstein’s inequality for the binomial distribution.
Lemma 2 (Bernstein’s inequality).
For positive integer , any and any , we have
We will need the following basic properties of the entropy function.
For , is convex in . Moreover,
We will also use the following upper bound on the binomial coefficients.
For any integers ,
The next result bounds the hypergeometric distribution with the corresponding binomial distribution. Letdenotes the hypergeometric distribution counting the number of red balls in draws from an urn containing red balls out of .
is stochastically smaller than .
Suppose the balls are picked one by one without replacement. At each stage, the probability of selecting a red ball is smaller than . The result follows. ∎
7.2 Proof of Theorem 1
Following standard lines, we start by reducing the composite alternative to a simple alternative by considering the uniform prior on subsets of size . The resulting likelihood ratio is
which is the observed number of cliques of size divided by the expected number under the null.
The risk of any test for the original problem is well-known to be bounded from below by the risk of the likelihood ratio test for this ‘averaged’ problem, which is equal to
Therefore, it suffices to show that . Here we use arguably the simplest method, a second moment argument, which is based on the fact that
by the Cauchy-Schwarz inequality, so that it is enough to prove that . We do so by showing that .
where denotes the expectation with respect to . Hence, by Fubini’s theorem, we have
where . Indeed, the event means that all edges between pairs of nodes in exist, and similarly for , and there are a total of such edges.
In particular, this means that , eventually, and therefore
For fixed, the function is decreasing on and increasing on . Therefore,
By (32), the second term in the maximum tends to . This also the case of the first term, since
with the second difference bounded from below. Hence, . Hence, the sum in (35) is bounded by
Hence we showed that and the proof of Theorem 1 is complete.
7.3 Proof of Theorem 2
where is the expectation with respect to , and
which is the moment generating function of.
Still leaving implicit, let be short for . It is well-known that is the Fenchel-Legendre transform of ; more specifically, for ,
The second moment argument used in Section 7.2 is also applicable here, though it does not yield sharp bounds. In Case 1 below (see Subsection 7.3.3), which is the regime where the moderate deviations of the binomial come into play, this method leads to a requirement that the limit superior in (8) be bounded by instead of 1. And, worse than that, in Case 3 below, which is the regime where the large deviations of the binomial are involved, it does not provide any useful bound whatsoever.
Fortunately, a finer approach was suggested by Ingster (1997). The refinement is based on bounding the first and second moments of a truncated likelihood ratio. Here we follow Butucea and Ingster (2011). They work with the following truncated likelihood
where the events will be specified below. We note . Using the triangle inequality, the fact that and the Cauchy-Schwarz inequality, we have the following upper bound:
so that when and . Note that contrary to Butucea and Ingster (2011), we do not require that . More precisely, we shall prove that is an accumulation point of any subsequence of . Adopting this approach allows us to assume that converges to , converges to and that