The Geometric Block Model

09/16/2017 ∙ by Sainyam Galhotra, et al. ∙ University of Massachusetts Amherst 0

To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model generalizes the random geometric graphs in the same way that the well-studied stochastic block model generalizes the Erdos-Renyi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancement in community detection. While being a topic of fundamental theoretical interest, our main contribution is to show that many practical community structures are better explained by the geometric block model. We also show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. Indeed, even in the regime where the average degree of the graph grows only logarithmically with the number of vertices (sparse-graph), we show that this algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model. We simulate our results on both real and synthetic datasets to show superior performance of both the new model as well as our algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The planted-partition model or the stochastic block model (SBM) is a random graph model for community detection that generalizes the well-known Erdös-Renyi graphs [21, 14, 12, 2, 1, 20, 11, 25]. Consider a graph , where is a disjoint union of clusters denoted by The edges of the graph are drawn randomly: there is an edge between and

with probability

Given the adjacency matrix of such a graph, the task is to find exactly (or approximately) the partition of .

This model has been incredibly popular both in theoretical and practical domains of community detection, and the aforementioned references are just a small sample. Recent theoretical works focus on characterizing sharp threshold of recovering the partition in the SBM. For example, when there are only two communities of exactly equal sizes, and the inter-cluster edge probability is and intra-cluster edge probability is , it is known that perfect recovery is possible if and only if [1, 25]. The regime of the probabilities being has been put forward as one of most interesting ones, because in an Erdös-Renyi random graph, this is the threshold for graph connectivity [8]. This result has been subsequently generalized for communities [2, 3, 19] (for constant or when

), and under the assumption that the communities are generated according to a probabilistic generative model (there is a prior probability

of an element being in the th community) [2]. Note that, the results are not only of theoretical interest, many real-world networks exhibit a “sparsely connected” community feature [24], and any efficient recovery algorithm for SBM has many potential applications.

One aspect that the SBM does not account for is a “transitivity rule” (‘friends having common friends’) inherent to many social and other community structures. To be precise, consider any three vertices and . If and are connected by an edge (or they are in the same community), and and are connected by an edge (or they are in the same community), then it is more likely than not that and are connected by an edge. This phenomenon can be seen in many network structures - predominantly in social networks, blog-networks and advertising. SBM, primarily a generalization of Erdös-Renyi random graph, does not take into account this characteristic, and in particular, probability of an edge between and there is independence of the fact that there exist edges between and and and . However, one needs to be careful such that by allowing such “transitivity”, the simplicity and elegance of the SBM is not lost.

Inspired by the above question, we propose a random graph community detection model analogous to the stochastic block model, that we call the geometric block model (GBM). The GBM depends on the basic definition of the random geometric graph that has found a lot of practical use in wireless networking because of its inclusion of the notion of proximity between nodes [26].

Definition. A random geometric graph (RGG) on vertices has parameters , an integer and a real number

. It is defined by assigning a vector

to vertex where

are independent and identical random vectors uniformly distributed in the Euclidean sphere

. There will be an edge between vertices and if and only if .

Note that, the definition can be further generalized by considering s to have a sample space other than , and by using a different notion of distance than inner product (i.e., the Euclidean distance). We simply stated one of the many equivalent definitions [10].

Random geometric graphs are often proposed as an alternative to Erdös-Renyi random graphs. They are quite well studied theoretically (though not nearly as much as the Erdös-Renyi graphs), and very precise results exist regarding their connectivity, clique numbers and other structural properties [17, 27, 13, 5, 16]. For a survey of early results on geometric graphs and the analogy to results in Erdös-Renyi graphs, we refer the reader to [26]. A very interesting question of distinguishing an Erdös-Renyi graph from a geometric random graph has also recently been studied [10]. This will provide a way to test between the models which better fits a scenario, a potentially great practical use.

As mentioned earlier, the “transitivity” feature led to random geometric graphs being used extensively to model wireless networks (for example, see [18, 7]

). Surprisingly, however, to the best of our knowledge, random geometric graphs are never used to model community detection problems. In this paper we take the first step towards this direction. Our main contributions can be classified as follows.

  • [noitemsep,leftmargin=5pt]

  • We define a random generative model to study canonical problems of community detection, called the geometric block model (GBM). This model takes into account a measure of proximity between nodes and this proximity measure characterizes the likelihood of two nodes being connected when they are in same or different communities. The geometric block model inherits the connectivity properties of the random geometric graphs, in particular the likelihood of “transitivity” in triplet of nodes (or more).

  • We experimentally validate the GBM on various real-world datasets. We show that many practical community structures exhibit properties of the GBM. We also compare these features with the corresponding notions in SBM to show how GBM better models data in many practical situations.

  • We propose a simple motif-based efficient algorithm for community detection on the GBM. We rigorously show that this algorithm is optimal up to a constant fraction (to be properly defined later) even in the regime of sparse graphs (average degree ).

  • The motif-counting algorithms are extensively tested on both synthetic and real-world datasets. They exhibit very good performance in three real datasets, compared to the spectral-clustering algorithm (see Section 

    5). Since simple motif-counting is known to be far from optimum in stochastic block model (see Section 4), these experiments give further validation to GBM as a real-world model.

Given any simple random graph model, it is possible to generalize it to a random block model of communities much in line with the SBM. We however stress that the geometric block model is perhaps the simplest possible model of real-world communities that also captures the transitive/geometric features of communities. Moreover, the GBM explains behaviors of many real world networks as we will exemplify subsequently.

Area 1 Area 2 same different
MOD AI 10 2
ARCH MOD 6 1
ROB ARCH 3 0
MOD ROB 4 0
ML MOD 7 1
Area same different
MOD 19 35
ARCH 13 15
ROB 24 16
AI 39 32
ML 14 42
Table 1: On the left we count the number of inter-cluster edges when authors shared same affiliation and different affiliations. On the right, we count the same for intra-cluster edges.

2 The Geometric Block Model and its Validation

Let be the set of vertices that is a disjoint union of clusters, denoted by . Given an integer , for each vertex , define a random vector that is uniformly distributed in the -dimensional sphere.

Definition (Geometric Block Model ()).

Given and a set of real numbers , the geometric block model is a random graph with vertices and an edge exists between and if and only if .

The case of : In this paper we particularly analyze our algorithm for

. In this special case, the above definition is equivalent to choosing random variable

uniformly distributed in , for all . Then there will be an edge between two vertices if and only if or . This in turn, is equivalent to choosing a random variable uniformly distributed in for all , and there exists an edge between two vertices if and only if

where , are a set of real numbers.

For the rest of this paper, we concentrate on the case when for all , which we call the “intra-cluster distance” and for all , which we call the “inter-cluster distance,” mainly for the clarity of exposition. To allow for edge density to be higher inside the clusters than across the clusters, assume .

The main problem that we seek to address is following. Given the adjacency matrix of a geometric block model with clusters, and , , find the partition .

We next give two examples of real datasets that motivate the GBM. In particular, we experiment with two different types of real world datasets in order to verify our hypothesis about geometric block model and the role of distance in the formation of edges. The first one is a dataset with academic collaboration, and the second one is a product purchase metadata from Amazon.

2.1 Motivation of GBM: Academic Collaboration

We consider the collaboration network of academicians in Computer Science in 2016 (data obtained from csrankings.org

). According to area of expertise of the authors, we consider five different communities: Data Management (MOD), Machine Learning and Data Mining (ML), Artificial Intelligence (AI), Robotics (ROB), Architecture (ARCH). If two authors share the same affiliation, or shared affiliation in the past, we assume that they are geographically close. We would like to hypothesize that, two authors in the same communities might collaborate even when they are geographically far. However, two authors in different communities are more likely to collaborate only if they share the same affiliation (or are geographically close). Table

1 describes the number of edges across the communities. It is evident that the authors from same community are likely to collaborate irrespective of the affiliations and the authors of different communities collaborate much frequently when they share affiliations or are close geographically. This clearly indicates that the inter cluster edges are likely to form if the distance between the nodes is quite small, motivating the fact in the GBM.

2.2 Motivation of GBM: Amazon Metadata

The next dataset that we use in our experiments is the Amazon product metadata on SNAP (https://snap.stanford.edu/data/amazon-meta.html), that has 548552 products and each product is one of the following types {Books, Music CD’s, DVD’s, Videos}. Moreover, each product has a list of attributes, for example, a book may have attributes like “General”, “Sermon”, “Preaching”. We consider the co-purchase network over these products. We make two observations here: (1) edges get formed (that is items are co-purchased) more frequently if they are similar, where we measure similarity by the number of common attributes between products, and (2) two products that share an edge have more common neighbors (no of items that are bought along with both those products) than two products with no edge in between.

Figures 2 and 2 show respectively average similarity of products that were bought together, and not bought together. From the distribution, it is quite evident that edges in a co-purchase network gets formed according to distance, a salient feature of random geometric graphs, and the GBM.

Figure 1: Histogram: similarity of products bought together (mean )
Figure 2: Histogram: similarity of products not bought together (mean)
Figure 3: Histogram of common neighbors of edges and non-edges in the co-purchase network, from left to right: Book-DVD, Book-Book, DVD-DVD

We next take equal number of product pairs inside Book (also inside DVD, and across Book and DVD) that have an edge in-between and do not have an edge respectively. Figure 3 shows that the number of common neighbors when two products share an edge is much higher than when they do not–in fact, almost all product pairs that do not have an edge in between also do not share any common neighbor. This again strongly suggests towards GBM due to its transitivity property. On the other hand, this also suggests that SBM is not a good model for this network, as in SBM, two nodes having common neighbors is independent of whether they share an edge or not.

Difference between SBM and GBM. It is important to stress that the network structures generated by the SBM and the GBM are quite different, and it is significantly difficult to analyze any algorithm or lower bound on GBM compared to SBM. This difficulty stems from the highly correlated edge generation in GBM (while edges are independent in SBM). For this reason, analyses of the sphere-comparison algorithm and spectral methods for clustering on GBM cannot be derived as straight-forward adaptations. Whereas, even for simple algorithms, a property that can be immediately seen for SBM, will still require a proof for GBM.

3 The Motif-Counting Algorithm

Suppose, we are given a graph with disjoint clusters, generated according to . Our clustering algorithm is based on counting motifs, where a motif is simply defined as a configuration of triplets in the graph. Let us explain this principle by one particular motif, a triangle. For any two vertices and in , where is an edge, we count the total number of common neighbors of and . We show that, whenever , this count is different when and belong to the same cluster, compared to when they belong to different clusters. We assume is connected, because otherwise it is impossible to recover the clusters with certainty. For every pair of vertices in the graph that share an edge, we decide whether they are in the same cluster or not by this count of triangles. In reality, we do not have to check every such pair, instead we can stop when we form a spanning tree. At this point, we can transitively deduce the partition of nodes into clusters.

The main new idea of this algorithm is to use this triangle-count (or motif-count in general), since they carry significantly more information regarding the connectivity of the graph than an edge count. However, we can go to statistics of higher order (such as the two-hop common neighbors) at the expense of increased complexity. Surprisingly, the simple greedy algorithm that rely on triplets can separate clusters when and are , which is also a minimal requirement for connectivity of random geometric graphs [26]. Therefore this algorithm is optimal up to a constant factor. It is interesting to note that this motif-counting algorithm is not optimal for SBM (as we observe), in particular, it will not detect the clusters in the sparse threshold region of , however, it does so for GBM.

The pseudocode of the algorithm is described in Algorithm  3. The algorithm looks at individual pairs of vertices to decide whether they belong to the same cluster or not. We go over pair of vertices and label them same/different, till we have enough labels to partition the graphs into clusters.

At any stage, the algorithm picks up an unassigned node and queries it with another node that has already been assigned to one of the clusters. Note that it is always possible to find such a vertex because otherwise the graph would not be connected. To decide whether these two points and belong to the same cluster, the algorithm calls a subroutine named process. The process function counts the number of common neighbors of and to make a decision. The node is assigned to its respective cluster depending upon the output of process subroutine. This procedure is continued till all nodes in are assigned to one of the clusters.

Algorithm 1 Cluster recovery in GBM
0:   GBM ,
0:   
1:   Choose any
2:   ,
3:   while  do
4:       Choose
5:       if process(then
6:           if  then
7:               
8:           else
9:               
10:           end if
11:       else
12:           if  then
13:               
14:           else
15:               
16:           end if
17:       end if
18:   end while
Algorithm 2 process
0:   ,, ,
0:   true/false
1:   count
2:   if  then
3:       return  true
4:   end if
5:   return  false

The process function counts the number of common neighbors of two nodes and then compares the difference of the count with two functions of and , called and .

We have compiled the distribution of the number of common neighbors along with other motifs (other patterns of triplets, given ) in Table  2. We provide the values of and in Theorem 1 for the regime of . In this table we have assumed that there are only two clusters of equal size. The functions change when the cluster sizes are different. Our analysis described in later sections can be used to calculate new function values. In the table, means and are in the same cluster.

Motif: Distribution of count () Distribution of count ()
Motif 1:
Motif 2:
Motif 3:
Motif 4:
Table 2: Distribution of motif count for an edge conditioned on the distance between them , when there are two equal sized clusters. Here denotes a binomial random variable with mean .

Similarly, the process function can be run on other set of motifs by fixing two nodes. On considering a larger set of motifs, the process function can take a majority vote over the decisions received from different motifs. Note that, our algorithm counts motifs only for edges, and does not count motifs for more than edges, as there are only vertices to be be assigned to clusters.

Remark 1.

If we are given k clusters (), our analysis can be extended to calculate new values of and . If there exists a palpable gap between the two values, we can extend the Algorithm 3 to identify the true assignment of each node.

4 Analysis of the Algorithm

The critical observation that we have to make to analyze the motif-counting algorithm is the fact that given a GBM graph with two clusters , and a pair of vertices , the events of any other vertex being a common neighbor of both and given are dependent (this is not true in SBM); however given the distance between the corresponding random variables , the events are independent. Moreover, the probabilities of are different when and are in the same cluster and when they are in different clusters. Therefore the count of the common neighbors are going to be different, and substantially separated with high probability for two vertices in cases when they are from the same cluster or from different clusters. This will lead the function process to correctly characterize two vertices as being from same or different clusters with high probability.

Let us now show this more formally. We have the following two lemmas for a GBM graph with two equal-sized (unknown) clusters , and parameters .

Lemma 1.

For any two vertices belonging to the same cluster, the event is independent with conditional on the distance between and , .

Proof.

Let us assume that belong to the same cluster as that of (the proof is similar for other cases too, and we omit those cases here). The event given is equivalent to having both and (the random variable corresponding to vertices and respectively) within a range of if and can never happen if . Hence for .

On the other hand, the event given is equivalent to having within a range of if and 0 otherwise. Similarly the event given is equivalent to having within a range of . Therefore . ∎

This observation leads to the derivation of distributions of counts of triangles involving for the cases when and are in the same cluster and when they are not.

Lemma 2.

For any two vertices belonging to the same cluster and , the count of common neighbors is a random variable distributed according to if and according to if , where is a binomial random variable with mean .

Lemma 3.

For any two vertices belonging to different clusters and , the count of common neighbors is a random variable distributed according to when and according to when .

Here let us give the proof of Lemma 2. The proof of Lemma 3 will follow similarly. These expressions can also be generalized when the clusters are of unequal sizes, but we omit those for clarity of exposition.

Proof of Lemma 2.

Let be the uniform random variable associated with . Let us also denote by . Without loss of generality, assume . For any vertex , let be the event that is a common neighbor given that the vertices and have an edge and the distance between those vertices is . For ,

For ,

Since we are conditioning on the fact that the vertices and have an edge, can take a maximum value of . Now since there are points in and points in , we have the statement of the lemma. ∎

The proof of Lemma 3 is similar and we delegate it to Appendix 6.1.

Consider the case when . The above lemmas show that for all values of , the expected count of the number of triangles involving is higher when and belong to the same cluster as opposed to different clusters. By leveraging the concentration of binomial random variables, we bound the count of the number of triangles in these two cases. We use Lemma 2

to first estimate the minimum value of triangle count when

and belong to the same cluster and Lemma 3 to estimate the maximum value of triangle count when and belong to different clusters. Our algorithm will correctly resolve whether two points in the same cluster or not if the minimum value in the former case is higher than the maximum value in the later. While more general statements are possible, we give a theorem concentrating on the special case when , which is at the order of the connectivity threshold of geometric random graphs [26].

Theorem 1.

Let and , , and Algorithm 3 with and

can recover the clusters accurately with a probability of if

Proof.

We need to consider the case of from Lemma 2 and Lemma 3. Let denote the random variable that equals the number of common neighbors of two nodes . Let us also denote and , where means and are in the same cluster. We can easily find and from Lemmas 2, 3. We see that,

The value of is greater than that of for all values of when . We try to bound the values of in these two cases and then achieve the condition of correct resolution. Given a fixed , since is a sum of independent binary random variables, using the Chernoff bound, when . Now when belong to the same cluster and , with probability at least ,

Using Chernoff bound, we also know that when . Hence, with probability at least , is at most when belong to different clusters.

We calculate the minimum value of over all values of to find the value closest to . When , is a decreasing function with the minimum value of at . Plugging in , and we get that the algorithm will be successful to label correctly with probability as long as,

Now we need the correct assignment of vertices for pairs of vertices (according to Algorithm 3). Applying union bound over distinct pairs guarantees the probability of recovery as . ∎

(a) Triangle motif varying and minimum value of that satisfies the accuracy bound.
(b) f-score with varying , fixing .
(c) Fraction of nodes misclassified.
Figure 4: Results of the motif-counting algorithm on a synthetic dataset with nodes.
Dataset Total no. Accuracy Running Time (sec)
of nodes Motif-Counting Spectral clustering Motif-Counting Spectral clustering
Political Blogs 1222 20 2 1 0.788 0.53 1.62 0.29
DBLP 12138 10 1 2 0.675 0.63 3.93 18.077
LiveJournal 2366 20 1 1 0.7768 0.64 0.49 1.54
Table 3: Performance on real world networks

Instead of relying only on the triangle (or common-neighbor) motif, we can consider other different motifs (as listed in Table 2) and use them to make similar analysis. Aggregating the different motifs by taking a majority vote decision may improve the results experimentally but it is difficult to say anything theoretically since the decisions of the different motifs are not independent. We refer the reader to Appendix 6.2 for the detailed analysis of incorporating other motifs to obtain analogous theorems.

Remark 2.

Instead of using Chernoff bound we could have used better concentration inequality (such as Poisson approximation) in the above analysis, to get tighter condition on the constants. We again preferred to keep things simple.

Remark 3 (GBM for and above).

For GBM with , to find the number of common neighbors of two vertices, we need to find out the area of intersection of two spherical caps on the sphere. It is possible to do that. It can be seen that, our algorithm will successfully identify the clusters as long as again when the constant terms satisfy some conditions. However tight characterization becomes increasingly difficult. For general , our algorithm should be successful when , which is also the regime of connectivity threshold.

Remark 4 (More than two clusters).

When there are more than two clusters, the same analysis technique is applicable and we can estimate the expected number of common neighbors. This generalization can be straightforward but tedious.

Motif counting algorithm for SBM. While our algorithm is near optimal for GBM in the regime of , it is far from optimal for the SBM in the same regime of average degree. Indeed, by using simple Chernoff bounds again, we see that the motif counting algorithm is successful for SBM with inter-cluster edge probability and intra-cluster probability , when . The experimental success of our algorithm in real sparse networks therefore somewhat enforce the fact that GBM is a better model for those network structures than SBM.

5 Experimental Results

In addition to validation experiments in Section 2.1 and 2.2, we also conducted an in-depth experimentation of our proposed model and techniques over a set of synthetic and real world networks. Additionally, we compared the efficacy and efficiency of our motif-counting algorithm with the popular spectral clustering algorithm using normalized cuts111http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering and the correlation clustering algorithm [6].

Real Datasets. We use three real datasets described below.

  • [noitemsep,leftmargin=5pt]

  • Political Blogs. [4] It contains a list of political blogs from 2004 US Election classified as liberal or conservative, and links between the blogs. The clusters are of roughly the same size with a total of 1200 nodes and 20K edges.

  • DBLP. [29] The DBLP dataset is a collaboration network where the ground truth communities are defined by the research community. The original graph consists of roughly 0.3 million nodes. We process it to extract the top two communities of size 4500 and 7500 respectively. This is given as input to our algorithm.

  • LiveJournal. [23] The LiveJournal dataset is a free online blogging social network of around 4 million users. Similar to DBLP, we extract the top two clusters of sizes 930 and 1400 which consist of around 11.5K edges.

We have not used the academic collaboration (Section 2.1) dataset here because it is quite sparse and below the connectivity threshold regime of both GBM and SBM.

Synthetic Datasets. We generate synthetic datasets of different sizes according to the GBM with and for a wide spectrum of values of and , specifically we focus on the sparse region where and with variable values of and .

Experimental Setting. For real networks, it is difficult to calculate an exact threshold as the exact values of and are not known. Hence, we follow a three step approach. Using a somewhat large threshold we sample a subgraph such that will be in if there is an edge between and , and they have at least common neighbors. We now attempt to recover the subclusters inside this subgraph by following our algorithm with a small threshold . Finally, for nodes that are not part of , say , we select each that has an edge with and use a threshold of to decide if and should be in the same cluster. The final decision is made by taking a majority vote. We can employ sophisticated methods over this algorithm to improve the results further, which is beyond the scope of this work.

We use the popular f-score metric which is the harmonic mean of precision (fraction of number of pairs correctly classified to total number of pairs classified into clusters) and recall (fraction of number of pairs correctly classified to the total number of pairs in the same cluster for ground truth), as well as the node error rate for performance evaluation. A node is said to be misclassified if it belongs to a cluster where the majority comes from a different ground truth cluster (breaking ties arbitrarily). Following this, we use the above described metrics to compare the performance of different techniques on various datasets.

Results.

We compared our algorithm with the spectral clustering algorithm where we extracted two eigenvectors in order to extract two communities. Table 

3 shows that our algorithm gives an accuracy as high as . The spectral clustering performed worse compared to our algorithm for all real world datasets. It obtained the worst accuracy of 53% on political blogs dataset. The correlation clustering algorithm generates various small sized clusters leading to a very low recall, performing much worse than the motif-counting algorithm for the whole spectrum of parameter values.

We can observe in Table 3 that our algorithm is much faster than the spectral clustering algorithm for larger datasets (LiveJournal and DBLP). This confirms that motif-counting algorithm is more scalable than the spectral clustering algorithm. The spectral clustering algorithm also works very well on synthetically generated SBM networks even in the sparse regime [22, 28]. The superior performance of the simple motif clustering algorithm over the real networks provide a further validation of GBM over SBM. Correlation clustering takes 8-10 times longer as compared to motif-counting algorithm for the various range of its parameters. We also compared our algorithm with the Newman algorithm [15] that performs really well for the LiveJournal dataset (98% accuracy). But it is extremely slow and performs much worse on other datasets. This is because the LiveJournal dataset has two well defined subsets of vertices with very few intercluster edges. The reason for the worse performance of our algorithm is the sparseness of the graph. If we create a subgraph by removing all nodes of degrees and , we get accuracy with our algorithm. Finally, our algorithm is easily parallelizable to achieve better improvements. This clearly establishes the efficiency and effectiveness of motif-counting.

(a) f-score with varying , fixed .
(b) Fraction of nodes misclassified.
Figure 5: Results of the spectral clustering on a synthetic dataset with nodes.

We observe similar gains on synthetic datasets. Figures 3(a), 3(b) and 3(c) report results on the synthetic datasets with nodes. Figure 3(a) plots the minimum gap between and that guarantees exact recovery according to Theorem 1 vs minimum value of for varying for which experimentally (with only triangle motif) we were able to recover the clusters exactly. Empirically, our results demonstrate much superior performance of our algorithm. The empirical results are much better than the theoretical bounds because the concentration inequalities applied in Theorem 1 assume the worst value of the distance between the pair of vertices that are under consideration. We also see a clear threshold behavior on both f-score and node error rate in Figures 3(b) and 3(c). We have also performed spectral clustering on this 5000-node synthetic dataset (Figures 4(a) and 4(b)). Compared to the plots of figures 3(b) and 3(c), they show suboptimal performance, indicating the relative ineffectiveness of spectral clustering in GBM compared to the motif counting algorithm.

References

  • [1] E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. IEEE Trans. Information Theory, 62(1):471–487, 2016.
  • [2] E. Abbe and C. Sandon. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 56th Annual Symposium on Foundations of Computer Science (FOCS), pages 670–688. IEEE, 2015.
  • [3] E. Abbe and C. Sandon. Recovering communities in the general stochastic block model without knowing the parameters. In Advances in Neural Information Processing Systems, pages 676–684, 2015.
  • [4] L. A. Adamic and N. Glance. The political blogosphere and the 2004 us election: divided they blog. In 3rd international workshop on Link discovery, pages 36–43. ACM, 2005.
  • [5] C. Avin and G. Ercal. On the cover time and mixing time of random geometric graphs. Theoretical Computer Science, 380(1-2):2–22, 2007.
  • [6] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1-3):89–113, 2004.
  • [7] C. Bettstetter. On the minimum node degree and connectivity of a wireless multihop network. In Proceedings of the 3rd ACM international symposium on Mobile ad hoc networking & computing, pages 80–91. ACM, 2002.
  • [8] B. Bollobás. Random graphs. In Modern Graph Theory, pages 215–252. Springer, 1998.
  • [9] S. Boucheron, G. Lugosi, and O. Bousquet. Concentration inequalities.
  • [10] S. Bubeck, J. Ding, R. Eldan, and M. Z. Rácz. Testing for high-dimensional geometry in random graphs. Random Structures & Algorithms, 2016.
  • [11] P. Chin, A. Rao, and V. Vu. Stochastic block model and community detection in the sparse graphs: A spectral algorithm with optimal rate of recovery. arXiv:1501.05021, 2015.
  • [12] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.
  • [13] L. Devroye, A. György, G. Lugosi, F. Udina, et al. High-dimensional random geometric graphs and their clique number. Electronic Journal of Probability, 16:2481–2508, 2011.
  • [14] M. E. Dyer and A. M. Frieze. The solution of some random np-hard problems in polynomial expected time. Journal of Algorithms, 10(4):451–489, 1989.
  • [15] M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002.
  • [16] A. Goel, S. Rai, and B. Krishnamachari. Monotone properties of random geometric graphs have sharp thresholds. Annals of Applied Probability, pages 2535–2552, 2005.
  • [17] P. Gupta and P. R. Kumar. Critical power for asymptotic connectivity. In 37th IEEE Conference on Decision and Control, volume 1, pages 1106–1110. IEEE, 1998.
  • [18] M. Haenggi, J. G. Andrews, F. Baccelli, O. Dousse, and M. Franceschetti. Stochastic geometry and random graphs for the analysis and design of wireless networks. IEEE Journal on Selected Areas in Communications, 27(7), 2009.
  • [19] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semidefinite programming. IEEE Transactions on Information Theory, 62(5):2788–2797, 2016.
  • [20] B. E. Hajek, Y. Wu, and J. Xu. Computational lower bounds for community detection on random graphs. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 899–928, 2015.
  • [21] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
  • [22] J. Lei, A. Rinaldo, et al. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215–237, 2015.
  • [23] J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. ACM Transactions on the Web (TWEB), 1(1):5, 2007.
  • [24] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In 17th international conference on World Wide Web, pages 695–704. ACM, 2008.
  • [25] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In

    47th Annual ACM Symposium on Theory of Computing

    , pages 69–75. ACM, 2015.
  • [26] M. Penrose. Random geometric graphs. Number 5. Oxford University Press, 2003.
  • [27] M. D. Penrose. On a continuum percolation model. Advances in applied probability, 23(03):536–556, 1991.
  • [28] K. Rohe, S. Chatterjee, B. Yu, et al. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915, 2011.
  • [29] J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems, 42(1):181–213, 2015.

6 Appendix

For the analysis, let be the uniform random variable associated with . Also recall that .

6.1 Proof of Lemma 3

Proof of Lemma 3.

Here are from different clusters. For any vertex , let be the event that is a common neighbor. For ,

Now since there are points in , we have the statement of the lemma. ∎

6.2 Results for other motifs

Next, we describe two lemmas for a GBM graph with two unknown clusters , and parameters , on considering other motifs than triangles (Motif 1). These results are used to populate Table 2. When we run Algorithm 3 with other motifs, the subroutine process uses the corresponding motifs to compute the variable ‘count’. Other than this the algorithm remains same.

Motif 2 amd Motif 3

Lemma 4.

For any two vertices belonging to the same cluster and , the count of number of nodes forming Motif 2 (see Table 2) with and (i.e., neighbors of and non neigbors of ), is a random variable distributed according to , where is a binomial random variable with mean .

Proof.

Without loss of generality, assume . For any vertex , let be the event that is a neighbor of and non neighbor of . For ,

For , we have,

Now since there are points in and points in , we have the statement of the lemma. ∎

Lemma 5.

For any two vertices belonging to different clusters and , the count of number of nodes forming Motif 2 (see Table 2) with and (i.e. neighbor of and non neighbor of ), is a random variable distributed according to , assuming .

Proof.

For any vertex , let be the event that is a neighbor of and a non neighbor of . For

Now for , there cannot be an edge with and no edge with because . Since there are points in , we have the statement of the lemma. ∎

Theorem 2 (Motif 2 or 3).

If and , , Algorithm 3 with and