Neural node embeddings have recently emerged as a powerful representation for supervised learning tasks involving graph-structured data. We leverage this recent advance to develop a novel algorithm for unsupervised community discovery in graphs. Through extensive experimental studies on simulated and real-world data, we demonstrate that the proposed approach consistently improves over the current state-of-the-art. Specifically, our approach empirically attains the information-theoretic limits for community recovery under the benchmark Stochastic Block Models for graph generation and exhibits better stability and accuracy over both Spectral Clustering and Acyclic Belief Propagation in the community recovery limits.READ FULL TEXT VIEW PDF
The stochastic block model (SBM) is a random graph model with planted
Node embedding is the task of extracting informative and descriptive fea...
We present a parallelized bijective graph matching algorithm that levera...
Graph embedding has become a key component of many data mining and analy...
We propose and analyze a generic method for community recovery in stocha...
Modern data analysis pipelines are becoming increasingly complex due to ...
We study the fundamental limits on learning latent community structure i...
Learning a representation for nodes in a graph, also known as node embedding, has been an important tool for extracting features that can be used in machine learning problems involving graph-structured data[1, 2, 3, 4]. Perhaps the most widely adopted node embedding is the one based on the eigendecomposition of the adjacency matrix or the graph Laplacian [2, 5, 6]
. Recent advances in word embeddings for natural language processing such as has inspired the development of analogous embeddings for nodes in graphs [8, 3]. These so-called “neural” node embeddings have been applied to a number of supervised learning problems such us link prediction and node classification and demonstrated state-of-the-art performance [8, 3, 4].
In contrast to applications to supervised learning problems in graphs, in this work we leverage the neural embedding framework to develop an algorithm for the unsupervised community discovery problem in graphs [9, 10, 11, 12]
. The key idea is straightforward: learn node embeddings such that vectors of similar nodes are close to each other in the latent embedding space. Then, the problem of discovering communities in a graph can be solved by finding clusters in the embedding space.
We focus on non-overlapping communities and validate the performance
of the new approach through a comprehensive set of experiments on both
synthetic and real-world data. Results demonstrate that the
performance of the new method is consistently superior to those of
spectral methods across a wide range of graph sparsity levels.
In fact, we find that the proposed algorithm can empirically attain
the information-theoretic phase transition
information-theoretic phase transitionthresholds for exact and weak recovery of communities under the Stochastic Block Model (SBM) [13, 14, 11, 15]. SBM is a canonical probabilistic model for random graphs with latent structure and has been widely used for empirical validation and theoretical analysis of community detection algorithms [16, 17, 10, 9]. In particular, when compared to the best known algorithms based on Acyclic Belief Propagation (ABP) that can provably detect communities at the information-theoretic limits [15, 11, 14], our approach has consistently better accuracy. In addition, we find that ABP is very sensitive to random initialization and exhibits high variability. In contrast, our approach is stable to both random initialization and a wide range of algorithm parameter settings.
Our implementation and scripts to recreate all the results in this paper are available at https://github.com/cy93lin/SBM_node_embedding
The community detection problem has been extensively studied in the literature [9, 10, 12, 18], . It has important applications in various real-world networks that are encountered in sociology, biology, signal processing, statistics and computer science. One way to systematically evaluate the performance of a community detection algorithm and establish theoretical guarantees is to consider a generative model for graphs with a latent community structure. The most widely-adopted model is the classic Stochastic Block Model. SBM was first proposed in [16, 20, 21] as a canonical model for studying community structure in networks and various community detection algorithms based on it have been proposed, e.g., [22, 5, 23, 24]. Among these approaches, algorithms that are based on the graph spectrum and semidefinite programming relaxations of suitable graph-cut objectives have been extensively studied [23, 24]. In particular, the phase transition behavior of spectral graph clustering for a generative model that includes SBM as a special case has also been established recently . Graph-statistics based algorithms such as modularity optimization and their connection to Bayesian models on graphs have also been studied . Only very recently have the information-theoretic limits for community recovery under the general SBM model been established [13, 11, 15]. In [11, 15], a belief-propagation based algorithm has been shown to asymptotically detect the latent communities in an SBM and achieve the information-theoretic limits. It has also been shown that graph spectrum based algorithms cannot achieve the information-theoretic limits for recovering communities in SBM models .
The use of a neural-network to embed natural language words into Euclidean space was made popular by the famous “word2vec” algorithm[7, 27]. In these works, each word in a vocabulary is represented as a low-dimensional vector in Euclidean space. These representations are learned in an unsupervised fashion using large text corpora such as Wikipedia articles. The neural word embedding idea was adapted in  to embed nodes from a graph into Euclidean space and use the node embedding vectors to solve supervised learning tasks such as node attribute prediction and link prediction. This method has also been used in 
to solve semi-supervised learning tasks associated with graphs. The node embeddings are computed by viewing nodes as “words”, forming “sentences” via random paths on the graph, and then employing a suitable neural embedding technique for words. Different ways of creating “sentences” of nodes was further explored in
where a parametric family of node transition probabilities was proposed to generate the random paths. The transition probabilities need node and/or edge labels and is therefore only suitable for supervised tasks.
Our work is most closely related to [3, 8]. While [3, 8] make use of node embeddings in supervised learning problems such as node attribute prediction and link prediction, this paper focuses on the unsupervised community detection problem. We also explore the information-theoretic limits for community recovery under the classic SBM generative model and empirically show that our algorithm can achieve these limits.
Random walks have been used in a number of ways to detect communities. Seminal work in  and  proposed to use a random walk and its steady-state-distribution for graph clustering. Subsequent work  further proposed to exploit multi-step transition probabilities between nodes for clustering. Our work can be viewed as implicitly factorizing a gram-matrix related to the multi-step transition probabilities between nodes (cf. Sec. III). This is different from the prior literature. The idea of converting a graph into a time-series signal or a time-series signal into a graph has also been studied in the signal processing community and applied to the problem of graph filtering .
Let be a graph with nodes and latent communities. We focus on non-overlapping communities and denote by the latent community assignment for node . Given , the goal is to infer the community assignment .
Our approach is to learn, in an unsupervised fashion, a low-dimensional vector representation for each node that captures its local neighborhood structure. These vectors are referred to as node embeddings. The premise is that if done correctly, nodes from the same community will be close to each other in the embedding space. Then, communities can be found via clustering of the node embeddings.
Skip-gram word-embedding framework: In order to construct the node embedding, we proceed as in the skip-gram-based negative sampling framework for word embedding which was recently developed in the natural language processing literature [7, 3]. A document is an ordered sequence of words from a fixed vocabulary. A
-skip-bigram is an ordered pair of wordsthat occur within a distance of words from each other within a sentence in the document. A document is then viewed as a multiset of all its -skip-bigrams which are generated in an independent and identically distributed (IID) fashion according to a joint probability which is related to the word embedding vectors , of words and respectively, in -dimensional Euclidean space.
Now consider a multiset of -skip-bigrams which are generated in an IID fashion according to the product probability where the ’s are the unigram (single word) probabilities. The unigram probabilities can be approximated via the empirical frequencies of individual words (unigrams) in the document.
The -skip-bigrams in are labeled as positive samples () and those in are labeled as negative samples (). In the negative sampling framework [7, 3], the posterior probability that an observed -skip-bigram will be labeled as positive is modeled as follows
Under this model, the likelihood ratio , becomes proportional to
. Thus the negative sampling model posits that the ratio of the odds of observing a-skip-bigram from a bonafide document to the odds of observing it due to pure chance is exponentially related to the inner product of the underlying embedding vectors of the nodes in the -skip-bigram.
Maximum-likelihood estimation of embeddings:
Maximum-likelihood estimation of embeddings:The word embedding vectors which are parameters of the posterior distributions are selected to maximize the posterior likelihood of observing all the positive and negative samples, i.e.,
Substituting from Eq. (1) and taking negative log, this reduces to
The optimization problem in Eq. (2) can be reformulated as
where the summation is over all distinct pairs of words in the vocabulary and and are the number of pairs in and respectively. The objective function in Eq. (2) or equivalently Eq. (3) is non-convex with respect to the embedding vectors. Moreover, the solution is not unique because the objective function, which only depends on the pairwise inner products of the embedding vectors, is invariant to any global angle-preserving transformation of the embedding vectors.
One solution approach  is to first re-parameterize the objective in terms the of the gram matrix of embedding vectors , i.e., replace with , the -th entry of , and then solve for the optimum by relaxing the requirement that is symmetric and positive semi-definite. The solution to this relaxed problem is given by which can be shown to be equal, up to an additive constant, to the so-called pointwise mutual information (PMI) , where denotes the number of occurrences in . The embedding vectors can then be obtained by performing a low-rank matrix factorization of via, for example, an SVD.
An alternative solution approach which we adopt in our
node-embedding algorithm described below, is to optimize
Eq. (2 ) using stochastic gradient descent (SGD)
) using stochastic gradient descent (SGD)[35, 36]. SGD iteratively updates the embedding vectors by moving them along directions of negative gradients of a modified objective function which is constructed (during each iteration) by partially summing over a small, randomly selected, batch of terms that appear in the complete summation that defines the original objective function (cf. Eq. (2)). This is the approach that is followed in . An advantage of SGD is its conceptual simplicity. The other advantage is that it can be parallelized and nicely scaled to large vocabularies . SGD also comes with theoretical guarantees of almost sure convergence to a local minimum under suitable regularity conditions .
Proposed node-embedding algorithm: We convert the word embedding framework for documents described above into a node embedding framework for graphs. Our key idea is to view nodes as words and and a document as a collection of sentences that correspond to paths of nodes in the graph. To operationalize this idea, we generate multiple paths (sentences) by performing random walks of suitable lengths starting from each node. Specifically, we simulate random walks on of fixed length starting from each node. In each random walk, the next node is chosen uniformly at random among all the immediate neighbors of the current node in the given graph. The set is then taken to be the multiset of all node pairs for each node and all nodes that are within steps of node in all the simulated paths whenever appears. The parameter controls the size of the local neighborhood of a node in the given graph. The local neighborhood of a node is the counterpart of context words surrounding a word in a given text document.
The set (negative samples) is constructed as a multiset using the following approach: for each node pair in , we append node pairs to , where the nodes are drawn in an IID manner from all the nodes according to the estimated unigram node (word) distribution across the document of node paths. The set captures the behavior of a random walk on a graph which is completely connected. When applied to graphs as we do, the negative sampling model can be viewed as positing that the ratio of the odds of observing a pair of nodes that are within steps from each other in a random walk on the given graph to the odds of observing the same pair in a random walk on a (suitably edge-weighted) completely connected graph is exponentially related to the inner product of the underlying embedding vectors of the pair of nodes.
Once and are generated, we optimize Eq. (2) using stochastic gradient descent . The per-iteration computational complexity of the SGD algorithm used to solve Eq. (2) is , i.e., linear in the emebdding dimension. The number of iterations is .
Once the embedding vectors ’s are learned, we apply -means clustering to get the community memberships for each node. These steps are summarized in Algorithm 1.
Selecting algorithm parameters: The proposed algorithm VEC has 5 tuning parameters. These are (i) : the number of random walks launched from each node, (ii) : length of each random walk, (iii) the local window size, (iv) : the embedding dimension, and (v) : the number of negative samples per positive sample. In general, the community recovery performance of VEC will depend on all of these tuning parameters. We do not currently have a rigorous theoretical framework which can guide the optimum joint selection of all these parameters. An exhaustive exploration of the five-dimensional space of all algorithm parameters to determine which combinations of choices have good performance for graphs of different sizes (number of nodes), different sparsities (number of edges), and number of communities is clearly impractical.
In Sec. IV-E we explore the
sensitivity of the community recovery performance of VEC to
perturbations of algorithm parameters around the following default
setting: , , , and . We set
the number of negative samples per observation to five ()
as suggested in .
The results from Sec. IV-E demonstrate that the
performance of VEC remains remarkably stable across a wide range of
values of algorithm-parameters around the default setting.
The only two significant parameters that seem to have a noticeable
impact on community recovery performance are and, to a lesser
extent, . Informally, we can try to make sense of these empirical
observations as follows. If the graph is fully connected and
aperiodic, then as , the node distribution
will converge to the stationary distribution of the Markov chain
defined by the graph. It is therefore not surprising that the
dependence of performance on up to a point and then plateaus. Although
the node embedding algorithm is not explicitly optimized for
community discovery, embeddings that work well for community
discovery via Euclidean-space clustering should be such that the
embedding vectors of nodes from the same community should be roughly
equidistant from each other. The distances of embedding vectors from
one community to those of another community should also be roughly
similar. These conditions are harder to meet in lower dimensions
unless the embedding vectors from the same community are all
identical. In higher dimensions there are more degrees of freedom
available for the distance properties to be satisfied.
, the node distribution will converge to the stationary distribution of the Markov chain defined by the graph. It is therefore not surprising that the dependence of performance onwill become negligible beyond a point. We may view starting random walks from each node as a practical method to capture the steady-state behavior with small . The most significant parameter appears to be which directly controls the size of the local neighborhood around each node from which the set of positive node-pairs are formed. Out results indicate the performance is poor when is too small, but plateaus as increases. When is extremely large, we should expect performance to suffer since then all node-pairs would appear in the positive set which would then resemble the positive set of node-pairs from a completely connected graph which has no community structure. The performance also appears to improve with increasing embedding dimension
up to a point and then plateaus. Although the node embedding algorithm is not explicitly optimized for community discovery, embeddings that work well for community discovery via Euclidean-space clustering should be such that the embedding vectors of nodes from the same community should be roughly equidistant from each other. The distances of embedding vectors from one community to those of another community should also be roughly similar. These conditions are harder to meet in lower dimensions unless the embedding vectors from the same community are all identical. In higher dimensions there are more degrees of freedom available for the distance properties to be satisfied.
Algorithms for performance comparison: In the rest of this paper, we compare the proposed “VEC” algorithm against two baseline approaches: Spectral Clustering (SC) that is widely adopted in practice [6, 23, 22, 5] and Acyclic Belief Propagation (ABP) which can achieve the information-theoretic limits in SBMs [14, 11, 15]. We also include a limited comparison with another state-of-the-art algorithm BigClam (BC)  suggested by one of the reviewers. For SC we use a state-of-the-art implementation that can handle large scale sparse graphs . In order to assure the best performance for ABP, we assume that the ground-truth SBM model parameters are known, and adopt the parameters suggested in  which are functions of the ground-truth parameters. In other words, we allow the competing algorithm ABP additional advantages that are not used in our proposed VEC algorithm. Our implementation is available at https://github.com/cy93lin/SBM_node_embedding.
In this section, we present and discuss a comprehensive set experimental results on graphs that are synthetically generated using a Stochastic Block Model (SBM). SBMs have been widely used for both theoretical analysis as well as empirical validation of community detection algorithms [16, 17, 10, 9, 4].
Generative procedure: In an SBM, a random graph with latent communities is generated thus: Each node is randomly assigned to one community with community membership probabilities given by the probability vector ; For each unordered pair of nodes , an edge is formed with probability . Here, are the self- and cross-community connection probabilities and are typically assumed to vanish as to capture the sparse connectivity (average node degree ) of most real-world networks .
Weak and exact recovery: We consider two definitions of recovery studied in SBMs. Let accuracy be the fraction of nodes for which the estimated communities agree with (for the best permutation of node labels). Then,
Weak recovery is solvable if an algorithm can achieve accuracy , for some , with probability
Exact recovery is solvable if an algorithm can achieve accuracy with probability .
Simulation setting and scaling regimes: In the bulk of our experiments, we synthesize graphs with balanced communities, i.e., , and equal community connection probabilities. Specifically, for , we consider the standard planted partition model where and , for all , are the same. In Sec. IV-C, we study how unbalanced communities and unequal connectivities affect the performance of different algorithms.
We consider two commonly studied scaling regimes and parameter settings for , namely
constant expected node degree scaling: , and
logarithmic expected node degree scaling: , .
Intuitively, and influence the degree of sparsity whereas controls the degree of separation between communities. Let . The constant expected node degree scaling regime is more challenging for community recovery than the logarithmic expected node degree regime. The most recent results in [13, 11, 15] when specialized to the planted partition model can be summarized as follows:
Condition 1: For constant scaling, weak recovery is guaranteed if . For , the condition is also necessary.
Condition 2: For logarithmic scaling, exact recovery is solvable if, and only if, .
We choose different combinations of in order to explore recovery behavior around the weak and exact recovery thresholds. We set in both cases as it is typical in real-world datasets (cf. Sec. 12). For each combination of model parameters, we synthesize
random graphs and report the mean and standard deviation of all the performance metrics (discussed next).
Performance metrics: In all our experiments, we adopt the commonly used Normalized Mutual Information (NMI)  and Correct Classification Rate (CCR) metrics  to measure the clustering accuracy since ground-truth community assignments are available. For all , let denote the number of (ground-truth) community- nodes that are labeled as community- by some community discovery algorithm. Then the CCR is defined by:
To define NMI, let denote the empirical joint pmf of the ground-truth and the estimated labels and and their marginals. Then,
(with the convention ) is the mutual information between and and and are their entropies:
Both CCR and NMI are symmetric with respect to the ground-truth labels and the estimated labels . However NMI is invariant to any permutation of labels, but CCR is not. We therefore calculate CCR based on the best re-labeling of the estimated labels, i.e.,
In order to compare the overall relative performance of different algorithms across a large number of different simulation settings, we also compute the NMI and CCR Performance Profile (PP) curves  across a set of 250 distinct experiments. These curves provide a global performance summary of the compared algorithms.
Table I provides a bird’s-eye view of all our experiments with synthetically generated graphs. The table presents all key problem parameters that are held fixed as well as those which are varied. It also summarizes the main conclusion of each experimental study and includes pointers to the appropriate figures and subsections where the results can be found.
|#||Fig., Table & Sec.||Scaling regime||Sparsity or||Graph size||Balanced & unifrm. ?||Main observation|
|1||Fig. 1, Sec. IV-B||constant||variable||yes||VEC exhibits weak recovery phase transition behavior|
|2||Fig. 3, Sec. IV-B||constant||fixed||variable||yes||VEC achieves weak recovery asymptotically when conditions are satisfied|
|3||Fig. 4, Sec. IV-B||constant||variable||yes||VEC can cross the weak recovery limit for|
|4||Fig. 5, Sec. IV-B||constant||fixed||yes||variable||VEC is robust to the number of communities|
|5||Fig. 6, Sec. IV-C||constant||fixed||unbalanced||VEC is robust to unbalanced communities|
|6||Fig. 7, Sec. IV-C||constant||fixed||non-uniform||VEC is robust to unequal connectivities|
|7||Fig. 8, Sec. IV-D||logarithmic||variable||yes||VEC attains the exact recovery limit|
|8||Fig. 9, Sec. IV-D||logarithmic||fixed||variable||yes||VEC achieves exact recovery asymptotically when conditions are satisfied|
|9||Fig. 10, Sec. IV-E||logarithmic||fixed||yes||VEC is robust to algorithm parameters|
|10||Table II, Sec. IV-E||both||fixed||yes||2, 5||VEC is robust to randomness in creating paths|
|11||Fig. 11, Sec. IV-F||constant||variable||variable||yes||variable||VEC consistently outperforms baselines across 240 experiments|
|12||Table III, Sec. IV-G||both||variable||yes||variable||VEC outperforms baselines on degree-corrected model|
To understand behavior around the weak recovery limit, we synthesized SBM graphs with , , and at various sparsity levels in the constant scaling regime. For these parameter settings, weak recovery is possible if, and only if, (cf. Condition 1). The results are summarized in Fig. 1.
Figure 1 reveals that the proposed VEC algorithm exhibits weak recovery phase transition behavior: for , and when , (random guess). This behavior can be also observed through the NMI metric. The behavior of ABP which provably achieves the weak recovery limit  is also shown in Fig. 1. Compared to ABP, VEC has consistently superior mean clustering accuracy over the entire range of
values. In addition, we note that the variance of NMI and CCR for ABP is significantly larger than VEC. This is discussed later in this section. SC, however, does not achieve weak recovery for sparse(cf. Fig. 1) which is consistent with theory .
Effect of increasing number of random
The curves shown in Fig. 1 are based on averaging
performance metrics across multiple, independently generated, random
realizations of SBM graphs. In order to understand the impact of the
number of realizations on the overall performance trends of different
algorithms, we increased the number of random graphs used to create
each data point in Fig. 1 from to . The
resulting mean NMI and CCR values and their associated confidence
intervals are shown in Fig.
. The resulting mean NMI and CCR values and their associated confidence intervals are shown in Fig.2. Comparing the curves in Fig. 1 and Fig. 2, we see that they are very similar. The confidence intervals of all algorithms have clearly diminished as expected. Thery were already very small for VEC even with realizations. They have all but “disappeared” for VEC with realizations. The only significant change is to the mean curve for ABP which has become monotonic (as it should) and smoother after increasing the number of random realizations. Since the curves with and realizations are so similar, we only use random realizations for each SBM parameter setting in the rest of this paper.
Comparison with BigClam : Figure 2 also shows the performance of BigClam  (BC) a recent powerful community detection algorithm based on matrix factorization techniques that scales well to large graphs of millions of nodes and edges. We observe that BC outperforms SC when the graph is relatively sparse. But VEC and ABP still outperform BC. Since the performance of BigClam is quite similar to that of SC, we decided to exclude BC in the remainder of our experiments.
Effect of graph size: In order to
provide further support for the observations presented
above we also synthesized SBM graphs with increasing graph size
with held fixed in the constant degree
scaling regime. Since , weak recovery is
possible asymptotically as .
As shown in Fig. 3, VEC can empirically achieve
weak recovery for both small and large graphs, and consistently
outperforms ABP and SC. While ABP can provably achieve weak recovery
asymptotically, its performance on smaller graphs is poor.
Crossing below the weak recovery limit for :
Here we explore the behavior of VEC below the weak recovery limit for since to-date there are no necessary and sufficient weak recovery bounds established for this setting (i.e., ). Similar to Fig. 1, we synthesized SBM graphs in the constant degree scaling regime for various sparsity levels fixing , and . In this setting, is sufficient but not necessary for weak recovery. The results are summarized in Fig. 4.
As can be seen in Fig. 4, the VEC algorithm can
cross the weak recovery limit: for some ,
and with a significant
Here too we observe that VEC consistently outperforms ABP and SC with
a large margin.
Weak recovery with increasing number of communities : Next we consider the performance of VEC as the number of communities increases. In particular, we synthesize planted partition model SBMs in the constant scaling regime with , , , and uniform . As increases, recovery becomes impossible, because according to Condition 1 for weak recovery , weak recovery is possible if . For the above parameter settings, we have .
Figure 5 summarizes the performance of VEC, SC, and ABP as a function of . The performance of the three algorithms can be compared more straightforwardly by focusing on the NMI metric (plots in the upper sub-figure of Fig. 5). Similar to all the previous studies of this section, the proposed VEC algorithm can empirically achieve weak-recovery whenever the information-theoretic sufficient conditions are satisfied, i.e., with a significant margin for all . We note that in terms of the CCR metric, a “weak” recovery corresponds to since it is the CCR of the rule which assigns the apriori most likely community label to all nodes. This is empirically attained by the VEC algorithm as illustrated in plots in the bottom sub-figure of Fig. 5.
Note that the performance of SC drops significantly beyond . We also note that the CCR performance margin between ABP, which is a provably asymptotically consistent algorithm, and the best constant guess rule () is much smaller than for VEC.
Here we study how unbalanced communities and unequal connectivities affect the performance of the proposed VEC algorithm. By unbalanced communities we mean an SBM in which the community membership weights are not uniform. In this scenario, some clusters will be more dominant than the others making it challenging to detect small clusters. By unequal connectivities we mean an SBM in which is not the same for all or is not the same for all . Since it is unwieldy to explore all types of unequal connectivities, our study only focuses on unequal self-connectivities, i.e., is not the same for all . In this scenario, the densities of different communities will be different making it challenging to detect the sparser communities. Here we compare NMI and CCR curves only for VEC and ABP but not SC. When communities are unbalanced or the self-connectivities are unequal we observed that SC takes an inordinate amount of time to terminate. We decided therefore to omit NMI and CCR plots for SC from the experimental results in this subsection.
Unbalanced communities: We first show results on SBMs with nonuniform community weights . For simplicity, we consider SBMs with communities and set . Then, . For the other parameters we set . From the general weak recovery conditions for nonuniform in ,111Condition 1 in Sec. IV-A assumes uniform . it can be shown that as , the threshold for guaranteed weak-recovery will be broken. Specifically, for the above parameter settings, it can can be shown that must not exceed for weak recovery.
We summarize the results in Fig. 6. For comparison, the right (CCR) sub-figure of Fig. 6 also shows the plot of which is the CCR of the rule which assigns the apriori most likely community to all nodes. From the figure it is evident that unlike ABP, the CCR performance of VEC remains stable across a wide range of values indicating that it can tolerate significantly unbalanced communities.
Unequal community connectivity: We next consider the situation in which the connectivity constants of different communities are distinct. For simplicity, we consider SBMs with communities and balanced weights . We focus on the constant scaling regime and set
Here, determines the relative densities of communities and . For the other model parameters, we set , as in the previous subsection. From the general weak recovery conditions for unequal community connectivity in , it can be shown that weak recovery requires that . We summarize the results in Fig. 7.
From the figure it is once again evident that the performance of VEC remains stable across a wide range of values when compared to ABP.
We now turn to explore the behavior of VEC near the exact recovery limit. Figure 8 plots NMI and CCR as a function of increasing sparsity level for SBM graphs under logarithmic node degree scaling fixing , , and . In this setting, exact recovery is solvable if, and only if, (cf. Condition 2 in Sec. IV-A). As can be seen in Fig. 8, the CCR and NMI values of VEC converge to as increases far beyond . Therefore, VEC empirically attains the exact recovery limit. We note that SC can match the performance of VEC when is large, but cannot correctly detect communities for very sparse graphs (). Note also that VEC significantly outperforms ABP in this scaling scheme.
We also compared the behavior of VEC, ABP, and SC algorithms for increasing graph sizes . We set , and uniform. Figure 9 illustrates the performance of VEC, SC, and ABP as a function of the number of nodes for three different choices of , and . Since exact recovery requires , only the third choice of guarantees exact recovery asymptotically.
As can be seen in Fig. 9, when is above the exact recovery condition (see Fig. 9 ), the proposed algorithm VEC can achieve exact recovery, i.e., and . In this setting, the proposed VEC algorithm can be observed to achieve exact recovery even when the number of nodes is relatively small. On the other hand, when is below the exact recovery condition (see Figs. 9 and ), as increases, the accuracy of VEC increases and converges to a value that is somewhere between random guessing () and exact recovery ().
We note that among the compared baselines, the performance of SC is similar to that of VEC when is large (relatively dense graph) but its performance deteriorates when is small (sparse graph). As shown in Fig. 9 , when the SBM-synthesized graph is relatively sparse, the performance of SC is close to a random guess while the performance of VEC and ABP increases with the number of nodes . This observation is consistent with known theoretical results [11, 23].
Parameter sensitivity: The performance of VEC depends on the number of random paths per node , the length of each path , the local window size , and the embedding dimension . We synthesized SBM graphs under logarithmic scaling with and applied VEC with different choices for , , , and . The results are summarized in Fig. 10. While the performance of VEC is remarkably insensitive to , , and across a wide range of values, a relatively large local window size appears to be essential for attaining good performance (cf. Fig. 10(c)). This suggests that incorporating a larger graph neighborhood is critical to the success of VEC.
Effect of random initialization in VEC and ABP: We also studied the effect of random initialization in VEC and ABP. We synthesized two SBM graphs as described in Table II. For a fixed graph, we run VEC and ABP times and summarize the mean and standard deviation values of NMI and CCR. We observe that the variance of ABP is an order of magnitude higher than VEC indicating its high sensitivity to initialization.
So far we presented and discussed the performance of VEC, ABP, and SC across a wide range of parameter settings. All results indicate that VEC matches or outperforms both ABP and SC in almost all scenarios. In order to summarize and compare of the overall performance of all three algorithms across the wide range of parameter settings that we have considered, we adopt the commonly used Performance Profile 
as a “global” evaluation metric. Formally, letdenote a set of experiments and a specific algorithm. Let denote the value of a performance metric attained by an algorithm in experiment where higher values correspond to better performance. Then the performance profile of at is the fraction of the experiments in which the performance of is at least a factor times as good as the best performing algorithm in that experiment, i.e.,
The Performance Profile is thus an empirical cumulative distribution function of an algorithm’s performance relative to the best-performing algorithm in each experiment. We calculatefor . The higher a curve corresponding to an algorithm, the more often it outperforms the other algorithms.
For simplicity, we only consider the simulation settings for the planted partition model SBMs in the constant degree scaling regime. We set , , and . For each combination of settings , we conduct independent random repetitions of the experiment. Thus overall the Performance Profile is calculated based on experiments.
Figure 11 shows the performance profiles for both NMI and CCR metrics. From the figure it is clear that VEC dominates both ABP and SC and that ABP and SC have similar performance across many experiments.
Our experiments thus far focused on the SBM where the
expected degree is constant across all nodes within the same
community (degree homogeneity). In order to compare the robustness
of different algorithms to degree heterogeneity, we considered
the degree-corrected SBM  (DC-SBM), which
generates edge-weighted graphs with a power law within-community
degree distribution that is observed in many real-world graphs. We
adopted the following generative procedure which was proposed in
: (1) Assign the latent community labels
using the same procedure as in the SBM; (2) Within each community,
sample a parameter from a power law distribution for each
node within this community, and normalize these ’s so
that they sum up to ; (3) For any two nodes and , sample a
weighted edge with weights drawn from a Poisson
distribution with mean
with weights drawn from a Poisson distribution with mean. We omitted the steps of generating self-loops since they will be ignored by the community detection algorithms. Our proposed algorithm VEC can be modified to handle weighted graphs by setting the random walk transition probabilities to be proportional to the edge weights. The ABP and SC algorithms can also be suitably adapted to work with weighted graphs.
We simulated graphs with and the power law distribution for with power and a minimum value of . We considered a number of sparsity settings used to simulate SBM graphs in previous sections. For each setting, the mean NMI and CCR values and their confidence intervals (based on random graphs ) are summarized in Table III. These results indicate that our algorithm still outperforms the competing approaches. We would like to point out that even though the DC-SBM and SBM parameters are similar, the similarity of parameter settings does not imply similarity of information-theoretic limits. To the best of our knowledge, the information-theoretic limits of recovery for the DC-SBM is still open.
|, constant scaling|
|, constant scaling|
|, log scaling|
|, log scaling|
Having comprehensively studied the empirical performance of VEC, ABP, and SC on SBM-based synthetic graphs, in this section we turn our attention to real-world datasets that have ground truth (non-overlapping) community labels. Here we use only NMI to measure the performance.
We consider two benchmark real-world graphs: the Political Blogs network  and the Amazon co-purchasing network . Since the original graphs are directed, we convert it to undirected graphs by forming an edge between two nodes if either direction is part of the original graph. The basic statistics of the datasets are summarized in Table IV. Here, and are the maximum likelihood estimates of and respectively in the planted partition model SBM under logarithmic scaling. Note that in Amazon, the ground truth community proportions are highly unbalanced.
We report NMI and CCR values for VEC, SC, and ABP applied to these datasets. To apply ABP, we set the algorithm parameters using the fitted SBM parameters as suggested in . As shown in Table V, VEC achieves better accuracy compared to SC and ABP.
The performance of SC is noticeably poorer (in terms of both NMI and CCR) compared to both VEC and ABP. Interestingly, the NMI of SC on random graphs that are synthetically generated according to a planted partition SBM model that best fits (in a maximum-likelihood sense) the Political Blogs graph is surprisingly good: NMI = (average NMI across 10 random graphs). This suggests that real-world graphs such as Political Blogs have additional characteristics that are not well-captured by a planted partition SBM model. This is further confirmed by the plots of empirical degree distributions of nodes in real-world and synthesized graphs in Fig. 12. The plots show that the node degree distributions are quite different in real-world and synthesized graphs even if the SBM model which is used to generate the graphs is fitted in a maximum-likelihood sense to real-world graphs. These results also suggest that the performance of SC is sensitive to model-mismatch and its good performance on synthetically generated graphs based on SBMs may not be indicative of good performance on matching real-world graphs. In contrast to SC, both VEC and ABP do not seem to suffer from this limitation.
In this work we put forth a novel framework for community discovery in graphs based on node embeddings. We did this by first constructing, via random walks in the graph, a document made up of sentences of node-paths and then applying a well-known neural word embedding algorithm to it. We then conducted a comprehensive empirical study of community recovery performance on both simulated and real-world graph datasets and demonstrated the effectiveness and robustness of the proposed approach over two state-of-the-art alternatives. In particular, the new method is able to attain the information-theoretic limits for recovery in stochastic block models.
There are a number of aspects of the community recovery problem that we have not explored in this work, but which merit further investigation. First, we have focused on undirected graphs, but our algorithm can be applied ‘as-is’ to directed graphs as well. We have assumed knowledge of the number of communities , but the node embedding part of the algorithm itself does not make use of this information. In principle, we can apply any -agnostic clustering algorithm to the node embeddings. We have focused on non-overlapping community detection. It is certainly possible to convert an overlapping community detection problem with communities into a non-overlapping community detection problem with communities, but this approach is unlikely to work well in practice if is large. An alternative approach is to combine the node embeddings with topic models to produce a “soft” clustering. Finally, this study was purely empirical in nature. Establishing theoretical performance guarantees that can explain the excellent performance of our algorithm is an important task which seems challenging at this time. One difficulty is the nonconvex objective function of the word2vec algorithm. This can be partially addressed by constructing a suitable convex relaxation and analyzing its limiting behavior (under suitable scaling) as the length of the random walk goes to infinity. The limiting objective function will still be random since it depends on the observed realization of the random graph. One could then examine if the limiting objective, when suitably normalized, concentrates around its mean value as the graph size goes to infinity.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in neural information processing systems, 2013, pp. 3111–3119.
A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, “Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications,”Physical Review E, vol. 84, no. 6, p. 066106, 2011.
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
R. Boppana, “Eigenvalues and graph bisection: An average-case analysis,” inProc. of the 28th Anuual Symposium on Foundations of Computer Science (FOCS), 1987, pp. 280–285.
Proceedings of the thirty-sixth annual ACM symposium on Theory of computing. ACM, 2004, pp. 81–90.