1. Introduction
Graphs provide a natural way of representing complex realworld systems. Community detection methods are an essential tool for understanding the structure and behavior of these systems. Detecting communities allows us to analyze social networks (Girvan and Newman, 2002), detect fraud (Pinheiro, 2012), discover functional units of the brain (Garcia et al., 2018), and predict functions of proteins (Song and Singh, 2009). The problem of community detection has attracted significant attention of the research community and numerous models and algorithms have been proposed (Xie et al., 2013).
In the recent years, the emerging field of deep learning for graphs has shown great promise in designing more accurate and more scalable algorithms. While deep learning approaches have achieved unprecedented results in graphrelated tasks like link prediction and node classification (Cai et al., 2018), relatively little attention has been dedicated to their application for unsupervised community detection. Several methods have been proposed (Yang et al., 2016; Choong et al., 2018; Cavallari et al., 2017), but they all have a common drawback: they only focus on the special case of disjoint (nonoverlapping) communities. However, it is well known that communities in real networks are overlapping (Yang and Leskovec, 2014). Handling overlapping communities is a requirement not yet met by existing deep learning approaches for community detection.
In this paper we address this research gap and propose an endtoend deep learning model capable of detecting overlapping communities. To summarize, our main contributions are:

Model: We introduce a graph neural network (GNN) based model for overlapping community detection.

Data: We introduce 4 new datasets for overlapping community detection that can act as a benchmark and stimulate future research in this area.

Experiments: We perform a thorough evaluation of our model and show its superior performance compared to established methods for overlapping community detection, both in terms of speed and accuracy. We highlight the importance of the GNN component of our model through an ablation study.
2. Background
Assume that we are given an undirected unweighted graph , represented as a binary adjacency matrix . We denote as the number of nodes ; and as the number of edges . Every node might be associated with a
dimensional attribute vector, that can be represented as an attribute matrix
. The goal of overlapping community detection is to assign nodes into communities. Such assignment can be represented as a nonnegative community affiliation matrix , where denotes the strength of node ’s membership in community (with the notable special case of binary assignment ). Some nodes may be assigned to no communities, while others may belong to multiple.Even though the notion of ”community” seems rather intuitive, there is no universally agreed upon definition of it in the literature. However, most recent works tend to agree with the statement that a community is a group of nodes that have higher probability to form edges with each other than with other nodes in the graph
(Fortunato and Hric, 2016). This way, the problem of community detection can be considered in terms of the probabilistic inference framework. Once we posit a communitybased generative model for the graph, detecting communities boils down to inferring the unobserved affiliation matrix given the observed graph .Besides the traditional probabilistic view, one can also view community detection through the lens of representation learning. The community affiliation matrix can be considered as an embedding of nodes into , with the aim of preserving the graph structure. Given the recent success of representation learning for graphs (Cai et al., 2018), a question arises: ”Can the advances in deep learning for graphs be used to design better community detection algorithms?”. As we show in Section 4.1
, simply combining existing node embedding approaches with overlapping Kmeans doesn’t lead to satisfactory results. Instead, we propose to combine the probabilistic and representation points of view, and learn the community affiliations in an endtoend manner using a graph neural network.
3. The NOCD model
Here, we present the Neural Overlapping Community Detection (NOCD) model. The core idea of our approach is to combine the power of GNNs with the Bernoulli–Poisson probabilistic model.
3.1. Bernoulli–Poisson model
The Bernoulli–Poisson (BP) model (Yang and Leskovec, 2013; Zhou, 2015; Todeschini et al., 2016) is a graph generative model that allows for overlapping communities. According to the BP model, the graph is generated as follows. Given the affiliations , adjacency matrix entries are sampled i.i.d. as
(1) 
where is the row vector of community affiliations of node (the ’s row of the matrix ). Intuitively, the more communities nodes and have in common (i.e. the higher the dot product is), the more likely they are to be connected by an edge.
This model has a number of desirable properties: It can produce various community topologies (e.g. nested, hierarchical), leads to dense overlaps between communities (Yang and Leskovec, 2014) and is computationally efficient (Section 3.3
). Existing works propose to perform inference in the BP model using maximum likelihood estimation with coordinate ascent
(Yang and Leskovec, 2013; Yang et al., 2013)or Markov chain Monte Carlo
(Zhou, 2015; Todeschini et al., 2016).3.2. Model definition
Instead of treating the affiliation matrix as a free variable over which optimization is performed, we generate with a GNN:
(2) 
A ReLU nonlinearity is applied elementwise to the output layer to ensure nonnegativity of
. See Section 4 and Appendix B for details about the GNN architecture.The negative loglikelihood of the Bernoulli–Poisson model is
(3) 
Realworld graph are usually extremely sparse, which means that the second term in Equation 3 will provide a much larger contribution to the loss. We counteract this by balancing the two terms, which is a standard technique in imbalanced classification (He and Garcia, 2008)
(4) 
where and
denote uniform distributions over edges and nonedges respectively.
Instead of directly optimizing the affiliation matrix , as done by traditional approaches (Yang and Leskovec, 2013; Yang et al., 2013), we search for neural network parameters that minimize the (balanced) negative loglikelihood
(5) 
Using a GNN for community prediction has several advantages. First, due to an appropriate inductive bias, the GNN outputs similar community affiliation vectors for neighboring nodes, which improves the quality of predictions compared to simpler models (Section 4.2). Also, such formulation allows us to seamlessly incorporate the node features into the model. If the node attributes are not available, we can simply use as node features (Kipf and Welling, 2017). Finally, with the formulation from Equation 2, it’s even possible to predict communities inductively for nodes not seen at training time.
3.3. Scalability
One advantage of the BP model is that it allows to efficiently evaluate the loss and its gradients w.r.t. . By using a caching trick (Yang and Leskovec, 2013), we can reduce the computational complexity of these operations from to . While this already leads to large speedups due to sparsity of realworld networks (typically ), we can speed it up even further. Instead of using all entries of when computing the loss (Equation 4), we sample a minibatch of
edges and nonedges at each training epoch, thus approximately computing
in . In Appendix E we show that this stochastic optimization strategy converges to the same solution as the fullbatch approach, while keeping the computational cost and memory footprint low.While we subsample the graph to efficiently evaluate the training objective , we use the full adjacency matrix inside the GNN. This doesn’t limit the scalability of our model: NOCD is trained on a graph with 800K+ edges in 3 minutes on a single GPU (see Section 4.1). It is straightforward to make the GNN component even more scalable by applying the techniques such as (Chen et al., 2018; Ying et al., 2018).
4. Evaluation
Dataset  BigCLAM  CESNA  EPM  SNetOC  CDE  SNMF  DW/NEO  G2G/NEO  NOCDG  NOCDX 

Facebook 348  26.0  29.4  6.5  24.0  24.8  13.5  31.2  17.2  34.7  36.4 
Facebook 414  48.3  50.3  17.5  52.0  28.7  32.5  40.9  32.3  56.3  59.8 
Facebook 686  13.8  13.3  3.1  10.6  13.5  11.6  11.8  5.6  20.6  21.0 
Facebook 698  45.6  39.4  9.2  44.9  31.6  28.0  40.1  2.6  49.3  41.7 
Facebook 1684  32.7  28.0  6.8  26.1  28.8  13.0  37.2  9.9  34.7  26.1 
Facebook 1912  21.4  21.2  9.8  21.4  15.5  23.4  20.8  16.0  36.8  35.6 
Chemistry  0.0  23.3  DNF  DNF  DNF  2.6  1.7  22.8  22.6  45.3 
Computer Science  0.0  33.8  DNF  DNF  DNF  9.4  3.2  31.2  34.2  50.2 
Engineering  7.9  24.3  DNF  DNF  DNF  10.1  4.7  33.4  18.4  39.1 
Medicine  0.0  14.4  DNF  DNF  DNF  4.9  5.5  28.8  27.4  37.8 
Datasets. We use the following realworld graph datasets in our experiments. Facebook (Mcauley and Leskovec, 2014) is a collection of small (50800 nodes) egonetworks from the Facebook graph. Larger graph datasets (10K+ nodes) with reliable groundtruth overlapping community information and node attributes are not openly available, which hampers the evaluation and development of new methods. For this reason we have collected and preprocessed 4 realworld datasets, that satisfy these criteria and can act as future benchmarks. Chemistry, Computer Science, Medicine, Engineering are coauthorship networks, constructed from the Microsoft Academic Graph (mag, [n. d.]). Communities correspond to research areas in respective fields, and node attributes are based on keywords of the papers by each author. Statistics for all used datasets are provided in Appendix A.
Model architecture. For all experiments, we use a 2layer Graph Convolutional Network (GCN) (Kipf and Welling, 2017) as the basis for the NOCD model. The GCN is defined as
(6) 
where is the normalized adjacency matrix, is the adjacency matrix with self loops, and is the diagonal degree matrix of
. We considered other GNN architectures, as well as deeper models, but none of them led to any noticeable improvements. The two main differences of our model from standard GCN include (1) batch normalization after the first graph convolution layer and (2)
regularization applied to all weight matrices. We found both of these modifications to lead to substantial gains in performance. We optimized the architecture and hyperparameters only using the Computer Science dataset — no additional tuning was done for other datasets. More details about the model configuration and the training procedure are provided in Appendix B. We denote the model working on node attributes as NOCDX, and the model using the adjacency matrix as input as NOCDG. In both cases, the feature matrix is rownormalized.Assigning nodes to communities. In order to compare the detected communities to the ground truth, we first need to convert the predicted continuous community affiliations into binary community assignments. We assign node to community if its affiliation strength is above a fixed threshold . We chose the threshold like all other hyperparameters — by picking the value that achieved the best score on the Computer Science dataset, and then using it in further experiments without additional tuning.
Metrics. We found that popular metrics for quantifying agreement between true and detected communities, such as Jaccard and scores (Yang and Leskovec, 2013; Yang et al., 2013; Li et al., 2018), can give arbitrarily high scores for completely uninformative community assignments. See Appendix F for an example and discussion. Instead we use overlapping normalized mutual information (NMI) (McDaid et al., 2011), as it is more robust and meaningful.
4.1. Recovery of groundtruth communities
We evaluate the NOCD model by checking how well it recovers communities in graphs with known groundtruth communities.
Baselines. In our selection of baselines, we chose methods that are based on different paradigms for overlapping community detection: probabilistic inference, nonnegative matrix factorization (NMF) and deep learning. Some methods incorporate the attributes, while other rely solely on the graph structure.
BigCLAM (Yang and Leskovec, 2013), EPM (Zhou, 2015) and SNetOC (Todeschini et al., 2016) are based on the Bernoulli–Poisson model. BigCLAM learns using coordinate ascent, while EPM and SNetOC perform inference with Markov chain Monte Carlo (MCMC). CESNA (Yang et al., 2013) is an extension of BigCLAM that additionally models node attributes. SNMF (Wang et al., 2011) and CDE (Li et al., 2018) are NMF approaches for overlapping community detection.
We additionally implemented two methods based on neural graph embedding. First, we compute node embeddings for all the nodes in the given graph using two established approaches – DeepWalk (Perozzi et al., 2014) and Graph2Gauss (Bojchevski and Günnemann, 2018). Graph2Gauss takes into account both node features and the graph structure, while DeepWalk only uses the structure. Then, we cluster the nodes using Nonexhaustive Overlapping (NEO) KMeans (Whang et al., 2015) — which allows to assign them to overlapping communities. We denote the methods based on DeepWalk and Graph2Gauss as DW/NEO and G2G/NEO respectively.
To ensure a fair comparison, all methods were given the true number of communities . Other hyperparameters were set to their recommended values. An overview of all baseline methods, as well as their configurations are provided in Appendix C.
Attributes  Adjacency  

Dataset  GNN  MLP  GNN  MLP  Free variable 
Facebook 348  36.4 2.0  11.7 2.7  34.7 1.5  27.7 1.6  25.7 1.3 
Facebook 414  59.8 1.8  22.1 3.1  56.3 2.4  48.2 1.7  49.2 0.4 
Facebook 686  21.0 0.9  1.5 0.7  20.6 1.4  19.8 1.1  13.5 0.9 
Facebook 698  41.7 3.6  1.4 1.3  49.3 3.4  42.2 2.7  41.5 1.5 
Facebook 1684  26.1 1.3  17.1 2.0  34.7 2.6  31.9 2.2  22.3 1.4 
Facebook 1912  35.6 1.3  17.5 1.9  36.8 1.6  33.3 1.4  18.3 1.2 
Chemistry  45.3 2.3  46.6 2.9  22.6 3.0  12.1 4.0  5.2 2.3 
Computer Science  50.2 2.0  49.2 2.0  34.2 2.3  31.9 3.8  15.1 2.2 
Engineering  39.1 4.5  44.5 3.2  18.4 1.9  15.8 2.1  7.6 2.2 
Medicine  37.8 2.8  31.8 2.1  27.4 2.5  23.6 2.1  9.4 2.3 
Comparison of the GNNbased model against simpler baselines. Multilayer perceptron (MLP) and Free Variable (FV) models are optimizing the same objective (Equation
4), but represent the community affiliations differently.Results: Recovery. Table 1 shows how well different methods recover the groundtruth communities. Either NOCDX or NOCDG achieve the highest score for 9 out of 10 datasets. We found that the NMI of both methods is strongly correlated with the reconstruction loss (Equation 4): NOCDG outperforms NOCDX in terms of NMI exactly in those cases, when NOCDG achieves a lower reconstruction loss. This means that we can pick the better performing of two methods in a completely unsupervised fashion by only considering the loss values.
Results: Hyperparameter sensitivity. It’s worth noting again that both NOCD models use the same hyperparameter configuration that was tuned only on the Computer Science dataset (). Nevertheless, both models achieve excellent results on datasets with dramatically different characteristics (e.g. Facebook 414 with ).
Results: Scalability. In addition to displaying excellent recovery results, NOCD is highly scalable. NOCD is trained on the Medicine dataset (63K nodes, 810K edges) using a single GTX1080Ti GPU in 3 minutes, while only using 750MB of GPU RAM (out of 11GB available). See Appendix D for more details on hardware.
EPM, SNetOC and CDE don’t scale to larger datasets, since they instantiate very large dense matrices during computations. SNMF and BigCLAM, while being the most scalable methods and having lower runtime than NOCD, achieved relatively low scores in recovery. Generating the embeddings with DeepWalk and Graph2Gauss can be done very efficiently. However, overlapping clustering of the embeddings with NEOKMeans was the bottleneck, which led to runtimes exceeding several hours for the large datasets. As the authors of CESNA point out (Yang et al., 2013), the method scales to large graphs if the number of attributes is low. However, as increases, which is common for modern datasets, the method scales rather poorly. This is confirmed by our findings — on the Medicine dataset, CESNA (parallel version with 18 threads) took 2 hours to converge.
4.2. Do we really need a graph neural network?
Our GNNbased model achieved superior performance in community recovery. Intuitively, it makes sense to use a GNN for the reasons laid out in Section 3.2. Nevertheless, we should ask whether it’s possible achieve comparable results with a simpler model. To answer this question, we consider the following two baselines.
Multilayer perceptron (MLP): Instead of a GCN (Equation 6), we use a simple fullyconnected neural network to generate .
(7) 
This is related to the model proposed by (Hu et al., 2017). Same as for the GCNbased model, we optimize the weights of the MLP, , to minimize the objective Equation 4.
(8) 
Free variable (FV): As an even simpler baseline, we consider treating as a free variable in optimization and solve
(9) 
We optimize the objective using projected gradient descent with Adam (Kingma and Ba, 2015), and update all the entries of at each iteration. This can be seen as an improved version of the BigCLAM model. Original BigCLAM uses the imbalanced objective (Equation 3) and optimizes using coordinate ascent with backtracking line search.
Setup. Both for the MLP and FV models, we tuned the hyperparameters on the Computer Science dataset (just as we did for the GNN model), and used the same configuration for all datasets. Details about the configuration for both models are provided in Appendix B. Like before, we consider the variants of the GNNbased and MLPbased models that use either or as input features. We compare the NMI scores obtained by the models on all 11 datasets.
Results. The results for all models are shown in Table 2. The two neural network based models consistently outperform the free variable model. When node attributes are used, the MLPbased model outperforms the GNN version for Chemistry and Engineering datasets, where the node features alone provide a strong signal. However, MLP achieves extremely low scores for Facebook 686 and Facebook 698 datasets, where the attributes are not as reliable. On the other hand, when is used as input, the GNNbased model always outperforms MLP. Combined, these findings confirm our hypothesis that a graphbased neural network architecture is indeed beneficial for the community detection task.
5. Related work
The problem of community detection in graphs is wellestablished in the research literature. However, most of the works study detection of nonoverlapping communities (Abbe, 2018; Von Luxburg, 2007). Algorithms for overlapping community detection can be broadly divided into methods based on nonnegative matrix factorization (Li et al., 2018; Wang et al., 2011; Kuang et al., 2012), probabilistic inference (Yang and Leskovec, 2013; Zhou, 2015; Todeschini et al., 2016; Latouche et al., 2011)
, and heuristics
(Gleich and Seshadhri, 2012; Galbrun et al., 2014; Ruan et al., 2013; Li et al., 2015).Deep learning for graphs can be broadly divided into two categories: graph neural networks and node embeddings. GNNs (Kipf and Welling, 2017; Hamilton et al., 2017; Xu et al., 2018) are specialized neural network architectures that can operate on graphstructured data. The goal of embedding approaches (Perozzi et al., 2014; Kipf and Welling, 2016; Grover and Leskovec, 2016; Bojchevski and Günnemann, 2018) is to learn vector representations of nodes in a graph that can then be used for downstream tasks. While embedding approaches work well for detecting disjoint communities (Cavallari et al., 2017; Tsitsulin et al., 2018), they are not wellsuited for overlapping community detection, as we showed in our experiments. This is caused by lack of reliable and scalable approaches for overlapping clustering of vector data.
Several works have proposed deep learning methods for community detection. (Yang et al., 2016) and (Cao et al., 2018) use neural nets to factorize the modularity matrix, while (Cavallari et al., 2017) jointly learns embeddings for nodes and communities. However, neither of these methods can handle overlapping communities. Also related to our model is the approach by (Hu et al., 2017)
, where they use a deep belief network to learn community affiliations. However, their neural network architecture does not use the graph, which we have shown to be crucial in Section
4.2; and, just like EPM and SNetOC, relies on MCMC, which heavily limits the scalability of their approach. Lastly, (Chen et al., 2019) designed a GNN for supervised community detection, which is a very different setting.6. Discussion & Future work
We proposed NOCD— a graph neural network model for overlapping community detection. The experimental evaluation confirms that the model is accurate, flexible and scalable.
Besides strong empirical results, our work opens interesting followup questions. We plan to investigate how the two versions of our model (NOCDX and NOCDG) can be used to quantify the relevance of attributes to the community structure. Moreover, we plan to assess the inductive performance of NOCD (Hamilton et al., 2017).
To summarize, the results obtained in this paper provide strong evidence that deep learning for graphs deserves more attention as a framework for overlapping community detection.
Acknowledgments
This research was supported by the German Research Foundation, Emmy Noether grant GU 1409/21.
References
 (1)
 mag ([n. d.]) [n. d.]. Microsoft Academic Graph. https://kddcup2016.azurewebsites.net/.
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16).
 Abbe (2018) Emmanuel Abbe. 2018. Community Detection and Stochastic Block Models: Recent Developments. JMLR 18 (2018).
 Bojchevski and Günnemann (2018) Aleksandar Bojchevski and Stephan Günnemann. 2018. Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. In ICLR.
 Cai et al. (2018) Hongyun Cai, Vincent W Zheng, and Kevin Chang. 2018. A comprehensive survey of graph embedding: problems, techniques and applications. TKDD (2018).
 Cao et al. (2018) Jinxin Cao, Di Jin, Liang Yang, and Jianwu Dang. 2018. Incorporating network structure with node contents for community detection on large networks using deep learning. Neurocomputing 297 (2018).
 Cavallari et al. (2017) Sandro Cavallari, Vincent W Zheng, Hongyun Cai, Kevin ChenChuan Chang, and Erik Cambria. 2017. Learning community embedding with community detection and node embedding on graphs. In CIKM.

Chen
et al. (2018)
Jianfei Chen, Jun Zhu,
and Le Song. 2018.
Stochastic Training of Graph Convolutional Networks with Variance Reduction.. In
ICML.  Chen et al. (2019) Zhengdao Chen, Xiang Li, and Joan Bruna. 2019. Supervised Community Detection with Hierarchical Graph Neural Networks. In ICLR.

Choong
et al. (2018)
Jun Jin Choong, Xin Liu,
and Tsuyoshi Murata. 2018.
Learning community structure with variational autoencoder. In
ICDM.  Fortunato and Hric (2016) Santo Fortunato and Darko Hric. 2016. Community detection in networks: A user guide. Physics Reports 659 (2016).
 Galbrun et al. (2014) Esther Galbrun, Aristides Gionis, and Nikolaj Tatti. 2014. Overlapping community detection in labeled graphs. Data Mining and Knowledge Discovery 28 (2014).
 Garcia et al. (2018) Javier O Garcia, Arian Ashourvan, Sarah Muldoon, Jean M Vettel, and Danielle S Bassett. 2018. Applications of community detection techniques to brain graphs: Algorithmic considerations and implications for neural function. Proc. IEEE 106 (2018).
 Girvan and Newman (2002) Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and biological networks. PNAS 99 (2002).
 Gleich and Seshadhri (2012) David F Gleich and C Seshadhri. 2012. Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In KDD.
 Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NIPS.
 He and Garcia (2008) Haibo He and Edwardo A Garcia. 2008. Learning from imbalanced data. TKDE 9 (2008).
 Hu et al. (2017) Changwei Hu, Piyush Rai, and Lawrence Carin. 2017. Deep Generative Models for Relational Data with Side Information. ICML.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR (2015).
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Variational Graph AutoEncoders. NIPS Workshop on Bayesian Deep Learning.
 Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semisupervised classification with graph convolutional networks. ICLR.
 Kuang et al. (2012) Da Kuang, Chris Ding, and Haesun Park. 2012. Symmetric nonnegative matrix factorization for graph clustering. In SDM.
 Latouche et al. (2011) Pierre Latouche, Etienne Birmelé, Christophe Ambroise, et al. 2011. Overlapping stochastic block models with application to the French political blogosphere. The Annals of Applied Statistics 5 (2011).
 Li et al. (2015) Yixuan Li, Kun He, David Bindel, and John E Hopcroft. 2015. Uncovering the small community structure in large networks: A local spectral approach. In WWW.
 Li et al. (2018) Ye Li, Chaofeng Sha, Xin Huang, and Yanchun Zhang. 2018. Community Detection in Attributed Graphs: An Embedding Approach. In AAAI.
 Mcauley and Leskovec (2014) Julian Mcauley and Jure Leskovec. 2014. Discovering social circles in ego networks. TKDD 8 (2014).
 McDaid et al. (2011) Aaron F McDaid, Derek Greene, and Neil Hurley. 2011. Normalized mutual information to evaluate overlapping community finding algorithms. arXiv:1110.2515 (2011).
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD.
 Pinheiro (2012) Carlos André Reis Pinheiro. 2012. Community detection to identify fraud events in telecommunications networks. SAS SUGI proceedings: customer intelligence (2012).
 Ruan et al. (2013) Yiye Ruan, David Fuhry, and Srinivasan Parthasarathy. 2013. Efficient community detection in large networks using content and links. In WWW.
 Song and Singh (2009) Jimin Song and Mona Singh. 2009. How and when should interactomederived clusters be used to predict functional modules and protein function? Bioinformatics 25 (2009).
 Todeschini et al. (2016) Adrien Todeschini, Xenia Miscouridou, and François Caron. 2016. Exchangeable random measures for sparse and modular graphs with overlapping communities. arXiv:1602.02114 (2016).
 Tsitsulin et al. (2018) Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In WWW.

Von Luxburg (2007)
Ulrike Von Luxburg.
2007.
A tutorial on spectral clustering.
Statistics and computing 17 (2007).  Wang et al. (2011) Fei Wang, Tao Li, Xin Wang, Shenghuo Zhu, and Chris Ding. 2011. Community discovery using nonnegative matrix factorization. Data Mining and Knowledge Discovery 22 (2011).
 Whang et al. (2015) Joyce Jiyoung Whang, Inderjit S Dhillon, and David F Gleich. 2015. Nonexhaustive, Overlapping kmeans. In SDM.
 Xie et al. (2013) Jierui Xie, Stephen Kelley, and Boleslaw K Szymanski. 2013. Overlapping community detection in networks: The stateoftheart and comparative study. CSUR 45 (2013).
 Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation learning on graphs with jumping knowledge networks. ICML (2018).
 Yang and Leskovec (2013) Jaewon Yang and Jure Leskovec. 2013. Overlapping community detection at scale: a nonnegative matrix factorization approach. In WSDM.
 Yang and Leskovec (2014) Jaewon Yang and Jure Leskovec. 2014. Structure and Overlaps of GroundTruth Communities in Networks. ACM TIST 5 (2014).
 Yang et al. (2013) Jaewon Yang, Julian McAuley, and Jure Leskovec. 2013. Community detection in networks with node attributes. In ICDM.
 Yang et al. (2016) Liang Yang, Xiaochun Cao, Dongxiao He, Chuan Wang, Xiao Wang, and Weixiong Zhang. 2016. Modularity Based Community Detection with Deep Learning.. In IJCAI.

Ying et al. (2018)
Rex Ying, Ruining He,
Kaifeng Chen, Pong Eksombatchai,
William L Hamilton, and Jure Leskovec.
2018.
Graph Convolutional Neural Networks for WebScale Recommender Systems.
 Zhou (2015) Mingyuan Zhou. 2015. Infinite edge partition models for overlapping community detection and link prediction. In AISTATS.
Appendix A Datasets
Dataset  Network type  

Facebook 348  Social  14  
Facebook 414  Social  7  
Facebook 686  Social  14  
Facebook 698  Social  13  
Facebook 1684  Social  17  
Facebook 1912  Social  46  
Computer Science  Coauthorship  18  
Chemistry  Coauthorship  14  
Medicine  Coauthorship  17  
Engineering  Coauthorship  16 
Appendix B Model configuration
b.1. Architecture
We picked the hyperparameters and chose the model architecture for all 3 models by only considering their performance (NMI) on the Computer Science dataset. No additional tuning for other datasets has been done.
GNNbased model. (Equation 6) We use a 2layer graph convolutional neural network, with hidden size of 128, and the output (second) layer of size (number of communities to detect). We apply batch normalization after the first graph convolution layer. Dropout with 50% keep probability is applied before every layer. We add weight decay to both weight matrices with regularization strength . The feature matrix (or , in case we are working without attributes) is normalized such that every row has unit norm.
We also experimented with the Jumping Knowledge Network (Xu et al., 2018) and GraphSAGE (Hamilton et al., 2017) architectures, but they led to lower NMI scores on the Computer Science dataset.
MLPbased model. (Equation 7) We found the MLP model to perform best with the same configuration as described above for the GCN model (i.e. same regularization strength, hidden size, dropout and batch norm).
Free variable model (Equation 9) We considered two initialization strategies for the free variable model: (1) Locally minimal neighborhoods (Gleich and Seshadhri, 2012) — the strategy used by the BigCLAM and CESNA models and (2) initializing to the output of an untrained GCN. We found strategy (1) to consistently provide better results.
b.2. Training
GNN and MLPbased models. We train both models using Adam optimizer (Kingma and Ba, 2015) with default parameters. The learning rate is set to . We use the following early stopping strategy: Every 50 epochs we compute the full training loss (Equation 4). We stop optimization if there was no improvement in the loss for the last iterations, or after 5000 epochs, whichever happens first.
Free variable model. We use Adam optimizer with learning rate . After every gradient step, we project the matrix to ensure that it stays nonnegative: . We use the same early stopping strategy as for the GNN and MLP models.
Appendix C Baselines
Method  Model type  Attributed  Scalable  

BigCLAM (Yang and Leskovec, 2013)  Probabilistic  ✓  
CESNA (Yang et al., 2013)  Probabilistic  ✓  
SNetOC (Todeschini et al., 2016)  Probabilistic  
EPM (Zhou, 2015)  Probabilistic  
CDE (Li et al., 2018)  NMF  ✓  
SNMF (Wang et al., 2011)  NMF  ✓  
DW/NEO (Perozzi et al., 2014; Whang et al., 2015)  Deep learning  
G2G/NEO (Bojchevski and Günnemann, 2018; Whang et al., 2015)  Deep learning  ✓  
NOCD 

✓  ✓ 

We used the reference C++ implementations of BigCLAM and CESNA, that were provided by the authors (https://github.com/snapstanford/snap
). Models were used with the default parameter settings for step size, backtracking line search constants, and balancing terms. Since CESNA can only handle binary attributes, we binarize the original attributes (set the nonzero entries to 1) if they have a different type.

We implemented SNMF ourselves using Python. The matrix is initialized by sampling from the distribution. We run optimization until the improvement in the reconstruction loss goes below per iteration, or for 300 epochs, whichever happens first. The results for SNMF are averaged over 50 random initializations.

We use the Matlab implementation of CDE provided by the authors. We set the hyperparameters to , , , as recommended in the paper, and run optimization for 20 iterations.

For SNetOC and EPM we use the Matlab implementations provided by the authors with the default hyperparameter settings. The implementation of EPM provides to options: EPM and HEPM. We found EPM to produce better NMI scores, so we used it for all the experiments.

We use the TensorFlow implementation of Graph2Gauss provided by the authors. We set the dimension of the embeddings to 128, and only use the matrix as embeddings.

We implemented DeepWalk ourselves: We sample 10 random walks of length 80 from each node, and use the Word2Vec implementation from Gensim (https://radimrehurek.com/gensim/) to generate the embeddings.The dimension of embeddings is set to 128.

For NEOKMeans, we use the Matlab code provided by the authors. We let the parameters and be selected automatically using the builtin procedure.
Appendix D Hardware and software
The experiments were performed on a computer running Ubuntu 16.04LTS with 2x Intel(R) Xeon(R) CPU E52630 v4 @ 2.20GHz CPUs, 256GB of RAM and 4x GTX1080Ti GPUs. Note that training and inference were done using only a single GPU at a time for all models. The NOCD model was implemented using Tensorflow v1.1.2 (Abadi et al., 2016)
Appendix E Convergence of the stochastic sampling procedure
Instead of using all pairs when computing the gradients at every iteration, we sample edges and nonedges uniformly at random. We perform the following experiment to ensure that our training procedure converges to the same result, as when using the full objective.
Experimental setup.
We train the model on the Computer Science dataset and compare the fullbatch optimization procedure with stochastic gradient descent for different choices of the batch size
. Starting from the same initialization, we measure the full loss (Equation 4) over the iterations.Results. Figure 1 shows training curves for different batch sizes , as well as for fullbatch training. The horizontal axis of the plot displays the number of entries of adjacency matrix accessed. One iteration of stochastic training accesses entries , and one iteration of fullbatch accesses entries, since we are using the caching trick from (Yang and Leskovec, 2013). As we see, the stochastic training procedure is stable: For batch sizes and , the loss converges very closely to the value achieved by fullbatch training.
Appendix F Quantifying agreement between overlapping communities
A popular choice for quantifying the agreement between true and predicted overlapping communities is the symmetric agreement score (Yang and Leskovec, 2013; Yang et al., 2013; Li et al., 2018). Given the groundtruth communities and the predicted communities , the symmetric score is defined as
(10) 
where is a similarity measure between sets, such as score or Jaccard similarity.
We discovered that these frequently used measures can assign arbitrarily high scores to completely uninformative community assignments, as you can in see in the following simple example. Let the ground truth communities be and ( nodes in each community), and let the algorithm assign all the nodes to a single community . While this predicted community assignment is completely uninformative, it will achieve symmetric score of and symmetric Jaccard similarity of (e.g., if and , the scores will be and respectively). These high numbers might give a false impression that the algorithm has learned something useful, while that clearly isn’t the case. As an alternative, we suggest using overlapping normalized mutual information (NMI), as defined in (McDaid et al., 2011). NMI correctly handles the degenerate cases, like the one above, and assigns them the score of 0.
Comments
There are no comments yet.