Combining Random Walks and Nonparametric Bayesian Topic Model for Community Detection

07/19/2016 ∙ by Ruimin Zhu, et al. ∙ Northwestern University 0

Community detection has been an active research area for decades. Among all probabilistic models, Stochastic Block Model has been the most popular one. This paper introduces a novel probabilistic model: RW-HDP, based on random walks and Hierarchical Dirichlet Process, for community extraction. In RW-HDP, random walks conducted in a social network are treated as documents; nodes are treated as words. By using Hierarchical Dirichlet Process, a nonparametric Bayesian model, we are not only able to cluster nodes into different communities, but also determine the number of communities automatically. We use Stochastic Variational Inference for our model inference, which makes our method time efficient and can be easily extended to an online learning algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

hierarchical-dirichlet-process-network-community-detection

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Social networks have been studied intensively for decades. Community detection is one of the most interesting problems in this area. Social networks usually possess community structure: nodes in the network can be grouped such that nodes in the same group are densely connected while nodes in different groups are sparsely connected. We call such groups communities.

Over the years, many novel methods have been proposed for community structure extraction. Among them, there is a category of methods based on random walks on networks [4], [5], [6]. In these methods, random walks on the network serve as input data for future network property mining. A simple illustration explains why random walks give us insights of the network’s properties, such as community structure. Because nodes in the same community tend to be densely connected, unsurprisingly, the path of a random walker tends to fall in only a small number of communities. Combining many random walks together can be a very good way of collecting information of the network, thus reveal the network’s properties. What’s more, a random walker chooses its next move based solely on the information of the neighborhood of its current position, which makes random walk easy to implement.

Traditional community detection methods need to specify the number of communities manually. Moreover, this same problem is quite common in statistical and machine learning. How many classes should I use in my mixture model? How many factors should I use in my factor analysis

[7]? The classical way to handle this problem is by model selection, where people first fit several models, then conduct model selection to choose the model that fits the data well and also has a relatively small model complexity. Nonparametric Bayesian models (BNP) address this problem in a very different way: it can detect such hyper-parameters automatically and can increase the model complexity when new data are observed.

Our model RW-HDP is based on random walks and nonparametric Bayesian modeling mentioned above. We first conduct random walks on the network. Each node is treated as a word, and each random walk is treated as a document. Using random walks as the input data, we fit the HDP topic model to reveal community structure. For fast model inference (posterior approximation), we use Stochastic Variational Inference [1].

Our model has many merits:

  • By equipping a probabilistic model to the random walks, we obtain a better interpretation than other methods.

  • HDP, a nonparametric Bayesian model, can automatically determine the number of communities and, if necessary, automatically increase the number of communities when new data are observed.

  • The random walk part can be parallelized which ensures that our method can be time efficient.

  • The posterior approximation part can easily be extended to an online learning scenario where random walks are not generated at once but keep streaming.

The rest of the paper is organized as follows. In section 2 we introduce the related works. In section 3, we describe our method in details. In section 4, we outline the experimental studies and compare our method to others. In section 5, we close with a conclusion and a discussion of possible future works.

2 Background and Related Works

Community detection in complex networks has been receiving a lot of attentions during the past decades. A large number of methods have been proposed for this problem. For example, centrality or betweenness based approaches and minimum cut methods are two of the most popular and rich tactics [15], [16], [17], [18], [19], [20].

While being less popular, other works such as random walk based methods also enrich the research in this field. Pons and Latapy [5] introduced a measure of similarity between nodes based on random walks. Rosvall and Bergstrom [6]

utilized the probability flow of random walks on a network as a proxy for information flows to reveal community structure. Perozzi, AI-Rfou, and Skiena

[4]

conducted random walks on a network, and treated them as sentences, later they used Natural Language Processing techniques to map the nodes into a Euclidean space for various tasks including clustering analysis.


LDA topic model proposed by Blei, et al [13] has been applied in many domains such as document modeling [13], image processing [22], information retrieval [21] and the list goes on. Zhang, et al [2] developed a method for network community detection based on LDA. In their model, a social interaction profile was treated as a document and LDA was smoothly fitted into the community detection task, where communities were treated as topics.

Nonparametric Bayesian models are becoming popular these days due to their flexibility [7], [25] and new posterior inference methods such as Variational Inference [23], [24]. Teh, Jordan, Beal, and Blei [14] extended the Dirichlet Process to the Hierarchical Dirichlet Process which can also be viewed as an extension of LDA model. Hoffman, Blei, Wang, and Paisley [1] developed the Stochastic Variational Inference for several non-parametric Bayesian models including HDP. In social network research, Morup and Schmidt [12], [11] formulated a non-parametric Bayesian community generative model for social network analysis. Kim, Gopalan, Blei, and Sudderth [22] proposed the hierarchical Dirichlet process relational model which allowed nodes to have mixed membership in an unbounded set of communities. Guo and Nordman [9] introduced a series of Bayesian nonparametric statistical models for community detection. Blundell and Teh [8] proposed an efficient Bayesian nonparametric model for discovering hierarchical community structure in social networks.

In relation to these related works, our work essentially combines the data generation component of the random walk (RW) approach and the inferential component of the HDP topic model. This combination allows us to achieve more than what the previous methods can do, e.g., automatic detection of number of communities, and the use of more informative RW “documents” for the topic model to detect the network communities.

Later we will show that the proposed method indeed leads to obvious improvement in several performance measures in the real applications, when compared to several existing methods; but first we will describe the details of the proposed method in the next section.

3 Random Walk Hierarchical Dirichlet Process Topic Model

In this section, we describe the RW-HDP model for network community detection in detail. We first introduce related terminologies and notation. Section 3.1 describes the corpus generation using random walks. Section 3.2 introduces the Hierarchical Dirichlet Process topic model.

Formally, let be an undirected network, where is the set of nodes; is the set of edges and is the weight between node and . is the number of nodes. As we treat each node as a word, is also the vocabulary size. We assume networks are undirected but our model can easily be extended to directed ones.

Our notation is partly summarized in Table 1.

number of nodes size of the vocabulary
number of random walks number of documents
expected random walk length expected document length
length of the random walk length of the document
the random walk the document
node in the random walk word in the document
community assignment of topic assignment of
Table 1: Notation

3.1 Random Walk for Corpus Generation

Each random walk is treated as a document and the collection of random walks is treated as the corpus. To carry out a random walk, we need to specify three elements: the starting point, the length, and the transition probability matrix.

The starting points are sampled independently uniformly across the nodes set

. The lengths of random walks are sampled independently from Poisson distribution parametrized by

, the expected length of a random walk. As for the transition matrix , we take the edge weights into consideration. Let . That being said, a random walker at node , would randomly choose one of node ’s neighbors to visit in the next step with probability proportional to the edge weight between node and its destination.

The success of the proposed method is caused by using RW to generate informative “documents” for topic discovery. The same word, when embedded in good documents, can be easily assigned its correct topic; on the other hand, when it is embedded in bad documents, it cannot be easily assigned any topic with clarity, even when the same topic model, e.g., the HDP topic model, is used. As a simple example: when the network has two communities of integer-indexed nodes, the community of odd indexes and the community of even indexes, with many more within-community connections than between-community connections. The completely random (CR) documents would have documents all looking like

, which is hard to be assigned any topic clearly. On the other hand, RW will generate documents either like or ,which have clearly distinctive styles of word distributions. The same word, say, 3, cannot be easily assigned a topic in CR documents, but can be easily assigned a topic in the RW documents.

The RW documents also have a natural interpretation. An RW document can be regarded as recording the experience of a person in making a sequence of friends in an RW fashion in the network, each time with his newest friend introducing a “neighbor” to him as a newer friend. These documents of names of friends, when accumulated over many such experiences as a corpus will, therefore, be very informative in inferring the community structure, unlike CR documents consisting of only random names.

To speed up the corpus generation process, random walks can be carried out simultaneously using parallel computing. Also, if we keep carrying out random walks and feeding newly observed random walks to the HDP topic model, we easily extend our model to an online learning model. Although we set the length of a random walk be a Poisson random variable, it is not critical to our method.

3.2 Hierarchical Dirichlet Process Topic Model

In this section, we introduce the Hierarchical Dirichlet Process topic model [14] in detail and show how to apply it for network community structure extraction.

HDP topic model is a mixture model with unbounded number of mixtures (topics). In mixture models, each observation is assumed to belong to a cluster or group. In our case, each node belongs to a topic

. Note that in topic models, a word may be assigned to different topics in different documents. For the network community detection problem, we go a step further: we use Bayesian rule to calculate the overall probability of a node belonging to each community and assign the node to the community with the largest posterior probability. If we can get the conditional probability

and the topic probability , then by Bayesian rule, the conditional probability is given by

(1)

The community assignment of node is .

For programming implementation simplicity, we will derive the HDP topic model using stick-breaking construction. The HDP topic model couples a set of document-level Dirichlet Process via a single top-level Dirichlet Process. In the top-level, the base measure is a symmetric Dirichlet over the vocabulary simplex. The stick-breaking construction for HDP topic model is given by [1]:

  1. Draw an infinite number of topics,

  2. Draw corpus breaking proportions,

  3. For each document:

    1. Draw document-level topic indexes,

    2. Draw document breaking proportions,

    3. For each word:

      1. Draw topic assignment .

      2. Draw word .

Note that specifies the word distribution in the topic. By stick-construction, the corpus level distribution of topics is given by:

thus from (1) the overall probability of node belonging to topic is given by

(2)

The HDP topic model is a hierarchical model which means that there are different levels in the governing ranges of variables. Variables such as and are corpus or global level variables. They govern the distributions of the observations (words) across all documents. While, other variables such as and are document or local level variables because they only govern the distributions of the observations in a particular document. The scopes of variables are clearly shown in Figure 1.

Figure 1: HDP Topic Model

We choose Stochastic Variational Inference [1], which is a combination of stochastic method and Variational Inference, for posterior approximation. At each iteration, we only sample a mini batch of documents, update their corresponding local variables, and then treat them as if they appear times to update the global variables333Other constructions of the HDP topic model and the details of the Stochastic Variational Inference for the model can be found in Hoffman et al. (2013)..

4 Experiments

In this section, we evaluate and compare the performances of our model with three previously proposed methods. The comparison has been done in five different types of networks.

4.1 Models

We deliberately choose three models to compare with ours. Each of them shares one characteristic with our model. Since RW-HDP consists of three major characteristics: random walk, nonparametric Bayesian, and topic model. We decide to choose the following three models to compare with:

  1. SIP2-LDA [2], a topic based community detection model. In this model, each node is associated with a social interaction profile (SIP2), which only takes a node’s immediate and secondary neighbors into consideration. Those social interaction profiles are treated as documents for community detection using Latent Dirichlet Allocation.

  2. Walktrap [5], a random walk based community detection model. This method does not actually implement random walks on the network, but it defines node-to-node distance and community-to-community distance based on properties of random walks, such as the transition probability between any pair of nodes within steps. Later, it merges communities iteratively to get a hierarchical tree of partition. Finally, it cuts the tree to get the best partition.

  3. BCD [12]

    , a nonparametric Bayesian network generative model. The generative process is: first, a cluster assignment is generated using Chinese Restaurant Process (a commonly used metaphor for Dirichlet Process); then, within-cluster and between-cluster link probabilities are generated; finally, links between nodes are generated according to the within- and between-cluster link probabilities.

4.2 Data

We conduct experiments on several different types of networks:

  1. yeast: a yeast protein complex interaction network [28].

  2. GSE: a breast cancer gene co-expression network [29], [30].

  3. ca-GrQc: Arxiv General Relativity and Quantum Cosmology collaboration network. If an author co-authored a paper with author , the graph contains an undirected edge between and [31].

  4. ca-CondMat: Arxiv Condense Matter Physics collaboration network [31].

  5. US powergrid: the high-voltage power grid in the Western States of the United States of America. The nodes are transformers, substations, and generators, and the ties are high-voltage transmission lines [32].

The statistics of networks are summarized in Table 2.

statistics yeast GSE ca-GrQc ca-CondMat US powergrid
type biology biology co-authorship co-authorship engineer
nodes 1540 9112 5242 16264 4941
edges 8703 244928 14478 47594 6594
Table 2: Network Statistics

4.3 Evaluation Metrics

To compare our method with others, we use community scoring functions. A good community tends to be densely connected internally and sparsely connected with other parts of the network. The community scoring functions quantify this intuition in different aspects [3].

Given a set of nodes (a community of the network ), let be the subgraph induced by , where . Let denote the number of nodes in the original network. Let denote the number of nodes in set , and the number of edges in set . Let denote the number of edges with one end in , and the other outside of : . We use the following four community scoring functions:

  1. Internal density: . This metric sores the community structure based on its internal connectivity. A larger internal density usually means a better community structure [34].

  2. Cut Ratio: , which quantifies the community structure based on its external connectivity. A smaller cut ratio usually means a better community structure [35].

  3. Conductance: , which measures the fraction of edge that points outside the cluster. It combines both internal and external connectivity to give a score. A smaller conductance usually means a better community structure [36].

  4. Modularity: , where is the number of communities, the fraction of edges with one end in community and the other in community , . This index falls in . A larger modularity means a better community structure [37].

For the two topic based models: SIP2-LDA and RW-HDP, we also compare their perplexity scores [13] on the testing corpus. Perplexity is defined as:

where is the testing corpus, which is either random walks in RW-HDP model, or randomly sampled social profiles in SIP-LDA model. A smaller perplexity score corresponds to a better topic model.

4.4 Choice of Hyperparameters

There are three types of hyperparameters that are worthy of study. They are corpus, HDP topic model, and Stochastic Variational Inference hyperparameters. Here, we briefly explain the principles and intuitions of choosing proper hyperparameters.


The corpus hyperparameters are average random walk length and number of random walks . If is too large, a document will contain words from many topics, which by intuition is not a well-written document and thus will make it hard for the topic model to detect any meaningful topic assignment. cannot be too small either. An extreme case is , where it is impossible to capture the dependencies between nodes and thus will make community detection by topic models in vain. In our experiments, we set to be around 100. The same principle applies to corpus size . We set to be approximately five times of the number of nodes.

The HDP topic model has an infinite number of topics which makes it hard for programming implementation and variational inference. Instead, we do truncations both at the corpus level and the document level. At the corpus level, we fit breaking points as the topic choices. At the document level, we fit topic pointers and let the topic assignment variable take on one of values. can be much smaller than as there might be hundreds of topics in the corpus but for a single document, only a small number of topics will be exhibited. To see this is still an infinite model, note that by setting the truncations high enough, the variational posterior will not necessarily use all topics [1].

The last type of hyperparameters is those in Stochastic Variational Inference, such as batch size, the number of epochs, etc. The choice of batch size is a tradeoff between speed and noise. We set the batch size to be a small number such as

since noise is tolerable in our community detection task. We run epochs on the corpus.

4.5 Results and Comparison

model yeast GSE ca-GrQc ca-CondMat US powergrid
RW-HDP 0.7605 0.5967 0.7848 0.7588 0.9087
SIP2-LDA 0.6995 0.5881 0.7479 0.6615 0.7775
Walktrap 0.6968 0.6014 0.7430 0.7238 0.8953
BCD 0.6452 0.2017 0.5378 0.5041 0.4802
Table 3: modularity
model yeast GSE ca-GrQc ca-CondMat US powergrid
RW-HDP 62.26 1124.51 504.16 1262.18 235.46
SIP2-LDA 279.95 1664.80 2902.81 41920.72 7197.49
Table 4: perplexity
Figure 2: Boxplots of internal density, cut ratio and conductance. 1: RW-HDP, 2: SIP2-LDA, 3: Walktrap, 4: BCD.

Our model outperforms other models almost in every aspect and on every dataset, except that the modularity of our model on the GSE network is slightly smaller than that of Walktrap and the densities of our model on the US powergrid and the ca-CondMat network are not the biggest among all models.

RW-HDP outperforms SIP2-LDA because random walk is a much better way of collecting information compared to the social interaction profile. Also, HDP topic model is more powerful than LDA, as the former is a nonparametric Bayesian model. Both RW-HDP and Walktrap utilize random walks but perhaps it is because the transition probability between nodes within steps is not the best way to capture nodes similarities that makes Walktrap less successful in terms of model performances. Also, Walktrap views the community detection as an optimization problem. It merges two communities according to Ward’s method which minimizes the mean of squared distances between each node and its community. To get the best partition, the algorithm cuts the hierarchical tree at where it has the largest modularity, which could result in less satisfying results on other metrics. RW-HDP, on the contrary, is a probabilistic model, which may be the reason why it is more flexible. As for BCD, though it is also a nonparametric Bayesian model, hierarchically, it has a relatively shallow depth, which could be the reason why it lacks the power to reveal the network structure.

5 Conclusion and Future Works

Network community detection has been studied intensively by researchers from various disciplines. In this paper, we present the RW-HDP model for this task. As the name suggested, the two pillars for RW-HDP are random walk and Hierarchical Dirichlet Process topic model. We first conduct random walks on the network and treat them as documents. Later, we fit the Hierarchical Dirichlet Process topic model to reveal community structure. As HDP is a nonparametric model, our method enables us to find the number of communities automatically. Our work is a new endeavor in nonparametric Bayesian modeling in networks. It borrows ideas from some novel previous works and outperforms them.

The choices of hyperparameters are based on intuitive rules at the current stage. A more sophisticated mathematical formulation leaves as one of our future works. Also, our model can be improved in many ways. One possible future work is to allow overlapping community detection. Currently, we assume each node only belongs to a unique community, the one gives the largest conditional probability. We may relax this assumption and assign a community to a node if its corresponding conditional probability exceeds some threshold, which would allow overlapping communities to exist. Finally, we believe by changing HDP topic model to other topic models, we are able to find various types of community structures in networks, such as hierarchical community structure. Such interesting applications of topic models on social networks are yet to be explored in the future.

References

  • [1] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic Variational Inference. Journal of Machine Learning Research 14(2013) 1303-1347.
  • [2] Haizheng Zhang, Baojun Qiu, C. Lee Giles, Henry C. Foley and John Yen. An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks. In IEEE International Conference on Intelligence and Security Informatics, 2007.
  • [3] Jaewon Yang, and Jure Leskovec. Defining and Evaluating Network Communities based on Ground-truth. IEEE International Conference on Data Mining, 2012.
  • [4] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD, 2014.
  • [5] Pascal Pons, and Matthieu Latapy. Computing communities in large networks using random walks. JGAA Journal of Graph Algorithms and Applications J. Graph Algorithms Appl. 10.2 (2006): 191-218.
  • [6] Martin Rosvall, and Carl T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105, 1118–1123 (2008).
  • [7] Samuel J. Gershman, and David M. Blei. A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology 56.1 (2012): 1-12.
  • [8] Charles Blundell, and Yee Whye Teh. Bayesian hierarchical community discovery. NIPS proceeding, 2013.
  • [9] Jiqiang Guo, Alyson G. Wilson, and Daniel J. Nordman. Bayesian nonparametric models for community detection. Technometrics 55.4 (2013): 390-402.
  • [10] Dae Il Kim, Prem Gopalan, David M. Blei, and Erik B. Sudderth. Efficient Online Inference for Bayesian Nonparametric Relational Models. NIPS, 2013.
  • [11] Mikkel N. Schmidt, and Morten Morup. Non-parametric Bayesian modeling of complex networks. IEEE Signal Processing Magazine, pp. 110–128, 2013.
  • [12] Morten Morup, and Mikkel N. Schmidt. Bayesian community detection. Neural Computation, vol. 24, no. 9, pp. 2434–2456, 2012.
  • [13] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
  • [14] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet Process. Technical report, UC Berkeley Statistics TR-653, 2004.
  • [15] L. Freeman. A set of measures of centrality based upon betweeness. Sociometry, 1977, pp. 178-185.
  • [16] M. Girvan and M. E. Newman. Community structure in social and biological networks. Proc Natl Acad Sci USA, vol 99, no. 12 pp. 7821-7826, June 2002.
  • [17] D. M. Wilkinson and B. A. Huberman. A method for findng communities of related genes. Proc Natl Acad Sci USA, vol. 101 Suppl 1, pp. 5241-5248, April 2004.
  • [18] M. R. Garey and D. S. Johnson. COmputers and Intractability: A guide to the theory of NP-Completeness. New York, NY, USA: W. H. Freeman and Co., 1979.
  • [19] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. in KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM Press, 2000, pp. 150-160.
  • [20] G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph clustering and minimum cut trees. Internet Mathematics, vol. 1, no. 4, pp. 385-408, 2004.
  • [21] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. in SIGIR, 2006, pp. 178-185.
  • [22] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Learning hierarchical models of scenes, objects, and parts

    . in ICCV ’05: Proceedings of the Tenth IEEE International Conference on Computer Vision. Washington, DC, USA: IEEE Computer Society, 2005, pp. 1331-1338.

  • [23] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. Machine Learning, vol. 37, pp. 183–233, 1999.
  • [24] Wainwright, M. and M. Jordan. Graphical models, exponential families, and variational inference Tech. Rep. 649, U.C. Berkeley, Dept. of Statistics, 2003.
  • [25] N. Hjort, C. Holmes, P. Muller, and S. W. Eds. Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, 2010.
  • [26] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 569–577, New York, NY, USA, 2008. ACM.
  • [27] Teh, Yee Weh. Dirichlet Process. In Encyclopedia of machine learning, eds. Claude Sammut and Geoffrey Webb. New York: Springer, 2010.
  • [28] H. Yu et al. High-quality binary protein interaction map of the yeast interactome network. Science, 322, 104–110, 2008.
  • [29] D. T. Chen et al. Proliferative genes dominate malignancy-risk gene signature in histologically-normal breast tissue. Breast Cancer Res. Tr., 119, 335–346.
  • [30] Y. Chen and D. Xu. Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics, 21, 575–581, 2005.
  • [31] J. Leskovec, J. Kleinberg and C. Faloutsos. Graph Evolution: Densification and Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 1(1), 2007.
  • [32] Watts, D. J., Strogatz, S. H. Collective dynamics of “small-world” networks. Nature 393, 440-442, 1998.
  • [33] J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.
  • [34] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. PNAS, 101(9):2658–2663, 2004.
  • [35] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 – 174, 2010.
  • [36] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE PAMI, 22(8):888–905, 2000.
  • [37] M. Newman. Modularity and community structure in networks. PNAS, 103(23):8577–8582, 2006.