Local Partition in Rich Graphs

03/14/2018 ∙ by Scott Freitas, et al. ∙ HUAWEI Technologies Co., Ltd. Arizona State University 0

Local graph partitioning is a key graph mining tool that allows researchers to identify small groups of interrelated nodes (e.g. people) and their connective edges (e.g. interactions). Because local graph partitioning is primarily focused on the network structure of the graph (vertices and edges), it often fails to consider the additional information contained in the attributes. In this paper we propose---(i) a scalable algorithm to improve local graph partitioning by taking into account both the network structure of the graph and the attribute data and (ii) an application of the proposed local graph partitioning algorithm (AttriPart) to predict the evolution of local communities (LocalForecasting). Experimental results show that our proposed AttriPart algorithm finds up to 1.6× denser local partitions, while running approximately 43× faster than traditional local partitioning techniques (PageRank-Nibble). In addition, our LocalForecasting algorithm shows a significant improvement in the number of nodes and edges correctly predicted over baseline methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Motivation. With the rise of the big data era an exponential amount of network data is being generated at an unprecedented rate across many disciplines. One of the critical challenges before us is the translation of this large-scale network data into meaningful information. A key task in this translation is the identification of local communities with respect to a given seed node 111we interchangeably refer to local community as a local partition. In practical terms, the information discovered in these local communities can be utilized in a wide range of high-impact areas—from the micro (protein interaction networks (Laura Bennett, 2014) (Yong-Yeol Ahn, 2010)) to the macro (social (Tantipathananandh et al., 2007) (Chen et al., 2009) and transportation networks).

Problem Overview. How can we quickly determine the local graph partition around a given seed node? This problem is traditionally solved using an algorithm like Nibble (Spielman and Teng, 2013), which identifies a small cluster in time proportional to the size of the cluster, or PageRank-Nibble, (Andersen et al., 2006)

which improves the running time and approximation ratio of Nibble with a smaller polylog time complexity. While both of these methods provide powerful techniques in the analysis of network structure, they fail to take into account the attribute information contained in many real-world graphs. Other techniques to find improved rank vectors, such as attributed PageRank

(Hsu et al., 2017), lack a generalized conductance metric for measuring cluster ”goodness” containing attribute information. In this paper, we propose a novel method that combines the network structure and attribute information contained in graphs—to better identify local partitions using a generalized conductance metric.

Applications. Local graph partition plays a central role in many application scenarios. For example, a common problem in recommender systems is that of social media networks and determining how a local community will evolve over time. The proposed LocalForecasting algorithm can be used to determine the evolution of local communities, which can then assist in user recommendations. Another example utilizing social media networks is ego-centric network identification, where the goal is to identify the locally important neighbors relative to a given person. To this end, we can use our AttriPart algorithm to identify better ego-centric networks using the graph’s network structure and attribute information. Finally, newly arrived nodes (i.e., cold-start nodes) often contain few connections to their surrounding neighbors, making it difficult to ascertain their grouping to various communities. The proposed LocalForecasting algorithm mitigates this problem by introducing additional attribute edges (link prediction), which can assist in determining which local partitions the cold start nodes will belong to in the future.

Contributions. Our primary contributions are three-fold:

  • The formulation of a graph model and generalized conductance metric that incorporates both attribute and network structure edges.

  • The design and analysis of local clustering algorithm AttriPart and local community prediction algorithm LocalForecasting. Both algorithms utilize the proposed graph model, modified conductance metric and novel subgraph identification technique.

  • The evaluation of the proposed algorithms on three real-world datasets—demonstrating the ability to rapidly identify denser local partitions compared to traditional techniques.

Deployment. The local partitioning algorithm AttriPart is currently deployed to the PathFinder (Freitas et al., 2017) web platform (www.path-finder.io), with the goal of assisting users in mining local network connectivity from large networks. The design and deployment challenges were wide ranging, including—(i) the integration of four different programming languages, (ii) obtaining real-time performance with low cost hardware and (iii) implementation of a visually appealing and easy to use interface. We note that the AttriPart algorithm, deployed to the web platform, has performance nearly identical to the results presented in section 4.

Figure 1. Close-up of the AttriPart algorithm on the PathFinder web platform.

This paper is organized as follows—Section 2 defines the problem of local partitioning in rich graphs; Section 3 introduces our proposed model and algorithms; Section 4 presents our experimental results on multiple real-world datasets; Section 5 reviews the related literature; and Section 6 concludes the paper.

2. Problem Definition

In this paper we consider three graphs—(1) an undirected, unweighted structure graph , (2) an undirected, weighted attribute graph and (3) a combined graph consisting of both G and A that is undirected and weighted . In each graph, is the set of vertices, is the set of edges, is the number of vertices and is the number of edges (i.e. , and contain the same number of vertices and edges by default). In order to denote the degree centrality we say is the degree of vertex . We use bold uppercase letters to denote matrices (e.g. G) and bold lowercase letters to denote vectors (e.g. v).

For the ease of description, we define terms that are interchangeably used throughout the literature and this paper—(a) we refer to network as a graph, (b) node is synonymous with vertex, (c) local partition is referred to as a local cluster, (d) seed node is equivalent to query and start vertex, (e) topological edges of the graph refers to the network structure of the graph, (f) a rich graph is a graph with attributes on the nodes and or edges.

Having outlined the notation, we define the problem of local partitioning in rich graphs as follows:

Problem 1. Local Partitioning in Rich Graphs

Given: (1) an undirected, unweighted graph , (2) a seed node and (3) attribute information for each node containing a k-dimensional attribute vector —with an attribute matrix representing the attribute vector for each node .

Output: a subset of vertices such that best represents the local partition around seed node in graph .

Symbol Definition
, , network, attribute & combined graphs



,
number of nodes & edges in graphs , ,


number of edges in after LocalForecasting

,
number of nodes & edges in

, ,
preference vector, seed node & target conductance

lazy random walk transition matrix


set of vertices representing local partition


,
rank truncation and iteration thresholds

,
rank vector iterations; number of vertices to sweep

,
AttriPart & LocalProximity teleport values

,
subgraph relevance threshold & number of walks




; ,
subgraph of ; walk count dictionary & list


,

mean and standard deviation of



edge addition threshold
Table 1. Symbols and Definition

3. Methodology

This section first describes the preliminaries for our proposed algorithms, including the graph model and modified conductance metric. Next, we introduce each proposed algorithm—(1) LocalProximity, (2) AttriPart and (3) LocalForecasting. Finally, we provide an analysis of the proposed algorithms in terms of effectiveness and efficiency.

3.1. Preliminaries

Graph Model.

Topological network represents the network structure of the graph and is formally defined in Eq. (1). Attribute network represents the attribute structure of the graph and is computed based on the similarity for every edge in . In order to determine the similarity between the two nodes, we use Jaccard Similarity . is formally defined in Eq. (2) where 0.05 is the default attribute similarity between an edge in if . In addition, is the similarity threshold for the addition of edges not in where . Combined Network represents the combined graph of and and is formally defined in Eq. (3).

Figure 2. Example of the three graph models: (a) graph is the network structure with nodes and corresponding attribute set given as input. (b) Graph is the attribute network with the same set of edges as with each edge assigned a positive similarity weight . (c) Graph is a linear combination of the each respective edge from and .

Formally, we define each of the three graph models , and in Eq. (1), Eq. (2) and Eq. (3). Figure 2 presents an illustrative example.

(1)
(2)
(3)

Conductance

Conductance is a standard metric for determining how tight knit a set of vertices are in a graph (Kannan et al., 2004). The traditional conductance metric is defined in Eq. (4), where is the set of vertices representing the local partition. The lower the conductance value , where , the more likely represents a good partition of the graph.

(4)

Where the cut is , and the volume is .

This definition of conductance will serve as the benchmark to compare the results of our parallel conductance metric.

Parallel Conductance. We propose a parallel conductance metric which takes into account both the attribute and topological edges in the graph. Instead of simply adding the cut of each vertex , we want to determine whether is more similar to the vertices in or . The new cut and conductance metric is formally defined in Eq. (5) and Eq. (6), respectively. The key idea behind the parallel conductance metric is to determine whether each vertex in is more similar to or using the additional information provided by the attribute links.

(5)

By definition, can be split into its representative components, and . We also note a few key properties of the parallel cut metric below:

  1. means that the vertices in have connections of equal weighting between and .

  2. means that the vertices in have only a few strong connections to .

  3. means that the vertices in are more strongly connected to than .

Eq. (6) uses the cut as defined in Eq. (5) and the volume as defined above with the modification that is a sum of it’s components in and .

(6)

We note that the parallel conductance metric has a different scale compared to the traditional conductance metric. For example, a conductance of 0.3 in the traditional conductance doesn’t have the same meaning as a conductance of 0.3 in the parallel definition. We also bound the volume of to . This allows us to reduce the computation to .

Figure 3. A toy example calculating the parallel cut and conductance with local partition containing vertices . Parallel cut() = 1.05/2.1 = 0.5, parallel cut() = 0, parallel cut() = 1.05/2.2 = 0.477, parallel cut() = 0, parallel cut() = 0.5 + 0.477 = 0.977. Volume() = 12. Parallel conductance() = 0.977/12 = 0.0814.

3.2. Algorithms

We propose three algorithms in this subsection, including (1) LocalProximity  (2) AttriPart and (3) LocalForecasting. First, we introduce the LocalProximity algorithm as a key building block for speeding-up the AttriPart and LocalForecasting algorithms by finding a subgraph containing only the nodes and edges relevant to the given seed node. Based on LocalProximity, we further propose the AttriPart algorithm to find a local partition around a seed node by minimizing the parallel conductance metric. Finally, we propose the LocalForecasting algorithm, which builds upon AttriPart, to predict a local community’s evolution.

LocalProximity. There are two primary purposes for the LocalProximity algorithm—(i) the requisite computations for the LocalForecasting algorithm require a pairwise similarity calculation of all nodes, which is intractable for large graphs due to the quadratic run time. To make this computation feasible, we use the LocalProximity algorithm to determine a small subgraph of relevant vertices around a given seed node . (ii) We experimentally found that the PageRank vector utilized in the AttriPart algorithm is significantly faster to compute after running the proposed LocalProximity algorithm.

Algorithm Details. The goal is to find a subgraph around seed node , such that contains only nodes and edges likely to be reached in trials of random walk with restart. We base the importance of a vertex on the theory that random walks can measure the importance of nodes and edges in a graph (Dupont et al., 2017)(Newman, 2005). This is done by defining node relevance proportional to the frequency of times a random walk with restart walks on a vertex in trials (nodes walked on more than once in a walk will still count as one). Instead of using a simple threshold parameter to determine node/edge relevance as in (Dupont et al., 2017), we utilize the mean and standard deviation of the walk distribution in order for the results to remain insensitive of given that is sufficiently large. In conjunction with the mean and standard deviation, we introduce as a relevance threshold parameter to determine the size of the resulting subgraph . See section 3.3 for more details.

Algorithm Description. The LocalProximity algorithm takes a graph , a seed node , a teleport value , the number of walks to simulate , a relevance threshold —and returns a subgraph containing the relevant vertices in relation to . This algorithm can be viewed in three major steps:

  1. Compute the walk distribution around seed node in graph using random walk with restart (line 2). We omit the Random Walk algorithm due to space constraints, however, the technique is described above.

  2. Determine the number of vertices to include in the subgraph based on the relevance threshold parameter , mean of the walk distribution list and the standard deviation of the walk distribution list (lines 4-6).

  3. Create a subgraph based on the included vertices (line 8).

algocf[h]    

AttriPart. Armed with the LocalProximity algorithm, we further propose an algorithm AttriPart, which takes into account the network structure and attribute information contained in graph to find denser local partitions than can be found using the network structure alone. The foundation of this algorithm is based on (Spielman and Teng, 2013)(Andersen et al., 2006)(Zhukov, [n. d.]) with subtle modifications on lines 1, 4 and 9. These modifications incorporate the addition of a combined graph model, approximate PageRank computation using the LocalProximity algorithm, and the parallel cut and conductance metric. In addition, AttriPart doesn’t depend on reaching a target conductance in order to return a local partition—instead it returns the best local partition found within sweeping vertices of the sorted PageRank vector.

Algorithm Description. Given a graph , seed node , target conductance , rank truncation threshold , the number of iterations to compute the rank vector , teleport value , rank iteration threshold and number of nodes to sweep AttriPart will find a local partition around within iterations of sweeping. This algorithm can be viewed in five steps:

  1. [nolistsep]

  2. Set values for and as seen in Eq. (7) and Eq. (9) respectively. We experimentally set and to 0.01. For additional detail on parameters , and see (Spielman and Teng, 2013). For all other parameter values see Section 4.

  3. Run LocalProximity around seed node in order to reduce the run time of the PageRank computations (line 1).

  4. Compute the PageRank vector using a lazy random transition with personalized restart—with preference vector

    containing all the probability on seed node

    . At each iteration truncate a vertex’s rank if it’s degree normalized PageRank score is less than (lines 2-7).

  5. Divide each vertex in the PageRank vector by its corresponding degree centrality and order the rank vector in descending order (line 8).

  6. Sweep over the PageRank vector for the first vertices, returning the best local partition found (lines 9-10). The sweep works by taking the re-organized rank vector and creating a set of vertices by iterating through each vertex in the rank vector one at a time, each time adding the next vertex in the rank vector to and computing .

(7)
(8)
(9)

algocf[h]    

LocalForecasting. As a natural application of the AttriPart algorithm, we introduce a method to predict how local communities will evolve over time. This method is based on the AttriPart algorithm with two significant modifications—(i) required use of the LocalProximity algorithm to create a subgraph around the seed node and (ii) the use of the ExpandedNeighborhood algorithm to predict links between nodes in the subgraph. The idea behind using the ExpandedNeighborhood algorithm is that nodes are often missing many connections they will make in the future, which in turn affects the grouping of nodes into communities. To aid in predicting future edge connections we use Jaccard Similarity (Liben-Nowell and Kleinberg, 2007) to predict the likelihood of each vertex connecting to the others—with edges added if the similarity between two nodes is greater than threshold .

Algorithm Description. Given a graph , a seed node , a target conductance , a rank truncation threshold , the number of iterations to compute the rank vector , a teleport value , rank iteration threshold , similarity threshold and number of nodes to sweep —this algorithm will find a predicted local partition around within iterations of sweeping. As the LocalForecasting algorithm is similar to AttriPart, we highlight the three primary steps:

  1. [nolistsep]

  2. Determine the subgraph around a given seed node using the LocalProximity algorithm (line 1).

  3. Determine the pairwise similarity between all nodes in the subgraph using Jaccard Similarity, adding edges that are above a given similarity threshold (line 2).

  4. Run the AttriPart algorithm to find the predicted local partition around the seed node (line 3).

algocf[h]    

algocf[h]    

3.3. Analysis

Effectiveness

LocalProximity (Algorithm LABEL:NeighborhoodApproximation). The objective is to ensure that all relevant nodes in proximity to seed node are included. We use the fact that many real-world graphs follow a scale-free distribution (Barabási and Albert, 1999) (Faloutsos et al., 1999), with many nodes containing only a few links while a handful encompasses the majority. In Figure 4, we found that after running trials of random walk with restart, a scale-free like distribution formed—with a large majority of the nodes containing a small number of ‘hits’, while a few nodes constituted the bulk.

Figure 4. Random walk w/ restart—distribution of node walk counts. = 10,000, = 0.15; dataset: wikipedia, start vertex: ‘ewok’, y-axis: right; dataset: Aminer, start vertex: 364298, y-axis: left. We omit nodes walked zero times in the graph, however, they’re used in calculating , .

As the number of random walks is increased, the scale-free like distribution is maintained since each node is proportionally walked with the same distribution. We therefore need only some minimum value for

, which we set to 10,000. We use this skewed scale-free like distribution in combination with Eq. (

10) below to ensure the extraction of relevant nodes in relation to a query vertex.

Mathematically we define node relevance based on Eq. (10), where is a dictionary containing the walk count of each vertex and represents the number of times vertex is walked in trials of the random walk with restart. is a list of each node’s walk count in the graph, is the average number of times all of the nodes in the graph are walked and is the standard deviation of the number of times all of the nodes in the graph are walked. In section 4 we discuss values of that have been shown to be empirically effective.

(10)

After determining the relevant nodes we create a subgraph from a portion of the long-tail curve as defined by threshold parameter in conjunction with and . We say that subgraph contains nodes—with increasing nearly independently of the graph size (depending on threshold ). As seen in Figure 4 the number of nodes with walks converges independent of graph size.

Efficiency

All algorithms use the same data structure for storing the graph information. If a compressed sparse row (CSR) format is used, the space complexity is . Alternatively, we note that with minor modification to the algorithms above we can use an adjacency list format with space.

Lemma 3.1 (Time Complexity).

LocalProximity has a time complexity of while AttriPart has a time complexity of and LocalForecasting a time complexity of .

Proof.

LocalProximity: There are three major components to this algorithm: (1) random walks with walk length for a time complexity of (line 2). (2) Linear iteration through the number of nodes taking (lines 4-7). (3) Subgraph creation based on the number of included vertices with node set —requiring iteration through every edge of node for total edges. Iterating through every edge is linear in the number of edges for a time complexity of (line 8). This leads to a total time complexity of

AttriPart: There are six major steps to this algorithm: (1) calling LocalProximity which returns a subgraph containing nodes and edges for a time complexity of (line 1). (2) Creating a diagonal degree matrix by iterating through each node in with time complexity (line 2). (3) Creating the lazy random walk transition matrix , which requires from multiplying the corresponding matrix entries (line 3). (4) In lines 4-7 we iterate for iterations, with each iteration (i) updating the rank vector by multiplying the corresponding edges in the transition matrix , with the rank vector for a time complexity of and (ii) truncating every vertex with rank for a time complexity linear in the number of nodes in the rank vector . (5) Sort the rank vector which will be upper bounded by (line 8). (6) Compute the parallel conductance, which takes time (lines 9-10). Combining each step leads to a total time complexity of .

LocalForecasting: This algorithm has three major steps: (1) run the LocalProximity algorithm, which has a time complexity of . (2) Perform the ExpandedNeighborhood algorithm, which densifies by adding predicted edges for a total of edges in . This algorithm has a time complexity of due to the nested for loops. (3) Run the AttriPart algorithm, which has a time complexity of with the modification of to for the additional edges. This leads to an overall time complexity of . ∎

While AttriPart and LocalForecasting both scale quadratically with respect to , we note that in practice these algorithms are very fast since and scales nearly independent of graph size as shown in section 3.3.

4. Experiments

In this section, we demonstrate the effectiveness and efficiency of the proposed algorithms on three real-world network datasets of varying scale.

4.1. Experiment setup

Datasets. We evaluate the performance of the proposed algorithms on three datasets—(1) the Aminer co-authorship network (Zhang et al., 2017), (2) a Musician network mined from DBpedia and (3) a subset of Wikipedia entries in DBpedia containing both abstracts and links. All three networks are undirected with detailed information on each below:

  • Aminer. Nodes represents an author, with each author containing a set of topic keywords, and an edge representing a co-authorship. To form the attribute network, we compute attribute edges based on the similarity between two authors for every network edge, using Jaccard Similarity on the corresponding authors’s topic set.

  • Musician. Nodes represent a Musician, with each Musician containing a set of music genres, and an edge representing two Musicians who have played in the same band. To form the attribute network, we compute attribute edges based on the similarity between two Musicians for every network edge, using Jaccard Similarity on the corresponding artist’s music genre set.

  • Wikipedia. Nodes represent an entity, place or concept from Wikipedia which we will jointly refer to as an item. Each item contains a set of defining key words; with edges representing a link between the two items. The dataset originates from DBpedia as a directed graph with links between Wikipedia entries. We modify the graph to be undirected for use with our algorithms—which we believe to be a reasonable as each edge denotes a relationship between two items. In addition, this dataset uses only a portion of the Wikipedia entries containing both abstracts and links to other Wikipedia pages found in DBpedia. To form the attribute network, we compute attribute edges based on the similarity between two items for every network edge using Jaccard Similarity on the corresponding item’s key word set.

Category Network Nodes Edges
Aminer Co-Author 1,560,640 4,258,946
Musician Co-Musician 6,006 8,690
Wikipedia Link 237,588 1,130,846
Table 2. Network Statistics

Metrics. (1) To benchmark the LocalProximity algorithm’s effectiveness and efficiency, we compare (i) the difference between local partition created with and without the LocalProximity algorithm on AttriPartand (ii) the run time and difference between the top 20 PageRank vector entries with and without the LocalProximity algorithm. (2) To benchmark the AttriPart algorithm’s effectiveness and efficiency we compare the triangle count, node count, local partition density and run time to PageRank-Nibble. Normally, PageRank-Nibble does not return a local partition if the target conductance is not met, however, we modify it to return the best local partition found—even if the target conductance is not met. This modification allows for more comparable results to AttriPart. (3) To provide a baseline for the LocalForecasting algorithm’s effectiveness, we compare the local partition results to AttriPart on two graph missing 15% of their edges.

Repeatability. All data and source code used in this research will be made publicly available. The Aminer co-authorship network can be found on the Aminer website 222https://Aminer.org/data; the Musician and Wikipedia datasets used in the experiments will be released on the author’s website. All algorithms and experiments were conducted in a Windows environment using Python.

4.2. Effectiveness

LocalProximity. In Figure 5 parts (a)-(c), we can see that the proposed LocalProximity algorithm significantly reduces the computational run time, while maintaining high levels of accuracy across both metrics. Parts (a)-(b) demonstrate to what extent the accuracy of the results are dependent upon the parameter values. In particular, a low value of (random walk alpha) and a high value of (relevance threshold) are critical to providing high accuracy results.

In Figure 5 part (a), we measure accuracy as the number of vertices that differ between the local partitions w/ and w/o the LocalProximity algorithm on AttriPart. A small partition difference indicates that the LocalProximity algorithm finds a relevant subgraph around the given seed node and that the full graph is unnecessary for accurate results. In part (b), we define the accuracy of the results to be the difference between the set of top 20 entries in the PageRank vectors for the full graph and subgraph using the LocalProximity algorithm. Overall, the results from part (b) correlate well to (a)—showing that for low values of (random walk alpha) and high values of (relevance threshold), their is negligible difference between the results computed on the full graph and the subgraph found using the LocalProximity algorithm.

(a) Y-axis represents the difference in vertices between the local partition calculated w/ and w/o the LocalProximity algorithm.
(b) Y-axis represents the # of vertices differing between the top 20 rank vector entries w/ and w/o the LocalProximity algorithm.
(c) Y-axis represents the difference in run time between the PageRank calculation w/ and w/o the LocalProximity algorithm.
Figure 5. Each data point averages 10 randomly sampled vertices in both the Aminer and Musician datasets. Default parameters (unless sweeped across): = 0.2, = 0.15, = 0.2, = 2, = 10,000, = 200. Parameter ranges: , and [0.1-0.7] in 0.1 intervals; [1-5] in 0.5 intervals.

AttriPart. In Figure 6, we see that AttriPart finds significantly denser local partitions than PageRank-Nibble—with local partition densities approximately 1.6, 1.3 and 1.1 higher in AttriPart than PageRank-Nibble in the Aminer, Wikipedia and Musician datasets respectively. Density is measured as where is the number of edges and is the number of nodes.

In Figure 6, we observe that the triangle count of the AttriPart algorithm is lower than PageRank-Nibble in the Musician and Aminer datasets. We attribute this to the fact that AttriPart is finding smaller partitions (as measured by node count) and, therefore, there are less possible triangles. We also note that each triangle is counted three times, once for each node in the triangle. While no sweeps across algorithm parameters were performed, we believe that the gathered results provide an effective baseline for parameter selection.

(a) Scalability: Each data point represents the Aminer dataset in 1/10th intervals, with each point averaged over 3 randomly sampled vertices. Parameters: = 0.2, = 0.15, = 0.2, = 2, = 10,000, = 200.
(b)
Figure 6. Effectiveness: results are averaged over 20 and 100 randomly sampled vertices in the Aminer/Wikipedia and Musician datasets, respectively. Parameters: = 0.2, = 0.15, = 0.05, = 2, = 10,000, = 200.

LocalForecasting. In order to measure the effectiveness of the LocalForecasting algorithm we setup the following experiment with three local partition calculations: (1) calculate the local partition using AttriPart, (2) calculate the local partition using AttriPart with 15% of the edges randomly removed from the graph and (3) calculate the local partition using the LocalForecasting algorithm with 15% of the edges randomly removed from the graph. We treat (1) as the baseline local community and want to test if (3) finds better local partitions than (2). The idea behind randomly removing 15% of the edges in the graph is to simulate the evolution of the graph over time and test if the LocalForecasting algorithm can predict better local communities in the future. Ideally, we would have ground-truth local community data for a rich graph with time series snapshots, however, in its absence we use the above method.

In Figure 7, each data point is generated in three steps—(i) taking the difference between the set of vertices and edges in local partitions (1) and (3), (ii) taking the difference between the set of vertices and edges in local partitions (1) and (2) and (iii) by taking the difference between (ii) and (i). Step (i) tells us how far off the LocalForecasting algorithm is from the baseline, step (ii) tells us how far off the local partition would be from the baseline if no prediction techniques were used and step (iii) tells us the difference between the local partitions with and without the LocalForecasting algorithm (which is what we see graphed in Figure 7).

In Figure 7, we see that the local partition prediction accuracy, for both the edges and vertices, is above the baseline calculations in the Aminer dataset for a majority of edge similarity threshold values (). The best results were obtained when is 0.6, with an average of 1.4 vertices and 2.75 edges predicted over the baseline using the LocalForecasting algorithm. This number, while relatively small, is an average of 20 randomly sampled vertices—with one result reaching up to 14 vertices and 26 edges over baseline. In addition, we can see that the Musician dataset does not perform as well as the Aminer dataset, with most of the prediction results performing worse than the baseline (as indicated by the negative difference). We believe that this result on the Musician dataset is due to the different nature of each dataset’s network structure—with the Musician dataset being significantly more sparse (no giant connected component) than the Aminer dataset.

Figure 7. Each data point averages 20 randomly sampled vertices in the Aminer and Musician datasets. Default parameters (unless sweeped across): = 0.2, = 0.15, = 0.2, = 5, = 0.7, = 10,000, = 200. Parameter ranges: [0.1-0.9] in 0.1 intervals, [0.1-0.6] in 0.1 intervals.

4.3. Efficiency

For both the proposed and baseline algorithms, the efficiency results represent only the time taken to run the algorithm (e.g. not including loading data into memory). LocalProximity. Across a majority of the parameters the run time for the full graph PageRank computation is approximately 450 seconds longer compared to computing the PageRank vector based on the LocalProximity sugraph. AttriPart. In Figure 6, we see that the AttriPart algorithm finds local partitions 43 faster than PageRank-Nibble. LocalForecasting. This algorithm has an expected run time nearly identical to AttriPart, we therefore refer the reader to Figure 6 for run time results.

5. Related Work

We provide a high level review of both local and global community detection methods, with a focus on the research that pertains to the algorithms we propose in this paper.

A - Local Community Detection. Given an undirected graph, start vertex and a target conductance—the goal of Nibble is to find a subset of vertices that has conductance less than the target conductance (Spielman and Teng, 2013). This algorithm has strong theoretical properties with a run time of , where is a user defined constant, is the target conductance and is the number of edges. PageRank-Nibble builds on the work of Nibble by introducing the use of personalized PageRank (Haveliwala, 2003; Tong et al., 2006), in addition to an algorithm for the computation of approximate PageRank vectors (Andersen et al., 2006). Since PageRank-Nibble and Nibble run on undirected graphs, they use truncated random walks in order to prevent the stationary distribution from becoming proportional to the degree centrality of each node (Grolmusz, 2015). There are also many alternative techniques for local community detection. To name a few, the paper by Bagrow and Bollt (Bagrow and Bollt, 2005) introduces a method of local community identification that utilizes an -shell spreading outward from a start vertex. However, their algorithm requires knowledge of the entire graph and is therefore not truly local. The research by J. Chen et. al. (Chen et al., 2009) proposes a method for local community identification in social networks that avoids the use of hard to obtain parameters and improves the accuracy of identified communities by introducing a new metric. In addition, the work by (Zhou et al., 2017) and (Yin et al., 2017) introduces two methods of local community identification that take into account high-order network structure information. In (Zhou et al., 2017), the authors provide mathematical guarantees of the optimality and scalability of their algorithms, in addition to the generalization of it to various network types (e.g. signed and multi-partite networks).

B - Global Community Detection. The basic idea behind the Walktrap algorithm is that random walks on a graph tend to get ”trapped” in densely connected parts that correspond to communities (Pons and Latapy, 2005). Utilizing the properties of random walks on graphs, they define a measurement of structural similarity between vertices and between communities, creating a distance metric. The algorithm itself has an upper bound of . Another popular choice for global community detection is spectral analysis. In the paper by M. Newman (Newman, 2013) it is shown that the problems of community detection by modularity maximization, community detection by statistical inference and normalized-cut graph partitioning when tackled using spectral methods, are in fact, the same problem. The work by S. White et. al. in (White and Smyth, [n. d.])

attempts to find communities in graphs using spectral clustering. They achieve this by using an objective function for graph clustering

(Newman and Girvan, 2004) and reformulating it as a spectral relaxation problem, for which they propose two algorithms to solve it. A systematic introduction to spectral clustering techniques can be found in (von Luxburg, 2007). There also exists many alternative techniques for global community detection. Among others, two interesting techniques relevant to this work are (Jaewon yang, 2013) (Takaffoli et al., 2014). In (Jaewon yang, 2013), the authors propose a community detection algorithm that uses the information in both the network structure and the node attributes, while in (Takaffoli et al., 2014)

the authors use network feature extraction to predict the evolution of communities. A detailed review of various community detection algorithms can be found in

(Zhao Yang, 2016).

6. Conclusion

This paper proposes new algorithms for attributed graphs, with the goal of (i) computing denser local graph partitions and (ii) predicting the evolution of local communities. We believe that the proposed algorithms will be of particular interest to data mining researchers given the computational speed-up and enhanced dense local partition identification. The proposed local partitioning algorithm AttriPart has already deployed to the web platform PathFinder (www.path-finder.io) (Freitas et al., 2017) and allow users to interactively explore all three datasets presented in the paper. In addition, the source code and datasets will be made publicly available by the conference date.

References

  • (1)
  • Andersen et al. (2006) R. Andersen, F. Chung, and K. Lang. 2006. Local Graph Partitioning using PageRank Vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). 475–486. https://doi.org/10.1109/FOCS.2006.44
  • Bagrow and Bollt (2005) James P. Bagrow and Erik M. Bollt. 2005. Local method for detecting communities. Phys. Rev. E 72 (Oct 2005), 046108. Issue 4. https://doi.org/10.1103/PhysRevE.72.046108
  • Barabási and Albert (1999) Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in Random Networks. Science 286, 5439 (1999), 509–512. https://doi.org/10.1126/science.286.5439.509 arXiv:http://science.sciencemag.org/content/286/5439/509.full.pdf
  • Chen et al. (2009) J. Chen, O. Zaïane, and R. Goebel. 2009. Local Community Identification in Social Networks. In 2009 International Conference on Advances in Social Network Analysis and Mining. 237–242. https://doi.org/10.1109/ASONAM.2009.14
  • Dupont et al. (2017) Pierre Dupont, J Callut, G Dooms, J N. Monette, and Yves Deville. 2017. Relevant subgraph extraction from random walks in a graph. (12 2017).
  • Faloutsos et al. (1999) Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. 1999. On Power-law Relationships of the Internet Topology. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ’99). ACM, New York, NY, USA, 251–262. https://doi.org/10.1145/316188.316229
  • Freitas et al. (2017) Scott Freitas, Hanghang Tong, Nan Cao, and Yinglong Xia. 2017. Rapid Analysis of Network Connectivity. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM ’17). ACM, New York, NY, USA, 2463–2466. https://doi.org/10.1145/3132847.3133170
  • Grolmusz (2015) Vince Grolmusz. 2015. A Note on the PageRank of Undirected Graphs. Inf. Process. Lett. 115, 6 (June 2015), 633–634. https://doi.org/10.1016/j.ipl.2015.02.015
  • Haveliwala (2003) T. H. Haveliwala. 2003. Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search. IEEE Transactions on Knowledge and Data Engineering 15, 4 (July 2003), 784–796. https://doi.org/10.1109/TKDE.2003.1208999
  • Hsu et al. (2017) Chin-Chi Hsu, Yi-An Lai, Wen-Hao Chen, Ming-Han Feng, and Shou-De Lin. 2017. Unsupervised Ranking Using Graph Structures and Node Attributes. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, New York, NY, USA, 771–779. https://doi.org/10.1145/3018661.3018668
  • Jaewon yang (2013) Jure Leskovec Jaewon yang, Julian McAuley. 2013. Community Detection in Networks with Node Attributes. ICDM (2013).
  • Kannan et al. (2004) Ravi Kannan, Santosh Vempala, and Adrian Vetta. 2004. On Clusterings: Good, Bad and Spectral. J. ACM 51, 3 (May 2004), 497–515. https://doi.org/10.1145/990308.990313
  • Laura Bennett (2014) Songsong Liu Lazaros G. Papageorgiou Sophia Tsoka Laura Bennett, Aristotelis Kittas. 2014. Community Structure Detection for Overlapping Modules through Mathematical Programming in Protein Interaction Networks. PLOS ONE (2014). https://doi.org/10.1371/journal.pone.0112821
  • Liben-Nowell and Kleinberg (2007) David Liben-Nowell and Jon Kleinberg. 2007. The Link-prediction Problem for Social Networks. J. Am. Soc. Inf. Sci. Technol. 58, 7 (May 2007), 1019–1031. https://doi.org/10.1002/asi.v58:7
  • Newman (2013) Mark E. J. Newman. 2013. Spectral methods for network community detection and graph partitioning. CoRR abs/1307.7729 (2013).
  • Newman and Girvan (2004) M. E. J. Newman and M. Girvan. 2004. Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004).
  • Newman (2005) M.E. J. Newman. 2005. A measure of betweenness centrality based on random walks. Social Networks 27, 1 (2005), 39 – 54. https://doi.org/10.1016/j.socnet.2004.11.009
  • Pons and Latapy (2005) Pascal Pons and Matthieu Latapy. 2005. Computing Communities in Large Networks Using Random Walks. In Proceedings of the 20th International Conference on Computer and Information Sciences (ISCIS’05). Springer-Verlag, Berlin, Heidelberg, 284–293. https://doi.org/10.1007/11569596_31
  • Spielman and Teng (2013) Daniel A. Spielman and Shang-Hua Teng. 2013. A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning. SIAM J. Comput. 42, 1 (2013), 1–26. https://doi.org/10.1137/080744888 arXiv:https://doi.org/10.1137/080744888
  • Takaffoli et al. (2014) M. Takaffoli, R. Rabbany, and O. R. Zaïane. 2014. Community evolution prediction in dynamic social networks. In 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014). 9–16. https://doi.org/10.1109/ASONAM.2014.6921553
  • Tantipathananandh et al. (2007) Chayant Tantipathananandh, Tanya Berger-Wolf, and David Kempe. 2007. A Framework for Community Identification in Dynamic Social Networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’07). ACM, New York, NY, USA, 717–726. https://doi.org/10.1145/1281192.1281269
  • Tong et al. (2006) Hanghang Tong, Jingrui He, Mingjing Li, Wei-Ying Ma, Hong-Jiang Zhang, and Changshui Zhang. 2006.

    Manifold-Ranking-Based Keyword Propagation for Image Retrieval.

    EURASIP Journal on Applied Signal Processing 2006 (2006), Article ID 79412, 10 pages. doi:10.1155/ASP/2006/79412.
  • von Luxburg (2007) Ulrike von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (01 Dec 2007), 395–416. https://doi.org/10.1007/s11222-007-9033-z
  • White and Smyth ([n. d.]) Scott White and Padhraic Smyth. [n. d.]. A Spectral Clustering Approach To Finding Communities in Graphs. 274–285. https://doi.org/10.1137/1.9781611972757.25 arXiv:http://epubs.siam.org/doi/pdf/10.1137/1.9781611972757.25
  • Yin et al. (2017) Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. 2017. Local Higher-Order Graph Clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 555–564. https://doi.org/10.1145/3097983.3098069
  • Yong-Yeol Ahn (2010) Sune Lehmann Yong-Yeol Ahn, James P. Bagrow. 2010. Link communities reveal multiscale complexity in networks. Nature (August 2010), 761–764. https://doi.org/doi:10.1038/nature09182
  • Zhang et al. (2017) Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, Juanzi Li, Walter Luyten, and Marie-Francine Moens. 2017. Fast and Flexible Top-k Similarity Search on Large Networks. ACM Trans. Inf. Syst. 36, 2, Article 13 (Aug. 2017), 30 pages. https://doi.org/10.1145/3086695
  • Zhao Yang (2016) Claudio J. Tessone Zhao Yang, René Algesheimer. 2016. A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Scientific Reports (2016). https://doi.org/doi:10.1038/srep30750
  • Zhou et al. (2017) Dawei Zhou, Si Zhang, Mehmet Yigit Yildirim, Scott Alcorn, Hanghang Tong, Hasan Davulcu, and Jingrui He. 2017. A Local Algorithm for Structure-Preserving Graph Cut. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 655–664. https://doi.org/10.1145/3097983.3098015
  • Zhukov ([n. d.]) Leonid Zhukov. [n. d.]. Structural Analysis and Visualization of Networks. ([n. d.]). https://www.youtube.com/watch?v=jIS5pZ8doH8&list=PLriUvS7IljvkBLqU4nPOZtAkp7rgpxjg1&index=11