Finding densely connected structures in social networks, a.k.a., communities, has been extensively studied in past decades. Most of the prior research focus on the finding clusters in social networks (Fortunato, 2010; Lancichinetti and Fortunato, 2009; Ahn et al., 2010; Huang et al., 2014). However, some researchers (Fang et al., 2017), have argued that for location-aware applications like location-based event recommendation and market advertisement, each community of people should be not only socially connected but also be in close locational proximity to each other. This is called the Co-located Community Detection (CCD) problem. One reason for this increased emphasis on CCD problems is data availability – the growing usage of mobile based services offered by social media applications that allow users to publish their real-time locations. Some researchers have considered spatial location attributes to discover co-located communities (Chen et al., 2015; Chen et al., 2018; Fang et al., 2017). In our own prior work (Desai, 2016; Weibel et al., 2017), we investigated the formation of HIV related communities and determined that geographic proximity is a stronger predictor of community formation among users who tweet about HIV-related health issues compared to pure network proximity on Twitter.
To give the definition of co-located community, we consider social and spatial cohesiveness constraints separately. There is already a significant body of research exists on Community Detection (CD) and Community Search (CS) on social network (Newman, 2004; White and Smyth, 2005; Brandes et al., 2007; Cui et al., 2014; Sozio and Gionis, 2010; Cui et al., 2013; Huang et al., 2015), we focus primarily on the hardness of satisfying the spatial constraint aspect of the CCD problem.
To motivate why the hardness problem is interesting to investigate, we present two existing research approaches to the problem.
Fang et al (Fang et al., 2017) apply the spatial minimum covering circle (Elzinga and Hearn, 1972a, b) to ensure that each cluster discovered also maintains high spatial compactness. Hence, they require that the returned community with high social cohesiveness is covered geographically by the minimum covering circle with the smallest radius. However, this method cannot be applied to scenarios where users want to specify distance threshold for members in a community, and it cannot provide a consistent distance constraint for different communities (see more in the Experiment section).
In this paper, we adopt the second strategy to provide a bounded spatial distance guarantee for each co-located cluster.
The Hardness Issue. To detect groups satisfying this spatial distance constraint, Zhang et al (Zhang et al., 2017; Chen et al., 2018) build a virtual spatial neighbourhood network where there is an edge between any two users with spatial distance less than distance threshold, and transform this problem to enumerating all maximal cliques in the graph, which is an NP-Hard problem. When the geographical distribution of users has high locality in some region, the spatial neighbourhood network for this region can be really dense or even be a complete graph, which makes their algorithm, enumerating cliques, unpractical.
The problem that these two papers solve is to detect the one co-located community with the maximum number of users; hence, they develop some pruning techniques to reduce search space and the expected execution time. However, in our paper, we aim to solve a community detection problem where all co-located community should be returned, and in this case, their strategy would be extremely inefficient and non-scalable. (Chen et al., 2018) also provides a more efficient approximation algorithm, however, the approximation ratio to the optimal solution is not bounded by a constant.
Contributions. Our major contributions are listed as follows.
We clarify the hardness of the all-pair spatial constraint problem and provide a true polynomial algorithm together with several effective pruning rules to solve it.
To further reduce the time complexity, we design a near-linear approximation algorithm with a constant performance guarantee (-bounded).
Based on the efficient spatial constraint checker, we design a uniform framework which decouples social and spatial constraints so that users have high freedom to define social cohesiveness (e.g., -truss or -core) and to choose existing community detection algorithms.
2. Related Work
Even though spatial information is useful and important in many scenarios, to our best knowledge, there are very limited prior works that take into account spatial constraint in finding communities. As Table 1
presents, these works can be classified into three categories based on their goals: 1) community detection (CD): to find all co-located communities; 2) community search (CS): finding personalized communities for query vertices; 3) MCM: find the maximum co-located community with the largest number of members, which is neither CS or CD. These works can also be categorized into three types based on methodology. The first technique is to define new criteria of community by integrating both social and spatial information.(Chen et al., 2015) modifies the modularity function (Newman, 2004) by introducing a distance decay function and then provides a community detection algorithm based on fast modularity maximisation algorithm. There are two main limitations of this technique. The first and the most serious one is that it can not provide a geographic distance bound guarantee for members in a community. Secondly, it couples social and spatial information which is less flexible if users prefer other community detection techniques, e.g. -core or -truss.
|Method||Algo.||CD or CS||Distance Bound|
|Modify Community Criteria||Modified CNM (Chen et al., 2015)||CD||No guarantee|
|Computational Geometry||AppAcc (Fang et al., 2017)||CS||No guarantee|
|Reduction to Clique||AdvMax (Zhang et al., 2017)||CD||1|
|EffiExact(Chen et al., 2018)||MCM||1|
|Apx2(Chen et al., 2018)||MCM|
Different from the first technique, (Chen et al., 2018; Fang et al., 2017; Zhang et al., 2017) process spatial constraint and social cohesiveness separately. Since CD or CS techniques are well researched, for social network constraints, they adopt existing techniques such as -core and -truss. Their main focus is to ensure the spatial cohesiveness for communities. (Fang et al., 2017) provides a community search algorithm that returns a -core so that a spatial circle with the smallest radius can cover all community members. However, it can not guarantee a consistent distance bound for different query vertices and the case study in experiments section present an example to demonstrate it.
(Chen et al., 2018)(Zhang et al., 2017) define spatial constraint in a similar way to our work. They can guarantee that for an returned community, the spatial distance between any two members are within a user specified constant. In (Zhang et al., 2017), each of the returned communities should be a -core and satisfy pairwise similarity constraints where the similarity can be distance. It proves the NP hardness of this problem by reducing a NP-complete problem, finding -clique, to it, and then provides a clique-based algorithm. (Chen et al., 2018) applies the same methodology to solve the problem where similarity is defined as spatial distance in space. They provide the clique-based algorithm as baseline and also provide an approximation algorithm based on grid. However, there are two major problems: firstly, they mislead the hardness of problem. When the pairwise similarity constraint is defined as spatial distance in space, the problem in (Zhang et al., 2017) is no longer NP hard and the -clique decision problem can not be reduced to this problem any more, however, (Chen et al., 2018) still applies the algorithm for enumerating maximal cliques to solve the spatial constraint; Secondly, for their approximation algorithm, the approximation ratio is and where is the grid size and is the pairwise distance threshold. When , the approximation ratio is small, which is desirable, however, it would require more memory to store grid information; when grid is set to be a large value, then approximation ratio would be large.
In this section, we formally define our data model and problem, and present the framework for co-located community detection.
3.1. Problem Definition
Definition 0 (Geo-Social Network (GeoSN)).
A geo-social network (GeoSN) is a directed graph where each denotes a user associated with a spatial location , and maintains the relationship (e.g., friendship) among users.
Given a geo-social network, the objective of this paper is to find all communities that simultaneously satisfy the spatial cohesiveness constraint and the social connectivity constraint. We first introduce the definition of a maximal co-located community.
Definition 0 (Maximal Co-located Community (MCCs)).
Given a GeoSN , a maximal co-located community is a subgraph satisfying three constraints,
Social Connectivity: should satisfy a user-specified social constraint over a graph property like -truss, -core, etc.
Spatial Cohesiveness: Let denote the spatial distance (Euclidean distance) between two users and . Given a distance threshold , for any two vertices , it holds that .
Maximality: There does not exist a subgraph which satisfies social connectivity and spatial cohesiveness constraints.
The following formally defines the -MCCs Detection problem and presents an example,
Definition 0 (-MCCs Detection).
Given a geo-social network , a distance threshold and social constraint, the problem is to find all maximal co-located communities.
Example 0 ().
Fig. 1 (a) presents a Geo-social network where users are denoted as circles and relationships are denoted as lines. Each user is associated with a location in space. Suppose that high social cohesiveness is defined as a minimum degree of at least 2, then there are two communities found in the GeoSN denoted as blue circles and orange circles respectively. Suppose that the distance threshold is set as 4 grids, then users can be divided into four groups based on their locations denoted as four shadow circles in Fig. 1 (b). Combining spatial and social information, there are two MCCs detected: and .
We first focus on the spatial constraints. To de-couple the spatial constraint from MCCs detection, we provide the definition of Spatial Co-located Cluster merely based on spatial constraints.
Definition 0 (Spatial Co-located Cluster).
A Spatial Co-located Cluster (SCC) is a subset of users satisfying two constraints,
All-Pair Co-located: Let denote the spatial distance (Euclidean distance) between two users and . Given a pre-specified distance threshold , for any two users , holds.
Maximality: If is an SCC, there does not exist any user such that for , .
To detect SCCs in GeoSN, we provide an equivalent concept defined on space which is easier to detect,
Definition 0 (Global Maximal Set (GMS)).
Given a set of points in space 111Note that, for brevity, we use the notation to denote both set of nodes (users) in GeoSN and set of points of their corresponding locations. and a distance threshold , a global maximal set is a set of points covered by a circle with diameter such that there does not exist a same size circle that covers a superset of .
Because of the maximality property of SCCs (or GMS), we have the following lemma,
Lemma 0 ().
Any maximal co-located community must be found in a global maximal set.
Based on this lemma, MCCs can be detected in three steps (Algorithm 1): (1) find all global maximal sets (line 1), (2) for each GMS, get the social subgraph from GeoSN induced by this set of uses and find all local MCCs in the subgraph based on social constraint (lines 2 - 4), (3) find all MCCs by removing local MCCs which are subgraph of any other MCCs (FindGlobalMCC function). Note that in the first step (line 2), i.e, finding GMSs, the parameters of social constraint are also passed to the spatial algorithm so that some simple pruning techniques can be implemented.
Example 0 ().
We still take the GeoSN in Fig. 1 (a) as an example to illustrate the procedures. The first step returns four GMSs detected as the shadow circles in (b) show. In each social subgraph induced by vertices in a GMS, detect the local MCCs based on social constraint, then we get three sets: , , and . By calling the function FindGlobalMCC, covered in the blue shadow circle is removed from MCCs. Thus, we detect two MCCs: and
4. Exact Spatial Algorithm
In this section, we will give an exact algorithm with polynomial worst-case time complexity for detecting all global maximal sets in space. The basic idea is to transform the input spatial space from Cartesian coordinate system to polar coordinate system, and based on which an angular sweep procedure is repeatedly invoked for each node to ensure that no GMS is missed.
4.1. Local Maximal Set
Global maximal set is defined based on covering circle with fixed radius, and the following will give the definition of a more restricted covering circle, -bound circle, and based on which local maximal set will be defined.
Definition 0 (-bounded circle).
Given a point , if a circle passes , then it is called a -bounded circle denoted as .
For a given point , set it as reference point and -axis as a reference direction to build a polar coordinate system. If the center of a -bounded circle has coordinate in this polar coordinate system, denote this circle as 222We alternatively use and if the context is clear. where . Now we give the definition of -local maximal set as follows.
Definition 0 (-Local Maximal Set (LMS)).
Let where is user-specified distance threshold, a -local maximal set is a set of points covered by circle such that there does not exist a circle covering a proper superset of . Denote the set of all -LMSs for a fixed as .
We then have the following lemma showing the relationship between global and local maximal sets, which is the backbone of the exact spatial algorithm.
Lemma 0 ().
Given a set of points in and a distance threshold, denote the set of all GMSs as , then .
As shown in Fig. 2, let all the small circles be a GMS , by definition, there is a circle with radius covering it, shown as the large black circle centered at . W.o.l.g., assume that is the farthest point from point and then the dashed circle centered at with radius still covers all points in . Find a point on the line such that and the grey circle centered at with radius shown as the grey circle. For any point , based on triangle inequality, . Thus all points in can be covered by the -bounded grey circle, i.e., is an -LMS. ∎
4.2. Searching Local Maximal Sets
Lemma 3 shows that the problem of finding all GMSs can be solved by calculating -LMSs for every . To efficiently calculate for a given reference point , in this subsection, we introduce the Angular Sweep-based technique.
Suppose that circle rotates counterclockwise, i.e., increases from to , for each point within distance from , we consider two special events: it first enters and it quits , and we call the angles at these two special events as start angle and end angle respectively. When , the circle always covers this point. Figure 4 illustrates such rotation process.
In Figure 4, the circles centered at or are at the two special events for point . Denote polar coordinates of and as and respectively, and polar coordinates of as . We calculate and using equations:
Given a reference point and a set of vertexes where each vertex is within from , Algorithm 2 outputs all -local maximal sets. Lines 1 - 5 first calculate start and end angles for nodes in via Eq. (1) and Eq. (2) and sort the angle intervals based on start angles. Lines 6 - 19 present the angular sweep procedure (Fig. 4).
Let the initial state of (shown as the black circle in Fig. 4) be at the place where it just passes the first node ( ) and let the candidate set which records the set of points currently enclosed by . Let keep track of the smallest end angle of points in . Keep rotating counter-clockwisely to next points and adding new points to until some points in will leave . More specifically, denote the next point that is going to reach as , when which means that some points are going to leave , then add to local maximal set. Rotate to reach , add to and remove points whose end angles are less than to form a new candidate set. Keep the above procedure until reach the last point. For example, in Fig. 4, add to step by step and then when is going to enclose , since , the current set should be a LMS. Then remove points from because it has left and add to . Keep rotating and generating LMS until the circle enclose the last point . There are three LMSs detected as the grey dashed circles enclose.
Complexity Analysis. Suppose that the input vertex set has a size , then Line 5 takes time by using a conventional sorting algorithm. For the angular sweep shown in lines 6-19, the update of candidate set (lines 13 to 15), which dominates the loop body, is executed in time. Thus, the total worst case time complexity of Algorithm 2 should be .
4.3. Search Global Maximal Sets
An LMS may not be a GMS as it might be a subset of another LMS with different reference node. However, Lemma 3 indicates that any GMS must exist in the set of all LMSs. Thus, by excluding any LMS which is a subset of another LMS, we obtain all GMSs.
The whole algorithm to find GMSs (i.e., Spatial Co-located Clusters) is presented in Algorithm 3. Note that for certain social constraints, e.g., -core, -truss, some simple pruning can be implemented to reduce search space. Algorithm 3 uses -core as an illustration. In the experiments, we implement both -core and -truss. For each node , to reduce search space, line 3 applies range query to find out vertexes within distance from the location of since any vertex outside this circle can not be in a -LMS. Since we need to find -core at last, if the number of vertexes lie in the circle is less than , these LMSs can not contain any MCC. Line 4 makes use of social constraint to further reduce search space. Line 5 invokes Algorithm 2 LocalMaximalSet to find out all -LMSs. After detecting all LMSs, the function FindGMS is invoked to add LMSs which are not subset of any others to the GMSs set .
Complexity Analysis. Assume that there are vertexes in GeoSN, i.e., , in the worst scenario, for each vertex , there are vertexes within distance to , and thus the worst time complexity for finding all -LMSs (line 5) would be as analyzed in last subsection. Thus, finding all LMSs would cost . There are LMSs in total since the number of -LMS is , thus function FindGMS will do set comparisons where each single comparison takes time . The total time complexity in the worst case should be . However, in practice, the spatial threshold is a small value. Assume that the points density is , and let , then it takes time to get -LMSs, the number of -LMSs would be and each set has points , so the time complexity would be .
5. Pruning and Optimization
In last section, we introduce a polynomial exact solution for finding all GMSs (i.e., SCCs). However, the high time complexity of Algorithm 3 prevents it being scaled to large dataset. Thus, in this section, we propose several pruning strategies and optimization tricks for Algorithm 3, which is experimentally demonstrated to accelerate the algorithm a lot and reduce time by orders of magnitude in some datasets.
In Algorithm 3, in the worst case an LMS needs to be compared with other LMSs to determine whether or not it is a GMS, which is extremely inefficient and is the dominant part of the time complexity. In this section, we develop pruning rules to dramatically reduce the times of set comparisons.
Pruning rule I: point-wise pruning. Given an -LMS and a -LMS ( is a different point from ), a trivial observation is that if , one of them can never be a superset of the other and there is no need to perform element-wise set comparison.
However, in many situations, even though is smaller than , it is very likely that an -LMS can never cover a -LMS, as Fig. 5 (b) shows. The following will seek a stronger pruning rule in the granularity of LMSs so that we only need to check elements of two LMSs when necessary.
Assume that there is a set of points and there exists a circle with radius covering all points in , and we now consider the problem to decide the location of . Denote the circle center of as , for any point , we have . We draw a circle with radius centered at each point in , then must lie in the intersection of these circles. We relax these circles with their minimum bounding rectangles, and should lie in the intersection area of these rectangles. The intersection rectangle is trivial to compute: instead of considering all points in , we only need four values: , , and , which are the maximal and minimal coordinates and coordinates of points in respectively. As Fig. 5 (a) shows, there are three points filled with grey that decide the intersection rectangle. The dashed rectangle centered at the uppermost or rightmost point decide the bottom side or left side of intersection rectangle respectively, while the one centered at the leftmost and also bottom-most point decide the right and upper side of intersection. The rectangle is thus calculated by .
For two LMSs with different reference nodes, we consider the necessary condition for a set to cover another. As Fig. 5 (b) show, there are two bounded circles with threshold as diameter, shown as black and grey large circles, covering an -LMS and -LMS respectively. For each LMS, we calculate the rectangle as the black and grey shadows show respectively. Since their s do not intersect with each other, it is not likely to find a circle with diameter to cover all points in these two sets, thus either of the two LMSs can not cover the other. The following is a stricter pruning rule,
Pruning rule II: LMSs-wise pruning. Given a LMS , we only need to do set comparison for and those LMSs whose intersect with that of .
Implementation. By applying pruning rules, we re-implement the function FindGMS in Algorithm 3, called FindGMSPrune. As Algorithm 4 shows, for any point , the point-wise pruning rule is first applied. Nearby candidate points is found by using a range query, and then a set of all LMSs with reference node in are gathered for comparison ( in Algorithm 4). To further reduce set comparisons, for each -LMS , set-wise pruning rule is applied so that we will not compare with sets in each of which does not have a intersecting with ’s.
6. Approximate Spatial Algorithm
In last section, we propose powerful pruning rules, though it works in practice, it would still be desirable to pursue a more scalable algorithm. In addition, in the exact algorithm, only after all LMSs are detected can we decide if a LMS is global. However, in many scenarios, users would expect to get GMSs in a more interactive way, i.e., we should return some GMSs before all LMSs are detected. In this section, we will show that if we loose the spatial constraint a little bit, then a much more efficient and interactive algorithm with constant approximation ratio () can be designed.
6.1. The Basic Intuition
In Fig. 6, assume that the small black points consists of a spatial co-location cluster, then based on the definition, there is a circle, shown as the large black circle, with diameter which is the spatial distance threshold to cover this cluster. We relax this circle by its minimum bounding rectangle, shown as the black rectangle in the figure, and this rectangle must cover all points in that cluster. Similar to the definition of GMS, we give that of Global Approximate Maximal Set (GAMS) based on rectangles.
Definition 0 (Global Approximate Maximal Set).
Given a set of points and a distance threshold , is a global approximate maximal set if,
there exists a rectangle with side length covering ;
there does not exist a rectangle with side length covering a set of points such that .
The rectangle covering is called a global maximal square.
We give a theoretical bound for using GAMSs to replace GMSs,
Theorem 2 (Sandwich Theorem).
For a distance threshold , denote the set of all exact global maximal sets as and the set of all GAMSs as . Then we have the following theorem,
For any set , such that .
For any set , such that .
Fig. 6 illustrates this lemma. The first property is trivial. For the second property, let the black rectangle denote a global maximal square covering a GAMS, then its minimum bounding circle, denoted as the black dashed circle, with radius must cover this GAMS. ∎
Based on this theorem, the problem of detecting all SCCs can be approximated by finding all GAMSs with approximation ratio . Similar to -bounded circle and -LMS, we give the definitions of square with -bounded left side (with shorthand as -bounded square) and -Local Approximate Maximal Set (-LAMS) as follows,
Definition 0 (Square with -Bounded Left Side).
Given a square with side length , it is a square with -bounded left side if the left side of this square passes node .
Definition 0 (Local Approximate Maximal Set).
Given an -bounded square and a set of nodes covered by , is a -Local Approximate Maximal Set (-LAMS) if and only if and there does not exist a set of nodes covered by another -bounded square. Denote the set of all -LAMS for a fix as .
Similar to Lemma 3, we have the following lemma showing the relationship between global approximate maximal sets and local approximate maximal sets.
Lemma 0 ().
Given a set of points in and a distance threshold , denote the set of all global approximate maximal sets as . It always holds that .
Based on Lemma 5, the problem of finding all GAMSs can be transformed to finding LAMSs as candidates and then generating GAMSs from the candidate set.
Algorithm 5 presents the whole procedure to detect all GAMSs interactively by a single scan of all nodes. Line 2 first sorts points by -coordinates. For each point, it generates all LAMSs and calls function CheckGlobal to check if each LAMS is a GAMS. The following explain the detail of these two proceudres.
Detecting LAMSs (lines 4-13). Fig. 7 illustrates the steps of finding LAMS. For point with coordinate , the possible points that a -bounded square can cover is in the rectangle as the grey shadow rectangle shows. Then we generate all -LAMSs by moving a rectangle downwards in the shadow. The points in the grey shadow are sorted by coordinates, and then let the upper side of a -bounded square passes the points one by one. Initially, let the upper side of rectangle passes the first point (line 6). Get all points covered by this rectangle (lines 12 - 13 where keeps tracking of the first point that has not been covered and denote candidate for LAMS) and this should be a LAMS since when the rectangle move downwards, it can not cover anymore. Move the rectangle downwards so that its upper side passes the next point, and there can be three possible situations: if the last point in the grey shadow has been covered by LAMS, terminate (line 9); if the new rectangle does not cover any new node, then ignore it (line 10); if the new rectangle covers more points than the previous one, then get all nodes covered by this square and it is a LAMS (lines 11-13). For example, in Fig. 7, when the rectangle moves to pass the second point , no new points is covered and thus it is skipped, while when it passes , new points are included and all the points covered is a LAMS. Since the last point has been covered, it terminates.
Finding GAMSs. Once a LAMS is found, it is easy to check if this is a GAMS. For example, in Fig. 7, there is a -LAMS covered by the red dashed rectangle. To check if this is a GAMS, we only need to compare it with -LAMSs where which have already been detected, since any -LAMS where can not contain point . Only at most three points in the dashed rectangle needed to be considered. The three points are: point with maximum coordinate and points with minimum and maximum coordinates. If these three points are already in a GAMS, then all points in the rectangle are in it, thus this LAMS will be discarded. Otherwise, this is a GAMS. Function CheckGlobal of Algorithm 5 shows this process where records for each point a set of all GAMSs currently found that cover it. By determining if the three GAMSs sets for these three points have intersection, we can check if the LAMS is a global one.
Complexity Analysis The average number of points in rectangle is . For each point , lines 7 to 13 take time to compute all -LAMSs. For function CheckGlobal, the dominate step is set intersections. Suppose there are GAMSs that may contain a point , i.e., the size of is and conducting a set intersection operation would take , then the total time complexity is . In the worst case, , however, since records only global approximate maximal sets found currently for instead of all LAMSs, is practically very small.
7. Experimental Studies
Our experiments contain three parts: we first test and compare the spatial algorithms which find out all spatial co-located clusters (SCCs), then test the whole MCC framework to get all maximal co-located communities, and finally we conduct case studies to compare our results with two state-of-the-art researches. All of our algorithms are implemented by Java using JDK 11 and tested on an Ubuntu server with Intel(R) Xeon(R) CPU X5675 @ 3.07GHz and 64 GB memory.
7.1. Spatial Algorithm Evaluation
Algorithms We test four algorithms as shown in Table. 2. The clique-based algorithm from (Chen et al., 2018) does not solve exactly the same problem as ours, however, it can be easily adapted to detect all SCCs by enumerating all cliques of a spatial graph.
|adapt clique-based algorithm (Chen et al., 2018) to get all SCCs|
|algorithm 3 with pruning rule 1|
|algorithm 3 with pruning rule 1 and 2|
|approximation algorithm Algorithm 5|
Dataset The experiments are conducted on three real-world datasets and two synthetic datasets. Table. 3 presents the statistics of the spatial part of three real geo-social networks. #Neighbors is defined as the number of people within 500 meters from a specific user and we calculate the average and maximum number of #Neighbors. The locality level is defined as the ratio between max. #Neighbors and avg. #Neighbors. For example, for the weibo dataset, the max. #Neighbors is high while avg. #Neighbors is low, so it has relatively high spatial locality.
|Brightkite (Cho et al., 2011)||51||1342||55.67||medium|
Parameter Settings Table. 4 shows the parameter settings for both synthetic real datasets. For the synthetic datasets, there are three parameters: the number of points , density and distance threshold . For real datasets, we consider two parameters: the percentage of users sampled from the original datasets and distance threshold . At each time, we vary one parameter, while other parameters are set to their underlined default values.
To test the scalability of our algorithms, we vary for synthetic datasets and for real datasets. The results are shown in Fig. 8(a)-8(e). The clique-based algorithm increases exponentially as the number of points increases on all datasets, which demonstrates the NP-hardness nature of the clique enumeration problem. Our exact algorithm significantly outperforms clique-based algorithm: 1) on synthetic datasets, it outperforms clique by one to two orders of magnitudes and clique can not terminate in 8,000 s for more than 500K data points while exact can return results in 100 seconds. 2) on real datasets, our algorithm outperforms clique especially for large-sized and high-locality dataset Weibo where clique can not return results in 8,000 s when sampling 40% data points. Our exact and approximation algorithms show strong scalability on synthetic datasets, since they show near-linear increase when the number of points increases. Notably, on three real datasets, the increase is faster than that on synthetic data since when the becomes larger, both the number of data points and the density increase.
7.1.2. Effect of
Fig. 8(f)-8(j) present the execution time by varying the distance threshold . For clique, when increases, the execution time increases dramatically, e.g., it can not return results when meter for Brightkite and weibo datasets. For exact algorithms, when implemented with both two pruning rules, the execution time is much less than that of using only one pruning rule and it becomes more obvious when increases. The execution time of approx does not show obvious change w.r.t. comparing to other algorithms. For the weibo dataset, when is set as 700 or 1000, approx still return results in short time while other three algorithms cannot terminate within 8,000 s.
We briefly give the reason here. For clique, when increases, the virtual spatial neighborhood network would be more complex and thus enumerating all maximal cliques would be much more time-consuming. For the exact algorithm, as increases, the number of LMSs and the number of points in each LMS increase, and accordingly the time spent on set comparisons for LMSs would be a major bottleneck for the total execution time. Recall the time complexity for the exact algorithm, we have time . Pruning rule 1 decreases the times of set comparisons by excluding comparisons between two LMSs with reference nodes distance larger than . When increases, the percentage of set comparisons pruned by this rule would decrease and thus lose the pruning power. However, for pruning rule 2, it is still very effective when grows since it is a set-wise pruning method instead of point-wise.
7.1.3. Effect of data density
shows the execution time w.r.t. different densities of synthetic datasets, which can be done by changing the variance when generate the data. As the density increases, execution time of both clique and exact+rule1 increases quickly, however, when implemented with both pruning rules, the exact algorithm grows much more slowly. The effectiveness of pruning rule II becomes more obvious as the increase of density. For Gaussian distributed datasets, which have higher locality than uniform data, the pruning rule II reduces more than 50% execution time than exact+rule1 when density is set as 0.02. Density affects exact algorithms due to the same reason asdoes, both of them increase the number and set size of LMSs, which makes LMSs comparisons more costly. The execution time of approximation algorithm increases very slowly since even though density is large, there are still only at most three set comparisons needed to be conduct for each local approximate maximal set.
7.1.4. Effectiveness of pruning rules
As we have analyzed before, the bottleneck of time complexity for exact algorithms is set comparisons for local maximal sets and the two pruning rules decrease time by reducing set comparisons at different levels. As we have shown, when or increase, the pruning rule 2 become more effective in reducing execution time. To further present the effectiveness of different pruning rules, we record the number of set comparisons when implementing only first pruning rule and both rules. Fig. 10 shows the results on Gowalla and uniform synthetic datasets. Pruning rule 2 can help decrease number of set comparisons by orders and when or increase, it is observed to reduce more set comparisons. When is set as 1 km on Gowalla dataset, pruning rule 2 can reduce more than set comparisons compared with exact+rule1.
7.2. Framework Evaluation
The previous subsection presents the results for spatial algorithm and this part will demonstrate the efficiency and effectiveness for the whole framework to detect all maximal co-located clusters, which is shown in Algorithm 1. The experiments are conducted on three real world geo-social networks and Table 5 shows the statistics of social network information. Note that before running the algorithms, we do some data cleaning works for original datasets, e.g., deleting all self-loop edges.
|Dataset||#Vertices||#Edges||Avg. Degree||Max Degree|
For social constraint in the framework, we implement both -core and -truss, however, due to the limit of page, we only present the evaluation results of framework based on -core, and the results on -truss have very similar performance. To make the framework more efficient, we adopt a simple pruning rule similar to the one used in (Chen et al., 2018). The pruning rule is based on the fact that a MCC must be a subset of a -core (or -truss), thus we first generate all -cores from social network by applying core decomposition algorithm, and then apply our framework in each -core to get all MCCs.
Table 6 shows the settings of two parameters: (of -core) and distance threshold .
7.2.1. Effect of
Fig. 11(a)-11(c) show the total execution time w.r.t. . Since we apply spatial algorithm in each -core instead of for all data points, the execution time for detecting MCCs is much less than that of detecting all SCCs presented in the last subsection. Clique is the slowest one on all datasets and increases dramatically w.r.t. . The exact algorithms present efficiency on Brightkite and Gowalla datasets and the time does not increase much as increases. However, for Weibo dataset, time increases quickly with . A possible reason is that data points in Weibo have much higher degree and there can be a -core for default value consisting of many data points – applying spatial algorithm in that core can still be time consuming and the change of time w.r.t. is similar to the spatial algorithm experiment result as Fig. 8(j) shows.
7.2.2. Effect of
7.2.3. Correctness of approximation algorithm
The above results have already demonstrated the efficiency and scalability of our approximation algorithm. To further validate its correctness, in each SCC detected by the approximation algorithm, we calculate the maximum pairwise distance as the cluster distance and Fig.12 present the average and maximum cluster distance for all clusters. It shows that the cluster distance is always bounded by and the average cluster distance is normally smaller than which means many clusters have distance smaller than threshold.
7.3. Case Studies
We implement the algorithms in (Fang et al., 2017; Chen et al., 2018). Both the two papers have different problem definitions with us. (Fang et al., 2017) provides a community search algorithm where the distance constraint is not defined in the same way as our work, and (Chen et al., 2018) solves the problem to find only the maximal MCC. We conduct two case studies on Gowalla and Brightkite datasets by using our approximation algorithm and compare the result with that of (Fang et al., 2017; Chen et al., 2018) respectively to demonstrate the effectiveness of our problem and algorithm.
7.3.1. Bounded Spatial Distance Guarantee
We conduct experiment on Gowalla dataset and set and . Fig.13 (a) shows all MCCs detected by our algorithm in a region. Each circle is the location of the MCC center and the color indicates the number of cluster members. There are 20 MCCs in this region. We also present two communities shown as the red small circles in (b) and (c) respectively retrieved by using the community search algorithm in (Fang et al., 2017) with two different query users. In (b), the purple circle with diameter covers a MCC found by our algorithm. The method in (Fang et al., 2017) only returns a small subset of our MCC in order to make the covering circle has minimum radius. In (c), (Fang et al., 2017) returns a community that has a minimum covering circle with diameter much larger than km, and is not detected as a MCC by our algorithm. As Fig.13 shows, the distance among users in any MCC by our algorithm is within a user-specified distance, however, (Fang et al., 2017) does not allow user to specify the distance threshold, and different MCCs do not have consistent distance bound. For a query user who have many nearby friends, (Fang et al., 2017) may return a small subset, however, for user who do not have nearby friends, it still returns a cluster with large distance among cluster members.
7.3.2. Diverse MCCs
On Brightkit dataset, by setting km and
, we detect 32 MCCs. We conduct hierarchical cluster analysis on 32 sets where Jaccard distance is used to measure the set distance. As Fig.14 presents, there are five clusters that do not share any common user and there are 9 clusters when distance is set as 0.6. The results indicate that many MCCs have diverse set members. However, the problem in (Chen et al., 2018) only find one maximum MCC and ignore all others despite the fact that other MCCs are equally meaningful and very different from members in the maximum MCC.
In this paper, we investigate the -MCCs detection problem on large scale geo-social networks. Unlike prior work that searches MCC for given query nodes or finds one maximal MCC, we solve a community detection problem which detects all communities satisfying both social and spatial cohesiveness constraints. To make our solution compatible with existing community detection techniques, we design a uniform framework so that existing techniques like -core and -truss decomposition can be easily plugged in. Besides generality and compatibility, our MCC detection framework improves efficiency thanks to our spatial constraint checking algorithms and several engineering level optimization. The effectiveness and efficiency of both the spatial algorithm and the whole MCC detection framework are demonstrated by using three real-world datasets and two synthetic datasets with various parameter settings.
- Ahn et al. (2010) Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann. 2010. Link communities reveal multiscale complexity in networks. Nature 466, 7307 (2010), 761.
- Brandes et al. (2007) Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Gorke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. 2007. On modularity clustering. IEEE Trans. on Knowledge and Data Engg. 20, 2 (2007), 172–188.
- Chen et al. (2018) Lu Chen, Chengfei Liu, Rui Zhou, Jianxin Li, Xiaochun Yang, and Bin Wang. 2018. Maximum co-located community search in large scale social networks. Proc. of the VLDB Endowment 11, 10 (2018), 1233–1246.
- Chen et al. (2015) Yu Chen, Jun Xu, and Minzheng Xu. 2015. Finding community structure in spatially constrained complex networks. Int. J. of Geog. Info. Sc. 29, 6 (2015), 889–911.
- Cho et al. (2011) Eunjoon Cho, Seth A Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In Proc. of the 17th SIGKDD Conf. ACM, 1082–1090.
- Cui et al. (2013) Wanyun Cui, Yanghua Xiao, Haixun Wang, Yiqi Lu, and Wei Wang. 2013. Online search of overlapping communities. In Proc.of the Int. Conf. on SIGMOD. ACM, 277–288.
- Cui et al. (2014) Wanyun Cui, Yanghua Xiao, Haixun Wang, and Wei Wang. 2014. Local search of communities in large graphs. In Proc. of the Int. SIGMOD Conf. ACM, 991–1002.
- Desai (2016) Purvi Jayesh Desai. 2016. PIRCNET: A Data Driven Approach to HIV Risk Analysis. Ph.D. Dissertation. Univ. of California San Diego.
- Elzinga and Hearn (1972a) D Jack Elzinga and Donald W Hearn. 1972a. The minimum covering sphere problem. Management science 19, 1 (1972), 96–104.
- Elzinga and Hearn (1972b) Jack Elzinga and Donald W Hearn. 1972b. Geometrical solutions for some minimax location problems. Transportation Science 6, 4 (1972), 379–394.
- Fang et al. (2017) Yixiang Fang, Reynold Cheng, Xiaodong Li, Siqiang Luo, and Jiafeng Hu. 2017. Effective community search over large spatial graphs. Proc. of the VLDB Endowment 10, 6 (2017), 709–720.
- Fortunato (2010) Santo Fortunato. 2010. Community detection in graphs. Physics Reports 486, 3-5 (2010), 75–174.
- Huang et al. (2014) Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu. 2014. Querying k-truss community in large and dynamic graphs. In Proc. of the Int. Conf. on SIGMOD. ACM, 1311–1322.
- Huang et al. (2015) Xin Huang, Laks VS Lakshmanan, Jeffrey Xu Yu, and Hong Cheng. 2015. Approximate closest community search in networks. Proc. of the VLDB Endowment 9, 4 (2015), 276–287.
- Lancichinetti and Fortunato (2009) Andrea Lancichinetti and Santo Fortunato. 2009. Community detection algorithms: a comparative analysis. Physical Review E 80, 5 (2009), 056117.
- Newman (2004) Mark EJ Newman. 2004. Fast algorithm for detecting community structure in networks. Physical Review E 69, 6 (2004), 066133.
- Sozio and Gionis (2010) Mauro Sozio and Aristides Gionis. 2010. The community-search problem and how to plan a successful cocktail party. In Proc. of the 16th Int. SIGKDD Conf. ACM, 939–948.
- Weibel et al. (2017) Nadir Weibel, Purvi Desai, Lawrence Saul, Amarnath Gupta, and Susan Little. 2017. HIV risk on twitter: The ethical dimension of social media evidence-based prevention for vulnerable populations. In Proc. of the 50th Hawaii Int. Conf. on System Sciences.
White and Smyth (2005)
Scott White and Padhraic
A spectral clustering approach to finding communities in graphs. InProc. of the Int. Conf. on Data Mining. SIAM, 274–285.
- Zhang et al. (2017) Fan Zhang, Ying Zhang, Lu Qin, Wenjie Zhang, and Xuemin Lin. 2017. When engagement meets similarity: efficient (k, r)-core computation on social networks. Proc. of the VLDB Endowment 10, 10 (2017), 998–1009.