Local Search for Group Closeness Maximization on Big Graphs

11/08/2019 ∙ by Eugenio Angriman, et al. ∙ 0

In network analysis and graph mining, closeness centrality is a popular measure to infer the importance of a vertex. Computing closeness efficiently for individual vertices received considerable attention. The NP-hard problem of group closeness maximization, in turn, is more challenging: the objective is to find a vertex group that is central as a whole and state-of-the-art heuristics for it do not scale to very big graphs yet. In this paper, we present new local search heuristics for group closeness maximization. By using randomized approximation techniques and dynamic data structures, our algorithms are often able to perform locally optimal decisions efficiently. The final result is a group with high (but not optimal) closeness centrality. We compare our algorithms to the current state-of-the-art greedy heuristic both on weighted and on unweighted real-world graphs. For graphs with hundreds of millions of edges, our local search algorithms take only around ten minutes, while greedy requires more than ten hours. Overall, our new algorithms are between one and two orders of magnitude faster, depending on the desired group size and solution quality. For example, on weighted graphs and k = 10, our algorithms yield solutions of 12,4% higher quality, while also being 793,6× faster. For unweighted graphs and k = 10, we achieve solutions within 99,4% of the state-of-the-art quality while being 127,8× faster.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Identifying important vertices in large networks is one of the main problems in network analysis [newman2018networks]. For this purpose, several centrality measures have been introduced over the past decades [boldi2014axioms]. Among them, one of the most widely-used measures is closeness [bavelas1948mathematical]. For a given vertex , it is defined as the reciprocal of the average shortest-path distance from to all other vertices. The problem of identifying the vertices with highest closeness centrality has received significant attention [bergamini2019computing, okamoto2008ranking, olsen2014efficient]. In graph mining applications, however, it is often necessary to determine a group of vertices that is central as a whole – which is an orthogonal problem shown to be -hard [chen2016efficient]. One can view group closeness as special case (on graphs) of the well-known -median problem for facility location. Example applications include: (i) retailers that want to advertise their product via social media; they could select as promoters the group of members with highest centrality ( influence over the other members) [zhu2014maximizing]; (ii) in P2P networks, shared resources could be placed on peers so that they are easily accessible by others [gkantsidis2004random]; (iii) in citation networks, group centrality measures can be employed as alternative indicators for the influence of journals or papers within their field [leydesdorff2007betweenness].

Everett and Borgatti [everett1999centrality] formalized the concept of centrality for groups of vertices; the closeness of a group is defined as the reciprocal of the average distance from

to all other vertices of the graph. While exact algorithms to find a group with maximal group closeness are known (e. g., algorithms based on integer linear programming (ILP) 

[bergamini2018scaling]), they do not scale to graphs with more than a few thousand edges. Hence, in practice, heuristics are used to find groups with high group closeness on large real-world data sets. For example, Chen et al. [chen2016efficient] proposed a greedy algorithm that heuristically computes a group with high closeness . To obtain a group of size , it performs iterations and adds in each iteration the vertex with highest marginal contribution to .111While this greedy algorithm was claimed to have a bounded approximation quality (e. g., in [chen2016efficient, bergamini2018scaling]), the proof of this bound relied on the assumption that is submodular. A recent update [DBLP:journals/corr/abs-1710-01144] to the conference version [bergamini2018scaling] revealed that, in fact, is not submodular. We are not aware of any approximation algorithm for group closeness that scales to large graphs. It was shown empirically that this greedy algorithm yields solutions of very high quality (within of the optimum) [bergamini2018scaling] – at least for those small graphs where running a comparison against an exact algorithm is still feasible within a few hours. Due to greedy’s promising quality, Bergamini et al. [bergamini2018scaling] proposed techniques to speed up the algorithm by pruning, e. g., by exploiting the supermodularity of farness, i. e., . Pruning yields a significant acceleration (while retaining the same solution); however, graphs with hundreds of millions of edges still require several hours to complete. Indeed, pruning is most effective when the group is already large. When performing the first addition, however, the greedy algorithm has to perform one (pruned) single-source shorted-path (SSSP) computation for each vertex in the graph to compute its marginal contribution, and this phase scales super-linearly in the size of the graph.

Our Contribution

We present two new local search algorithms for group closeness: the first algorithm, Local-Swap, requires little time per iteration but can only exchange vertices locally. Our second algorithm, Grow-Shrink, is able to perform non-local vertex exchanges, but updating its data structures is more involved. The final result of both algorithms is heuristic, i. e., no approximation guarantee is known. Yet, each iteration of Local-Swap maximizes (in approximation) a lower bound on the objective function, while each iteration of Grow-Shrink locally maximizes the objective function itself. Despite these favorable properties, the time complexity of a single iteration of our algorithms matches the time complexity of a single evaluation of the objective function, i. e., for unweighted graphs, it is linear in the size of the graph.

Our experiments show that our best algorithm, extended Grow-Shrink, finds solutions with a closeness score greater than of the score of a greedy solution, while being faster to compute (). We see this algorithm as our main contribution for unweighted graphs. When quality is not a primary concern, our other algorithms can further accelerate the computation: For example, the non-extended variant of Grow-Shrink yields solutions for groups of size 10 whose quality is compared to the state of the art; in this case, it is faster. The speedup varies between and for groups of sizes 5 and 100, respectively.

On weighted graphs, our algorithms even improve both the quality and the running time performance compared to the state of the art, returning solutions of higher quality at a speedup of (). Other trade-offs between quality and performance are possible and discussed in Section III.

Ii Local Search for Group Closeness

Let be an undirected connected graph. We allow both unweighted and positively weighted graphs . Subsets are called groups. The farness of any given group is defined as:

Here, refers to the minimal shortest-path distance from any to in . Furthermore, the (group) closeness of is defined as , i. e., is the reciprocal of the average distance of all vertices in to the nearest vertex in . Recall that determining the group that maximizes over all groups with is known to be -hard [chen2016efficient]; we are not aware of any algorithm with a bounded approximation ratio.

We consider the problem of improving the group closeness of a given set via local search. More precisely, we consider exchanges of vertices from and . Let be a vertex in and . To simplify the presentation, we use the notation to denote the set that is constructed by exchanging and . We also use the notation and , to denote vertex additions and removals, respectively.

Note that, as our algorithms can only perform vertex exchanges, they require the construction of an initial set before the algorithms start. To avoid compromising our algorithms’ running times, we cannot afford a superlinear initialization step. Thus, in all of our local search algorithms, we simply choose the initial uniformly at random. For large graphs, this initialization can be expected to cover the graph reasonably well. Exploratory experiments revealed that other obvious initialization techniques (such as selecting the vertices with highest degree) did not improve the performance of the algorithm.

Ii-a Estimating the Quality of Vertex Exchanges

It is known that a simple greedy ascent algorithm yields results of good quality on real-world graphs [bergamini2018scaling]. This greedy algorithm starts with an empty set and iteratively adds vertices to that maximize . Depending on the input graph and the value of , however, the greedy algorithm might need to evaluate the difference in for a substantial number of vertices – this computation is rather expensive for large real-world graphs.

The algorithms in this paper aim to improve upon the running time of the greedy algorithm. We achieve this by considering only local vertices for , i. e., vertices that are already “near” . It is clear that selecting only local vertices would decrease the quality of a greedy solution (as the greedy algorithm does not have the ability to eventually correct suboptimal choices). However, this is not necessarily true for our algorithms based on vertex exchanges in Sections II-B and II-C.

To make our notion of locality more concrete, let be the DAG constructed by running a SSSP algorithm (i. e., BFS or Dijkstra’s algorithm) from the vertices in . We remark that we work with the full SSSP DAG here (and not a SSSP tree). Here, the vertices of are all considered as sources of the SSSP algorithm, i. e., they are at distance zero. Furthermore, define

(1)

To compute the greedy solution, it seems to be necessary to compute exactly for a substantial number of vertices .222The techniques of [bergamini2018scaling] can avoid some of the computations; nevertheless, many evaluations of still have to be performed. As discussed above, this seems to be impractical for large graphs. However, a lower bound for can be computed from the shortest path DAG . To this end, let be the set of vertices reachable from in .

Lemma 1.

It holds that:

In the unweighted case, equality holds if is a neighbor of .

This lemma can be proven from the definition of . A formal proof can be found in Appendix A.

The bound of Lemma 1 will be used in the two algorithms in Sections II-B and  II-C. Instead of picking vertices that maximize , those algorithms pick vertices that maximize the right-hand side of Lemma 1, i. e., . The bound is local in the sense that it is more accurate for vertices near : in particular, the reachability sets of vertices in are larger in than those in , as does not contain back edges. Unfortunately, computing the size of exactly for all still seems to be prohibitively expensive: indeed, the fastest known algorithm to compute the size of the transitive closure of a DAG (= ) relies on iterated (Boolean) matrix multiplication (hence, the best known exact algorithm has a complexity of  [le2014powers]). However, it turns out that randomized algorithms can be used to approximate the sizes of for all at the same time. We employ the randomized algorithm of Cohen [cohen1997size] for this task. In multiple iterations, this algorithm samples a random number for each vertex of the graph , accumulates in each vertex the minimal random number of any vertex reachable from

, and estimates

based on this information.

We remark that since Cohen’s algorithm yields an approximation, but not a lower bound for the right-hand side of Lemma 1, the inequality of the Lemma can be violated in our algorithms; in particular, it can happen that our algorithms pick a vertex such that . In this case, instead of decreasing the closeness centrality of our current group, our algorithms terminate. Nevertheless, our experiments demonstrate that on real-world instances, a coarse approximation of the reachability set size is enough for Lemma 1 to yield useful candidates for vertex exchanges (see Section II-D5 and Appendix C-B).

Ii-B Local-Swaps Algorithm

1:repeat
2:     approximate for all
3:     
4:     
5:      run pruned BFS from to recompute
6:until previous iteration did not decrease
Algorithm 1 Overview of Local-Swaps Algorithm

Let us first focus on unweighted graphs. To construct a fast local search algorithm, a straightforward idea is to allow swaps between vertices in and their neighbors in . This procedure can be repeated until no swap can decrease . Let be a vertex of the group and let be one of its neighbors outside of the group. To determine whether swapping and (i. e., replacing by ) is beneficial, we have to check whether , i. e., whether the farness is decreased by the swap. The challenge here is to find a pair that satisfies this inequality (without checking all pairs exhaustively) and to compute the difference quickly. Note that a crucial ingredient that allows us to construct an efficient algorithm is that the distance of to every vertex can only change by when doing a swap. Hence, we only have to count the numbers of vertices where the distance changes by and the number of vertices where it changes by .

Our algorithm requires a few auxiliary data structures to compute . In particular, we store the following:

  • the distance from to all vertices ,

  • a set for each that contains all vertices in that realize the shortest distance from to ,

  • the value for each , where is the set of vertices for which the shortest distance is realized exclusively by .

Note that the sets consume memory in total. However, since , this can be afforded even for large real-world graphs. In our implementation, we store each in only bits.

All of those auxiliary data structures can be maintained dynamically during the entire algorithm with little additional overhead. More precisely, after a - swap is done, is added to all satisfying ; for that satisfy , the set replaces . can be removed from all by a linear scan through all .

Algorithm 1 states a high-level overview of the algorithm. In the following, we discuss how to pick a good swap (line 3 of the pseudocode) and how to update the data structures after a swap (line 5). The running time of the algorithm is dominated by the initialization of . Thus, it runs in time for each update.

Ii-B1 Choosing a Good Swap

Because it would be too expensive to compute the exact difference in for each possible swap, we find the pair of vertices with , that maximizes . Note that this value is a lower bound for the decrease of after swapping and : In particular, Lemma 1 implies that is a lower bound for the decrease in farness when adding to . Additionally, is an upper bound for the increase in farness when removing from (and hence also for the increase in farness when removing from ).333 Note, however, that this bound is trivial if . Thus, we can expect this strategy to yield pairs of vertices that lead to a decrease of . To maximize , for each , we compute the neighbor that minimizes (in time). Afterwards, we can maximize by a linear scan over all .

Ii-B2 Computing the Difference in Farness

Instead of comparing distances, it is sufficient to define sets of vertices whose distance to is increased or decreased (by 1) by the swap:

As , it holds that:

Lemma 2.

Fortunately, computing is straightforward: this can be done by running a BFS rooted at ; the BFS simply counts those vertices for which . Hence, to check this condition, we have to store the values of for all . We remark that it is not necessary to run a full BFS: indeed, we can prune the search once (i. e., if the BFS is about to visit a vertex satisfying this condition, the search continues without visiting ). However, as we will see in the following, it makes sense to sightly relax the pruning condition and only prune the BFS if ; this allows us to update our auxiliary data structures on the fly.

can be computed from with the help of the auxiliary data structures. We note that , as only vertices where is uniquely realized by (out of all vertices in the group) can have their distance from increased by the swap. As , we can further restrict this inclusion to , but, in general, will consist of more vertices than just . More precisely, can be partitioned into , where consists only of vertices whose distance is neither increased nor decreased by the swap. By construction, and are disjoint. This proves that the following holds:

Lemma 3.

.

We note that (and also ) is completely visited by our BFS. To determine , the BFS only has to count the vertices that satisfy and . On the other hand, to determine , it has to count the vertices satisfying and .

Ii-C Grow-Shrink Algorithm

1:repeat
2:      approximate for all
3:     
4:     
5:      run pruned BFS from to recompute , ,
6:     
7:     
8:      run Dijkstra-like algo. to recompute , ,
9:until previous iteration did not decrease
Algorithm 2 Overview of Grow-Shrink Algorithm

The main issue with the swapping algorithm from Section II-B is that it can only exchange a vertex with one of its neighbors . Due to this behavior, the algorithm might take a long time to converge to a local optimum. It also makes it hard to escape a local optimum: indeed, the algorithm will terminate if no swap with a neighbor can improve the closeness.

Our second algorithm lifts those limitations. It also allows to be a weighted graph. In particular, it allows vertex exchanges that change the distances from to the vertices in by arbitrary amounts. Computing the exact differences for all possible pairs of and seems to be impractical in this setting. Hence, we decompose the vertex exchange of and into two operations: the addition of to and the removal of from . In particular, we allow the set to grow to a size of before we shrink the size of back to . Thus, the cardinality constraint is temporarily violated; eventually, the constraint is restored again. Fortunately, the individual differences and (or bounds for those differences) turn out to be efficiently computable for all possible and , at least in approximation. We remark, however, that while this technique does find the vertex that maximizes and the vertex that maximizes , it does not necessarily find the pair of vertices maximizing . Nevertheless, our experiments in Section III demonstrate that the solution quality of this algorithm is superior to the quality of the local-swaps algorithm.

In order to perform these computations, our algorithm maintains the following data structures:

  • the distance of each vertex to , and a representative that realizes this distance, i. e., it holds that ,

  • the distance from to and representative for this distance (satisfying the analogous equality).

Since the graph is connected, these data structures are well-defined for all groups of size . Furthermore, the difference between and yields exactly the difference in farness when is removed from the . Later, we will use this fact to quickly determine differences in farness.

We remark that it can happen that ; nevertheless, and are always distinct. Indeed, there can be two distinct vertices and in that satisfy . With and , we denote the set of vertices with and , respectively.

Algorithm 2 gives a high-level overview of the algorithm. In the following two subsections, we discuss the growing phase (line 2-5) and the shrinking phase (line 6-8) individually. The running time of Grow-Shrink is dominated by the Dijkstra-like algorithm (in line 8). Therefore, it runs in time per update (when using an appropriate priority queue). The space complexity is .

Ii-C1 Vertex additions

When adding a vertex to , we want to select such that is minimized. Note that minimizing is equivalent to maximizing the difference . Instead of maximizing , we maximize the lower bound . We perform a small number of iterations of the reachability set size approximation algorithm (see Section II-A) to select the vertex with (approximatively) largest .

After is selected, we perform a BFS from to compute exactly. As we only need to visit the vertices whose distance to is smaller than to , this BFS can be pruned once a vertex is reached with . During the BFS, the values of are updated to reflect the vertex addition: the only thing that can happen here is that realizes either of the new distances or .

Ii-C2 Vertex removals

For vertex removals, we can efficiently calculate the exact increase in farness for all vertices , even without relying on approximation. In fact, is given as:

We need to compute such sums (i. e., for each ); but they can all be computed at the same time by a single linear scan through all vertices .

-boundary

-boundary
Fig. 1: , and are vertices in . Vertices within the solid regions belong to , and , respectively. Vertices within the dashed regions belong to and , respectively. After removing from , the vertices in will have an invalid and .

It is more challenging, however, to update and after removing a vertex from . For vertices with an invalid (i. e., vertices in ), we can simply update and . This update invalidates and . In the following, we treat as infinite and as undefined for all updated vertices ; eventually those expressions will be restored to valid values using the algorithm that we describe in the remainder of this section. Indeed, we now have to handle all vertices with an invalid (i. e., those in ). This computation is more involved. We run a Dijkstra-like algorithm (even in the unweighted case) to fix and . The following definition yields the starting points for our Dijkstra-like algorithm.

Definition 1.

Let be any vertex and let be a neighbor of that needs to be updated.

  • We call a -boundary pair for iff . In this case, we set .

  • We call a -boundary pair for iff and . In this case, we set .

In both cases, is called the boundary distance of .

The definition is illustrated in Figure 1. Intuitively, boundary pairs define the boundary between regions of that have a valid (blue regions in Figure 1) and regions of the graph that have an invalid (orange region in Figure 1). The boundary distance corresponds to the value of that a SSSP algorithm could propagate from to . We need to distinguish -boundary pairs and -boundary pairs as the boundary distance can either be propagated on a shortest path from over to (in case of a -boundary pair) or on a shortest path from over to (in case of a -boundary pair).

Consider all such that there exists at least one (- or -)boundary pair for . For each such , let be the boundary pair with minimal boundary distance . Our algorithm first determines all such and updates . If is a -boundary pair, we set ; for -boundary pairs, we set . After this initial update, we run a Dijkstra-like algorithm starting from these vertices for which a boundary pair exists. This algorithm treats as the distance. Compared to the usual Dijkstra algorithm, our algorithm needs the following modifications: For each vertex , our algorithm only visits those neighbors that satisfy . Furthermore, whenever such a visit results in an update of , we propagate . Note that these conditions imply that we never update such that .

Lemma 4.

After the Dijkstra-like algorithm terminates, and are correct.

A proof of this lemma can be found in Appendix A.

Ii-D Variants and Algorithmic Improvements

Ii-D1 Semi-local Swaps

One weakness of the algorithm in Section II-B is that it only performs local vertex exchanges. In particular, the algorithm always swaps a vertex and a vertex . This condition can be generalized: in particular, it is sufficient that also satisfies . In this situation, the distances of all vertices can still only change by a single hop and the proofs of the correctness of the algorithm remain valid. Note that this naturally partitions candidates into two sets: first, the set of candidates that the original algorithm considers, and the set . Candidates in the latter set can be determined independently of ; indeed, they can be swapped with any . Hence, our swap selection strategy from Section II-B1 continues to work with little modification.

Ii-D2 Restricted Swaps

To further improve the performance of our Local-Swap algorithm at the cost of its solution quality, we consider the following variant: instead of selecting the pair of vertices that maximize , we just select the vertex that maximizes and then choose such that is minimized. This restricts the choices for ; hence, we expect this Restricted Local-Swap algorithm to yield solutions of worse quality. Due to the restriction, however, it is also expected to converge faster.

Ii-D3 Local Grow-Shrink

During exploratory experiments, it turned out that the Grow-Shrink algorithm sometimes overestimates the lower bound on the decrease of the farness after adding an element . This happens because errors in the approximation of are amplified by multiplying with a large . Hence, we found that restricting the algorithm’s choices for to vertices near improves the solution quality of the algorithm.

It may seem that this modification makes Grow-Shrink vulnerable to the same weaknesses as Local-Swap. Namely, local choices imply that large numbers of exchanges might be required to reach local optima and it becomes hard to escape these local optima. Fortunately, additional techniques discussed in the next section can be used to avoid this problem.

Ii-D4 Extended Grow-Shrink

Even in the case of Grow-Shrink, the bound of Lemma 1 becomes worse for vertices at long distances from the current group. As detailed in Section II-A, this happens as our reachability set size approximation approach does not take back edges into account. This is a problem especially on graphs with a large diameter where we have to expect that many back edges exist. We mitigate this problem (as well as the problems mentioned in Section II-D3) by allowing the group to grow by more than one vertex before we shrink it again. In particular, we allow the group to grow to size for some , before we shrink it back to .

In our experiments in Section III, we consider two strategies to choose . First, we consider constant values for . However, this is not expected to be appropriate for all graphs: specifically, we want to take the diameter of the graph into account. Hence, a more sophisticated strategy selects for a fixed . This strategy is inspired by mesh-like graphs (e. g., real-world road networks and some other infrastructure networks): if we divide a quadratic two-dimensional mesh into quadratic sub-meshes (where is a power of 2), the diameter of the sub-meshes is . Hence, if we assume that each vertex of the group covers an equal amount of vertices in the remaining graph, vertex additions should be sufficient to find at least one good vertex that will improve a size- group. As we expect that real-world networks deviate from ideal two-dimensional meshes to some extend, we consider not only but also other values of .

Ii-D5 Engineering the reachability set size approximation algorithm

Cohen’s reachability set size approximation algorithm [cohen1997size]

has multiple parameters that need to be chosen appropriately: in particular, there is a choice of probability distribution (exponential vs. uniform), the estimator function (averaging vs. selection-based), the number of samples and the width of each random number. For the estimator, we use an averaging estimator, as this estimator can be implemented more efficiently than a selection-based estimator (i. e., it only requires averaging numbers instead of finding the

-smallest number). We performed exploratory experiments to determine a good configuration of the remaining parameters. It turns out that, while the exponential distribution empirically offers better accuracy than the uniform distribution, the algorithm can be implemented much more efficiently using the uniform distribution: in particular, for the uniform distribution, it is sufficient to generate and store the per-vertex random numbers as (unsigned) integers, while the exponential distribution requires floating point calculations. We deal with the decrease in accuracy by simply gathering more samples. For the uniform distribution and real-world graphs, 16 bits per integer turns out to yield sufficient accuracy. In this setting, we found that 16 samples are enough to accurately find the vertex with highest reachability set size. In particular, while the theoretical guarantee in 

[cohen1997size] requires the number of samples to grow with , we found this number to have a negligible impact on the solution quality of our group closeness heuristic (see Appendix C-B).

Ii-D6 Memory latency in reachability set size approximation

It is well-known that the empirical performance of graph traversal algorithms (like BFS and Dijkstra) is often limited by memory latency [bader2005architectural, lumsdaine2007challenges]

. Unfortunately, the reachability set size approximation needs to perform multiple traversals of the same graph. To mitigate this issue, we perform multiple iterations of the approximation algorithm at the same time. This technique increases the running time performance of the algorithm at the cost of its memory footprint. More precisely, during each traversal of the graph, we store 16 random integers per vertex and we aggregate all 16 minimal values per vertex at the same time. This operation can be performed very efficiently by utilizing SIMD vector operations. In particular, we use 256-bit AVX operations of our Intel Xeon CPUs to take the minimum of all 16 values at the same time. As mentioned above, aggregating 16 random numbers per vertex is enough for our use case; thus, using SIMD aggregation, we only need to perform a single traversal of the graph.

Ii-D7 Accepting swaps and stopping condition

As detailed in Sections II-B and II-C, our algorithms stop once they cannot find another vertex exchange that improves the closeness score of the current group. Exchanges that worsen the score are not accepted. To prevent vertex exchanges that change the group closeness score only negligibly, we also set a limit on the number of vertex exchanges. In our experiments, we choose a conservative limit that does not impact the solution quality measurably (see Appendix C-A).

Iii Experiments

11todo: 1For final version: consider adding markers to bar plots.

In this section, we evaluate the performance of our algorithms against the state-of-the-art greedy algorithm of Bergamini et al. [bergamini2018scaling].444In our experiments we do not consider the naive greedy algorithm and the OSA heuristic of [chen2016efficient] because they are both dominated by [bergamini2018scaling]. As mentioned in Section I, it has been shown empirically that the solution quality yielded by the greedy algorithm is often nearly-optimal. We evaluate two variants, LS and LS-restrict (see Section II-D2), of our Local-Swap algorithm, and three variants, GS, GS-local (see Section II-D3) and GS-extended (see Section II-D4) of our Grow-Shrink algorithm. We evaluate these algorithms for group sizes of on the largest connected component of the input graphs. We measure the performance in terms of running time and closeness of the group computed by the algorithms. Because our algorithms construct an initial group by selecting vertices uniformly at random (see Section II

), we average the results of five runs, each one with a different random seed, using the geometric mean over speedup and relative closeness.

555These five runs are done to average out particularly bad (or good) selections of initial groups; as one can see from Appendix C-B

, the variance due to the randomized reachability set size algorithm is negligible.

Unless stated otherwise, our experiments are based on the graphs listed in Tables II and III. They are all undirected and have been downloaded from the public repositories 9th DIMACS Challenge [demetrescu2009shortest] and KONECT [kunegis2013konect]. The running time of the greedy baseline varies between 10 minutes and 2 hours on those instances.

Our algorithms are implemented in the NetworKit [staudt2016networkit] C++ framework and use PCG32 [pcg2014] to generate random numbers. All experiments were managed by the SimexPal software to ensure reproducibility [angrimanGLMNPT19]. Experiments were executed with sequential code on a Linux machine with an Intel Xeon Gold 6154 CPU and 1.5 TiB of memory.

Iii-a Results for Extended Grow-Shrink

(a) Speedup over the greedy algorithm (geom. mean).
(b) Closeness score relative to the score of the group returned by greedy (geom. mean).
Fig. 2: Performance of the extended Grow-Shrink algorithm for different values of or on unweighted graphs, .

In a first experiment, we evaluate performance of our extended Grow-Shrink algorithm and compare it to the greedy heuristic. Because of its ability to escape local optima, we expect this to be the best algorithm in terms of quality; hence, it should be a good default choice among our algorithms. For this experiment, we set .

As discussed in Section II-D4, we distinguish two strategies to determine : we either fix a constant , or we fix a constant . For each strategy, we evaluate multiple values for or . Results for both strategies are shown in Figure 2. As expected, higher values of (or, similarly, lower values of ) increase the algorithm’s running time; (while allows the algorithm to perform better choices, it does not converge -times as fast). Still, for all tested values of or , the extended Grow-Shrink algorithm is one to two orders of magnitude faster than the greedy baseline. Furthermore, values of yield results of very good quality: for , for example, we achieve a quality of . At the same time, using this setting for , our algorithm is faster than the greedy algorithm. We remark that for all but the smallest values of (i. e., those corresponding to the lowest quality), choosing constant is a better strategy than choosing constant : for the same running time, constant always achieves solutions of higher quality.

Iii-B Scalability to Large Graphs

We also analyze the running time of our extended Grow-Shrink algorithm on large-scale networks. To this end, we switch to graphs larger than the ones in Table II. We fix , as Section III-A demonstrated that this setting results in a favorable trade-off between solution quality and running time. The greedy algorithm is not included in this experiment as it requires multiple hours of running time, even for the smallest real-world graphs that we consider in this part. Hence, we also do not compare against its solution quality in this experiment.

Iii-B1 Results on Synthetic Data

(a) R-MAT networks; to vertices (up to million edges).
(b) Random hyperbolic networks; to vertices (up to million edges).
Fig. 3:

Running time (s) of the extended Grow-Shrink algorithm on synthetic graphs (black line = linear regression);

, .

Figure 3 shows the average running time of our algorithm on randomly generated R-MAT [chakrabarti2004r] graphs as well as graphs from a generator [DBLP:conf/hpec/LoozOLM16]

for random hyperbolic graphs. Like R-MAT, the random hyperbolic model yields graphs with a skewed degree distribution, similar to the one found in real-world complex networks. In the (log-log) plot, the straight lines represent a linear regression of the running times. In both cases, the running time curves are at most as steep as the regression line, i. e., the running time behaves linearly in the number of vertices for the considered network models and sizes.

Iii-B2 Results on Large Real-World Data Sets

Network Time (s)
soc-LiveJournal1 4,843,953 42,845,684 95.3
livejournal-links 5,189,808 48,687,945 135.6
orkut-links 3,072,441 117,184,899 199.9
dbpedia-link 18,265,512 126,888,089 368.0
dimacs10-uk-2002 18,459,128 261,556,721 333.1
wikipedia_link_en 13,591,759 334,640,259 680.1
TABLE I: Running time of the extended Grow-Shrink algorithm on large real-world networks; , .

Table I reports the algorithm’s performance on large real-world graphs. In contrast to the greedy algorithm (which would require hours), our extended Grow-Shrink algorithm can handle real-world graphs with hundreds of millions of edges in a few minutes. For the orkut-links network, Bergamini et al. [bergamini2018scaling] report running times for greedy of 16 hours on their machine; it is the largest instance in their experiments.

Iii-C Accelerating Performance on Unweighted Graphs

(a) Speedup over the greedy algorithm (geom. mean).
(b) Closeness score relative to the score of the group returned by greedy (geom. mean).
Fig. 4: Performance of our local search algorithms for different values of ; unweighted graphs.

While the extended Grow-Shrink algorithm yields results of very high quality, if quality is not a primary concern, even faster algorithms might be desirable for very large graphs. To this end, we also evaluate the performance of the non-extended Grow-Shrink and the Local Swap algorithms. For extended Grow-Shrink, we fix again. The speedup and the quality of our algorithms over the greedy algorithm, for different values of the group size , is shown in Figures 3(a) and 3(b), respectively. Note that the greedy algorithm scales well to large , so that the speedup of our algorithms decreases with (as mentioned in Section I, the main bottleneck of greedy is adding the first vertex to the group). However, even for large groups of , all of our algorithms are still at least faster.

After extended Grow-Shrink, our non-extended local version of Grow-Shrink is the next best algorithm in terms of quality. As explained in Section II-D3, this variant yields better solutions than non-local Grow-Shrink and gives a speedup of over extended Grow-Shrink with and (= a speedup of over greedy); the solution quality in this case is of the greedy quality.

The non-restricted Local-Swap algorithm is dominated by Grow-Shrink, both regarding running time and solution quality. Furthermore, compared to the other algorithms, the restricted Local-Swap algorithm only gives a rough estimate of the group with highest group closeness; however, it is also significantly faster than all other algorithms and may be employed during an exploratory analysis of graph data sets.

Iii-D Results on Weighted Road Networks

(a) Speedup over the greedy algorithm (geom. mean).
(b) Closeness score relative to the score of the group returned by greedy (geom. mean).
Fig. 5: Performance of our local search algorithms for different values of ; weighted graphs.

Recall that the Local-Swaps algorithm does not work for weighted graphs; we thus report only Grow-Shrink data in the weighted case. The performance of Grow-Shrink and local Grow-Shrink on weighted graphs is shown in Figure 5. In contrast to unweighted graphs, the quality of the non-local Grow-Shrink algorithm is superior to the greedy baseline for all considered group sizes. Furthermore, in contrast to the unweighted case, the ability to perform non-local vertex exchanges greatly benefits the non-local Grow-Shrink algorithm compared to local Grow-Shrink.666For this reason, we do not include the extended Grow-Shrink algorithm in this experiment. In fact, we expect that it improves only slightly on GS-local (red line/bars in Fig. 5) but cannot compete with (non-local) GS: the ability to perform non-local vertex exchanges, as done by GS (green line/bars in Fig. 5) appears to be crucial to obtain high-quality results on weighted graphs. Thus, Grow-Shrink clearly dominates both the greedy algorithm and local Grow-Shrink on the weighted graphs in our benchmark set – both in terms of speed and solution quality.

Iii-E Summary of Experimental Results

On unweighted graphs, a good trade-off between solution quality and running time is achieved by the extended Grow-Shrink algorithm with constant . This strategy yields solutions with at least of the closeness score of a greedy solution (greedy, in turn, was at most away from the optimum on small networks in previous work [bergamini2018scaling]). Extended Grow-Shrink is faster than greedy (). Thus, it is able to handle graphs with hundreds of millions of edges in a few minutes – the state of the art needs multiple hours. If a fast but inaccurate algorithm is needed for exploratory analysis of graph data sets, we recommend to run the non-extended Grow-Shrink algorithm, or, if only a very coarse estimate of the group with maximal closeness is needed, restricted Local-Swap.

On weighted graphs, we recommend to always use our Grow-Shrink algorithm, as it outperforms the greedy state of the art both in terms of quality (yielding solution that are on average better than greedy solutions) and in terms of running time performance (with a speedup of two orders of magnitude), at the same time.

Iv Conclusions

In this paper, we introduced two families of new local-search algorithms for group closeness maximization in large networks. As maximizing group closeness exactly is infeasible for graphs with more than a few thousand edges, our algorithms are heuristics (just like the state-of-the-art greedy algorithm). However, for small real-world networks, the results are empirically known to be close to optimal solutions [bergamini2018scaling].

Compared to previous state-of-the-art heuristics, our algorithms (in particular: extended Grow-Shrink) allow to find groups with high closeness centrality in real-world networks with hundreds of millions of edges in seconds to minutes instead of multiple hours, while sacrificing less than in quality. In weighted graphs, Grow-Shrink (GS) even dominates the best known heuristic: the GS solution quality is more than higher and GS is two orders of magnitude faster. Adapting the algorithm to even larger graphs in distributed memory is left to future work.

References

Appendix A Technical Proofs

Proof of Lemma 1.

Let . Because of the sub path optimality property of shortest paths, it is clear that (as is a predecessor of on a shortest path from ). On the other hand, adding to decreases the length of this path (as the distance between and becomes zero); in other words: . These observations allow us to express the right-hand side of Eq. 1 as . Summing this equation for all vertices in yields the term of the lemma. For vertices , it holds that , hence the inequality.

For the statement about the unweighted case, we need to show that the contribution of all other vertices is zero, i. e., for all vertices . Note that (otherwise would be in ) and . Thus, which completes the proof. ∎

Proof of Lemma 4.

Let be the set of vertices that need to be updated by the algorithm, i. e., equals the set before the Dijkstra-like algorithm runs. We have not shown yet whether all vertices in are indeed updated. For the remainder of this proof, all symbols (such as , and ) refer to the state of our data structures after the algorithm terminates. To prove the lemma, it is sufficient to prove that (i. e., that no remains undefined, or, in other words, is updated wherever necessary) and that for all (i. e., that the definition of is respected).

Let us first prove that . Let . There exists a path from every to and each such path contained at least one boundary pair before the algorithm started. Indeed, there is a boundary pair for the first vertex on that path that is also in . Thus, the algorithm sets for some (i. e., ) and propagates the update of along the path from to . We have to prove that our pruning condition does not prevent any necessary update along this path. Hence, let be a pair of vertices so that the algorithm is pruned before visiting from . Only the case is interesting, as must already be correct otherwise. Pruning only happens if and therefore . But in this case, was a -boundary pair and the preceding argument shows that .

Now consider the second part of the proof. Let be any group vertex and let be any vertex that is updated by the algorithm with . The algorithm guarantees that as and (by construction of the algorithm) is the length of a (not yet proven to be shortest) path from to . It is sufficient to show that this path is a shortest one, i. e., . We prove this statement for all by an inductive argument using . We distinguish two cases depending on whether there exists a neighbor of in that is on a shortest path from to . First, we handle the case that no such neighbor exists. In this case, holds for all on shortest paths from to . As did not change during the algorithm, all such correspond to -boundary pairs for and is the minimal boundary distance over all these pairs . Hence, was updated correctly before the Dijkstra-like algorithm ran. On the other hand, let be a neighbor of that is on a shortest path from to . implies ; thus, the algorithm cannot be pruned when visiting from . In this case, however, the algorithm sets . As , the induction yields that is already correct, i. e., . Since is on a shortest path from to , is also updated correctly. ∎

Appendix B Details of the Experimental Setup

Network Category
dimacs9-NY 264,346 365,050 Road
dimacs9-BAY 321,270 397,415 Road
web-Stanford 255,265 1,941,926 Hyperlink
hyves 1,402,673 2,777,419 Social
youtube-links 1,134,885 2,987,468 Social
com-youtube 1,134,890 2,987,624 Social
web-Google 855,802 4,291,352 Hyperlink
trec-wt10g 1,458,316 6,225,033 Hyperlink
dimacs10-eu-2005 862,664 16,138,468 Road
soc-pokec-relationships 1,632,803 22,301,964 Social
wikipedia_link_ca 926,588 27,133,794 Hyperlink
TABLE II: Unweighted real-world networks used in the experiments.
State
DC 9,522 14,807
HI 21,774 26,007
AK 48,560 55,014
DE 48,812 59,502
RI 51,642 66,650
CT 152,036 184,393
ME 187,315 206,176
State
ND 203,583 249,809
SD 206,998 249,828
WY 243,545 293,825
ID 265,552 310,684
MD 264,378 312,977
WV 292,557 320,708
NE 304,335 380,004
TABLE III: Weighted networks used in the experiments. All networks are road networks of different states of the US.

Tables II and  III show details about our real-world instances. To generate the synthetic graphs in Figure 3, we use the same parameter setting as in the Graph 500’s benchmark [murphy2010introducing] (i. e., edge factor 16, , , , and ) for the R-MAT generator. For the random hyperbolic generator, we set the average degree to , and the exponent of the power-law distribution to 3.

Appendix C Additional Experiments

C-a Impact of the number of vertex exchanges

Figures 6(a) and 6(b) depict the relative closeness (compared to the closeness of the group returned by the greedy algorithm), depending on the progress of the algorithm in terms of vertex exchanges. For extended Grow-Shrink, we fix . All of the local search algorithm quickly converge to a value near their final result; additional vertex exchanges improve the group closeness score by small amounts. In order to avoid an excessive amount of iterations, it seems reasonable to set a limit on the number of vertex exchanges. In our experiments we set a conservative limit of 100 exchanges.

C-B Impact of reachability set size approximation

(a) Speedup over the greedy algorithm (geom. mean).
(b) Closeness score relative to the score of the group returned by greedy (geom. mean).
Fig. 6: Performance of the Grow-Shrink algorithm for different numbers of samples to estimate reachability set size; .

As mentioned in Section II-D3, the errors in the approximation of are amplified by the multiplication with , and this results in GS-local computing higher quality solutions than GS. We study how increasing the accuracy of the reachability set size approximation by incrementing the number of samples impacts the performances of both GS and GS-local. Figure 5(a) shows that GS needs at least 64 samples to converge to a better local optimum than GS-local. However, in both cases increasing the number of samples degrades the speedup without yielding a significant quality improvement (Figure 5(b)).

(a) Unweighted graphs.
(b) Weighted graphs.
Fig. 7: Behavior of the relative closeness score (compared to the group returned by greedy, geom. mean) over the execution of the algorithms (in terms of vertex exchanges); .