Group Centrality Maximization for Large-scale Graphs

10/30/2019 ∙ by Eugenio Angriman, et al. ∙ 0

The study of vertex centrality measures is a key aspect of network analysis. Naturally, such centrality measures have been generalized to groups of vertices; for popular measures it was shown that the problem of finding the most central group is NP-hard. As a result, approximation algorithms to maximize group centralities were introduced recently. Despite a nearly-linear running time, approximation algorithms for group betweenness and (to a lesser extent) group closeness are rather slow on large networks due to high constant overheads. That is why we introduce GED-Walk centrality, a new submodular group centrality measure inspired by Katz centrality. In contrast to closeness and betweenness, it considers walks of any length rather than shortest paths, with shorter walks having a higher contribution. We define algorithms that (i) efficiently approximate the GED-Walk score of a given group and (ii) efficiently approximate the (proved to be NP-hard) problem of finding a group with highest GED-Walk score. Experiments on several real-world datasets show that scores obtained by GED-Walk improve performance on common graph mining tasks such as collective classification and graph-level classification. An evaluation of empirical running times demonstrates that maximizing GED-Walk (in approximation) is two orders of magnitude faster compared to group betweenness approximation and for group sizes ≤ 100 one to two orders faster than group closeness approximation. For graphs with tens of millions of edges, approximate GED-Walk maximization typically needs less than one minute. Furthermore, our experiments suggest that the maximization algorithms scale linearly with the size of the input graph and the size of the group.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large network111We use the terms “graph” and “network” interchangeably in this paper. data sets have become abundant in real-world applications; consider online social networks, web and co-authorship graphs, transportation and biological networks [38], to name just a few. Network analysis techniques have become very important in the last one or two decades to gain insights from these data [38]

. One particularly interesting class of analysis methods are centrality measures: they assign each vertex a numerical score that indicates the importance of the vertex based on its position in the network. A few very popular centrality measures are betweenness, closeness, degree, eigenvector, PageRank, and Katz centrality 

[7, 38].

The notion of centrality has subsequently been extended to groups of vertices [18]; this extension allows to determine the importance of whole groups, not only of individuals and, vice versa, to determine important groups. The latter question leads to an optimization problem: find the group of vertices of a given size that maximizes a certain group centrality measure. It is easy to imagine applications for group centrality, e. g., facility location and influence maximization.

Motivation. Such optimization problems are often -hard; for example, this has been proven for group betweenness [16], group closeness [14] and even group degree. Since exact results may not be necessary in real-world applications, approximation is a reasonable way of dealing with complexity. For example, in the case of group betweenness, Mahmoody et al. [32] designed a nearly-linear time (in and ) approximation algorithm based on random sampling, which has been shown to outperform its competitors, while having similar accuracy. Yet, due to constant factors, it is not applicable to really large real-world instances; also (and in particular), the running time depends quadratically on . Indeed, the largest instances considered in their paper have around half a million edges. While group closeness approximation can already be done faster [6], in the context of big data there is still a need for a meaningful group centrality measure with a truly scalable algorithm for its optimization.

Contribution. In this paper, we introduce a new centrality measure called ED-Walk for exponentially decaying walk (see Section 2.1). While being inspired by Katz centrality, it allows a more natural extension to the group case, then called GED-Walk. GED-Walk takes walks of any length into account, but considers shorter ones more important. We develop an algorithm to approximate the GED score of a given group in nearly-linear time. As our measure is submodular (see Section 2.2), we can design a greedy algorithm to approximate a group with maximum GED-Walk score (in Section 3). To allow the greedy algorithm to quickly find the top-1 vertex with highest marginal gain, we adapt a top-k centrality approach. In our experiments, our algorithms are two orders of magnitude faster than the state of the art for group betweenness approximation [32]. Compared to group closeness approximation [6], we achieve a speedup of one to two orders of magnitude for groups with at most 100 vertices. Furthermore, our experiments indicate a linear running time behavior in practice (despite a higher worst-case time complexity) for greedy GED-Walk approximation. Finally, we show some potential applications of GED-Walk: (i) choosing a training set with high group centrality leads to improved performance on the semi-supervised vertex classification task; and (ii) features derived based on GED-Walk improve graph classification performance.

2 Problem Definition and Properties

In this paper, we deal with unweighted graphs ; they can be directed or undirected. By , we denote the adjacency matrix of .

2.1 Group Centrality Measures.

We start by reviewing existing centrality measures and by defining the new ED-Walk; our main focus is then on the group centrality derived from ED-Walk.

Closeness and betweenness, two of the most popular centrality measures, are both based on shortest paths. More precisely, betweenness is defined as the fraction of shortest paths that cross a certain vertex (or a certain group of vertices, for group betweenness). Hence, betweenness can be computed using all-pairs shortest-path (APSP) techniques and, interestingly, a sub-cubic algorithm for betweenness (on general graphs) would also imply a faster algorithm for APSP [1]. Furthermore, the latter result holds true even if one considers the problem of computing the betweenness centrality of only a single vertex of the graph.

Closeness, on the other hand, is defined as the reciprocal of the average distance from a given vertex (or a group of vertices, for group closeness) to all other vertices of the graph. Hence, computing the closeness of a single vertex (or a single group of vertices) is easy; nonetheless, unless the strong exponential time hypothesis (SETH) fails, there is no sub-quadratic algorithm to find a vertex with maximal closeness [8].

To avoid those complexity-theoretic barriers, instead of deriving our measure from betweenness or closeness, we will derive it from Katz centrality (KC). KC is another popular centrality measure but, in contrast to measures based on shortest paths, it can be computed much more efficiently. Let be the number of walks of length that start at vertex (in our notation, we omit the dependence of on ; obviously, potentially depends on the entire graph).222Katz centrality is sometimes alternatively defined based on walks ending at a given vertex; from an algorithmic perspective, however, this difference is irrelevant. For a given parameter (which needs to be chosen small enough [26], see also Proposition 2.2), the Katz centrality of is defined as:

(1)

Note that does not only consider shortest paths but walks of any length. Nevertheless, due to the fact that the contribution of a walk decreases exponentially with its length, short walks are still preferred. As does not rely on APSP, it can be computed faster than . For example, following from the algebraic formulation [26], iterative linear solvers (e. g., conjugate gradient) can be used to compute for all vertices at the same time. Furthermore, efficient nearly-linear time Katz approximation and ranking algorithms exist [49]. In fact, the algorithms that we present in this paper will be based on those Katz algorithms.

In contrast to , however, replacing the single vertex in Eq. (1) by a group leads to a rather uninteresting measure: as and count distinct walks for , such a “group Katz” measure would just be equal to the sum of individual Katz scores; a group with maximal score would just consist of the vertices appearing in a top- Katz ranking. Indeed, this contradicts one’s intuition of group centralities: parts of a graph that are “covered” by one vertex of the group should not need to be covered by a second group vertex. Fortunately, we can construct a centrality satisfying this intuition by replacing by a natural variant of it: instead of considering walks that start at a given vertex , we consider all walks that cross vertex . This leads to our definition of ED-Walk: Let be the number of -walks (i. e., walks of length ) that contain at least one vertex of . Then, the ED-Walk (for exponentially decaying walk) centrality for is

where is a parameter of the centrality.

In the preceding definition, needs to be chosen appropriately small so that the series converges. In Section 2.2 we will see that it is indeed possible to choose such an . As claimed above, ED-Walk naturally generalizes to a group measure: [GED-Walk] The GED-Walk centrality of a group of vertices is given by:

(2)

Two problems are specifically interesting in the context of group centralities: (i) the problem of computing the centrality of a given group and (ii) the problem of finding a group that maximizes the group centrality measure. We deal with both in this paper and in particular with the following optimization problem: [GED-Walk maximization] Given a graph and an integer , find

(3)

Note again that Problem (3) is in general different from finding the vertices whose individual ED-Walk centrality is highest (top- problem). In fact, for closeness centrality, it was shown empirically that the top- problem and the group closeness problem often have very different solutions [6]. One may expect a similar behavior here as well since a group with high score for GED-Walk is unlikely to have many members close to each other. Such a “close” group would share many walks, leading to a relatively small .

2.2 Properties of GED-Walk.

Since GED is based on similar concepts as Katz centrality, it is not surprising that the two measures are closely related. Indeed, the convergence properties of and GED-Walk are identical: The following holds:

  1. If , is finite for all , where

    is the largest singular value of the adjacency matrix of

    .

Note that the equality of the first claim of Proposition 2.2 does not hold for groups smaller than . To prove the first claim, it is enough to show that the walks that contribute to and the walks that contribute to are exactly the walks of length in . Indeed, each walk of length contributes to . It also contributes to exactly one of the for , namely to . After applying the first claim (and observing that is an upper bound, also see Proposition 2.2), the second one follows from the well-known fact that is finite for all iff  [26].

Now that we have established that GED-Walk is indeed well-defined, let us prove some basic properties (defined in Appendix A) of the function:

GED-Walk is both non-decreasing and submodular as a set function. Monotonicity is obvious from Eq. (2): when the input set becomes larger, cannot decrease. To see that GED-Walk is also submodular, consider the marginal gain , i. e., the increase of the score if a vertex is added to the set . This marginal gain is exactly the sum of the number of walks that contain but no vertex in , with each walk weighted by a power of . As such, the marginal gain can only decrease if is replaced by a superset .

Given an algorithm to compute , Proposition 2.2 would immediately allow us to construct a greedy algorithm to approximate Problem (3) (using a well-known theorem on submodular approximation [37], also see Proposition A in Appendix A). Nevertheless, as we will see in Section 3.2, we can do better than naively applying the greedy approach. Before discussing any algorithms, however, we will first demonstrate that maximizing GED-Walk to optimality is -hard; hence, approximation is indeed appropriate to solve Problem (3). Solving Problem (3) to optimality is -hard. Intuitively, Theorem 2.2 holds because if is chosen small enough, GED-Walk degrades to group degree (or, equivalently, vertex cover).333A set of vertices has a group degree of iff it is a vertex cover. Let be a graph. Let . We show that has a vertex cover of size if and only if there is a group of vertices with .

First, assume that has a vertex cover of size . To see now that a group with GED indeed exists, consider : since covers all edges, the contribution of the term corresponding to in Eq. (2) is already at least (independent of ).

Secondly, assume that a group with exists. Let denote the set of edges incident to at least one vertex in . is given by . Thus, to decide if is a vertex cover, it is sufficient to show that : in this case, needs to be as all walks of length together cannot match the contribution of any -walk to . It holds that . A straightforward calculation shows that the latter term is smaller than if .

3 Algorithms for GED-Walk

3.1 Computing GED-Walk Centrality.

In this section, we discuss the problem of computing for a given group . As we are not aware of an obvious translation of our definition to a closed-form expression, we employ an approximation algorithm that computes the infinite series up to an arbitrarily small additive error . To see how the algorithm works, let be a positive integer. We split the series into the first terms and a tail consisting of an infinite number of terms. This allows us to reduce the problem of approximating to the problem of (exactly) computing the -th partial sum and to the problem of finding an upper bound on the tail . In particular, we are looking for an upper bound that converges to zero for increasing . Given such an upper bound, our approximation algorithm chooses such that this bound is below the specified error threshold . The algorithm returns once such an is found.

3.1.1 Computation of .

In order to compute , it is obviously enough to compute for . The key insight to computing efficiently is that it can be expressed as a series of recurrences (similar to Katz centrality): let us define as the number of -walks ending in and containing at least one vertex from . Obviously, the walks that contribute to partition the walks that contribute to in the sense that

(4)

On the other hand, can be expressed in terms of , i. e., the number of -walks that end in but do not contain any vertex from :

This takes advantage of the fact that any -walk is the concatenation of a -walk and a single edge. The expression considers -walks which do not contain a vertex in only if the concatenated edge ends in . Finally, can similarly be expressed as the recurrence:

In the case, all walks ending in have a vertex in , so no walks contribute to . We remark that the base cases of and can easily be computed directly from . Collectively, this yields an -time algorithm for computing the GED-Walk score of a given group. We also note that the computation of and can be trivially parallelized by computing the values of those functions for multiple vertices in parallel.

3.1.2 Bounding .

To complete our algorithm, we need a sequence of upper bounds on such that the upper bound converges to zero for . Using Proposition 2.2, we can lift tail bounds of Katz centrality to GED-Walk. In particular, applying the first claim of this proposition shows:

(5)

Bounds for the latter term can be found in the literature on Katz centrality. For example, in [49], it was shown that , for . If we apply Eq. (5), this yields the bound

(6)

for GED-Walk; we call this bound the combinatorial bound (due to the nature of the proof in [49]). Here, denotes the maximum degree of any vertex in . We generalize this statement to arbitrary (i. e., ), at the cost of a factor of . For completeness, we show the proof in the case of .

It holds that:

(7)

We call the bound of Eq. (7) the spectral bound. First note that , where

denotes the vector containing all ones in

and denotes the 1-norm. Furthermore, let denote the induced -norm on -valued matrices. Since is compatible with , it holds that Here, the last inequality uses the fact that for matrices  [19] and the fact that is sub-multiplicative. Furthermore, it is well-known that ; hence

Substituting this into Eq. (5), we get: . After rewriting the geometric series , we obtain the statement of the lemma.

3.1.3 Complexity Analysis.

Input: Graph , parameters , and group
Output:

1:
2:
3:loop
4:      compute
5:     
6:      compute bound s.t.
7:     if  then
8:          return      
9:     
Algorithm 1 GED-Walk Computation

Algorithm 1 shows the pseudocode of the algorithm. Line 4 computes using Equation (4) and the recurrences for and from Section 3.1.1. Assuming that and are stored, this can be done in time . Line 6 computes an upper bound on ; we use either the combinatorial or the spectral bound from Section 3.1.2.

The score of a given group can be approximated up to an additive error of in iterations. We assume that the bound of Lemma 3.1.2 is used in the algorithm (similar results can be obtained for other bounds). By construction, the algorithm terminates once . Note that ; thus, we can bound the entire term by . A straightforward calculation shows that for , the previous term becomes smaller than . The score of a given group can be approximated up to an additive error of in time .

3.2 Maximizing GED-Walk Centrality.

Let denote the marginal gain of a vertex w. r. t. a group ; in other words, . As observed in Section 2.2, the properties given by Proposition 2.2 imply that a greedy algorithm that successively picks a vertex with highest marginal gain yields a -approximation for Problem (3). Note, however, that we do not have an algorithm to compute exactly. While we could use the approximation algorithm of Section 3.1 to feed approximate values of into a greedy algorithm, this procedure would involve more work than necessary: indeed, the greedy algorithm does not care about the value of at all, it just needs to ascertain that there is no with .

This consideration is similar to a top-1 centrality problem.444Top-1 algorithms have been developed for multiple centrality measures, e. g., in [5]. Hence, we adapt the ideas of the top- Katz ranking algorithm that was introduced in [49] to the case of .555Note that the vertex that maximizes is exactly the vertex that maximizes . However, algorithmically, it is advantageous to deal with , as it allows us to construct a lazy greedy algorithm (see Section 3.2.2). Applied to the marginal gain of GED-Walk, the main ingredients of this algorithm are families and of lower and upper bounds on , satisfying the following definition: We say that families of functions and are suitable bounds on the marginal gain of GED-Walk if the following conditions are all satisfied:

  1. and are non-increasing in

In addition to and , we need the following definition: Let and fix some . Let be two vertices. If , we say that is -separated from .

Given the definition, it is easy to see that if there is a vertex that is -separated from all other , then either is the vertex with top-1 marginal gain or the marginal gain of the top-1 vertex is at most (in fact, this follows directly from the first property of Definition 5). Note that the introduction of is required here to guarantee that separation can be achieved for finite even if and have truly identical marginal gains. Furthermore, while the first condition of Definition 5 is required for -separation to work, the second and third conditions are required for the correctness of the algorithm that we construct in Section 3.2.2.

3.2.1 Construction of and .

In order to apply the methodology from Section 3.2, we need suitable families and of bounds. Luckily, it turns out that we can re-use the bounds developed in Section 3.1.2. Indeed, it holds that: . In the above calculation, because is non-decreasing as a set function. Hence yields a family of lower bounds on the marginal gain of . On the other hand, . Thus, a family of upper bounds on the marginal gain of is given by , where denotes either the combinatorial or the spectral bound developed in Section 3.2.

Let . The two families

form suitable families of bounds on the marginal gain of . The discussion at the beginning of this section already implies that properties (i) and (ii) of Definition 5 hold. For seeing that (iii) also holds, it is enough to show that is non-increasing as a set function; however, this directly follows from the proof of Proposition 2.2).

3.2.2 Lazy-Greedy Algorithm.

Input: Graph , parameter , and group size .
Output:  Group with times the optimum score.

1:
2:
3: for all
4:, Priority queues w. r. t. ,
5:procedure lazyUpdate()
6:     repeat
7:          
8:          
9:          
10:          ;
11:     until 
12:     return
13:loop
14:     reset ; for all
15:     ;
16:     loop
17:          if  then
18:               return           
19:          
20:          ;
21:          
22:          if  then
23: is not -separated from ; increase
24:               ;
25:               break           
26:                
27:      Increase using geometric progression
Algorithm 2 Lazy greedy algorithm for GED-Walk Maximization

Taking advantage of the ideas from Section 3.2.1, we can construct a greedy approximation algorithm for GED-Walk maximization. To reduce the number of evaluations of our objectives (without impacting the solution or the (worst-case) time complexity), we use a lazy strategy inspired by the well-known lazy greedy algorithm for submodular approximation [34]. However, in contrast to the standard lazy greedy algorithm, we do not evaluate our submodular objective function directly; instead, we apply Definition 3.2 to find the top-1 vertex with highest marginal gain. To this end, instead of lazily evaluating for vertices , we lazily evaluate and . In fact, to apply Definition 3.2, we need to find the vertex that maximizes and the vertex that maximizes . Hence, we rank all vertices according to priorities and and lazily compute the true values of and until and are identified.

Algorithm 2 depicts the pseudocode of this algorithm. Procedure lazyUpdate (line 5) lazily updates the priorities and until the exact values of and are known for the topmost element of the given priority queue. In line 22, the resulting vertices are checked for -separation. It is necessary to achieve an -separation (and not only an -separation) to guarantee that the total absolute error is below , even after all iterations. If separation fails, is incremented in line 27 and we reset and for all in line 14; otherwise, (i. e., the vertex with top-1 marginal gain) is added to .

If and are suitable families of bounds, Algorithm 2 computes a group such that , where is the group with highest score. First, note that it is indeed sufficient to only consider the vertices with highest and for -separation: if those vertices are not -separated, is not the top-1 vertex with highest marginal gain (up to ). Due to property (ii) of Definition 5, and will eventually be -separated and the algorithm terminates. Hence, it is enough to show that the algorithm indeed finds the vertices with highest and . This follows from the invariant that and . The invariant obviously holds after each reset of and . Furthermore, property (iii) of Definition 5 guarantees that it also holds when the set is extended. The approximation guarantee follows from Proposition A in the appendix. Specifically, it follows from a variant of the classical result on submodular approximation, adapted to the case of an additive error in the objective function.

Initialization of and .

To accelerate the algorithm in practice, it is crucial that the initial values of and are chosen appropriately during each reset: if we reset as in Algorithm 2, the algorithm has to evaluate and for all whenever increases! To avoid this issue, we use to provide a better initialization of and . Let be the value of in the reverse graph of , i. e., is the number of -walks that start at but do not contain any vertex of . The following lemma summarizes our initialization strategy:

Let

The following holds:

  1. Let denote the bound from the construction of . yields the following bounds on and :

Thus, instead of initializing , we compute and use the right-hand sides of the second statement of Lemma 3.2.2 as initial values for and . Note that using the recurrence for from Section 3.1 (and a similar recurrence for ), can be computed in time , simultaneously for all . To prove the first claim, observe that the difference between and is that counts walks that contain cycles multiple times while counts those walks only once. The second claim follows immediately.

Complexity of the lazy algorithm.

Assume that Algorithm 2 uses the spectral bound from Lemma 3.1.2. The worst-case value of to achieve -separation is given by Lemma 3.1.3, namely . Note that the algorithm does some iterations that do not achieve -separation. We calculate the cost of each iteration in terms of the number of evaluations of . In this measure, the cost of each iteration is . Thus, as is doubled on unsuccessful -separation, the cost of all unsuccessful attempts to achieve -separation is always smaller than the cost of a successful attempt to achieve -separation. Hence, the worst-case running time of Algorithm 2 is : in each of the successful iterations, it can be necessary to extract all elements of the priority queues and evaluate and on each of them.

We expect that much fewer evaluations of and will be required in practice than in the worst-case: in fact, we expect almost all vertices to have low marginal gains (i. e., only few walks cross those vertices) and will never be evaluated on those vertices. Hence, we investigate the practical performance of Algorithm 2 in Section 4.

4 Experiments

Our algorithms for GED-Walk maximization are implemented in C++ on top of the open-source framework NetworKit 

[48], which also includes implementations of the aforementioned algorithms for group betweenness and group closeness maximization. All experiments are executed on a Linux server with two Intel Xeon Gold 6154 CPUs (36 cores in total) and 1.5 TB of memory. If not stated differently, every experiment uses 36 threads (one per core), and the default algorithm for -Walk maximization uses the combinatorial bound (see Section 3.1.2) and the lazy-greedy strategy (see Section 3.2.2) with . All networks are undirected; real-world networks have been downloaded from the Koblenz Network Collection [29], and synthetic networks have been generated using the graph generators provided in NetworKit (see Appendix B for more details about the settings of the parameters of the generators). Because group closeness is not defined on disconnected graphs (without modifications), we always consider the largest connected component of our input instances.

4.1 Scalability w. r. t. Group Size.

(a) Running time (s) of -Walk, , and maximization (note that is in log-scale).
(b) Length of the walks considered by our algorithm for -Walk maximization.
Figure 1: Scalability w. r. t. of -Walk, , and maximization (Figure 0(a)), and highest walk length considered by our GED-Walk maximization algorithm (Figure 0(b)

). Data points are aggregated using the geometric mean over the instance in Table 

2.

Figure 0(a) shows the average running time in seconds of , -Walk, and maximization for group sizes from to . For small group sizes -Walk can be maximized much faster than and : for our algorithm for -Walk maximization is on average faster than maximization, and faster than maximization, while for it is on average faster than maximization and faster than maximization. On the other hand, maximization scales better than and -Walk maximization w. r. t. the group size (see Appendix C for additional results on large groups). This behavior is expected since the evaluation of marginal gains of becomes computationally cheaper for larger groups. This does not apply to our algorithm for maximizing -Walk, which instead needs to increase the length of the walks while the group grows (see Algorithm 2, line 27).

Yet, one can also observe that the group closeness score increases only very slowly (and more slowly than -Walk’s) when increasing the group size (see Figure 4(b) in the appendix). This means that, from a certain group size on, the choice of a new group member hardly makes a difference for group closeness – in most cases only the distance to very close vertices can be reduced. In that sense -Walk seems to distinguish better between more and less promising candidates. Furthermore, Figure 0(b) shows the length of the walks considered by our algorithm w. r. t. . In accordance with what we stated in Section 3.2.2, grows sub-linearly with .

4.2 Scalability to Large (Synthetic) Graphs.

(a) Running time (s) on Erdős Rényi networks.
(b) Running time (s) on R-MAT networks.
(c) Running time (s) on Barabàsi-Albert networks.
(d) Running time (s) on random hyperbolic networks.
Figure 2: Running time in seconds on 36 cores of -Walk maximization on synthetic networks with to vertices, . Data points are aggregated using the geometric mean over three different randomly generated networks (see Appendix B for further details).

Figure 2 shows the running time in seconds of -Walk maximization with on randomly generated networks using the Erdős Rényi, R-MAT [11] and Barabàsi-Albert [4] models as well as the random hyperbolic generator from von Looz et al. [50]

. The thin blue lines represent the linear regression on the running times (all resulting

-values are and therefore small enough for statistical relevance); w. r. t. the running time curves, the regression lines either have a steeper slope (Figures 1(a)1(b), and 1(c)), or they match it almost perfectly in the case of the hyperbolic random generator. Therefore, for the network types and sizes under consideration, GED-Walk maximization scales (empirically) linearly w. r. t. .

Experimental results on large real-world networks with hundreds of million of edges can be found in Table 4 in Appendix D. Our algorithm needs up to six minutes to finish, in most cases only around one minute.

4.3 Parallel Scalability

(a) Multi-core speedups of maximization over single-core maximization.
(b) Running time of , , and maximization w. r. t. the number of cores.
Figure 3: Multi-core scalability of , , and maximization, . Data points are aggregated using the geometric mean over the instances in Table 2.

As stated above, the GED-Walk algorithm can be parallelized by computing and in parallel for different . Figure 2(a) shows the parallel speedup of our algorithm for maximizing -Walk over itself running on a single core for . The scalability is moderate up to 16 cores, while on 32 cores it does not gain much additional speedup. A similar behavior was observed before for group closeness [6]. Figure 2(b) shows the running time in seconds of , , and with increasing number of cores, for . On a single core, -Walk maximization is on average faster than maximization and faster than maximization. As we increment the number of cores, the decrement of the running time of the three algorithms is comparable – except

on 32 cores: In this case, GCC is slower than with 16 cores on the considered instances. The limited scalability affecting all three algorithms is probably due to memory latency becoming a bottleneck in the execution on multiple cores 

[3, 31]. Figure 2(b) also shows that our algorithm for -Walk maximization finishes on average within a few seconds on cores. Here the running time of the algorithm is dominated by its sequential parts, and it is not surprising that adding more cores does not speed the algorithm up substantially.

4.4 Applications of GED-Walk.

In this section we demonstrate the usefulness of GED-Walk for different applications by showing that it improves the performance on two important graph mining tasks: semi-supervised vertex classification and graph classification. As a preprocessing step, for both tasks before applying GED-Walk we first construct a weighted graph using the symmetrically normalized adjacency matrix which is often used in the literature [52, 28], where is a diagonal matrix of vertex degrees. Here, instead of taking the contribution of a -walk to be , we define it to be where denotes the weight of edge . Except for the introduction of coefficients in the recurrences of and , no modifications to our algorithm are required. Compared to our unweighted definition of GED-Walk, this weighted variant converges even faster, as the contribution of each walk is smaller.

4.4.1 Vertex classification.

Semi-supervised vertex classification is a fundamental graph mining problem where the goal is to predict the class labels of all vertices in a graph given a small set of labelled vertices and the graph structure. The choice of which vertices we label (i. e., which vertices we include in the training set) before building a classification model can have a significant impact on the test accuracy, especially when the number of labelled vertices is small in comparison to the size of the graph [2, 47]. Since many models rely on diffusion to propagate information on the graph, our hypothesis is that selecting a training set with high group centrality will improve diffusion, and thus also improve accuracy. To test this hypothesis we evaluate the performance of Label Propagation [52, 12]

given different strategies for choosing the training set. We choose the simple Label Propagation model as our benchmark to isolate the effect on accuracy more clearly, but similar conclusions apply to more sophisticated vertex classification models such as graph neural networks

[28].

More specifically, we evaluate the classification accuracy on two common benchmark graphs: Cora () and Wiki () [46]. We use the Normalized Laplacian variant of Label Propagation [52] setting the value for the return probability hyper-parameter to . We let the vertices with highest group centrality according to GED-Walk be in the training set and the rest of the vertices be in the test set. We compare GED-Walk with the following baselines for selecting the training set: RND: select vertices at random (results averaged over 10 trials); DEG: select the vertices with highest degree; NBC: select vertices with highest (non-group) betweenness centrality; GBC: select vertices with highest group betweenness centrality; and PPR: select vertices with highest PageRank.

On Figure 4, we can see that for both datasets and across different number of labelled vertices, selecting the training set using GED-Walk leads to highest (or comparable) test accuracy. Furthermore, while the second-best baseline strategy is different on the different datasets (on Cora it is the GBC strategy and on Wiki its the SBC strategy), GED-Walk is consistently better. Overall, these results confirm our hypothesis.

(a) Cora
(b) Wiki
Figure 4: Semi-supervised vertex classification accuracy for different strategies for choosing the training set.

4.4.2 Graph classification.

Graph classification is another fundamental graph mining problem where the goal is to classify entire graphs based on features derived from their topology/structure. In contrast to the vertex classification task, where each dataset is a single graph, here each dataset consists of many graphs of varying size and their associated ground-truth class labels. Our hypothesis is that the vertices with high group centrality identified by GED-Walk capture rich information about the graph structure and can be used to derive features that are useful for graph classification.

To extract features based on GED-Walk, we first obtain the group of vertices with highest (approximate) group centrality score. The group centrality score itself is the first feature we extract. In addition we summarize the marginal gains of all remaining vertices in the graph in a histogram with bins. We concatenate these features to get a feature vector for each graph in the dataset. These features are useful since graphs with similar structure will have similar group scores and marginal gains. We denote this base case by GED.

In addition, we obtain the topic-sensitive PageRank vector of each graph, where we specify the teleport set to be equal to the vertices in the group with highest (approximate) group centrality. Then we summarize this vector using (a) a histogram of bins and (b) by extracting the top values, denoted by PPR-H and PPR-T, respectively. Intuitively, these features capture the amount of diffusion on the graph.

As a strong baseline we compute the eigenvalues of the adjacency matrix and summarize them (a) in a histogram of

bins and (b) by extracting the top eigenvalues, denoted by Eig-H and Eig-T, respectively. This is inspired by recent work on graph classification [22]

showing that spectral features can outperform deep-learning based approaches. The ability to efficiently compute the eigenvalue histograms further motivates using them as a feature for graph classification

[17]. Similarly, a strong advantage of the features based on GED-Walk is that we can efficiently compute them. Last, we also combine the spectral- and GED-based features by concatenation, i.e. Eig-T+GED denotes the combination of Eig-T and GED features.

In the following experiments we fix the value of the hyper-parameters , , and ; however, in practice these parameters can also be tuned e.g. using cross-validation. We split the data into 80% training and 20% test set and average the results for 10 independent random splits.

Table 1 summarizes the graph classification results. We observe that enhancing the baseline features with our GED-Walk-based features improves the classification performance on all datasets, with variants using all available features, i.e. Eig+GED+PPR performing best. Moreover, as shown in Table 6 in the appendix, using the most central group as determined by GED-Walk as teleport set yields performance improvements over standard PageRank with teleport to all vertices. See Table 3 in the appendix for an overview of datasets used. In summary, we have shown that GED-Walk captures meaningful information about the graph structure that is complementary to baseline spectral features. We argue that GED-Walk can be used as a relatively inexpensive to compute additional source of information to enhance any existing graph classification model.

Dataset ENZ. IMD. Mut. PRO. RED.
Eig-T 23.02 56.59 56.90 73.35 75.31
Eig-H 23.47 70.28 68.88 72.42 72.02
Ged 19.18 60.64 64.51 71.95 70.59
Ged+PPR-H 20.85 65.18 65.46 72.18 71.59
Ged+PPR-H 20.39 66.27 65.86 72.44 75.95
Eig-T+Ged-T 26.46 63.56 64.14 74.06 80.18
Eig-H+Ged-H 23.14 69.74 69.08 73.13 75.16
Eig-T+ 27.45 69.25 62.78 73.54 76.70
Ged+PPR-T
Eig-H+ 24.12 71.62 69.18 73.10 74.17
Ged+PPR-H
Eig-T+ 27.88 68.53 62.43 73.72 80.48
Ged+PPR-T
Eig-H+ 24.78 70.54 68.81 72.97 81.43
Ged+PPR-H
: PageRank teleport probability 0.85;
: PageRank teleport probability 0.15
Table 1: Graph classification accuracy (in %). Best performance per dataset marked in bold.

5 Related Work

Several algorithms to maximize group centralities are found in the literature, and many of them are based on greedy approaches. Chehreghani et al. [13]

provide an extensive analysis of several algorithms for group betweenness centrality estimation, and present a new algorithm based on an alternative definition of distance between a vertex and a group of vertices. Greedy approximation algorithms to maximize group centrality also exist for measures like closeness (by Bergamini et al. 

[6]) and current-flow closeness (by Li et al. [30]). Puzis et al. [43] state an algorithm for group betweenness maximization that does not utilize submodular approximation but relies on a branch-and-bound approach.

Alternative group centrality measures were introduced by Ishakian et al. [25], Puzis et al. [42], and Fushimi et al. [21]. Ishakian et al. defined single-vertex centrality measures based on a generic concept of path, and generalized them to groups of vertices. However, in contrast to GED-Walk, those measures are defined for directed acyclic graphs only. Puzis et al. defined the path betweenness centrality of a group of vertices by counting the fraction of shortest paths that cross all vertices within the group. The proposed algorithms to find a group with high path-betweenness are quadratic in the best case (both in time and memory) and therefore cannot scale to large-scale networks. In fact, the largest graphs considered in their paper have only 500 vertices. Fushimi et al. proposed a new measure called connectedness centrality, that targets the specific problem of identifying the appropriate locations where to place evacuation facilities on road networks. For this reason, their measure assumes the existence of edge failure probabilities and is not applicable to general graphs. Due to expensive Monte Carlo simulation, computing their measure also requires many hours on graphs with 100000 edges, while GED-Walk scales to hundreds of millions of edges.

Network design is a different strategy to address the group centrality maximization problem when the input graph is allowed to be altered. Medya et al. [33] introduced two algorithms that optimize the increment of the coverage centrality of a given group of vertices by iteratively adding new edges to the input graph.

In the context of single-vertex centrality, multiple algebraic-based metrics that do not only depend on shortest paths exist. Besides the aforementioned Katz, some examples are current-flow closeness [10], PageRank, flow betweenness [20] and random-walk betweenness [39].

6 Conclusions

We have presented a new group centrality measure, GED-Walk, and efficient approximation algorithms for its optimization and for computing the value of a given group. GED-Walk’s descriptive power is demonstrated by experiments on two fundamental graph mining tasks: both semi-supervised vertex classification and graph classification benefit from the new measure. As GED can be optimized faster than earlier measures, it is often a viable replacement for more expensive measures in performance-sensitive applications.

More precisely, in terms of running time, our algorithm for GED-Walk maximization significantly outperforms the state-of-the-art algorithms for maximizing group closeness and group betweenness centrality when group sizes are at most . The fact that scales worse than w. r. t. to may seem as a limitation; however, we expect that many applications are interested in group sizes considerably smaller than .

Experiments on synthetic networks indicate that our algorithm for GED-Walk maximization scales linearly with the number of vertices. For graphs with vertices and more than M edges, it needs up to half an hour – often less. In fact, our algorithm can maximize GED-Walk for small groups on real-world graphs with hundreds of millions of edges within a few minutes. In future work we plan to apply GED-Walk for network and traffic monitoring [16, 41]; for developing immunization strategies to lower the vulnerability of a network to epidemic outbreaks [40]; and for improving landmark-based shortest path queries [23].

References

  • [1] Abboud, A., Grandoni, F., and Williams, V. V. Subcubic equivalences between graph centrality problems, APSP and diameter. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015 (2015), P. Indyk, Ed., SIAM, pp. 1681–1697.
  • [2] Avrachenkov, K., Gonçalves, P., and Sokol, M.

    On the choice of kernel and labelled data in semi-supervised learning methods.

    In International Workshop on Algorithms and Models for the Web-Graph (2013), Springer, pp. 56–67.
  • [3] Bader, D. A., Cong, G., and Feo, J. On the architectural requirements for efficient execution of graph algorithms. In Parallel Processing, 2005. ICPP 2005. International Conference on (2005), IEEE, pp. 547–556.
  • [4] Barabási, A.-L., and Albert, R. Emergence of scaling in random networks. science 286, 5439 (1999), 509–512.
  • [5] Bergamini, E., Borassi, M., Crescenzi, P., Marino, A., and Meyerhenke, H. Computing top-k closeness centrality faster in unweighted graphs. ACM Transactions on Knowledge Discovery from Data (TKDD) 13, 5 (2019), 53.
  • [6] Bergamini, E., Gonser, T., and Meyerhenke, H. Scaling up group closeness maximization. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX) (2018), SIAM, pp. 209–222. Updated version: https://arxiv.org/abs/1710.01144.
  • [7] Boldi, P., and Vigna, S. Axioms for centrality. Internet Mathematics 10, 3-4 (2014), 222–262.
  • [8] Borassi, M., Crescenzi, P., and Habib, M. Into the square: On the complexity of some quadratic-time solvable problems. Electr. Notes Theor. Comput. Sci. 322 (2016), 51–67.
  • [9] Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.-P. Protein function prediction via graph kernels. Bioinformatics 21, suppl_1 (2005), i47–i56.
  • [10] Brandes, U., and Fleischer, D. Centrality measures based on current flow. In Proceedings of the 22nd Annual Symposium on Theoretical Aspects of Computer Science, STACS 2005 (2005), vol. 3404 of LNCS, Springer, pp. 533–544.
  • [11] Chakrabarti, D., Zhan, Y., and Faloutsos, C. R-mat: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining (2004), SIAM, pp. 442–446.
  • [12] Chapelle, O., Schölkopf, B., and Zien, A. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20, 3 (2009), 542–542.
  • [13] Chehreghani, M. H., Bifet, A., and Abdessalem, T. An in-depth comparison of group betweenness centrality estimation algorithms. In 2018 IEEE International Conference on Big Data (Big Data) (2018), IEEE, pp. 2104–2113.
  • [14] Chen, C., Wang, W., and Wang, X. Efficient maximum closeness centrality group identification. In ADC (2016), vol. 9877 of Lecture Notes in Computer Science, Springer, pp. 43–55.
  • [15] Dobson, P. D., and Doig, A. J. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology 330, 4 (2003), 771–783.
  • [16] Dolev, S., Elovici, Y., Puzis, R., and Zilberman, P. Incremental deployment of network monitors based on group betweenness centrality. Information Processing Letters 109, 20 (2009), 1172–1176.
  • [17] Dong, K., Benson, A. R., and Bindel, D. Network density of states. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (New York, NY, USA, 2019), KDD ’19, ACM, pp. 1152–1161.
  • [18] Everett, M. G., and Borgatti, S. P. The centrality of groups and classes. The Journal of Mathematical Sociology 23, 3 (1999), 181–201.
  • [19] Feng, B. Q. Equivalence constants for certain matrix norms. Linear Algebra and its Applications 374 (2003), 247 – 253.
  • [20] Freeman, L. C., Borgatti, S. P., and White, D. R. Centrality in valued graphs: A measure of betweenness based on network flow. Social Networks 13, 2 (1991), 141 – 154.
  • [21] Fushimi, T., Saito, K., Ikeda, T., and Kazama, K. A new group centrality measure for maximizing the connectedness of network under uncertain connectivity. In International Conference on Complex Networks and their Applications (2018), Springer, pp. 3–14.
  • [22] Galland, A., and Lelarge, M. Invariant embedding for graph classification. ICML 2019 Workshop on Learning and Reasoning with Graph-Structured Representations (2019).
  • [23] Goldberg, A. V., and Harrelson, C. Computing the shortest path: A* search meets graph theory. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (2005), Society for Industrial and Applied Mathematics, pp. 156–165.
  • [24] Goundan, P. R., and Schulz, A. S. Revisiting the greedy approach to submodular set function maximization. Optimization online (2007), 1–25.
  • [25] Ishakian, V., Erdös, D., Terzi, E., and Bestavros, A. A framework for the evaluation and management of network centrality. In Proceedings of the 2012 SIAM International Conference on Data Mining (2012), SIAM, pp. 427–438.
  • [26] Katz, L. A new status index derived from sociometric analysis. Psychometrika 18, 1 (1953), 39–43.
  • [27] Kazius, J., McGuire, R., and Bursi, R. Derivation and validation of toxicophores for mutagenicity prediction. Journal of medicinal chemistry 48, 1 (2005), 312–320.
  • [28] Kipf, T. N., and Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • [29] Kunegis, J. KONECT: the koblenz network collection. In 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, Companion Volume (2013), International World Wide Web Conferences Steering Committee / ACM, pp. 1343–1350.
  • [30] Li, H., Peng, R., Shan, L., Yi, Y., and Zhang, Z. Current flow group closeness centrality for complex networks? In The World Wide Web Conference (2019), ACM, pp. 961–971.
  • [31] Lumsdaine, A., Gregor, D., Hendrickson, B., and Berry, J. Challenges in parallel graph processing. Parallel Processing Letters 17, 01 (2007), 5–20.
  • [32] Mahmoody, A., Tsourakakis, C. E., and Upfal, E. Scalable betweenness centrality maximization via sampling. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), ACM, pp. 1765–1773.
  • [33] Medya, S., Silva, A., Singh, A., Basu, P., and Swami, A. Group centrality maximization via network design. In Proceedings of the 2018 SIAM International Conference on Data Mining (2018), SIAM, pp. 126–134.
  • [34] Minoux, M. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques. Springer, 1978, pp. 234–243.
  • [35] Mirzasoleiman, B., Badanidiyuru, A., Karbasi, A., Vondrák, J., and Krause, A. Lazier than lazy greedy. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    (2015).
  • [36] Murphy, R. C., Wheeler, K. B., Barrett, B. W., and Ang, J. A. Introducing the graph 500. Cray Users Group (CUG) 19 (2010), 45–74.
  • [37] Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming 14, 1 (1978), 265–294.
  • [38] Newman, M. Networks, 2nd ed. Oxford university press, 2018.
  • [39] Newman, M. J. A measure of betweenness centrality based on random walks. Social Networks 27, 1 (2005), 39 – 54.
  • [40] Pastor-Satorras, R., and Vespignani, A. Immunization of complex networks. Physical review E 65, 3 (2002), 036104.
  • [41] Puzis, R., Altshuler, Y., Elovici, Y., Bekhor, S., Shiftan, Y., and Pentland, A. Augmented betweenness centrality for environmentally aware traffic monitoring in transportation networks. Journal of Intelligent Transportation Systems 17, 1 (2013), 91–105.
  • [42] Puzis, R., Elovici, Y., and Dolev, S. Fast algorithm for successive group betweenness centrality computation. Physical Review E 76 (2007), 056709.
  • [43] Puzis, R., Elovici, Y., and Dolev, S. Finding the most prominent group in complex networks. AI communications 20, 4 (2007), 287–296.
  • [44] Riesen, K., and Bunke, H.

    Iam graph database repository for graph based pattern recognition and machine learning.

    In Structural, Syntactic, and Statistical Pattern Recognition (Berlin, Heidelberg, 2008), N. da Vitoria Lobo, T. Kasparis, F. Roli, J. T. Kwok, M. Georgiopoulos, G. C. Anagnostopoulos, and M. Loog, Eds., Springer Berlin Heidelberg, pp. 287–297.
  • [45] Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., and Schomburg, D. Brenda, the enzyme database: updates and major new developments. Nucleic acids research 32, suppl_1 (2004), D431–D433.
  • [46] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
  • [47] Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
  • [48] Staudt, C. L., Sazonovs, A., and Meyerhenke, H. Networkit: A tool suite for large-scale complex network analysis. Network Science 4, 4 (2016), 508–530.
  • [49] van der Grinten, A., Bergamini, E., Green, O., Bader, D. A., and Meyerhenke, H. Scalable katz ranking computation in large static and dynamic graphs. In 26th Annual European Symposium on Algorithms, ESA 2018, August 20-22, 2018, Helsinki, Finland (2018), Y. Azar, H. Bast, and G. Herman, Eds., vol. 112 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, pp. 42:1–42:14.
  • [50] von Looz, M., Özdayi, M. S., Laue, S., and Meyerhenke, H. Generating massive complex networks with hyperbolic geometry faster in practice. In 2016 IEEE High Performance Extreme Computing Conference (HPEC) (2016), IEEE, pp. 1–6.
  • [51] Yanardag, P., and Vishwanathan, S. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2015), KDD ’15, ACM, pp. 1365–1374.
  • [52] Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. Learning with local and global consistency. In Advances in neural information processing systems (2004), pp. 321–328.

Appendix A Technical Details

The following definition is well-known in the literature: [submodularity, monotonicity] Let be a non-empty finite set.

  • A set function is called submodular if for every and every we have that . In this context, the value is called the marginal gain of (w. r. t. the set ).

  • A set function