1 Introduction
Large network^{1}^{1}1We use the terms “graph” and “network” interchangeably in this paper. data sets have become abundant in realworld applications; consider online social networks, web and coauthorship graphs, transportation and biological networks [38], to name just a few. Network analysis techniques have become very important in the last one or two decades to gain insights from these data [38]
. One particularly interesting class of analysis methods are centrality measures: they assign each vertex a numerical score that indicates the importance of the vertex based on its position in the network. A few very popular centrality measures are betweenness, closeness, degree, eigenvector, PageRank, and Katz centrality
[7, 38].The notion of centrality has subsequently been extended to groups of vertices [18]; this extension allows to determine the importance of whole groups, not only of individuals and, vice versa, to determine important groups. The latter question leads to an optimization problem: find the group of vertices of a given size that maximizes a certain group centrality measure. It is easy to imagine applications for group centrality, e. g., facility location and influence maximization.
Motivation. Such optimization problems are often hard; for example, this has been proven for group betweenness [16], group closeness [14] and even group degree. Since exact results may not be necessary in realworld applications, approximation is a reasonable way of dealing with complexity. For example, in the case of group betweenness, Mahmoody et al. [32] designed a nearlylinear time (in and ) approximation algorithm based on random sampling, which has been shown to outperform its competitors, while having similar accuracy. Yet, due to constant factors, it is not applicable to really large realworld instances; also (and in particular), the running time depends quadratically on . Indeed, the largest instances considered in their paper have around half a million edges. While group closeness approximation can already be done faster [6], in the context of big data there is still a need for a meaningful group centrality measure with a truly scalable algorithm for its optimization.
Contribution. In this paper, we introduce a new centrality measure called EDWalk for exponentially decaying walk (see Section 2.1). While being inspired by Katz centrality, it allows a more natural extension to the group case, then called GEDWalk. GEDWalk takes walks of any length into account, but considers shorter ones more important. We develop an algorithm to approximate the GED score of a given group in nearlylinear time. As our measure is submodular (see Section 2.2), we can design a greedy algorithm to approximate a group with maximum GEDWalk score (in Section 3). To allow the greedy algorithm to quickly find the top1 vertex with highest marginal gain, we adapt a topk centrality approach. In our experiments, our algorithms are two orders of magnitude faster than the state of the art for group betweenness approximation [32]. Compared to group closeness approximation [6], we achieve a speedup of one to two orders of magnitude for groups with at most 100 vertices. Furthermore, our experiments indicate a linear running time behavior in practice (despite a higher worstcase time complexity) for greedy GEDWalk approximation. Finally, we show some potential applications of GEDWalk: (i) choosing a training set with high group centrality leads to improved performance on the semisupervised vertex classification task; and (ii) features derived based on GEDWalk improve graph classification performance.
2 Problem Definition and Properties
In this paper, we deal with unweighted graphs ; they can be directed or undirected. By , we denote the adjacency matrix of .
2.1 Group Centrality Measures.
We start by reviewing existing centrality measures and by defining the new EDWalk; our main focus is then on the group centrality derived from EDWalk.
Closeness and betweenness, two of the most popular centrality measures, are both based on shortest paths. More precisely, betweenness is defined as the fraction of shortest paths that cross a certain vertex (or a certain group of vertices, for group betweenness). Hence, betweenness can be computed using allpairs shortestpath (APSP) techniques and, interestingly, a subcubic algorithm for betweenness (on general graphs) would also imply a faster algorithm for APSP [1]. Furthermore, the latter result holds true even if one considers the problem of computing the betweenness centrality of only a single vertex of the graph.
Closeness, on the other hand, is defined as the reciprocal of the average distance from a given vertex (or a group of vertices, for group closeness) to all other vertices of the graph. Hence, computing the closeness of a single vertex (or a single group of vertices) is easy; nonetheless, unless the strong exponential time hypothesis (SETH) fails, there is no subquadratic algorithm to find a vertex with maximal closeness [8].
To avoid those complexitytheoretic barriers, instead of deriving our measure from betweenness or closeness, we will derive it from Katz centrality (KC). KC is another popular centrality measure but, in contrast to measures based on shortest paths, it can be computed much more efficiently. Let be the number of walks of length that start at vertex (in our notation, we omit the dependence of on ; obviously, potentially depends on the entire graph).^{2}^{2}2Katz centrality is sometimes alternatively defined based on walks ending at a given vertex; from an algorithmic perspective, however, this difference is irrelevant. For a given parameter (which needs to be chosen small enough [26], see also Proposition 2.2), the Katz centrality of is defined as:
(1) 
Note that does not only consider shortest paths but walks of any length. Nevertheless, due to the fact that the contribution of a walk decreases exponentially with its length, short walks are still preferred. As does not rely on APSP, it can be computed faster than . For example, following from the algebraic formulation [26], iterative linear solvers (e. g., conjugate gradient) can be used to compute for all vertices at the same time. Furthermore, efficient nearlylinear time Katz approximation and ranking algorithms exist [49]. In fact, the algorithms that we present in this paper will be based on those Katz algorithms.
In contrast to , however, replacing the single vertex in Eq. (1) by a group leads to a rather uninteresting measure: as and count distinct walks for , such a “group Katz” measure would just be equal to the sum of individual Katz scores; a group with maximal score would just consist of the vertices appearing in a top Katz ranking. Indeed, this contradicts one’s intuition of group centralities: parts of a graph that are “covered” by one vertex of the group should not need to be covered by a second group vertex. Fortunately, we can construct a centrality satisfying this intuition by replacing by a natural variant of it: instead of considering walks that start at a given vertex , we consider all walks that cross vertex . This leads to our definition of EDWalk: Let be the number of walks (i. e., walks of length ) that contain at least one vertex of . Then, the EDWalk (for exponentially decaying walk) centrality for is
where is a parameter of the centrality.
In the preceding definition, needs to be chosen appropriately small so that the series converges. In Section 2.2 we will see that it is indeed possible to choose such an . As claimed above, EDWalk naturally generalizes to a group measure: [GEDWalk] The GEDWalk centrality of a group of vertices is given by:
(2) 
Two problems are specifically interesting in the context of group centralities: (i) the problem of computing the centrality of a given group and (ii) the problem of finding a group that maximizes the group centrality measure. We deal with both in this paper and in particular with the following optimization problem: [GEDWalk maximization] Given a graph and an integer , find
(3) 
Note again that Problem (3) is in general different from finding the vertices whose individual EDWalk centrality is highest (top problem). In fact, for closeness centrality, it was shown empirically that the top problem and the group closeness problem often have very different solutions [6]. One may expect a similar behavior here as well since a group with high score for GEDWalk is unlikely to have many members close to each other. Such a “close” group would share many walks, leading to a relatively small .
2.2 Properties of GEDWalk.
Since GED is based on similar concepts as Katz centrality, it is not surprising that the two measures are closely related. Indeed, the convergence properties of and GEDWalk are identical: The following holds:
Note that the equality of the first claim of Proposition 2.2 does not hold for groups smaller than . To prove the first claim, it is enough to show that the walks that contribute to and the walks that contribute to are exactly the walks of length in . Indeed, each walk of length contributes to . It also contributes to exactly one of the for , namely to . After applying the first claim (and observing that is an upper bound, also see Proposition 2.2), the second one follows from the wellknown fact that is finite for all iff [26].
Now that we have established that GEDWalk is indeed welldefined, let us prove some basic properties (defined in Appendix A) of the function:
GEDWalk is both nondecreasing and submodular as a set function. Monotonicity is obvious from Eq. (2): when the input set becomes larger, cannot decrease. To see that GEDWalk is also submodular, consider the marginal gain , i. e., the increase of the score if a vertex is added to the set . This marginal gain is exactly the sum of the number of walks that contain but no vertex in , with each walk weighted by a power of . As such, the marginal gain can only decrease if is replaced by a superset .
Given an algorithm to compute , Proposition 2.2 would immediately allow us to construct a greedy algorithm to approximate Problem (3) (using a wellknown theorem on submodular approximation [37], also see Proposition A in Appendix A). Nevertheless, as we will see in Section 3.2, we can do better than naively applying the greedy approach. Before discussing any algorithms, however, we will first demonstrate that maximizing GEDWalk to optimality is hard; hence, approximation is indeed appropriate to solve Problem (3). Solving Problem (3) to optimality is hard. Intuitively, Theorem 2.2 holds because if is chosen small enough, GEDWalk degrades to group degree (or, equivalently, vertex cover).^{3}^{3}3A set of vertices has a group degree of iff it is a vertex cover. Let be a graph. Let . We show that has a vertex cover of size if and only if there is a group of vertices with .
First, assume that has a vertex cover of size . To see now that a group with GED indeed exists, consider : since covers all edges, the contribution of the term corresponding to in Eq. (2) is already at least (independent of ).
Secondly, assume that a group with exists. Let denote the set of edges incident to at least one vertex in . is given by . Thus, to decide if is a vertex cover, it is sufficient to show that : in this case, needs to be as all walks of length together cannot match the contribution of any walk to . It holds that . A straightforward calculation shows that the latter term is smaller than if .
3 Algorithms for GEDWalk
3.1 Computing GEDWalk Centrality.
In this section, we discuss the problem of computing for a given group . As we are not aware of an obvious translation of our definition to a closedform expression, we employ an approximation algorithm that computes the infinite series up to an arbitrarily small additive error . To see how the algorithm works, let be a positive integer. We split the series into the first terms and a tail consisting of an infinite number of terms. This allows us to reduce the problem of approximating to the problem of (exactly) computing the th partial sum and to the problem of finding an upper bound on the tail . In particular, we are looking for an upper bound that converges to zero for increasing . Given such an upper bound, our approximation algorithm chooses such that this bound is below the specified error threshold . The algorithm returns once such an is found.
3.1.1 Computation of .
In order to compute , it is obviously enough to compute for . The key insight to computing efficiently is that it can be expressed as a series of recurrences (similar to Katz centrality): let us define as the number of walks ending in and containing at least one vertex from . Obviously, the walks that contribute to partition the walks that contribute to in the sense that
(4) 
On the other hand, can be expressed in terms of , i. e., the number of walks that end in but do not contain any vertex from :
This takes advantage of the fact that any walk is the concatenation of a walk and a single edge. The expression considers walks which do not contain a vertex in only if the concatenated edge ends in . Finally, can similarly be expressed as the recurrence:
In the case, all walks ending in have a vertex in , so no walks contribute to . We remark that the base cases of and can easily be computed directly from . Collectively, this yields an time algorithm for computing the GEDWalk score of a given group. We also note that the computation of and can be trivially parallelized by computing the values of those functions for multiple vertices in parallel.
3.1.2 Bounding .
To complete our algorithm, we need a sequence of upper bounds on such that the upper bound converges to zero for . Using Proposition 2.2, we can lift tail bounds of Katz centrality to GEDWalk. In particular, applying the first claim of this proposition shows:
(5) 
Bounds for the latter term can be found in the literature on Katz centrality. For example, in [49], it was shown that , for . If we apply Eq. (5), this yields the bound
(6) 
for GEDWalk; we call this bound the combinatorial bound (due to the nature of the proof in [49]). Here, denotes the maximum degree of any vertex in . We generalize this statement to arbitrary (i. e., ), at the cost of a factor of . For completeness, we show the proof in the case of .
It holds that:
(7) 
We call the bound of Eq. (7) the spectral bound. First note that , where
denotes the vector containing all ones in
and denotes the 1norm. Furthermore, let denote the induced norm on valued matrices. Since is compatible with , it holds that Here, the last inequality uses the fact that for matrices [19] and the fact that is submultiplicative. Furthermore, it is wellknown that ; henceSubstituting this into Eq. (5), we get: . After rewriting the geometric series , we obtain the statement of the lemma.
3.1.3 Complexity Analysis.
Algorithm 1 shows the pseudocode of the algorithm. Line 4 computes using Equation (4) and the recurrences for and from Section 3.1.1. Assuming that and are stored, this can be done in time . Line 6 computes an upper bound on ; we use either the combinatorial or the spectral bound from Section 3.1.2.
The score of a given group can be approximated up to an additive error of in iterations. We assume that the bound of Lemma 3.1.2 is used in the algorithm (similar results can be obtained for other bounds). By construction, the algorithm terminates once . Note that ; thus, we can bound the entire term by . A straightforward calculation shows that for , the previous term becomes smaller than . The score of a given group can be approximated up to an additive error of in time .
3.2 Maximizing GEDWalk Centrality.
Let denote the marginal gain of a vertex w. r. t. a group ; in other words, . As observed in Section 2.2, the properties given by Proposition 2.2 imply that a greedy algorithm that successively picks a vertex with highest marginal gain yields a approximation for Problem (3). Note, however, that we do not have an algorithm to compute exactly. While we could use the approximation algorithm of Section 3.1 to feed approximate values of into a greedy algorithm, this procedure would involve more work than necessary: indeed, the greedy algorithm does not care about the value of at all, it just needs to ascertain that there is no with .
This consideration is similar to a top1 centrality problem.^{4}^{4}4Top1 algorithms have been developed for multiple centrality measures, e. g., in [5]. Hence, we adapt the ideas of the top Katz ranking algorithm that was introduced in [49] to the case of .^{5}^{5}5Note that the vertex that maximizes is exactly the vertex that maximizes . However, algorithmically, it is advantageous to deal with , as it allows us to construct a lazy greedy algorithm (see Section 3.2.2). Applied to the marginal gain of GEDWalk, the main ingredients of this algorithm are families and of lower and upper bounds on , satisfying the following definition: We say that families of functions and are suitable bounds on the marginal gain of GEDWalk if the following conditions are all satisfied:



and are nonincreasing in
In addition to and , we need the following definition: Let and fix some . Let be two vertices. If , we say that is separated from .
Given the definition, it is easy to see that if there is a vertex that is separated from all other , then either is the vertex with top1 marginal gain or the marginal gain of the top1 vertex is at most (in fact, this follows directly from the first property of Definition 5). Note that the introduction of is required here to guarantee that separation can be achieved for finite even if and have truly identical marginal gains. Furthermore, while the first condition of Definition 5 is required for separation to work, the second and third conditions are required for the correctness of the algorithm that we construct in Section 3.2.2.
3.2.1 Construction of and .
In order to apply the methodology from Section 3.2, we need suitable families and of bounds. Luckily, it turns out that we can reuse the bounds developed in Section 3.1.2. Indeed, it holds that: . In the above calculation, because is nondecreasing as a set function. Hence yields a family of lower bounds on the marginal gain of . On the other hand, . Thus, a family of upper bounds on the marginal gain of is given by , where denotes either the combinatorial or the spectral bound developed in Section 3.2.
Let . The two families
form suitable families of bounds on the marginal gain of . The discussion at the beginning of this section already implies that properties (i) and (ii) of Definition 5 hold. For seeing that (iii) also holds, it is enough to show that is nonincreasing as a set function; however, this directly follows from the proof of Proposition 2.2).
3.2.2 LazyGreedy Algorithm.
Taking advantage of the ideas from Section 3.2.1, we can construct a greedy approximation algorithm for GEDWalk maximization. To reduce the number of evaluations of our objectives (without impacting the solution or the (worstcase) time complexity), we use a lazy strategy inspired by the wellknown lazy greedy algorithm for submodular approximation [34]. However, in contrast to the standard lazy greedy algorithm, we do not evaluate our submodular objective function directly; instead, we apply Definition 3.2 to find the top1 vertex with highest marginal gain. To this end, instead of lazily evaluating for vertices , we lazily evaluate and . In fact, to apply Definition 3.2, we need to find the vertex that maximizes and the vertex that maximizes . Hence, we rank all vertices according to priorities and and lazily compute the true values of and until and are identified.
Algorithm 2 depicts the pseudocode of this algorithm. Procedure lazyUpdate (line 5) lazily updates the priorities and until the exact values of and are known for the topmost element of the given priority queue. In line 22, the resulting vertices are checked for separation. It is necessary to achieve an separation (and not only an separation) to guarantee that the total absolute error is below , even after all iterations. If separation fails, is incremented in line 27 and we reset and for all in line 14; otherwise, (i. e., the vertex with top1 marginal gain) is added to .
If and are suitable families of bounds, Algorithm 2 computes a group such that , where is the group with highest score. First, note that it is indeed sufficient to only consider the vertices with highest and for separation: if those vertices are not separated, is not the top1 vertex with highest marginal gain (up to ). Due to property (ii) of Definition 5, and will eventually be separated and the algorithm terminates. Hence, it is enough to show that the algorithm indeed finds the vertices with highest and . This follows from the invariant that and . The invariant obviously holds after each reset of and . Furthermore, property (iii) of Definition 5 guarantees that it also holds when the set is extended. The approximation guarantee follows from Proposition A in the appendix. Specifically, it follows from a variant of the classical result on submodular approximation, adapted to the case of an additive error in the objective function.
Initialization of and .
To accelerate the algorithm in practice, it is crucial that the initial values of and are chosen appropriately during each reset: if we reset as in Algorithm 2, the algorithm has to evaluate and for all whenever increases! To avoid this issue, we use to provide a better initialization of and . Let be the value of in the reverse graph of , i. e., is the number of walks that start at but do not contain any vertex of . The following lemma summarizes our initialization strategy:
Let
The following holds:


Let denote the bound from the construction of . yields the following bounds on and :
Thus, instead of initializing , we compute and use the righthand sides of the second statement of Lemma 3.2.2 as initial values for and . Note that using the recurrence for from Section 3.1 (and a similar recurrence for ), can be computed in time , simultaneously for all . To prove the first claim, observe that the difference between and is that counts walks that contain cycles multiple times while counts those walks only once. The second claim follows immediately.
Complexity of the lazy algorithm.
Assume that Algorithm 2 uses the spectral bound from Lemma 3.1.2. The worstcase value of to achieve separation is given by Lemma 3.1.3, namely . Note that the algorithm does some iterations that do not achieve separation. We calculate the cost of each iteration in terms of the number of evaluations of . In this measure, the cost of each iteration is . Thus, as is doubled on unsuccessful separation, the cost of all unsuccessful attempts to achieve separation is always smaller than the cost of a successful attempt to achieve separation. Hence, the worstcase running time of Algorithm 2 is : in each of the successful iterations, it can be necessary to extract all elements of the priority queues and evaluate and on each of them.
We expect that much fewer evaluations of and will be required in practice than in the worstcase: in fact, we expect almost all vertices to have low marginal gains (i. e., only few walks cross those vertices) and will never be evaluated on those vertices. Hence, we investigate the practical performance of Algorithm 2 in Section 4.
4 Experiments
Our algorithms for GEDWalk maximization are implemented in C++ on top of the opensource framework NetworKit
[48], which also includes implementations of the aforementioned algorithms for group betweenness and group closeness maximization. All experiments are executed on a Linux server with two Intel Xeon Gold 6154 CPUs (36 cores in total) and 1.5 TB of memory. If not stated differently, every experiment uses 36 threads (one per core), and the default algorithm for Walk maximization uses the combinatorial bound (see Section 3.1.2) and the lazygreedy strategy (see Section 3.2.2) with . All networks are undirected; realworld networks have been downloaded from the Koblenz Network Collection [29], and synthetic networks have been generated using the graph generators provided in NetworKit (see Appendix B for more details about the settings of the parameters of the generators). Because group closeness is not defined on disconnected graphs (without modifications), we always consider the largest connected component of our input instances.4.1 Scalability w. r. t. Group Size.
). Data points are aggregated using the geometric mean over the instance in Table
2.Figure 0(a) shows the average running time in seconds of , Walk, and maximization for group sizes from to . For small group sizes Walk can be maximized much faster than and : for our algorithm for Walk maximization is on average faster than maximization, and faster than maximization, while for it is on average faster than maximization and faster than maximization. On the other hand, maximization scales better than and Walk maximization w. r. t. the group size (see Appendix C for additional results on large groups). This behavior is expected since the evaluation of marginal gains of becomes computationally cheaper for larger groups. This does not apply to our algorithm for maximizing Walk, which instead needs to increase the length of the walks while the group grows (see Algorithm 2, line 27).
Yet, one can also observe that the group closeness score increases only very slowly (and more slowly than Walk’s) when increasing the group size (see Figure 4(b) in the appendix). This means that, from a certain group size on, the choice of a new group member hardly makes a difference for group closeness – in most cases only the distance to very close vertices can be reduced. In that sense Walk seems to distinguish better between more and less promising candidates. Furthermore, Figure 0(b) shows the length of the walks considered by our algorithm w. r. t. . In accordance with what we stated in Section 3.2.2, grows sublinearly with .
4.2 Scalability to Large (Synthetic) Graphs.
Figure 2 shows the running time in seconds of Walk maximization with on randomly generated networks using the Erdős Rényi, RMAT [11] and BarabàsiAlbert [4] models as well as the random hyperbolic generator from von Looz et al. [50]
. The thin blue lines represent the linear regression on the running times (all resulting
values are and therefore small enough for statistical relevance); w. r. t. the running time curves, the regression lines either have a steeper slope (Figures 1(a), 1(b), and 1(c)), or they match it almost perfectly in the case of the hyperbolic random generator. Therefore, for the network types and sizes under consideration, GEDWalk maximization scales (empirically) linearly w. r. t. .4.3 Parallel Scalability
As stated above, the GEDWalk algorithm can be parallelized by computing and in parallel for different . Figure 2(a) shows the parallel speedup of our algorithm for maximizing Walk over itself running on a single core for . The scalability is moderate up to 16 cores, while on 32 cores it does not gain much additional speedup. A similar behavior was observed before for group closeness [6]. Figure 2(b) shows the running time in seconds of , , and with increasing number of cores, for . On a single core, Walk maximization is on average faster than maximization and faster than maximization. As we increment the number of cores, the decrement of the running time of the three algorithms is comparable – except
on 32 cores: In this case, GCC is slower than with 16 cores on the considered instances. The limited scalability affecting all three algorithms is probably due to memory latency becoming a bottleneck in the execution on multiple cores
[3, 31]. Figure 2(b) also shows that our algorithm for Walk maximization finishes on average within a few seconds on cores. Here the running time of the algorithm is dominated by its sequential parts, and it is not surprising that adding more cores does not speed the algorithm up substantially.4.4 Applications of GEDWalk.
In this section we demonstrate the usefulness of GEDWalk for different applications by showing that it improves the performance on two important graph mining tasks: semisupervised vertex classification and graph classification. As a preprocessing step, for both tasks before applying GEDWalk we first construct a weighted graph using the symmetrically normalized adjacency matrix which is often used in the literature [52, 28], where is a diagonal matrix of vertex degrees. Here, instead of taking the contribution of a walk to be , we define it to be where denotes the weight of edge . Except for the introduction of coefficients in the recurrences of and , no modifications to our algorithm are required. Compared to our unweighted definition of GEDWalk, this weighted variant converges even faster, as the contribution of each walk is smaller.
4.4.1 Vertex classification.
Semisupervised vertex classification is a fundamental graph mining problem where the goal is to predict the class labels of all vertices in a graph given a small set of labelled vertices and the graph structure. The choice of which vertices we label (i. e., which vertices we include in the training set) before building a classification model can have a significant impact on the test accuracy, especially when the number of labelled vertices is small in comparison to the size of the graph [2, 47]. Since many models rely on diffusion to propagate information on the graph, our hypothesis is that selecting a training set with high group centrality will improve diffusion, and thus also improve accuracy. To test this hypothesis we evaluate the performance of Label Propagation [52, 12]
given different strategies for choosing the training set. We choose the simple Label Propagation model as our benchmark to isolate the effect on accuracy more clearly, but similar conclusions apply to more sophisticated vertex classification models such as graph neural networks
[28].More specifically, we evaluate the classification accuracy on two common benchmark graphs: Cora () and Wiki () [46]. We use the Normalized Laplacian variant of Label Propagation [52] setting the value for the return probability hyperparameter to . We let the vertices with highest group centrality according to GEDWalk be in the training set and the rest of the vertices be in the test set. We compare GEDWalk with the following baselines for selecting the training set: RND: select vertices at random (results averaged over 10 trials); DEG: select the vertices with highest degree; NBC: select vertices with highest (nongroup) betweenness centrality; GBC: select vertices with highest group betweenness centrality; and PPR: select vertices with highest PageRank.
On Figure 4, we can see that for both datasets and across different number of labelled vertices, selecting the training set using GEDWalk leads to highest (or comparable) test accuracy. Furthermore, while the secondbest baseline strategy is different on the different datasets (on Cora it is the GBC strategy and on Wiki its the SBC strategy), GEDWalk is consistently better. Overall, these results confirm our hypothesis.
4.4.2 Graph classification.
Graph classification is another fundamental graph mining problem where the goal is to classify entire graphs based on features derived from their topology/structure. In contrast to the vertex classification task, where each dataset is a single graph, here each dataset consists of many graphs of varying size and their associated groundtruth class labels. Our hypothesis is that the vertices with high group centrality identified by GEDWalk capture rich information about the graph structure and can be used to derive features that are useful for graph classification.
To extract features based on GEDWalk, we first obtain the group of vertices with highest (approximate) group centrality score. The group centrality score itself is the first feature we extract. In addition we summarize the marginal gains of all remaining vertices in the graph in a histogram with bins. We concatenate these features to get a feature vector for each graph in the dataset. These features are useful since graphs with similar structure will have similar group scores and marginal gains. We denote this base case by GED.
In addition, we obtain the topicsensitive PageRank vector of each graph, where we specify the teleport set to be equal to the vertices in the group with highest (approximate) group centrality. Then we summarize this vector using (a) a histogram of bins and (b) by extracting the top values, denoted by PPRH and PPRT, respectively. Intuitively, these features capture the amount of diffusion on the graph.
As a strong baseline we compute the eigenvalues of the adjacency matrix and summarize them (a) in a histogram of
bins and (b) by extracting the top eigenvalues, denoted by EigH and EigT, respectively. This is inspired by recent work on graph classification [22]showing that spectral features can outperform deeplearning based approaches. The ability to efficiently compute the eigenvalue histograms further motivates using them as a feature for graph classification
[17]. Similarly, a strong advantage of the features based on GEDWalk is that we can efficiently compute them. Last, we also combine the spectral and GEDbased features by concatenation, i.e. EigT+GED denotes the combination of EigT and GED features.In the following experiments we fix the value of the hyperparameters , , and ; however, in practice these parameters can also be tuned e.g. using crossvalidation. We split the data into 80% training and 20% test set and average the results for 10 independent random splits.
Table 1 summarizes the graph classification results. We observe that enhancing the baseline features with our GEDWalkbased features improves the classification performance on all datasets, with variants using all available features, i.e. Eig+GED+PPR performing best. Moreover, as shown in Table 6 in the appendix, using the most central group as determined by GEDWalk as teleport set yields performance improvements over standard PageRank with teleport to all vertices. See Table 3 in the appendix for an overview of datasets used. In summary, we have shown that GEDWalk captures meaningful information about the graph structure that is complementary to baseline spectral features. We argue that GEDWalk can be used as a relatively inexpensive to compute additional source of information to enhance any existing graph classification model.
Dataset  ENZ.  IMD.  Mut.  PRO.  RED. 

EigT  23.02  56.59  56.90  73.35  75.31 
EigH  23.47  70.28  68.88  72.42  72.02 
Ged  19.18  60.64  64.51  71.95  70.59 
Ged+PPRH  20.85  65.18  65.46  72.18  71.59 
Ged+PPRH  20.39  66.27  65.86  72.44  75.95 
EigT+GedT  26.46  63.56  64.14  74.06  80.18 
EigH+GedH  23.14  69.74  69.08  73.13  75.16 
EigT+  27.45  69.25  62.78  73.54  76.70 
Ged+PPRT  
EigH+  24.12  71.62  69.18  73.10  74.17 
Ged+PPRH  
EigT+  27.88  68.53  62.43  73.72  80.48 
Ged+PPRT  
EigH+  24.78  70.54  68.81  72.97  81.43 
Ged+PPRH  
: PageRank teleport probability 0.85;  
: PageRank teleport probability 0.15 
5 Related Work
Several algorithms to maximize group centralities are found in the literature, and many of them are based on greedy approaches. Chehreghani et al. [13]
provide an extensive analysis of several algorithms for group betweenness centrality estimation, and present a new algorithm based on an alternative definition of distance between a vertex and a group of vertices. Greedy approximation algorithms to maximize group centrality also exist for measures like closeness (by Bergamini et al.
[6]) and currentflow closeness (by Li et al. [30]). Puzis et al. [43] state an algorithm for group betweenness maximization that does not utilize submodular approximation but relies on a branchandbound approach.Alternative group centrality measures were introduced by Ishakian et al. [25], Puzis et al. [42], and Fushimi et al. [21]. Ishakian et al. defined singlevertex centrality measures based on a generic concept of path, and generalized them to groups of vertices. However, in contrast to GEDWalk, those measures are defined for directed acyclic graphs only. Puzis et al. defined the path betweenness centrality of a group of vertices by counting the fraction of shortest paths that cross all vertices within the group. The proposed algorithms to find a group with high pathbetweenness are quadratic in the best case (both in time and memory) and therefore cannot scale to largescale networks. In fact, the largest graphs considered in their paper have only 500 vertices. Fushimi et al. proposed a new measure called connectedness centrality, that targets the specific problem of identifying the appropriate locations where to place evacuation facilities on road networks. For this reason, their measure assumes the existence of edge failure probabilities and is not applicable to general graphs. Due to expensive Monte Carlo simulation, computing their measure also requires many hours on graphs with 100000 edges, while GEDWalk scales to hundreds of millions of edges.
Network design is a different strategy to address the group centrality maximization problem when the input graph is allowed to be altered. Medya et al. [33] introduced two algorithms that optimize the increment of the coverage centrality of a given group of vertices by iteratively adding new edges to the input graph.
6 Conclusions
We have presented a new group centrality measure, GEDWalk, and efficient approximation algorithms for its optimization and for computing the value of a given group. GEDWalk’s descriptive power is demonstrated by experiments on two fundamental graph mining tasks: both semisupervised vertex classification and graph classification benefit from the new measure. As GED can be optimized faster than earlier measures, it is often a viable replacement for more expensive measures in performancesensitive applications.
More precisely, in terms of running time, our algorithm for GEDWalk maximization significantly outperforms the stateoftheart algorithms for maximizing group closeness and group betweenness centrality when group sizes are at most . The fact that scales worse than w. r. t. to may seem as a limitation; however, we expect that many applications are interested in group sizes considerably smaller than .
Experiments on synthetic networks indicate that our algorithm for GEDWalk maximization scales linearly with the number of vertices. For graphs with vertices and more than M edges, it needs up to half an hour – often less. In fact, our algorithm can maximize GEDWalk for small groups on realworld graphs with hundreds of millions of edges within a few minutes. In future work we plan to apply GEDWalk for network and traffic monitoring [16, 41]; for developing immunization strategies to lower the vulnerability of a network to epidemic outbreaks [40]; and for improving landmarkbased shortest path queries [23].
References
 [1] Abboud, A., Grandoni, F., and Williams, V. V. Subcubic equivalences between graph centrality problems, APSP and diameter. In Proceedings of the TwentySixth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 46, 2015 (2015), P. Indyk, Ed., SIAM, pp. 1681–1697.

[2]
Avrachenkov, K., Gonçalves, P., and Sokol, M.
On the choice of kernel and labelled data in semisupervised learning methods.
In International Workshop on Algorithms and Models for the WebGraph (2013), Springer, pp. 56–67.  [3] Bader, D. A., Cong, G., and Feo, J. On the architectural requirements for efficient execution of graph algorithms. In Parallel Processing, 2005. ICPP 2005. International Conference on (2005), IEEE, pp. 547–556.
 [4] Barabási, A.L., and Albert, R. Emergence of scaling in random networks. science 286, 5439 (1999), 509–512.
 [5] Bergamini, E., Borassi, M., Crescenzi, P., Marino, A., and Meyerhenke, H. Computing topk closeness centrality faster in unweighted graphs. ACM Transactions on Knowledge Discovery from Data (TKDD) 13, 5 (2019), 53.
 [6] Bergamini, E., Gonser, T., and Meyerhenke, H. Scaling up group closeness maximization. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX) (2018), SIAM, pp. 209–222. Updated version: https://arxiv.org/abs/1710.01144.
 [7] Boldi, P., and Vigna, S. Axioms for centrality. Internet Mathematics 10, 34 (2014), 222–262.
 [8] Borassi, M., Crescenzi, P., and Habib, M. Into the square: On the complexity of some quadratictime solvable problems. Electr. Notes Theor. Comput. Sci. 322 (2016), 51–67.
 [9] Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.P. Protein function prediction via graph kernels. Bioinformatics 21, suppl_1 (2005), i47–i56.
 [10] Brandes, U., and Fleischer, D. Centrality measures based on current flow. In Proceedings of the 22nd Annual Symposium on Theoretical Aspects of Computer Science, STACS 2005 (2005), vol. 3404 of LNCS, Springer, pp. 533–544.
 [11] Chakrabarti, D., Zhan, Y., and Faloutsos, C. Rmat: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining (2004), SIAM, pp. 442–446.
 [12] Chapelle, O., Schölkopf, B., and Zien, A. Semisupervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20, 3 (2009), 542–542.
 [13] Chehreghani, M. H., Bifet, A., and Abdessalem, T. An indepth comparison of group betweenness centrality estimation algorithms. In 2018 IEEE International Conference on Big Data (Big Data) (2018), IEEE, pp. 2104–2113.
 [14] Chen, C., Wang, W., and Wang, X. Efficient maximum closeness centrality group identification. In ADC (2016), vol. 9877 of Lecture Notes in Computer Science, Springer, pp. 43–55.
 [15] Dobson, P. D., and Doig, A. J. Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology 330, 4 (2003), 771–783.
 [16] Dolev, S., Elovici, Y., Puzis, R., and Zilberman, P. Incremental deployment of network monitors based on group betweenness centrality. Information Processing Letters 109, 20 (2009), 1172–1176.
 [17] Dong, K., Benson, A. R., and Bindel, D. Network density of states. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (New York, NY, USA, 2019), KDD ’19, ACM, pp. 1152–1161.
 [18] Everett, M. G., and Borgatti, S. P. The centrality of groups and classes. The Journal of Mathematical Sociology 23, 3 (1999), 181–201.
 [19] Feng, B. Q. Equivalence constants for certain matrix norms. Linear Algebra and its Applications 374 (2003), 247 – 253.
 [20] Freeman, L. C., Borgatti, S. P., and White, D. R. Centrality in valued graphs: A measure of betweenness based on network flow. Social Networks 13, 2 (1991), 141 – 154.
 [21] Fushimi, T., Saito, K., Ikeda, T., and Kazama, K. A new group centrality measure for maximizing the connectedness of network under uncertain connectivity. In International Conference on Complex Networks and their Applications (2018), Springer, pp. 3–14.
 [22] Galland, A., and Lelarge, M. Invariant embedding for graph classification. ICML 2019 Workshop on Learning and Reasoning with GraphStructured Representations (2019).
 [23] Goldberg, A. V., and Harrelson, C. Computing the shortest path: A* search meets graph theory. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms (2005), Society for Industrial and Applied Mathematics, pp. 156–165.
 [24] Goundan, P. R., and Schulz, A. S. Revisiting the greedy approach to submodular set function maximization. Optimization online (2007), 1–25.
 [25] Ishakian, V., Erdös, D., Terzi, E., and Bestavros, A. A framework for the evaluation and management of network centrality. In Proceedings of the 2012 SIAM International Conference on Data Mining (2012), SIAM, pp. 427–438.
 [26] Katz, L. A new status index derived from sociometric analysis. Psychometrika 18, 1 (1953), 39–43.
 [27] Kazius, J., McGuire, R., and Bursi, R. Derivation and validation of toxicophores for mutagenicity prediction. Journal of medicinal chemistry 48, 1 (2005), 312–320.
 [28] Kipf, T. N., and Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
 [29] Kunegis, J. KONECT: the koblenz network collection. In 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 1317, 2013, Companion Volume (2013), International World Wide Web Conferences Steering Committee / ACM, pp. 1343–1350.
 [30] Li, H., Peng, R., Shan, L., Yi, Y., and Zhang, Z. Current flow group closeness centrality for complex networks? In The World Wide Web Conference (2019), ACM, pp. 961–971.
 [31] Lumsdaine, A., Gregor, D., Hendrickson, B., and Berry, J. Challenges in parallel graph processing. Parallel Processing Letters 17, 01 (2007), 5–20.
 [32] Mahmoody, A., Tsourakakis, C. E., and Upfal, E. Scalable betweenness centrality maximization via sampling. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), ACM, pp. 1765–1773.
 [33] Medya, S., Silva, A., Singh, A., Basu, P., and Swami, A. Group centrality maximization via network design. In Proceedings of the 2018 SIAM International Conference on Data Mining (2018), SIAM, pp. 126–134.
 [34] Minoux, M. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques. Springer, 1978, pp. 234–243.

[35]
Mirzasoleiman, B., Badanidiyuru, A., Karbasi, A., Vondrák, J., and
Krause, A.
Lazier than lazy greedy.
In
TwentyNinth AAAI Conference on Artificial Intelligence
(2015).  [36] Murphy, R. C., Wheeler, K. B., Barrett, B. W., and Ang, J. A. Introducing the graph 500. Cray Users Group (CUG) 19 (2010), 45–74.
 [37] Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming 14, 1 (1978), 265–294.
 [38] Newman, M. Networks, 2nd ed. Oxford university press, 2018.
 [39] Newman, M. J. A measure of betweenness centrality based on random walks. Social Networks 27, 1 (2005), 39 – 54.
 [40] PastorSatorras, R., and Vespignani, A. Immunization of complex networks. Physical review E 65, 3 (2002), 036104.
 [41] Puzis, R., Altshuler, Y., Elovici, Y., Bekhor, S., Shiftan, Y., and Pentland, A. Augmented betweenness centrality for environmentally aware traffic monitoring in transportation networks. Journal of Intelligent Transportation Systems 17, 1 (2013), 91–105.
 [42] Puzis, R., Elovici, Y., and Dolev, S. Fast algorithm for successive group betweenness centrality computation. Physical Review E 76 (2007), 056709.
 [43] Puzis, R., Elovici, Y., and Dolev, S. Finding the most prominent group in complex networks. AI communications 20, 4 (2007), 287–296.

[44]
Riesen, K., and Bunke, H.
Iam graph database repository for graph based pattern recognition and machine learning.
In Structural, Syntactic, and Statistical Pattern Recognition (Berlin, Heidelberg, 2008), N. da Vitoria Lobo, T. Kasparis, F. Roli, J. T. Kwok, M. Georgiopoulos, G. C. Anagnostopoulos, and M. Loog, Eds., Springer Berlin Heidelberg, pp. 287–297.  [45] Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., and Schomburg, D. Brenda, the enzyme database: updates and major new developments. Nucleic acids research 32, suppl_1 (2004), D431–D433.
 [46] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and EliassiRad, T. Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
 [47] Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
 [48] Staudt, C. L., Sazonovs, A., and Meyerhenke, H. Networkit: A tool suite for largescale complex network analysis. Network Science 4, 4 (2016), 508–530.
 [49] van der Grinten, A., Bergamini, E., Green, O., Bader, D. A., and Meyerhenke, H. Scalable katz ranking computation in large static and dynamic graphs. In 26th Annual European Symposium on Algorithms, ESA 2018, August 2022, 2018, Helsinki, Finland (2018), Y. Azar, H. Bast, and G. Herman, Eds., vol. 112 of LIPIcs, Schloss Dagstuhl  LeibnizZentrum fuer Informatik, pp. 42:1–42:14.
 [50] von Looz, M., Özdayi, M. S., Laue, S., and Meyerhenke, H. Generating massive complex networks with hyperbolic geometry faster in practice. In 2016 IEEE High Performance Extreme Computing Conference (HPEC) (2016), IEEE, pp. 1–6.
 [51] Yanardag, P., and Vishwanathan, S. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2015), KDD ’15, ACM, pp. 1365–1374.
 [52] Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. Learning with local and global consistency. In Advances in neural information processing systems (2004), pp. 321–328.
Appendix A Technical Details
The following definition is wellknown in the literature: [submodularity, monotonicity] Let be a nonempty finite set.

A set function is called submodular if for every and every we have that . In this context, the value is called the marginal gain of (w. r. t. the set ).

A set function
Comments
There are no comments yet.