Novel Adaptive Algorithms for Estimating Betweenness, Coverage and k-path Centralities

10/23/2018 ∙ by Mostafa Haghir Chehreghani, et al. ∙ Télécom ParisTech 0

An important index widely used to analyze social and information networks is betweenness centrality. In this paper, first given a directed network G and a vertex r∈ V(G), we present a novel adaptive algorithm for estimating betweenness score of r. Our algorithm first computes two subsets of the vertex set of G, called RF(r) and RT(r), that define the sample spaces of the start-points and the end-points of the samples. Then, it adaptively samples from RF(r) and RT(r) and stops as soon as some condition is satisfied. The stopping condition depends on the samples met so far, |RF(r)| and |RT(r)|. We show that compared to the well-known existing methods, our algorithm gives a more efficient (λ,δ)-approximation. Then, we propose a novel algorithm for estimating k-path centrality of r. Our algorithm is based on computing two sets RF(r) and D(r). While RF(r) defines the sample space of the source vertices of the sampled paths, D(r) defines the sample space of the other vertices of the paths. We show that in order to give a (λ,δ)-approximation of the k-path score of r, our algorithm requires considerably less samples. Moreover, it processes each sample faster and with less memory. Finally, we empirically evaluate our proposed algorithms and show their superior performance. Also, we show that they can be used to efficiently compute centrality scores of a set of vertices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graphs (in particular directed graphs) are a widely used tool for modeling data in different domains, including social networks, information networks, road networks and the world wide web. Centrality is a structural property of vertices or edges in the graph that indicates their importance. For example, it determines the importance of a person within a social network, or a road within a road network. There are several centrality notions in the literature, including betweenness centrality (DBLP:journals/cj/Chehreghani14, ), coverage centrality (Yoshida:2014:ALA:2623330.2623626, ) and -path centrality (Alahakoon:2011:KCN:1989656.1989657, ).

Although there exist polynomial time algorithms for computing these indices, the algorithms are expensive in practice. However, there are observations that may improve computation of centrality indices in practice. In several applications it is sufficient to compute centrality score of only one or a few vertices. For instance, the index might be computed for only core vertices of communities in social/information networks (jrnl:Wang, ) or for only hubs in communication networks. Another example, discussed in (DBLP:conf/complenet/AgarwalSCI15, ; DBLP:journals/corr/AgarwalSCI14, ), is handling cascading failures. It has been shown that the failure of a vertex with a higher betweenness score may cause greater collapse of the network (Stergiopoulos201534, ). Therefore, failed vertices should be recovered in the order of their betweenness scores. This means it is required to compute betweenness scores of only failed vertices, that usually form a very small subset of vertices. Note that these vertices are not necessarily those that have the highest betweenness scores, hence, algorithms that identify the top- vertices (Riondato2016, ) are not applicable. Another example where in a road network it is required to compute betweenness (-path) score of only one vertex, is discussed in (DBLP:journals/cj/Chehreghani14, ). In this paper, we exploit these practical observations to develop more effective algorithms.

In recent years, several approximate algorithms have been proposed in the literature to estimate betweenness/coverage/-path centralities. Some of them are based on sampling shortest paths (Riondato2016, ; DBLP:conf/esa/BorassiN16, ) and some others are based on sampling source vertices (or source-destination vertices) (proc:Bader, ; DBLP:journals/cj/Chehreghani14, ). Very recently, a technique has been proposed that significantly improves the efficiency of the source sampler algorithms in directed graphs. In this technique, for a given vertex , first the set of vertices that have a non-zero contribution (dependency score) to the betweenness score of , is computed. Then, this set is used for sampling source vertices (DBLP:journals/corr/abs-1708-08739, ). , which can be computed very efficiently, is usually much smaller than the vertex set of the graph, hence, source vertex sampling can be done more effectively. However, the error bounds presented in (DBLP:journals/corr/abs-1708-08739, ) are not adaptive and are not of practical convenience (see Section 6).

In the current paper, to estimate betweenness centrality, we further improve this technique by not only restricting the source vertices to , but also finding a set of vertices that can be a destination of a shortest path that passes over . Given a directed graph and a vertex of , our algorithm first computes two subsets of the vertex set of , called and . These subsets can be computed very effectively and define the sample spaces of the start-points and the end-points of the samples (i.e., shortest paths). Then, it adaptively samples from and and stops as soon as some condition is satisfied. The stopping condition depends on the samples met so far, and . We theoretically analyze our algorithm and show that in order to estimate betweenness of with a maximum error

with probability

, it requires considerably less samples than the well-known existing algorithms. In fact, our algorithm tries to collect the advantages of all existing methods. On the one hand, unlike (DBLP:journals/corr/abs-1708-08739, ), it is adaptive and its error bounds are of practical convenience. On the other hand, unlike algorithms such as (Riondato2016, ; DBLP:conf/esa/BorassiN16, ), it uses the sets and to prune the search space, that gives significant theoretical and empirical improvements. We also discuss how our algorithm can be revised to compute coverage centrality of .

Then, we propose a novel adaptive algorithm for estimating -path centrality of . Our algorithm is based on computing two sets and . While defines the sample space of the source vertices of the sampled paths, defines the sample space of the other vertices of the paths. We show that in order to give a -approximation of the -path score of , our algorithm requires considerably less samples. Moreover, it can process each sampled path faster and with less memory. We also propose a method to determine the number of samples adaptively and based on the samples met so far and . In the end, we evaluate the empirical efficiency of our centrality estimation algorithms over several real-world datasets. We show that in practice, while our betweenness estimation algorithm is usually faster than the well-known existing algorithms, it generates considerably more accurate results. Furthermore, we show that while our algorithm is intuitively designed to estimate betweenness score of only one vertex, it can also be used to effectively compute betweenness scores of a set of vertices. Finally, we show that our algorithm for -path centrality is considerably more accurate than existing methods.

The rest of this paper is organized as follows. In Section 2, we introduce preliminaries and necessary definitions used in the paper. In Section 3, we give an overview on related work. In Section 4, we introduce our betweenness/coverage estimation algorithm and theoretically analyze it. In Section 5, we present our -path centrality estimation algorithm and its analysis. In Section 6, we empirically evaluate our proposed algorithms and show their high efficiency, compared to well-known existing algorithms. Finally, the paper is concluded in Section 7.

2. Preliminaries

We assume that the reader is familiar with basic concepts in graph theory. Throughout the paper, refers to a directed graph. For simplicity, we assume that is a connected and loop-free graph without multi-edges. Throughout the paper, we assume that is an unweighted graph, unless it is explicitly mentioned that is weighted. and refer to the set of vertices and the set of edges of , respectively. For a vertex , the number of head ends adjacent to is called its in degree and the number of tail ends adjacent to is called its out degree. For a vertex , by we denote the set of outgoing neighbors of .

A shortest path from to is a path whose length is minimum, among all paths from to . For two vertices , if is unweighted, by we denote the length (the number of edges) of a shortest path connecting to . If is weighted, denotes the sum of the weights of the edges of a shortest path connecting to . By definition, . Note that in directed graphs, is not necessarily equal to . The vertex diameter of , denoted by , is defined as the number of vertices of the longest shortest path of the graph. For , denotes the number of shortest paths between and , and denotes the number of shortest paths between and that also pass through . Betweenness centrality of a vertex is defined as: Coverage centrality of is defined as follows (Yoshida:2014:ALA:2623330.2623626, ):

Let denote a simple path that starts with vertex and has edges. Let also denote the vertices in the order they appear in , with . We define as . We say returns if lies on , and otherwise. For a vertex in an unweighted graph , its -path centrality is defined as follows (Alahakoon:2011:KCN:1989656.1989657, ):111The original definition presented in (Alahakoon:2011:KCN:1989656.1989657, ) does not include the normalization part . Here, due to consistency with the definitions of betweenness and coverage centralities, we use this normalized definition.

The -path centrality of in a weighted graph is defined in a similar way (Alahakoon:2011:KCN:1989656.1989657, ). We here omit it due to lack of space.

3. Related work

Brandes (jrnl:Brandes, ) introduced an efficient algorithm for computing betweenness centrality of all vertices, which is performed respectively in and times for unweighted and weighted networks with positive weights. The authors of (DBLP:conf/sdm/CatalyurekKSS13, ) presented the compression and shattering techniques to improve the efficiency of Brandes’s algorithm. In (jrnl:Everett, ) and (conf:cbcwsdm, ), the authors respectively studied group betweenness centrality and co-betweenness centrality, two natural extensions of betweenness centrality to sets of vertices. In (jrnl:Brandes3, ) and (proc:Bader, ), the authors proposed approximate algorithms based on selecting some source vertices and computing dependency scores of them on the other vertices in the graph and scaling the results. In the algorithm of Geisberger et.al. (conf:Geisberger, ), the method for aggregating dependency scores changes so that vertices do not profit from being near the selected source vertices. Chehreghani (DBLP:journals/cj/Chehreghani14, )

proposed a non-uniform sampler for unbiased estimation of the betweenness score of a vertex. Riondato and Kornaropoulos

(Riondato2016, ) presented shortest path samplers for estimating betweenness centrality of all vertices or the vertices that have the highest betweenness scores. Riondato and Upfal (RiondatoKDD20116, ) introduced the ABRA algorithm and used Rademacher average to determine the number of required samples. Recently, Borassi and Natale (DBLP:conf/esa/BorassiN16, ) presented KADABRA, which is adaptive and uses bb-BFS to sample shortest paths. Finally, in (DBLP:journals/corr/abs-1708-08739, ) the authors presented exact and approximate algorithms for computing betweenness centrality of one vertex or a small set of vertices in directed graphs. As discussed earlier, our betweenness estimation algorithm tries to have the advantages of both algorithms presented in (DBLP:conf/esa/BorassiN16, ) and (DBLP:journals/corr/abs-1708-08739, ).

Yoshida (Yoshida:2014:ALA:2623330.2623626, ) presented an algorithm that uses samples to estimate coverage centrality of a vertex within an additive error with probability . Alahakoon et.al. (Alahakoon:2011:KCN:1989656.1989657, ) introduced -path centrality of a vertex and proposed the RA-kpath algorithm to estimate it. Mahmoody et.al. (Mahmoody:2016:SBC:2939672.2939869, ) showed that -path centrality admits a hyper-edge sampler and proposed an algorithm that picks a source vertex uniformly at random, and generates a random simple path of length at most , and outputs the generated simple path as a hyper-edge (Mahmoody:2016:SBC:2939672.2939869, ). The key difference between our algorithm and these two algorithms is that our algorithm restricts the sample spaces of the source vertices and the other vertices of the paths to the sets and , respectively. This considerably improves the error guarantee and empirical efficiency of our algorithm.

4. Betweenness centrality

In this section, we present our adaptive approximate algorithm for estimating betweenness centrality of a given vertex in a directed graph . We start by introducing the sets and , that are used to define the sample spaces of start-points and end-points of shortest paths.

Definition 4.1 ().

Let be a directed graph and . We say is reachable from if there is a (directed) path from to . The set of vertices that is reachable from them is denoted by .

Definition 4.2 ().

Let be a directed graph and . We say is reachable to if there is a (directed) path from to . The set of vertices that is reachable to them is denoted by .

For a given vertex , and can be efficiently computed, using reverse graph.

Definition 4.3 ().

Let be a directed graph. Reverse graph of , denoted by , is a directed graph such that: (i) , and (ii) if and only if (DBLP:journals/corr/abs-1708-08739, ).

To compute , we act as follows (DBLP:journals/corr/abs-1708-08739, ): (i) first, by flipping the direction of the edges of , is constructed, (ii) then, if is weighted, the weights of the edges are ignored, (iii) finally, a breadth first search (BFS) or a depth-first search (DFS) on starting from is performed. All the vertices that are met during the BFS (or DFS), except , are added to . To compute , we act as follows: (i) if is weighted, the weights of the edges are ignored, (ii) a breadth first search (BFS) or a depth-first search (DFS) on starting from is performed. All the vertices that are met during the BFS (or DFS), except , are added to . Both and can be computed in time, for both unweighted and weighted graphs. Furthermore, we have the following lemma.

Lemma 4.4 ().

Given a directed graph and , the exact betweenness score of can be computed as follows:

Lemma 4.4 says that in order to compute betweenness score of , we require to only consider those shortest paths that start from a vertex in and end with a vertex in (and check which ones pass over ). This means methods that sample and uniformly at random from (Riondato2016, ; RiondatoKDD20116, ; DBLP:conf/esa/BorassiN16, ) can be revised to sample and from and , respectively. In the current paper, we revise the KADABRA algorithm (DBLP:conf/esa/BorassiN16, ) by restricting and to and , respectively. We then analyze the resulting algorithm and show that it can be adaptive where in order to give an (,)-approximation, it requires much less samples than KADABRA.

Algorithm 1 shows the high level pseudo code of our proposed algorithm, called ABAD222ABAD is an abbreviation for Adaptive Betweenness Approximation algorithm for Directed graphs.. The input parameters of this algorithm are a directed graph , a vertex for which we want to estimate betweenness score, and real values used to determine the error bound. ABAD first computes and and stores them in and , respectively. Then, at each iteration of the loop in Lines 8-16, ABAD picks up vertices and uniformly at random. Then, it picks up a shortest path among all possible shortest paths from to , uniformly at random. To do so, it uses balanced bidirectional breadth-first search (bb-BFS) where at the same time it performs a BFS from and another BFS from and stops when the two BFSs touch each other (bidirectionalsearch, ). Finally, if is on , ABAD estimates betweenness score of at iteration , , as ; otherwise, will be . The final estimation of the betweenness score of is the average of the estimations of different iterations.

1:  Input. A directed network , a vertex , and real numbers .
2:  Output. Betweenness score of .
3:  if in degree of is or out degree of is  then
4:     return  .
5:  end if
6:  , .
7:   compute , compute .
8:  while  Stop do
9:     .
10:     Pick up and , both uniformly at random.
11:     Pick up a shortest path from to uniformly at random.
12:     if  is on  then
13:        .
14:     end if
15:     , .
16:  end while
17:  .
18:  return  
Algorithm 1 High level pseudo code of the ABAD algorithm .

Let be the value of that Algorithm 1 finds in the end of the iterations done in Lines 8-16. The value of , i.e., the stopping condition of the sampling part of Algorithm 1, is determined adaptively and depends on the samples observed so far, and . The dependence on and can be expressed in terms of the parameter , defined as follows: In Theorem 4.8, we discuss the method Stop, that defines the stopping condition. Before that, in Lemmas 4.5 and 4.6

, we investigate the expected value and variance of

’s, that are used by Theorem 4.8.

Lemma 4.5 ().

In Algorithm 1, we have: .

Proof.

For each iteration in the loop in Lines 8-16 of Algorithm 1, we have:

Then, is the average of ’s. Therefore . ∎

Lemma 4.6 ().

In Algorithm 1, for each we have:

Proof.

We have:

Theorem 4.7 ().

(Theorem 6.1 of (chung2006, ).) Let be a constant and be a martingale, associated with a filter , that satisfies the followings: (i) , for , and (ii) , for . We have

(1)

Let and be real numbers in and assume that is defined as where the constant is an universal positive constant and it is estimated to be approximately (Loffler2009, ). By the results in (Riondato2016, ), in Algorithm 1 and after samples (i.e., ), the estimation error of betweenness score of will be bounded by , with probability . For , we have the following theorem.

Theorem 4.8 ().

In Algorithm 1, for real values and the value defined above, after iterations of the loop in Lines 8-16, we have:

(2)

and

(3)

where

Proof.
333Similar to the proof of Theorem 9 of (DBLP:conf/esa/BorassiN16, ), the proof of Theorem 4.8 of the current paper is based on Theorem 6.1 of (chung2006, ). The key difference is that in Theorem 9 of (DBLP:conf/esa/BorassiN16, )

, the variance of each random variable

is , where is used as an upper bound. In our theorem, the variance of each random variable is the value presented in Lemma 4.6, where we use as an upper bound.

In the following we prove that Equation 2 holds. The correctness of Equation 3 can be proven in a similar way. We define as and martingale as . Using Theorem 4.7, we get:

(4)

Furthermore, Lemma 4.6 yields:

If in Equation 4 we use this upper bound on the sum of variances, we get:

(5)

Putting the right side of Equation 5 equal to yields:

(6)

Parameter should not be expressed in terms of , as it unknown. Therefore in Equation 6 we should find its value in terms of the other parameters. To do so, we put the obtained value of into Equation 5 and obtain:

(7)

After solving the quadratic equation (with respect to ) of the event inside of Equation 7 and some simplifications, we get Equation 2. ∎

Now the definition of the Stop method can be indicated by Theorem 4.8. This theorem implies that for the defined value of , the probability of and is at most . Moreover, the probability of and is at most (), and the probability of and is at most (). Therefore and using union bounds, in order to have -approximation for the given values of and , in the beginning of each iteration of the loop in Lines 8-16 of Algorithm 1, the current value of the random variable should be either equal to , or it should make both terms of Equation 2 and of Equation 3 less than or equal to (with ). If these conditions are satisfied, the Stop method returns true and the loop terminates; otherwise, more samples are required, hence, the Stop method returns false.

The main difference between the lower and upper bounds presented in Inequalities 2 and 3 and the lower and upper bounds presented in (DBLP:conf/esa/BorassiN16, ) is that in (DBLP:conf/esa/BorassiN16, ), is replaced by . Since for given values and , and in most cases (for example, in our extensive experiments reported in Section 6, is always less than !), the number of samples (iterations) required by our algorithm is much less than the number of samples required by e.g., KADABRA (DBLP:conf/esa/BorassiN16, ).

Complexity analysis

For unweighted graphs, each iteration of the loop in Lines 8-16 of Algorithm 1 takes time. For weighted graphs with positive weights, it takes time (and for weighted graphs with negative weights, the problem is NP-hard). This is the same as time complexity of processing each sample by the existing algorithms (Riondato2016, ; DBLP:conf/esa/BorassiN16, ). In a more precise analysis and when instead of breadth-first search (BFS), balanced bidirectional breadth-first search (bb-BFS) (bidirectionalsearch, ) is used to sample a shortest path , time complexity of each iteration is improved to , where is the maximum degree of the graph (bidirectionalsearch, ). In Algorithm 1, the number of iterations is determined adaptively and as discussed above, it is considerably less than the number of samples required by the most efficient existing algorithms. Space complexity of our algorithm is .

Coverage centrality

ABAD can be revised to compute some related indices such as coverage centrality (Yoshida:2014:ALA:2623330.2623626, ) and stress centrality. To compute coverage centrality of , Lines 11-14 of Algorithm 1 are replaced by the following lines:

  if  is on a shortest path between and  then
     .
  end if

In other words, instead of sampling a shortest path between and , we check whether is on some shortest path between them, which can be done by conducting a bb-BFS from to . If during this procedure is met, it is on some shortest path from to ; otherwise, it is not. In a way similar to Lemma 4.5 we can show that this method gives an unbiased estimation of coverage centrality of . Moreover, similar to Theorem 4.8 we can define the stopping conditions of estimating coverage centrality.

5. -path centrality

In this section, we present the APAD algorithm for estimating -path centrality of a given vertex in a directed graph . We define the domain of a vertex , denoted by , as . Furthermore, we say a path belong to and denote it with , iff . It is easy to see that for directed graphs, the first vertex of each path must belong to and all its vertices must belong to . Because, otherwise, will be zero and hence will have no contribution to . This motivates us to present the following equivalent definition of -path centrality.

(8)

Note that in Equation 8, someone may decide to change the definition of to . The algorithm we propose works with both definitions of .

1:  Input. A directed network , a vertex , an integer , and real numbers .
2:  Output. -path centrality of .
3:  if in degree of is or out degree of is  then
4:     return  .
5:  end if
6:  , .
7:   compute , compute .
8:  .
9:  while  Stop do
10:     .
11:     Select uniformly at random.
12:     Select an integer uniformly at random.
13:     Select (with probability ) a random path among all paths .
14:     if  is on  then
15:        .
16:     end if
17:     , .
18:  end while
19:  .
20:  return  .
Algorithm 2 High level pseudo code of the APAD algorithm .

Algorithm 2 shows the high level pseudo code of APAD. It first computes and and stores them respectively in and . Then it starts the sampling part where at each iteration, it selects and from and , respectively; and a path from the set of all paths that belong to and start with vertex and have edges. The estimation at each iteration is and the final estimation is the average of all . While and are chosen uniformly at random, is chosen as follows. First, we initialize by . Then, at each step , let , , be the last vertex added to the current . We add to a vertex chosen uniformly at random from . This procedure yields the following definition of : .

Lemma 5.1 ().

In Algorithm 2, we have: .

Proof.

For each iteration in the loop in Lines 9-18 of Algorithm 2, we have:

where is the probability of choosing the path . Since is the average of ’s, its expected value is the same as the expected value of ’s. ∎

In the rest of this section, we derive error bounds for our estimation of . Before that, we define as the ratio .

Theorem 5.2 ().

Suppose that in Algorithm 2 the loop in Lines 9-18 is performed for a fixed number of times . For large enough values of , Algorithm 2 gives a -approximation of -path score of .

Proof.

Our proof is based on Hoeffding inequality (Hoeffding:1963, ). Let be independent random variables bounded by the interval . Let also be . Hoeffding inequality (Hoeffding:1963, ) states that for any , the following holds:

For each sampled path , using any of the two before mentioned definitions of , the following holds: . This means each is in the interval . As a result, we can apply Hoeffding inequality on the random variables , with , , and . This yields

(9)

which means if , Algorithm 2 estimates -path score of within an additive error with a probability at least . ∎

The difference between the error bounds presented in Theorem 5.2 and those that can be obtained for RA-kpath (Alahakoon:2011:KCN:1989656.1989657, ) is that on the one hand for the same error guarantee our algorithm requires times less samples. On the other hand, while in our algorithm each iteration (sample) takes time, in RA-kpath it takes time. As a result, for the same error guarantee, our algorithm is times faster than RA-kpath. Let be the set of the edges of whose both end-points are in . While space complexity of our algorithm is , it is for RA-kpath. Note that for vertices of real-world graphs, is usually considerably less than and and are respectively considerably smaller than and . Similar results hold for our algorithm against the algorithm of (Mahmoody:2016:SBC:2939672.2939869, ), as it uses the same sampling strategy as RA-kpath.

In the last part of this section, similar to the case of betweenness centrality, in Theorem 5.4 we discuss how the stopping condition of Algorithm 2 can be determined adaptively. The proof of this theorem is similar to the proof of Theorem 4.8 (hence, we omit it). There are, however, two key differences between Theorems 4.8 and 5.4. First, as shown in Lemma 5.3, in Theorem 5.4 the variance of each is bounded by , where it is in Theorem 4.8. Second, as shown in Theorem 5.2, when the number of samples is the estimation error of the -path score of is bounded by , with probability . We refer to this quantity as and use it in Theorem 5.4, as the upper bound, for the required number of samples.

Lemma 5.3 ().

In Algorithm 2, for each we have:

Proof.
(10)

Note that Lemma 5.3 holds for both definitions of .

In Algorithm 2, if the number of samples is equal to , with probability at least we have: . For the case of , we present the following theorem.

Theorem 5.4 ().

In Algorithm 2, for real values and the value of defined above, after iterations of the loop in Lines 9-18, we have:

where

6. Experimental results

We perform experiments over several real-world networks to assess the quantitative and qualitative behavior of our proposed algorithms: ABAD and APAD. We test the algorithms over several real-world datasets from the SNAP repository444https://snap.stanford.edu/data/, including the com-amazon network (DBLP:conf/icdm/YangL12, ), the com-dblp co-authorship network (DBLP:conf/icdm/YangL12, ), the email-EuAll email communication network (DBLP:journals/tkdd/LeskovecKF07, ), the p2p-Gnutella31 peer-to-peer network (DBLP:journals/tkdd/LeskovecKF07, ) and the web-NotreDame web graph (albert1999dww, ). All the networks are treated as directed graphs555The approximate algorithms can be successfully run over larger networks. Here, however, the bottleneck is to run the exact algorithm for measuring the errors!. For a vertex , its empirical approximation error is defined as: where is the calculated approximate score and is its exact score.

6.1. Betweenness centrality

In this section, we present the empirical results of ABAD. We compare ABAD against the most efficient existing algorithms, which are KADABRA (DBLP:conf/esa/BorassiN16, ) and BCD (DBLP:journals/corr/abs-1708-08739, ). The stopping condition used in KADABRA to determine the number of samples can vary to either estimate betweenness centrality of (a subset of size of) vertices or to find top vertices that have highest betweenness scores. Here, due to consistency with the stopping condition used by Algorithm 1, we use the setting of KADABRA that estimates betweenness scores of (a subset of size of) vertices. Furthermore, KADABRA has a parameter that determines the number of vertices for which we want to estimate the betweenness score. By increasing the value of , KADABRA becomes slower. In our experiments, we set666For a given value of , KADABRA estimates betweenness score of vertices. On the one hand, when we set to 1, over several datasets our chosen vertices are not among the vertices for which betweenness centrality is computed. On the other hand, as mentioned before, by increasing the value of , KADABRA becomes slower. Therefore, we set to 10 where in most cases our chosen vertices are among the vertices for which KADABRA estimates betweenness centrality and yet, it is not large. to 10. BCD is not adaptive and it does not automatically compute the number of samples required for the given values of and . Furthermore, unlike ABAD and KADABRA, BCD is a (source) vertex sampler algorithm, hence, it requires much more time to process each sample. Thus, for BCD we cannot use the same number of samples as ABAD (or KADABRA). In order to have fair comparisons, in our tests we let BCD have as much as samples that it can process within the running time of ABAD. As a result, while ABAD and BCD will have (almost) the same running times, the number of samples of BCD will be much less.

Dataset ABAD KADABRA BCD
Time Error Time Error Error
avg max avg max avg max avg max avg max
com-amazon 0.001 0.0000027 0.0000044 0.48 0.55 0.45 1.19 1.32 29.06 29.06 1.28 1.99
0.00075 0.71 0.86 0.36 0.73 2.01 28.37 56.49 1.46 3.90
0.0005 1.39 1.7 0.41 1.04 3.65 26.22 63.14 1.09 2.25
com-dblp 0.001 0.0077 0.0094 5.54 6.72 3.93 7.84 17.55 9.78 16.99 6.02 15.59
0.00075 9.31 11.46 3.76 7.08 28.79 9.16 14.31 4.46 6.42
0.0005 19.07 23.15 3.41 7.46 54.46 8.49 14.98 3.47 4.48
email-EuAll 0.001 0.0012 0.0013 1.50 1.79 1.10 2.01 1.33 16.70 23.02 2.63 6.60
0.00075 2.47 2.91 2.00 3.73 2.05 8.67 11.94 3.51 10.13
0.0005 5.11 6.27 2.33 4.95 3.85 5.76 9.76 2.04 3.89
p2p-Gnutella31 0.001 0.01 0.01 1.36 1.59 2.71 4.24 12.44 4.94 13.98 5.88 11.55
0.00075 2.27 2.69 2.36 4.32 19.71 2.93 7.46 6.18 10.62
0.0005 4.75 5.57 1.16 2.31 38.24 2.52 4.73 4.29 9.28
web-NotreDame 0.001 0.0021 0.0033 0.91 1.25 2.17 4.45 2.80 3.59 7.14 4.59 9.75
0.00075 1.5105 2.115 1.60 2.81 4.47 4.21 7.85 1.83 4.38
0.0005 2.98 4.11 1.15 2.38 8.74 2.59 4.83 2.27 5.54
Table 1. Empirical evaluation of ABAD against KADABRA and BCD for five vertices that have the highest betweenness scores. The value of is and times are in seconds. Times reported for ABAD include both computing and and estimating betweenness score.

ABAD and BCD need to specify the vertex for which we want to estimate betweenness score. For each dataset, we choose the five vertices that have the highest betweenness scores and run ABAD and BCD for each of them. Choosing the vertices that have the highest betweenness scores has two reasons. First, exact betweenness of the vertices that have a small (or ) can be computed very efficiently (DBLP:journals/corr/abs-1708-08739, ). Therefore, using approximate algorithms for them is not very meaningful. In fact by choosing vertices with high betweenness scores we guarantee that the chosen vertices have large enough (or ), so that it makes sense to run approximate algorithms for them. Second, these vertices usually have a higher than the other vertices. Hence, it is likely that ABAD will estimate betweenness centrality of the other vertices more efficiently. For instance, in web-NotreDame consider those five randomly chosen vertices that are used in the experiments of (DBLP:journals/corr/abs-1708-08739, ). These vertices have the following values: , , , and . These values are much smaller than the of the vertices that have the highest betweenness scores (see Table 1). Therefore, ABAD will have a much better performance for these vertices.

Table 1 reports the experimental results. In each experiment, the bold value shows the one that has the lowest error. In all the experiments, we set to for both ABAD and KADABRA. Since the behavior of these algorithms depends on the value of , we run them for three different values of : , and . Over com-amazon, some vertices are not among the vertices for which KADABRA computes the betweenness score, hence, they do not contribute in the reported results. Since the running time of BCD is very close to the running time of ABAD (as we set so), we do not report it in Table 1.

Our results show that ABAD considerably outperforms both KADABRA and BCD. It works always better than KADABRA; furthermore, in most cases (38 out of 42 cases) it is more accurate than BCD. This is due to very low values of that vertices of real-world graphs have. In our experiments, is always smaller than . This considerably restricts the search space and hence, improves the efficiency of ABAD. The vertices tested in our experiments are those that have the highest betweenness scores. This implies that compared to the other vertices, they usually have a high . As a result, the relative efficiency of ABAD with respect to KADABRA for the other vertices of the graphs (that do not have a very high betweenness score) will be even better. It should also be noted that for a vertex , while is computed, the sets and are computed, too. Therefore, computation of these sets does not impose any extra computational cost on ABAD.

Dataset samples APAD RA-kpath
Time Error Time Error
avg max avg max avg max avg max
com-amazon 50000 0.0176 0.0375 0.6202 0.6271 9.4273 20.3553 0.6201 17.3874 35.9696
100000 1.4710 1.4945 8.7553 16.9069 1.4483 23.5959 47.8998
500000 6.1657 6.3111 8.5740 16.2172 6.0273 15.0688 28.1798
com-dblp 50000 0.2984 0.5837 0.6120 1.4015 8.3598 14.0485 0.6990 18.8275 39.3276
100000 1.4015 1.4227 8.3598 13.5833 1.3864 18.8275 56.3778
500000 6.0236 6.1431 7.3327 10.8506 5.8907 7.1508 12.7557
email-EuAll 50000 0.3057 0.4563 0.4517 0.4570 4.4155 9.9514 0.4409 9.0128 26.2509
100000 1.0388 1.0904 3.2181 7.6999 1.0128 5.4020 13.1260
500000 4.3868 4.4107 1.3479 3.8887 4.3044 2.3709 3.8737
p2p-Gnutella31 50000 0.2050 0.5776 0.1666 0.1715 10.1815 26.4400 0.1633 25.1530 56.1906
100000 0.7527 0.3947 9.8433 24.2559 0.3436 12.8763 24.9524
500000 1.5625 1.5896 6.5869 20.6747 1.5652 20.1032 32.5738
web-NotreDame 50000 0.3520 0.7182 0.6116 0.6152 4.6194 9.4785 0.6033 7.8810 13.2039
100000 1.2208 1.2331 3.4717 11.1948 1.2108 2.3478 5.8867
500000 5.9969 6.0496 2.9064 4.9510 6.0111 2.2955 4.9636
Table 2. Empirical evaluation of APAD against RA-kpath. All times are in seconds. Times reported for APAD include both computing reachable set and estimating -path score.

Someone may complain that ABAD (like BCD) estimates betweenness score of only one vertex, whereas KADABRA can estimate betweenness scores of all vertices. However, as mentioned before, in several applications it is sufficient to compute this index for only one vertex or for a few vertices. Even if in an application we require to compute betweenness score of a set of vertices, ABAD can be still useful. In this case, after computing and of all the vertices in the set in parallel, someone may run the sampling part of Algorithm 1 (Lines 8-16 of the algorithm) for all the vertices in the set in parallel, as was done in the existing algorithms. Note that even if ABAD is run independently for each vertex in the set, in many cases it will outperform KADABRA. For example, consider our experiments where we run ABAD for five vertices that have the highest betweenness scores. In the reported experiments, for each dataset, by simply summing up the running times and the number of samples of five vertices, we can obtain the running time and the number of samples of the independent runs of ABAD for all the five vertices. Even in this case, the number of samples used by ABAD to estimate betweenness centrality of all five vertices is considerably (at least 10 times!) less than the number of samples used by KADABRA. Moreover, in most cases, ABAD is more accurate than KADABRA. Finally, while over p2p-Gnutella31 ABAD takes less time to compute betweenness scores of all five vertices, over the other datasets KADABRA is faster. A reason for the faster performance of KADABRA over com-amazon, com-dblp, email-EuAll and web-NotreDame (despite the fact that it uses much more samples) is that it processes most of these samples in parallel, whereas in the independent runs of ABAD, less samples are processed in parallel. Using considerably less samples by ABAD gives a sign that if ABAD estimates betweenness scores of all vertices of a set in parallel, it might become always faster than KADABRA. Developing a fully parallel version of ABAD that estimates betweenness scores of a set of vertices in parallel is an interesting direction for future work.

6.2. -path centrality

In this section, we present the empirical results of APAD. We compare APAD against RA-kpath (Alahakoon:2011:KCN:1989656.1989657, ) that samples a vertex , an integer and a random path that starts from and has edges. This sampling method has been used by some other work, including (Mahmoody:2016:SBC:2939672.2939869, ). The error bound presented in (Alahakoon:2011:KCN:1989656.1989657, ) is not adaptive and has only one parameter (different from the parameter used in this paper), instead of two parameters and that we have for APAD. This makes comparison of the methods slightly challenging. In order to have a fair comparison (regardless of the method used to determine the number of samples for an ()-approximation), we run each of them for a fixed number of iterations and report the empirical approximation error and running time. APAD requires to specify the vertex for which we want to compute -path centrality. In order to have experiments consistent with Section 6.1, in this section we use the same vertices used in the evaluation of ABAD. We repeat each experiment for three times and report the average results.

Table 2 reports the experimental results of -path centrality. In each experiment, the bold value shows the one that has the lowest (average/maximum) error. As can be seen in the table, in most cases APAD shows a better average error and maximum error than RA-kpath. This is particularly more evident over the datasets (e.g., com-amazon) that have a low value, which empirically validates our theoretical analysis on the connection between and the efficiency of APAD. Note that as before mentioned, RA-kpath computes -path centrality of all vertices, whereas APAD requires to specify the vertex for which we want to compute the centrality score. As a results, for different given vertices, while only one run of RA-kpath is sufficient, we require to have different runs of APAD. Therefore in Table 2, for each dataset and sample size we report only one time for RA-kpath. In contrast, for each dataset and sample size we have five running times for APAD, where the average and maximum values are reported in Table 2. The times reported for APAD include both computing reachable set and estimating -path score. These times are only slightly larger than the times reported for RA-kpath. This means that computing the reachable set of a vertex can be done very efficiently and in a time negligible compared to the running time of the whole process.

7. Conclusion

In this paper, first we presented a novel adaptive algorithm for estimating betweenness score of a vertex in a directed graph . Our algorithm computes the sets and , samples from them, and stops as soon as some condition is satisfied. The stopping condition depends on the samples met so far, and . We showed that our algorithm gives a more efficient -approximation than the existing algorithms. Then, we proposed a novel algorithm for estimating -path centrality of and showed that in order to give a -approximation, it requires considerably less samples. Moreover, it processes each sample faster and with less memory. Finally, we empirically evaluated our centrality estimation algorithms and showed their superior performance.

References

  • [1] Manas Agarwal, Rishi Ranjan Singh, Shubham Chaudhary, and S. R. S. Iyengar. An efficient estimation of a node’s betweenness. In Proceedings of the 6th Workshop on Complex Networks, New York, USA, March 25-27, 2015, pages 111–121, 2015.
  • [2] Manas Agarwal, Rishi Ranjan Singh, Shubham Chaudhary, and Sudarshan Iyengar. Betweenness ordering problem: an efficient non-uniform sampling technique for large graphs. CoRR, abs/1409.6470, 2014.
  • [3] Tharaka Alahakoon, Rahul Tripathi, Nicolas Kourtellis, Ramanuja Simha, and Adriana Iamnitchi. K-path centrality: A new centrality measure in social networks. In Proceedings of the 4th Workshop on Social Network Systems, SNS ’11, pages 1:1–1:6, New York, NY, USA, 2011. ACM.
  • [4] R. Albert, H. Jeong, and A. L. Barabasi. The diameter of the world wide web. Nature, 401:130–131, 1999.
  • [5] D. A. Bader, S. Kintali, K. Madduri, and M. Mihail. Approximating betweenness centrality. In Proceedings of 5th International Conference on Algorithms and Models for the Web-Graph, pages 124–137, 2007.
  • [6] Michele Borassi and Emanuele Natale. KADABRA is an adaptive algorithm for betweenness via random approximation. In 24th Annual European Symposium on Algorithms, ESA 2016, August 22-24, 2016, Aarhus, Denmark, pages 20:1–20:18, 2016.
  • [7] U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163–177, 2001.
  • [8] U. Brandes and C. Pich. Centrality estimation in large networks. Intl. Journal of Bifurcation and Chaos, 17(7):303–318, 2007.
  • [9] Ümit V. Çatalyürek, Kamer Kaya, Ahmet Erdem Sariyüce, and Erik Saule. Shattering and compressing networks for betweenness centrality. In Proceedings of the 13th SIAM International Conference on Data Mining, pages 686–694, 2013.
  • [10] Mostafa Haghir Chehreghani. Effective co-betweenness centrality computation. In Seventh ACM International Conference on Web Search and Data Mining (WSDM), pages 423–432, 2014.
  • [11] Mostafa Haghir Chehreghani. An efficient algorithm for approximate betweenness centrality computation. Comput. J., 57(9):1371–1382, 2014.
  • [12] Mostafa Haghir Chehreghani, Albert Bifet, and Talel Abdessalem. Efficient exact and approximate algorithms for computing betweenness centrality in directed graphs. In 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2018.
  • [13] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet Math., 3(1):79–127, 2006.
  • [14] M. Everett and S. Borgatti. The centrality of groups and classes. Journal of Mathematical Sociology, 23(3):181–201, 1999.
  • [15] R. Geisberger, P. Sanders, and D. Schultes. Better approximation of betweenness centrality. In Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments, pages 90–100, 2008.
  • [16] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963.
  • [17] Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.
  • [18] Maarten Löffler and Jeff M. Phillips.

    Shape Fitting on Point Sets with Probability Distributions

    , pages 313–324.
    Berlin, Heidelberg, 2009.
  • [19] Ahmad Mahmoody, Charalampos E. Tsourakakis, and Eli Upfal. Scalable betweenness centrality maximization via sampling. In Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1765–1773. ACM, 2016.
  • [20] Ira Pohl. Bi-directional search. Machine Intelligence, 6:127–140, 1971.
  • [21] Matteo Riondato and Evgenios M. Kornaropoulos. Fast approximation of betweenness centrality through sampling. Data Mining and Knowledge Discovery, 30(2):438–475, 2016.
  • [22] Matteo Riondato and Eli Upfal. Abra: Approximating betweenness centrality in static and dynamic graphs with rademacher averages. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1145–1154, 2016.
  • [23] George Stergiopoulos, Panayiotis Kotzanikolaou, Marianthi Theocharidou, and Dimitris Gritzalis. Risk mitigation strategies for critical infrastructures based on graph centrality analysis. International Journal of Critical Infrastructure Protection, 10:34 – 44, 2015.
  • [24] Y. Wang, Z. Di, and Y. Fan. Identifying and characterizing nodes important to community structure using the spectrum of the graph. PLoS ONE, 6(11):e27418, 2011.
  • [25] Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on ground-truth. In 12th IEEE International Conference on Data Mining, Brussels, Belgium, December 10-13, 2012, pages 745–754, 2012.
  • [26] Yuichi Yoshida. Almost linear-time algorithms for adaptive betweenness centrality using hypergraph sketches. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 1416–1425, New York, NY, USA, 2014. ACM.