1. Introduction
Graphs (in particular directed graphs) are a widely used tool for modeling data in different domains, including social networks, information networks, road networks and the world wide web. Centrality is a structural property of vertices or edges in the graph that indicates their importance. For example, it determines the importance of a person within a social network, or a road within a road network. There are several centrality notions in the literature, including betweenness centrality (DBLP:journals/cj/Chehreghani14, ), coverage centrality (Yoshida:2014:ALA:2623330.2623626, ) and path centrality (Alahakoon:2011:KCN:1989656.1989657, ).
Although there exist polynomial time algorithms for computing these indices, the algorithms are expensive in practice. However, there are observations that may improve computation of centrality indices in practice. In several applications it is sufficient to compute centrality score of only one or a few vertices. For instance, the index might be computed for only core vertices of communities in social/information networks (jrnl:Wang, ) or for only hubs in communication networks. Another example, discussed in (DBLP:conf/complenet/AgarwalSCI15, ; DBLP:journals/corr/AgarwalSCI14, ), is handling cascading failures. It has been shown that the failure of a vertex with a higher betweenness score may cause greater collapse of the network (Stergiopoulos201534, ). Therefore, failed vertices should be recovered in the order of their betweenness scores. This means it is required to compute betweenness scores of only failed vertices, that usually form a very small subset of vertices. Note that these vertices are not necessarily those that have the highest betweenness scores, hence, algorithms that identify the top vertices (Riondato2016, ) are not applicable. Another example where in a road network it is required to compute betweenness (path) score of only one vertex, is discussed in (DBLP:journals/cj/Chehreghani14, ). In this paper, we exploit these practical observations to develop more effective algorithms.
In recent years, several approximate algorithms have been proposed in the literature to estimate betweenness/coverage/path centralities. Some of them are based on sampling shortest paths (Riondato2016, ; DBLP:conf/esa/BorassiN16, ) and some others are based on sampling source vertices (or sourcedestination vertices) (proc:Bader, ; DBLP:journals/cj/Chehreghani14, ). Very recently, a technique has been proposed that significantly improves the efficiency of the source sampler algorithms in directed graphs. In this technique, for a given vertex , first the set of vertices that have a nonzero contribution (dependency score) to the betweenness score of , is computed. Then, this set is used for sampling source vertices (DBLP:journals/corr/abs170808739, ). , which can be computed very efficiently, is usually much smaller than the vertex set of the graph, hence, source vertex sampling can be done more effectively. However, the error bounds presented in (DBLP:journals/corr/abs170808739, ) are not adaptive and are not of practical convenience (see Section 6).
In the current paper, to estimate betweenness centrality, we further improve this technique by not only restricting the source vertices to , but also finding a set of vertices that can be a destination of a shortest path that passes over . Given a directed graph and a vertex of , our algorithm first computes two subsets of the vertex set of , called and . These subsets can be computed very effectively and define the sample spaces of the startpoints and the endpoints of the samples (i.e., shortest paths). Then, it adaptively samples from and and stops as soon as some condition is satisfied. The stopping condition depends on the samples met so far, and . We theoretically analyze our algorithm and show that in order to estimate betweenness of with a maximum error
with probability
, it requires considerably less samples than the wellknown existing algorithms. In fact, our algorithm tries to collect the advantages of all existing methods. On the one hand, unlike (DBLP:journals/corr/abs170808739, ), it is adaptive and its error bounds are of practical convenience. On the other hand, unlike algorithms such as (Riondato2016, ; DBLP:conf/esa/BorassiN16, ), it uses the sets and to prune the search space, that gives significant theoretical and empirical improvements. We also discuss how our algorithm can be revised to compute coverage centrality of .Then, we propose a novel adaptive algorithm for estimating path centrality of . Our algorithm is based on computing two sets and . While defines the sample space of the source vertices of the sampled paths, defines the sample space of the other vertices of the paths. We show that in order to give a approximation of the path score of , our algorithm requires considerably less samples. Moreover, it can process each sampled path faster and with less memory. We also propose a method to determine the number of samples adaptively and based on the samples met so far and . In the end, we evaluate the empirical efficiency of our centrality estimation algorithms over several realworld datasets. We show that in practice, while our betweenness estimation algorithm is usually faster than the wellknown existing algorithms, it generates considerably more accurate results. Furthermore, we show that while our algorithm is intuitively designed to estimate betweenness score of only one vertex, it can also be used to effectively compute betweenness scores of a set of vertices. Finally, we show that our algorithm for path centrality is considerably more accurate than existing methods.
The rest of this paper is organized as follows. In Section 2, we introduce preliminaries and necessary definitions used in the paper. In Section 3, we give an overview on related work. In Section 4, we introduce our betweenness/coverage estimation algorithm and theoretically analyze it. In Section 5, we present our path centrality estimation algorithm and its analysis. In Section 6, we empirically evaluate our proposed algorithms and show their high efficiency, compared to wellknown existing algorithms. Finally, the paper is concluded in Section 7.
2. Preliminaries
We assume that the reader is familiar with basic concepts in graph theory. Throughout the paper, refers to a directed graph. For simplicity, we assume that is a connected and loopfree graph without multiedges. Throughout the paper, we assume that is an unweighted graph, unless it is explicitly mentioned that is weighted. and refer to the set of vertices and the set of edges of , respectively. For a vertex , the number of head ends adjacent to is called its in degree and the number of tail ends adjacent to is called its out degree. For a vertex , by we denote the set of outgoing neighbors of .
A shortest path from to is a path whose length is minimum, among all paths from to . For two vertices , if is unweighted, by we denote the length (the number of edges) of a shortest path connecting to . If is weighted, denotes the sum of the weights of the edges of a shortest path connecting to . By definition, . Note that in directed graphs, is not necessarily equal to . The vertex diameter of , denoted by , is defined as the number of vertices of the longest shortest path of the graph. For , denotes the number of shortest paths between and , and denotes the number of shortest paths between and that also pass through . Betweenness centrality of a vertex is defined as: Coverage centrality of is defined as follows (Yoshida:2014:ALA:2623330.2623626, ):
Let denote a simple path that starts with vertex and has edges. Let also denote the vertices in the order they appear in , with . We define as . We say returns if lies on , and otherwise. For a vertex in an unweighted graph , its path centrality is defined as follows (Alahakoon:2011:KCN:1989656.1989657, ):^{1}^{1}1The original definition presented in (Alahakoon:2011:KCN:1989656.1989657, ) does not include the normalization part . Here, due to consistency with the definitions of betweenness and coverage centralities, we use this normalized definition.
The path centrality of in a weighted graph is defined in a similar way (Alahakoon:2011:KCN:1989656.1989657, ). We here omit it due to lack of space.
3. Related work
Brandes (jrnl:Brandes, ) introduced an efficient algorithm for computing betweenness centrality of all vertices, which is performed respectively in and times for unweighted and weighted networks with positive weights. The authors of (DBLP:conf/sdm/CatalyurekKSS13, ) presented the compression and shattering techniques to improve the efficiency of Brandes’s algorithm. In (jrnl:Everett, ) and (conf:cbcwsdm, ), the authors respectively studied group betweenness centrality and cobetweenness centrality, two natural extensions of betweenness centrality to sets of vertices. In (jrnl:Brandes3, ) and (proc:Bader, ), the authors proposed approximate algorithms based on selecting some source vertices and computing dependency scores of them on the other vertices in the graph and scaling the results. In the algorithm of Geisberger et.al. (conf:Geisberger, ), the method for aggregating dependency scores changes so that vertices do not profit from being near the selected source vertices. Chehreghani (DBLP:journals/cj/Chehreghani14, )
proposed a nonuniform sampler for unbiased estimation of the betweenness score of a vertex. Riondato and Kornaropoulos
(Riondato2016, ) presented shortest path samplers for estimating betweenness centrality of all vertices or the vertices that have the highest betweenness scores. Riondato and Upfal (RiondatoKDD20116, ) introduced the ABRA algorithm and used Rademacher average to determine the number of required samples. Recently, Borassi and Natale (DBLP:conf/esa/BorassiN16, ) presented KADABRA, which is adaptive and uses bbBFS to sample shortest paths. Finally, in (DBLP:journals/corr/abs170808739, ) the authors presented exact and approximate algorithms for computing betweenness centrality of one vertex or a small set of vertices in directed graphs. As discussed earlier, our betweenness estimation algorithm tries to have the advantages of both algorithms presented in (DBLP:conf/esa/BorassiN16, ) and (DBLP:journals/corr/abs170808739, ).Yoshida (Yoshida:2014:ALA:2623330.2623626, ) presented an algorithm that uses samples to estimate coverage centrality of a vertex within an additive error with probability . Alahakoon et.al. (Alahakoon:2011:KCN:1989656.1989657, ) introduced path centrality of a vertex and proposed the RAkpath algorithm to estimate it. Mahmoody et.al. (Mahmoody:2016:SBC:2939672.2939869, ) showed that path centrality admits a hyperedge sampler and proposed an algorithm that picks a source vertex uniformly at random, and generates a random simple path of length at most , and outputs the generated simple path as a hyperedge (Mahmoody:2016:SBC:2939672.2939869, ). The key difference between our algorithm and these two algorithms is that our algorithm restricts the sample spaces of the source vertices and the other vertices of the paths to the sets and , respectively. This considerably improves the error guarantee and empirical efficiency of our algorithm.
4. Betweenness centrality
In this section, we present our adaptive approximate algorithm for estimating betweenness centrality of a given vertex in a directed graph . We start by introducing the sets and , that are used to define the sample spaces of startpoints and endpoints of shortest paths.
Definition 4.1 ().
Let be a directed graph and . We say is reachable from if there is a (directed) path from to . The set of vertices that is reachable from them is denoted by .
Definition 4.2 ().
Let be a directed graph and . We say is reachable to if there is a (directed) path from to . The set of vertices that is reachable to them is denoted by .
For a given vertex , and can be efficiently computed, using reverse graph.
Definition 4.3 ().
Let be a directed graph. Reverse graph of , denoted by , is a directed graph such that: (i) , and (ii) if and only if (DBLP:journals/corr/abs170808739, ).
To compute , we act as follows (DBLP:journals/corr/abs170808739, ): (i) first, by flipping the direction of the edges of , is constructed, (ii) then, if is weighted, the weights of the edges are ignored, (iii) finally, a breadth first search (BFS) or a depthfirst search (DFS) on starting from is performed. All the vertices that are met during the BFS (or DFS), except , are added to . To compute , we act as follows: (i) if is weighted, the weights of the edges are ignored, (ii) a breadth first search (BFS) or a depthfirst search (DFS) on starting from is performed. All the vertices that are met during the BFS (or DFS), except , are added to . Both and can be computed in time, for both unweighted and weighted graphs. Furthermore, we have the following lemma.
Lemma 4.4 ().
Given a directed graph and , the exact betweenness score of can be computed as follows:
Lemma 4.4 says that in order to compute betweenness score of , we require to only consider those shortest paths that start from a vertex in and end with a vertex in (and check which ones pass over ). This means methods that sample and uniformly at random from (Riondato2016, ; RiondatoKDD20116, ; DBLP:conf/esa/BorassiN16, ) can be revised to sample and from and , respectively. In the current paper, we revise the KADABRA algorithm (DBLP:conf/esa/BorassiN16, ) by restricting and to and , respectively. We then analyze the resulting algorithm and show that it can be adaptive where in order to give an (,)approximation, it requires much less samples than KADABRA.
Algorithm 1 shows the high level pseudo code of our proposed algorithm, called ABAD^{2}^{2}2ABAD is an abbreviation for Adaptive Betweenness Approximation algorithm for Directed graphs.. The input parameters of this algorithm are a directed graph , a vertex for which we want to estimate betweenness score, and real values used to determine the error bound. ABAD first computes and and stores them in and , respectively. Then, at each iteration of the loop in Lines 816, ABAD picks up vertices and uniformly at random. Then, it picks up a shortest path among all possible shortest paths from to , uniformly at random. To do so, it uses balanced bidirectional breadthfirst search (bbBFS) where at the same time it performs a BFS from and another BFS from and stops when the two BFSs touch each other (bidirectionalsearch, ). Finally, if is on , ABAD estimates betweenness score of at iteration , , as ; otherwise, will be . The final estimation of the betweenness score of is the average of the estimations of different iterations.
Let be the value of that Algorithm 1 finds in the end of the iterations done in Lines 816. The value of , i.e., the stopping condition of the sampling part of Algorithm 1, is determined adaptively and depends on the samples observed so far, and . The dependence on and can be expressed in terms of the parameter , defined as follows: In Theorem 4.8, we discuss the method Stop, that defines the stopping condition. Before that, in Lemmas 4.5 and 4.6
, we investigate the expected value and variance of
’s, that are used by Theorem 4.8.Lemma 4.5 ().
In Algorithm 1, we have: .
Proof.
Lemma 4.6 ().
In Algorithm 1, for each we have:
Proof.
We have:
∎
Theorem 4.7 ().
(Theorem 6.1 of (chung2006, ).) Let be a constant and be a martingale, associated with a filter , that satisfies the followings: (i) , for , and (ii) , for . We have
(1) 
Let and be real numbers in and assume that is defined as where the constant is an universal positive constant and it is estimated to be approximately (Loffler2009, ). By the results in (Riondato2016, ), in Algorithm 1 and after samples (i.e., ), the estimation error of betweenness score of will be bounded by , with probability . For , we have the following theorem.
Theorem 4.8 ().
Proof.
^{3}^{3}3Similar to the proof of Theorem 9 of (DBLP:conf/esa/BorassiN16, ), the proof of Theorem 4.8 of the current paper is based on Theorem 6.1 of (chung2006, ). The key difference is that in Theorem 9 of (DBLP:conf/esa/BorassiN16, ), the variance of each random variable
is , where is used as an upper bound. In our theorem, the variance of each random variable is the value presented in Lemma 4.6, where we use as an upper bound.In the following we prove that Equation 2 holds. The correctness of Equation 3 can be proven in a similar way. We define as and martingale as . Using Theorem 4.7, we get:
(4) 
Furthermore, Lemma 4.6 yields:
If in Equation 4 we use this upper bound on the sum of variances, we get:
(5) 
Putting the right side of Equation 5 equal to yields:
(6) 
Parameter should not be expressed in terms of , as it unknown. Therefore in Equation 6 we should find its value in terms of the other parameters. To do so, we put the obtained value of into Equation 5 and obtain:
(7) 
After solving the quadratic equation (with respect to ) of the event inside of Equation 7 and some simplifications, we get Equation 2. ∎
Now the definition of the Stop method can be indicated by Theorem 4.8. This theorem implies that for the defined value of , the probability of and is at most . Moreover, the probability of and is at most (), and the probability of and is at most (). Therefore and using union bounds, in order to have approximation for the given values of and , in the beginning of each iteration of the loop in Lines 816 of Algorithm 1, the current value of the random variable should be either equal to , or it should make both terms of Equation 2 and of Equation 3 less than or equal to (with ). If these conditions are satisfied, the Stop method returns true and the loop terminates; otherwise, more samples are required, hence, the Stop method returns false.
The main difference between the lower and upper bounds presented in Inequalities 2 and 3 and the lower and upper bounds presented in (DBLP:conf/esa/BorassiN16, ) is that in (DBLP:conf/esa/BorassiN16, ), is replaced by . Since for given values and , and in most cases (for example, in our extensive experiments reported in Section 6, is always less than !), the number of samples (iterations) required by our algorithm is much less than the number of samples required by e.g., KADABRA (DBLP:conf/esa/BorassiN16, ).
Complexity analysis
For unweighted graphs, each iteration of the loop in Lines 816 of Algorithm 1 takes time. For weighted graphs with positive weights, it takes time (and for weighted graphs with negative weights, the problem is NPhard). This is the same as time complexity of processing each sample by the existing algorithms (Riondato2016, ; DBLP:conf/esa/BorassiN16, ). In a more precise analysis and when instead of breadthfirst search (BFS), balanced bidirectional breadthfirst search (bbBFS) (bidirectionalsearch, ) is used to sample a shortest path , time complexity of each iteration is improved to , where is the maximum degree of the graph (bidirectionalsearch, ). In Algorithm 1, the number of iterations is determined adaptively and as discussed above, it is considerably less than the number of samples required by the most efficient existing algorithms. Space complexity of our algorithm is .
Coverage centrality
ABAD can be revised to compute some related indices such as coverage centrality (Yoshida:2014:ALA:2623330.2623626, ) and stress centrality. To compute coverage centrality of , Lines 1114 of Algorithm 1 are replaced by the following lines:
In other words, instead of sampling a shortest path between and , we check whether is on some shortest path between them, which can be done by conducting a bbBFS from to . If during this procedure is met, it is on some shortest path from to ; otherwise, it is not. In a way similar to Lemma 4.5 we can show that this method gives an unbiased estimation of coverage centrality of . Moreover, similar to Theorem 4.8 we can define the stopping conditions of estimating coverage centrality.
5. path centrality
In this section, we present the APAD algorithm for estimating path centrality of a given vertex in a directed graph . We define the domain of a vertex , denoted by , as . Furthermore, we say a path belong to and denote it with , iff . It is easy to see that for directed graphs, the first vertex of each path must belong to and all its vertices must belong to . Because, otherwise, will be zero and hence will have no contribution to . This motivates us to present the following equivalent definition of path centrality.
(8) 
Note that in Equation 8, someone may decide to change the definition of to . The algorithm we propose works with both definitions of .
Algorithm 2 shows the high level pseudo code of APAD. It first computes and and stores them respectively in and . Then it starts the sampling part where at each iteration, it selects and from and , respectively; and a path from the set of all paths that belong to and start with vertex and have edges. The estimation at each iteration is and the final estimation is the average of all . While and are chosen uniformly at random, is chosen as follows. First, we initialize by . Then, at each step , let , , be the last vertex added to the current . We add to a vertex chosen uniformly at random from . This procedure yields the following definition of : .
Lemma 5.1 ().
In Algorithm 2, we have: .
Proof.
In the rest of this section, we derive error bounds for our estimation of . Before that, we define as the ratio .
Theorem 5.2 ().
Proof.
Our proof is based on Hoeffding inequality (Hoeffding:1963, ). Let be independent random variables bounded by the interval . Let also be . Hoeffding inequality (Hoeffding:1963, ) states that for any , the following holds:
For each sampled path , using any of the two before mentioned definitions of , the following holds: . This means each is in the interval . As a result, we can apply Hoeffding inequality on the random variables , with , , and . This yields
(9) 
which means if , Algorithm 2 estimates path score of within an additive error with a probability at least . ∎
The difference between the error bounds presented in Theorem 5.2 and those that can be obtained for RAkpath (Alahakoon:2011:KCN:1989656.1989657, ) is that on the one hand for the same error guarantee our algorithm requires times less samples. On the other hand, while in our algorithm each iteration (sample) takes time, in RAkpath it takes time. As a result, for the same error guarantee, our algorithm is times faster than RAkpath. Let be the set of the edges of whose both endpoints are in . While space complexity of our algorithm is , it is for RAkpath. Note that for vertices of realworld graphs, is usually considerably less than and and are respectively considerably smaller than and . Similar results hold for our algorithm against the algorithm of (Mahmoody:2016:SBC:2939672.2939869, ), as it uses the same sampling strategy as RAkpath.
In the last part of this section, similar to the case of betweenness centrality, in Theorem 5.4 we discuss how the stopping condition of Algorithm 2 can be determined adaptively. The proof of this theorem is similar to the proof of Theorem 4.8 (hence, we omit it). There are, however, two key differences between Theorems 4.8 and 5.4. First, as shown in Lemma 5.3, in Theorem 5.4 the variance of each is bounded by , where it is in Theorem 4.8. Second, as shown in Theorem 5.2, when the number of samples is the estimation error of the path score of is bounded by , with probability . We refer to this quantity as and use it in Theorem 5.4, as the upper bound, for the required number of samples.
Lemma 5.3 ().
In Algorithm 2, for each we have:
Proof.
(10) 
∎
Note that Lemma 5.3 holds for both definitions of .
In Algorithm 2, if the number of samples is equal to , with probability at least we have: . For the case of , we present the following theorem.
6. Experimental results
We perform experiments over several realworld networks to assess the quantitative and qualitative behavior of our proposed algorithms: ABAD and APAD. We test the algorithms over several realworld datasets from the SNAP repository^{4}^{4}4https://snap.stanford.edu/data/, including the comamazon network (DBLP:conf/icdm/YangL12, ), the comdblp coauthorship network (DBLP:conf/icdm/YangL12, ), the emailEuAll email communication network (DBLP:journals/tkdd/LeskovecKF07, ), the p2pGnutella31 peertopeer network (DBLP:journals/tkdd/LeskovecKF07, ) and the webNotreDame web graph (albert1999dww, ). All the networks are treated as directed graphs^{5}^{5}5The approximate algorithms can be successfully run over larger networks. Here, however, the bottleneck is to run the exact algorithm for measuring the errors!. For a vertex , its empirical approximation error is defined as: where is the calculated approximate score and is its exact score.
6.1. Betweenness centrality
In this section, we present the empirical results of ABAD. We compare ABAD against the most efficient existing algorithms, which are KADABRA (DBLP:conf/esa/BorassiN16, ) and BCD (DBLP:journals/corr/abs170808739, ). The stopping condition used in KADABRA to determine the number of samples can vary to either estimate betweenness centrality of (a subset of size of) vertices or to find top vertices that have highest betweenness scores. Here, due to consistency with the stopping condition used by Algorithm 1, we use the setting of KADABRA that estimates betweenness scores of (a subset of size of) vertices. Furthermore, KADABRA has a parameter that determines the number of vertices for which we want to estimate the betweenness score. By increasing the value of , KADABRA becomes slower. In our experiments, we set^{6}^{6}6For a given value of , KADABRA estimates betweenness score of vertices. On the one hand, when we set to 1, over several datasets our chosen vertices are not among the vertices for which betweenness centrality is computed. On the other hand, as mentioned before, by increasing the value of , KADABRA becomes slower. Therefore, we set to 10 where in most cases our chosen vertices are among the vertices for which KADABRA estimates betweenness centrality and yet, it is not large. to 10. BCD is not adaptive and it does not automatically compute the number of samples required for the given values of and . Furthermore, unlike ABAD and KADABRA, BCD is a (source) vertex sampler algorithm, hence, it requires much more time to process each sample. Thus, for BCD we cannot use the same number of samples as ABAD (or KADABRA). In order to have fair comparisons, in our tests we let BCD have as much as samples that it can process within the running time of ABAD. As a result, while ABAD and BCD will have (almost) the same running times, the number of samples of BCD will be much less.
Dataset  ABAD  KADABRA  BCD  

Time  Error  Time  Error  Error  
avg  max  avg  max  avg  max  avg  max  avg  max  
comamazon  0.001  0.0000027  0.0000044  0.48  0.55  0.45  1.19  1.32  29.06  29.06  1.28  1.99 
0.00075  0.71  0.86  0.36  0.73  2.01  28.37  56.49  1.46  3.90  
0.0005  1.39  1.7  0.41  1.04  3.65  26.22  63.14  1.09  2.25  
comdblp  0.001  0.0077  0.0094  5.54  6.72  3.93  7.84  17.55  9.78  16.99  6.02  15.59 
0.00075  9.31  11.46  3.76  7.08  28.79  9.16  14.31  4.46  6.42  
0.0005  19.07  23.15  3.41  7.46  54.46  8.49  14.98  3.47  4.48  
emailEuAll  0.001  0.0012  0.0013  1.50  1.79  1.10  2.01  1.33  16.70  23.02  2.63  6.60 
0.00075  2.47  2.91  2.00  3.73  2.05  8.67  11.94  3.51  10.13  
0.0005  5.11  6.27  2.33  4.95  3.85  5.76  9.76  2.04  3.89  
p2pGnutella31  0.001  0.01  0.01  1.36  1.59  2.71  4.24  12.44  4.94  13.98  5.88  11.55 
0.00075  2.27  2.69  2.36  4.32  19.71  2.93  7.46  6.18  10.62  
0.0005  4.75  5.57  1.16  2.31  38.24  2.52  4.73  4.29  9.28  
webNotreDame  0.001  0.0021  0.0033  0.91  1.25  2.17  4.45  2.80  3.59  7.14  4.59  9.75 
0.00075  1.5105  2.115  1.60  2.81  4.47  4.21  7.85  1.83  4.38  
0.0005  2.98  4.11  1.15  2.38  8.74  2.59  4.83  2.27  5.54 
ABAD and BCD need to specify the vertex for which we want to estimate betweenness score. For each dataset, we choose the five vertices that have the highest betweenness scores and run ABAD and BCD for each of them. Choosing the vertices that have the highest betweenness scores has two reasons. First, exact betweenness of the vertices that have a small (or ) can be computed very efficiently (DBLP:journals/corr/abs170808739, ). Therefore, using approximate algorithms for them is not very meaningful. In fact by choosing vertices with high betweenness scores we guarantee that the chosen vertices have large enough (or ), so that it makes sense to run approximate algorithms for them. Second, these vertices usually have a higher than the other vertices. Hence, it is likely that ABAD will estimate betweenness centrality of the other vertices more efficiently. For instance, in webNotreDame consider those five randomly chosen vertices that are used in the experiments of (DBLP:journals/corr/abs170808739, ). These vertices have the following values: , , , and . These values are much smaller than the of the vertices that have the highest betweenness scores (see Table 1). Therefore, ABAD will have a much better performance for these vertices.
Table 1 reports the experimental results. In each experiment, the bold value shows the one that has the lowest error. In all the experiments, we set to for both ABAD and KADABRA. Since the behavior of these algorithms depends on the value of , we run them for three different values of : , and . Over comamazon, some vertices are not among the vertices for which KADABRA computes the betweenness score, hence, they do not contribute in the reported results. Since the running time of BCD is very close to the running time of ABAD (as we set so), we do not report it in Table 1.
Our results show that ABAD considerably outperforms both KADABRA and BCD. It works always better than KADABRA; furthermore, in most cases (38 out of 42 cases) it is more accurate than BCD. This is due to very low values of that vertices of realworld graphs have. In our experiments, is always smaller than . This considerably restricts the search space and hence, improves the efficiency of ABAD. The vertices tested in our experiments are those that have the highest betweenness scores. This implies that compared to the other vertices, they usually have a high . As a result, the relative efficiency of ABAD with respect to KADABRA for the other vertices of the graphs (that do not have a very high betweenness score) will be even better. It should also be noted that for a vertex , while is computed, the sets and are computed, too. Therefore, computation of these sets does not impose any extra computational cost on ABAD.
Dataset  samples  APAD  RAkpath  

Time  Error  Time  Error  
avg  max  avg  max  avg  max  avg  max  
comamazon  50000  0.0176  0.0375  0.6202  0.6271  9.4273  20.3553  0.6201  17.3874  35.9696 
100000  1.4710  1.4945  8.7553  16.9069  1.4483  23.5959  47.8998  
500000  6.1657  6.3111  8.5740  16.2172  6.0273  15.0688  28.1798  
comdblp  50000  0.2984  0.5837  0.6120  1.4015  8.3598  14.0485  0.6990  18.8275  39.3276 
100000  1.4015  1.4227  8.3598  13.5833  1.3864  18.8275  56.3778  
500000  6.0236  6.1431  7.3327  10.8506  5.8907  7.1508  12.7557  
emailEuAll  50000  0.3057  0.4563  0.4517  0.4570  4.4155  9.9514  0.4409  9.0128  26.2509 
100000  1.0388  1.0904  3.2181  7.6999  1.0128  5.4020  13.1260  
500000  4.3868  4.4107  1.3479  3.8887  4.3044  2.3709  3.8737  
p2pGnutella31  50000  0.2050  0.5776  0.1666  0.1715  10.1815  26.4400  0.1633  25.1530  56.1906 
100000  0.7527  0.3947  9.8433  24.2559  0.3436  12.8763  24.9524  
500000  1.5625  1.5896  6.5869  20.6747  1.5652  20.1032  32.5738  
webNotreDame  50000  0.3520  0.7182  0.6116  0.6152  4.6194  9.4785  0.6033  7.8810  13.2039 
100000  1.2208  1.2331  3.4717  11.1948  1.2108  2.3478  5.8867  
500000  5.9969  6.0496  2.9064  4.9510  6.0111  2.2955  4.9636 
Someone may complain that ABAD (like BCD) estimates betweenness score of only one vertex, whereas KADABRA can estimate betweenness scores of all vertices. However, as mentioned before, in several applications it is sufficient to compute this index for only one vertex or for a few vertices. Even if in an application we require to compute betweenness score of a set of vertices, ABAD can be still useful. In this case, after computing and of all the vertices in the set in parallel, someone may run the sampling part of Algorithm 1 (Lines 816 of the algorithm) for all the vertices in the set in parallel, as was done in the existing algorithms. Note that even if ABAD is run independently for each vertex in the set, in many cases it will outperform KADABRA. For example, consider our experiments where we run ABAD for five vertices that have the highest betweenness scores. In the reported experiments, for each dataset, by simply summing up the running times and the number of samples of five vertices, we can obtain the running time and the number of samples of the independent runs of ABAD for all the five vertices. Even in this case, the number of samples used by ABAD to estimate betweenness centrality of all five vertices is considerably (at least 10 times!) less than the number of samples used by KADABRA. Moreover, in most cases, ABAD is more accurate than KADABRA. Finally, while over p2pGnutella31 ABAD takes less time to compute betweenness scores of all five vertices, over the other datasets KADABRA is faster. A reason for the faster performance of KADABRA over comamazon, comdblp, emailEuAll and webNotreDame (despite the fact that it uses much more samples) is that it processes most of these samples in parallel, whereas in the independent runs of ABAD, less samples are processed in parallel. Using considerably less samples by ABAD gives a sign that if ABAD estimates betweenness scores of all vertices of a set in parallel, it might become always faster than KADABRA. Developing a fully parallel version of ABAD that estimates betweenness scores of a set of vertices in parallel is an interesting direction for future work.
6.2. path centrality
In this section, we present the empirical results of APAD. We compare APAD against RAkpath (Alahakoon:2011:KCN:1989656.1989657, ) that samples a vertex , an integer and a random path that starts from and has edges. This sampling method has been used by some other work, including (Mahmoody:2016:SBC:2939672.2939869, ). The error bound presented in (Alahakoon:2011:KCN:1989656.1989657, ) is not adaptive and has only one parameter (different from the parameter used in this paper), instead of two parameters and that we have for APAD. This makes comparison of the methods slightly challenging. In order to have a fair comparison (regardless of the method used to determine the number of samples for an ()approximation), we run each of them for a fixed number of iterations and report the empirical approximation error and running time. APAD requires to specify the vertex for which we want to compute path centrality. In order to have experiments consistent with Section 6.1, in this section we use the same vertices used in the evaluation of ABAD. We repeat each experiment for three times and report the average results.
Table 2 reports the experimental results of path centrality. In each experiment, the bold value shows the one that has the lowest (average/maximum) error. As can be seen in the table, in most cases APAD shows a better average error and maximum error than RAkpath. This is particularly more evident over the datasets (e.g., comamazon) that have a low value, which empirically validates our theoretical analysis on the connection between and the efficiency of APAD. Note that as before mentioned, RAkpath computes path centrality of all vertices, whereas APAD requires to specify the vertex for which we want to compute the centrality score. As a results, for different given vertices, while only one run of RAkpath is sufficient, we require to have different runs of APAD. Therefore in Table 2, for each dataset and sample size we report only one time for RAkpath. In contrast, for each dataset and sample size we have five running times for APAD, where the average and maximum values are reported in Table 2. The times reported for APAD include both computing reachable set and estimating path score. These times are only slightly larger than the times reported for RAkpath. This means that computing the reachable set of a vertex can be done very efficiently and in a time negligible compared to the running time of the whole process.
7. Conclusion
In this paper, first we presented a novel adaptive algorithm for estimating betweenness score of a vertex in a directed graph . Our algorithm computes the sets and , samples from them, and stops as soon as some condition is satisfied. The stopping condition depends on the samples met so far, and . We showed that our algorithm gives a more efficient approximation than the existing algorithms. Then, we proposed a novel algorithm for estimating path centrality of and showed that in order to give a approximation, it requires considerably less samples. Moreover, it processes each sample faster and with less memory. Finally, we empirically evaluated our centrality estimation algorithms and showed their superior performance.
References
 [1] Manas Agarwal, Rishi Ranjan Singh, Shubham Chaudhary, and S. R. S. Iyengar. An efficient estimation of a node’s betweenness. In Proceedings of the 6th Workshop on Complex Networks, New York, USA, March 2527, 2015, pages 111–121, 2015.
 [2] Manas Agarwal, Rishi Ranjan Singh, Shubham Chaudhary, and Sudarshan Iyengar. Betweenness ordering problem: an efficient nonuniform sampling technique for large graphs. CoRR, abs/1409.6470, 2014.
 [3] Tharaka Alahakoon, Rahul Tripathi, Nicolas Kourtellis, Ramanuja Simha, and Adriana Iamnitchi. Kpath centrality: A new centrality measure in social networks. In Proceedings of the 4th Workshop on Social Network Systems, SNS ’11, pages 1:1–1:6, New York, NY, USA, 2011. ACM.
 [4] R. Albert, H. Jeong, and A. L. Barabasi. The diameter of the world wide web. Nature, 401:130–131, 1999.
 [5] D. A. Bader, S. Kintali, K. Madduri, and M. Mihail. Approximating betweenness centrality. In Proceedings of 5th International Conference on Algorithms and Models for the WebGraph, pages 124–137, 2007.
 [6] Michele Borassi and Emanuele Natale. KADABRA is an adaptive algorithm for betweenness via random approximation. In 24th Annual European Symposium on Algorithms, ESA 2016, August 2224, 2016, Aarhus, Denmark, pages 20:1–20:18, 2016.
 [7] U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163–177, 2001.
 [8] U. Brandes and C. Pich. Centrality estimation in large networks. Intl. Journal of Bifurcation and Chaos, 17(7):303–318, 2007.
 [9] Ümit V. Çatalyürek, Kamer Kaya, Ahmet Erdem Sariyüce, and Erik Saule. Shattering and compressing networks for betweenness centrality. In Proceedings of the 13th SIAM International Conference on Data Mining, pages 686–694, 2013.
 [10] Mostafa Haghir Chehreghani. Effective cobetweenness centrality computation. In Seventh ACM International Conference on Web Search and Data Mining (WSDM), pages 423–432, 2014.
 [11] Mostafa Haghir Chehreghani. An efficient algorithm for approximate betweenness centrality computation. Comput. J., 57(9):1371–1382, 2014.
 [12] Mostafa Haghir Chehreghani, Albert Bifet, and Talel Abdessalem. Efficient exact and approximate algorithms for computing betweenness centrality in directed graphs. In 22nd PacificAsia Conference on Knowledge Discovery and Data Mining, PAKDD, 2018.
 [13] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet Math., 3(1):79–127, 2006.
 [14] M. Everett and S. Borgatti. The centrality of groups and classes. Journal of Mathematical Sociology, 23(3):181–201, 1999.
 [15] R. Geisberger, P. Sanders, and D. Schultes. Better approximation of betweenness centrality. In Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments, pages 90–100, 2008.
 [16] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963.
 [17] Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.

[18]
Maarten Löffler and Jeff M. Phillips.
Shape Fitting on Point Sets with Probability Distributions
, pages 313–324. Berlin, Heidelberg, 2009.  [19] Ahmad Mahmoody, Charalampos E. Tsourakakis, and Eli Upfal. Scalable betweenness centrality maximization via sampling. In Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 1317, 2016, pages 1765–1773. ACM, 2016.
 [20] Ira Pohl. Bidirectional search. Machine Intelligence, 6:127–140, 1971.
 [21] Matteo Riondato and Evgenios M. Kornaropoulos. Fast approximation of betweenness centrality through sampling. Data Mining and Knowledge Discovery, 30(2):438–475, 2016.
 [22] Matteo Riondato and Eli Upfal. Abra: Approximating betweenness centrality in static and dynamic graphs with rademacher averages. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1145–1154, 2016.
 [23] George Stergiopoulos, Panayiotis Kotzanikolaou, Marianthi Theocharidou, and Dimitris Gritzalis. Risk mitigation strategies for critical infrastructures based on graph centrality analysis. International Journal of Critical Infrastructure Protection, 10:34 – 44, 2015.
 [24] Y. Wang, Z. Di, and Y. Fan. Identifying and characterizing nodes important to community structure using the spectrum of the graph. PLoS ONE, 6(11):e27418, 2011.
 [25] Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on groundtruth. In 12th IEEE International Conference on Data Mining, Brussels, Belgium, December 1013, 2012, pages 745–754, 2012.
 [26] Yuichi Yoshida. Almost lineartime algorithms for adaptive betweenness centrality using hypergraph sketches. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 1416–1425, New York, NY, USA, 2014. ACM.
Comments
There are no comments yet.