The importance of a vertex in a graph can be quantified using centrality measures. In this paper we deal with the percolation centrality, a measure relevant in applications where graphs are used to model a contagious process in a network (e.g., disease transmission or misinformation spreading). Centrality measures can be defined in terms of local properties, such as the vertex degree, or global properties, such as the betweenness centrality or the percolation centrality. The betweenness centrality of a vertex , roughly speaking, is the fraction of shortest paths containing as an intermediate vertex. The percolation centrality generalizes the betweenness centrality by allowing weights on the shortest paths, and the weight of a shortest path depends on the disparity between the degree of contamination of the two end vertices of the path.
The study of the percolation phenomenon in a physical system was introduced by [Broadbent1957] in the context of the passage of a fluid in a medium. In graphs, percolation centrality was proposed by Piraveenan et al (2013) [Piraveenan2013], where the medium are the vertices of a graph and each vertex in has a percolation state (reflecting the “degree of contamination” of ). The percolation centrality of is a function that depends on the topological connectivity and the states of the vertices of (the formal definition is given in Section 2).
The best known algorithms that exactly computes the betweenness centrality for every vertex of a graph depends on computing all its shortest paths [Riondato2016] and, consequently, the same applies to the computation of percolation centrality. The best known algorithm for this task for weighted graphs runs in time , for some constant [ryan]. Currently it is an open problem whether this problem can be solved in , for any and the hypothesis that there is no such algorithm is used in hardness arguments in some works[abboudwilliams, abboudwilliams2]. Note that the betweenness centrality of one given vertex, being a global property, in the worst case might depend on every other vertex of and to the extent of the authors knowledge, there is no better strategy for computing this measure for a single vertex. Consequently, the same applies for the percolation centrality.
This paper is inspired by the work of Riondato and Kornaropoulos (2016) [Riondato2016] and Riondato and Upfal (2018) [RiondatoUpfal]. A main theme in their works is that for large scale graphs, an algorithm is inefficient and a high-quality approximation obtained with high confidence is sufficient in practice. The authors observe that keeping track of the exact betweenness centrality values, which may change continuously, provides little information gain. The main idea is to sample a subset of all shortest paths so that, for given , they obtain values within from the exact value with probability . Experimentally they show that modest requirements for and are sufficient, generally using and . [Riondato2016]. Since our work is theoretical, we do not deal with specific values for these constants, however, we asssume that are fixed (even though in Section 2.2 we give a precise relation between these constants and the sample size required for the approximation).
The techniques developed by [Riondato2016] for the betweenness problem relies on the Vapnik-Chervonenkis (VC) dimension theory and the -sample theorem. In our work, we use such techniques together with pseudo-dimension (a generalization of the VC-dimension) theory to show that the more general problem of estimating the percolation centrality of every vertex of can be computed in time (this complexity is reduced to for unweighted graphs). We note that in the more recent work of Riondato and Upfal [RiondatoUpfal] they also use pseudo-dimension theory for the betweenness problem, but they obtain different bounds for the sample size and they use pseudo-dimension in order to make use of Rademacher Averages. In our work we need pseudo-dimension theory by the very nature of the problem since percolation functions are real-valued and VC-dimension does not apply in our scenario.
A second contribution that we give in this paper is showing that sample complexity theory can be used to distinguish between two problems that in the exact case might not be distinguishable from each other. As discussed earlier, in the exact case, computing the percolation centrality of one single vertex in not known to be easier than computing the same measure for every vertex of the graph. However, in the sampling complexity scenario, the problem of estimating the percolation of a single vertex, referred as computing is shown to be distinct from the problem of estimating the percolation of every vertex of , referred as computing . More precisely, we show that and can be estimated respectively in time and in weighted graphs.
Our results also imply a similar “separation” for the percolation centrality estimation in unweighted dense graphs as well as separations for the estimation of betweenness centrality that holds in any combination of the following scenarios: weighted or unweighted for either sparse or dense graphs. In fact, for all these problems we show that estimating these measures for any set of vertices of size is distinguished from estimating the same measures for all vertices of . The intuition behind these results is that while in the exact case such centrality measures might not be separable, for estimation algorithms, meeting the requirements for the parameters of confidence and quality on a set of smaller size is easier than meeting the same requirements on a set containing every vertex of .
We now introduce the definitions, notation and results we use as the groundwork of our proposed algorithms.
2.1 Graphs and Percolation Centrality
Given a graph (directed or undirected), the percolation states for each and , let be the set of all shortest paths from to , and . For a given path , let be the set of internal vertices of , that is, . We denote as the number of shortest paths from to that is internal to. Let be the set of (immediate) predecessors of in , where is the set of edges of . We call the diameter of as the largest shortest path in . Let be the percolation state of . We say is fully percolated if , non-percolated if and partially percolated if . We say that a path from to is percolated if . The percolation centrality is defined below.
[Percolation Centrality] Let . Given a graph and percolation states , the percolation centrality of a vertex is defined as
2.2 Sample Complexity and Pseudo-dimension
In sampling algorithms, the sample complexity analysis relates the minimum size of a random sample required to estimate results that are consistent with the desired parameters of quality and confidence (e.g., in our case a minimum number of shortest paths that must be sampled). An upper bound to the Vapnik-Chervonenkis Dimension (VC-dimension) of a class of binary functions especially defined in order to model the particular problem that one is dealing provides an upper bound to sample size respecting such parameters. Generally speaking, the VC-dimension measures the expressiveness of a class of subsets defined on a set of points [Riondato2016].
For the problem presented in this work, however, the class of functions that we need to deal are not binary. Hence, we use the pseudo-dimension, which is a generalization of the VC-dimension for real-valued functions. An in-depth exposition of the definitions and results presented below can be found in the books of Shalev-Schwartz and Ben-David [ShalevShwartz2014], Anthony and Bartlett [Anthony2009], and Mohri et. al. [Mohri2012].
Empirical averages and -representative samples
Given a domain and a set , let be the family of functions from to such that there is one for each . Let be a collection of elements from sampled with respect to a probability distribution .
For each , such that , we define the expectation of and its empirical average as and , respectively, i.e.,
Given , a set is called -representative w.r.t. some domain , a set , a family of functions and a probability distribution if
By the linearity of expectation, the expected value of the empirical average corresponds to . Hence, , and by the law of large numbers, converges to its true expectation as goes to infinity, since is the empirical average of random variables sampled independently and identically w.r.t. . However, this law provides no information about the value for any sample size. Thus, we use results from the VC-dimension and pseudo-dimension theory, which provide bounds on the size of the sample that guarantees that the maximum deviation of is within with probability at least , for given .
A range space is a pair , where is a domain (finite or infinite) and is a collection of subsets of , called ranges. For a given , the projection of on is the set . If then we say is shattered by . The VC-dimension of a range space is the size of the largest subset that can be shattered by , i.e.,
The VC-dimension of a range space , denoted by , is .
Let be a family of functions from some domain to the range . Consider . For each , there is a subset defined as .
[see [Anthony2009], Section 11.2] Let and be range spaces, where . The pseudo-dimension of , denoted by , corresponds to the VC-dimension of , i.e., .
Theorem 2.2 states that having an upper bound to the pseudo-dimension of a range space allows to build an -representative sample.
[see [li2001], Section 1] Let be a range space () with and a probability distribution on . Given , let be a collection of elements sampled w.r.t. , with
where is a universal positive constant. Then is -representative with probability at least .
In the work of [loffler2009shape], it has been proven that the constant is approximately . Lemmas 2.2 and 2.2, stated an proved by Riondato and Upfal (2018), present constraints on the sets that can be shattered by a range set .
[see [RiondatoUpfal], Section 3.3] Let be a set that is shattered by . Then, can contain at most one for each and for a .
[see [RiondatoUpfal], Section 3.3] Let be a set that is shattered by . Then, does not contain any element in the form , for each .
3 Pseudo-dimension and percolated paths
In this section we model the percolation centrality in terms of a range set of the percolated shortest paths. That is, for a given a graph and the percolation states for each , let , with , and let , where . For each , there is a set . For a pair and a path , let be the function
The function gives the proportion of the percolation between and to the total percolation in the graph if . We define .
Let . For each , there is a range . Note that each range contains the pairs , where such that and . We define .
The function , for each , is a valid probability distribution.
Let be the set of shortest paths from to , where . Then,
For and for all , such that each is sampled according to the probability function ,
For a given graph and for all , we have from Definition 2.2
Let be a collection of shortest paths sampled independently and identically from . Next, we define , the estimation to be computed by the algorithm, as the empirical average from Definition 2.2:
4 Approximation to the percolation centrality
We present an algorithm which its correctness and running time relies on the sample size given by Theorem 2.2. In order to bound the sample size, in Theorem 4, we prove an upper bound to the range space . We are aware that the main idea in the proof is similar the proof of a result for a different range space on the shortest paths obtained in [Riondato2016] in their work using VC-dimension. For the sake of clarity, instead of trying to fit their definition to our model and use their result, we found it easier stating and proving the theorem directly for our range space.
Let and be the corresponding range spaces for the domain and range sets defined in Section 3, and let be the diameter of . We have
Let , where . Then, there is such that and is shattered by . From Lemmas 2.2 and 2.2, we know that for each , there is at most one pair in for some and there is no pair in the form . By the definition of shattering, each must appear in different ranges in . On the other hand, each pair is in at most ranges in , since either when or . Considering that , we have
Since must be integer, . Finally,
By Theorem 3 and Definition 3, and , respectively, for each and . Thus, , and by Theorems 2.2 and 4, we have that a sample of size suffices to our algorithm, for given . The problem of computing the diameter of is not known to be easier than the problem of computing all of its shortest paths [Aingworth:1996:FED:313852.314117], so obtaining an exact value for the diameter would defeat the whole purpose of using a sampling strategy that avoids computing all shortest paths. Hence, we use an 2-approximation for the diameter described in [Riondato2016]. We note that the diameter can be approximated within smaller factors, but even for a -aproximation algorithm (i.e., an algorithm that outputs a solution of size at most ) the complexity is [Aingworth:1996:FED:313852.314117], what would also be a bottleneck to our algorithm. Furthermore, since in our case we do not need the largest shortest path, but simply the value of the diameter, and we take logarithm of this value, the approximation of [Riondato2016] is sufficient.
4.1 Algorithm description and analysis
Given a weighted graph and the percolation states for each as well as the quality and confidence parameters , assumed to be constants (they do not depend on the size of ) respectively, the Algorithm 2 works as follows. At the beginning of the execution, the sample is initialized as and the approximated value for the diameter of is obtained, in line 3, by a 2-approximation described in [Riondato2016], as previously mentioned. According to Theorem 4, this value determines the size of , denoted by , in line 4.
The value for each , which are necessary to compute , is obtained in line 6 by the linear time dynamic programming strategy presented in Algorithm 1. The correctness of Algorithm 1 is not self evident, so we provide a proof of its correctness in Theorem 4.1.
A pair is sampled uniformly and independently, and then a shortest path between is sampled uniformly in in lines 11–17. For a vertex , the value is added to .
For an array of size , sorted in non-decreasing order, Algorithm 1 returns for and , respectively, the values and , for each .
By the definition of , we have that
Since is sorted, then if . Hence, if we consider only the , this value becomes
A similar step can be applied to the values of the array , and then for all indices ,
The recurrences below follow directly from lines 5 and 6, where denotes the value of at the beginning of the -th iteration of the algorithm.
The solutions to the above recurrences are, respectively,
The value is then correctly computed in lines 4–6, since
Finally, is also correctly computed in lines 8 and 9, since
Let be a sample of size for a given weighted graph and for given . Algorithm 2 returns with probability at least an approximation to , for each , such that is within error.
Each pair is sampled with probability in lines 8 and 9, and for each pair, the set is computed by Dijkstra algorithm (line 10). A shortest path is sampled independently and uniformly in (lines 11–17), i.e., with probability , by a backward traversing starting from (Lemma 5 in [Riondato2016], Section 5.1). Therefore, is sampled with probability .
In lines 13–17, each reached by the backward traversing have its value increased by . The value of is correctly computed as shown in Theorem 4.1. Let be the set of shortest paths that is an internal vertex. Then, at the end of the -th iteration,
Given a weighted graph with and and a sample of size , Algorithm 2 has running time .
We use the linear time algorithm of [vose1991] for the sampling step in lines 8, 9 and 14. The upper bound to the diameter, computed in line 3 and denoted by