Effectively Counting s-t Simple Paths in Directed Graphs

03/10/2021 ∙ by Mostafa Haghir Chehreghani, et al. ∙ AUT 0

An important tool in analyzing complex social and information networks is s-t simple path counting, which is known to be #P-complete. In this paper, we study efficient s-t simple path counting in directed graphs. For a given pair of vertices s and t in a directed graph, first we propose a pruning technique that can efficiently and considerably reduce the search space. Then, we discuss how this technique can be adjusted with exact and approximate algorithms, to improve their efficiency. In the end, by performing extensive experiments over several networks from different domains, we show high empirical efficiency of our proposed technique. Our algorithm is not a competitor of existing methods, rather, it is a friend that can be used as a fast pre-processing step, before applying any existing algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs (networks) are powerful tools that are used to model data in different domains, including social networks, information networks, road networks and the world wide web. A property seen in most of these real-world networks is that the links between the vertices do not always represent reciprocal relations [Newman03thestructure]. Consequently, the generated networks are directed graphs, where any edge has a direction and the edges are not necessarily symmetric.

Path counting is an important problem in analyzing large networks. For example, some of network indices, such as -path centrality are based on counting paths [DBLP:conf/cikm/ChehreghaniBA19]. A computationally demanding path counting problem is simple path counting. In this problem, given two vertices and , the goal is to count the number of all simple paths from to . A simple path is path that contains each node of the graph at most once. For both directed and undirected graphs, this problem is known to be P-complete [DBLP:journals/siamcomp/Valiant79]111It is important to not confuse our studied problem with the problem wherein the constraint ”simple” is removed, i.e., we count the number of all paths (simple or non-simple) between two vertices and . In this problem, a node may appear for several times in a path. Unlike our studied problem (which is P-complete), this problem can be efficiently solved in polynomial time, using e.g., dynamic programming or matrix multiplication. Another relevant problem, that unlike our studied problem can be solved in polynomial time, is counting the number of shortest paths between two vertices. . Hence, it is critical to develop algorithms that work efficiently in practice.

In this paper, we focus on efficient simple path counting in directed graphs. Our key contributions are as follows.

  • First, for a given pair of vertices and , we propose a pruning technique that can reduce the search space significantly. Moreover, it can be computed very efficiently.

  • Then, we discuss how this technique can be adjusted with exact and approximate algorithms, to improve their performance. We show, for example, that under some conditions, applying this pruning technique yields a polynomial time exact algorithm for enumerating simple paths.

  • Finally, by performing extensive experiments over several networks from different domains (social, peer-to-peer, communication, citation, stack exchange, product co-purchasing, …), we show high empirical efficiency of our pruning technique.

Our algorithm is not a competitor of existing simple path counting algorithms. Rather, it is a friend as it can be used as a fast pre-processing step, before applying any of existing algorithms.

The rest of this paper is organized as follows. In Section 2, we introduce preliminaries and necessary definitions used in the paper. In Section 3, we give an overview on related work. In Section 4, we present our pruning technique and discuss how it improves exact and approximate algorithms. In Section 5, we empirically evaluate our pruning technique. Finally, the paper is concluded in Section 6.

2 Preliminaries

We assume that the reader is familiar with basic concepts in graph theory. Throughout the paper, refers to a directed graph. For simplicity, we assume that is a connected and loop-free graph without multi-edges. By default, we assume that is an unweighted graph, unless it is explicitly mentioned that is weighted. and refer to the set of vertices and the set of edges of , respectively. For a vertex , the number of head ends adjacent to is called its in degree and the number of tail ends adjacent to is called its out degree. For a vertex , by we denote the set of outgoing neighbors of in the graph . A (directed) walk in a directed graph is a sequence of edges directed in the same direction and joins a sequence of vertices. A directed trail is a (directed) walk wherein all the edges are distinct. A (simple and directed) path is a (directed) trail wherein all the vertices are distinct. Let . The subgraph of induced by is the graph whose vertices are and edges are all the edges in that have both endpoints in .

3 Related work

There exist a number of algorithms in the literature for counting/enumerating/listing simple paths. In one of the earlier studies, Valiant [DBLP:journals/siamcomp/Valiant79] showed that it is P-complete to enumerate simple paths in both directed and undirected graphs. In [HURA1983157], Hura utilized Petri nets to enumerate all simple paths in a directed graph. Bax [DBLP:journals/ipl/Bax94] exploited Hamiltonian path algorithms to derive an time and space algorithm for simple path counting. Knuth [Knuth_4a] presented the simpath algorithm that generates a (not-always-reduced) binary decision diagram for all simple paths from to . The constructed diagram is called zero-suppressed decision diagram (ZDD). In recent years, more efficient algorithms have been proposed to construct ZDD [DBLP:conf/ambn/YasudaSM17]. Roberts and Kroese [DBLP:journals/jgaa/RobertsK07] presented an approximate algorithm for simple path counting, which is based on sequential importance sampling (we discuss their algorithm in more details in Section 4.3, and explain how our pruning technique can improve it). In [DBLP:journals/mst/MihalakSW16], Mihalák et.al. studied approximately counting approximately shortest paths in directed graphs. However, their algorithm is restricted to directed acyclic graphs (DAGs).

In the literature, there are problems that are close to our studied problem, or are a restricted form of it. In the following, we briefly review them:

  • In a restricted form of our studied problem, the number of simple paths of a (fixed) length are counted. Flum and Grohe [DBLP:journals/siamcomp/FlumG04] proved that counting simple paths of length on both directed and undirected graphs, parameterized by , is #W[1]-complete. Recently, Giscard et.al. [DBLP:journals/algorithmica/GiscardKW19] proposed an algorithm for counting simple paths of length up to . Time complexity of their algorithm is , where is the maximum degree of the graph, is the number of weakly connected induced subgraphs of on at most vertices, and is the exponent of matrix multiplication.

  • The other problem which is close to our studied problem is counting the number of all paths from a vertex to another vertex . However, unlike our studied problem (which is P-complete), this problem can be solved in polynomial time, using e.g., dynamic programming or fast matrix multiplication. For example and as described in e.g., [jrnl:Brandes], the number of all paths of size (simple or non-simple) from to is equal to the -th entry of , where is the adjacency matrix of the graph and is the -th power of . Arenas et.al. [DBLP:conf/pods/ArenasCJR19] study the problem of counting the number of all paths between two vertices and of a length at most , and show that it admits a fully polynomial-time randomized approximation scheme (FPRAS).

  • Another relevant problem is counting the number of shortest paths between a given pair of vertices. Unlike our studied problem, this problem can be solved efficiently in polynomial time, too. Using breadth first search (BFS), this problem can be solved in time [jrnl:Brandes]. Recently, a number of faster approximate algorithms have been proposed to address this problem in static [DBLP:journals/algorithms/MensahGY20] and dynamic graphs [DBLP:conf/cikm/TretyakovAGVD11]. The notion of shortest paths is used to develop several tools of network analysis, including betweenness centrality [DBLP:journals/cj/Chehreghani14, DBLP:journals/corr/abs-1708-08739, DBLP:conf/edbt/ChehreghaniAB19].

In the current paper, we propose a pruning technique to improve exact/approximate algorithms of the general form of simple path counting in directed graphs. Our algorithm is not a competitor of these existing algorithms. Rather, it is a friend that can be used as a fast pre-processing step before applying any of them.

4 Counting paths in directed graphs

In this section, given a source vertex and a target vertex in a directed graph , we propose algorithms to count the number of all simple paths from to . First in Section 4.1, we present a pruning technique used to reduce search space for path counting. Then in Section 4.2, we discuss how this pruning technique improves exact path counting. Finally in Section 4.3, we study how our pruning technique can improve approximate algorithms.

4.1 A pruning technique

In this section, we describe our pruning technique. We start with introducing the sets (Definition 4.1) and (Definition 4.1), that are used to reduce the search space.

Let be a directed graph and such that . We say is in the scope of with respect to iff either , or there is at least one directed path from to in that does not pass over . 222The path may end to , however, it can not pass over . The set of vertices that are in the scope of with respect to is denoted with .

Let be a directed graph. Inverse graph of , denoted with , is a directed graph such that: (i) , and (ii) if and only if [DBLP:journals/corr/abs-1708-08739].

Let be a directed graph and such that . We say is inversely in the scope of with respect to iff either , or there is at least one directed path from to in that does not pass over 333The path may end to , however, it can not pass over .. The set of vertices that are inversely in the scope of with respect to is denoted with .

For example, in the graph of Figure 1, consists of vertices , , , , , and . It does not contain vertex , as the path from to passes over . Moreover, consists of vertices , and . For a given pair of vertices , we can compute and efficiently in time. To compute , we act as follows:

  1. First, if is weighted, weights of the edges of are discarded,

  2. Then, a (revised) breadth first search (BFS) or a depth-first search (DFS) on starting from is conducted, with an small change: when is met, the traversal is not expanded to the children (and descendants) of . All the vertices that are met during the traversal are appended to .

To compute , we act as follows:

  1. First, by flipping the direction of the edges of , we construct ,

  2. Then, if is weighted, weights of the edges are discarded,

  3. Finally, a (revised) breadth-first search or depth-first search on starting from is conducted, with an small change: when is met, the traversal is not expanded to its children (and descendants). All the vertices that are met during the traversal are appended to .

It is easy to see that both and can be computed in time, for both unweighted and weighted graphs. Furthermore, we have the following lemma.

Given a directed graph and vertices , the number of simple paths from to is equal to the number of simple paths from to whose vertices belong to .

Proof.

Each path should start from and end with , therefore its vertices should belong to both and . ∎

Lemma 4.1 says that in order to compute paths, we require to consider only those paths that their vertices belong to both and . We can use this to prune many vertices of the graph that do not belong to either or or both. As a result, if in a graph, is (considerably) smaller than the number of vertices of the graph, this pruning technique can enormously improve the efficiency of the path counting algorithm. Note that time complexity of simple path counting algorithms is exponential in terms of the number of vertices of the graph. Hence, discarding a considerable part of the vertices can hugely improve the efficiency of the algorithms.

In Algorithm 1, we use this optimization technique to make path counting algorithms more efficient. If of is or of is , there is no path from to , hence, Algorithm 1 returns . Then, it computes and and stores them respectively in the sets and . Then, for each vertex in the graph, we check if it belongs to both and . We form the subgraph induced by all such vertices and call it . Finally, we apply the path counting algorithm on . the graph is usually much smaller than , therefore, the algorithm can be run much faster.

(a) A graph
(b)
Figure 1: The graph on left is a directed graph and the graph on right shows . Over and and

, while the variance of our algorithm is

, the variance of the algorithm of [DBLP:journals/jgaa/RobertsK07] is .
1:  Input. A directed network , a pair of vertices .
2:  Output. The number of simple paths from to .
3:  if out degree of is or in degree of is  then
4:     return  .
5:  end if
6:   compute .
7:   compute .
8:  for all vertices  do
9:     if  and  then
10:        Mark .
11:     end if
12:  end for
13:   the induced subgraph of consisting of the marked vertices.
14:   call an (exact/approximate) path counting algorithm on .
15:  return  .
Algorithm 1 High level pseudo code of computing the number of paths in a directed graph.

4.2 Exact algorithm

In an exact algorithm for counting simple paths that exhaustively enumerates all simple paths from to , we can use backtracking: we start from , take a path and walk it (so that the path does not contain repeated vertices). If the path ends to , we count it and backtrack and take some other path. If the path does not reach to , we discard it and take some other path. This algorithm enumerates all the paths from to . We refer to this algorithm as Exhaustive Simple Path Enumerator, or ESPE in short. As discussed in [DBLP:journals/siamcomp/Valiant79], enumerating paths for both directed and undirected graphs is P-complete. In the following, we discuss that using our pruning technique and for a certain type of vertices in a directed graph, the simple path enumeration problem can be solved in polynomial time.

In the simple path counting problem, the input size consists of the number of vertices and the number of edges of . In Theorem 4.2 we show that if for and , is a constant, then the simple counting problem can be solved efficiently. Later in Section 5 we empirically show that in most of real-world networks, vertices and have usually a small quantity for . It should be highlighted that even for such pairs of vertices, applying the above mentioned exhaustive simple path enumeration algorithm without using our pruning technique, does not yield a polynomial time algorithm.

Let be a directed graph, where for a given pair of vertices and , is a constant. The list of simple paths from to can be enumerated in polynomial time (in terms of and ).

Proof.

We can compute the sets , , and their intersection in time. So, can be computed in time. After that, we work with which has only a constant number of vertices. Hence, all the paths can be enumerated in a constant time. ∎

In fact, the proof of Theorem 4.2 presents a much better result than polynomial time: it says that under the mentioned condition and after a linear time spent to construct , the whole list can be enumerated in a constant time (in terms of and ).

4.3 Approximate algorithm

Roberts and Kroese [DBLP:journals/jgaa/RobertsK07]

proposed randomized algorithms for estimating the number of

simple paths in a graph. In this section, we investigate how the first randomized algorithm of [DBLP:journals/jgaa/RobertsK07] can be improved using our pruning technique 444The second randomized algorithm of [DBLP:journals/jgaa/RobertsK07] and most of the other randomized algorithms in the literature can be improved in a similar way.. First, we briefly describe how this algorithm changes if we apply our pruning technique on it. Then, we analyze this revised algorithm and compare it with the first algorithm of [DBLP:journals/jgaa/RobertsK07].

In the approximate algorithm, we sample

independent (random) paths and compute the probability of sampling each path. The following procedure is used to sample each path:

  1. Start with vertex . Initialize the following variables: (current vertex), (probability), (counter), and (path).

  2. Mark as (to be sure that will not visit again).

  3. Let be the set of possible vertices for the next step of the path.

  4. Choose the next vertex of the path uniformly at random and set to .

  5. Set , , , and . Mark as .

  6. If , then stop. Otherwise go to step .

If we do not use our pruning technique, instead of working with , we should work (e.g., in step 3) with . This has consequences. For example, while a path sampled from may never reach to (so after some iterations, becomes empty), this never happens when we work with . Hence, as stated in [DBLP:journals/jgaa/RobertsK07], when working with we need to check the following condition in step 3: if , we do not generate a valid path, so stop.

Let be the set of vertices of each path , with . In the above mentioned procedure, path is chosen with probability:

(1)

Let be the set of simple paths from to . Let also be the set of sampled paths. For each , we estimate the number of simple paths as . The final estimation is the average of estimations of different samples:

(2)

It is easy to see that

is an unbiased estimator for the number of simple paths from

to :

Since is the average of random variables whose expected values is , the expected value of is , too.

For the variance of , we have:

Since is the average of independent random variable , we have:

Now let compare this variance with the variance of the estimator of the algorithm of [DBLP:journals/jgaa/RobertsK07]. Unlike our algorithm, in the algorithm of [DBLP:journals/jgaa/RobertsK07] it is possible that a sampled path does not belong to . Let be the sample space of the algorithm of [DBLP:journals/jgaa/RobertsK07] that consists of those paths in that start from and end to or to a vertex whose is empty. In a way similar to our analysis, its variance is:

(3)

where is defined as follows ( are vertices of ):

(4)

Moreover in Equation 4.3, is an indicator function, which is if is a path from to , and otherwise.

Comparing Equation 4.3 with Equation 4.3 reveals that our algorithm has a better (lower) variable. In order to have a lower variance, an algorithm should assign as low as possible probabilities to the paths that it may sample but they do not belong to . Because in this case their inverse will be a very large number and the contribution of these large numbers to the variance will be . Our algorithm assigns probability to such paths, so it manages them very efficiently. However, the algorithm of [DBLP:journals/jgaa/RobertsK07] may assign a (large) non-zero value to many of them. For example, in Figure 1 assume that we want to estimate the number of simple paths from vertex to vertex . The sample space of our algorithm consists of only only one path which is the path . Hence, the variance of its estimation, even when is , is . In contrast, the sample space of the algorithm of [DBLP:journals/jgaa/RobertsK07] consists of paths: , , , , and , where it assigns an equal probability to each path. Among them, only the last one belongs to . While this algorithm assigns the same probability to each one, our algorithm gives probability to the paths that do not belong to . With , the variance of the estimation of the algorithm of [DBLP:journals/jgaa/RobertsK07] is . Since our algorithm assigns probability to the paths that are not in , it never finds a variance worse than the algorithm of [DBLP:journals/jgaa/RobertsK07].

Dataset #vertices #edges domain
soc-sign-Slashdot090221 [DBLP:conf/chi/LeskovecHK10] 82140 549202 social
soc-sign-epinions [DBLP:conf/chi/LeskovecHK10] 131828 841372 social
p2p-Gnutella08 [DBLP:journals/tkdd/LeskovecKF07] 6301 20777 peer-to-peer file sharing
p2p-Gnutella30 [DBLP:journals/tkdd/LeskovecKF07] 36682 88328 peer-to-peer file sharing
p2p-Gnutella31 [DBLP:journals/tkdd/LeskovecKF07] 62586 147892 peer-to-peer file sharing
wikiVote [DBLP:conf/chi/LeskovecHK10] 7115 103689 Wikipedia vote
collegeMsg [DBLP:journals/jasis/PanzarasaOC09] 1899 59835 message exchanging
soc-Epinions1 [DBLP:conf/semweb/RichardsonAD03] 75879 508837 who-trust-whom
email-EuAll [DBLP:journals/tkdd/LeskovecKF07] 265214 420045 communication
cit-HepPh [DBLP:conf/kdd/LeskovecKF05] 34546 421578 citation
cit-HepTh [DBLP:conf/kdd/LeskovecKF05] 27770 352807 citation
sx-mathoverflow [DBLP:conf/wsdm/ParanjapeBL17] 24818 506550 stack exchange (Math Overflow)
sx-askubuntu [DBLP:conf/wsdm/ParanjapeBL17] 159316 964437 stack exchange (Ask Ubuntu)
sx-superuser [DBLP:conf/wsdm/ParanjapeBL17] 194085 1443339 stack exchange (Super User)
amazon0302 [DBLP:journals/tweb/LeskovecAH07] 262111 1234877 product co-purchasing
amazon0312 [DBLP:journals/tweb/LeskovecAH07] 400727 3200440 product co-purchasing
Table 1: Specifications of our large networks.

5 Experimental results

In order to evaluate the empirical behaviour of our pruning technique, we perform extensive experiments over several real-world networks from different domains. The objective is to empirically show that in real-world directed networks, for a given pair of vertices and , is small (compared to ), and it can be computed very efficiently. These two automatically improve efficiency of any (exact/approximate) simple path counting algorithm.

Dataset
soc-sign-Slashdot090221 0.3333 0.0889
soc-sign-epinions 0.3144 0.1172
p2p-Gnutella08 0.3296 0.0014
p2p-Gnutella31 0.2261 0.0150
p2p-Gnutella30 0.2315 0.0081
wikiVote 0.1570 0.0068
collegeMsg 0.6821 0.0004
email-EuAll 0.1289 0.0618
soc-Epinions1 0.4246 0.0761
cit-HepPh 0.0000002 0.2833
cit-HepTh 0.0000002 0.2765
sx-mathoverflow 0.1478 0.0495
sx-askubuntu 0.1160 0.1399
sx-superuser 0.1477 0.2418
amazon0302 0.9229 0.2255
amazon0312 0.9488 0.5525
Table 2: Empirical evaluation of our proposed pruning technique. Column presents the maximum size of divided by the number of vertices (over different pairs of vertices). Column presents the maximum time to compute , where the reported times are in second(s).
Dataset #vertices #edges domain
bio-CE-GN [DBLP:conf/aaai/RossiA15] 2220 53683 Gene functional associations
bio-SC-LC [DBLP:conf/aaai/RossiA15] 2004 20452 Gene functional associations
bn-mouse_retina_1 [DBLP:conf/aaai/RossiA15] 1123 90811 Brain network
C1000-9 [DBLP:conf/aaai/RossiA15] 1001 450081 DIMACS
C2000-9 [DBLP:conf/aaai/RossiA15] 2001 1799534 DIMACS
SmaGri [DBLP:conf/aaai/RossiA15] 1060 4921 Citation network
Table 3: Specifications of our smaller networks.
Dataset max max total running time max simple paths
pruned-ESPE ESPE
bio-CE-GN 135 5.253 not terminated 1,180,840
bio-SC-LC 144 414.605 not terminated 119,738,817
bn-mouse_retina_1 29 183.987 not terminated 17,012,976
C1000-9 35 195.333 448.225 919,343,373
C2000-9 34 171.182 253.073 887,306,870
SmaGri 52 0.001 5.973 1,030
Table 4: Empirical comparison of our proposed pruning technique. All the reported times are in seconds and ”not terminated” means the algorithm does not terminate within 12 hours!

The experiments are done on an Intel processor clocked at GHz with GB main memory, running Ubuntu Linux LTS. We perform our tests over 16 real-world datasets from the SNAP repository555https://snap.stanford.edu/data/. The specifications of the datasets are summarized in Table 1. Over each dataset, we choose pairs of vertices and , uniformly at random. Then, we run our proposed pruning technique for each pair. In the end, we report maximum size of divided by (in the column) and the maximum time to compute (in the column), over all the sampled pairs.

The results are reported in Table 2. On the one hand, since is usually considerably smaller than the number of vertices of the graph, our pruning technique usually makes the real-world graphs considerably smaller. Hence, any path counting problem (exact, approximate, …) can be solved much faster on the smaller graph. Compared to the number of vertices, we may consider as a constant. Therefore after pruning, the path counting algorithm finds a constant time. Note that the reported values for are the maximum ones and most of the other values are much smaller.

On the other hand, the values reported in the column indicate that computing , , and their intersection can be done very quickly. Note that contains all the time required to compute , , and their intersection. As reported in the table, over all the datasets, is less than seconds! This time is quite ignorable, compared to the time of counting all simple paths. Note that while worst case time complexity of our pruning technique is , in practice it is usually performed much faster. The reason is that for most of vertices, the direction of edges makes the sets and small. As a result, BFSs (or DFSs) conducted on these small sets, can be conducted very efficiently.

In our experiments, we tried to compare the ESPE algorithm (discussed in Section 4.2), against the method wherein first we apply our pruning technique and then we use ESPE (we refer to this method as pruned-ESPE). However, over any of the above mentioned networks, none of the methods ESPE and pruned-ESPE finish during a reasonable time (24 hours!). Therefore and to compare the efficiency of pruned-ESPE against ESPE, we switch to smaller graphs. Our smaller real-world graphs have between 1K-2.3K vertices. Their specifications are summarized in Table 3 (they can be downloaded from http://networkrepository.com/networks.php). We treat all these networks as directed graphs.

When comparing pruned-ESPE against ESPE, we perform the following scenario: over each dataset, we sample uniformly at random 500 pairs of nodes and , such that there exists at least one path from to . Then for each pair, we run pruned-ESPE and ESPE and compute the following statistics: , total running time of pruned-ESPE (which includes the pre-processing/pruning phase), running time of ESPE, and the number of simple paths. In the end and among all the sampled pairs, we choose the pair that yields the maximum number of simple paths and report the results of this chosen pair. For both algorithms, this chosen pair usually takes the most amount of time to process. Thus, ”max ” depicts the value of for the chosen pair, ”max total running time” depicts the total time to process the chosen pair (by each of the algorithms), and ”max #simple paths” presents the number of simple paths between the two nodes of the chosen pair.

Table 4 reports the results. As seen in the table, using our proposed pruning techniques significantly improves the running time. All the times reported in this table are in seconds, and ”not terminated” means the algorithm does not terminate within 12 hours! So, for the chosen pair and over bio-CE-GN, bio-SC-LC and bn-mouseretina1, while ESPE doe not produce the results during a reasonable time (12 hours!), pruned-ESPE processes the chosen pairs quickly. Over the other graphs that both algorithms terminate within a reasonable time, pruned-ESPE acts quite faster. As evidenced by the values of the ”max ” column, this is due to the efficiency of our pruning technique in considerably reducing the search space.

6 Conclusion

In this paper, we studied efficient simple path counting in directed graphs. For a given pair of vertices and in a directed graph, first we presented a pruning technique that can efficiently reduce the search space. Then, we investigated how this pruning technique can be adjusted with exact and approximate algorithms, to improve their efficiency. In the end, we performed extensive experiments over several real-world networks to show the empirical efficiency of our proposed pruning technique.

References