Query-by-Sketch: Scaling Shortest Path Graph Queries on Very Large Networks

04/20/2021 ∙ by Ye Wang, et al. ∙ Massey University Australian National University 0

Computing shortest paths is a fundamental operation in processing graph data. In many real-world applications, discovering shortest paths between two vertices empowers us to make full use of the underlying structure to understand how vertices are related in a graph, e.g. the strength of social ties between individuals in a social network. In this paper, we study the shortest-path-graph problem that aims to efficiently compute a shortest path graph containing exactly all shortest paths between any arbitrary pair of vertices on complex networks. Our goal is to design an exact solution that can scale to graphs with millions or billions of vertices and edges. To achieve high scalability, we propose a novel method, Query-by-Sketch (QbS), which efficiently leverages offline labelling (i.e., precomputed labels) to guide online searching through a fast sketching process that summarizes the important structural aspects of shortest paths in answering shortest-path-graph queries. We theoretically prove the correctness of this method and analyze its computational complexity. To empirically verify the efficiency of QbS, we conduct experiments on 12 real-world datasets, among which the largest dataset has 1.7 billion vertices and 7.8 billion edges. The experimental results show that QbS can answer shortest-path graph queries in microseconds for million-scale graphs and less than half a second for billion-scale graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graphs are typical data structures used for representing complex relationships among entities, such as friendships in social networks, connections in computer networks, and links among web pages (Scott, 1988; Boccaletti et al., 2006; Ukkonen et al., 2008). Computing shortest paths between vertices is a fundamental operation in processing graph data, and has been used in many algorithms for graph analytics (Yao et al., 2013; Opsahl et al., 2010; Kolaczyk et al., 2009). These algorithms are often applied to support applications that require low latency on graphs with millions or billions of vertices and edges. Therefore, it is highly desirable – but challenging – to compute shortest paths efficiently on very large graphs.

Previously, the problem of point-to-point shortest path queries has been well studied, which is to find a shortest path between two vertices in a graph (Goldberg and Harrelson, 2005; Goldberg et al., 2006; Bast et al., 2007; Goldberg, 2007; Wagner and Willhalm, 2007; Abraham et al., 2010; Wu et al., 2012; Sankaranarayanan et al., 2009; Sanders and Schultes, 2005). By leveraging specific properties of road networks, such as hierarchical structures and near planarity (Fu et al., 2013; Akiba et al., 2013), previous works have proposed various exact and approximate methods for answering point-to-point shortest path queries (Cowen and Wagner, 2004; Abraham et al., 2012). Nonetheless, these methods often do not perform well on complex networks (e.g., social networks, and web graphs) because complex networks exhibit different properties from road networks, such as small diameter and local clustering (Goldberg and Harrelson, 2005; Fu et al., 2013; Akiba et al., 2013). Furthermore, existing methods for point-to-point shortest path queries were designed with the guarantee of finding only one shortest path, which limits their usability in practical applications.

Figure 1. An illustration of shortest paths between two vertices and whose distance is : (a) one shortest path; (b) three shortest paths; (c) seven shortest paths.
Figure 2. An illustration of our method Query-by-Sketch (QbS) for answering all shortest path queries.

Given two vertices and , as depicted in Figure 1(a)-(c), they have the same distance and cannot be distinguished from one another if only one shortest path is considered. However, when considering all shortest paths, the shortest paths between these two vertices indeed exhibit considerably different structures in Figure 1(a)-(c), which can not only distinguish vertices and in different scenarios, but also empower us to make full use of such structures to analyze how they are connected. Thus, in this paper, we study the problem of finding the structure of shortest paths between vertices. Specifically, we use the notion of “shortest path graph” to represent the structure of shortest paths between two vertices, which is a subgraph containing exactly all shortest paths between these two vertices. Accordingly, we term this problem as the shortest-path-graph problem (formally defined in Section 2).

Interestingly, shortest path graph manifests itself as a basis for tackling various shortest path related problems, particularly when investigating the structure of the solution space of a combinatorial problem based on shortest paths, for example, the Shortest Path Rerouting problem (i.e., to find a rerouting sequence from one shortest path to another shortest path that only differs in one vertex) (Kamiński et al., 2011; Bonsma, 2013; Nishimura, 2018), the Shortest Path Network Interdiction problem (i.e., to find critical edges and vertices whose removal can destroy all shortest paths between two vertices) (Khachiyan et al., 2008; Israeli and Wood, 2002), and the variants such as the Shortest Path Common Links problem (i.e., to find links common to all shortest paths between two vertices) (Labbé et al., 1995; Hansen et al., 1986). These shortest path related problems are motivated by a wide range of real-world applications arising in designing and analyzing networks. For example, identifying a rerouting sequence for shortest paths enables the robust design of networks with minimal cost for reconfiguration, and finding critical edges and vertices helps defend critical infrastructures against cyberattacks.

However, computing shortest path graphs is computationally expensive since it requires to identify all shortest paths, not just one, between two vertices. A straightforward solution for answering shortest-path-graph queries is to compute on-the-fly all shortest paths between two vertices using Dijkstra algorithm for weighted graphs (Dijkstra and others, 1959) or performing a breadth-first search (BFS) for unweighted graphs (Cormen et al., 2009). This is costly on graphs with millions or billions of vertices and edges. Another solution is to precompute all shortest paths for all pairs of vertices in a graph and then assign precomputed labels to vertices such that certain properties hold, e.g. 2-hop distance cover (Cohen et al., 2003). However, for large graphs, storing even just shortest path distances of all pairs of vertices is prohibitive (Akiba et al., 2013) and storing all shortest paths of all pairs is hardly feasible due to the demand for much more space overhead. Thus, the question we tackle in this paper is: How to construct labels for shortest-path-graph queries that should be of reasonable size (e.g. not much larger than the original graph), within a reasonable time (e.g. not longer than one day), and can speed up query answering as much as possible? In answering this question, we develop an efficient solution for shortest-path-graph queries. It is worth to note that: 1) we do not enumerate all shortest paths to produce a shortest path graph that contains exactly all shortest paths between two vertices; 2) our proposed solution can answer shortest-path-graph queries very efficiently, in microseconds for graphs with millions of edges and in less than half a second for graphs with billions of edges.

Contributions.  In the following, we summarize the contributions of this paper with the key technical details:

(1) We observe that 2-hop distance cover is inadequate for labelling required by shortest-path-graph queries. To alleviate this limitation and achieve high scalability, we propose a scalable method for answering shortest-path-graph queries, called Query-by-Sketch (QbS). This method consists of three phases, as illustrated in Figure 2: (a) labelling - constructing a labelling scheme, which is compact and of a small size, using a small number of landmarks through precomputation, (b) sketching - using labelling to efficiently compute a sketch that summarizes the important structure of shortest paths in a query answer, and (c) searching - computing shortest paths on a sparsified graph under the ”guide” of the sketch. We develop efficient algorithms for these phases, and combine them effectively to handle shortest-path-graph queries on very large graphs.

(2) We theoretically prove the correctness of our method . In addition to this, we conduct the complexity analysis for through analysing the time complexities of the algorithms for constructing a labelling scheme, computing a sketch, and performing a guided search for answering queries. We also prove that our labelling scheme is deterministic w.r.t. landmarks. This enables us to leverage the thread-level parallelism by performing BFSs from different landmarks simultaneously without considering an order of landmarks, which improves the efficiency of labelling construction and thus achieves better scalability.

(3) We have conducted experiments on 12 real-world datasets, among which the largest dataset ClueWeb09 has 1.7 billion vertices and 7.8 billion edges. It is shown that has significantly better scalability than the baseline methods. The labelling construction of can be parallelized, which takes 10 seconds for datasets with millions of edges and half an hour for the largest dataset ClueWeb09. The labelling sizes constructed by are generally smaller than the original sizes of graphs. Further, can answer queries much faster than the other methods. For graphs with billions of edges, it takes only around 0.01 - 0.5 seconds to answer a query.

2. Preliminaries

Let be an unweighted graph, where and represent the set of vertices and edges in , respectively. Without loss of generality, we assume that is undirected and connected since our work can be easily extended to directed or disconnected graphs. We use and to refer to the set of vertices and edges in , respectively, the set of all shortest paths between and , and the shortest path distance between and in .

Distance labelling. Let be a subset of special vertices in , called landmarks. For each vertex , the label of is a set of labelling entries , where and . We call a labelling over . The size of a labelling is defined as size(L)=. In viewing that each labelling entry corresponds to a hop from a vertex to a landmark with the distance , Cohen et al. (Cohen et al., 2003) proposed 2-hop distance cover, which has been widely used in labelling-based approaches for distance queries.

Definition 2.1 ().

[2-hop distance cover] A labelling over a graph is a 2-hop distance cover iff, for any two vertices , the following holds:

Informally, 2-hop distance cover requires that, for any two vertices in a graph, their labels must contain at least one common landmark that lies on one of their shortest paths.

Shortest-path-graph problem. In this work, we study shortest-path-graph queries. We first define the notion of shortest path graph.

Definition 2.2 ().

[Shortest path graph] Given any two vertices and in a graph , the shortest path graph (SPG) between and is a subgraph of , where (1) and (2) .

A shortest path graph is different from an induced subgraph where . Every edge in must lie on at least one shortest path between and , whereas may contain edges that do not lie on any shortest path between and .

Definition 2.3 ().

[Shortest-path-graph problem] Let and . Then the shortest-path-graph problem is, given a query , to find the shortest path graph over .

3. Shortest Path Labelling

In this section, we discuss several labelling-based methods for the shortest-path-graph problem. The purpose is to discuss their limitations and possible sources of difficulties.

3.1. 2-Hop Path Cover

Originally, 2-hop distance cover was proposed for reachability and distance queries (Cohen et al., 2003). Below, we discuss why it is insufficient for shortest-path-graph queries.

Example 3.1 ().

Consider a query on a graph depicted in Figure LABEL:fig:new_ex (a). The query answer is colored in green. In Figure LABEL:fig:new_ex(b), labels of a 2-hop distance cover over are colored in black. Starting from vertices and , we can find vertex because and , . Then, we have to stop since the label of vertex does not contain entries to other vertices. Thus, using the labels of the 2-hop distance cover can compute only one shortest path between and , failing to find vertices , and in the answer.

Finding a shortest path graph that exactly contains all shortest paths between two vertices requires us to accurately encode every shortest path between two vertices into labels. Thus, to answer shortest-path-graph queries, we generalize 2-hop distance cover to a property called 2-hop path cover.

3.2. Path Labelling Methods

To answer shortest-path-graph queries, a naive labelling-based method is, for each vertex , to conduct a breadth-first search (BFS) from and store the distances between and all other vertices in the label of , i.e. , which is a 2-hop path labelling. Although shortest-path-graph queries can be answered using , it is inefficient, particularly when a graph is large. The time and space complexity of constructing such labels are and respectively. Answering one shortest-path-graph query would cost in the worst case. A question that naturally arises is: can we follow the idea of Pruned Landmark Labelling (PLL) (Akiba et al., 2013), which has been shown to be successful for distance queries, to develop a pruning strategy for shortest-path-graph queries for improving efficiency? We will thus introduce two pruned path labelling methods for shortest-path-graph queries in the following.

Pruned path labelling. Inspired by Pruned Landmark Labelling (PLL) (Akiba et al., 2013), we conduct pruning during the breadth-first searches, i.e. pruned BFSs, for shortest-path-graph queries. We abbreviate this pruned path labelling method by PPL.

PPL works as follows. Given a pre-defined landmark order over all vertices in , we conduct a pruned BFS from each vertex one by one as described in Algorithm 1. In each pruned BFS rooted at , we use to denote the distance between and . Further, refers to the labels that have been constructed through the previous pruned BFSs from vertices , and denotes the distance between and being queried using labels in . When , the label is pruned (Lines 6-7) because labels in have already covered the shortest paths between and . In other words, is only added into the labels of vertices when (Line 8). Note that, unlike PLL, in the case of , the label cannot be pruned in PPL; otherwise, 2-hop path cover is not guaranteed, i.e., not all shortest paths are covered by labels. When , no further edges are traversed from because paths in this expansion have already been covered by labels in (Lines 6-7 and 9-10).

Input: ; a landmark ; a labelling
1 ; ;
2 , for all ;
3 for all ;
4 while  is not empty do
5       dequeue from ;
6       if  then
7             continue;
8            
9      ;
10       if  then
11             continue;
12      for all s.t.  do
13             ;
14             enqueue to ;
15            
16      
return ;
Algorithm 1 PrunedBFS

To answer a query , we need to compute vertices and edges of from a pruned path labelling recursively. Assume that ; otherwise we finish with containing only one edge . We begin with . We find the common landmarks in their labels that are on the shortest paths, e.g., computing a set . Then we query the shortest paths between u, v and these common landmarks, i.e., and for each . The query is computed by combining the shortest paths between u, v and the landmarks, i.e., .

Example 3.2 ().

When using PPL to answer the query on the graph in Figure LABEL:fig:new_ex(a), we start with and obtain . This leads to four new queries and . The distance between and is 1. Thus, . For the new query , we obtain , leading to another queries and . Similarly, for and we obtain queries , , and . Note that the labels of vertex are visited more than once, i.e. when querying and . Further, because and have multiple shortest paths between them, more than one common vertex on their shortest paths are found from their labels, i.e. . As a result, edges and are handled multiple times, i.e., when querying and .

PPL has the same time and space complexity for constructing labels as the naive labelling-based method. However, due to pruning in BFSs, PPL can construct labels more efficiently with a significantly reduced labelling size. Nonetheless, the query time of PPL is still slow because all shortest paths between two vertices can only be found through searching vertices and edges using labels in a recursive manner. When more than one shortest path exists between query vertices, labels of some vertices are searched repeatedly and edges are found repeatedly, leading to unnecessary computational cost, e.g., vertex and edges as in Example 3.2.

Path labelling with parents. One common technique to accelerate query time for shortest-path-graph queries is to keep additional parent information in labels so as to provide a clearer direction towards shortest paths. For example, Akiba et al. (Akiba et al., 2013) extended the label of each vertex to a set of triples where is the “parent” vertex of on a shortest path from to . To find all shortest paths, this requires us to store all parent vertices of a vertex, rather than just one parent vertex as in the previous work for finding one shortest path. To be precise, we store a set of triples where is a set of “parent” vertices of on a shortest path from to a landmark . To reduce space overhead, for each of such shortest paths, we store the “parent” vertices of , rather than the “child” vertices of , because landmarks often have a high degree (Akiba et al., 2013). To distinguish from PPL, we abbreviate this method with additional parent information by ParentPPL.

The time complexity of ParentPPL for constructing labels remains to be but the space complexity becomes . In practice, additional parent information only helps speed up query time on small graphs. Even for a graph with millions of vertices and edges, ParentPPL would run out of time (same as PPL) or space, failing to construct labels. We will discuss this further in Section 6.

3.3. Discussion

For 2-hop labelling-based methods such as PPL and ParentPPL, the structure (i.e. shortest paths) of a graph is encoded into distance information of labels under the guarantee of 2-hop path cover. Although shortest paths can be recovered through computing distances between pairs of vertices, these methods are inefficient. This is because they recursively split each path into two sub-paths and compute vertices on sub-paths via distance information in labels, which leads to redundant or unnecessary searches. Although storing parent information can often accelerate query time, it makes labelling size larger and does not scale over large networks. Therefore, we need to find a method for which (1) the labelling size is small, (2) the structure of shortest paths can be recovered in an efficient way, i.e., reducing redundant and unnecessary computation, and (3) it can scale over large networks.

4. Query-by-Sketch

In this section, we present an efficient and scalable method for solving the shortest-path-graph problem, called Query-by-Sketch (QbS). Conceptually, this method consists of three key components: labelling, sketching and searching, which will be discussed in Sections 4.1, 4.2 and 4.3, respectively. The main idea behind this method is to construct a labelling scheme through precomputation, and then answer shortest-path-graph queries by performing online computation that involves two steps: fast sketching and guided searching.

4.1. Labelling Scheme

Let be a graph, be a set of landmarks, and (i.e., is sufficiently smaller than ). We first preprocess the graph to obtain a compact representation of the shortest paths among landmarks, called a meta-graph of . Then, based on such a meta-graph, we define a labelling scheme to assign a label to each vertex in such that, given any pair of vertices , we can efficiently compute a sketch for answering .

Definition 4.1 ().

[Meta-graph] A meta-graph is where is a set of landmarks, is a set of edges s.t. iff at least one shortest path between and does not go through any other landmarks, and assigns each edge in a weight, i.e. .

Conceptually, a meta-graph represents how landmarks are connected through their shortest paths in a graph .

Figure 3. (a) A graph with three landmarks (highlighted in green), (b) a meta-graph, and (c) a path labelling.
Definition 4.2 ().

[Labelling scheme] A labelling scheme consists of a meta-graph and a path labelling that assigns to each vertex a label s.t.

(1)

Note that, to accurately present how vertices are linked to landmarks, we only allow that is in the label iff there exists at least one shortest path between and that does not contain other landmarks.

Example 4.3 ().

Figure 3 depicts a graph (a) and the meta-graph (b) and the path labelling (c) of this graph. The edge in the meta-graph is assigned with a weight , i.e. , since there is one shortest path between and which goes through . The label of in the path labelling contains and . The labelling entry is not included in the label of because every shortest path between and goes through another landmark, i.e. or .

Algorithm 2 describes the pseudo-code of our algorithm for constructing a labelling scheme. Given a graph and a set of landmarks , we conduct a BFS from each landmark . We use two queues and to keep track of visited vertices, which respectively need to be labeled and not to be labeled. All vertices, except for , are initialized as being unvisited (Line 5). For each vertex at the -th level of the BFS, we set its unvisited neighbors being visited (Line 10). If is a landmark, we push into and add an edge into and store the distance between and to the edge in . Otherwise, we push into and add a label in for (Lines 11-17). Then, We check unvisited neighbors of each vertex at the -th level, and push into without adding a label in or an edge in (Lines 18-21). This process is conducted level-by-level on the BFS (Line 22).

Input: ; a set of landmarks
Output: A labelling scheme with .
1 ; for all
2 for all  do
3       ; ;
4       .push();
5       ; for all ;
6       n = 0;
7       while  and  are not empty do
8             for all  at depth n do
9                   for all unvisited neighbors of  do
10                         ;
11                         if  is a landmark then
12                               .push();
13                               ;
14                               ;
15                              
16                        else
17                               .push();
18                               ;
19                              
20                        
21                  
22            for all  at depth n do
23                   for all unvisited neighbors of  do
24                         ;
25                         .push();
26                        
27                  
28            ;
29            
30      
Algorithm 2 Constructing a labelling scheme
Example 4.4 ().
Figure 4. An illustration of labelling: (a), (b) and (c) describe the BFSs rooted at the landmarks 1, 2 and 3, respectively, where light and dark green vertices denote the landmarks, and yellow vertices denote those being labelled.

Figure 4 shows how our algorithm conducts BFSs to construct labels. The BFS from landmark is depicted in Figure 4(a), in which vertices are labelled because the other vertices are either landmarks or have landmarks in all their shortest paths to landmark . We add edges and into the meta-graph. In the BFS from landmark in Figure 4(b), vertices are labelled because the shortest paths between and vertices in all go through landmark or . The BFS from landmark is depicted in Figure 4(c), which works in a similar manner.

Figure 5. An illustration of sketching and searching: (a) the sparsified graph of the graph shown in Figure 3(a); (b) the sketch for SPG(6,11) on the graph ; (c) the bi-directional BFS on , (d) the recover search based on , (e) the reverse search based on , and (f) shows the query answer of SPG(6,11).

4.2. Fast Sketching

Let be a labelling scheme on a graph . For a given query , we proceed to answer in two steps; (1) computing a sketch for two vertices and from the labelling scheme efficiently; (2) computing the exact answer by conducting a guided search based on the sketch for two vertices and . Hence, the purpose of such a sketch is to provide an efficient and principled way of searching the answer of , which is particularly important on very large networks.

Definition 4.5 ().

[Sketch] A sketch for on is where is a set of vertices, is a set of edges, and with , satisfying the condition that contains only edges lying on the paths between and with the minimal length as defined below:

(2)

Accordingly, we have the following corollary.

Corollary 4.6 ().

holds.

Algorithm 3 describes how to construct a sketch. Let and be a pair of vertices. We start with and . Then, for each pair of landmarks , we compute the minimum length of paths between and that go through and using the labels in and the meta graph (Lines 2-5). After that, we obtain the minimum length of paths between and that go through at least one landmark, i.e., (Line 6), and add the edges in these paths into , the vertices in these paths into , and the corresponding distances are associated with the edges (Lines 7-13).

Input: , two vertices and .
Output: A sketch
1 , ;
2 for all  do
3       ;
4       if  and  then
5             ;
6            
7      
8 min{};
9 for all and  do
10       ;
11       , ;
12       for all in the shortest path graph of in  do
13             ;
14            
15            
16      ;
17      
Algorithm 3 Computing a sketch
Example 4.7 ().

Figure 5(b) shows the sketch between two vertices and . The sketch has the edges , , , and because we have the following shortest paths between and with and . We thus have , and .

4.3. Guided Searching

Guided by , we conduct a search to compute the exact answer of , based on the following observations:

  • Such a search can be conducted on a sparsified graph by removing all landmarks in and all edges incident to these landmarks from . may potentially be greater than ; however, the number of search steps in this sparsified graph can be upper bounded by due to the fact that .

  • can guide how to conduct a bi-directional search on the sparsified graph . Specifically, for , we have

    (3)

    which suggests the number of search steps from the and sides, respectively. Here, we subtract 1 because can be found via labels of vertices in at most steps.

Given a query on a graph , the answer can thus be computed by searching over the sparsified graph and the labelling scheme , guided by the sketch , as follows:

(4)

We use to refer to shortest paths between and that go through at least one landmark in .

Generally, a guided search has three stages: (1) Bi-directional search, which has a forward search from the side and a backward search from the side (Goldberg and Harrelson, 2005), under the guide of w.r.t. Eq. 3. This search terminates when common vertices are found or the upper bound is reached. (2) Reverse search, which reverses the previous bi-directional search back to and in order to compute shortest paths in . (3) Recover search, which recovers the relevant labelling information under the guide of in order to compute shortest paths in . As we do not know initially which of the three cases of Eq. 4 holds, a bi-directional search is always performed. This search provides us with , though we abort once can be guaranteed. Then depending on the values of and , a reverse search, a recover search, or both of them are performed to compute and as in Eq. 4.

Input: , ,
Output: A shortest path graph
1 ;
2 , , , ;
3 Enqueue to and to ;
4 , for all ;
5 , ;
6 while  do
7       ;
8       if  then
9             );
10            
11      if  then
12             );
13            
14      ; ;
15       for ;
16       if  is not empty then
17             break;
18            
19      
20if  then
21       ;
22      
23if  then
24       ;
25       for all with  do
26             ;
27             for all with , ,  do
28                   ;
29                  
30            
31      );
32      
33;
Algorithm 4 Searching on

Algorithm 4 presents our guided search algorithm. We maintain two queues and which contain the set of all vertices traversed from and , respectively. and indicate the levels of traversal being conducted in the BFSs rooted at and , respectively. Two queues and keep vertices being searched from and at the and level, respectively. Initially and are empty, and and are enqueued into and respectively. and denote the depths of all vertices in the BFSs rooted at and .

A bi-directional search is first conducted (Lines 6-15). In each iteration, the bi-directional search is guided by and as well as the relative sizes of and to decide the next step (Line 7). We choose where and . If both and satisfy this condition, or none of them satisfy this condition, then the choice of a forward search () and a backward search () is determined by the sizes of and . Accordingly, or are expanded (Line 12). The bi-directional search terminates either when reaches the upper bound or is not empty. This approach extends the Optimized Bidirectional BFS algorithm of (Hayashi et al., 2016) by incorporating bounds obtained from our sketch.

If is not empty, we have and thus start a reverse search (Lines 16-17). For each vertex , we compute the shortest paths between and and between and according to the depths of vertices in and , respectively. For example, a neighbour of in is on the shortest path between and if , and thus we find such and compute shortest paths between and in the same manner. If , we have and start a recover search (Lines 18-24). For each edge in the sketch and , we search for all vertices with and (Lines 19-23). Each is a vertex closest to landmark among all vertices on at least one shortest path between and in our previous bi-directional search. stores pairs to guide the recover searches. In the recover search (Line 24), for each edge in where , we recover the shortest paths between and according to . For each , we find shortest paths between and according to and labelling information . For example, for a neighbour of in , is on the shortest path between and if and . The shortest paths between and (resp. ) is computed according to (resp. ), but the search for parts of shortest paths that have already been found in the reversed search can be skipped. We also compute the shortest paths between relevant landmarks.

Example 4.8 ().

Figure 5(c)-(e) illustrates how our guided searching finds the answer for a query SPG(6,11). The sparsified graph is depicted in Figure 5(a) and the sketch is depicted in Figure 5(b). The sketch provides the upper bound , and because and , respectively. The bi-directional BFS is depicted in Figure 5(c), in which , , , and . The queues and meet at vertex , and thus . The reverse search is depicted in Figure 5(e), which goes back to and from . The recover search is depicted in Figure 5(d), which finds shortest paths going through the landmarks with and recovers shortest paths between landmarks in the sketch. The final query answer is depicted in Figure 5(f).

5. Theoretical Discussion

We prove the correctness of QbS and analyze its complexity. We also discuss how to parallelize the labelling construction process.

Dataset Network Type max. deg avg. deg avg. dist
Douban (DO) social undirected 0.2M 0.3M 0.3M 287 4.2 5.2 2.5MB
DBLP (DB) co-authorship undirected 0.3M 1.1M 1.1M 343 6.6 6.8 8.0MB
Youtube (YT) social undirected 1.1M 3.0M 3.0M 28,754 5.27 5.3 23MB
WikiTalk(WK) communication directed 2.4M 5.0M 4.7M 100,029 3.89 3.9 36MB
Skitter (SK) computer undirected 1.7M 11.1M 11.1M 35,455 13.08 5.1 85MB
Baidu (BA) web directed 2.1M 17.8M 17.0M 97,848 15.89 4.1 130MB
LiveJournal (LJ) social directed 4.8M 68.5M 43.1M 20,334 17.79 5.5 329MB
Orkut (OR) social undirected 3.1M 117M 117M 33,313 76.28 4.2 894MB
Twitter (TW) social directed 41.7M 1.5B 1.2B 2,997,487 57.74 3.6 9.0GB
Friendster (FR) social undirected 65.6M 1.8B 1.8B 5,214 55.06 4.8 13.0GB
uk2007 (UK) web directed 106M 3.7B 3.3B 979,738 62.77 5.6 24.8GB
ClueWeb09 (CW) computer directed 1.7B 7.8B 7.8B 6,444,720 9.27 7.5 58.2GB
Table 1. Datasets, where is the number of edges in a graph being treated as undirected, and denotes the size of a graph with each edge appearing in the adjacency lists and being represented by 8 bytes.

5.1. Proof of Correctness

In the following, we prove the theorem for the correctness of QbS.

Theorem 5.1 ().

Given any query SPG(u,v) on a graph , the answer can be computed using QbS.

Proof sketch.

We first prove that a labelling scheme constructed by Algorithm 2 satisfies Definition 4.2. Suppose that we conduct a BFS rooted from . Given a landmark , if holds, there must exist with and (Lines 8-9, 11), and accordingly an edge is added into (Lines 13-14). Otherwise, is directly pushed into (Lines 19-21). Given a vertex that is not a landmark, if holds, there must exist with and (Lines 8-9, 15), and accordingly a label is added into (Lines 16-17). Otherwise, is directly pushed into (Lines 19-21).

Now we prove that a sketch constructed by Algorithm 3 satisfies Definition 4.5. First, Algorithm 3 (Lines 2-7) finds pairs of landmarks that minimise and (i.e., satisfying Eq. (3) in Definition 4.5). Then it adds and all edges on the shortest paths between on a meta-graph into the sketch (Lines 8-12).

Finally, we prove that can be constructed by Algorithm 4. Each shortest path between and that does not go through any landmark can be constructed from using a bi-directional BFS and its reverse search (Lines 6-15 and 16-17). For each shortest path between and that goes through at least one landmark, all such landmarks must be included in and such shortest paths are computed using the recover search (Lines 18-24). ∎

5.2. Complexity Analysis

The time complexity of constructing a BFS from one landmark in Algorithm 2 is and the overall time complexity of Algorithm 2 is . The time complexity of constructing a sketch in Algorithm 3 is and can be reduced to by precomputing shortest path distances and shortest paths between landmarks on a meta-graph constructed by Algorithm 3, i.e., computation on Lines 10-12 is saved. The time complexity of conducting a guided search in Algorithm 4 is .

Note that, in our work, the number of landmarks is small, i.e., by default, which is much smaller than the number of vertices or edges in the original graph. Thus, we can see that, constructing a labelling scheme by Algorithm 2 is indeed , computing a sketch is constant time, and performing a guided search becomes where denotes the number of edges in the sparsified graph after removing edges incident to landmarks from .

5.3. Parallelization

Given a graph and a set of landmarks in , a nice property of our labelling scheme is that there is only one such labelling scheme. Formally, we prove the lemma below.

Lemma 5.2 ().

Let be a labelling scheme on a graph w.r.t. a set of landmarks . is deterministic.

Proof sketch.

A labelling scheme consists of a meta-graph and a path labelling . From Definition 4.1, an edge if and only if there exists at least one shortest path between and that does not go through any other landmarks in . From Definition 4.2, a label if and only if there exists at least one shortest path between and that does not go through any other landmarks in . Therefore, is deterministic w.r.t and . ∎

For a fixed set of landmarks, the labelling construction in Algorithm 2 yields the same labelling scheme, regardless of the ordering of landmarks. This deterministic nature of labelling scheme enables us to speed up the construction of labelling scheme by paralleling Algorithm 2. If we use one thread for constructing labels from one landmark, then we can leverage the thread-level parallelism to perform BFSs from different landmarks simultaneously.

6. Experiments

We evaluated our method to answer the following questions:

  • How efficiently can our proposed method answer shortest-path-graph queries, while still achieving construction time efficiency and low labelling space overhead?

  • How well can sketching help improve the performance of answering shortest-path-graph queries?

  • How does the number of landmarks affect the performance such as construction time, labelling size and query time?

6.1. Experimental Setup

We implemented our proposed methods in C++ 11 and compiled using g++. We performed all experiments on a Linux server which has Intel Xeon W-2175 with 2.5GHz and 512GB of main memory.

Figure 6. Distance distribution of 10,000 randomly selected pairs of vertices on all the datasets.

Datasets. We conducted experiments on 12 real-world graph datasets from various types of complex large networks, including social networks, computer networks, web networks, co-authorship networks and communication networks. Table 1 presents the details of these datasets, among which the largest one has 1.7 billion vertices and 7.8 billion edges. We treated graphs in these datasets as being undirected. All the datasets used in our experiments are publicly available from Koblenz Network Collection (Kunegis, 2013), Stanford Networks Analysis Project (Leskovec and Krevl, 2014), Dynamically Evolving Large-scale Information Systems Project 111See http://law.di.unimi.it/datasets.php for datasets and the Lemur Project222See https://lemurproject.org/clueweb09/index.php.

Dataset Construction Time (sec.) Average Query Time (ms.)
QbS-P QbS     PPL ParentPPL       QbS PPL ParentPPL       Bi-BFS
Douban 0.05 0.3 154 2,736 0.037 1.414 0.038 0.585
DBLP 0.12 1.1 2,610 11,049 0.097 1.782 0.052 2.995
Youtube 0.47 4.4 22,601 DNF 0.218 5.314 - 23.809
WikiTalk 0.61 4.9 8,662 DNF 0.693 3.536 - 6.984
Skitter 1.51 12.7 86,326 DNF 0.951 16.978 - 44.685
Baidu 2.04 18.9 DNF OOE 0.845 - - 174.412
LiveJournal 6.48 52.2 DNF OOE 1.095 - - 84.967
Orkut 10.85 73.2 DNF OOE 4.237 - - 207.541
Twitter 199.8 1,345 DNF OOE 164.333 - - 4,817.774
Friendster 416.5 2,354 DNF OOE 11.972 - - 3,600.362
uk2007 178.5 1,485 OOE OOE 77.830 - - 5,264.101
ClueWeb09 1,819 17,060 OOE OOE 480.443 - - DNF
Table 2. Comparison of construction time and query time. DNF and OOE refer to running out of time (¿24 hours) and running out of memory, respectively.

Queries. We randomly sampled 10,000 pairs of vertices from all pairs of vertices in each graph to evaluate the average query time. Figure 6 shows the distance distribution of these 10,000 randomly sampled pairs of vertices in each graph dataset. We can see that the distances of these pairs of vertices mostly fall into the range of 2-9.

Baselines. We considered the following baselines:

  • Labelling-based methods. Pruned landmark labelling (PLL) is the state-of-the-art method for computing exact distance queries (Akiba et al., 2013). We thus use the methods Pruned Path Labelling (PPL) and Pruned Path Labelling with Parent information (ParentPPL) as discussed in Section 3 as our baselines.

  • Search-based methods. We use bi-directional BFS as the baseline which conducts search from the directions of two vertices alternatively (Goldberg and Harrelson, 2005). We denote it as Bi-BFS.

To evaluate the parallel speed-up of construction time, we use QbS to refer to our method with a sequential labelling construction and QbS-P to refer to our method with a parallel labelling construction, with up to 12 threads in our experiments. In PPL and ParentPPL, we use 32 bits and 8 bits to represent a landmark and a distance in their labels, respectively, and 32 bits to store each parent in ParentPPL. In QbS and QbS-P, we use *8 bits to store the label of each vertex.

Landmarks.

In PPL and ParentPPL, landmarks are ordered in descending order of degrees. In QbS, we choose vertices with the largest degrees as landmarks for two reasons: (1) removing high-degree vertices sparsifies a graph much more than low-degree vertices; (2) computing distances from two vertices to high-degree landmarks provides a good estimation of the shortest distance between these two vertices

(Potamias et al., 2009). We set in QbS by default.

6.2. Performance Comparison

We conducted experiments to compare construction time, labelling size and query time of our method against the baselines.

6.2.1. Construction Time

Table 2 shows that our method QbS can efficiently construct a labelling scheme on all the datasets, scaling over large networks with billions of vertices and edges. Compared with PPL and ParentPPL, our method QbS uses a significantly less amount of time (i.e., 2-4 orders of magnitude faster) to construct labelling information. Moreover, PPL failed to construct labels for 7 out of 12 datasets and ParentPPL failed for 10 out of 12 datasets. This is because these methods need to meet the 2-hop path cover property. The reason why ParentPPL is much slower than PPL is because a vertex often has more than one parent and finding all parents takes more time though the time complexity remains unchanged. We can also see that, compared with QbS, QbS-P can further improve construction time (i.e., 6-12 times faster), leading to much better scalability than QbS.

6.2.2. Labelling Size

Table 3 presents the comparison results for the labelling sizes of QbS, PPL and ParentPPL on all the datasets. We use to denote the size of precomputed shortest path graphs between landmarks as discussed in Section 5.2. We observe that: 1) the labelling sizes of QbS are hundreds of times smaller than the labelling sizes of PPL and ParentPPL; 2) the labelling sizes of ParentPPL are about twice as the labelling sizes of PPL. For dense graphs, such as Twitter, the sizes of precomputed shortest paths in QbS are relatively larger than the ones in sparse graphs. This is due to the existence of many shortest paths between landmarks in dense graphs. Nonetheless, it is important to notice that, the sizes of precomputed shortest paths between landmarks (i.e. in Table 3) are small in QbS, compared with the sizes of labelling (i.e. in Table 3). For meta-graphs, since each meta-graph contains at most edges, the space overhead for storing edges and weights of a meta-graph is very small. Indeed, even when we have =100, the size of a meta-graph would still be smaller than 0.01MB. In summary, these results show that QbS can scale well over very large networks in terms of the labelling size.

Dataset QbS PPL ParentPPL
Douban 2.95MB 0.03MB 0.4GB 0.8GB
DBLP 6.05MB 0.03MB 1.2GB 2.4GB
Youtube 21.6MB 0.6MB 1.7GB