1. Introduction
Graphs are typical data structures used for representing complex relationships among entities, such as friendships in social networks, connections in computer networks, and links among web pages (Scott, 1988; Boccaletti et al., 2006; Ukkonen et al., 2008). Computing shortest paths between vertices is a fundamental operation in processing graph data, and has been used in many algorithms for graph analytics (Yao et al., 2013; Opsahl et al., 2010; Kolaczyk et al., 2009). These algorithms are often applied to support applications that require low latency on graphs with millions or billions of vertices and edges. Therefore, it is highly desirable – but challenging – to compute shortest paths efficiently on very large graphs.
Previously, the problem of pointtopoint shortest path queries has been well studied, which is to find a shortest path between two vertices in a graph (Goldberg and Harrelson, 2005; Goldberg et al., 2006; Bast et al., 2007; Goldberg, 2007; Wagner and Willhalm, 2007; Abraham et al., 2010; Wu et al., 2012; Sankaranarayanan et al., 2009; Sanders and Schultes, 2005). By leveraging specific properties of road networks, such as hierarchical structures and near planarity (Fu et al., 2013; Akiba et al., 2013), previous works have proposed various exact and approximate methods for answering pointtopoint shortest path queries (Cowen and Wagner, 2004; Abraham et al., 2012). Nonetheless, these methods often do not perform well on complex networks (e.g., social networks, and web graphs) because complex networks exhibit different properties from road networks, such as small diameter and local clustering (Goldberg and Harrelson, 2005; Fu et al., 2013; Akiba et al., 2013). Furthermore, existing methods for pointtopoint shortest path queries were designed with the guarantee of finding only one shortest path, which limits their usability in practical applications.
Given two vertices and , as depicted in Figure 1(a)(c), they have the same distance and cannot be distinguished from one another if only one shortest path is considered. However, when considering all shortest paths, the shortest paths between these two vertices indeed exhibit considerably different structures in Figure 1(a)(c), which can not only distinguish vertices and in different scenarios, but also empower us to make full use of such structures to analyze how they are connected. Thus, in this paper, we study the problem of finding the structure of shortest paths between vertices. Specifically, we use the notion of “shortest path graph” to represent the structure of shortest paths between two vertices, which is a subgraph containing exactly all shortest paths between these two vertices. Accordingly, we term this problem as the shortestpathgraph problem (formally defined in Section 2).
Interestingly, shortest path graph manifests itself as a basis for tackling various shortest path related problems, particularly when investigating the structure of the solution space of a combinatorial problem based on shortest paths, for example, the Shortest Path Rerouting problem (i.e., to find a rerouting sequence from one shortest path to another shortest path that only differs in one vertex) (Kamiński et al., 2011; Bonsma, 2013; Nishimura, 2018), the Shortest Path Network Interdiction problem (i.e., to find critical edges and vertices whose removal can destroy all shortest paths between two vertices) (Khachiyan et al., 2008; Israeli and Wood, 2002), and the variants such as the Shortest Path Common Links problem (i.e., to find links common to all shortest paths between two vertices) (Labbé et al., 1995; Hansen et al., 1986). These shortest path related problems are motivated by a wide range of realworld applications arising in designing and analyzing networks. For example, identifying a rerouting sequence for shortest paths enables the robust design of networks with minimal cost for reconfiguration, and finding critical edges and vertices helps defend critical infrastructures against cyberattacks.
However, computing shortest path graphs is computationally expensive since it requires to identify all shortest paths, not just one, between two vertices. A straightforward solution for answering shortestpathgraph queries is to compute onthefly all shortest paths between two vertices using Dijkstra algorithm for weighted graphs (Dijkstra and others, 1959) or performing a breadthfirst search (BFS) for unweighted graphs (Cormen et al., 2009). This is costly on graphs with millions or billions of vertices and edges. Another solution is to precompute all shortest paths for all pairs of vertices in a graph and then assign precomputed labels to vertices such that certain properties hold, e.g. 2hop distance cover (Cohen et al., 2003). However, for large graphs, storing even just shortest path distances of all pairs of vertices is prohibitive (Akiba et al., 2013) and storing all shortest paths of all pairs is hardly feasible due to the demand for much more space overhead. Thus, the question we tackle in this paper is: How to construct labels for shortestpathgraph queries that should be of reasonable size (e.g. not much larger than the original graph), within a reasonable time (e.g. not longer than one day), and can speed up query answering as much as possible? In answering this question, we develop an efficient solution for shortestpathgraph queries. It is worth to note that: 1) we do not enumerate all shortest paths to produce a shortest path graph that contains exactly all shortest paths between two vertices; 2) our proposed solution can answer shortestpathgraph queries very efficiently, in microseconds for graphs with millions of edges and in less than half a second for graphs with billions of edges.
Contributions. In the following, we summarize the contributions of this paper with the key technical details:
(1) We observe that 2hop distance cover is inadequate for labelling required by shortestpathgraph queries. To alleviate this limitation and achieve high scalability, we propose a scalable method for answering shortestpathgraph queries, called QuerybySketch (QbS). This method consists of three phases, as illustrated in Figure 2: (a) labelling  constructing a labelling scheme, which is compact and of a small size, using a small number of landmarks through precomputation, (b) sketching  using labelling to efficiently compute a sketch that summarizes the important structure of shortest paths in a query answer, and (c) searching  computing shortest paths on a sparsified graph under the ”guide” of the sketch. We develop efficient algorithms for these phases, and combine them effectively to handle shortestpathgraph queries on very large graphs.
(2) We theoretically prove the correctness of our method . In addition to this, we conduct the complexity analysis for through analysing the time complexities of the algorithms for constructing a labelling scheme, computing a sketch, and performing a guided search for answering queries. We also prove that our labelling scheme is deterministic w.r.t. landmarks. This enables us to leverage the threadlevel parallelism by performing BFSs from different landmarks simultaneously without considering an order of landmarks, which improves the efficiency of labelling construction and thus achieves better scalability.
(3) We have conducted experiments on 12 realworld datasets, among which the largest dataset ClueWeb09 has 1.7 billion vertices and 7.8 billion edges. It is shown that has significantly better scalability than the baseline methods. The labelling construction of can be parallelized, which takes 10 seconds for datasets with millions of edges and half an hour for the largest dataset ClueWeb09. The labelling sizes constructed by are generally smaller than the original sizes of graphs. Further, can answer queries much faster than the other methods. For graphs with billions of edges, it takes only around 0.01  0.5 seconds to answer a query.
2. Preliminaries
Let be an unweighted graph, where and represent the set of vertices and edges in , respectively. Without loss of generality, we assume that is undirected and connected since our work can be easily extended to directed or disconnected graphs. We use and to refer to the set of vertices and edges in , respectively, the set of all shortest paths between and , and the shortest path distance between and in .
Distance labelling. Let be a subset of special vertices in , called landmarks. For each vertex , the label of is a set of labelling entries , where and . We call a labelling over . The size of a labelling is defined as size(L)=. In viewing that each labelling entry corresponds to a hop from a vertex to a landmark with the distance , Cohen et al. (Cohen et al., 2003) proposed 2hop distance cover, which has been widely used in labellingbased approaches for distance queries.
Definition 2.1 ().
[2hop distance cover] A labelling over a graph is a 2hop distance cover iff, for any two vertices , the following holds:
Informally, 2hop distance cover requires that, for any two vertices in a graph, their labels must contain at least one common landmark that lies on one of their shortest paths.
Shortestpathgraph problem. In this work, we study shortestpathgraph queries. We first define the notion of shortest path graph.
Definition 2.2 ().
[Shortest path graph] Given any two vertices and in a graph , the shortest path graph (SPG) between and is a subgraph of , where (1) and (2) .
A shortest path graph is different from an induced subgraph where . Every edge in must lie on at least one shortest path between and , whereas may contain edges that do not lie on any shortest path between and .
Definition 2.3 ().
[Shortestpathgraph problem] Let and . Then the shortestpathgraph problem is, given a query , to find the shortest path graph over .
3. Shortest Path Labelling
In this section, we discuss several labellingbased methods for the shortestpathgraph problem. The purpose is to discuss their limitations and possible sources of difficulties.
3.1. 2Hop Path Cover
Originally, 2hop distance cover was proposed for reachability and distance queries (Cohen et al., 2003). Below, we discuss why it is insufficient for shortestpathgraph queries.
Example 3.1 ().
Consider a query on a graph depicted in Figure LABEL:fig:new_ex (a). The query answer is colored in green. In Figure LABEL:fig:new_ex(b), labels of a 2hop distance cover over are colored in black. Starting from vertices and , we can find vertex because and , . Then, we have to stop since the label of vertex does not contain entries to other vertices. Thus, using the labels of the 2hop distance cover can compute only one shortest path between and , failing to find vertices , and in the answer.
Finding a shortest path graph that exactly contains all shortest paths between two vertices requires us to accurately encode every shortest path between two vertices into labels. Thus, to answer shortestpathgraph queries, we generalize 2hop distance cover to a property called 2hop path cover.
3.2. Path Labelling Methods
To answer shortestpathgraph queries, a naive labellingbased method is, for each vertex , to conduct a breadthfirst search (BFS) from and store the distances between and all other vertices in the label of , i.e. , which is a 2hop path labelling. Although shortestpathgraph queries can be answered using , it is inefficient, particularly when a graph is large. The time and space complexity of constructing such labels are and respectively. Answering one shortestpathgraph query would cost in the worst case. A question that naturally arises is: can we follow the idea of Pruned Landmark Labelling (PLL) (Akiba et al., 2013), which has been shown to be successful for distance queries, to develop a pruning strategy for shortestpathgraph queries for improving efficiency? We will thus introduce two pruned path labelling methods for shortestpathgraph queries in the following.
Pruned path labelling. Inspired by Pruned Landmark Labelling (PLL) (Akiba et al., 2013), we conduct pruning during the breadthfirst searches, i.e. pruned BFSs, for shortestpathgraph queries. We abbreviate this pruned path labelling method by PPL.
PPL works as follows. Given a predefined landmark order over all vertices in , we conduct a pruned BFS from each vertex one by one as described in Algorithm 1. In each pruned BFS rooted at , we use to denote the distance between and . Further, refers to the labels that have been constructed through the previous pruned BFSs from vertices , and denotes the distance between and being queried using labels in . When , the label is pruned (Lines 67) because labels in have already covered the shortest paths between and . In other words, is only added into the labels of vertices when (Line 8). Note that, unlike PLL, in the case of , the label cannot be pruned in PPL; otherwise, 2hop path cover is not guaranteed, i.e., not all shortest paths are covered by labels. When , no further edges are traversed from because paths in this expansion have already been covered by labels in (Lines 67 and 910).
To answer a query , we need to compute vertices and edges of from a pruned path labelling recursively. Assume that ; otherwise we finish with containing only one edge . We begin with . We find the common landmarks in their labels that are on the shortest paths, e.g., computing a set . Then we query the shortest paths between u, v and these common landmarks, i.e., and for each . The query is computed by combining the shortest paths between u, v and the landmarks, i.e., .
Example 3.2 ().
When using PPL to answer the query on the graph in Figure LABEL:fig:new_ex(a), we start with and obtain . This leads to four new queries and . The distance between and is 1. Thus, . For the new query , we obtain , leading to another queries and . Similarly, for and we obtain queries , , and . Note that the labels of vertex are visited more than once, i.e. when querying and . Further, because and have multiple shortest paths between them, more than one common vertex on their shortest paths are found from their labels, i.e. . As a result, edges and are handled multiple times, i.e., when querying and .
PPL has the same time and space complexity for constructing labels as the naive labellingbased method. However, due to pruning in BFSs, PPL can construct labels more efficiently with a significantly reduced labelling size. Nonetheless, the query time of PPL is still slow because all shortest paths between two vertices can only be found through searching vertices and edges using labels in a recursive manner. When more than one shortest path exists between query vertices, labels of some vertices are searched repeatedly and edges are found repeatedly, leading to unnecessary computational cost, e.g., vertex and edges as in Example 3.2.
Path labelling with parents. One common technique to accelerate query time for shortestpathgraph queries is to keep additional parent information in labels so as to provide a clearer direction towards shortest paths. For example, Akiba et al. (Akiba et al., 2013) extended the label of each vertex to a set of triples where is the “parent” vertex of on a shortest path from to . To find all shortest paths, this requires us to store all parent vertices of a vertex, rather than just one parent vertex as in the previous work for finding one shortest path. To be precise, we store a set of triples where is a set of “parent” vertices of on a shortest path from to a landmark . To reduce space overhead, for each of such shortest paths, we store the “parent” vertices of , rather than the “child” vertices of , because landmarks often have a high degree (Akiba et al., 2013). To distinguish from PPL, we abbreviate this method with additional parent information by ParentPPL.
The time complexity of ParentPPL for constructing labels remains to be but the space complexity becomes . In practice, additional parent information only helps speed up query time on small graphs. Even for a graph with millions of vertices and edges, ParentPPL would run out of time (same as PPL) or space, failing to construct labels. We will discuss this further in Section 6.
3.3. Discussion
For 2hop labellingbased methods such as PPL and ParentPPL, the structure (i.e. shortest paths) of a graph is encoded into distance information of labels under the guarantee of 2hop path cover. Although shortest paths can be recovered through computing distances between pairs of vertices, these methods are inefficient. This is because they recursively split each path into two subpaths and compute vertices on subpaths via distance information in labels, which leads to redundant or unnecessary searches. Although storing parent information can often accelerate query time, it makes labelling size larger and does not scale over large networks. Therefore, we need to find a method for which (1) the labelling size is small, (2) the structure of shortest paths can be recovered in an efficient way, i.e., reducing redundant and unnecessary computation, and (3) it can scale over large networks.
4. QuerybySketch
In this section, we present an efficient and scalable method for solving the shortestpathgraph problem, called QuerybySketch (QbS). Conceptually, this method consists of three key components: labelling, sketching and searching, which will be discussed in Sections 4.1, 4.2 and 4.3, respectively. The main idea behind this method is to construct a labelling scheme through precomputation, and then answer shortestpathgraph queries by performing online computation that involves two steps: fast sketching and guided searching.
4.1. Labelling Scheme
Let be a graph, be a set of landmarks, and (i.e., is sufficiently smaller than ). We first preprocess the graph to obtain a compact representation of the shortest paths among landmarks, called a metagraph of . Then, based on such a metagraph, we define a labelling scheme to assign a label to each vertex in such that, given any pair of vertices , we can efficiently compute a sketch for answering .
Definition 4.1 ().
[Metagraph] A metagraph is where is a set of landmarks, is a set of edges s.t. iff at least one shortest path between and does not go through any other landmarks, and assigns each edge in a weight, i.e. .
Conceptually, a metagraph represents how landmarks are connected through their shortest paths in a graph .
Definition 4.2 ().
[Labelling scheme] A labelling scheme consists of a metagraph and a path labelling that assigns to each vertex a label s.t.
(1) 
Note that, to accurately present how vertices are linked to landmarks, we only allow that is in the label iff there exists at least one shortest path between and that does not contain other landmarks.
Example 4.3 ().
Figure 3 depicts a graph (a) and the metagraph (b) and the path labelling (c) of this graph. The edge in the metagraph is assigned with a weight , i.e. , since there is one shortest path between and which goes through . The label of in the path labelling contains and . The labelling entry is not included in the label of because every shortest path between and goes through another landmark, i.e. or .
Algorithm 2 describes the pseudocode of our algorithm for constructing a labelling scheme. Given a graph and a set of landmarks , we conduct a BFS from each landmark . We use two queues and to keep track of visited vertices, which respectively need to be labeled and not to be labeled. All vertices, except for , are initialized as being unvisited (Line 5). For each vertex at the th level of the BFS, we set its unvisited neighbors being visited (Line 10). If is a landmark, we push into and add an edge into and store the distance between and to the edge in . Otherwise, we push into and add a label in for (Lines 1117). Then, We check unvisited neighbors of each vertex at the th level, and push into without adding a label in or an edge in (Lines 1821). This process is conducted levelbylevel on the BFS (Line 22).
Example 4.4 ().
Figure 4 shows how our algorithm conducts BFSs to construct labels. The BFS from landmark is depicted in Figure 4(a), in which vertices are labelled because the other vertices are either landmarks or have landmarks in all their shortest paths to landmark . We add edges and into the metagraph. In the BFS from landmark in Figure 4(b), vertices are labelled because the shortest paths between and vertices in all go through landmark or . The BFS from landmark is depicted in Figure 4(c), which works in a similar manner.
4.2. Fast Sketching
Let be a labelling scheme on a graph . For a given query , we proceed to answer in two steps; (1) computing a sketch for two vertices and from the labelling scheme efficiently; (2) computing the exact answer by conducting a guided search based on the sketch for two vertices and . Hence, the purpose of such a sketch is to provide an efficient and principled way of searching the answer of , which is particularly important on very large networks.
Definition 4.5 ().
[Sketch] A sketch for on is where is a set of vertices, is a set of edges, and with , satisfying the condition that contains only edges lying on the paths between and with the minimal length as defined below:
(2) 
Accordingly, we have the following corollary.
Corollary 4.6 ().
holds.
Algorithm 3 describes how to construct a sketch. Let and be a pair of vertices. We start with and . Then, for each pair of landmarks , we compute the minimum length of paths between and that go through and using the labels in and the meta graph (Lines 25). After that, we obtain the minimum length of paths between and that go through at least one landmark, i.e., (Line 6), and add the edges in these paths into , the vertices in these paths into , and the corresponding distances are associated with the edges (Lines 713).
Example 4.7 ().
Figure 5(b) shows the sketch between two vertices and . The sketch has the edges , , , and because we have the following shortest paths between and with and . We thus have , and .
4.3. Guided Searching
Guided by , we conduct a search to compute the exact answer of , based on the following observations:

Such a search can be conducted on a sparsified graph by removing all landmarks in and all edges incident to these landmarks from . may potentially be greater than ; however, the number of search steps in this sparsified graph can be upper bounded by due to the fact that .

can guide how to conduct a bidirectional search on the sparsified graph . Specifically, for , we have
(3) which suggests the number of search steps from the and sides, respectively. Here, we subtract 1 because can be found via labels of vertices in at most steps.
Given a query on a graph , the answer can thus be computed by searching over the sparsified graph and the labelling scheme , guided by the sketch , as follows:
(4) 
We use to refer to shortest paths between and that go through at least one landmark in .
Generally, a guided search has three stages: (1) Bidirectional search, which has a forward search from the side and a backward search from the side (Goldberg and Harrelson, 2005), under the guide of w.r.t. Eq. 3. This search terminates when common vertices are found or the upper bound is reached. (2) Reverse search, which reverses the previous bidirectional search back to and in order to compute shortest paths in . (3) Recover search, which recovers the relevant labelling information under the guide of in order to compute shortest paths in . As we do not know initially which of the three cases of Eq. 4 holds, a bidirectional search is always performed. This search provides us with , though we abort once can be guaranteed. Then depending on the values of and , a reverse search, a recover search, or both of them are performed to compute and as in Eq. 4.
Algorithm 4 presents our guided search algorithm. We maintain two queues and which contain the set of all vertices traversed from and , respectively. and indicate the levels of traversal being conducted in the BFSs rooted at and , respectively. Two queues and keep vertices being searched from and at the and level, respectively. Initially and are empty, and and are enqueued into and respectively. and denote the depths of all vertices in the BFSs rooted at and .
A bidirectional search is first conducted (Lines 615). In each iteration, the bidirectional search is guided by and as well as the relative sizes of and to decide the next step (Line 7). We choose where and . If both and satisfy this condition, or none of them satisfy this condition, then the choice of a forward search () and a backward search () is determined by the sizes of and . Accordingly, or are expanded (Line 12). The bidirectional search terminates either when reaches the upper bound or is not empty. This approach extends the Optimized Bidirectional BFS algorithm of (Hayashi et al., 2016) by incorporating bounds obtained from our sketch.
If is not empty, we have and thus start a reverse search (Lines 1617). For each vertex , we compute the shortest paths between and and between and according to the depths of vertices in and , respectively. For example, a neighbour of in is on the shortest path between and if , and thus we find such and compute shortest paths between and in the same manner. If , we have and start a recover search (Lines 1824). For each edge in the sketch and , we search for all vertices with and (Lines 1923). Each is a vertex closest to landmark among all vertices on at least one shortest path between and in our previous bidirectional search. stores pairs to guide the recover searches. In the recover search (Line 24), for each edge in where , we recover the shortest paths between and according to . For each , we find shortest paths between and according to and labelling information . For example, for a neighbour of in , is on the shortest path between and if and . The shortest paths between and (resp. ) is computed according to (resp. ), but the search for parts of shortest paths that have already been found in the reversed search can be skipped. We also compute the shortest paths between relevant landmarks.
Example 4.8 ().
Figure 5(c)(e) illustrates how our guided searching finds the answer for a query SPG(6,11). The sparsified graph is depicted in Figure 5(a) and the sketch is depicted in Figure 5(b). The sketch provides the upper bound , and because and , respectively. The bidirectional BFS is depicted in Figure 5(c), in which , , , and . The queues and meet at vertex , and thus . The reverse search is depicted in Figure 5(e), which goes back to and from . The recover search is depicted in Figure 5(d), which finds shortest paths going through the landmarks with and recovers shortest paths between landmarks in the sketch. The final query answer is depicted in Figure 5(f).
5. Theoretical Discussion
We prove the correctness of QbS and analyze its complexity. We also discuss how to parallelize the labelling construction process.
Dataset  Network  Type  max. deg  avg. deg  avg. dist  

Douban (DO)  social  undirected  0.2M  0.3M  0.3M  287  4.2  5.2  2.5MB 
DBLP (DB)  coauthorship  undirected  0.3M  1.1M  1.1M  343  6.6  6.8  8.0MB 
Youtube (YT)  social  undirected  1.1M  3.0M  3.0M  28,754  5.27  5.3  23MB 
WikiTalk(WK)  communication  directed  2.4M  5.0M  4.7M  100,029  3.89  3.9  36MB 
Skitter (SK)  computer  undirected  1.7M  11.1M  11.1M  35,455  13.08  5.1  85MB 
Baidu (BA)  web  directed  2.1M  17.8M  17.0M  97,848  15.89  4.1  130MB 
LiveJournal (LJ)  social  directed  4.8M  68.5M  43.1M  20,334  17.79  5.5  329MB 
Orkut (OR)  social  undirected  3.1M  117M  117M  33,313  76.28  4.2  894MB 
Twitter (TW)  social  directed  41.7M  1.5B  1.2B  2,997,487  57.74  3.6  9.0GB 
Friendster (FR)  social  undirected  65.6M  1.8B  1.8B  5,214  55.06  4.8  13.0GB 
uk2007 (UK)  web  directed  106M  3.7B  3.3B  979,738  62.77  5.6  24.8GB 
ClueWeb09 (CW)  computer  directed  1.7B  7.8B  7.8B  6,444,720  9.27  7.5  58.2GB 
5.1. Proof of Correctness
In the following, we prove the theorem for the correctness of QbS.
Theorem 5.1 ().
Given any query SPG(u,v) on a graph , the answer can be computed using QbS.
Proof sketch.
We first prove that a labelling scheme constructed by Algorithm 2 satisfies Definition 4.2. Suppose that we conduct a BFS rooted from . Given a landmark , if holds, there must exist with and (Lines 89, 11), and accordingly an edge is added into (Lines 1314). Otherwise, is directly pushed into (Lines 1921). Given a vertex that is not a landmark, if holds, there must exist with and (Lines 89, 15), and accordingly a label is added into (Lines 1617). Otherwise, is directly pushed into (Lines 1921).
Now we prove that a sketch constructed by Algorithm 3 satisfies Definition 4.5. First, Algorithm 3 (Lines 27) finds pairs of landmarks that minimise and (i.e., satisfying Eq. (3) in Definition 4.5). Then it adds and all edges on the shortest paths between on a metagraph into the sketch (Lines 812).
Finally, we prove that can be constructed by Algorithm 4. Each shortest path between and that does not go through any landmark can be constructed from using a bidirectional BFS and its reverse search (Lines 615 and 1617). For each shortest path between and that goes through at least one landmark, all such landmarks must be included in and such shortest paths are computed using the recover search (Lines 1824). ∎
5.2. Complexity Analysis
The time complexity of constructing a BFS from one landmark in Algorithm 2 is and the overall time complexity of Algorithm 2 is . The time complexity of constructing a sketch in Algorithm 3 is and can be reduced to by precomputing shortest path distances and shortest paths between landmarks on a metagraph constructed by Algorithm 3, i.e., computation on Lines 1012 is saved. The time complexity of conducting a guided search in Algorithm 4 is .
Note that, in our work, the number of landmarks is small, i.e., by default, which is much smaller than the number of vertices or edges in the original graph. Thus, we can see that, constructing a labelling scheme by Algorithm 2 is indeed , computing a sketch is constant time, and performing a guided search becomes where denotes the number of edges in the sparsified graph after removing edges incident to landmarks from .
5.3. Parallelization
Given a graph and a set of landmarks in , a nice property of our labelling scheme is that there is only one such labelling scheme. Formally, we prove the lemma below.
Lemma 5.2 ().
Let be a labelling scheme on a graph w.r.t. a set of landmarks . is deterministic.
Proof sketch.
A labelling scheme consists of a metagraph and a path labelling . From Definition 4.1, an edge if and only if there exists at least one shortest path between and that does not go through any other landmarks in . From Definition 4.2, a label if and only if there exists at least one shortest path between and that does not go through any other landmarks in . Therefore, is deterministic w.r.t and . ∎
For a fixed set of landmarks, the labelling construction in Algorithm 2 yields the same labelling scheme, regardless of the ordering of landmarks. This deterministic nature of labelling scheme enables us to speed up the construction of labelling scheme by paralleling Algorithm 2. If we use one thread for constructing labels from one landmark, then we can leverage the threadlevel parallelism to perform BFSs from different landmarks simultaneously.
6. Experiments
We evaluated our method to answer the following questions:

How efficiently can our proposed method answer shortestpathgraph queries, while still achieving construction time efficiency and low labelling space overhead?

How well can sketching help improve the performance of answering shortestpathgraph queries?

How does the number of landmarks affect the performance such as construction time, labelling size and query time?
6.1. Experimental Setup
We implemented our proposed methods in C++ 11 and compiled using g++. We performed all experiments on a Linux server which has Intel Xeon W2175 with 2.5GHz and 512GB of main memory.
Datasets. We conducted experiments on 12 realworld graph datasets from various types of complex large networks, including social networks, computer networks, web networks, coauthorship networks and communication networks. Table 1 presents the details of these datasets, among which the largest one has 1.7 billion vertices and 7.8 billion edges. We treated graphs in these datasets as being undirected. All the datasets used in our experiments are publicly available from Koblenz Network Collection (Kunegis, 2013), Stanford Networks Analysis Project (Leskovec and Krevl, 2014), Dynamically Evolving Largescale Information Systems Project ^{1}^{1}1See http://law.di.unimi.it/datasets.php for datasets and the Lemur Project^{2}^{2}2See https://lemurproject.org/clueweb09/index.php.
Dataset  Construction Time (sec.)  Average Query Time (ms.)  

QbSP  QbS  PPL  ParentPPL  QbS  PPL  ParentPPL  BiBFS  
Douban  0.05  0.3  154  2,736  0.037  1.414  0.038  0.585 
DBLP  0.12  1.1  2,610  11,049  0.097  1.782  0.052  2.995 
Youtube  0.47  4.4  22,601  DNF  0.218  5.314    23.809 
WikiTalk  0.61  4.9  8,662  DNF  0.693  3.536    6.984 
Skitter  1.51  12.7  86,326  DNF  0.951  16.978    44.685 
Baidu  2.04  18.9  DNF  OOE  0.845      174.412 
LiveJournal  6.48  52.2  DNF  OOE  1.095      84.967 
Orkut  10.85  73.2  DNF  OOE  4.237      207.541 
199.8  1,345  DNF  OOE  164.333      4,817.774  
Friendster  416.5  2,354  DNF  OOE  11.972      3,600.362 
uk2007  178.5  1,485  OOE  OOE  77.830      5,264.101 
ClueWeb09  1,819  17,060  OOE  OOE  480.443      DNF 
Queries. We randomly sampled 10,000 pairs of vertices from all pairs of vertices in each graph to evaluate the average query time. Figure 6 shows the distance distribution of these 10,000 randomly sampled pairs of vertices in each graph dataset. We can see that the distances of these pairs of vertices mostly fall into the range of 29.
Baselines. We considered the following baselines:

Labellingbased methods. Pruned landmark labelling (PLL) is the stateoftheart method for computing exact distance queries (Akiba et al., 2013). We thus use the methods Pruned Path Labelling (PPL) and Pruned Path Labelling with Parent information (ParentPPL) as discussed in Section 3 as our baselines.

Searchbased methods. We use bidirectional BFS as the baseline which conducts search from the directions of two vertices alternatively (Goldberg and Harrelson, 2005). We denote it as BiBFS.
To evaluate the parallel speedup of construction time, we use QbS to refer to our method with a sequential labelling construction and QbSP to refer to our method with a parallel labelling construction, with up to 12 threads in our experiments. In PPL and ParentPPL, we use 32 bits and 8 bits to represent a landmark and a distance in their labels, respectively, and 32 bits to store each parent in ParentPPL. In QbS and QbSP, we use *8 bits to store the label of each vertex.
Landmarks.
In PPL and ParentPPL, landmarks are ordered in descending order of degrees. In QbS, we choose vertices with the largest degrees as landmarks for two reasons: (1) removing highdegree vertices sparsifies a graph much more than lowdegree vertices; (2) computing distances from two vertices to highdegree landmarks provides a good estimation of the shortest distance between these two vertices
(Potamias et al., 2009). We set in QbS by default.6.2. Performance Comparison
We conducted experiments to compare construction time, labelling size and query time of our method against the baselines.
6.2.1. Construction Time
Table 2 shows that our method QbS can efficiently construct a labelling scheme on all the datasets, scaling over large networks with billions of vertices and edges. Compared with PPL and ParentPPL, our method QbS uses a significantly less amount of time (i.e., 24 orders of magnitude faster) to construct labelling information. Moreover, PPL failed to construct labels for 7 out of 12 datasets and ParentPPL failed for 10 out of 12 datasets. This is because these methods need to meet the 2hop path cover property. The reason why ParentPPL is much slower than PPL is because a vertex often has more than one parent and finding all parents takes more time though the time complexity remains unchanged. We can also see that, compared with QbS, QbSP can further improve construction time (i.e., 612 times faster), leading to much better scalability than QbS.
6.2.2. Labelling Size
Table 3 presents the comparison results for the labelling sizes of QbS, PPL and ParentPPL on all the datasets. We use to denote the size of precomputed shortest path graphs between landmarks as discussed in Section 5.2. We observe that: 1) the labelling sizes of QbS are hundreds of times smaller than the labelling sizes of PPL and ParentPPL; 2) the labelling sizes of ParentPPL are about twice as the labelling sizes of PPL. For dense graphs, such as Twitter, the sizes of precomputed shortest paths in QbS are relatively larger than the ones in sparse graphs. This is due to the existence of many shortest paths between landmarks in dense graphs. Nonetheless, it is important to notice that, the sizes of precomputed shortest paths between landmarks (i.e. in Table 3) are small in QbS, compared with the sizes of labelling (i.e. in Table 3). For metagraphs, since each metagraph contains at most edges, the space overhead for storing edges and weights of a metagraph is very small. Indeed, even when we have =100, the size of a metagraph would still be smaller than 0.01MB. In summary, these results show that QbS can scale well over very large networks in terms of the labelling size.
Dataset  QbS  PPL  ParentPPL  

Douban  2.95MB  0.03MB  0.4GB  0.8GB 
DBLP  6.05MB  0.03MB  1.2GB  2.4GB 
Youtube  21.6MB  0.6MB  1.7GB 
Comments
There are no comments yet.