A Highly Scalable Labelling Approach for Exact Distance Queries in Complex Networks

12/06/2018 ∙ by Muhammad Farhan, et al. ∙ Australian National University 0

Answering exact shortest path distance queries is a fundamental task in graph theory. Despite a tremendous amount of research on the subject, there is still no satisfactory solution that can scale to billion-scale complex networks. Labelling-based methods are well-known for rendering fast response time to distance queries; however, existing works can only construct labelling on moderately large networks (million-scale) and cannot scale to large networks (billion-scale) due to their prohibitively large space requirements and very long preprocessing time. In this work, we present novel techniques to efficiently construct distance labelling and process exact shortest path distance queries for complex networks with billions of vertices and billions of edges. Our method is based on two ingredients: (i) a scalable labelling algorithm for constructing minimal distance labelling, and (ii) a querying framework that supports fast distance-bounded search on a sparsified graph. Thus, we first develop a novel labelling algorithm that can scale to graphs at the billion-scale. Then, we formalize a querying framework for exact distance queries, which combines our proposed highway cover distance labelling with distance-bounded searches to enable fast distance computation. To speed up the labelling construction process, we further propose a parallel labelling method that can construct labelling simultaneously for multiple landmarks. We evaluated the performance of the proposed methods on 12 real-world networks. The experiments show that the proposed methods can not only handle networks with billions of vertices, but also be up to 70 times faster in constructing labelling and save up to 90% of labelling space. In particular, our method can answer distance queries on a billion-scale network of around 8B edges in less than 1ms, on average.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Finding the shortest-path distance between a pair of vertices is a fundamental task in graph theory, and has a broad range of applications (Backstrom et al., 2006; Freeman, 1977; Sabidussi, 1966; Yahia et al., 2008; Ukkonen et al., 2008; Vieira et al., 2007; Maniu and Cautis, 2013). For example, in web graphs, ranking of web pages based on their distances to recently visited web pages helps in finding the more relevant pages and is referred to as context-aware search (Ukkonen et al., 2008). In social network analysis, distance is used as a core measure in many problems such as centrality (Freeman, 1977; Sabidussi, 1966) and community identification (Backstrom et al., 2006), which require distances to be computed for a large number of vertex pairs. However, despite extensive efforts in addressing the shortest-path distance problem for many years, there is still a high demand for scalable solutions that can be used to support analysis tasks over large and ever-growing networks.

Traditionally, one can use the Dijkstra algorithm (Tarjan, 1983) for weighted graphs or a breadth-first search (BFS) algorithm for unweighted graphs to query shortest-path distances. However, these algorithms are not scalable, i.e., for large graphs with billions of vertices and edges, they may take seconds or even longer to find the shortest-path distance between one pair of vertices, which is not acceptable for large-scale network applications where distances need to be provided in the order of milliseconds. To improve query time, a well-established approach is to precompute and store shortest-path distances between all pairs of vertices in an index, also called distance labelling, and then answer a distance query (i.e., find the distance between two vertices) in constant time with a single lookup in the index. Recent work (Hayashi et al., 2016) shows that such labelling-based methods are the fastest known exact distance querying methods on moderately large graphs (million-scale) having millions of edges, but still fail to scale to large graphs (billion-scale) due to quadratic space requirements and unbearable indexing construction time.

(a)                                           (b) (c) Method Ordering- 2HC- HWC- Parallel? dependent? minimal? minimal? HL (ours) no n/a yes landmarks FD (Hayashi et al., 2016) no no no neighbours IS-L (Fu et al., 2013) yes no no no PLL (Akiba et al., 2013) yes yes no neighbours HDB (Jiang et al., 2014) yes no no no HHL (Abraham et al., 2012) yes no no no
Figure 1. High-level overview of the state-of-the-art methods and our proposed method (HL) for exact distance queries: (a) performance w.r.t. query time and labelling size on networks of size up to 400M, (b) scalability w.r.t. labelling construction time and network size, and (c) several important properties related to labelling methods.

Thus, the question is still open as to how scalable solutions to answer exact distance queries in billion-scale networks can be developed. Essentially, there are three computational factors to be considered concerning the performance of algorithms for answering distance queries: construction time, index size, and query time. Much of the existing work has focused on exploring trade-offs among these computational factors (Abraham et al., 2011, 2012; Akiba et al., 2013, 2012; Wei, 2010; Hayashi et al., 2016; Tretyakov et al., 2011; Potamias et al., 2009; Fu et al., 2013; Jin et al., 2012; Qiao et al., 2014; Gubichev et al., 2010; Li et al., 2017; Chang et al., 2012), especially for the 2-hop cover distance labelling (Cohen et al., 2003; Akiba et al., 2013). Nonetheless, to handle large graphs, we believe that a scalable solution for answering exact distance queries needs to have the following desirable characteristics: (1) the construction time of a distance labelling is scalable with the size of a network; (2) the size of a distance labelling is minimized so as to reduce the space overhead; (3) the query time remains in the order of milliseconds, even in graphs with billions of nodes and edges.

In this work, we aim to develop a scalable solution for exact distance queries which can meet the aforementioned characteristics. Our solution is based on two ingredients: (i) a scalable labelling algorithm for constructing minimal distance labelling, and (ii) a querying framework that supports fast distance-bounded search on a sparsified graph. More specifically, we first develop a novel labelling algorithm that can scale to graphs at the billion-scale. We observed that, for a given number of landmarks, the distance entries from these landmarks to other vertices in a graph can be further minimized if the definition of 2-hop cover distance labelling is relaxed. Thus, we formulate a relaxed notion for labelling in this paper, called the highway cover distance labelling, and develop a simple yet scalable labelling algorithm that adds a significantly small number of distance entries into the label of each vertex. We prove that the distance labelling constructed by our labelling algorithm is minimal, and also experimentally verify that the construction process is scalable.

Then, we formalize a querying framework for exact distance queries, which combines our proposed highway cover distance labelling with distance-bounded searches to enable fast distance computation. This querying framework is capable of balancing the trade-off between construction time, index size and query time through an offline component (i.e. the proposed highway cover distance labelling) and an online component (i.e. distance-bounded searches). The basic idea is to select a small number of highly central landmarks that allow us to efficiently compute the upper bounds of distances between all pairs of vertices using an offline distance labelling, and then conduct distance-bounded search over a sparsified graph to find exact distances efficiently. Our experimental results show that the query time of distance queries within this framework is still in millionseconds for large graphs with billions of vertices and edges.

Figure 1 summarizes the performance of the state-of-the-art methods for exact distance queries (Akiba et al., 2013; Fu et al., 2013; Abraham et al., 2012; Hayashi et al., 2016; Jiang et al., 2014; Tarjan, 1983; Pohl, 1971; Chang et al., 2012), as well as our proposed method in this paper, denoted as HL. In Figure 1(a)-1(b), we can see that, labelling-based methods PLL (Akiba et al., 2013), HDB (Jiang et al., 2014), and HHL (Abraham et al., 2012) can answer distance queries in considerably small amount of time. However, they have very large space requirements and very long labelling construction time. On the contrary, traditional online search methods such as Dijkstra (Tarjan, 1983) and bidirectional BFS (denoted as Bi-BFS) (Pohl, 1971) are not applicable to large-scale networks where distances need to be provided in the order of milliseconds because of their very high response time. The hybrid methods FD (Fu et al., 2013), IS-L (Hayashi et al., 2016) and HL (our method) combine an offline labelling and an online graph traversal technique, which can often provide better trade-offs between query response time and labelling size. In Figure 1(b), we can also see that only our proposed method HL can handle networks of size 8B, and is scalable to perform distance queries on networks with billions of vertices and billions of edges.

Figure 1(c) presents a high-level overview for several important properties of labelling methods. The column ordering dependent refers to whether a distance labelling depends on the ordering of landmarks when being constructed by a method. Only our method HL and FD are not ordering-dependent. The columns 2HC-minimal and HWC-minimal refer to whether a distance labelling constructed by a method is minimal in terms of the 2-hop cover (2HC) and highway cover (HWC) properties, respectively. PLL is 2HC-minimal, but not HWC-minimal. Our method HL is the only method that is HWC-minimal. The column Parallel refers to what kind of parallelism a method can support. FD and PLL support bit-parallelism for up to 64 neighbours of a landmark. Our method HL supports parallel computation for multiple landmarks, depending on the number of processors. Other methods did not mention any parallelism.

In summary, our contributions in this paper are as follows:

  • We introduce a new labelling property, namely highway cover labelling, which relaxes the standard notion of 2-hop cover labelling. Based on this new labelling property, we propose a highly scalable labelling algorithm that can scale to construct labellings for graphs with billions of vertices and billions of edges.

  • We prove that the proposed labelling algorithm can construct HWC-minimal labellings, which is independent of any ordering of landmarks. Then, due to this determinstric nature of labelling, we further develop a parallel algorithm which can run parallel BFSs from multiple landmarks to speed up labelling construction.

  • We combine our novel labelling algorithm with online bounded-distance graph traversal to efficiently answer exact distance queries. This querying framework enables us to balance the trade-offs among construction time, labelling size and query time.

  • We have experimentally verified the performance of our methods on 12 large-scale complex networks. The results that our methods can not only handle networks with billions of vertices, but also be up to 70 times faster in constructing labelling and save up to 90% of labelling space.

The rest of the paper is organized as follows. In Section 2, we present basic notations and definitions used in this paper. Then, we discuss a novel labelling algorithm in Section 3, formulate the querying framework in Section 4, and introduce several optimization techniques in Section 5. In Section 6 we present our experimental results and in Section 7 we discuss other works that are related to our work here. The paper is concluded in Section 8.

2. Preliminaries

Let be a graph where is a set of vertices and is a set of edges. We have and . Without loss of generality, we assume that the graph is connected and undirected in this paper. Let be a subset of vertices of . Then the induced subgraph is a graph whose vertex set is and whose edge set consists of all of the edges in that have both endpoints in . Let denote a set of neighbors of a vertex in .

The distance between two vertices and in , denoted as , is the length of the shortest path from to . We consider , if there does not exist a path from to . For any three vertices , the following triangle inequalities are satisfied:

(1)
(2)

If belongs to one of the shortest paths from to , then holds.

Given a special subset of vertices of , so-called landmarks, a label for each vertex can be precomputed, which is a set of distance entries where and for . The set of labels is called a distance labeling over . The size of a distance labelling is defined as size(L)=.

Using such a distance labeling L, we can query the distance between any pair of vertices in graph as follows,

(3)

We define , if and do not share any landmark. If holds for any two vertices and of , is called a 2-hop cover distance labeling over (Cohen et al., 2003; Abraham et al., 2012).

Given a graph and a set of landmarks , the distance querying problem is to efficiently compute the shortest path distance between any two vertices and in , using a distance labeling over in which labels may contain distance entries from landmarks in .

3. Highway Cover Labelling

In this section, we formulate the highway cover labelling problem and propose a novel algorithm to efficiently construct the highway cover distance labelling over graphs. Then, we provide theoretical analysis of our proposed algorithm.

3.1. Highway Cover Labelling Problem

We begin with the definitions of highway and highway cover.

Definition 3.1 ().

(Highway)  A highway is a pair , where is a set of landmarks and is a , i.e. , such that for any we have .

Given a landmark and two vertices (i.e. ), a -constrained shortest path between and is a path between and satisfying two conditions: (1) It goes through the landmark , and (2) It has the minimum length among all paths between and that go through . We use to denote the set of vertices in a shortest path between and , and to denote the set of vertices in a -constrained shortest path between and .

Definition 3.2 ().

(Highway Cover)  Let be a graph and a highway. Then for any two vertices and for any , there exist and such that and , where and may equal to .

If the label of a vertex contains a distance entry , we also say that the vertex is covered by the landmark in the distance labelling . Intuitively, the highway cover property guarantees that, given a highway with a set of landmarks and , any -constrained shortest path distance between two vertices and can be found using only the labels of these two vertices and the given highway. A distance labelling is called a highway cover distance labelling if satisfies the highway cover property.

Example 3.3 ().

Consider the graph depicted in Figure 2(a), the highway has three landmarks as highlighted in red in Figure 2(b). Based on the graph in Figure 2(a) and the highway in Figure 2(b), we have which is a shortest path between the vertices and constrained by the landmark , i.e. -constrained shortest path between and . In contrast, neither of the paths and is a -constrained shortest path between and .

In Figure 2(b), the outgoing arrows from each landmark point to vertices in that are covered by this landmark in the highway. The distance labelling in Figure 2(c) satisfies the highway cover property because for any two vertices that are not landmarks and any landmark , we can find the -constrained shortest path distance between these two vertices using their labels and the highway.

Figure 2. An illustration of highway cover distance labelling: (a) an example graph , (b) a highway structure and (c) a distance labelling that fulfills the highway cover property over .
Definition 3.4 ().

(Highway Cover Labelling Problem) Given a graph and a highway over , the highway cover labelling problem is to efficiently construct a highway cover distance labelling .

Several choices naturally come up: (1) One is to add a distance entry for each landmark into the label of every vertex in , as the approach proposed in (Hayashi et al., 2016); (2) Another is to use the pruned landmark labelling approach (Akiba et al., 2013) to add the distance entry of a landmark into the labels of vertices in if the landmark has not been pruned during a BFS rooted at ; (3) Alternatively, we can also extend the pruned landmark labelling approach to construct the highway cover labeling by replacing the 2-hop cover pruning condition with the one required by the highway cover as defined in Definition 3.2 at each step of checking possible labels to be pruned.

In all these cases, the labelling construction process would not be scalable nor be suitable for large-scale complex networks with billions of vertices and edges. Moreover, these approaches would potentially lead to the construction of distance labellings with different sizes. A question arising naturally is how to construct a minimal highway cover distance labelling without redundant labels? In a nutshell, it is a challenging task to construct a highway cover distance labelling that can scale to very large networks, ideally in linear time, but also with the minimal labelling size.

3.2. A Novel Algorithm

We propose a novel algorithm for solving the highway cover labelling problem, which can construct labellings in linear time.

The key idea of our algorithm is to construct a label for vertex such that the distance entry of each landmark is only added into the label iff there does not exist any other landmark that appears in the shortest path between and , i.e. . In other words, if there is another landmark and is in the shortest path between and , then is added into iff is the “closest" landmark to . To compute such labels efficiently, we conduct a breadth-first search from every landmark and add distance entries into labels of vertices that do not have any other landmark in their shortest paths from .

Example 3.5 ().

Consider vertex in Figure 2(c), the label contains the distance entries of landmarks , but no distance entry of landmark . This is because and are the closest landmarks to vertex 7 in the shortest paths and , respectively. However, for either of two shortest paths and between and , there is another landmark (i.e. or ) that is closer to compared with in these shortest paths. Thus the distance entry of landmark 1 is not added into .

Our highway cover labelling approach is described in Algorithm 1. Given a graph and a highway over , we start with an empty highway cover distance labelling , where for every . Then, for each landmark , we compute the corresponding distance entries as follows. We use two queues and to process vertices to be labeled or pruned at each level of a breadth-first search (BFS) tree, respectively. We start by processing vertices in . For each vertex at depth , we examine the children of at depth that are unvisited. For each unvisited child vertex at depth , if then we prune , i.e., we do not add a distance entry of the current landmark into and we also enqueue to the pruned queue (Line 11). Otherwise, we add to the label of , i.e., we add it into and we also enqueue to the labeled queue (Lines 13-14). Here, refers to BFS decoded distance from root to . Then we process the pruned vertices in . These vertices are either landmarks or have landmarks in their shortest paths from , and thus do not need to be labeled. Therefore, for each vertex at depth , we enqueue all unvisited children of at depth to the pruned queue . We keep processing these two queues, one after the other, until is empty.

Input: ,
Output:
1
2 foreach  do
3      
4      
5      
6       Enqueue to and set as the root of BFS
7       while  is not empty do
8             foreach  at depth  do
9                   foreach unvisited child of at depth  do
10                         if  is a landmark then
11                               Enqueue to
12                         else
13                               Enqueue to
14                               Add {(, )} to
15                         end if
16                        
17                   end foreach
18                  
19             end foreach
20            
21             foreach  at depth  do
22                   Enqueue unvisited children of at depth to
23             end foreach
24            
25       end while
26      
27 end foreach
return
Algorithm 1 Constructing the highway cover labelling
Example 3.6 ().

We illustrate how our algorithm conducts pruned BFSs in Figure 3. The pruned BFS from landmark is depicted in Figure 3(a), which labels only four vertices because the other vertices are either landmarks or contain other landmarks in their shortest paths to landmark . Similarly, in the pruned BFS from landmark depicted in Figure 3(b), only vertices are labelled, and none of the vertices , , and is labelled because of the presence of landmark in their shortest paths to landmark . Indeed, we can get the distance between landmark to these vertices by using the highway, i.e. , and distance entries in their labels to landmark . The pruned BFS from landmark 9 is depicted in Figure 3(c), which works in a similar fashion.

Note that, although a highway is given in Algorithm 1, we can indeed compute the distances for a given set of landmarks along with Algorithm 1.

Figure 3. An illustration of the highway cover labelling algorithm: (a), (b) and (c) describe the pruned BFSs that are rooted at the landmarks 1, 5 and 9, respectively, where yellow vertices denote roots, green vertices denote those being labeled, red vertices denote landmarks, and white vertices are not labelled. LS and ET at the top right corner denote the labelling size and the number of edges traversed during the pruned BFSs, respectively.

3.3. Correctness

Here we prove the correctness of our labelling algorithm.

Lemma 3.7 ().

In Algorithm 1, for each pruned BFS rooted at , is added into the label of a vertex iff there is no any other landmark appearing in the shortest path between and , i.e., .

Proof.

Suppose that Algorithm 1 is conducting a pruned BFS rooted at and is an unvisited child of another vertex in (start from ) (Lines 6-9). If (Line 10), then we have (Lines 11, 19-21), cannot be added into the label of any child of , i.e., put into . Otherwise, by and is an unvisited child of a vertex in (Lines 8-9), we know that and thus is added into (lines 12-14). ∎

Then, by Lemma 3.7, we have the following corollary.

Corollary 3.8 ().

Let be a landmark, a vertex, and a distance labelling constructed by Algorithm 1, if , then there must exist a landmark such that and .

Theorem 3.9 ().

The highway cover distance labelling over constructed using Algorithm 1 satisfies the highway cover property over .

Proof.

To prove that, for any two vertices and for any , there exist and such that and , we consider the following 4 cases: (1) If and , then . (2) If and , then and by Lemma 3.8, there exists another landmark such that is in the shortest path between and and . (3) If and , then similarly we have , and by Lemma 3.8, there exists another landmark such that is in the shortest path between and and . (4) If and , then by Lemma 3.8 there exist another two landmarks and such that is in the shortest path between and and , and is in the shortest path between and and . The proof is done. ∎

Figure 4. An illustration of the pruned landmark labelling algorithm (Akiba et al., 2013): (a)-(c) show an example of constructing labels through pruned BFSs from three landmarks in the labelling order ; (d)-(f) show an example of constructing labels using the same three landmarks but in a different labelling order . Yellow vertices denote landmarks that are the roots of pruned BFSs, green vertices denote those being labeled, grey vertices denote vertices being visited but pruned, and red vertices denote landmarks which have already been visited.

3.4. Order Independence

In previous studies (Abraham et al., 2011; Akiba et al., 2013; Abraham et al., 2012; Cohen et al., 2003), given a graph , a distance labelling algorithm builds a unique canonical distance labelling subject to a labelling order (i.e., the order of landmarks used for constructing a distance labelling). It has been well known that such a labelling order is decisive in determining the size of the constructed distance labelling (Qin et al., 2017). For the same set of landmarks, when using different labelling orders, the sizes of the constructed distance labelling may vary significantly.

The following example shows how different labelling orders in the pruned landmark labelling approach (Akiba et al., 2013) can lead to the distance labelling of different sizes.

Example 3.10 ().

In Figure 4, the size of the distance labelling constructed using the labelling order in Figure 4(a)-4(c) is different from the size of the distance labelling constructed using the labelling order in Figure 4(d)-4(f). In both cases, the first BFS adds the distance entry of the current landmark into all the vertices in the graph. Then, the following BFSs check each visited vertex whether the shortest path distance between the current landmark and the visited vertex can be computed via the 2-hop cover property based on their labels added by the previous BFSs. A distance entry is only added into the label of a vertex if the shortest path distance cannot be computed by applying the 2-hop cover over the existing labels. Thus, the choice of the labelling order could affect the size of labels significantly. Take the vertex for example, its label contains only one distance entry using the labelling order depicted in Figure 4(a)-4(c), but contains three distance entries , , and when the labelling order depicted in Figure 4(d)-4(f) is used.

Unlike all previous approaches taken with distance labelling, our highway cover labelling algorithm is order-invariant. That is, regardless of the labelling order, the distance labellings constructed by our algorithm using different labelling orders over the same set of landmarks always have the same size. In fact, we can show that our algorithm has the following stronger property: the distance labelling constructed using our algorithm is deterministic (i.e., the same label for each vertex) for a given set of landmarks.

Lemma 3.11 ().

Let be a graph and a highway over . For any two different labelling orders over , the highway cover distance labellings and over constructed by these two different labelling orders using Algorithm 1 satisfy for every .

Proof.

Let and be two different labelling orders over . For any landmark in and , Algorithm 1 generates exactly the same pruned BFS tree. This implies that, for each vertex , either the same distance entry ) is added into and , or no distance entry is added to and . Thus, Algorithm 1 satisfy for every . ∎

3.5. Minimality

Here we discuss the question of minimality, i.e., whether the highway cover distance labelling constructed by our algorithm is always minimal in terms of the labelling size. We first prove the following theorem.

Theorem 3.12 ().

The highway cover distance labelling over constructed using Algorithm 1 is minimal, i.e., for any highway cover distance labelling over , must hold.

Proof.

We prove this by contradiction. Let us assume that there is a highway cover distance labelling with . Then, this would imply that there must exist a vertex and a landmark such that and . By Lemma 3.7 and , we know that there is no any other landmark in that is in the shortest path between and . However, by the definition of the highway cover property (i.e. Definition 3.2) and , we also know that there must exist another landmark and , which contradicts with the previous conclusion that there is no any other landmark in the shortest path between and . Thus, must hold for any highway cover distance labelling . ∎

The state-of-the-art approaches for distance labelling is primarily based on the idea of 2-hop cover (Akiba et al., 2013; Fu et al., 2013; Abraham et al., 2011). One may ask the question: how is the highway cover labelling different from the 2-hop cover labelling, such as the pruned landmark labelling (Akiba et al., 2013)? It is easy to verify the following lemma that each pruned landmark labelling satisfies the highway cover property for the same set of landmarks.

Lemma 3.13 ().

Let be a pruned landmark labelling over graph constructed using a set of landmarks . Then also satisfies the highway cover property over where .

As the pruned landmark labelling algorithm (Akiba et al., 2013) prunes labels based on the 2-hop cover property, but our highway cover labeling algorithm prunes labels based on the property described in Lemma 3.7, by Theorem 3.12, we also have the following corollary, stating that, for the same set of landmarks, the size of the highway cover labelling is always smaller than the size of any pruned landmark labelling.

Corollary 3.14 ().

For a highway cover distance labelling produced by Algorithm 1 over , where , and a pruned landmark labelling over using any labelling order over , we always have .

Example 3.15 ().

Figure 4 shows the labelling size (LS) of the pruned landmark labelling at the top right corner, which is constructed using two different orderings. The first ordering labels 25 vertices whereas the second ordering labels 30 vertices. On the other hand, the LS of the highway cover distance labelling is 13 as shown in Figure 3. Note that the LS of the highway cover distance labelling does not change, irrespective of ordering. Since the highway cover distance labelling constructed by our algorithm is always minimal, the LS of the highway cover distance labelling in Figure 3 is much smaller than the LS of either pruned landmark labelling in Figure 4.

4. Bounded Distance Querying

In this section, we describe a bounded distance querying framework that allows us to efficiently compute exact shortest-path distances between two arbitrary vertices in a massive network.

4.1. Querying Framework

We start with presenting a high-level overview of our querying framework. To compute the shortest path distance between two vertices and in graph , our querying framework proceeds in two steps: (1) an upper bound of the shortest path distance between to is computed using the highway cover distance labelling; (2) the exact shortest path distance between to is computed using a distance-bounded shortest-path search over a sparsified graph from .

Given a graph and a highway over , we can precompute a highway cover distance labelling using the landmarks in , which enables us to efficiently compute the length of any -constrained shortest path between two vertices in . The length of such a -constrained shortest path must be greater than or equal to the exact shortest path distance between these two vertices and can thus serve as an upper bound in Step (1). On the other hand, since the length of such a -constrained shortest path between two vertices in can always be efficiently computed by the highway cover distance labelling , the distance-bounded shortest-path search only needs to be conducted over a sparsified graph by removing all landmarks in from , i.e. .

More precisely, we define the bounded distance querying problem in the following.

Definition 4.1 ().

(Bounded Distance Querying Problem) Given a sparsified graph , a pair of vertices , and an upper (distance) bound , the bounded distance querying problem is to efficiently compute the shortest path distance between and over under the upper bound such that,

In the following, we discuss the two steps of this framework in detail.

4.2. Computing Upper Bounds

Given any two vertices and , we can use a highway cover distance labelling to compute an upper bound for the shortest path distance between and as follows,

(4)

This corresponds to the length of a shortest path from to passing through landmarks and , where is the shortest path distance from to in , is the shortest path distance from to through highway , and is the shortest path distance from to in .

Example 4.2 ().

Consider the graph in Figure 2(a), we may use the labels and to compute the upper bound for the shortest path distance between two vertices and . There are two cases: (1) for the path that goes through landmarks 5 and 1, we have , and (2) for the path that goes through landmarks 9 and 1, we have . Thus, we take the minimum of these two distances as the upper bound, which is 3 in this case.

4.3. Distance-Bounded Shortest Path Search

We conduct a bidirectional search on the sparsified graph which is bounded by the upper bound from the highway cover distance labelling. For a pair of vertices , we run breadth-first search algorithm from and , simultaneously (Hayashi et al., 2016). Algorithm 2 shows the pseudo-code of our distance-bounded shortest path search algorithm. We use two sets of vertices and to keep track of visited vertices from and . We use two queues and to conduct both a forward search from and a reverse search from . Furthermore, we use two integers and to maintain the current distances from and , respectively.

During initialization, we set and to and , and enqueue and into and , respectively. In each iteration, we increment or and expand or by running either a forward search (FS) or a reverse search (RS) as long as and have no any common vertex or is equal to the upper bound , and and are not empty. In the forward search from , we examine the neighbors of each vertex . Suppose we are visiting a vertex , if is included in vertex set , then it means that we find a shortest path to vertex of length , because the reverse search from had already visited with distance . At this stage, we return as the answer since we already know . Otherwise, we add vertex to and enqueue into a new queue . When we could not find the shortest distance in the iteration, we replace with and increase by 1, and check if . If it holds, then we return since .

Input: , , ,
Output:
1 , , ,
2 Enqueue to , to
3 while  and are not empty do
4       if  then
5            
6       else
7            
8       end if
9      if found = true then
10             return
11       else if  then
12             return
13       end if
14      
15 end while
return
Algorithm 2 Distance-Bounded Shortest Path Search
Figure 5. An illustration of the distance-bounded shortest path search algorithm (Hayashi et al., 2016): (a) shows the sparsified graph after removing three landmarks from the graph in Figure 2(a); (b) shows an example of computing the bounded distance between vertices and as highlighted in yellow, and green vertices denote the visited vertices in the forward and reverse searches.
Example 4.3 ().

In Figure 5(b), the upper distance bound between vertices 2 and 11 is 3, as computed in Example 4.2. Suppose that we run BFSs from vertices 2 and 11 respectively. First, a forward search from 2 enqueues its neighbors 7, 12 and 14 into and increases by 1. Then a reverse search from 11 enqueues 4 and 10 into and also sets to 1. At this stage, we have not found any common vertex between and , and which is less the upper bound . Therefore, we continue to start a search from the vertices in , which enqueues 5 into and increments to 2. Now, we have reaching the upper bound, hence we do not need to continue our search.

4.4. Correctness

The correctness of our querying framework can be proven based on the following two lemmas. More specifically, Lemma 4.4 can be derived by the highway cover property and the definition of . Lemma 4.5 can also be proven by the property of shortest path and the definition of the sparsified graph .

Lemma 4.4 ().

For a highway cover distance labelling over (), we have for any two vertices and of , where is computed using and .

Lemma 4.5 ().

For any two vertices , if there is a shortest path between and in that does not include any vertex in , then holds.

Thus, the following theorem holds:

Theorem 4.6 ().

Let be a graph, a highway over and a highway cover distance labelling. Then, for any two vertices , the querying framework over yields .

Proof.

We consider two cases: (1) contains at least one landmark. In this case, By Lemma 4.4 and the definition of the highway cover property, we have . (2) does not contain any landmark. By Lemma 4.5, we have . ∎

5. Optimization Techniques

In this section, we discuss optimization techniques for label construction, label compression, and query processing.

Dataset Network Type avg. deg max. deg Sources
Skitter computer undirected 1.7M 11M 6.5 13.081 35455 85 MB (the Koblenz Network Collection, 2017)
Flickr social undirected 1.7M 16M 9.1 18.133 27224 119 MB (the Koblenz Network Collection, 2017)
Hollywood social undirected 1.1M 114M 49.5 98.913 11467 430 MB (Boldi and Vigna, 2004; Boldi et al., 2011)
Orkut social undirected 3.1M 117M 38.1 76.281 33313 894 MB (the Koblenz Network Collection, 2017)
enwiki2013 social directed 4.2M 101M 21.9 43.746 432260 701 MB (Boldi and Vigna, 2004; Boldi et al., 2011)
LiveJournal social directed 4.8M 69M 8.8 17.679 20333 327 MB (the Koblenz Network Collection, 2017)
Indochina web directed 7.4M 194M 20.4 40.725 256425 1.1 GB (Boldi and Vigna, 2004; Boldi et al., 2011)
it2004 web directed 41M 1.2B 24.9 49.768 1326744 7.7 GB (Boldi and Vigna, 2004; Boldi et al., 2011)
Twitter social directed 42M 1.5B 28.9 57.741 2997487 9.0 GB (Boldi and Vigna, 2004; Boldi et al., 2011)
Friendster social undirected 66M 1.8B 22.5 45.041 4006 13 GB (Leskovec and Krevl, 2015)
uk2007 web directed 106M 3.7B 31.4 62.772 979738 25 GB (Boldi and Vigna, 2004; Boldi et al., 2011)
ClueWeb09 computer directed 2B 8B 5.98 11.959 599981958 55 GB (Rossi and Ahmed, 2015)
Table 1. Datasets, where denotes the size of a graph with each edge appearing in the forward and reverse adjacency lists and being represented by 8 bytes.

5.1. Label Construction

A technique called Bit-Parallelism (BP) has been previously used in several methods (Akiba et al., 2013; Hayashi et al., 2016) to speed up the label construction process. The key idea of BP is to perform BFSs from a given landmark and up to 64 of its neighbors simultaneously, and encode the relative distances (-1, 0 or 1) of these neighbors w.r.t. the shortest paths between and each vertex into a 64-bit unsigned integer. In the work (Akiba et al., 2013), BP was applied to construct bit-parallel labels from initial vertices without pruning, which aimed to leverage the information from these bit-parallel labels to cover more shortest paths between vertices. Then, both bit-parallel labels and normal labels are constructed in the pruned BFSs. The work in (Hayashi et al., 2016) also used BP to construct thousands of bit-parallel shortest-path trees (SPTs) because it is very costly to construct thousands of normal SPTs in memory owing to their prohibitively large space requirements and very long construction time.

In our work, we develop a simple yet rigorous parallel algorithm (HL-P) which can run parallel BFSs from multiple landmarks (depending on the number of processors) to construct labelling in an extremely efficient way for massive networks, with much less time as will be demonstrated in our experiments.

5.2. Label Compression

The choice of the data structure for labels may significantly affect the performance of index size and memory usage. As noted in (Li et al., 2017), some works (Abraham et al., 2012; Delling et al., 2014) did not elaborate on what data structure they have used for representing labels. Nonetheless, for the works that are most relevant to ours, such as FD (Hayashi et al., 2016) and PLL (Akiba et al., 2013), they used 32-bit integers to represent vertices and 8-bit integers to represent distances for normal labels. In addition to this, they also used 64-bits to encode the distances from a landmark to up to 64 of its neighbors in their shortest paths to other vertices. Since our approach only selects a very small number of landmarks to construct the highway cover labelling (usually no more than 100 landmarks), we may use 8 bits to represent landmarks and another 8 bits to store distances for labels. In order to fairly compare methods from different aspects, we have implemented our methods using both 32 bits and 8 bits for representing vertices in labels. However, different from the BP technique that uses 64-bits to encode the distance information of up to 64 neighbours of a landmark, our parallel algorithm (HL-P) does not use a different data structure for labels constructed in parallel BFSs.

5.3. Query Processing

We show that computing the upper bound can be optimized based on the observation, captured by the following lemma.

Lemma 5.1 ().

For a highway cover distance labelling over , where and , and any , if a landmark appears in both and , then holds for any other .

Proof.

By the definition of the highway cover property, we know that is not in the shortest path between and . Then by triangle inequality in Equation 1, this lemma can be proven. ∎

Thus, in order to efficiently compute the upper bound , for any landmarks that appear in both and , we compute the -constrained shortest path distance between and using Equation 2, while for a landmark that only appear in one of and , we use Equation 4.2 to calculate the -constrained shortest path distance between and . This would lead to more efficient computations for queries when the landmarks appear in both labels of two vertices.

6. Experiments

To compare the proposed method with baseline approaches, we have implemented our method in C++11 using STL libraries and compiled using gcc 5.5.0 with the -O3 option. We performed all the experiments using a single thread on a Linux server (having 64 AMD Opteron(tm) Processors 6376 with 2.30GHz and 512GB of main memory) for sequential version of the proposed method and up to 64 threads for parallel version of the proposed method.

Figure 6. Distance distribution of 100,000 random pairs of vertices on all the datasets.
Dataset CT[s] QT[ms] ALS
HL-P HL FD PLL IS-L HL FD PLL IS-L Bi-BFS HL FD PLL IS-L
Skitter 2 13 30 638 1042 0.067 0.043 0.008 3.556 3.504 12 20+64 138+50 51
Flickr 2 14 41 1330 8359 0.015 0.028 0.01 33.760 4.155 10 20+64 290+50 50
Hollywood 3 17 107 31855 DNF 0.047 0.075 0.051 - 6.956 12 20+64 2206+50 -
Orkut 10 62 366 DNF DNF 0.224 0.251 - - 21.086 11 20+64 - -
enwiki2013 9 77 308 22080 DNF 0.190 0.131 0.027 - 19.423 10 20+64 471+50 -
LiveJournal 9 77 166 DNF 20583 0.088 0.111 - 56.847 17.264 13 20+64 - 69
Indochina 8 50 144 9456 DNF 1.905 1.803 0.02 - 9.734 5 20+64 441+50 -
it2004 66 304 1623 DNF DNF 2.684 2.118 - - 92.187 10 20+64 - -
Twitter 133 1380 1838 DNF DNF 1.424 0.432 - - 426.949 14 20+64 - -
Friendster 135 2229 9661 DNF DNF 1.091 1.435 - - 534.576 19 20+64 - -
uk2007 110 1124 6201 DNF DNF 11.841 18.979 - - 355.688 8 20+64 - -
ClueWeb09 4236 28124 DNF DNF DNF 0.309 - - - - 2 - - -
Table 2. Comparison of construction times and query times between our methods, i.e., HL-P and HP, and the state-of-the-art methods, where CT denotes the CPU clock time in seconds for labelling construction, QT denotes the average query time in milliseconds, and ALS denotes the average number of entries per label.
Dataset HL(8) HL FD PLL IS-L
Skitter 42MB 102MB 202MB 2.5GB 507MB
Flickr 34MB 81MB 178MB 3.7GB 679MB
Hollywood 28MB 67MB 293MB 13GB -
Orkut 70MB 170MB 756MB - -
enwiki2013 83MB 200MB 743MB 12GB -
LiveJournal 123MB 299MB 778MB - 3.8GB
Indochina 81MB 192MB 999MB 21GB -
it2004 855MB 2GB 5.6GB - -
Twitter 1.2GB 2.8GB 4.8GB - -
Friendster 2.5GB 5.2GB 11.8GB - -
uk2007 1.8GB 4.3GB 14.1GB - -
ClueWeb09 4.7GB 9GB - - -
Table 3. Comparison of labelling sizes between our methods, i.e., HL(8) and HL, and the state-of-the-art methods.

6.1. Datasets

In our experiments, we used 12 large-scale real-world complex networks, which are detailed in Table 1. These networks have vertices and edges ranging from millions to billions. Among them, the largest network is ClueWeb09 which has 2 billions of vertices and 8 billions of edges. We included this network in our experiments for the purpose of evaluating the robustness and scalability of the proposed method. In previous works, the largest dataset that has been reported is uk2007 which has only around 100 millions of vertices and 3.7 billions of edges. For all these networks, we treated them as undirected and unweighted graphs.

To investigate the query time of finding the distance between two vertices, we randomly sampled 100,000 pairs of vertices from all pairs of vertices in each network, i.e., . The distance distribution of these 100,000 randomly sampled pairs of vertices are shown in Figure 6(a)-6(b), from which we can confirm that most of pairs of vertices in these networks have a small distance ranging from 2 to 8.

6.2. Baseline Methods

We compared our proposed method with three state-of-the-art methods. Two of these methods, namely fully dynamic (FD) (Hayashi et al., 2016) and IS-L (Fu et al., 2013), combine a distance labelling algorithm with a graph traversal algorithm for distance queries on complex networks. The third one is pruned landmark labelling (PLL) (Akiba et al., 2013) which is completely based on distance labelling to answer distance queries. Besides these, there are a number of other methods for answering distance queries, such as HDB (Jiang et al., 2014), RXL and CRXL (Delling et al., 2014), HCL (Jin et al., 2012), HHL (Abraham et al., 2012) and TEDI (Wei, 2010). However, since the experimental results of the previous works (Hayashi et al., 2016; Akiba et al., 2013) have shown that FD outperforms HDB, RXL and CRXL, and PLL outperforms HCL, HHL and TEDI, we omit the comparison with these methods.

In our experiments, the implementations of the baseline methods FD, IS-L and PLL were provided by their authors, which were all implemented in C++. We used the same parametric settings for running these methods as suggested by their authors. For instance, the number of landmarks is chosen to 20 for FD (Hayashi et al., 2016), the number of bit-parallel BFSs is set to 50 for PLL (Akiba et al., 2013), and is 6 for graphs larger than 1 million vertices for IS-L (Fu et al., 2013).

6.3. Comparison with Baseline Methods

To evaluate the performance of our proposed approach, we compared our approach with the baseline methods in terms of the construction time of labelling, the size of labelling, and querying time. The experimental results are presented in Tables 2 and 3, where DNF denotes that a method did not finish in one day or ran out of memory. In order to make a consistent comparison with the baseline methods (Hayashi et al., 2016; Akiba et al., 2013; Fu et al., 2013), we chose top 20 vertices as landmarks after sorting based on decreasing order of their degrees, and also used 32-bit integers to represent vertices and 8-bit integers to represent distances.

6.3.1. Construction Time

As shown in Table 2, our proposed method (HL) has successfully constructed the distance labelling on all the datasets for a significantly less amount of time than the state-of-the-art methods. As compared to FD, our method is on average 5 times faster and have results on all the datasets. In contrast to this, FD failed to construct labelling for the largest dataset ClueWeb09. PLL failed for 7 out of 12 datasets, including the datasets Orkut and LiveJournal which have less than 120 millions of edges, due to its prohibitively high preprocessing time and memory requirements for building labelling. IS-L failed to construct labelling for all the datasets that have edges more than 100 million due to its very high cost for computing independent sets on massive networks, i.e. failed for 9 out of 12 datasets. We can also see from Table 2 that the parallel version of our method (HL-P) is much faster than the sequential version (HL). Compared with FD, HL-P is more than 50-70 times faster for the two large datasets Friendster and uk2007. This confirms that our method can construct labelling very efficiently and is scalable on large networks with billions of vertices and edges.

Figure 7. (a)-(d) Construction times using our method HL under 10-50 landmarks on all the datasets; (e)-(g) Query times using our method HL under 10-50 landmarks on all the datasets.

6.3.2. Labelling Size

As we can see from Table 3 that the labelling sizes of all the datasets constructed by the proposed method are significantly smaller than the labelling sizes of FD and much smaller than PLL and IS-Label. Specifically, our labelling sizes using 32-bits representation of vertices (HL) are 2-5 times smaller than FD except for ClueWeb09 (as discussed before, FD failed to construct labelling for ClueWeb09), 7 times smaller than IS-Label on Skitter, Flickr and LiveJournal and more than 60 times smaller than PLL for Skitter, Flickr, Hollywood, enwiki2013 and Indochina. The compressed version of our method that uses 8-bits representation of vertices (i.e. HL(8)) produces further smaller index sizes as compared to uncompressed version (HL). Here, It is important to note that the labelling sizes of almost all the datasets are also significantly smaller than the original sizes of the datasets shown in Table 1. This also shows that our method is highly scalable on large networks in terms of the labellng sizes.

6.3.3. Query Time

The average query times of our method (HL) are comparable with FD and PLL and faster than IS-L. Particularly, the average query time of our method on Hollywood is even faster than FD and PLL. This is due to a very small average labelling size (i.e., 12) as compared with FD and PLL (i.e., 20+64 and 2206+50, respectively) and a very small average distance. The average query time of HL on Twitter is 3 times slower than FD. This may be due to a large portion of covered pairs by FD as shown in Figure 9 which contributes towards an effective bounded traversal on the sparsified network since the landmarks of Twitter have very high degrees and the average distance is also very small. Moreover, the average query times of HL and FD on Indochina, it2004, Friendster and uk2007 are more than 1ms due to comparatively large average distances than other datasets as shown in Figure 6(b). Note that all the baseline methods are not scalable enough to have results for ClueWeb09 and the average query time on ClueWeb09 of our method HL is small because of a very large portion of covered pairs and a small average label size. We also reported the average query time for online bidirectional BFS algorithm (Bi-BFS) using randomly selected 1000 pairs of vertices in Table 2. As we can see that Bi-BFS has considerably long query times, which are not practicable in applications for performing distance queries in real time.

6.4. Performance under Varying Landmarks

We have also evaluated the performance of our method (HL) by varying the number of landmarks between 10 and 50, which are again selected based on highest degrees.

6.4.1. Construction Time

The construction times of our method HL against different numbers of landmarks (from 10 to 50) are shown in Figure 7(a)-7(d). We can see that the construction times are linear in terms of the number of landmarks, which confirms the scalability of our method. In Figure 7(a)-7(b), our method is able to construct labelling for 7 datasets under 50 landmarks from 20 seconds to 2 minutes, which is not possible with any state-of-the-art methods. In Figure 7(c), the construction time using 50 landmarks of Friendster is 3 times faster and the construction time of uk2007 is 4 times faster than FD using only 20 landmarks as shown in Table 2. Figure 7(d) shows the construction time for ClueWeb09 which has 2 billion vertices and 8 billion edges. The significant improvement in construction time allows us to compute labelling for a large number of landmarks, leading to better pair coverage ratios to tighten upper distance bounds (will be further discussed in Section 6.4.4).

Figure 8. Labelling sizes using our method HL under 10-50 landmarks and using FD on all the datasets.
Figure 9. Pair coverage ratios using our method HL under 10-50 landmarks and using FD on all the dataset.
Figure 8. Labelling sizes using our method HL under 10-50 landmarks and using FD on all the datasets.

6.4.2. Labelling Size

Figure 9 shows the labelling sizes of HL using 10, 20, 30, 40 and 50 landmarks on all the dataset, and of FD using only 20 landmarks on all the datasets except for ClueWeb09 (as discussed before, FD failed to construct labeling for ClueWeb09). It can be seen that the labelling sizes of HL increase linearly with the increased number of landmarks, and even the labelling sizes of HL using 50 landmarks are almost always smaller than the labelling sizes constructed by FD using only 20 landmarks. This reduction in labelling sizes enables us to save space and memory, thus makes our method scalable on large networks.

6.4.3. Query Time

Figure 7 shows the impact of using different numbers of landmarks between 10 and 50 on average query time of our method. The average query times either decrease or remain the same when the number of landmarks increases, except for Orkut when using 30 landmarks and for Friendster when using landmarks greater than 20. In particular, on Friendster, labelling sizes are very large as shown in Figure 9 and the fraction of covered pairs (i.e., pair coverage ratio) is very small as shown in Figure 9, which may have slowed down our query processing due to a longer time for computing upper distance bounds and ineffective use of bounded-distance traversal.

6.4.4. Pair Coverage

Figure 9 presents the ratios of pairs of vertices covered by at least one landmark (i.e., pair coverage ratios) in HL using 10-50 landmarks and in FD using 20 landmarks. As we can observe that the pair coverage ratios for HL increase when the number of landmarks increases and 40 turns out to be the better choice on the number of landmarks for most of the datasets. Specifically, pair coverage ratios on Orkut, enwiki2013, Indochina and uk2007 with 40 landmarks are good, resulting in better query times than using 20 landmarks, as shown in Figure 7. On datasets such as Hollywood and it2004, 30 landmarks are a better option than 40 landmarks because they only slightly differ in the pair coverage ratios and query times w.r.t. using 40 landmarks, but with reduced labelling sizes. The pair coverage ratios by FD are greater than HL on all the datasets except for ClueWeb09, which may be the reason behind its better query times for some datasets as shown in Table 2. Note that, on ClueWeb09, we obtain almost hundred percentage for pair coverage due to its very high degree landmarks.

7. Related Work

A naive solution for exact shortest-path distance computation is to run the Dijkstra search for weighted graphs or BFS for unweighted graphs, from a source vertex to a destination vertex (Tarjan, 1983). To improve search efficiency, a bidirectional scheme can be used to run two such searches: one from the source vertex and the other from the destination vertex (Pohl, 1971). Later on, Goldberg et al. (Goldberg and Harrelson, 2005)

combined the bidirectional search technique with the A* algorithm to further improve the search performance. In their method, they precomputed labeling based on landmarks to estimate the lower bounds, and used that estimate with a bidirectional A* search for efficient computation of shortest-path distances. However, this method is known to work only for road networks and do not scale well on complex networks

(Hayashi et al., 2016).

To efficiently answer exact shortest-path distance queries on graphs, labelling-based methods have been developed with great success (Akiba et al., 2013; Abraham et al., 2012; Fu et al., 2013; Jin et al., 2012; Abraham et al., 2011; Li et al., 2017). Most of them construct a labeling based on the idea of 2-hop cover (Cohen et al., 2003). It has also been shown that computing a minimal 2-hop cover labeling is NP-hard (Abraham et al., 2012; Cohen et al., 2003). In (Abraham et al., 2011), the authors proposed a hub-based labeling algorithm (HL) which constructs hub labelling by processing contraction hierarchies (CH) and is among the fastest known algorithms for distance queries in road networks. However, the method is not feasible for complex networks as reported by the same authors and they thus proposed a hierarchical hub-labeling (HHL) algorithm for complex networks in (Abraham et al., 2012). In this work, a top-down method was used to maintain a shortest-path tree for every vertex in order to indicate all uncovered shortest-paths at each vertex. Due to very high storage and computation requirements, the method is also not scalable for handling large graphs. Another method called Highway Centric Labeling (HCL) was proposed by Jin et al. (Jin et al., 2012) which exploits highway structure of a graph. This method aimed to find a spanning tree which can assist in optimal distance labelling and used that spanning tree as a highway to compute a highway-based 2-hop labelling for fast distance computation. After that, in (Akiba et al., 2013), Akiba et al. proposed the pruned landmark labeling (PLL) method which precomputes a distance-aware 2-hop cover index by performing a pruned breadth-first search (BFS) from every vertex. The idea is to prune vertices whose distance information can be obtained using a partially available 2-hop index constructed via previous BFSs. This work helps to achieve low construction cost and smaller index size due to reduced search space on million-scale networks. It has been shown that PLL outperforms other state-of-the-art methods available at the time of publication, including HHL (Abraham et al., 2012), HCL (Jin et al., 2012) and TEDI (Wei, 2010). However, PLL is still not feasible for constructing 2-hop cover indices for billion-scale networks due to a very high memory requirement for labelling construction.

Fu et al. (Fu et al., 2013) proposed IS-Label (IS-L) which gained significant scalability in precomputing 2-hop cover distance labellings for large graphs with hundreds of millions of vertices and edges. IS-L uses the notion of an independent set of vertices in a graph. First, it computes an independent set of vertices from a graph, then it constructs a graph by removing the independent set of vertices from the previous graph recursively and augments edges that preserve distance information after the removal of the independent set of vertices. All the vertices in the remaining graph preserve their distance information to/from each other. Generally, IS-L is regarded as a hybrid method that combines distance labelling with graph traversal for complex networks (Li et al., 2017). Following the same line of thought, very recently, Akiba et al. (Hayashi et al., 2016) proposed a method to accelerate shortest-path distances computation on large-scale complex networks. To the best of our knowledge, this work is most closely related to our work presented in this paper. The key idea of the method in (Hayashi et al., 2016) is to select a small set of landmarks and precompute shortest-path trees (SPTs) rooted at each . Given any two vertices and , it first computes the upper bound by taking the minimum length among the paths that pass through . Then a bidirectional BFS from to is conducted on the subgraph to compute the shortest-path distances that do not pass through and take the minimum of these two results as the answer to an exact distance query. The experiments in (Hayashi et al., 2016) showed that this method can scale to graphs with millions of vertices and billions of edges, and outperforms the state-of-the-art exact methods PLL (Akiba et al., 2013), HDB (Jiang et al., 2014), RXL and CRXL (Delling et al., 2014) with significantly reduced construction time and index size, while the query times are higher but still remain among 0.01-0.06 for most of graphs with less than 5M vertices.

Although the method proposed in (Hayashi et al., 2016) has been tested on a large network with millions of vertices and billions of edges, it still fails to construct labelling on billion-scale networks in general, particularly with billions of vertices. In contrast, our proposed method not only constructs labellings linearly with the number of landmarks in large networks with billions of vertices, but also enables the sizes of labellings to be significantly smaller than the original network sizes. In addition to these, the deterministic nature of labelling allows us to achieve further gains in computational efficiency using parallel BFSs over multiple landmarks, which is highly scalable for handling billion-scale networks.

8. Conclusion

We have presented a scalable solution for answering exact shortest path distance queries on very large (billion-scale) complex networks. The proposed method is based on a novel labelling algorithm that can scale to graphs at the billion-scale, and a querying framework that combines a highway cover distance labelling with distance-bounded searches to enable fast distance computation. We have proven that the proposed labelling algorithm can construct HWC-minimal labellings that are independent of the ordering of landmarks, and have further developed a parallel labelling method to speed up the labelling construction process by conducting BFSs simultaneously for multiple landmarks. The experimental results showed that the proposed methods significantly outperform the state-of-the-art methods. For future work, we plan to investigate landmark selection strategies for further improving the performance of labelling methods.

References

  • (1)
  • Abraham et al. (2011) Ittai Abraham, Daniel Delling, Andrew V Goldberg, and Renato F Werneck. 2011. A hub-based labeling algorithm for shortest paths in road networks. In SEA. 230–241.
  • Abraham et al. (2012) Ittai Abraham, Daniel Delling, Andrew V Goldberg, and Renato F Werneck. 2012. Hierarchical hub labelings for shortest paths. In ESA. 24–35.
  • Akiba et al. (2013) Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. 2013. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In ACM SIGMOD. 349–360.
  • Akiba et al. (2012) Takuya Akiba, Christian Sommer, and Ken-ichi Kawarabayashi. 2012. Shortest-path queries for complex networks: exploiting low tree-width outside the core. In EDBT. 144–155.
  • Backstrom et al. (2006) Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. 2006. Group formation in large social networks: membership, growth, and evolution. In ACM SIGKDD. 44–54.
  • Boldi et al. (2011) Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In WWW. 587–596.
  • Boldi and Vigna (2004) Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In WWW. 595–601.
  • Chang et al. (2012) Lijun Chang, Jeffrey Xu Yu, Lu Qin, Hong Cheng, and Miao Qiao. 2012. The exact distance to destination in undirected world. The VLDB Journal 21, 6 (2012), 869–888.
  • Cohen et al. (2003) Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. 2003. Reachability and distance queries via 2-hop labels. SIAM J. Comput. 32, 5 (2003), 1338–1355.
  • Delling et al. (2014) Daniel Delling, Andrew V Goldberg, Thomas Pajor, and Renato F Werneck. 2014. Robust distance queries on massive networks. In ESA. 321–333.
  • Freeman (1977) Linton C Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry (1977), 35–41.
  • Fu et al. (2013) Ada Wai-Chee Fu, Huanhuan Wu, James Cheng, and Raymond Chi-Wing Wong. 2013. Is-label: an independent-set based labeling scheme for point-to-point distance querying. VLDB 6, 6 (2013), 457–468.
  • Goldberg and Harrelson (2005) Andrew V Goldberg and Chris Harrelson. 2005. Computing the shortest path: A search meets graph theory. In SODA. 156–165.
  • Gubichev et al. (2010) Andrey Gubichev, Srikanta Bedathur, Stephan Seufert, and Gerhard Weikum. 2010. Fast and accurate estimation of shortest paths in large graphs. In CIKM. 499–508.
  • Hayashi et al. (2016) Takanori Hayashi, Takuya Akiba, and Ken-ichi Kawarabayashi. 2016. Fully Dynamic Shortest-Path Distance Query Acceleration on Massive Networks. In CIKM. 1533–1542.
  • Jiang et al. (2014) Minhao Jiang, Ada Wai-Chee Fu, Raymond Chi-Wing Wong, and Yanyan Xu. 2014. Hop doubling label indexing for point-to-point distance querying on scale-free networks. VLDB 7, 12 (2014), 1203–1214.
  • Jin et al. (2012) Ruoming Jin, Ning Ruan, Yang Xiang, and Victor Lee. 2012. A highway-centric labeling approach for answering distance queries on large sparse graphs. In ACM SIGMOD. 445–456.
  • Leskovec and Krevl (2015) Jure Leskovec and Andrej Krevl. 2015. SNAP Datasets:Stanford Large Network Dataset Collection. (2015).
  • Li et al. (2017) Ye Li, Man Lung Yiu, Ngai Meng Kou, et al. 2017. An experimental study on hub labeling based shortest path algorithms. VLDB 11, 4 (2017), 445–457.
  • Maniu and Cautis (2013) Silviu Maniu and Bogdan Cautis. 2013. Network-aware search in social tagging applications: instance optimality versus efficiency. In CIKM. 939–948.
  • Pohl (1971) Ira Pohl. 1971. Bi-derectional search. Machine intelligence 6 (1971), 127–140.
  • Potamias et al. (2009) Michalis Potamias, Francesco Bonchi, Carlos Castillo, and Aristides Gionis. 2009. Fast shortest path distance estimation in large networks. In CIKM. 867–876.
  • Qiao et al. (2014) Miao Qiao, Hong Cheng, Lijun Chang, and Jeffrey Xu Yu. 2014. Approximate shortest distance computing: A query-dependent local landmark scheme. IEEE TKDE 26, 1 (2014), 55–68.
  • Qin et al. (2017) Yongrui Qin, Quan Z Sheng, Nickolas JG Falkner, Lina Yao, and Simon Parkinson. 2017. Efficient computation of distance labeling for decremental updates in large dynamic graphs. WWW 20, 5 (2017), 915–937.
  • Rossi and Ahmed (2015) Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with Interactive Graph Analytics and Visualization. In AAAI. http://networkrepository.com
  • Sabidussi (1966) Gert Sabidussi. 1966. The centrality index of a graph. Psychometrika 31, 4 (1966), 581–603.
  • Tarjan (1983) Robert Endre Tarjan. 1983. Data structures and network algorithms. Vol. 44. Siam.
  • the Koblenz Network Collection (2017) KONECT the Koblenz Network Collection. 2017. http://konect.uni-koblenz.de/networks/
  • Tretyakov et al. (2011) Konstantin Tretyakov, Abel Armas-Cervantes, Luciano García-Bañuelos, Jaak Vilo, and Marlon Dumas. 2011. Fast fully dynamic landmark-based estimation of shortest path distances in very large graphs. In CIKM. 1785–1794.
  • Ukkonen et al. (2008) Antti Ukkonen, Carlos Castillo, Debora Donato, and Aristides Gionis. 2008. Searching the wikipedia with contextual information. In CIKM. 1351–1352.
  • Vieira et al. (2007) Monique V Vieira, Bruno M Fonseca, Rodrigo Damazio, Paulo B Golgher, Davi de Castro Reis, and Berthier Ribeiro-Neto. 2007. Efficient search ranking in social networks. In CIKM. 563–572.
  • Wei (2010) Fang Wei. 2010. TEDI: efficient shortest path query answering on graphs. In ACM SIGMOD. 99–110.
  • Yahia et al. (2008) Sihem Amer Yahia, Michael Benedikt, Laks VS Lakshmanan, and Julia Stoyanovich. 2008. Efficient network aware search in collaborative tagging sites. VLDB 1, 1 (2008), 710–721.