Efficient Maintenance of Distance Labelling for Incremental Updates in Large Dynamic Graphs

02/17/2021 ∙ by Muhammad Farhan, et al. ∙ Australian National University 0

Finding the shortest path distance between an arbitrary pair of vertices is a fundamental problem in graph theory. A tremendous amount of research has been successfully attempted on this problem, most of which is limited to static graphs. Due to the dynamic nature of real-world networks, there is a pressing need to address this problem for dynamic networks undergoing changes. In this paper, we propose an online incremental method to efficiently answer distance queries over very large dynamic graphs. Our proposed method incorporates incremental update operations, i.e. edge and vertex additions, into a highly scalable framework of answering distance queries. We theoretically prove the correctness of our method and the preservation of labelling minimality. We have also conducted extensive experiments on 12 large real-world networks to empirically verify the efficiency, scalability, and robustness of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Given a very large graph with billions of vertices and edges, how efficiently can we find the shortest path distance between any two vertices? If such a graph is dynamically changing over time (e.g. inserting edges or vertices), how can we not only efficiently but also accurately find the shortest path distance between any two vertices? These questions are intimately related to distance queries on dynamic graphs. As one of the most fundamental operations on graphs, distance queries have a wide range of real-world applications that operate on increasingly large dynamic graphs, such as context-aware search in web graphs (ukkonen2008searching), social network analysis in social networks (vieira2007efficient; backstrom2006group), management of resources in computer networks (boccaletti2006complex), and so on. Many of these applications use distance queries as a building block to realise more complicated tasks, and require distance queries to be answered instantly, e.g. in the order of milliseconds.

Figure 1. Distribution of affected vertices by a single graph change in various networks, where the results for 1000 graph changes are sorted in the descending order.

Previous studies have primarily focused on distance queries on static graphs (akiba2013fast; fu2013label; jin2012highway; abraham2011hub; abraham2012hierarchical; wei2010tedi; farhan2018highly), with little attention being paid to dynamics on graphs. To speed up query response time, a key technique is to precompute a data structure called distance labelling that satisfies certain properties such as 2-hop cover (cohen2003reachability), and then use this data structure to answer distance queries efficiently. However, when a graph dynamically changes, its distance labelling needs to be changed accordingly; otherwise, distance queries may yield overestimated distances. Although it is possible to recompute a distance labelling from scratch, this leads to inefficiency. As shown in Figure 1, the percentage of affected vertices by a single change often ranges from to in various real-world networks, recomputing distance labelling from scratch for each single change not only wastes computing resources, but also may generate inaccurate query results during recomputing process. The question arising is thus how to efficiently and accurately change distance labelling on dynamic graphs in order to support distance queries?

In this paper, we aim to develop an online incremental method that can dynamically maintain distance labelling on graphs being changed by edge and vertex insertions. Typically, real-world dynamic networks are more vulnerable to insertions than removals and a plethora of such real-world networks are large and frequently updated, primarily accommodating insertions (leskovec2007graph; viswanath2009evolution). Thus, an online incremental method for dynamic graphs should possess the following desirable characteristics: (1) time efficiency - It can answer distance queries and update distance labelling efficiently (in the order of milliseconds); 2) space efficiency - It guarantees the minimum size of distance labelling to reduce storage costs; (3) scalability - It can scale to very large networks with billions of vertices and edges.

Challenges. Designing online incremental methods for distance queries on dynamic graphs is known to be challenging (akiba2014dynamic). When an edge or a vertex is inserted into a graph, outdated and redundant entries of distance labelling may occur. It was reported that removing such entries is a complicated task (akiba2014dynamic) because affected vertices need to be precisely identified so as to update their labels without violating the original properties of a distance labelling such as minimality. Further, although query time and update time are both critical for answering distance queries on dynamic graphs, it is not easy (if not impossible) to design a solution that is efficient in both. This requires us to find new insights into dynamic properties of a distance labelling, as well as a good trade-off between query time and update time. Last but not least, scaling distance queries to dynamic graphs with billions of nodes and edges is hard. Previous work (akiba2014dynamic; hayashi2016fully) mostly considered 2-hop labelling, which has very high space requirements and index construction time; as a result, their query and update performance are dramatically degraded on large-scale dynamic graphs. Ideally, the labelling size of a graph should be much smaller than its original size. However, the state-of-the-art distance labelling technique, i.e. pruned landmark labeling method (PLL) (akiba2014dynamic), still yields a distance labelling whose size is 20-30 times larger than the original size of a dataset.

Contributions. Our contributions are summarised as follows:

  • Our method overcomes the challenge of eliminating outdated and redundant distance entries. None of the previous studies have addressed this challenge because detecting those entries is too costly (akiba2014dynamic; d2019fully). When an edge or a vertex is inserted, previous studies only add new distance entries or modify existing distance entries. This would however lead to an ever increasing size of labelling, particularly when a graph is frequently updated by newly added edges or vertices. Accordingly, both query performance and space efficiency would deteriorate over time.

  • We prove the correctness of our proposed method and show that it preserves the desirable property of minimality on our distance labelling. Due to a property called highway cover (farhan2018highly), the minimal size of a distance labelling in this work is much smaller than the size of a 2-hop labelling in previous work (akiba2014dynamic; hayashi2016fully). Preserving minimality on a distance labelling thus improves space efficiency and query performance, as well as update performance. We also provide a complexity analysis of our proposed method.

  • We conducted experiments using 12 real-world large networks across different domains to show the efficiency, scalability and robustness of our method. Particularly, our method can perform updates under one second, on average, even on billion-scale networks, while still answering queries efficiently in the order of milliseconds and guaranteeing the labelling size of a graph to be much smaller.

2. Related Work

Answering shortest-path distance queries in graphs has been an active research topic for many years. Traditionally, a distance query can be answered using Dijkstra’s algorithm (tarjan1983data) on positively weighted graphs or Breadth-First Search (BFS) algorithm on unweighted graphs. However, these traditional algorithms fail to achieve desired response time for distance queries on large graphs. Later, labelling-based methods have emerged as an attractive way of accelerating response time to distance queries (cohen2003reachability; akiba2013fast; jin2012highway; fu2013label; abraham2012hierarchical; abraham2011hub; farhan2018highly), among which Akiba et al. (akiba2013fast) proposed a pruned landmark labeling (PLL) to precompute a 2-hop cover distance labelling (cohen2003reachability). This method serves as the state-of-the-art for labelling-based distance queries and can handle graphs with hundreds of millions of edges.

So far, only a few attempts have been made to study distance queries over dynamic graphs (akiba2014dynamic; hayashi2016fully), which are all based on the idea of 2-hop distance labelling or its variants. Akiba et al. (akiba2014dynamic) studied the problem of updating a pruned landmark labelling for incremental updates (i.e. vertex additions and edge additions). This work however does not remove redundant entries in distance labels because the authors considered that detecting such outdated entries is too costly. This inevitably breaks the minimality of pruned landmark labelling, leading to an ever increase of labelling size and deteriorated query performance over time. To accelerate shortest-path distance queries on large networks, another line of research is to combine a partial distance labelling with online shortest-path searches. Hayashi et al. (hayashi2016fully) proposed a fully dynamic approach that selects a small set of landmarks and precompute a shortest-path tree (SPT) rooted at each . Then, an online search is conducted on a sparsified graph under an upper distance bound being computed via the SPTs. Nevertheless, this method still fails to construct labelling on networks with billions of vertices. Following the same line, a recent work by Farhan et al. (farhan2018highly) introduced a highway-cover labelling method (HL), which can provide fast response time (milliseconds) for distance queries even on billion-scale graphs. However, this approach only works for static graphs.

3. Problem Formulation

Let be an undirected graph where is a set of vertices and is a set of edges. We denote by the set of neighbors of a vertex , i.e. . Given two vertices and in , the distance between and , denoted as , is the length of the shortest path from to . If there does not exist a path from to , then . We use to denote the set of all shortest paths between and in . Given a graph , an edge insertion is to add an edge into where and . Accordingly, a node insertion is to add a new node into together with a set of edge insertions that connect to existing vertices in . The following fact is critical for designing algorithms for an edge insertion.

Fact 3.1 ().

Let be the graph after inserting an edge into . Then for any two vertices , .

That is, the distance between any two vertices never increases after inserting edges or vertices in a graph.

Highway cover labelling. Unlike the previous work (akiba2014dynamic; hayashi2016fully; d2019fully) that uses 2-hop cover labelling (cohen2003reachability), we develop our method using a highly scalable labelling approach, called highway cover labelling (farhan2018highly). Let be a small set of landmarks in a graph . For each vertex , the label of is a set of distance entries , where and . We call a distance labelling over whose size is defined as: . A highway consists of a set of landmarks and a distance decoding function such that, for any two landmarks , holds.

Definition 3.2 ().

A highway cover labelling is a pair where is a highway and is a distance labelling s.t. for any vertex and , we have:

(1)

Highway cover labelling enjoys several nice theoretical properties, such as minimality and order independence. A minimal highway cover labelling can be efficiently constructed, independently of the order of applying landmarks (farhan2018highly).

Given a highway cover labeling , an upper bound on the distance between any two vertices is computed:

(2)

An exact distance query can be answered by conducting a distance-bounded shortest-path search over a sparsified graph (i.e., removing all landmarks in from ) under the upper bound such that:

Problem definition. In this work, we study the problem of answering distance queries over a graph that is dynamically changed by edge and vertex insertions over time. Since a vertex insertion can be treated as a set of edge insertions, without loss of generality, below we define the problem based on edge insertions.

Definition 3.3 ().

Let denote that a graph is changed to a graph by an edge insertion. The dynamic distance querying problem is, given any two vertices and in the changed graph , to efficiently compute the distance .

Figure 2. An illustration of our online incremental algorithm : (a) a graph with three landmarks , and (colored in yellow); (b) and (d) the BFSs for finding affected vertices (colored in green) w.r.t. landmarks and , respectively; (c) and (e) the BFSs for repairing affected vertices w.r.t. landmarks and , respectively, where vertices with added/modified entries are colored in blue, and vertices with removed entries are colored in red.

4. Online Incremental Algorithm

In this section, we propose an algorithm to incrementally update labelling to reflect graph changes. Algorithm 1 describes the main steps of . Below, we discuss them in detail.

4.1. Finding Affected Vertices

When an update operation occurs on a graph , there exists a subset of “affected” vertices in whose labels need to be updated as a consequence of this update operation on the graph.

Definition 4.1 ().

A vertex is affected by iff for at least one ; unaffected otherwise.

We use to denote the set of all affected vertices w.r.t. a landmark and the set of all affected vertices.

Example 4.2 ().

Consider Figure 2(a) in which and are two landmarks. After inserting an edge , in Figure 2(b) and in Figure 2(d).

The following lemma states how affected vertices relate to an edge being inserted.

Lemma 4.3 ().

When for an edge insertion , a vertex iff there exists a shortest path between and in passing through .

Following Lemma 4.3, we can reduce the search space of affected vertices by eliminating landmarks with since in such a case. Thus, we assume that w.r.t. a landmark in the rest of this section w.l.o.g. Further, by the lemma below, we can also reduce the search space by “jumping” from the root of a BFS to vertex .

Lemma 4.4 ().

When with an inserted edge , we have for any affected vertex .

Proof.

By Lemma 4.3, there exists a shortest path from any affected vertex to going through the edge and thus through . Since is unaffected and the distance from to is equal to or greater than 1, thus holds. ∎

Algorithm 2 describes our algorithm for finding affected vertices. Given a graph with an inserted edge and a highway cover labelling over , we conduct a jumped BFS w.r.t. a landmark starting from the vertex with its new depth (Lines 3-4). For every , we enqueue all the neighbors of that are affected into with new distances (Lines 7-8) and add to as affected vertex (Line 9). This process continues until is empty.

Example 4.5 ().

Figure 2 illustrates how our algorithm finds affected vertices as a result of inserting an edge . The BFS rooted at landmark is depicted in Figure 2(b), which jumps to vertex and finds six affected vertices . Similarly, the BFS rooted at landmark is depicted in Figure 2(d), which jumps to vertex and finds three affected vertices .

Input: , , ,
Output:
1 foreach  do
2      
3      
Algorithm 1 Incremental algorithm ().
1 Function FindAffected(, , , )
2       ,
3      
4       Enqueue to
5       while  is not empty do
6             Dequeue from
7             foreach  s.t.  do
8                   Enqueue to
9            
10      return
Algorithm 2 Finding affected vertices.

4.2. Repairing Affected Vertices

Now we propose a repair strategy to efficiently update the labels of affected vertices in order to reflect graph changes. The key idea is that, instead of conducting a full BFS on all vertices, we conduct a partial BFS from only on affected vertices. Further, to avoid unnecessary computations, we distinguish two kinds of affected vertices: (1) affected vertices that are covered by other landmarks and can thus be easily repaired by removing an entry from their labels; (2) affected vertices whose labels need to be repaired with accurately calculated distances on a changed graph. The following lemma characterizes the first kind according to the definition of highway cover labelling.

Lemma 4.6 ().

An affected vertex is covered by a landmark iff exists in . If an affected vertex is covered by , then any affected vertex satisfying must also be covered by .

By Lemma 4.6, we can efficiently repair affected vertices as follows. If is covered by a landmark (i.e., one of the unaffected parents of does not contain in its label) and is also a landmark, we only update the highway; otherwise, we remove the entry of from . If is not covered by any , we add/modify the entry of in . If is a descendant of covered vertices, we simply remove the entry of from (if exists).

1 Function RepairAffected(, , , , )
2       ,
3      
4       Enqueue to if covered; otherwise to
5       while  is not empty do
6             while  at depth  do
7                   forall  s.t. at depth  do
8                         if  then
9                               if  is a landmark then
10                                    
11                              else
12                                     Remove from
13                              Enqueue to
14                        else
15                               Add/Modify in
16                               Enqueue to
17                        Remove from
18                  Dequeue from
19            while  at depth  do
20                   forall  s.t. at depth  do
21                         Remove from
22                         Remove from
23                         Enqueue to
24                  Dequeue from
25            
26      Remove entry from remaining vertices in
Algorithm 3 Repairing affected vertices.

Algorithm 3 describes our algorithm for repairing affected vertices. Given a graph with an inserted edge and a set of affected vertices , we conduct a BFS w.r.t. a landmark starting from the vertex with its new distance (Lines 3-4). We use two queues and to process uncovered and covered vertices, respectively. If is covered, we enqueue to and remove the entry of from the labels of affected vertices (Line 25). Otherwise, we enqueue to and start processing vertices in (Line 5). For each vertex at depth , we examine its affected neighbors at depth . If is covered, then if is a landmark, we update the highway (Line 10); otherwise we remove the entry of from (Line 12) because there must exist another landmark in the shortest path from to and add to (Line 13). Otherwise, we add/modify the entry of with the new distance in and enqueue to (Lines 15-16). After that, we remove from (line 17). Then, for each , we remove from the labels of affected neighbors of , remove these affected vertices from and enqueue them to (Lines 19-24). We process these two queues, one after the other, until is empty. Finally, we remove the entry of from the labels of the remaining vertices in (Line 25).

Example 4.7 ().

Figure 2 illustrates how our algorithm repairs labels as a result of inserting an edge . The BFS for landmark is depicted in Figure 2(c), which jumps to vertex and repairs three affected vertices . The vertices are covered by landmarks and . Similarly, the BFS for landmark is depicted in Figure 2(e), in which vertices are repaired and vertex is covered by landmarks and .

5. Theoretical Results

Dataset Update Time (ms) Query Time (ms) Labelling Size
     IncFD   IncPLL      IncFD   IncPLL        IncFD    IncPLL
Skitter 0.194 0.444 2.05 0.027 0.019 0.047 42 MB 153 MB 2.44 GB
Flickr 0.006 0.074 1.73 0.007 0.012 0.064 34 MB 152 MB 3.69 GB
Hollywood 0.031 0.101 48 0.027 0.037 0.109 27 MB 263 MB 12.58 GB
Orkut 2.026 2.049 - 0.101 0.103 - 70 MB 711 MB -
Enwiki 0.134 0.163 5.91 0.054 0.035 0.071 82 MB 608 MB 12.57 GB
Livejournal 0.245 0.268 - 0.044 0.046 - 122 MB 663 MB -
Indochina 5.443 158 2018 0.737 0.839 0.063 81 MB 838 MB 18.64 GB
IT 95.92 224 - 1.069 1.013 - 854 MB 4.74 GB -
Twitter 0.027 0.134 - 0.863 0.177 - 1.14 GB 3.83 GB -
Friendster 0.159 0.419 - 0.814 0.904 - 2.43 GB 9.14 GB -
UK 11.49 384 - 3.443 5.858 - 1.78 GB 11.8 GB -
Clueweb09 40.68 - - 16.93 - - 163 GB - -
Table 1. Comparing the update time, query time and labelling size of our method with the baseline methods.
Dataset Network avg. deg avg. dist
Skitter comp (u) 1.7M 11M 13.081 5.1
Flickr social (u) 1.7M 16M 18.133 5.3
Hollywood social (u) 1.1M 114M 98.913 3.9
Orkut social (u) 3.1M 117M 76.281 4.2
Enwiki social (d) 4.2M 101M 43.746 3.4
Livejournal social (d) 4.8M 69M 17.679 5.6
Indochina web (d) 7.4M 194M 40.725 7.7
IT web (d) 41M 1.2B 49.768 7.0
Twitter social (d) 42M 1.5B 57.741 3.6
Friendster social (u) 66M 1.8B 55.056 5.0
UK web (d) 106M 3.7B 62.772 6.9
Clueweb09 web (d) 1.7B 7.8B 9.27 7.4
Table 2. Summary of datasets.

Proof of correctness. For where our method updates a highway cover labelling over into a highway cover labelling over , we consider to be correct iff, whenever holds for any two vertices and in , then also holds for any two vertices and in . We prove the theorem below for .

Theorem 5.1 ().

is correct.

Proof.

First, we prove that FindAffected returns the set of all affected vertices as a result of an edge insertion. (Lines 7-8 of Algorithm 2) guarantees that any vertex being added to has one shortest path to a landmark which goes through the inserted edge . By Lemma 4.3, such vertices are affected vertices, and thus a vertex is added to in Algorithm 2 iff . Then, we prove that RepairAffected repairs s.t. (1) for , iff contains only one landmark ; (2) for any . Starting from with new distance , the distances of affected vertices in are iteratively inferred on and reflected into their labels via if these affected vertices are not covered (Lines 15-16 of Algorithm 3). If an affected vertex is covered, it is kept in ; if is also a landmark, in is updated (Lines 9-10). Thus, the distance entry of is removed from the labels of affected vertices appearing in , whereas any vertex appearing in must have . ∎

Preservation of minimality.  It has been reported in (farhan2018highly) that, given a graph , a minimal highway cover labelling of can be constructed using an algorithm proposed in their work, i.e., holds for any of . For where updates over into over , we prove that preserves the minimality of labelling.

Theorem 5.2 ().

If is minimal on , then is minimal on .

Proof.

By Lemma 4.6, for iff does not contain any other landmark ; otherwise we remove the entry of from the label of (Line 12, 21 and 25 of Algorithm 3). Thus, the labels of all affected vertices must be minimal after applying . For unaffected vertices, their labels should remain unchanged. Hence, must be minimal. ∎

Complexity analysis. Let be the total number of affected vertices, be the average size of labels (i.e. ), and be the average degree of vertices. For a landmark, Algorithm 2 takes time to find all affected vertices and Algorithm 3 takes to repair the labels of all affected vertices. We omit from for Algorithm 3 because distances for all unaffected neighbors of affected vertices are stored in Algorithm 2. Therefore, has time complexity . In our experiments, we notice that is usually orders of magnitudes smaller than and is also significantly smaller than .

Directed and weighted graphs.  For directed graphs, we can store sets of forward and backward labels, namely and , for each vertex which contain pairs from forward and backward BFSs w.r.t. each landmark. Accordingly, we can store forward and backward highways and . Then, we conduct two BFSs to update these labels and highways: one in the forward direction and the other in the backward direction. Our method can also be easily extended to handling weighted graphs by using Dijkstra’s algorithm instead of BFSs.

6. Experiments

We have evaluated our method to answer the following questions: (Q1) How efficiently can our method perform against state-of-the-art methods? (Q2) How does the number of landmarks affect the performance of our method? (Q3) How does our method scale to perform updates occurring rapidly in large dynamic networks?

Datasets. We used 12 large real-world networks as detailed in Table 2. These networks are accessible at Stanford Network Analysis Project (leskovec2015snap), Laboratory for web Algorithmics (BoVWFI), Koblenz Network Collection (kunegis2013konect), and Network Repository (rossi2015network). We treated these networks as undirected and unweighted graphs.

Updates and queries. For each network, we randomly sampled 1,000 pairs of vertices as edge insertions, denoted as , where to evaluate the average update time. Further, we evaluate the average query time with 100,000 randomly sampled pairs of vertices from each network and report the labelling size after reflecting all the updates.

Baseline methods. We compared our method () with the state-of-the-art methods: (1) IncPLL: an online incremental algorithm proposed in (akiba2014dynamic) which is based on the 2-hop cover labelling to answer distance queries; (2) IncFD: an online incremental algorithm proposed in (hayashi2016fully) which combines a 2-hop cover labelling with a graph traversal algorithm to answer distance queries. The codes of these methods were provided by their authors and implemented in C++. We used the same parameter settings for these methods as suggested by their authors unless otherwise stated. For a fair comparison, following (hayashi2016fully) we set for IncFD and our methods, except for Clueweb09 which has due to its billion-scale vertices. Our methods were implemented in C++11 and compiled using gcc 5.5.0 with the -O3 option. We performed all the experiments using a single thread on Linux server (Intel Xeon W-2175 with 2.50GHz and 512GB of main memory).

6.1. Performance Comparison

Figure 3. Average update time of our method (in colored bars) and the baseline method IncFD (in colored plus grey bars) under 10-50 landmarks. There are no results of IncFD for Clueweb09 due to the scalability issue.
Figure 4. Update time of for performing up to 10,000 updates against construction time of labelling from scratch.

6.1.1. Update Time

Table 1 shows that the average update time of our method outperforms the state-of-the-art methods IncFD and IncPLL on all datasets. This is due to a novel repair strategy utilized by . Further, only can scale to very large networks with billions of vertices and edges. IncFD fails to scale to Clueweb09, and IncPLL fails for 7 out of 12 datasets due to very high preprocessing time and memory requirements.

6.1.2. Labelling Size

From Table 1, we see that has significantly smaller labelling sizes than IncFD and IncPLL. When updates occur on a graph, the labelling sizes of IncFD and remain stable because their average label sizes are bounded by the size of landmarks set (i.e. ). Moreover, IncFD stores complete shortest path trees w.r.t. landmarks; while stores pruned shortest-path trees which lead to labelling of much smaller sizes than IncFD. For IncPLL, the labelling sizes may increase because IncPLL does not remove outdated and redundant entries.

6.1.3. Query Time

In Table 1 the query times of are comparable with IncFD and IncPLL. It has been shown in (d2019fully) that query time depends on labelling size. As discussed in Section 6.1.2, the update operations do not considerably affect the labelling sizes of IncFD and , and thus their query times remain stable. However, the query times for IncPLL may increase over time because of the presence of outdated and redundant entries, which result in labelling of increasing size.

6.2. Performance with Varying Landmarks

Figure 3 shows the average update time of our method against the baseline method IncFD under varying landmarks, i.e., . As we can see, outperforms IncFD on all the datasets against almost every selection of landmarks. We can also see the performance gap remains stable for most of the datasets when increasing the number of landmarks. This empirically verifies the efficiency of our repair strategy.

6.3. Scalability Test

We conducted a scalability test on the update time of our method , by starting with 500 updates and then iteratively adding 500 updates each time until 10,000 updates. Figure 4 shows the results. We observe that the update time of on almost all the datasets is considerably below the construction time of labelling. On Indochina and IT, performs relatively worse because these networks have large average distances as depicted in Table 2, which lead to high percentages of affected vertices as shown in Figure 1. In contrast, performs well on graphs with small average distances such as Twitter. Overall, can scale to perform a large number of updates efficiently.

7. Conclusion

This paper has studied the problem of answering distance queries on large dynamic networks. Our proposed algorithm exploits properties of a recent labelling technique called highway cover labelling (farhan2018highly) to efficiently process incremental graph updates, and can preserve the minimality property of labelling after each update operation. We have empirically evaluated the efficiency and scalability of the proposed algorithm. The results show that our proposed algorithm outperforms the state-of-the-art methods. In future, we plan to further investigate the effects of decremental updates on graphs since they are also commonly used in practice.

References