1. Introduction and Background
Point-to-Point Shortest Distance (PPSD) computation is one of the most important primitives encountered in graph databases. It is used for similarity analysis on biological and social networks, context-aware search on knowledge graphs, route navigation on roads etc. These applications generate a very large number of PPSD queries and making online query response fast is absolutely crucial to their performance.
One way to answer a PPSD query is to run a traversal algorithm such as Dijkstra, Bellman-Ford or Delta-Stepping. However, even state-of-the-art traversal algorithms and implementations (ligra, ; galois, ; julienne, ; gemini, ; asynch, ) have query response times in the order of hundreds of milliseconds on large graphs. This is prohibitively slow, especially if the application generates large number of queries. Another solution is to pre-compute and store all pairs shortest paths. While this approach allows queries to be answered in constant time, it incurs quadratic pre-processing time and storage complexity and hence, is not feasible for large graphs.
Hub-labeling is a popular alternate approach for PPSD computation. It trades off pre-processing costs with query performance, by pre-computing for each vertex, the distance to a ‘small’ subset of vertices known as hubs. The set of (hub, distance)-tuples are known as the hub-labels of vertex with its label size. A hub-labeling can correctly answer any PPSD query if it satisfies the following cover property: Every connected pair of vertices are covered by a hub vertex from their shortest path i.e. there exists an 111Table 1 lists some frequently used notations in this paper. For ease of description, we consider to be weighted and undirected. However, all labeling approaches described here can be easily extended to directed graphs by using forward and backward labels for each vertex(abrahamCHL, ). Our implementation is indeed, compatible with directed graphs. such that and are in the label set of and , respectively. Now, a PPSD query for vertices and can be answered by finding the common hub with minimum cumulative distance to and .
Query response time is clearly dependent on average label size. However, finding the optimum labeling (with minimum average label size) is known to be NP-hard (cohen2hop, ). Let denote a total order on all vertices i.e. a ranking function, also known as network hierarchy. Rather than find the optimal labeling, Abraham et al.(abrahamCHL, ) conceptualize Canonical Hub Labeling (CHL) in which, for a given shortest path , only the highest-ranked hub is added to the labels of and . CHL satisfies the cover property and is minimal for a given , as removing any label from it results in a violation of the cover property. Intuitively, a good ranking function will prioritize highly central vertices (such as highways vs residential streets). Such vertices are good candidates for being hubs - a large number of shortest paths in the graph can be covered with a few labels. Therefore, a labeling which is minimal for a good will be quite efficient overall. (abrahamCHL, ) develops a sequential polynomial time algorithm for computing CHL where paths are recursively shortcut by vertices in rank order and label sets pulled from high ranked reachable vertices in the modified graph. However, this algorithm incurs large pre-processing time rendering it infeasible for labeling large graphs(vldbExperimental, ; akibaPLL, ).
|a weighted undirected graph with vertex set and edges|
|number of vertices and edges;|
|neighboring vertices of|
|weight of edge|
|ranking function or network hierarchy|
|(vertices in) shortest path(s) between , (inclusive)|
|shortest path tree rooted at vertex|
|shortest path distance between vertices and|
|a hub label for vertex with as the hub|
|set of hub labels for vertex|
|number of nodes in a multi-node cluster|
Akiba et al.(akibaPLL, ) propose Pruned Landmark Labeling (PLL) which is arguably the most efficient sequential algorithm to compute the CHL. PLL iteratively computes Shortest Path Trees (SPTs) from roots selected in decreasing order of rank. The efficiency of PLL comes from heavy pruning of the SPTs. As shown in fig.(b)b, for every vertex visited in , PLL initiates a pre-processing Distance-Query to determine if there exists a hub in both and such that , where is the distance to as found in . Hub-label is added to only if the query is unable to find such a common hub (we say that in such a case, the query returns false). Otherwise, further exploration from is pruned and is not added to . (Note: every node is its own hub by default). Despite the heavy pruning, PLL is computationally very demanding. Dong et al.(dongPLL, ) show that PLL takes several days to process large weighted graph datasets - coPaper (15M edges) and Actor (33M edges). Note that the ranking also affects the performance of PLL. Intuitively, ranking vertices with high-degree or high betweenness centrality higher should lead to better pruning when processing lower ranked vertices. Optimizing is of independent interest and not the focus of (akibaPLL, ) or this study.
Parallelizing CHL construction/PLL comes with a myriad of challenges. Most existing approaches (parapll, ; parallelPLLThesis, ) attempt to construct multiple trees concurrently using parallel threads. However, such simple parallelization violates network hierarchy and results in higher label sizes. Many mission critical applications may require a CHL for a specific network hierarchy. Further, larger label sizes directly impact query performance. Parallelizing label construction over multiple nodes in clusters exacerbates these problems because the hubs generated by a node are not immediately visible to other clusters. Most importantly, the size of labeling can be significantly larger than the graph itself, stressing the available main memory on a single machine. Although a disk-based labeling algorithm has been proposed previously (jiangDisk, ), it is substantially slower than PLL and ill suited to process large networks. To the best of our knowledge, none of the existing parallel approaches resolve this issue.
In this paper, we systematically address the multiple challenging facets of the parallel CHL construction problem. Two key perspectives drive the development of our algorithmic innovations and optimizations. First, we approach simultaneous construction of multiple SPTs in PLL as an optimistic parallelization that can result in mistakes. We develop PLL-inspired shared-memory parallel Global Local Labeling (GLL) and Distributed-memory Global Local Labeling (DGLL) algorithms that
only make mistakes from which they can recover, and
efficiently correct those mistakes.
Second, we note that mistake correction in DGLL generates huge amount of label traffic, thus limiting its scalability. Therefore, we shift our focus from parallelizing PLL to the primary problem of parallel CHL construction. Drawing insights from the fundamental concepts behind canonical labeling, we develop an embarrassingly parallel and communication avoiding distributed algorithm called PLaNT (Prune Labels and (do) Not (Prune) Trees). Unlike PLL which prunes SPTs using distance queries but inserts labels for all vertices explored, PLaNT does not prune SPTs but inserts labels selectively. PLaNT ensures correctness and minimality of output hub labels (for a given ) generated from an SPT without consulting previously discovered labels, as shown in fig. 1. This allows labeling to be partitioned across multiple nodes without increase in communication traffic and enables us to simultaneously scale effective parallelism and memory capacity using multiple cluster nodes. By seamlessly transitioning between PLaNT and DGLL, we achieve both computational efficiency and high scalability.
Overall, our contributions can be summarized as follows:
We develop parallel shared-memory and distributed algorithms that output the minimal hub labeling (CHL) for a given graph and network hierarchy . None of the existing parallel algorithms guarantee the CHL as output.
We develop a new embarrassingly parallel algorithm for distributed CHL construction, called PLaNT. PLaNT completely avoids inter-node communication to achieve high scalability at the cost of additional computation. We further propose a hybrid algorithm for efficient and scalable CHL construction.
Our algorithms use the memory of multiple cluster nodes in a collaborative fashion to enable completely in-memory processing of graphs whose labels cannot fit on the main memory of a single node. To the best of our knowledge, this is the first work to accomplish this task.
We develop different schemes for label data distribution in a cluster to increase query throughput by utilizing parallel processing power of multiple compute nodes. To the best of our knowledge, none of the existing works use multiple machines to store labeling and compute query response.
We use real-world datasets to evaluate our algorithms. Label construction using PLaNT is on an average, faster on nodes compared to single node execution. Further, our distributed implementation is able to process the LiveJournal(liveJournal, ) graph with GB output label size in minutes on nodes.
2. Problem Description and Challenges
In this paper, given a weighted graph with positive edge weights and a ranking function , we develop solutions to the following three objectives:
P1 Efficiently utilize the available parallelism in shared-memory (multicore CPUs) and distributed-memory (multi node clusters) systems to accelerate CHL construction.
P2 Scale (in-memory) processing to large graphs whose main memory requirements cannot be met by a single machine.
P3 Given a labeling , accelerate query response time by using multi-node parallelism in a cluster.
The challenges for parallel construction of canonical labels are manifold–foremost being the dependency of pruning in an SPT on the labels previously generated by all SPTs rooted at higher ranked vertices. This lends the problem its inherently sequential nature, requiring SPTs to be constructed in a successive manner. For a distributed system, this also leads to high label traffic that is required to efficiently prune the trees on each node. Also note that the average label size can be orders of magnitude greater than the average degree of the graph. Hence, even if the main memory of a single node can accommodate the graph, it may not be feasible to store the complete labeling on each node.
3. Related Work
Recently, researchers have proposed parallel approaches for hub labeling (parallelPLLThesis, ; dongPLL, ; parapll, ), many of which are based on PLL. Ferizovic et al.(parallelPLLThesis, ) construct a task queue from such that each thread pops the highest ranked vertex still in the queue and constructs a pruned SPT from that vertex. They only process unweighted graphs which allows them to use the bit-parallel labeling optimization of (akibaPLL, ). Very recently, Li et al.(liSigmod, ) proposed a highly scalable Parallel Shortest distance Labeling (PSL) algorithm. In a given round , PSL generates hub labels with distance in parallel, replacing the sequential node-order label dependency of PLL with a distance label dependency. PSL is very efficient on and is explicitly designed for unweighted small-world networks with low diameter. Contrarily, we target a generalized problem of labeling weighted graphs with arbitrary diameters, where these approaches are either not applicable or not effective.
Qiu et al.(parapll, ) argue that existing graph frameworks parallelize a single instance of SSSP and are not suitable for parallelization of PLL. They propose the paraPLL framework that launches concurrent instances of pruned Dijkstra (similar to (parallelPLLThesis, )) to process weighted graphs. By using a clever idea of hashing root labels prior to launching an SPT construction, they ensure that despite concurrent tree constructions, the labeling will satisfy the cover property even though it labeling may not be consistent with . While this approach benefits from the order of task assignment and dynamic scheduling, it can lead to significant increase in label size if the number of threads is large.
Dong et al. (dongPLL, ) observe that the sizes of SPTs with high ranked roots is quite large. They propose a hybrid intra- and inter-tree paralelization scheme utilizing parallel Bellman ford for the large SPTs initially, and concurrent dijkstra instances for small SPTs in the later half of the execution, ensuring average label size close to that of CHL. Their hybrid algorithm works well for scale-free graphs but fails to accelerate high-diameter graphs, such as road networks, due to the high complexity of Bellman Ford.
paraPLL(parapll, ) also provides a distributed-memory implementation that statically divides the tasks (root IDs) across multiple nodes. It periodically synchronizes the concurrent processes and exchanges the labels generated so that they can be used by every process for pruning. This generates large amount of label traffic and also introduces a pre-processing vs query performance tradeoff as reducing synchronizations improves labeling time but drastically increases label size. Moreover, paraPLL stores all the labels generated on every node and hence, cannot scale to large graphs despite the cumulative memory of all nodes being enough to store the labels.
4. Shared-memory parallel labeling
4.1. Label Construction and Cleaning
In this section, we discuss LCC - a two-step Label Construction and Cleaning (LCC) algorithm to generate the CHL for a given graph and ordering . LCC utilizes shared-memory parallelism and forms the basis for the other algorithms discussed in this paper.
We first define some labeling properties. Recall that a labeling algorithm produces correct results if it satisfies the cover property.
Definition 1 ().
A hub label is said to be redundant if it can be removed from without violating the cover property.
Definition 2 ().
A labeling satisfies the minimality property if it has no redundant labels.
Let be any network hierarchy. For any pair of connected vertices and , let .
Definition 3 ().
A labeling respects if and , for all connected vertices , .
Lemma 0 ().
A hub label in a labeling that respects is redundant if is not the highest ranked vertex in .
WLOG, let . By assumption, . By definition, , , where }. Since the labeling respects , for any , we must have and also , where . Clearly, which implies that . Thus, for every , there exists a hub that covers and and can be removed without affecting the cover property. ∎
Lemma 0 ().
Given a ranking and a labeling that respects , a redundant label can be detected by a PPSD query between the vertex and the hub .
Let . By Lemma 4, . Further, since the labeling respects , must be a hub for and . Thus a PPSD query between and with rank priority will return hub and distance , allowing us to detect redundant label in . ∎
Lemmas 4 and 5 show that redundant labels (if any) in a labeling can be detected if it respects . Next, we describe our parallel LCC algorithm and show how it outputs the CHL. Note that the CHL (abrahamCHL, ) respects and is minimal.
The main insight underlying LCC is that simultaneous construction of multiple SPTs can be viewed as an optimistic parallelization of sequential PLL - that allows some ‘mistakes’ (generate labels not in CHL) in the hub labeling. However, only those mistakes shall be allowed that can be corrected to obtain the CHL. LCC addresses two major parallelization challenges:
Label Construction Construct in parallel, a labeling that respects .
Label Cleaning Remove all redundant labels in parallel.
Label Construction: To obtain a labeling that respects , LCC’s label construction incorporates a crucial element. In addition to Distance-Query pruning, LCC also performs Rank-Query pruning (algorithm 1–Line 7). Specifically, during construction of , if a higher ranking vertex is visited, we 1) prune at and 2) do not insert a label for into even if the corresponding Distance-Query might have returned false. Since LCC constructs multiple SPTs in parallel it is possible that the SPT of a higher ranked vertex which should be a hub for (for example above) is still in the process of construction and thus the hub list of is incomplete. Step 2) above guarantees that for any pair of connected vertices with , either is labeled a hub of or they both share a higher ranked hub. This fact will be crucial in proving the minimal covering property of LCC after its label cleaning phase. Note that might get unnecessarily inserted as a hub for some other vertex due to Rank Pruning at . However, as we will show subsequently, such ‘optimistic’ labels can be cleaned (deleted).
The parallel label construction in LCC is shown in algorithm 2. Similar to (parapll, ; parallelPLLThesis, ), each concurrent thread selects the most important unselected vertex from (by atomic updates to a global counter), and constructs the corresponding using pruned Dijkstra. However, unlike previous works, LCC’s pruned Dijkstra is also enabled with Rank Queries in addition to Distance Queries222Initialization steps only touch array elements that have been modified in the previous run of Dijkstra. We use atomics to avoid race conditions.. This parallelization strategy exhibits good load balance as all threads are working until the very last SPT and there is no global synchronization barrier where threads may stall. Moreover, pruned Dijkstra is computationally efficient for various network topologies compared to the pruned Bellman Ford of (dongPLL, ) which performs poorly on large diameter graphs.
Claim 1 ().
The labeling generated by LCC’s label construction step (LCC-I) satisfies the cover property and respects .
Let (, resp.) denote the set of hub vertices of a vertex after LCC-I (sequential PLL, resp.). We will show that . Suppose for some vertex . Consider three cases:
Case 1: because a Rank-Query pruned at in LCC-I. Thus we must have . Since sequential PLL is also the CHL, also.
Case 2: because a Distance-Query pruned at in LCC-I. This can only happen if LCC found a shorter distance through a hub vertex (alg. 1 : lines 15-16). Since LCC with Rank-Querying identified as a hub for both and , we must have and thus .
Case 3: because was not discovered by due to some vertex being pruned. Similar to Case 2 above, this implies with and therefore .
Combining these cases, we can say that . Since sequential PLL also generates the CHL for , the claim follows. ∎
Label Cleaning: Note that LCC creates some extra labels due to the parallel construction of . For example, might get (incorrectly) inserted as a hub for vertex if the for a higher ranked vertex is still under construction and has not yet been inserted as a hub for and . These extra labels are redundant, since there exists a canonical subset of LCC (i.e. ) satisfying the cover property, and so do not affect PPSD queries. LCC eliminates redundant labels using the function alg 2-lines 15-20333Instead of computing full set intersection, the actual implementation of DQ_Clean stops at the first common hub (also the highest ranked) in sorted and which satisfies the condition in line 20 of algortihm 2.- For vertex , a label is redundant if a Distance-query returns true with a hub with .
Claim 2 ().
The final labeling generated by LCC after the Label Cleaning step (LCC-II) is the CHL.
Lemma 0 ().
LCC is work-efficient. It performs
work, generates hub labels and answers each query in time, where is the tree-width of .
Consider the centroid decomposition
of minimum-width tree decomposition of the input graph , where maps vertices in (bags) to subset of vertices in (akibaPLL, ). Let be determined by the minimum depth bag i.e. vertices in root bag are ranked highest followed by vertices in children of root and so on. Since we prune using Rank-Query, will never visit vertices beyond the parent of . A bag is mapped to at most vertices and the depth of is . Since the only labels inserted at a vertex are its ancestors in the centroid tree, there are labels per vertex.
Each time a label is inserted at a vertex, we evaluate all its neighbors in the distance queue. Thus the total number of distance queue operations is . Further, distance queries are performed on vertices that cannot be pruned by rank queries. This results in work.
Label Cleaning step sorts the label sets and executes PPSD queries performing work. Thus, overall work complexity of LCC is which is the same as the sequential algorithm (parapll, ), making LCC work-efficient. ∎
Note that paraPLL(parapll, ) generated labeling is not guaranteed to respect and hence, doing Label Cleaning after paraPLL may result in a labeling that violates the cover property. However, it can be used to clean the output of inter-tree parallel algorithm by Dong et al (dongPLL, ).
Although LCC is theoretically efficient, in practice, the Label cleaning step adds non-trivial overhead to the execution time. In the next subsection, we describe a Global Local Labeling (GLL) algorithm that drastically reduces the overhead of cleaning.
4.2. Global Local Labeling (GLL)
The main goal of the GLL algorithm is to severely restrict the size of label sets used for PPSD queries during Label Cleaning. A natural way to accelerate label cleaning is by avoiding futile computations (in DQ_Clean) over hub labels that were already consulted during label construction. However, to achieve notable speedup, these pre-consulted labels must be skipped in constant time without actually iterating over all of them.
GLL overcomes this challenge by using a novel Global Local Label Table data structure and interleaved cleaning strategy. As opposed to LCC, GLL utilizes multiple synchronizations where the threads switch between label construction and cleaning. We denote the combination of a Label Construction and corresponding Label Cleaning step as a superstep. During label construction, the newly generated labels are pushed to a Local Label Table and the volume of labels generated is tracked. Once the number of labels in the local table becomes greater than , where is the synchronization threshold, the threads synchronize, sort and clean the labels in local table and commit them to the Global Label Table.
In the next superstep, it is known that all labels in the global table are consulted444For effective pruning, the Label Construction step uses both global and local table to answer distance queries. during the label generation. Therefore, the label cleaning only needs to query for redundant labels on the local table, thus dramatically reducing the number of repeated computations in PPSD queries. After a label construction step, the local table holds a total of labels. Assuming
average labels per vertex, (we empirically observe that labels are almost uniformly distributed across the vertices except the few highest ranked vertices), each cleaning step should perform on averagework. The number of cleaning steps is and thus we expect the total complexity of cleaning to be in GLL as opposed to in LCC. If the constant , cleaning in GLL is more efficient than LCC.
Using two tables also drastically reduces locking during pruning queries. Both paraPLL and LCC have to lock label sets before reading because label sets are dynamic arrays that can undergo memory (de)allocation when a label is appended. However, GLL only appends to the local table. Most pruning queries are answered by label sets in the global table that do not need to be locked.
5. Distributed-memory Hub Labeling
A distributed algorithm allows the application to scale beyond the levels of parallelism and the main memory offered by a single node. This is particularly useful for hub labeling as it is extremely memory intensive and computationally demanding, rendering off-the-shelf shared-memory systems inapt for processing large-scale graphs. However, a distributed-memory system also presents strikingly different challenges than a shared-memory system, in general as well as in the specific context of hub labeling. Therefore, a trivial extension of GLL algorithm is not suitable for a multi-node cluster.
Particularly, the labels generated on a node are not readily available to other nodes until nodes synchronize and exchange labels. Further, unlike paraPLL, our aim is to harness not just the compute but also the collective memory capability of multiple nodes to construct CHL for large graphs. This mandates that labels be partitioned and distributed across multiple nodes at all times, and severely limits the knowledge of SPTs created by other nodes even after synchronization. This absence of labels dramatically affects the pruning efficiency during label construction, resulting in large number of redundant labels and consequently, huge communication volume that bottlenecks the pre-processing.
In this section, we will present novel algorithms and optimizations that systematically conquer these challenges. We begin the discussion with a distributed extension of GLL that highlights the basic data distribution and parallelization approach.
5.1. Distributed GLL (DGLL)
The distributed GLL (DGLL) algorithm divides the task queue for SPT creation uniformly among nodes in a rank circular manner. The set of root vertices assigned to node is . Every node loads the complete graph instance and executes GLL on its alloted task queue555Every node also stores a copy of complete ranking for rank queries.. DGLL has two key optimizations tailored for distributed implementation:
1. Label Set Partitioning: In DGLL, nodes only store labels generated locally i.e. all labels at node are of the form , where . Equivalently, the labels of a vertex are disjoint and distributed across nodes i.e. . Thus, all the nodes collaborate to provide main memory space for storing the labels and the effective memory scales in proportion to the number of nodes. This is in stark contrast with paraPLL that stores on every node, rendering effective memory same as that of a single node.
2. Synchronization and Label Cleaning: For every superstep in DGLL, we decide the synchronization point apriori in terms of the number of SPTs to be created. The synchronization point computation is motivated by the label generation behavior of the algorithm. Fig. 2 shows that the number of labels generated by initial SPTs rooted at high rank vertices is very large and it drops exponentially as the rank decreases. To maintain cleaning efficiency with few synchronizations, we increase the number of SPTs constructed in the supersteps by a factor of i.e. if superstep constructs SPTs, superstep will construct SPTs. This is unlike distributed paraPLL(parapll, ) where same number of trees are constructed in every superstep.
After synchronization, the labels generated in a superstep are broadcasted to all nodes. Each node creates a bitvector containing response of all cleaning queries. The bitvectors are then combined using an all reduce operation to obtain final redundancy information.
Note that DGLL uses both global and local tables to answer cleaning queries. Yet, interleaved cleaning is beneficial as it removes redundant labels, thereby reducing query response time for future cleaning steps. While label construction queries only use tables on generator node, cleaning queries use tables on all nodes for every query. The presence of redundant labels can thus, radically slow down cleaning. For some datasets, we empirically observe redundancy in labels generated in some supersteps.
5.2. Prune Labels and (do) Not (prune) Trees (PLaNT)
The redundancy check in DGLL can severely restrict scalability of the algorithm due to huge label broadcast traffic (redundant non-redundant labels), motivating the need for an algorithm that can avoid redundancy without communicating with other nodes.
To this purpose, we propose the Prune Labels and (do) Not (prune) Trees (PLaNT) algorithm that accepts some loss in pruning efficiency to achieve a dramatic decrease in communication across nodes in order by outputting completely non-redundant labels without additional label cleaning. We note that the redundancy of a hub label is only determined by whether or not is the highest ranked vertex in . This is the key idea behind PLaNT: When constructing , if, when resolving distance queries, embedded information about high-ranked vertices on paths can be retrieved, will intrinsically have the requisite information to detect redundancy of as a hub.
Algorithm 3 (PLaNTDijkstra) depicts the construction of a shortest path tree using PLaNT, which we call PLaNTing trees. Instead of pruning using distance or rank queries, PLaNTDijkstra tracks the most important ancestor encountered on the path from to by allowing ancestor values to propagate along with distance values. When is popped from the distance queue, a label is added to if neither nor are ranked higher than the root. Thus, for any shortest path , only succeeds in adding itself to the labels of and , guaranteeing minimality of the labeling while simultaneously respecting . Figure (c)c provides a detailed graphical illustration of label generation using PLaNT and shows that it generates the same labeling as the canonical PLL.
If there are multiple shortest paths from to , the path with the highest-ranked ancestor is selected. This is achieved in the following manner: when a vertex is popped from the dijkstra queue and its edges are relaxed, the ancestor of a neighbor is allowed to update even if the newly calculated tentative distance to is equal to the currently assigned distance to (line 12 of algorithm 3). For example, in fig.(c)c, the shortest paths to , and have the same length and is selected by setting because .
Note that PLaNT not only avoids dependency on labels on remote nodes, but it rids SPT construction of any dependency on the output of other SPTs, effectively providing an embarassingly parallel solution for CHL construction with depth (complexity of a single instance of dijkstra) and work. Due to its embarassingly parallel nature, PLaNT does not require SPTs to be constructed in a specific order. However, to enable optimizations discussed later, we follow the same rank determined order in PLaNT as used in DGLL (section 5.1).
Early Termination: To improve the computational efficiency of PLaNT and prevent it from exploring the full graph for every SPT, we propose the following simple early termination strategy: stop further exploration when the rank of either the ancestor or the vertex itself is higher than root for all vertices in dijkstra’s distance queue 666Further exploration from such vertices will only result in shortest paths with at least one vertex ranked higher than the root and hence, no labels will be generated.. Early termination has the potential to dramatically cut down traversal in SPTs with low-ranked roots.
Despite early termination, PLaNTed trees can possibly explore a large part of the graph which PLL would have pruned. Fig.3 shows that in PLaNT, # vertices explored in an SPT per label generated () can be . Large value of implies a lot of exploration overhead that PLL algorithm would have avoided by pruning.
5.2.1. Hybrid PLaNT + DGLL
Apart from its embarrassingly parallel nature, an important virtue of PLaNT is its compatibility with DGLL. Since PLaNT also constructs SPTs in rank order and generates labels with root as the hub, we can seamlessly transition between PLaNT and DGLL to enjoy the best of both worlds. We propose a Hybrid algorithm that initially uses PLaNT and switches to DGLL after certain number of SPTs. The initial SPTs rooted at high ranked vertices generate most of the labels in CHL (fig.2) and exhibit low value (fig.3). By PLaNTing these SPTs, we efficiently parallelize bulk of the computation and avoid communicating a large fraction of the overall label set, at the cost of little additional exploration in the trees. By doing PLaNT for the initial SPTs, we also avoid a large number of distance queries that PLL or DGLL would have done on all the visited vertices in those SPTs. In the later half of execution, becomes high and very few labels are generated per SPT. The Hybrid algorithm uses DGLL in this half to exploit the heavy pruning and avoid the inefficiencies associated with PLaNT.
The Hybrid algorithm is a natural fit for scale-free networks. These graphs may have a large tree-width but they exhibit a core-fringe structure with a small dense core whose removal leaves a fringe like structure with very low tree-width (weiCoreFringe, ; akibaCoreFringe, ). Typical degree and centrality based ordering schemes also tend to rank the vertices in the dense core highly. In such graphs, the Hybrid algorithm uses PLaNT to construct SPTs rooted at core vertices which generate a large number of labels. SPTs rooted on fringe vertices generate few labels and are constructed using DGLL which exploits heavy pruning to limit computation.
For graphs with a core-fringe structure, a relaxed tree decomposition parameterized by an integer can be computed such that , where is the root of and maps vertices in (bags) to subset of vertices in (akibaCoreFringe, ). In other words, except root bag, is bounded by a parameter .
Lemma 0 ().
The hybrid algorithm performs work, broadcasts only data, generates hub labels and answers each query in time.
Consider the relaxed tree decomposition
with root and perform centroid decomposition on all subtrees rooted at the children of to obtain tree . The height of any tree in the forest generated by removing from is . Hence, the height of .
Consider a ranking where is determined by the minimum depth bag . For GLL, the number of labels generated by SPTs from vertices in root bag is . Combining this with lemma 6, we can say that total labels generated by GLL is and query complexity is . The same also holds for the Hybrid algorithm since it outputs the same CHL as GLL.
If Hybrid algorithm constructs SPTs using PLaNT and rest using DGLL, the overall work-complexity is .
The Hybrid algorithm only communicates labels generated after switching to DGLL, resulting in
data broadcast. In comparison, doing only DGLL for the same ordering will broadcast data. ∎
In reality, we use the ratio
as a heuristic, dynamically switching from PLaNT to DGLL whenbecomes greater than a threshold .
Lemma 0 ().
The Hybrid algorithm consumes
main memory per node, where is the number of nodes used.
Distributed labels use space per node and storing the graph requires space. ∎
5.3. Enabling efficient Multi-node pruning
We propose an optimization that simultaneously solves the following two problems:
1. Pruning traversal in PLaNT The reason why PLaNT cannot prune using rank or distance queries is that with pruning using partial label info, an SPT can still visit those vertices which would’ve been pruned if all prior labels were available and possibly, through non shortest paths with the wrong ancestor information. This can lead to redundant label generation and defeat the purpose of PLaNT.
In general, if a node prunes using , it must have to guarantee non-redundant labels. When this condition is met, a vertex is either visited through the shortest path with correct ancestor or is pruned. We skip the proof details for brevity.
2. Redundant labels in DGLL Fig.4 shows the label count generated by PLL if pruning queries are restricted to use hub labels from few top-ranked hubs only. We observe that label count decreases dramatically even if pruning utilizes only few highest-ranked hubs.
Thus, for a given integer , if we store all labels from most important hubs on every compute node i.e. , we can
use distance queries on to prune PLaNTed trees, and
drastically increase pruning efficiency of DGLL.
To this purpose, we allocate a Common Label table on every node that stores common labels . These labels are broadcasted even if they are generated by PLaNT. For , using common labels incurs additional broadcast traffic, queries of complexity each , and consumes more memory per node. Thus, it does not alter the theoretical bounds on work-complexity, communication volume and space requirements of the Hybrid algorithm. In our implementation, we store labels from highest ranked hubs in the Common Label Table.
The ideas discussed in section 4 and 5 are not restricted to multi-node clusters and can be used for any massively parallel system, such as GPU. On GPUs, simply parallelizing PLL is not enough, because the first tree constructed by every concurrent thread will not have any label information from higher ranked SPTs and will not prune at all on distance queries. For a GPU which can run thousands of concurrent threads, this can lead to an unacceptable blowup in label size making Label Cleaning extremely time consuming. Even worse, the system may simply run out of memory. Instead, we can use PLaNT to construct first few SPTs for every thread and switch to GLL afterwards. Our approach can also be extended to disk-based processing where access cost to labels is very high. The Common Label Table can be mapped to faster memory in the hierarchy (DRAM) to accelerate pre-processing. Finally, we note that by storing the parent of each vertex in an SPT along with the corresponding hub label, CHL can also be used to compute shortest paths in time linear to the number of edges in the paths.
We provide three modes to the user for distance queries:
Querying with Fully Distributed Labels (QFDL) In this mode, the label set of every vertex is partitioned between all nodes . A query is broadcasted to all nodes and individual responses of the nodes are reduced using MPI_MIN to obtain the shortest path distance. It utilizes parallel processing power of multiple nodes and consumes only memory per node, but incurs high communication costs.
Querying with Distributed Overlapping Labels (QDOL) In this mode, we divide the vertex set into partitions. For every possible partition pair, a node is allocated that stores complete label set of all vertices in that pair. Thus, a given query is answered completely by a single node but not by every node. Unlike QFDL, this mode utilizes the more efficient P2P communication instead of broadcasting. Each query is mapped to the node that has labels for vertex partitions containing and . The query is then communicated to this node which single-handedly computes and sends back the response. In this case, multi-node parallelism is exploited in a batch of queries where different nodes simultaneously compute responses to the respective queries mapped to them.
For a cluster of nodes, can be computed as follows:
Storing labels of two vertex partitions consumes
memory per node (much larger than QFDL).
We conduct shared-memory experiments on a 36 core, 2-way hyperthreaded, dual-socket linux server with two Intel Xeon E5-2695 v4 processors@ 2.1GHz and 1TB DRAM. For the distributed memory experiments, we use a 64-node cluster with each node having an 8 core, 2-way hyperthreaded, Intel Xeon E5-2665@ 2.4GHz processor and 64GB DRAM. We use OpenMP v4.5 and OpenMPI v3.1.2 for parallelization. Our shared-memory implementations use all cores with hyperthreading and distributed implementations use all cores with hyperthreading on each node.
Baselines: We use sequential PLL (seqPLL), state-of-the-art paraPLL shared-memory (SparaPLL) and distributed (DparaPLL) versions for comparison of pre-processing efficiency. We enable SparaPLL with dynamic task assignment policy for good load balancing. Our implementation of DparaPLL777paraPLL code is not publicly available. executes SparaPLL on every compute node using a task queue with circular allocation (section 5.1). We observed that DparaPLL scales poorly as the number of compute nodes increase. This is because of high communication overhead and label size explosion that can be attributed to the absence of rank queries. Therefore, we also plot the performance of DGLL for better baselining. Both DGLL and DparaPLL implementation synchronize times to exchange labels among the nodes.
Implementation Details: We vary the synchronization threshold in GLL and switching threshold in the Hybrid algorithm to empirically assess the impact of these parameters on algorithm performance. Figure 5 shows the impact of on GLL. We note that the execution time is robust to significant variations in within a range of to . Intuitively, a small value of reduces cleaning time (section 4.2) but making it too small can lead to frequent synchronizations that hurt parallel performance. Based on our observations, we set for further experiments. Figure 6 shows the effect of on the performance of hybrid algorithm. Intuitively, keeping too large increases the computation overhead (seen in scale-free networks) because even low-ranked SPTs that generate few labels, are PLaNTed. On the other hand, keeping too small results in poor scalability (seen in road networks) as the algorithm switches to DGLL quickly and parallelism and communication avoidance of PLaNT remain underutilized. Based on these findings, we set for scale-free networks and for road networks.
We evaluate our algorithms on real-world graphs with varied topologies, as listed in table 2. The scale-free networks do not have edge weights from the download sources. For such graphs, we assign edge weights between uniformly at random.
The ranking is determined by betweenness for road networks(vldbExperimental, ) and degree for scale-free networks(akibaPLL, ). Betweenness is approximated by sampling a few shortest path trees and both methods are inexpensive to compute.
|CAL(roadDimacs, )||1,890,815||4,657,742||California Road network||Undirected|
|EAS(roadDimacs, )||3,598,623||8,778,114||East USA Road network||Undirected|
|CTR(roadDimacs, )||14,081,816||34,292,496||Center USA Road network||Undirected|
|USA(roadDimacs, )||23,947,347||58,333,344||Full USA Road network||Undirected|
|SKIT(skitter, )||192,244||636,643||Skitter Autonomous Systems||Undirected|
|WND(und, )||325,729||1,497,134||Univ. Notre Dame webpages||Directed|
|AUT(coauth, )||227,320||814,134||Citeseer Collaboration||Undirected|
|YTB(konect, )||1,134,890||2,987,624||Youtube Social network||Undirected|
|ACT(konect, )||382,219||33,115,812||Actor Collaboration Network||Undirected|
|BDU(konect, )||2,141,300||17,794,839||Baidu HyperLink Network||Directed|
|POK(konect, )||1,632,803||30,622,564||Social network Pokec||Directed|
|LIJ(konect, )||4,847,571||68,993,773||LiveJournal Social network||Directed|
7.2. Evaluation of Shared-memory Algorithms
Table 3 compares the performance of GLL with LCC, SparaPLL and seqPLL. It also shows the Average Label Size (ALS) per vertex in CHL (GLL, LCC and seqPLL) and labeling generated by SparaPLL. The query response time is directly proportional to Average Label Size (ALS) per vertex and hence, ALS is a crucial parameter for any hub labeling algorithm. In case of LIJ graph, none of the shared-memory algorithms finished execution and its CHL ALS was obtained from the distributed algorithms.
We observe that on average, GLL generates less labels than paraPLL which can be quite significant for an application that generates many PPSD queries. GLL is only slower than paraPLL on average even though it re-checks every label generated and the cleaning queries use linear-merge based querying888For space efficiency, the labels are only stored as (ordered by vertex) and there is no copy of labels stored as (ordered by hubs). as opposed to the more efficient hash-join label construction queries.
For some graphs such as CAL, GLL is even faster than SparaPLL. This is because of rank queries, faster label construction queries due to smaller sized label sets and lock avoidance in GLL. Although not shown in table 3 for brevity, we observed that ALS and scalability of paraPLL worsen as number of threads increase. Hence, we expect the relative performance of GLL to be even better on systems with more parallelism.
Fig. 7 shows execution time breakup for LCC and GLL. GLL cleaning is significantly faster than LCC because of the reduced cleaning complexity (section 4.2). Overall, GLL is faster than LCC on average. However, for some graphs such as CAL, fraction of cleaning time is even for GLL. This is because in the first superstep of GLL, number of labels generated is more than as there are no labels available for distance query pruning and number of simultaneous SPTs under construction is ( is # threads). This problem can be circumvented by using PLaNT for the first superstep in shared-memory implementation as well.
7.3. Evaluation of Distributed-memory Algorithms
To assess the scalability of distributed hub labeling algorithms, we vary from 1 to 64 (# compute cores from 8 to 512). Fig. 8 shows the strong scaling of different algorithms in terms of labeling construction time.
We note that both DparaPLL and DGLL do not scale well as increases. DparaPLL often runs out-of-memory when is large. This is because in the first superstep itself, a large number of hub labels are generated that when exchanged, overwhelm the memory of the nodes. DGLL, on the other hand, limits the amount of labels per superstep by synchronizing frequently in the early stage of execution and increasing the synchronization point later.
Moreover, due to the absence of rank queries, the label size of DparaPLL explodes as increases (fig.9). The efficiency of distance query based pruning in DparaPLL suffers because on every compute node, labels from several high-ranked hubs (that cover a large number of shortest paths) are missing. As the label size explodes, distance queries become expensive and the pre-processing becomes dramatically slower. On the other hand, rank queries in DGLL allow pruning even at those hubs whose SPTs were not created on the querying node. Further, it periodically cleans redundant labels, thus, retaining the performance of distance queries. Yet, DGLL incurs significant communication and slows down as more compute nodes are involved in pre-processing. Neither DparaPLL, nor DGLL are able to process the large CTR, USA, POK and LIJ datasets, the former running out-of-memory and the latter failing to finish execution within time limit.
|Dataset||Throughput (million queries/)||Latency ( per query)||Memory Usage (GB)|
PLaNT on the other hand, paints a completely different picture. Owing to its embarrassingly parallel nature, PLaNT exhibits excellent near-linear speedup upto for almost all datasets. On average, PLaNT is able to achieve speedup on nodes compared to single node execution. However, for scale-free graphs, PLaNT is not efficient. It is unable to process LIJ and takes more than an hour to process POK dataset even on nodes.
The Hybrid algorithm combines the scalability of PLaNT with the pruning efficiency of DGLL (powered by Common Labels). It scales well upto and for most datasets, achieves speedup over single node execution. At the same time, for large scale-free datasets ACT, BDU and POK, it is able to construct CHL , and faster than PLaNT, respectively, on nodes. When processing scale-free datasets on small number of compute nodes (, or nodes), Hybrid beats PLaNT by more than an order of magnitude difference in execution time. Compared to DparaPLL, the Hybrid algorithm is faster on average when run on compute nodes. For SKIT and WND, the Hybrid algorithm is and faster, respectively, than DparaPLL on nodes.
Although fig.9 only plots ALS for Hybrid algorithm, even PLaNT and DGLL generate the same CHL and hence, have the same label size for any dataset irrespective of (section 5). We also observe superlinear speedup in some cases (Hybrid CAL and AUT node vs nodes; PLaNT CTR nodes vs nodes). This is because these datasets generate huge amount of labels that exert a lot of pressure on the memory. When running on few nodes, main memory utilization on every node is very high, slowing down the memory allocation operations that happen when new labels are appended. In such cases, increasing number of nodes not only increases compute cores but also releases pressure on memory management system, resulting in a super linear speedup.
Graph Topologies: We observe that PLaNT alone not only scales well but is also extremely efficient for road networks. On the other hand, in scale-free networks, PLaNT although scalable is not efficient as it incurs large overhead of additional exploration in low-ranked SPTs. This is consistent with our observations in figure 3 where the maximum value of for SKIT was that of maximum in CAL road network. Consequently, the Hybrid algorithm that cleverly manages the trade-off between additional exploration and communication avoidance, is significantly faster than PLaNT for most scale-free networks.
We also observe that the Hybrid algorithm does not scale equally well for small datasets when the number of compute nodes is high. PLaNT eventually catches up with the Hybrid, even beating it in several cases. This is because even few synchronizations of large number of nodes completely dominate the small pre-processing time. Perhaps, scalability for small datasets can be improved by making the number of synchronizations and switching point from PLaNT to DGLL, a function of both and .
7.4. Evaluation of Query Modes
In this section, we assess the different query modes on the basis of their memory consumption, query response latency and query processing throughput. Table 4 shows the memory consumed by label storage under different modes. QLSN requires all labels to be stored on every node and is the most memory hungry mode. Both QDOL and QFDL distribute the labels across multiple nodes enabling queries on large graphs where QLSN fails. Our experiments also confirm the theoretical insights into the memory usage of QFDL and QDOL presented in section 6. On average, QDOL requires more main memory for label storage than QFDL. This is because the label size per partition in QDOL scales with and every compute node has to further store label set of such partitions.
To evaluate the latency of various query modes, we generate million PPSD queries and compute their response one at a time. In QFDL (QDOL) mode, one query is transmitted per MPI_Broadcast (MPI_Send, respectively) and inter-node communication latency becomes a major contributor to the overall query response latency. This is evident from the experimental results (table 4) where latency of QFDL shows little variation across different datasets. Contrarily, QLSN does not incur inter-node communication and compared to QDOL and QFDL, has significantly lower latency although it increases proportionally with ALS. For most datasets, QDOL latency is compared to QFDL, because of the cheaper point-to-point communication as opposed to more expensive broadcasts (section 6). An exception is POK, where average label size is huge (table 3) and QFDL takes advantage of multi-node parallelism to reduce latency.
To evaluate the query throughput, we create a batch of million PPSD queries and compute their responses in parallel. For most datasets, the added multi-node parallelism of QFDL and QDOL999In QDOL mode, prior to communicating the queries, we sort them based on the nodes that they map to. After receiving query responses, we rearrange them in original order. The throughput reported in table 4 also takes into account, the time for sorting and rearranging. overcomes the query communication overhead and results in higher throughput than QLSN. QDOL is further faster than QFDL because of reduced communication overhead101010QDOL also has better memory access locality as every node traverses all hub labels of vertices in queries assigned to it. Contrarily, each node in QFDL only traverses a part of hub labels for all queries, frequently jumping from one vertex’ labels to another..
8. Conclusion and Future Work
In this paper, we address the problem of efficiently constructing Hub Labeling and answering shortest distance qu-
eries on shared and distributed memory parallel systems. We outline the multifaceted challenges associated with the algorithm in general, and specific to the parallel processing platforms. We propose novel algorithmic innovations and optimizations that systematically resolve these challenges. Our embarassingly parallel algorithm PLaNT, dramatically increases the scalability of hub labeling, making it feasible to utilize the massive parallelism in a cluster of compute nodes.
We show that our approach exhibits good theoretical and empirical performance. Our algorithms are able to scale significantly better than the existing approaches, with orders of magnitude faster pre-processing and capability to process very large graphs.
There are several interesting directions to pursue in the context of this work. We will explore the use of distributed atomics and RMA calls to dynamically allocate tasks even on multiple nodes. This can improve load balance across nodes and further boost the performance of PLaNT and Hybrid algorithms. Another important area for research is development of heuristics to compute switching point between PLaNT and DGLL and to autotune the parameters and to adapt to the graph topology and number of compute nodes .
-  The skitter as links dataset, 2019. [Online; accessed 8-April-2019].
-  I. Abraham, D. Delling, A. V. Goldberg, and R. F. Werneck. Hierarchical hub labelings for shortest paths. In European Symposium on Algorithms, pages 24–35. Springer, 2012.
-  T. Akiba, Y. Iwata, and Y. Yoshida. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 349–360. ACM, 2013.
-  T. Akiba, C. Sommer, and K.-i. Kawarabayashi. Shortest-path queries for complex networks: exploiting low tree-width outside the core. In Proceedings of the 15th International Conference on Extending Database Technology, pages 144–155. ACM, 2012.
-  R. Albert, H. Jeong, and A.-L. Barabási. Internet: Diameter of the world-wide web. nature, 401(6749):130, 1999.
-  E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. SIAM Journal on Computing, 32(5):1338–1355, 2003.
-  C. Demetrescu, A. Goldberg, and D. Johnson. 9th dimacs implementation challenge–shortest paths. American Mathematical Society, 2006.
-  L. Dhulipala, G. Blelloch, and J. Shun. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 293–304. ACM, 2017.
-  Q. Dong, K. Lakhotia, H. Zeng, R. Karman, V. Prasanna, and G. Seetharaman. A fast and efficient parallel algorithm for pruned landmark labeling. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1–7. IEEE, 2018.
-  D. Ferizovic and G. E. Blelloch. Parallel pruned landmark labeling for shortest path queries on unit-weight networks. 2015.
-  J. S. Firoz, M. Zalewski, T. Kanewala, and A. Lumsdaine. Synchronization-avoiding graph algorithms. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC), pages 52–61. IEEE, 2018.
-  R. Geisberger, P. Sanders, and D. Schultes. Better approximation of betweenness centrality. In Proceedings of the Meeting on Algorithm Engineering & Expermiments, pages 90–100. Society for Industrial and Applied Mathematics, 2008.
-  M. Jiang, A. W.-C. Fu, R. C.-W. Wong, and Y. Xu. Hop doubling label indexing for point-to-point distance querying on scale-free networks. Proceedings of the VLDB Endowment, 7(12):1203–1214, 2014.
-  J. Kunegis. Konect: the koblenz network collection. In Proceedings of the 22nd International Conference on World Wide Web, pages 1343–1350. ACM, 2013.
-  J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In Proceedings of the 17th international conference on World Wide Web, pages 695–704. ACM, 2008.
-  W. Li, M. Qiao, L. Qin, Y. Zhang, L. Chang, and X. Lin. Scaling distance labeling on small-world networks. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, pages 1060–1077, New York, NY, USA, 2019. ACM.
-  Y. Li, M. L. Yiu, N. M. Kou, et al. An experimental study on hub labeling based shortest path algorithms. Proceedings of the VLDB Endowment, 11(4):445–457, 2017.
-  D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 456–471. ACM, 2013.
-  K. Qiu, Y. Zhu, J. Yuan, J. Zhao, X. Wang, and T. Wolf. Parapll: Fast parallel shortest-path distance query on large-scale weighted graphs. In Proceedings of the 47th International Conference on Parallel Processing, page 2. ACM, 2018.
-  J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In ACM Sigplan Notices, volume 48, pages 135–146. ACM, 2013.
-  F. Wei. Tedi: efficient shortest path query answering on graphs. In Graph Data Management: Techniques and Applications, pages 214–238. IGI Global, 2012.
-  X. Zhu, W. Chen, W. Zheng, and X. Ma. Gemini: A computation-centric distributed graph processing system. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 301–316, 2016.