Locality-based Graph Reordering for Processing Speed-Ups and Impact of Diameter

11/24/2021
by   Vedant Satav, et al.
0

Graph analysis involves a high number of random memory access patterns. Earlier research has shownthat the cache miss latency is responsible for more than half of the graph processing time, with the CPU execution having the smaller share. There has been significant study on decreasing the CPU computing time for example, by employing better cache prefetching and replacement policies. In thispaper, we study the various methods that do so by attempting to decrease the CPU cache miss ratio.Graph Reordering attempts to exploit the power-law distribution of graphs – few sparsely-populated vertices in the graph have high number of connections – to keep the frequently accessed vertices together locally and hence decrease the cache misses. However, reordering the graph by keeping the hot vertices together may affect the spatial locality of the graph, and thus add to the total CPU compute time.Also, we also need to have a control over the total reordering time and its inverse relation with thefinal CPU execution timeIn order to exploit this trade-off between reordering as per vertex hotness and spatial locality, we introduce the light-weight Community-based Reordering. We attempt to maintain the community-structureof the graph by storing the hot-members in the community locally together. The implementation also takes into consideration the impact of graph diameter on the execution time. We compare our implementation with other reordering implementations and find a significantly better result on five graph processing algorithms: BFS, CC, CCSV, PR and BC. Lorder achieved speed-up of upto 7x and an average speed-up of 1.2x as compared to other reordering algorithms

READ FULL TEXT VIEW PDF

Authors

page 1

page 19

01/23/2020

A Closer Look at Lightweight Graph Reordering

Graph analytics power a range of applications in areas as diverse as fin...
01/22/2020

Domain-Specialized Cache Management for Graph Analytics

Graph analytics power a range of applications in areas as diverse as fin...
07/03/2018

On the Incomparability of Cache Algorithms in Terms of Timing Leakage

Modern computer architectures rely on caches to reduce the latency gap b...
11/05/2018

On the complexity of cache analysis for different replacement policies

Modern processors use cache memory: a memory access that "hits" the cach...
10/08/2019

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Graph processing is typically considered to be a memory-bound rather tha...
12/18/2020

Fast and Efficient Parallel Breadth-First Search with Power-law Graph Transformation

In the big data era, graph computing is widely used to exploit the hidde...
08/28/2020

Cache-Efficient Sweeping-Based Interval Joins for Extended Allen Relation Predicates (Extended Version)

We develop a family of efficient plane-sweeping interval join algorithms...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Graph analysis involves a high number of random memory access patterns. Earlier research has shown that the cache miss latency is responsible for more than half of the graph processing time, with the CPU execution having the smaller share. There has been significant study on decreasing the CPU computing time for example, by employing better cache prefetching and replacement policies. In this paper, we study the various methods that do so by attempting to decrease the CPU cache miss ratio.

Graph Reordering attempts to exploit the power-law distribution of graphs- few sparsely-populated vertices in the graph have high number of connections- to keep the frequently accessed vertices together locally and hence decrease the cache misses. However, reordering the graph by keeping the hot vertices together may affect the spatial locality of the graph, and thus add to the total CPU compute time. Also, we also need to have a control over the total reordering time and its inverse relation with the final CPU execution time

In order to exploit this trade-off between reordering as per vertex hotness and spatial locality, we introduce the light-weight Community-based Reordering. We attempt to maintain the community-structure of the graph by storing the hot-members in the community locally together. The implementation also takes into consideration the impact of graph diameter on the execution time. We compare our implementation with other reordering implementations and find a significantly better result on five graph processing algorithms- BFS, CC, CCSV, PR and BC. Lorder achieved speed-up of upto and an average speed-up of as compared to other reordering algorithms

Index terms– graph analysis, power-law distribution, ground-truth communities

2.1 Structure of Natural Graphs

Real-world graphs are large in size and have connections governed by natural interaction patterns and thus, carry some inherent connectivity information and display community structures. The memory access patterns are greatly governed by these neighbourhood laws. These natural graphs can be used to represent many of our network applications like social networks, computer networks, maps, etc. Two prevalent properties of natural graphs that we borrow from-
Power-law distribution

: Only a small fraction of the vertices have high degree of connectivity. This skewed distribution of degree is an important property with respect to cache reuse since a few of these hot vertices have a high chane of reuse, since they are neighbours of a high number of vertices. A memory block with many hot vertices would likely be queried large number of time, and if it were maintained in the cache, the cache miss rate could be reduced. Table 1 quantifies this skew for the datasets used in this paper. The table contains the percentage of hot vertices in the entire graphs along with the number of edges that come in or out of these vertices. Here, hotness is defined as a vertex having a degree higher than the threshold of average hotness of the vertices for the graph.


Community structures: Graphs based on natural connection patterns display high inter-connections in different regions. These clusters are called communities in a graph. For example, in a social network, members sharing similar likes and preferences are bound to converge into groups. The corresponding graph would represent these groups as a closely-knit community of vertices corresponding to these members. In geographical graphs used for navigation purposes, smaller towns would be positioned around important towns or cities.

2.2 Graph Representation

For the purposes of representation, graphs are modelled by a set of vertices (V) and the set of their corresponding directed edges (E), G = (V, E). Each of these vertices would have some properties associated with it, which form the basis of graph analysis.

In a natural graph, the edges are not uniformly distributed across the vertices but are in fact sparse due to the underlying community structures in the graph. Thus, in an adjacency matrix form of representing graphs, most of the elements would be null, leading to a wastage of expensive shared-memory space.

(a) Graph structure
(b) Corresponding CSR format
Figure 2.2.1: Example of Compressed Sparse Row (CSR) representation of graphs

An alternative is found in the Compressed Sparse Row (CSR) format for representing graphs. The CSR format uses a vertex and an edge array to encode the graph, with an additional array to store the properties corresponding to each of the vertices. The edge array could represent an in-edge or an out-edge (in Figure 2.2.1, the out-edges are considered) depending on the type of computations being carried out. In pull-type computations, the parent vertex collects the property values from its in-neighbours. Hence, the CSR format should maintain the in-edges for pull-type computations. In push-type computations, the parent vertex pushes its property values to its out-neighbours for which, the CSR format should maintain the out-edges in the edge array.
How is the CSR format queried? We take a pull-type algorithm as an example. For every vertex, the vertex array element corresponding to the vertex id contains the index of the first in-neighbour of the vertex in the edge array. Thus, each element in the edge array would be the source vertex of that edge. The number of edges (the degree) for any vertex can be easily determined by the difference of consecutive elements in the vertex array. In the figure 2.2.1, the Resuse demonstrates the number of queries to the property element for each graph vertex for one graph traversal. The number is equivalent to the number of in-edges for the corresponding vertices.

2.3 Graph Traversal and Memory Use

For processing a vertex, the graph processing algorithm would query all its neighbours from the edge array. The property arrays corresponding to each of these neighbours would be queries to process the parent vertex. In the traversal of the entire graph, thus, elements in the vertex and the edge arrays would be queried only once. However, elements in the property arrays would be queried multiple times, depending on the degree of the vertex. Note that this degree would depend on the type of computations: out-degree for pull-based computations and vice versa. The number of reuses of each of the elements can be understood from the figure given. Hence, out of the three arrays, the property arrays show temporal locality. In order to improve the cache utilization from these high-reuse vertices, we try to keep the hotter vertices together. Graph reordering provides the means to optimize the CPU access patterns without affecting the graph structure or the algorithm being executed. The differences in the cache utilizations can be easier understood by observing the below two conditions:

3.1 Statement

The problem statement of graph reordering as follows:
Problem statement: Given that we have a graph modelled as G = (V, E), we need a permutation function phi() that assigns an unique vertex ID in (1, 2, 3…, n) to each vertex of the original graph, where n = mod (V), the size of Vertex data structure. The optimal permutation function should be such that it minimises the total cache misses during execution of a typical graph processing algorithm

The problem statement can be divided into three objective points for designing, testing and comparing any graph reordering scheme:

  1. Low reordering overhead- Although our objective is to minimise the total execution time for the graph processing algorithm, we need to also consider the end-to-end time including the time to reorder the entire graph. However, we need not reorder the graph every time we need to carry out any graph analysis, since adding a few vertices to the graph do not cause significant changes in the structure or degrees in the graph. Hence, the reordering time itself can be amortized over multiple instances. The number of traversals that would be required to amortize the reordering time depends on the total reordering time, which thus needs to be curtailed

  2. High cache utilization- keeping as many hot vertices together as possible would lead to better temporal locality of vertices and would thus help in better cache utilization for memory blocks having these hot vertices

  3. Preserving the structure- Many graph analysis algorithms involve high spatio-temporal locality, i.e, they tend to process those vertices first that are in the neighbourhood of the current vertices. Hence, the underlying structures present in the graphs are crucial for determining the data flow in the data flow. Any reordering scheme should preserve these structures to prevent adverse impacts on the performance achieved from improving the temporal locality alone

We now criticise some reordering schemes on the above mentioned objective points.

3.2 Graph Reordering (GOrder)

In GOrder, the authors observe that the data access patterns are greatly influenced by the neighbourhood relations in the graph. In addition to the dependency of the neighbouring vertices on the current vertex, it also observes that the relations between the vertices, called sibling relationships, are of consequence.
To formalise the graph reordering problem statement, GOrder takes help of a score function to measure closeness between any two vertices:

(3.1)

Where,
: Number of common in-neighbours of and
: number of times and are direct neighbours
Thus, two vertices that are densely connected with each other would have a greater score function. If by some means, say a permutation function , we ensure that such pairs of vertices are kept close to each other in the cache, the number of cache hits would increase. Mathematically, Gorder defines an accumulated locality score over a window of size , that should be maximised:

(3.2)

Maximising of is proved to be NP-hard by Gorder, which proposes a greedy algorithm as a solution. Although this greedy algorithm has a high reordering overhead, the execution time achieved after the reordering is considerably lesser. A closer analysis of the results from Gorder suggests that it reorders the hot vertices first as well as keeps the neighbouring vertices together. The combination of these three results make GOrder a very good basis for comparison.

3.3 Structure-preserved Reordering (S-Order)

It proposes the concept of a hypernode, an aggregate of adjacent unvisited cold vertices beginning from a seed vertex into a virtual vertex. The adjacency is defined to be the vertices that are within -hops from the seed vertices. The neighbours of the vertices present in the hypernode are then split into two groups based on their hotness. The reordering begins with the vertices in the hypernode, then the hot neighbours of the hypernode and finally the cold neighbours.

3.4 Neighbourhood Reordering (N-Order)

Norder takes a very different approach to balancing the tradeoffs. It first creates a list of vertices in descending order of their hotness. The next iteration, it carries our reordering in BFS-fashion serially taking elements from the previously arranged list as seed vertices for the BFS. The new reordering is in the order of traversal. Since the entire graph would be traversed twice during NOrdering, the reordering time is expected to be proportionally high as well, as discussed later during the results comparison.

3.5 Degree-Based Grouping (DBG)

DBG is a skew-aware coarse-reordering technique that rearranges the graph such that each cache block has vertices having similar degrees. As opposed to Sort, which just reorders the entire graph with new vertex ids being assigned in descending order of their hotness, DBG bins the vertices in different partitions, in the original vertex numbering order, i.e in each of the partitions, DBG maintains the original relative order of the vertices. But that is the extent to which it ensures the integrity of the original structure of the graph. The partitions are defined with degree ranges that follow the power-law distribution. Since this technique involves only ‘binning’ and not a full-blown sorting of the vertices, the reordering time incurred by DBG is found to be the least for most of the cases. However, since it conforms to the power-law distribution directly, the results from DBG reordering are quite aligned with our objective points.

Some key observations after studying the above schemes:

  1. Since one of the objectives is to maintain the neighbourhood structures of the original graph, many of the reordering algorithms thus make use of BFS-type pattern to traverse the entire graph for indexing

  2. Although GOrder shows promising results for execution time, the reordering time is quite high

  3. By forming hypernodes, SOrder tends to prefer preserving neighbourhood structure over improving the temporal locality

  4. Norder, on the other hand, trades off the temporal locality over preserving the community structure, while also encountering significant reordering time due to the double traversal

  5. Design of DBG limits its structure-preservation to just maintaining the relative order of the reordered vertices. The serial vertex IDs of the original graph may not be the best representative of the underlying cross connections in the graph

5.1 Experimental Setup

We evaluate effectiveness of Lorder by running six different graph processing algorithms on graph reordered by four graph reordering schemes. The evaluations are run on a machine with following configuration: (add compiler version here too)
Algorithms: We make use of the GAPS Benchmark suites for carrying out the graph processing. It has C++ implementations of the following algorithms:

  • Breadth-First Search: A fundamental graph traversal technique which starts from a single source and traverses all the vertices of the current level before moving on to the next one

  • Page Rank: Iteratively computes the rank of vertices using their connectivity propoerties

  • Betweenness centrality: Using BFS traversal to determine the paths originating from a aprticular source through a particular vertex, it determines a central vertex

  • Single Source Shortest Path: Using the Bellman Ford algorithm, it finds shortest path to all the connected vertices from a particular vertex

  • Connected Components: It assigns unique labels to all the connected components and the vertices associated with each of these components

  • CC_SV: Connected Components implemented with the Shiloach-Vishkin algorithm

Dataset Type  # of vertices  # of nodes
lj Social Network 4,847,571 68,993,773
orkut Community 3,072,441 117,185,083
pld arc Hyperlink
kron23 Kronecker
youtube Community 1,134,890 2,987,624
pokec Social Network 1,632,803 30,622,564
Table 5.1: Dataset details

Graph datasets: A total of six graph datasets, summaried in Table XYZ. Apart from the Kronecker graph by Graph500 benchmark, all the other graphs are real-world graphs. The average degrees of these graphs are also mentioned, which forms the threshold for defining hotness for each vertex. We also report the diameter of the graphs, the longest shortest connected path in the graph, for those belonging to SNAP Stanford dataset repository. The diameter plays an integral part in the analysis of Lorder, as summarised in the results.
Evaluation Method For the various reordering schemes, the parameters are set as per the inferences by their authors. For example, we set and for SOrder. For LOrder, we set the average degree of the graph. We also have the parameter in Lorder, which varies from graph to graaph. the value is chosen such that the final execution time of the algorithms is least for the set value. The comparison across the reordering schemes are based on the execution speedups, the cache statistics, reordering time as well as interesting observation about the values. We report execution times averaged over 16 iterations to incorporate time required for cache warm-up.

5.2 Results

The value of . The number of hops from the seed vertex, parameter , determines the locality for any seed vertex. The value is set such that the post-reordering execution time is minimum. The results are as shown in the Table 5.2. The value of is observed to be half of the diameter of the graph, as mentioned on the SNAP dataset repository. Hence, is also referred to as the radius in this paper. For the remaining datasets, the diamater can be shown to be twice of the radius.

Dataset Type  Diameter of the graph  Value of
lj Social Network 16 8
orkut Community 9 4
youtube Community 20 10
pokec Social Network 11 5
Table 5.2: The value of is set to achieve the lowest execution time

Comparing processing speed-up The proposed reordering scheme is implemented on the GAPS Benchmark and the execution times post-reordering are compared to the execution time without reordering. The speed-ups are plotted in Figure 5.2.2 for six different datasets and for five graph processing kernels. Since DBG provides some of the best processing speed-ups, we compare our results to see that Lorder beats DBG for 77% of the total comparisons

(i.e, 23 times out of 30). Likewise, it beats SOrder 60% of the time (i.e, 18 times out of 30). To better analyse the overall behaviour of the reordering schemes, we plot their geometric means for different graph processing kernels in Figure

5.2.3. We can see that Lorder outright beats all the other reordering schemes for BFS, CC and PR applications while it is marginally slower that DBG in BC. However, Lorder consistently performs badly for the CC (Shiloach-Vishkin) graph kernel. This can be attributed to the uncertainty in which vertices would be a part of the processed at any given time.
It is also observed that for apart for a few cases, reordering graphs lead to significant processing speed-ups. For applications involving multiple iterations or traversals, these speed-ups can greatly reduce the total processing time. However, we also need to account for the reordering times.
Comparing reordering times. Out of the four reordering schemes being discussed, DBG and SOrder require a single traversal of the graph while NOrder and LOrder require two graph traversals. So it is expected of the latter two to have almost double the reordering time of the former, as collaborated through Table 5.2.1. The size of the graph also determines the time taken to reorder it. The values in the above table follow the graph size trends in Table 5.1.
However, the actual reordering time also depends on factors like the number of hot vertices being processed initially, etc. The cost of reordering is not incurred every iteration of processing. A reordered graph is stored in the memory and thus, can be used multiple times. Also, the addition and removal of vertices to the reordered graph with significantly high number of vertices does not affect the processing time much, as shown by a previous study. Hence, the reordering time is amortized over multiple iterations.

Figure 5.2.1: Time taken (in seconds) to reorder graphs with each of the four reordering schemes
(a) soc-LiveJournal1
(b) com-Orkut
(c) kron23
(d) pld_arc
(e) com-Youtube
(f) soc-Pokec
Figure 5.2.2: Speed ups comparison contributed by different reordering algorithms for various datasets. It can be seen that a basic implementation of Lorder without any embellishments performs better in many cases as discussed in the results
Figure 5.2.3: Geometric means of graph processing speed-ups for various kernels across different datasets