Accelerating PageRank using Partition-Centric Processing

09/21/2017 ∙ by Kartik Lakhotia, et al. ∙ University of Southern California 0

PageRank is a fundamental link analysis algorithm and a key representative of the performance of other graph algorithms and Sparse Matrix Vector (SpMV) multiplication. Calculating PageRank on sparse graphs generates large amount of random memory accesses resulting in low cache line utilization and poor use of memory bandwidth. In this paper, we present a novel Partition-Centric Processing Methodology (PCPM) that drastically reduces the amount of communication with DRAM and achieves high memory bandwidth. Similar to the state of the art Binning with Vertex-centric Gather-Apply-Scatter (BVGAS) method, PCPM performs partition wise scatter and gather of updates with both phases enjoying full cache line utilization. However, BVGAS suffers from random memory accesses and redundant read/write of update values from nodes to their neighbors. In contrast, PCPM propagates single update from source node to all destinations in a partition, thus decreasing redundancy effectively. We make use of this characteristic to develop a novel bipartite Partition-Node Graph (PNG) data layout for PCPM, that enables streaming memory accesses, with very little generation overhead. We perform detailed analysis of PCPM and provide theoretical bounds on the amount of communication and random DRAM accesses. We experimentally evaluate our approach using 6 large graph datasets and demonstrate an average 2.7x speedup in execution time and 1.7x reduction in communication, compared to the state of the art. We also show that unlike the BVGAS implementation, PCPM is able to take advantage of intelligent node labeling that enhances locality in graphs, by further reducing the amount of communication with DRAM. Although we use PageRank as the target application in this paper, our approach can be applied to generic SpMV computation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are the preferred choice of data representation in many fields such as web and social network analysis [9, 3, 29, 10], biology [17], transportation [15, 4] etc. The growing scale of problems in these areas has generated substantial research interest in high performance graph analytics. A large fraction of this research is focused on shared memory platforms because of their low communication overhead compared to distributed systems [26]. High DRAM capacity in modern systems further allows in-memory processing of large graphs on a single server [35, 33, 37]. However, efficient utilization of compute power is challenging even on a single node because of the low computation-to-communication ratio and, irregular memory access patterns of graph algorithms. The growing disparity between CPU speed and DRAM bandwidth, termed memory wall [42], has become a key issue in high performance graph analytics. PageRank is a quintessential algorithm that exemplifies the performance challenges posed by graph computations. It iteratively performs Sparse Matrix-Vector (SpMV) multiplication over the adjacency matrix of the target graph and the current PageRank vector to generate new PageRank values. The irregularity in adjacency matrices leads to random accesses to with poor spatial and temporal locality. The resulting cache misses and communication volume become the performance bottleneck for PageRank computation. Since many graph algorithms can be similarly modeled as a series of SpMV operations [37], optimizations on PageRank can be easily generalized to other algorithms. Recent works have proposed the use of Gather-Apply-Scatter (GAS) model to improve locality and reduce communication for SpMV and PageRank [43, 11, 5]. This model splits computation into two phases: scatter current source node values on edges and gather propagated values on edges to compute new values for destination nodes. The -phased approach restricts access to either the current or new at a time. This provides opportunities for cache-efficient and lock-free parallelization of the algorithm. We observe that although this approach exhibits several attractive features, it also has some drawbacks leading to inefficient memory accesses, both quantitative as well as qualitative. First, we note that while scattering, a vertex repeatedly writes its value on all outgoing edges, resulting in large number of reads and writes. We also observe that the Vertex-centric graph traversal in [11, 5] results in random DRAM accesses and the Edge-centric traversal in [34, 43] scans edge list in coordinate format which increases the number of reads. Our premise is that by changing the focus of computation from a single vertex or edge to a cacheable group of vertices (partition), we can effectively identify and reduce redundant edge traversals as well as avoid random accesses to DRAM, while still retaining the benefits of GAS model. Based on these insights, we develop a new Partition-Centric approach to compute PageRank. The major contributions of our work are: We propose a Partition-Centric Processing Methodology (PCPM) that propagates updates from nodes to partitions and reduces the redundancy associated with GAS model. By carefully evaluating how a PCPM based implementation impacts algorithm behavior, we develop several system optimizations that substantially accelerate the computation, namely, a new data layout that drastically reduces communication and random memory accesses, branch avoidance mechanisms to remove unpredictable branches. We demonstrate that PCPM can take advantage of intelligent node labeling to further reduce the communication volume. Thus, PCPM is suitable even for high locality graphs. We conduct extensive analytical and experimental evaluation of our approach using 6 large datasets. On a 16-core shared memory system, PCPM achieves speedup in execution time and reduction in main memory communication over state-of-the-art. We show that PCPM can be easily extended to weighted graphs and generic SpMV computation (section 3.5) even though it is described in the context of PageRank algorithm in this paper.

2 Background and Related Work

2.1 PageRank Computation

In this section, we describe how PageRank is calculated and what makes it challenging for the conventional implementation to achieve high performance. Table 1 lists a set of notations that we use to mathematically represent the algorithm.

Input directed graph
adjacency matrix of
in-neighbors of vertex
out-neighbors of vertex
PageRank value vector after iteration
scaled PageRank vector
damping factor in PageRank algorithm
Table 1: List of graph notations

PageRank is computed iteratively. In each iteration, all vertex values are updated by the new weighted sum of their in-neighbors’ PageRank, as shown in equation 1.

(1)

To assist visualization of some techniques and optimizations, we also use the Linear Algebraic perspective where a PageRank iteration can be re-written as follows:

(2)

The most computationally intensive term that dictates the performance of computing this equation is the Sparse Matrix-Vector (SpMV) multiplication . Henceforth, we will focus on the SpMV term to improve the performance of PageRank algorithm.

PageRank is typically computed in pull direction [35, 38, 37, 30] where each vertex pulls the value of its in-neighbors and accumulates into its own value, as shown in algorithm 1. This corresponds to traversing in a column-major order and computing the dot product of each column with the scaled PageRank vector .

1:for  do
2:     
3:     for all  do
4:               
5:     
6:swap
Algorithm 1 Pull Direction PageRank (PDPR) Iteration

In the pull direction implementation, each column completely owns the computation of the corresponding element in the output vector. This enables all columns of to be traversed asynchronously in parallel without the need to store partial sums in memory. On the contrary, in the push direction, each node updates its out-neighbors by adding its own value to them. This requires a row-major traversal of and storage for partial sums since each row contributes partially to multiple elements in the output vector. Further, synchronization is needed to ensure conflict-free processing of multiple rows that update the same output element.

Performance Challenges: Sparse matrix layouts like Compressed Sparse Column (CSC) store all non-zero elements of a column sequentially in memory allowing fast column-major traversal of  [36]. However, as shown in fig. 2, the neighbors of a node (non-zero columns in adjacency matrix) can be scattered anywhere in the graph and reading their values results in random accesses (single or double word) to in pull direction computation. Similarly, the push direction implementation uses a Compressed Sparse Row (CSR) format for fast row-major traversal of but suffers from random accesses to the partial sums vector. These low locality and fine granularity accesses incur high cache miss ratio and contribute a large fraction to the overall memory traffic as shown in fig. 1.

Figure 1: Percentage contribution of vertex value accesses to the total DRAM traffic in a PageRank iteration.
Figure 2: Access locations into are derived from non-zero column indices in rows of and tend to be highly irregular.

2.2 Related Work

The performance of PageRank depends heavily on the locality in memory access patterns of the graph (which we refer to as graph locality). Since node labeling has significant impact on graph locality, many prior works have investigated the use of node reordering or clustering [7, 22, 6, 2] to improve the performance of graph algorithms. Reordering based on spatial and temporal locality aware placement of neighbors [39, 20]

has been shown to further outperform the well known clustering and tree-based techniques. Such sophisticated algorithms provide significant speedup but also introduce substantial pre-processing overhead which limits their practicability. In addition, scale-free graphs like social networks are less tractable by reordering transformations because of their skewed degree distribution.

Cache Blocking (CB) is another technique used to accelerate graph processing [41, 32, 45]. CB attempts to induce locality by restricting range of randomly accessed nodes and has been shown to reduce cache misses [24]. CB partitions along rows, columns or both into multiple block matrices. Each block matrix can be stored in CSR or coordinate (COO) format. However, SpMV computation with CB requires the partial sums to be re-read for each block. The extremely sparse nature of these block matrices also reduces the locality in partial sum accesses [31].

Gather-Apply-Scatter (GAS) is another popular model incorporated in many graph analytics frameworks  [23, 34, 13]. It splits the analytic computation into scatter and gather phases. In the scatter phase, source vertices transmit updates on all of their outgoing edges and in the gather phase, these updates are processed to compute new values for corresponding destination vertices. The updates for PageRank algorithm correspond to scaled PageRank values defined earlier in section 2.1.

Binning exploits the -phased computation model by storing the updates in a semi-sorted manner. This induces spatio-temporal locality in access patterns of the algorithm. Binning can be used in conjunction with both Vertex-centric or Edge-centric paradigms. Zhou et al. [43, 44] use a custom sorted edge list with Edge-centric processing to reduce DRAM row activations and improve memory performance. However, their sorting mechanism introduces a non-trivial pre-processing cost and imposes the use of COO format. This results in larger communication volume and execution time than the CSR based Vertex-centric implementations [5, 11].

GAS model is also inherently sub-optimal when used with either Vertex-centric or Edge-centric abstractions. This is because it traverses the entire graph twice in each iteration. Nevertheless, Binning with Vertex-centric GAS (BVGAS) is the state-of-the-art methodology on shared memory platforms [5, 11] and we use it as baseline for comparison in this paper.

3 Partition-Centric Processing

We propose a new Partition-Centric Processing Methodology (PCPM) that significantly improves the efficiency of processor-memory communication over that achievable with current Vertex-centric or Edge-centric methods. We define partitions as disjoint sets of contiguously labeled nodes. The Partition-Centric abstraction then perceives the graph as a set of links from each node to the partitions corresponding to the neighbors of the node. We use this abstraction in conjunction with the 2-phased Gather-Apply-Scatter (GAS) model.

During the PCPM scatter phase, each thread processes one partition at a time. Processing a partition means propagating messages from nodes in to the neighboring partitions. A message to a partition comprises of the update value of source node () and the list of out-neighbors of that lie in . PCPM caches the vertex data of and streams the messages to the main memory. The messages from are generated in a Partition-centric manner i.e. messages from all nodes in to a neighboring partition are generated consecutively and are not interleaved with messages to any other partition.

During the gather phase, each thread scans all messages destined to one partition at a time. A message scan applies the update value to all nodes in the neighbor list of that message. Partial sums of nodes in are cached and messages are streamed from the main memory. After all messages to are scanned, the partial sums (new PageRank values) are written back to DRAM.

With static pre-allocation of distinct memory spaces for each partition to write messages, PCPM can asynchronously scatter or gather multiple partitions in parallel. In this section, we provide a detailed discussion on PCPM based computation and the required data layout.

3.1 Graph Partitioning

We employ a simple approach to divide the vertex set into partitions. We create equisized partitions of size where partition owns all the vertices with index as shown in fig. 2(a). As discussed later, the PCPM abstraction is built to easily take advantage of more sophisticated partitioning schemes and deliver further performance improvements (the trade-off is time complexity of partitioning versus performance gains). As we show in the results section, even the simple partitioning approach described above delivers significant performance gains over state-of-the-art methods.

Each partition is also allocated a contiguous memory space called bin to store updates () and corresponding list of destination nodes () in the incoming messages. Since each thread in PCPM scatters or gathers only one partition at a time, the random accesses to vertex values or partial sums are limited to address range equal to the partition size. This improves temporal locality in access pattern and in turn, overall cache performance of the algorithm.

Before beginning PageRank computation, each partition calculates the offsets (address in bins where it must start writing from) into all and . Our scattering strategy dictates that the partitions write to bins in the order of their IDs. Therefore, the offset for a partition into any given bin is the sum of the number of values that all partitions with ID are writing into that bin. For instance, in fig. 3, the offset of partition into is  (since partitions and do not write to bin ). Similarly, its offset into and is  (since writes one update to bin and writes one update to bin ). Offset computation provides each partition fixed and disjoint locations to write messages. This allows PCPM to parallelize partition processing without the need of locks or atomics.

(a) Example graph with partitions of size
(b) Bins store update value and list of destination nodes
Figure 3: Graph Partitioning and messages inserted in bins during scatter phase

Note that since the destination node IDs written in the first iteration remain unchanged over the course of algorithm, they are written only once and reused in subsequent iterations. The reuse of destination node IDs along with the specific system optimizations discussed in section 3.2 and  3.3 enables PCPM to traverse only a fraction of the graph during scatter phase. This dramatically reduces the number of DRAM accesses and gets rid of the inherent sub-optimality of GAS model.

3.2 Partition-Centric Update Propagation

The unique abstraction of PCPM naturally leads to transmitting a single update from a node to a neighboring partition. In other words, even if a node has multiple neighbors in a partition, it inserts only one update value in the corresponding during scatter phase (algorithm 2). Fig. 4 illustrates the difference between Partition-Centric and Vertex-centric scatter for the example graph shown in fig. 2(a).

PCPM manipulates the Most Significant Bit (MSB) of destination node IDs to indicate the range of nodes in a partition that use the same update value. In the , it consecutively writes IDs of all nodes in the neighborhood of same source vertex and sets the MSB of first ID in this range to for demarcation (fig. 3(b)). Since MSB is reserved for this functionality, PCPM supports graphs with upto 2 billion nodes instead of 4 billion for 4 Byte node IDs. However, to the best of our knowledge, this is enough to process most of the large publicly available datasets.

(a) Scatter in Vertex-centric GAS
(b) Scatter in PCPM
Figure 4: PCPM decouples and to avoid redundant update value propagation

The gather phase starts only after all partitions are processed in the scatter phase. PCPM gather function sequentially reads updates and node IDs from the bins of the partition being processed. When gathering partition , an update value should be applied to all out-neighbors of that lie in . This is done by checking the MSB of node IDs to determine whether to apply the previously read update or to read the next update, as shown in algorithm 2. The MSB is then masked to generate the true ID of destination node whose partial sum is updated.

Algorithm 2 describes PCPM based PageRank computation using a row-wise partitioned CSR format for adjacency matrix . Note that PCPM only writes updates for some edges in a node’s adjacency list, specifically the first outgoing edge to a partition. The remaining edges to that partition are unused. Since CSR stores adjacencies of a node contiguously, the set of first edges to neighboring partitions is interleaved with other edges. Therefore, we have to scan all outgoing edges of each vertex during scatter phase to access this set, which decreases efficiency. Moreover, the algorithm can potentially switch bins for each update insertion, leading to random writes to DRAM. Finally, the manipulation of MSB in node indices introduces additional data dependent branches which hurts the performance. Clearly, CSR adjacency matrix is not an efficient data layout for graph processing using PCPM. In the next section, we propose a PCPM-specific data layout.

1: partition size, set of partitions
2:for all  do Scatter
3:     for all  do
4:         
5:         for all  do
6:              if  then
7:                  insert in
8:                                               
9:
10:for all partitions  do Gather
11:     while  do
12:         pop from
13:         if  then
14:              pop from          
15:               
16:for all  do Apply
17:     
Algorithm 2 PageRank iteration in PCPM using CSR format. Writing of is not shown here.

3.3 Data Layout Optimization

In this subsection, we describe a new bipartite Partition-Node Graph (PNG) data layout that brings out the true Partition-Centric nature of PCPM. During the scatter phase, PNG prevents unused edge reads and ensures that all updates to a bin are streamed together before switching to another bin.

We exploit the fact that once are written, the only required information in PCPM is the connectivity between nodes and partitions. Therefore, edges going from a source to all destination nodes in a single partition can be compressed into one edge whose new destination is the corresponding partition number. This gives rise to a bipartite graph with disjoint vertex sets and  (where represents the set of partitions in the original graph), and a set of directed edges going from to . Such a transformation has the following effects:

  1. [leftmargin=*]

  2. Eff the unused edges in original graph are removed

  3. Eff the range of destination IDs reduces from to .

The advantages of Eff are obvious but those of Eff will become clear when we discuss the storage format and construction of PNG.

The compression step reduces memory traffic by eliminating unused edge traversal. However note that scatters to a bin from source vertices in a partition are still interleaved with scatters to other bins. This can lead to random DRAM accesses during the scatter phase processing of a (source) partition. We resolve this problem by transposing the adjacency matrix of bipartite graph . The rows of the transposed matrix represent edges grouped by destination partitions which enables streaming updates to one bin at a time. This advantage comes at the cost of random accesses to source node values during the scatter phase. To prevent these random accesses from going to DRAM, we construct PNG on a per-partition basis i.e. we create a separate bipartite graph for each partition with edges between and the nodes in  (fig. 5). By carefully choosing to make partitions cacheable, we ensure that all requests to source nodes are served by the cache resulting in zero random DRAM accesses.

Figure 5: Partition-wise construction of PNG for graph  (fig. 2(a)). is much smaller than .

Eff is crucial for transposition of bipartite graphs in all partitions. The number of offsets required to store a transposed matrix in CSR format is equal to the range of destination node IDs. By reducing this range, Eff reduces the storage requirement for offsets of each matrix from to . Since there are partitions, each having one bipartite graph, the total storage requirement for edge offsets in PNG is instead of .

Although PNG construction looks like a -step approach, we actually merge compression and transposition into a single step. We first scan the outgoing edges of all nodes in a partition and individually compute the in-degree of all the destination partitions while discarding unused edges. A prefix sum of these degrees is carried out to compute the offsets array for CSR matrix. The same offsets can also be used to allocate disjoint writing locations into the bins of destination partitions. In the next scan, the edge array in CSR is filled with source node IDs completing both compression and transposition. PNG construction can be easily parallelized over all partitions to accelerate the pre-processing effectively.

Algorithm 3 shows the pseudocode for PCPM scatter phase using PNG layout. Unlike algorithm 2, the scatter function in algorithm 3 does not contain data dependent branches to check and discard unused edges. Using PNG provides drastic performance gains in PCPM scatter phase with little pre-processing overhead.

1: PNG, in-neighbors of partition in bipartite graph of partition
2:for all  do Scatter
3:     for all  do
4:         for all  do
5:              insert into               
Algorithm 3 PCPM scatter phase using PNG layout

3.4 Branch Avoidance

Data dependent branches have been shown to have significant impact on performance of graph algorithms [14] and PNG removes such branches in PCPM scatter phase. In this subsection, we propose a branch avoidance mechanism for the PCPM gather phase. Branch avoidance enhances the sustained memory bandwidth but does not impact the amount of DRAM communication.

Note that the pop operations shown in algorithm 2 are implemented using pointers that increment after reading an entry from the respective bin. Let and be the pointers to and , respectively. Note that the is incremented in every iteration whereas the is only incremented if .

To implement the branch avoiding gather function, instead of using a condition check over , we add it directly to . When is , the pointer is not incremented and the same update value is read from cache in the next iteration; when is , the pointer is incremented executing the pop operation on . The modified pseudocode for gather phase is shown in algorithm 4.

1:
2:for all partitions  do Gather
3:     
4:     while  do
5:         
6:         
7:         
8:               
Algorithm 4 Branch Avoiding gather function in PCPM

3.5 Weighted Graphs and SpMV

PCPM can be easily extended for computation on weighted graphs by storing the edge weights along with destination IDs in . These weights can be read in the gather phase and applied to the source node value before updating the destination node. PCPM can also be extended to generic SpMV with non-square matrices by partitioning the rows and columns separately. In this case, the outermost loops in scatter phase (algorithm 3) and gather phase (algorithm 4) will iterate over row partitions and column partitions of , respectively.

3.6 Comparison with Vertex-centric GAS

The Binning with Vertex-centric GAS (BVGAS) method allocates multiple bins to store incoming messages ( pairs). If bin width is , then all messages destined to are written in bin . The scatter phase traverses the graph in a Vertex-centric fashion and inserts the messages in respective bins of the destination vertices. Number of bins is kept small to allow insertion points for all bins to fit in cache, providing good spatial locality. The gather phase processes one bin at a time as shown in algorithm 5, and thus, enjoys good temporal locality if bin width is small.

1: bin width, no. of bins
2:for  do Scatter
3:     
4:     for all  do
5:         insert into      
6:
7:for  to  do Gather
8:     for all  in  do
9:               
10:for all  do Apply
11:     
Algorithm 5 PageRank Iteration using BVGAS

Unlike algorithm 5, in our BVGAS implementation, we write the destination IDs only in the first iteration. We also use small cached buffers to store updates before writing to DRAM. This ensures full cache line utilization and reduces communication during scatter phase [5].

Irrespective of all the locality advantages and optimizations, BVGAS inherently suffers from redundant reads and writes of a vertex value on all of its outgoing edges. This redundancy manifests itself in the form of BVGAS’ inability to utilize high locality in graphs with optimized node labeling. PCPM on the other hand, uses graph locality to reduce the fraction of graph traversed in scatter phase. Unlike PCPM, the Vertex-centric traversal in BVGAS can also insert consecutive updates into different bins. This leads to random DRAM accesses and poor bandwidth utilization. We provide a quantitative analysis of these differences in the next section.

4 Analytical Evaluation

We derive performance models to compare PCPM against conventional Pull Direction PageRank (PDPR) and BVGAS. Our models provide valuable insights into the behavior of different methodologies with respect to varying graph structure and locality. Table 2 defines the parameters used in the analysis. We use a synthetic kronecker graph [28] of scale 25 (kron) as an example for illustration purposes.

Original Graph PNG layout
no. of vertices (V) no. of partitions (P)
no. of edges (E) compression ratio ()
Architecture Software
cache miss ratio for source
value reads in PDPR
sizeof (updates/PageRank value)
sizeof (cache line) sizeof (node or edge index)
Table 2: List of model parameters

We analyze the amount of data exchanged with main memory per iteration of PageRank. We assume that data is accessed in quantum of one cache line and BVGAS exhibits full cache line utilization. Since destination indices are written only in the first iteration for PCPM and BVGAS, they are not accounted for in this model.

PDPR: The pull technique scans all edges in the graph once (algorithm 1). For a CSR format, this requires reading edge offsets and source node indices. PDPR also reads source node values that incur cache misses generating  Bytes of DRAM traffic. Outputting new PageRank values generates  Bytes of writes to DRAM. The total communication volume for PDPR is:

(3)

BVGAS: The scatter phase (algorithm 5) scans the graph and writes updates on all outgoing edges of the source node, thus communicating  Bytes. The gather phase loads updates and destination node IDs on all the edges generating  Bytes of read traffic. At the end of gather phase,  Bytes of new PageRank values are written in the main memory. Total communication volume for BVGAS is therefore, given by:

(4)

PCPM with PNG: Number of edge offsets in bipartite graph of each partition is . Thus, in the scatter phase (algorithm 3), a scan of PNG reads  Bytes. The scatter phase further reads PageRank values and writes updates on edges. The gather phase (algorithm 4) reads destination IDs and updates followed by new PageRank value writes. Net communication volume in PCPM is given by:

(5)

Comparison: Performance of pull technique depends heavily on . In the worst case, all accesses are cache misses i.e. and in best case, only cold misses are encountered to load the PageRank values in cache i.e. . Assuming , we get . On the other hand, communication for BVGAS stays constant. With additional loads and stores, can never reach the lower bound of . Comparatively, achieves optimality when for every vertex, all outgoing edges can be compressed into a single edge i.e. . In the worst case when , PCPM is still as good as BVGAS and we get . Unlike BVGAS, achieves the same lower bound as .

Analyzing equations 3 and 4, we see that BVGAS is profitable compared to PDPR when:

(6)

In comparison, PCPM offers a more relaxed constraint on  (by a factor of ) becoming advantageous when:

(7)

The RHS in eq. 6 is constant indicating that BVGAS is advantageous for low locality graphs. With optimized node ordering, we can reduce and outperform BVGAS. On the contrary, in the RHS of eq. 7 is a function of locality. With an optimized node labeling, also increases and enhances the performance of PCPM. Fig. 6 shows the effect of on predicted DRAM communication for the kron graph. Obtaining an optimal nodel labeling that makes might be very difficult or even impossible for some graphs. However, as can be observed from fig. 6, DRAM traffic decreases rapidly for and converges slowly for . Therefore, a node reordering that can achieve is good enough to optimize communication volume in PCPM.

Figure 6: Predicted DRAM traffic for kron graph with  M,  M, and  Bytes.

4.1 Random Memory Accesses

We define a random access as a non-sequential jump in the address of memory location being read from or written to DRAM. Random accesses can incur latency penalties and negatively impact the sustained memory bandwidth. In this subsection, we model the amount of random accesses performed by different methodologies in a single PageRank iteration.

PDPR: Reading edge offsets and source node IDs in pull technique is completely sequential because of the CSR format. However, all accesses to source node PageRank values served by DRAM contribute to potential random accesses resulting in:

(8)

BVGAS: In scatter phase of algorithm 5, updates can potentially be inserted at random memory locations. Assuming full cache line utilization for BVGAS, for every  Bytes written, there is at most random DRAM access. In gather phase, all DRAM accesses are sequential if we assume that bin width is smaller than the cache. Total random accesses for BVGAS are then given by:

(9)

PCPM: With the PNG layout (algorithm 3), there are at most bin switches when scattering updates from a partition. Since there are such partitions, total number of random accesses in PCPM is bound by:

(10)

Comparison: BVGAS exhibits less random accesses than PDPR. However, is much smaller than both and . For instance, in the kron dataset with  Bytes,  Bytes and ,  M whereas  M.

Although it is not indicated in algorithm 5, the number of data dependent unpredictable branches in cache bypassing BVGAS implementation is also . For every update insertion, the BVGAS scatter function has to check if the corresponding cached buffer is full (section 3.6). In contrast, the number of branch mispredictions for PCPM (using branch avoidance) is with misprediction for every destination partition () switch in algorithm 3. The derivations are similar to random access model and for the sake of brevity, we do not provide a detailed deduction.

5 Experimental Evaluation

5.1 Experimental Setup and Datasets

We conduct experiments on a dual-socket Ivy Bridge server equipped with two 8-core Intel Xeon E5-2650 v2 processors@2.6 GHz running Ubuntu 14.04 OS. Table 3 lists important characteristics of our machine. Memory bandwidth is measured using STREAM benchmark [25]. All codes are written in C++ and compiled using G++ 4.7.1 with the highest optimization -O3 flag. The memory statistics are collected using Intel Performance Counter Monitor [40]. All data types used for indices and PageRank values are  Bytes.

Socket no. of cores 8
shared L3 cache 25MB
Core L1d cache 32 KB
L2 cache 256 KB
Memory size 128 GB
Read BW 59.6 GB/s
Write BW 32.9 GB/s
Table 3: System Characteristics

We use 6 large real world and synthetic graph datasets coming from different applications, for performance evaluation. Table 4 summarizes the size and sparsity characteristics of these graphs. Gplus and twitter are follower graphs on social networks; pld, web and sd1 are hyperlink graphs obtained by web crawlers; and kron is a scale 25 graph generated using Graph500 Kronecker generator. The web is a very sparse graph but has high locality obtained by a very expensive pre-processing of node labels [6]. The kron graph has higher edge density as compared to other datasets.

Dataset Description # Nodes (M) # Edges (M) Degree
gplus [12] Google Plus 28.94 462.99 16
pld [27] Pay-Level-Domain 42.89 623.06 14.53
web [6] Webbase-2001 118.14 992.84 8.4
kron [28] Synthetic graph 33.5 1047.93 31.28
twitter [19] Follower network 61.58 1468.36 23.84
sd1 [27] Subdomain graph 94.95 1937.49 20.4
Table 4: Graph Datasets

5.2 Implementation Details

We use a simple hand coded implementation of algorithm 1 for PDPR and parallelize it over vertices with static load balancing on the number of edges traversed. Our baseline does not incur overheads associated with similar implementations in frameworks [35, 30, 37] and hence, is faster than framework based programs [5].

To parallelize BVGAS scatter phase (algorithm 5), we give each thread a fixed range of nodes to scatter. Work per thread is statically balanced in terms of the number of edges processed. We also give each thread distinct memory spaces corresponding to all bins to avoid atomicity concerns in scatter phase. We use the Intel AVX non-temporal store instructions [1] to bypass the cache while writing updates and use  Bytes cache line aligned buffers to accumulate the updates for streaming stores [5]. BVGAS gather phase is parallelized over bins with load balanced using OpenMP dynamic scheduling. The optimal bin width is empirically determined and set to  KB (K nodes). As bin width is a power of , we use bit shift instructions instead of integer division to compute the destination bin from node ID.

The PCPM scatter and gather phases are parallelized over partitions and load balancing in both the phases is done dynamically using OpenMP. Partition size is empirically determined and set to  KB. A detailed design space exploration of PCPM is discussed in section 5.3.2.

All the implementations mentioned in this section execute PageRank iterations on cores. For accuracy of the collected information, we repeat these algorithms times and report the average values.

5.3 Results

5.3.1 Comparison with Baselines

Execution Time: Fig. 7 gives a comparison of the GTEPS (computed as the ratio of giga edges in the graph to the runtime of single PageRank iteration) achieved by different implementations. We observe that PCPM is faster than the state-of-the-art BVGAS implementation and upto faster than PDPR. BVGAS achieves constant throughput irrespective of the graph structure and is able to accelerate computation on low locality graphs. However, it is worse than PDPR for high locality (web) and dense (kron) graphs. PCPM is able to outperform PDPR and BVGAS on all datasets, though the speedup on web graph is minute because of high performance of PDPR. Detailed results for execution time of BVGAS and PCPM during different phases of computation are given in table 5. PCPM scatter phase benefits from a multitude of optimizations to achieve a dramatic speedup over BVGAS scatter phase.

Figure 7: Performance in GTEPS. PCPM provides substantial speedup over BVGAS and PDPR.
PDPR BVGAS PCPM
Dataset
Total
Time(s)
Scatter
Time(s)
Gather
Time(s)
Total
Time(s)
Scatter
Time(s)
Gather
Time(s)
Total
Time(s)
gplus 0.44 0.26 0.12 0.38 0.06 0.1 0.16
pld 0.68 0.33 0.15 0.48 0.09 0.13 0.22
web 0.21 0.58 0.23 0.81 0.04 0.17 0.21
kron 0.65 0.5 0.22 0.72 0.07 0.18 0.25
twitter 1.83 0.79 0.32 1.11 0.18 0.27 0.45
sd1 1.97 1.07 0.42 1.49 0.24 0.35 0.59
Table 5: Execution time per iteration of PageRank for PDPR, BVGAS and PCPM

Communication and Bandwidth: Fig. 8 shows the amount of data communicated with main memory normalized by the number of edges in the graph. Average communication in PCPM is and less than BVGAS and PDPR, respectively. Further, PCPM memory traffic per edge for web and kron is lower than other graphs because of their high compression ratio (table 6). The normalized communication for BVGAS is almost constant and therefore, its utility depends on the efficiency of pull direction baseline.

Figure 8: Main memory traffic per edge. PCPM communicates the least for all datasets except the web graph.

Note that the speedup obtained by PCPM is larger than the reduction in communication volume. This is because by avoiding random DRAM accesses and unpredictable branches, PCPM is able to efficiently utilize the available DRAM bandwidth. As shown in fig. 9, PCPM can sustain an average  GB/s bandwidth compared to  GB/s and  GB/s of PDPR and BVGAS, respectively. For large graphs like sd1, PCPM achieves of the peak read bandwidth (table 3) of our system. Although both PDPR and BVGAS suffer from random memory accesses, the former executes very few instructions and therefore, has better bandwidth utilization.

Figure 9: Sustained Memory Bandwidth for different methods. PCPM achieves highest bandwidth utilization.
Original Labeling GOrder Labeling
Dataset
#Edges in
Graph (M)
#Edges in
PNG (M)
#Edges in
PNG (M)
gplus 463 243.8 1.9 157.4 2.94
pld 623.1 347.7 1.79 166.7 3.73
web 992.8 118.1 8.4 126.8 7.83
kron 104.8 342.7 3.06 169.7 6.17
twitter 1468.4 722.4 2.03 386.2 3.8
sd1 1937.5 976.9 1.98 366.2 5.29
Table 6: Locality vs compression ratio . GOrder improves locality in neighbors and increases compression

The reduced communication and streaming access patterns in PCPM also enhance its energy efficiency resulting in lower /edge consumption as compared to BVGAS and PDPR, as shown in fig. 10. Energy efficiency is important from an eco-friendly computing perspective as highlighted by the Green Graph500 benchmark [16].

Figure 10: DRAM energy consumption per edge. PCPM benefits from reduced communication and random memory accesses.

Effect of Locality: To assess the impact of locality on different methodologies, we relabel the nodes in our graph datasets using the GOrder [39] algorithm. We refer to the original node labeling in graph as Orig and GOrder labeling as simply GOrder. GOrder increases spatial locality by placing nodes with common in-neighbors closer in the memory. As a result, outgoing edges of the nodes tend to be concentrated in few partitions which increases the compression ratio as shown in table 6. However, the web graph exhibits near optimal compression () with Orig and does not show improvement with GOrder.

Table 7 shows the impact of GOrder on DRAM communication. As expected, BVGAS communicates a constant amount of data for a given graph irrespective of the labeling scheme used. On the contrary, memory traffic generated by PDPR and PCPM decreases because of reduced and increased , respectively. These observations are in complete accordance with the performance models discussed in section LABEL:sec:commModel. The effect on PCPM is not as drastic as PDPR because after becomes greater than a threshold, PCPM communication decreases slowly as shown in fig. 6. Nevertheless, for almost all of the datasets, the net data transferred in PCPM is remarkably lesser than both PDPR and BVGAS for either of the vertex labelings.

PDPR BVGAS PCPM
Dataset Orig GOrder Orig GOrder Orig GOrder
gplus 13.1 7.4 9.3 9.3 6.6 5.1
pld 24.5 10.7 12.6 12.5 9.4 6.1
web 7.5 7.6 21.6 21.3 8.5 8.4
kron 18.1 10.8 19.9 19.5 10.4 7.5
twitter 68.2 31.6 28.8 28.2 19.4 13.4
sd1 65.1 23.8 37.8 37.8 26.9 15.6
Table 7: DRAM data transfer per iteration (in GB). PDPR and PCPM benefit from optimized node labeling
Figure 11: Compression ratio increases with partition size

5.3.2 PCPM Design Space Exploration

As we increase the size of partition, the neighbors of each node are forced to fit in fewer partitions resulting in better compression as shown in fig. 11. The web graph is an exception for which the value remains almost constant because its node labeling provides high spatial locality and close to optimal compression even for small partition sizes. The kron dataset exhibits larger compression than other graphs because of high edge density.

A direct consequence of increase in is observed in the amount of DRAM communication which reduces as we increase the partition size (fig. 12). However, if the partition becomes more than the cache capacity, cache is unable to accommodate all the nodes of a partition resulting in cache misses. This drastically increases the main memory traffic as for each cache miss, one complete cache line is transferred from DRAM.

Figure 12: Impact of partition size on communication volume. Very large partitions result in cache misses and increased DRAM traffic.

The execution time (fig. 13) also benefits from communication reduction and is penalized by cache misses for large partitions. Note that for partition sizes  KB and  MB, communication volume decreases but execution time increases. This is because in this range, many requests are served from the larger shared L3 cache which is slower than the private L1 and L2 caches. This phenomenon decelerates the computation but does not add to DRAM traffic.

Figure 13: Impact of partition size on execution time.

Partition size represents an important trade off in PCPM. Large partitions result in better compression but poor locality for random accesses to nodes within the partition. Fig. 14 shows the effect of altering partition size separately on scatter and gather phases for PageRank computation on sd1 dataset. Both scatter and gather phases benefit from higher compression as partition size increases. However, gather phase performance saturates early because its memory accesses are proportional to , as compared to scatter phase where accesses are proportional to . However, in both the phases, nodes within a partition are randomly accessed and hence, the performance declines if partition size grows beyond what fits in cache. Based on our observations, we chose the  KB as the optimal partition size for our platform.

Figure 14: Time consumption of scatter and gather phases for sd1 graph. Both the phases execute fastest for partition size KB

5.3.3 Pre-processing Time

We assume that adjacency matrix in CSR and CSC format is available and hence, PDPR does not need any pre-processing. Both BVGAS and PCPM however, require a beforehand computation of bin size and write offsets incurring non-zero pre-processing time as shown in table 8. In addition, PCPM also constructs the PNG layout. Fortunately, the computation of write offsets is merged with PNG construction (section 3.3) to reduce the overhead. The pre-processing time is lesser than execution time of even a single iteration (table 5) and gets easily amortized over multiple iterations of PageRank.

Dataset PCPM BVGAS PDPR
gplus 0.25s 0.1s 0s
pld 0.32s 0.15s 0s
web 0.26s 0.18s 0s
kron 0.43s 0.22s 0s
twitter 0.7s 0.27s 0s
sd1 0.95s 0.32s 0s
Table 8: Pre-processing time of different methodologies. PNG construction increases the overhead of PCPM

6 Conclusion and Future Work

In this paper, we formulated a Partition-Centric Processing Methodology (PCPM) that perceives a graph as a set of links between nodes and partitions instead of nodes and their individual neighbors. We presented several features of this abstraction and developed system level optimizations to exploit them.

We developed a novel PNG data layout for efficient processing using PCPM. The idea for this layout originated when we were trying to relax the Vertex-centric programming constraint that all outgoing edges of a node should be traversed consecutively. The additional freedom arising from treating edges individually allowed us to group edges in a way that eliminates random memory accesses to sustain of peak memory bandwidth.

We conducted extensive analytical and experimental evaluation of our approach. Using a simple index based partitioning, we observed an average speedup in execution time and reduction in DRAM communication volume over state-of-the-art. In the future, we will explore edge partitioning models [21, 8] to further reduce communication and improve load balancing for PCPM.

Although we demonstrate the advantages of PCPM on PageRank, we show that it can be easily extended to generic SpMV computation. We believe that PCPM can be an efficient programming model for other graph algorithms or graph analytics frameworks. In this context, there are many promising directions for further exploration. For instance, the streaming memory access patterns of PNG enabled PCPM are highly suitable for High Bandwidth Memory (HBM) and disk-based systems. Exploring PCPM as a programming model for heterogenous memory or processor architectures is an interesting avenue for future work.

PCPM accesses nodes from only one graph partition at a time. Hence, G-Store’s smallest number of bits representation [18] can be used to reduce the memory footprint and DRAM communication even further. Devising novel methods for enhanced compression can also make PCPM amenable to be used for large-scale graph processing on commodity PCs.

Acknowledgements:

This material is based on work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract Number FA8750-17-C-0086, National Science Foundation (NSF) under Contract Numbers CNS-1643351 and ACI-1339756 and Air Force Research Laboratory under Grant Number FA8750-15-1-0185. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, NSF or AFRL. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

References

  • [1] Intel® c++ compiler 17.0 developer guide and reference, 2016. Available at https://software.intel.com/en-us/intel-cplusplus-compiler-17.0-user-and-reference-guide.
  • [2] Abou-Rjeili, A., and Karypis, G. Multilevel algorithms for partitioning power-law graphs. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (2006), IPDPS’06, IEEE Computer Society, pp. 124–124.
  • [3] Albert, R., Jeong, H., and Barabási, A.-L. Internet: Diameter of the world-wide web. nature 401, 6749 (1999), 130.
  • [4] Aldous, J. M., and Wilson, R. J. Graphs and applications: an introductory approach, vol. 1. Springer Science & Business Media, 2003.
  • [5] Beamer, S., Asanović, K., and Patterson, D. Reducing pagerank communication via propagation blocking. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International (2017), IEEE, pp. 820–831.
  • [6] Boldi, P., Rosa, M., Santini, M., and Vigna, S. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Proceedings of the 20th international conference on World wide web (2011), ACM, pp. 587–596.
  • [7] Boldi, P., Santini, M., and Vigna, S. Permuting web and social graphs. Internet Mathematics 6, 3 (2009), 257–283.
  • [8] Bourse, F., Lelarge, M., and Vojnovic, M. Balanced graph edge partition. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2014), KDD ’14, ACM, pp. 1456–1465.
  • [9] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. Graph structure in the web. Computer networks 33, 1-6 (2000), 309–320.
  • [10] Bronson, N., Amsden, Z., Cabrera, G., Chakka, P., Dimov, P., Ding, H., Ferris, J., Giardullo, A., Kulkarni, S., Li, H. C., et al. Tao: Facebook’s distributed data store for the social graph. In USENIX Annual Technical Conference (2013), pp. 49–60.
  • [11] Buono, D., Petrini, F., Checconi, F., Liu, X., Que, X., Long, C., and Tuan, T.-C. Optimizing sparse matrix-vector multiplication for large-scale data analytics. In Proceedings of the 2016 International Conference on Supercomputing (2016), ACM, p. 37.
  • [12] Gong, N. Z., Xu, W., Huang, L., Mittal, P., Stefanov, E., Sekar, V., and Song, D. Evolution of social-attribute networks: measurements, modeling, and implications using google+. In Proceedings of the 2012 Internet Measurement Conference (2012), ACM, pp. 131–144.
  • [13] Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. Powergraph: Distributed graph-parallel computation on natural graphs. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12) (2012), USENIX, pp. 17–30.
  • [14] Green, O., Dukhan, M., and Vuduc, R. Branch-avoiding graph algorithms. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures (2015), ACM, pp. 212–223.
  • [15] Haklay, M., and Weber, P. Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7, 4 (2008), 12–18.
  • [16] Hoefler, T. Green graph500. Available at http://green.graph500.org/.
  • [17] Huber, W., Carey, V. J., Long, L., Falcon, S., and Gentleman, R. Graphs in molecular biology. BMC bioinformatics 8, 6 (2007), S8.
  • [18] Kumar, P., and Huang, H. H. G-store: high-performance graph store for trillion-edge processing. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for (2016), IEEE, pp. 830–841.
  • [19] Kwak, H., Lee, C., Park, H., and Moon, S. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web (2010), ACM, pp. 591–600.
  • [20] Lakhotia, K., Singapura, S., Kannan, R., and Prasanna, V. Recall: Reordered cache aware locality based graph processing. In High Performance Computing (HiPC), 2017 IEEE 24th International Conference on (2017), IEEE, pp. 273–282.
  • [21] Li, L., Geda, R., Hayes, A. B., Chen, Y., Chaudhari, P., Zhang, E. Z., and Szegedy, M. A simple yet effective balanced edge partition model for parallel computing. In Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems (2017), SIGMETRICS ’17 Abstracts, ACM, pp. 6–6.
  • [22] Liu, W.-H., and Sherman, A. H. Comparative analysis of the cuthill–mckee and the reverse cuthill–mckee ordering algorithms for sparse matrices. SIAM Journal on Numerical Analysis 13, 2 (1976), 198–213.
  • [23] Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (2010), ACM, pp. 135–146.
  • [24] Malicevic, J., Lepers, B., and Zwaenepoel, W. Everything you always wanted to know about multicore graph processing but were afraid to ask. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) (2017), USENIX, pp. 631–643.
  • [25] McCalpin, J. D. Stream benchmark. Link: www. cs. virginia. edu/stream/ref. html# what 22 (1995).
  • [26] McSherry, F., Isard, M., and Murray, D. G. Scalability! but at what cost? In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems (2015), HOTOS’15, USENIX Association, pp. 14–14.
  • [27] Meusel, R., Vigna, S., Lehmberg, O., and Bizer, C. The graph structure in the web: Analyzed on different aggregation levels. The Journal of Web Science 1, 1 (2015), 33–47.
  • [28] Murphy, R. C., Wheeler, K. B., Barrett, B. W., and Ang, J. A. Introducing the graph 500. Cray Users Group (CUG) 19 (2010), 45–74.
  • [29] Newman, M. E., Watts, D. J., and Strogatz, S. H. Random graph models of social networks. Proceedings of the National Academy of Sciences 99, suppl 1 (2002), 2566–2572.
  • [30] Nguyen, D., Lenharth, A., and Pingali, K. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 456–471.
  • [31] Nishtala, R., Vuduc, R. W., Demmel, J. W., and Yelick, K. A. When cache blocking of sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communication and Computing 18, 3 (2007), 297–311.
  • [32] Penner, M., and Prasanna, V. K. Cache-friendly implementations of transitive closure. Journal of Experimental Algorithmics (JEA) 11 (2007), 1–3.
  • [33] Prabhakaran, V., Wu, M., Weng, X., McSherry, F., Zhou, L., and Haradasan, M. Managing large graphs on multi-cores with graph awareness. 41–52.
  • [34] Roy, A., Mihailovic, I., and Zwaenepoel, W. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 472–488.
  • [35] Shun, J., and Blelloch, G. E. Ligra: a lightweight graph processing framework for shared memory. In ACM Sigplan Notices (2013), vol. 48, ACM, pp. 135–146.
  • [36] Siek, J. G., Lee, L.-Q., and Lumsdaine, A. The Boost Graph Library: User Guide and Reference Manual, Portable Documents. Pearson Education, 2001.
  • [37] Sundaram, N., Satish, N., Patwary, M. M. A., Dulloor, S. R., Anderson, M. J., Vadlamudi, S. G., Das, D., and Dubey, P. Graphmat: High performance graph analytics made productive. Proceedings of the VLDB Endowment 8, 11 (2015), 1214–1225.
  • [38] Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., and Owens, J. D. Gunrock: A high-performance graph processing library on the gpu. In ACM SIGPLAN Notices (2016), vol. 51, ACM, p. 11.
  • [39] Wei, H., Yu, J. X., Lu, C., and Lin, X. Speedup graph processing by graph ordering. In Proceedings of the 2016 International Conference on Management of Data (2016), ACM, pp. 1813–1828.
  • [40] Willhalm, T., Dementiev, R., and Fay, P. Intel performance counter monitor-a better way to measure cpu utilization. 2012. URL: http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpuutilization (2016).
  • [41] Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., and Demmel, J. Optimization of sparse matrix–vector multiplication on emerging multicore platforms. Parallel Computing 35, 3 (2009), 178–194.
  • [42] Wulf, W. A., and McKee, S. A. Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news 23, 1 (1995), 20–24.
  • [43] Zhou, S., Chelmis, C., and Prasanna, V. K. Optimizing memory performance for fpga implementation of pagerank. In ReConFigurable Computing and FPGAs (ReConFig), 2015 International Conference on (2015), IEEE, pp. 1–6.
  • [44] Zhou, S., Lakhotia, K., Singapura, S. G., Zeng, H., Kannan, R., Prasanna, V. K., Fox, J., Kim, E., Green, O., and Bader, D. A. Design and implementation of parallel pagerank on multicore platforms. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE (2017), IEEE, pp. 1–6.
  • [45] Zhu, X., Han, W., and Chen, W. Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15) (2015), USENIX Association, pp. 375–386.