PartitionedVC: Partitioned External Memory Graph Analytics Framework for SSDs

05/10/2019 ∙ by Kiran Kumar Matam, et al. ∙ University of Southern California 0

Graphs analytics are at the heart of a broad range of applications such as drug discovery, page ranking, transportation systems, and recommendation systems. When graph size exceeds memory size, out-of-core graph processing is needed. For the widely used external memory graph processing systems, accessing storage becomes the bottleneck. We make the observation that nearly all graph algorithms have a dynamically varying number of active vertices that must be processed in each iteration. However, existing graph processing frameworks, such as GraphChi, load the entire graph in each iteration even if a small fraction of the graph is active. This limitation is due to the structure of the data storage used by these systems. In this work, we propose to use a compressed sparse row (CSR) based graph storage that is more amenable for selectively loading only a few active vertices in each iteration. But CSR based graph processing suffers from random update propagation to many target vertices. To solve this challenge we propose to use a multi-log update mechanism which logs updates separately, rather than directly update the active edge in a graph. Our proposed multi-log system maintains a separate log per each vertex interval. This separation enables us to efficiently process each vertex interval by just loading the corresponding log. Over the current state of the art out-of-core graph processing framework, our evaluation results show that the PartitionedVC framework improves performance by up to 16.40×, 1.13×, 1.64×, 1.38×, and 2.76× for the widely used breadth-first search, pagerank, community detection, graph coloring, and the maximal independent set applications, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graphs analytics are at the heart of a broad range of applications. The size of the graphs in many of these domains exceeds the size of main memory. Hence, many out-of-core (also called external memory) graph processing systems have been proposed. These systems primarily operate on graphs by splitting the graph into chunks and operating on each chunk that fits in main memory. An alternative approach is to distribute the graph processing across multiple systems to fit the graph across the distributed memory. In this work, we focus on single system graph processing.

Many popular graph processing systems use vertex-centric programming paradigm. This computational paradigm uses bulk synchronous parallel (BSP) processing where each vertex is processed once during a single superstep, which may generate new updates to their connected vertices which are then iteratively processed in the following superstep. A vertex can modify its state or generate updates to another vertex, or even mutate the topology of the graph.

GraphChi is an external memory graph analytics system which supports a vertex-centric programming model (graphchi, ). Graphchi was originally designed for graph processing with hard disks, so they tried to minimize the number of random disk accesses. As such GraphChi uses a custom graph structure called a shard to split the graph into chunks such that each chunk fits in main memory (more details in the next section). GraphChi loads a shard containing a set of vertices and all the outgoing edges from these vertices that are located in other shards into main memory to process vertices in batches.

As we will describe in more detail in the next section, the loading time for a graph dominates the total execution time due to the repeated fetching of shards from storage. GraphChi has to load all the shards that make up the full graph during every iteration of the BSP compute model, even if certain vertices/edges do not have any new updates. This limitation is primarily due to the structure of the shard that stores outgoing edge information across multiple shards. Our data shows that the number of vertices that receive an update from a previous BSP iteration will dynamically change as the graph algorithm converges. In some algorithms, such as breadth first search, the active graph starts to be very small and then grows with each superstep. In other algorithms, such as page rank the active graph size shrinks with each superstep. Hence, there is a significant opportunity to reduce the cost of graph processing if only the active vertices are loaded from storage.

GraphChi’s shard structure was a right choice for hard disk based systems that heavily penalize random accesses. With the advent of solid state disks (SSDs), we need to evaluate new graph storage structures, such as compressed sparse row (CSR) formats, that are better suited for incrementally loading active vertices. In fact, GraphChi considered CSR formats in its original design (graphchi, ), but the main drawback with CSR format is that updating outgoing edges of a vertex during each BSP iteration leads to significant random access traffic. The limitation of random updates in CSR format can be mitigated by using a message passing based log structure. The log structure in essence records all the updates to various target vertices as a set of sequential log writes, rather than directly updating the graph itself. This log structure mitigates the random access concern. One drawback of the log structure is that messages bound to a single target vertex may be interspersed throughout the log. Hence, the log structure itself must be sorted (or traversed multiple times) to process the updates in the next iteration.

Given these limitations this paper proposes a new graph processing framework that allows loading only the active vertices in each superstep. In particular, we demonstrate how CSR formatted graphs can be exploited to load only the active vertices to reduce the graph load time in iterative graph processing algorithms. Second, to tackle the large overheads of managing log structure we propose a split log structure that divides the logs into multiple vertex intervals. All the updates generated by one vertex interval to other vertex interval are stored in their corresponding log. When an interval of vertices are scheduled for processing all the updates it needs are located in a single log.

Our main contributions in this work are:

  • We propose an efficient external memory graph analytic system, called PartitionedVC, which reduces read amplification while accessing the storage for active vertices data and the updates sent between them. To realize this, we use a compressed graph storage format suitable for accessing active vertices data and log the updates sent between the vertices, instead of directly updating at the target vertex location. We show that the log based updates significantly reduce the number of random accesses while performing a wide range of graph analytics.

  • To efficiently log and access the updates in a superstep, we partition the graph into multiple vertex intervals for processing. To further reduce the log management overhead we split the update log into multiple logs each associated with one vertex interval.

  • While processing these vertex intervals, we efficiently schedule graph accesses in order to access only active graph pages from SSD.

  • We also support efficient graph structural updates to the compressed graph format by logging the graph structural changes for each interval.

    2. Background and Motivation

    Figure 1. Modern SSD platform architecture

    2.1. Modern SSD platforms

    Since much of our work uses SSD-specific optimizations, we provide a very brief overview of SSD architecture. Figure 1 illustrates the architecture of a modern SSD platform. An SSD equips multiple flash memory channels to support high data bandwidth. Multiple dies are integrated into a single NAND flash package by employing die-stacking structure to integrate more storage space on the limited platform board. Data parallelism can be achieved per die with multi-plane or multi-way composition. Each plane or way is divided into multiple blocks which have dozens of physical pages. A page is a basic physical storage unit that can be read or written by one flash command. SSD uses firmware to manage all the idiosyncrasies while accessing a page. To execute the SSD firmware, major components of SSD include an embedded processor, DRAM, on-chip interconnection network, and a flash controller.

    2.2. Graph Computational model

    In this work, we support the commonly used vertex-centric programming model for graph analytics. The input to a graph computation is a directed graph, . Each vertex in the graph has an id between to and a modifiable user-defined value associated with it. For a directed edge , we refer to be the out-edge of , and in-edge of . Also for , we refer to be source vertex and to be the target vertex, and may be associated with a modifiable, user-defined value.

    A typical vertex-centric computation consists of input, where the graph is initialized, followed by a sequence of supersteps separated by global synchronization points until the algorithm terminates, and finishes with output. Within each superstep vertices compute in parallel, each executing the same user-defined function that expresses the logic of a given algorithm. A vertex can modify its state or that of its neighboring edges, or generate updates to another vertex, or even mutate the topology of the graph. Edges are not first-class citizens in this model, having no associated computation.

    Depending on when the updates are visible to the target vertices, the computational model can be either synchronous or asynchronous. In the synchronous computational model, the updates generated to a target vertex are available to it in the next superstep. Graph systems such as Pregel (malewicz2010pregel, ), and Apache Giraph (apache_giraph, ) use this approach. In an asynchronous computational model, an update to a target vertex is visible to that vertex in the current superstep. So if the target vertex is scheduled after the source vertex in a superstep, then the current superstep’s update is available to the target vertex. GraphChi (graphchi, ) and Graphlab (low2014graphlab, ) use this approach. An asynchronous computational model is shown to be useful for accelerating the convergence of many numerical algorithms (graphchi, ). PartitionedVC supports asynchronous updates.

    Graph algorithms can also be broadly classified into two categories, based on how the updates are handled. A certain class of graph algorithms exhibits associative and commutative property. In such algorithms updates to a target vertex can be combined into a single value, and they can be combined in any order before processing the vertex. Algorithms such as pagerank, BFS, single-source shortest path fall in this category. There are many other graph algorithms that require the update order to be preserved, and each update is individually applied. Algorithms such as community detection

    (FLP_implementation, ), graph coloring (gonzalez2012powergraph, ), maximal independent set (malewicz2010pregel, ) fall in this category. PartitionedVC supports both these types of graph algorithms.

    2.3. Out-of-core graph processing

    In the out-of-core graph processing context, graphs sizes are considered to be large when compared to the main memory size but can fit in the storage size of current SSDs (in Terabytes). As described earlier, GraphChi (graphchi, ) is an out-of-core vertex-centric programming system. GraphChi partitions the graph into several vertex intervals, and stores all the incoming edges to a vertex interval as a shard. Figure (b)b shows the shard structure for an illustrative graph shown in Figure (a)a. For instance, shard1 stores all the incoming edges of vertex interval , shard2 stores ’s incoming edges, and shard3 stores incoming edges of all the vertices in the interval . While incoming edges are closely packed in a shard, the outgoing edges of a vertex are dispersed across other shards. In this example, the outgoing edges of are dispersed across shard1, shard2, and shard3. Another unique property of shard organization is that each shard stores all its in-edges sorted by source vertex.

    GraphChi relies on this shard organization to process vertices in intervals. It first loads into memory a shard corresponding to one vertex interval, as well as all the outgoing edges of those vertices that may be stored across multiple shards. Updates generated during processing are asynchronously and directly passed to the target vertices through the out-going edges in other shards in the memory. Once the processing for a vertex interval in a superstep is finished, its corresponding shard and its out-going edges in other shards are written back to the disk.

    Using the above approach GraphChi primarily relies on sequential accesses to disk data and minimizes random accesses. However, in the following superstep, a subset of vertices may become active (if they received any messages on their in-edges). The in-edges to a vertex are stored in a shard, and in-edges to a vertex can come from any source vertex. These in-edges in a shard are sorted based on source vertex id. Hence, even if a single vertex is active within a vertex interval the entire shard must be loaded since the in-edges for that vertex may be dispersed throughout the shard. So to access the in-edges of active vertices within a shard, one has to load the entire shard, even there may be few active vertices in that shard. For instance, if any of the

    is active, the entire shard3 must be loaded. Loading a shard may be avoided only if all the vertices in the associated vertex interval are not active. However, in real-world graphs, the vertex intervals typically span tens of thousands of vertices, and during each superstep the probability of a single vertex being active in a given interval is very high. As a result, GraphChi in practice ends up loading all the shards in every superstep independent of the number of active vertices in that superstep.

    2.4. Active graph

    To quantify the amount of superfluous loading that must be performed we counted the active vertex and the active edge counts in each superstep while running graph coloring application described in section 5 over the datasets shown in Table 1. For this application, we ran a maximum of 15 supersteps. Figure 3 shows the active vertices and active edges count over these supersteps. The x-axis indicates the superstep number, the major y-axis shows the ratio of active vertices divided by total vertices, and the minor y-axis shows the number of active edges (updates sent over an edge) divided by the total number of edges in the graph. The fraction of active vertices and active edges shrink dramatically as supersteps progress. However, at the granularity of a shard, even the few active vertices lead to loading many shards since the active vertices are spread across the shards.

    3. CSR format in the era of SSDs

    (a) CSR format representation for the example graph
    (b) GraphChi shard structure for the example graph
    Figure 2. Graph storage formats

    Given the large loading bandwidth demands of GraphChi, we evaluated a compressed sparse row (CSR) format for graph processing. CSR format has the desirable property that one may load just the active vertex information more efficiently. Large graphs tend to be sparsely connected. CSR format takes the adjacency matrix representation of a graph and compresses it using three vectors. The value vector,

    val, stores all the non-zero values from each column sequentially. The column index vector, colIdx, stores the column index of each element in the val vector. The row pointer vector, rowPtr, stores the starting index of each row in the val vector. CSR format representation for the example graph is shown in Figure (a)a. The edge weights on the graph are stored in val vector, and the adjacent outgoing vertices are stored in colIdx vector. To access adjacent outgoing vertices associated with a vertex in CSR graph storage format, we first need to access the rowPtr vector to get the starting index in the colIdx vector where the adjacent vertices associated with the vertex are stored in a contiguous fashion. As all the outgoing edges connected to a vertex are stored in a contiguous location, while accessing the adjacency information for the active vertices CSR format is suitable for minimizing the number of pages accessed in an SSD and reducing the read amplification while accessing the SSD.

    Figure 3. Active vertices and edges over supersteps

    3.1. Challenges for graph processing with a CSR format

    While CSR format looks appealing, it suffers one fundamental challenge. One can either maintain an in-edge list in the colIdx vector or the out-edge list, but not both (due to coherency issues with having the same information in two different vectors). Consider the case that adjacency list stores only in-edges and during the superstep all the updates on the out-edges generate many random accesses to the adjacency lists to extract the out-edge information.

    3.2. Two Key Observations: Using a log and splitting the log

    To avoid the random access problem with the CSR format, we make the first key observation. Namely, updates to the out-edges do not need to be propagated using the adjacency list directly. Instead these updates can be simply logged. Thus, we propose to log the updates sent between the vertices, instead of directly updating at the target vertex location. In a superstep, we log all the vertex updates, and group the target vertex messages in the next superstep and pass them to that target vertex.

    One can maintain a single log for all the updates that can be parsed in the next superstep. However, as multiple messages sent to a vertex may be spread all over the log, one may need to do external sorting over a large number of updates. The second key observation is that we maintain a separate log for a collection of vertices. As such we create a coarse-grain log for an interval of vertices that stores all the updates bound to those vertices.

    We partition the graph into several vertex intervals and use a log for each interval. We choose the size of a vertex interval such that typically the entire update log corresponding to that interval can be loaded into the host memory, and used for processing by the vertices in that interval.

    3.3. Multi-log Architecture

    1:for all the vertex intervals in the superstep do
    2:     Load vertex interval’s update log into the buffer
    3:     Group updates based on target vertex id
    4:     repeat/* repeat ends in line 8*/
    5:         For the active vertices load the required vertex data (vertex values, in-edge and out-edge lists, in-edge and out-edge weights) into the buffer
    6:         for each of the active vertices do
    7:              ProcessVertex(VertexData)          
    8:     until Entire update buffer is processed
    Algorithm 1 Overview of a superstep in PartitionedVC
    Figure 4. Internal components of PartitionedVC framework

    Given the multi-log architecture design, we propose the vertex-centric programming model that can be implemented as follows. In a superstep, we loop over each of the vertex intervals for processing. For each of the vertex intervals, we load its update log and schedule the vertex processing functions for each of the active vertices in that interval. Algorithm 1 shows the overall framework functionality. What is important to highlight in the algorithms is the fact that we load only active vertex data in step 5, rather than the entire graph. There is however a small penalty we pay for logging the updates, which requires us to sort each of the vertex interval logs based on the target vertex id of the update. As long as each vertex interval log fits in main memory, we can do in-memory sorting. Furthermore, as we described earlier the active graph size shrinks dramatically with each superstep, and the total log size is proportional to the active graph size. Hence the log size also shrinks with each superstep, thereby shrinking the cost of managing and sorting the log.

    Figure 4 shows the software components used to realize the framework, which is described in detail in the following paragraphs. First, we describe the framework for the synchronous computation model (described in subsection 2.2) and then later extend it to asynchronous computation model.

    Multi-Log Unit: Handling Updates - This component handles storing and retrieving updates generated by the vertices in a superstep. While processing a vertex in a superstep (line 7 in Algorithm 1), the programmer invokes the update function as usual to pass the update to the target vertex. The update function calls the MutliLogVC’s runtime system transparent to the programmer. The runtime system invokes the multi-log unit to log the update. A log is maintained for each vertex interval, and an update is generated to a target vertex in the current superstep is appended to the target vertex interval’s log. As we will describe later, these updates are retrieved in the next superstep and processed by the corresponding target vertices.

    To efficiently implement logging PartitionedVC first caches the log writes in main memory buffers, called the multi-log memory buffers. Buffering helps to reduce fine-grained writes which in turn reduces write amplification to the SSD storage. Note that flash memory in SSDs can only be written at page granularity. As such PartitionedVC maintains memory buffers in chunks of SSD page size. Since any vertex interval may generate an update to a target vertex that may be present in any other vertex interval, at least one log buffer is allocated for each vertex interval in the entire graph. In our experiments, even with the largest graph size, the number of vertex intervals was in the order of a few (¡5) thousands. Hence, at least several thousands of pages may be allocated in multi-log memory buffer at one time.

    For each vertex interval log, a top page is maintained in the buffer. When a new update is sent to the multi-log unit, first the top page of that vertex interval where the update is bound for is identified. As updates are just appended to the log, an update that can fit in the available space in the top page is written into it. If there is not enough space in the top page, then a new page is allocated by the multi-log unit, and that new page becomes the top page for that vertex interval log. We maintain a simple mapping table indexed by the vertex interval to identify the top page.

    When the available free space in the multi-log buffer is less than a certain threshold, some log pages are evicted from the main memory to SSD. In synchronous mode of graph updates there is one log file per each vertex interval that is stored in SSD. When a log is evicted from memory it is appended to the corresponding vertex interval log file. While the multi-log architecture may theoretically store many log pages to SSD in practice we noticed that evictions to SSD are needed only when the multi-log buffer overflows the size of main memory. But as we mentioned earlier, the total size of the active graph changes in each superstep and in majority of supersteps the active size is much smaller than the total graph size. Since the log file is proportional to the number of updates the log file size also shrinks when the active graph size is small. Hence the log for each vertex interval is mostly cached in memory as the supersteps progress.

    VC Unit – Handling Data Retrievals - The updates that are bound to each vertex interval are logged, either in memory or maybe in SSD when there is an overflow, as described above. When the next superstep begins the updates received by each vertex in the previous iteration must be processed by that vertex. Note that the updates bound to a given vertex are dispersed throughout the log associated with that vertex interval. Hence, it is necessary to group all the updates in that log first before initiating a vertex processing function. The VC unit is responsible for this task. At the start of each vertex interval processing the VC unit reads the corresponding log and groups all the messages bound for each vertex in that interval.

    As described in the background section some graph algorithms support associative and commutative property on updates. Hence, the updates can be merged in any order. For such programs along with the vertex processing function, we provide an accumulation function. The programmer uses the accumulation function to specify the combine operation for the updates. This function is processed for all the incoming updates in a superstep by the VC unit. The accumulation function is operated on all the updates to a target vertex in a superstep before the target vertex’s processing function is called. Algorithm 3 shows how the accumulation function is specified for the the page rank application. Hence, the VC unit can optimize the performance automatically whenever there is an accumulation function defined in a graph algorithms. In the case of non-associative and commutative programs, the updates are grouped based on the target vertex id, and the update function is individually called for each update.

    Graph Loader Unit: As is the case with any graph algorithm, the programmer decides when a vertex will become active or otherwise. It is typically specified as part of the vertex processing function or the accumulate function. PartitionedVC maintains an active vertex bit for each vertex in the main memory and update that bit during each superstep. This active vertex bit mask is used to decide which vertices to process in the next superstep.

    As described eariler, PartitionedVC uses CSR format to store graphs since CSR is more efficient to load a collection of active vertices. A graph loader unit is responsible to load the graph data for the vertices present in the active vertex list (line 5 in Algorithm 1). Graph data unit maintains the row buffer for loading the row pointer and buffer for each of the vertex data (adjacency edge lists/weights). The graph data unit loops over the row pointer array for the range of vertices in the active vertex list, each time fetching vertices that can fit in the graph data row pointer buffer. For the vertices which are active in the row pointer buffer, vertex data - in-edge/out-edge neighbors, in-edge/out-edge weights, are fetched from the colIdx or val vector stored in the SSD, accessing only the pages in SSD that have active vertex data. The VC unit indicates which vertex data to load, such as in-edge or out-edge adjacent neighbors or edge weights, as not all the vertex data may be required by the application program. Graph data unit uses double buffering so that it can overlap loading vertex data from storage with the processing of the vertex data loaded into the buffer by the VC unit.

    3.4. Design Choices: Graph Vertex Interval

    The first design choice for PartitionedVC implementation is the size of each vertex interval. If each vertex interval has only a few vertices it will lead to many vertex intervals. Recall that during the graph update propagation any vertex interval may update a target in any other vertex interval. Hence, having more intervals increases the overhead of vertex interval processing and also requires more time while routing updates to a target vertex interval. On the other hand having too many vertices in a single vertex interval will lead to memory overflow. While processing a vertex interval, it is important that updates to that vertex interval should all fit in the main memory. Typically, in vertex-centric programming updates are sent over the outgoing edges of a vertex, so the number of updates received by a vertex is at most the number of incoming edges for that vertex. Since fitting the updates of each vertex interval in main memory is a critical need, we conservatively assume that there may be an update on each incoming edge of a vertex for the purposes of determining the vertex interval size. We statically partition the vertices into contiguous segments of vertices, such that the sum of the number of incoming updates to the vertices is less than the main memory size provided. This size could be limited by the administrators, application programmer or could be simply limited by the size of the virtual machines allocated for graph processing.

    Due to our conservative assumption that there may be a message on each incoming edge, the size of vertex interval may be small. But during runtime updates received by a vertex can be less than the number of incoming edges. For each of this vertex intervals, we keep a counter, which tracks the number of updates sent to that vertex interval in the current superstep. At the beginning of the next superstep the PartitionedVC’s runtime may dynamically fuse contigous vertex intervals into a single large interval to process at once. Such dynamic fusing enables efficient use of the memory during each superstep.

    3.5. Design Choices: Graph Data Sizes

    Currently our system supports any arbitrary structure of data associated with a vertex. But for efficiency reasons we assume that the structure does not morph dynamically. The graph loader unit loops over the active vertex list and loads each structure into memory. Hence, as long as the programmer assigns a large enough structure to hold all the vertex data our system can easily handle that algorithm. However, if the structure of the data morphs and in particular the size grows arbitrarily it is less efficient to dynamically alter the CSR format organization in SSDs. However, PartitionedVC still works functionally correctly but the graph loading process may be slower. Hence, we recommend that the programmer may conservatively assign a large enough structure statically to enable future growth of data associated with a vertex.

    3.6. Graph structural updates

    In vertex-centric programming, graph structure can be updated during the supersteps. Graph structure updates in a superstep can be applied at the end of the superstep.

    In the CSR format, merging the graph structural updates into the column index or value vectors is a costly operation, as one needs to re-shuffle the entire column vectors. To minimize the costly merging operation, we partition the CSR format graph based on the vertex intervals, so that each vertex interval’s graph data is stored separately in the CSR format.

    Instead of merging each update directly into the vertex interval’s graph data, we batch several structural updates for a vertex interval and later merge them into the graph data after a certain threshold number of structural updates. As graph structural updates generated during the vertex processing can be to any vertex, we buffer each vertex interval’s structural updates in memory. The multi-log and VC units always access these buffered updates to accurately fetch the most current graph data for processing.

    3.7. Support for asynchronous computation

    In asynchronous vertex-centric programming, in a superstep, if a target vertex is scheduled later than the source vertex generating the update, then the update is available to the target vertex in the same superstep. Therefore, we maintain a single multi-log file for each vertex interval. The current superstep’s updates are also appended to the same log as the previous iteration. Hence, the vertex interval’s can load all the updates which are generated to them in the previous or current superstep. Current updates are made by the previously scheduled vertex intervals in the current superstep. Note that in asynchronous operation the current and previous superstep logs are isolated.

    To facilitate routing of the updates to the target vertices that are scheduled within the same vertex interval, but later than the source vertex, we keep two arrays. All the updates to active target vertices which are in the same vertex interval, but are not yet scheduled are kept in one array as a linked list, each target vertex having a separate linked list. Another array, having an entry for each vertex in that vertex interval, points to the start of it’s linked list in the first array.

    In a similar fashion, for the graph structural updates that are within the same interval are passed to the target vertex using the two arrays. As described in the subsection 3.6 these graph structural updates are written to the storage as a log at the end of the interval to make them persistent.

    3.8. Programming model

    For each vertex, vertex processing function is provided. The main function logic for a vertex is written in this function. Each vertex can access it’s vertex data in this function, send updates to other vertices, and mutate the graph. Vertices can access and modify their local vertex data, which includes: vertex values, in-edge/out-edge lists, in-edge/out-edge labels. Communication between the vertices is implemented using sending updates, each vertex can send an update to any vertex. For the synchronous computation model, updates will be delivered to the target vertex by the start of its vertex processing in next superstep. In the asynchronous computation model, the latest updates from the source vertices will be delivered to the target vertices, which can either be from the current superstep or the previous superstep. Vertices can modify the graph structure and these graph modifications will be finished by the start of next superstep. For the mutating graph, we provide a basic graph structure modification functions, add/delete the edge/vertex, which can modify any part of the graph, not just local vertex structure. The vertex also indicates in the vertex processing function if it wants to be deactivated. If a vertex is deactivated, it will be re-activated again if it receives an update from any other vertex. Algorithm 2 shows the pseudo-code for the community detection program using the most frequent label propagation method.

    A programmer can indicate several hints to the framework to further optimize performance. A programmer can indicate whether the program, a) requires adjacency lists, b) requires adjacency edge-weights, c) performs any graph structural changes. Note that one can use a compiler also to infer these hints.

    1:function ProcessVertex(VertexData v)
    2:     for each update m in v.updates() do
    3:         v.edge(m.source_id).set_label(m.data)      
    4:     new_label frequent_label(v.edges_label)
    5:     old_label v.get_value()
    6:     if old_label new_label then
    7:         v.set_value(new_label)
    8:         for each edge in v.edges() do
    9:              update m.source_id = v.id()
    10:              m.target_id = edge.id(),m.data = new_label
    11:              send_update(m)               
    12:     deactivate(v.id())
    Algorithm 2 Code snippet of community detection program
    1:function Accumulate(VertexValue val, update m)
    2:     val.change m.data.change
    3:     if is_set(m.data.activate) then
    4:         activate(m.target_id)      
    5:function ProcessVertex(VertexData v)
    6:     for each edge in v.edges() do
    7:         update m.target_id = edge.id()
    8:         m.data = v.val.changev.num_edges()
    9:         if v.val.change Threshold then
    10:              m.data.activate = 1          
    11:         send_update(m)      
    12:     value val.page_rank = (change)
    13:     val.change = 0
    14:     v.set_value(val)
    15:     deactivate(v.id())
    Algorithm 3 Code snippet of Pagerank - an associative and commutative program

    3.9. Analysis of I/O costs

    SSDs have a hierarchical organization and the minimum granularity at which one can access it is NAND page size. So we perform our I/O analysis based on the number of NAND pages accessed. Note that here we assume that all the I/O accesses the NAND pages and ignore the buffer present in the SSD DRAM, as it is small and typically used for buffering the NAND page before transferring it to the host system.

    In each iteration, storage is accessed for logging the updates and for accessing the graph data. As updates are appended as a log, and the log is read sequentially, the amount of storage accessed is proportional to the number of updates generated by the active vertices. Note that as the log is initially appended in the main memory buffer and to create space in the buffer, we evict fully written pages to the storage. So the partially written pages are present in the main memory only and the storage is accessed only for fully written pages. During an iteration, graph data is accessed only for the active vertices. As active vertices graph data may be spread across several SSD pages and the minimum granularity for accessing the storage is an SSD page, so in a superstep, the amount of storage accessed for graph data is proportional to the number of SSD pages containing active vertices data. In an iteration if of vertices are active, and on average if each vertex has edge data, and while reading the vertices graph data if the read amplification factor due to reading of data from SSD in granularities of pages is then the amount of storage accessed in an iteration is which is O(E). This is optimal as edge data corresponding to the actives vertices has to be accessed at least once in each superstep. If the number of updates generated by each of the active vertexes on average is then the number of updates generated is , then the amount of data accessed from storage in an iteration for updates is less than or equal to , once each for writing and reading.

    4. System design and Implementation

    We implemented the PartitionedVC system as a graph analytics runtime on an Intel i7-4790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use TB EVO SSDs (SSD860EVO, ). We use Ubuntu operating system which runs on Linux Kernel version 3.19.

    To simultaneously load pages from several non-contiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses in granularities of 16KB, typical SSD page size (ssd_page_size, ). Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep increasing to accommodate higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes (enterprise_storage, ).

    We used OpenMP to parallelize the code for running on multi-cores. We use 8-byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system.

    Baseline: We compare our results with the popular out-of-core GraphChi framework. While comparing with GraphChi, we use the same host-side memory cache size as the size of the multi-log buffer used in the PartitionedVC system. In both our implementation and GraphChi’s implementation, we limit the memory usage to GB. In our implementation, we limit memory usage by limiting the total size of the mulit-log buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such GraphChi also achieves peak storage access bandwidth.

    Graph dataset: To evaluate the performance of PartitionedVC, we selected two real-world datasets, one from the popular SNAP dataset (snapnets, ), and another is a popular web graph from Yahoo Webscope dataset (yahooWebscopre_graph, ). These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 1 shows the number of vertices and edges for these graphs.

    Dataset name Number of vertices Number of edges
    com-friendster (CF) 124,836,180 3,612,134,270
    YahooWebScope (YWS) 1,413,511,394 12,869,122,070
    Table 1. Graph dataset

    5. Applications

    To illustrate the benefits of our framework, we evaluate several graph applications, which are:

    BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.

    Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.

    Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bi-partite or similar structure.

    Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding in-edge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.

    Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.

    K-Core: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.

    Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).

    6. Experimental evaluation

    (a) BFS relative to GraphChi
    (b) Pagerank relative to GraphChi
    (c) FLP relative to GraphChi
    Figure 5. Application performance
    (a) GC relative to GraphChi
    (b) MIS relative to GraphChi
    Figure 6. Application performance
    Figure 7. Ratio of page access counts relative to GraphChi
    Figure 8. Storage access times and compute times

    Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The X-axis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an X-axis of 0.1 means that the selected source-target pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Y-axis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.

    On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.

    Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The X-axis shows the two graph datasets that we used and Y-axis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here X-axis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.

    (a) Pagerank performance over several supersteps
    (b) FLP performance over several supersteps
    (c) GC performance over several supersteps
    Figure 9. Performance comparisons over supersteps
    Figure 10. MIS performance over several supersteps

    Form the Figures (a)a, (b)b, (c)c, and 10 we can observe that as we keep increasing the number of supersteps, the performance benefits when compared to GraphChi increase.

    PartitionedVC uses CSR format, which is suitable for accessing fewer vertices data but is costly to merge into it as one has to shuffle the entire graph. Using multiple intervals helps in reducing the merge cost of CSR format. Figure LABEL:fig:KCore_algoritm shows the performance of K-core on PartitionedVC compared to it on GraphChi. In K-core as delete operations are used, GraphChi can directly update the delete bit in its outgoing edge’s shard, whereas in PartitionedVC we log the structural update and later update it in the graph. As graph updates are passed using asynchronous fashion, K-core takes only one iteration and all the vertices are active, so GraphChi performs better than PartitionedVC for K-core application. However, for other structural update operations like add edge, add a vertex, we expect PartitionedVC to perform in a similar fashion, as GraphChi and PartitionedVC tackle the structural updates in a similar fashion, buffer the updates and merge after a threshold.

    7. Related work

    Graph analytics systems are widely deployed in important domains such as drug processing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems (low2014graphlab, ; gonzalez2012powergraph, ).

    For vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraFBoost implements external memory graph analytics system (GraFBoost, ). It logs updates to the storage and pass vertex updates in a superstep. They use sort-reduce technique for efficiently combining the updates and applying them to vertex values. In a superstep, they access graph pages in storage corresponding to the active vertex list only once. However, they may access storage multiple times for the updates, as they sort-reduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our PartitionedVC system supports, complete vertex-centric programming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems (apache_flink, ).

    A recent work (elyasi2019large, ) extends GraFBoost work for the vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value. In this work, they improve performance by avoiding the sorting phase of the log. For this, they partition the destination vertices such that they can fit in the main memory and replicate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, with this scheme, one may need to replicate the source vertices and access edge data multiple times. When extending this scheme to support complete vertex-centric programming, computing with this scheme may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. Linearly extrapolating based on the data presented in their paper, the replication overhead for 1000 partitions is around , which is prohibitively expensive.

    X-stream (roy2013x, ) and GridGraph (zhu2015gridgraph, ) are edge-centric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edge-centric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as random-walk.

    GraphChi (graphchi, ) is the only external memory based vertex-centric programming system that supports more than associative and commutative combine programs. GraphChi partitions the graph into several vertex intervals, namely shards. Processing based on shards, updates by a vertex are passed using shared memory communication and all the updates are done in memory only, accessing storage efficiently in a coarse-grained manner. However, when processing with GraphChi, one has to read most of the graph in each iteration and is not suitable for processing graph algorithms which may access only a part of the graph, such as widely used BFS, or for processing algorithms where not all vertices are active during an iteration. In this work, we compare with GraphChi as a baseline and show considerable performance improvements.

    There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard (ai2017squeezing, ; vora2016load, ). However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep.

    Semi-external memory systems such as Flash graph (flashgraph, ) stores the vertex data in main memory and achieve high performance. When processing with a low-cost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the fine-grained accesses to the vertex-value vector.

    Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertex-centric programming based graph analytic systems including Pregel (malewicz2010pregel, ), Graphlab (low2014graphlab, ), PowerGraph (tiwari_hotpower12, ), etc. In the single-node in-memory setting, Ligra (shun2013ligra, ) provides graph processing framework optimized for multi-core processing.

    8. Conclusion

    Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art out-of-core graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadth-first search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.

    References

    • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
    • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
    • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
    • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
    • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
    • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
    • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
    • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
    • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
    • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
    • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
    • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
    • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
    • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
    • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
    • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
    • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
    • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
    • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
    • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
    • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
    • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
    • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
    • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.

2. Background and Motivation

Figure 1. Modern SSD platform architecture

2.1. Modern SSD platforms

Since much of our work uses SSD-specific optimizations, we provide a very brief overview of SSD architecture. Figure 1 illustrates the architecture of a modern SSD platform. An SSD equips multiple flash memory channels to support high data bandwidth. Multiple dies are integrated into a single NAND flash package by employing die-stacking structure to integrate more storage space on the limited platform board. Data parallelism can be achieved per die with multi-plane or multi-way composition. Each plane or way is divided into multiple blocks which have dozens of physical pages. A page is a basic physical storage unit that can be read or written by one flash command. SSD uses firmware to manage all the idiosyncrasies while accessing a page. To execute the SSD firmware, major components of SSD include an embedded processor, DRAM, on-chip interconnection network, and a flash controller.

2.2. Graph Computational model

In this work, we support the commonly used vertex-centric programming model for graph analytics. The input to a graph computation is a directed graph, . Each vertex in the graph has an id between to and a modifiable user-defined value associated with it. For a directed edge , we refer to be the out-edge of , and in-edge of . Also for , we refer to be source vertex and to be the target vertex, and may be associated with a modifiable, user-defined value.

A typical vertex-centric computation consists of input, where the graph is initialized, followed by a sequence of supersteps separated by global synchronization points until the algorithm terminates, and finishes with output. Within each superstep vertices compute in parallel, each executing the same user-defined function that expresses the logic of a given algorithm. A vertex can modify its state or that of its neighboring edges, or generate updates to another vertex, or even mutate the topology of the graph. Edges are not first-class citizens in this model, having no associated computation.

Depending on when the updates are visible to the target vertices, the computational model can be either synchronous or asynchronous. In the synchronous computational model, the updates generated to a target vertex are available to it in the next superstep. Graph systems such as Pregel (malewicz2010pregel, ), and Apache Giraph (apache_giraph, ) use this approach. In an asynchronous computational model, an update to a target vertex is visible to that vertex in the current superstep. So if the target vertex is scheduled after the source vertex in a superstep, then the current superstep’s update is available to the target vertex. GraphChi (graphchi, ) and Graphlab (low2014graphlab, ) use this approach. An asynchronous computational model is shown to be useful for accelerating the convergence of many numerical algorithms (graphchi, ). PartitionedVC supports asynchronous updates.

Graph algorithms can also be broadly classified into two categories, based on how the updates are handled. A certain class of graph algorithms exhibits associative and commutative property. In such algorithms updates to a target vertex can be combined into a single value, and they can be combined in any order before processing the vertex. Algorithms such as pagerank, BFS, single-source shortest path fall in this category. There are many other graph algorithms that require the update order to be preserved, and each update is individually applied. Algorithms such as community detection

(FLP_implementation, ), graph coloring (gonzalez2012powergraph, ), maximal independent set (malewicz2010pregel, ) fall in this category. PartitionedVC supports both these types of graph algorithms.

2.3. Out-of-core graph processing

In the out-of-core graph processing context, graphs sizes are considered to be large when compared to the main memory size but can fit in the storage size of current SSDs (in Terabytes). As described earlier, GraphChi (graphchi, ) is an out-of-core vertex-centric programming system. GraphChi partitions the graph into several vertex intervals, and stores all the incoming edges to a vertex interval as a shard. Figure (b)b shows the shard structure for an illustrative graph shown in Figure (a)a. For instance, shard1 stores all the incoming edges of vertex interval , shard2 stores ’s incoming edges, and shard3 stores incoming edges of all the vertices in the interval . While incoming edges are closely packed in a shard, the outgoing edges of a vertex are dispersed across other shards. In this example, the outgoing edges of are dispersed across shard1, shard2, and shard3. Another unique property of shard organization is that each shard stores all its in-edges sorted by source vertex.

GraphChi relies on this shard organization to process vertices in intervals. It first loads into memory a shard corresponding to one vertex interval, as well as all the outgoing edges of those vertices that may be stored across multiple shards. Updates generated during processing are asynchronously and directly passed to the target vertices through the out-going edges in other shards in the memory. Once the processing for a vertex interval in a superstep is finished, its corresponding shard and its out-going edges in other shards are written back to the disk.

Using the above approach GraphChi primarily relies on sequential accesses to disk data and minimizes random accesses. However, in the following superstep, a subset of vertices may become active (if they received any messages on their in-edges). The in-edges to a vertex are stored in a shard, and in-edges to a vertex can come from any source vertex. These in-edges in a shard are sorted based on source vertex id. Hence, even if a single vertex is active within a vertex interval the entire shard must be loaded since the in-edges for that vertex may be dispersed throughout the shard. So to access the in-edges of active vertices within a shard, one has to load the entire shard, even there may be few active vertices in that shard. For instance, if any of the

is active, the entire shard3 must be loaded. Loading a shard may be avoided only if all the vertices in the associated vertex interval are not active. However, in real-world graphs, the vertex intervals typically span tens of thousands of vertices, and during each superstep the probability of a single vertex being active in a given interval is very high. As a result, GraphChi in practice ends up loading all the shards in every superstep independent of the number of active vertices in that superstep.

2.4. Active graph

To quantify the amount of superfluous loading that must be performed we counted the active vertex and the active edge counts in each superstep while running graph coloring application described in section 5 over the datasets shown in Table 1. For this application, we ran a maximum of 15 supersteps. Figure 3 shows the active vertices and active edges count over these supersteps. The x-axis indicates the superstep number, the major y-axis shows the ratio of active vertices divided by total vertices, and the minor y-axis shows the number of active edges (updates sent over an edge) divided by the total number of edges in the graph. The fraction of active vertices and active edges shrink dramatically as supersteps progress. However, at the granularity of a shard, even the few active vertices lead to loading many shards since the active vertices are spread across the shards.

3. CSR format in the era of SSDs

(a) CSR format representation for the example graph
(b) GraphChi shard structure for the example graph
Figure 2. Graph storage formats

Given the large loading bandwidth demands of GraphChi, we evaluated a compressed sparse row (CSR) format for graph processing. CSR format has the desirable property that one may load just the active vertex information more efficiently. Large graphs tend to be sparsely connected. CSR format takes the adjacency matrix representation of a graph and compresses it using three vectors. The value vector,

val, stores all the non-zero values from each column sequentially. The column index vector, colIdx, stores the column index of each element in the val vector. The row pointer vector, rowPtr, stores the starting index of each row in the val vector. CSR format representation for the example graph is shown in Figure (a)a. The edge weights on the graph are stored in val vector, and the adjacent outgoing vertices are stored in colIdx vector. To access adjacent outgoing vertices associated with a vertex in CSR graph storage format, we first need to access the rowPtr vector to get the starting index in the colIdx vector where the adjacent vertices associated with the vertex are stored in a contiguous fashion. As all the outgoing edges connected to a vertex are stored in a contiguous location, while accessing the adjacency information for the active vertices CSR format is suitable for minimizing the number of pages accessed in an SSD and reducing the read amplification while accessing the SSD.

Figure 3. Active vertices and edges over supersteps

3.1. Challenges for graph processing with a CSR format

While CSR format looks appealing, it suffers one fundamental challenge. One can either maintain an in-edge list in the colIdx vector or the out-edge list, but not both (due to coherency issues with having the same information in two different vectors). Consider the case that adjacency list stores only in-edges and during the superstep all the updates on the out-edges generate many random accesses to the adjacency lists to extract the out-edge information.

3.2. Two Key Observations: Using a log and splitting the log

To avoid the random access problem with the CSR format, we make the first key observation. Namely, updates to the out-edges do not need to be propagated using the adjacency list directly. Instead these updates can be simply logged. Thus, we propose to log the updates sent between the vertices, instead of directly updating at the target vertex location. In a superstep, we log all the vertex updates, and group the target vertex messages in the next superstep and pass them to that target vertex.

One can maintain a single log for all the updates that can be parsed in the next superstep. However, as multiple messages sent to a vertex may be spread all over the log, one may need to do external sorting over a large number of updates. The second key observation is that we maintain a separate log for a collection of vertices. As such we create a coarse-grain log for an interval of vertices that stores all the updates bound to those vertices.

We partition the graph into several vertex intervals and use a log for each interval. We choose the size of a vertex interval such that typically the entire update log corresponding to that interval can be loaded into the host memory, and used for processing by the vertices in that interval.

3.3. Multi-log Architecture

1:for all the vertex intervals in the superstep do
2:     Load vertex interval’s update log into the buffer
3:     Group updates based on target vertex id
4:     repeat/* repeat ends in line 8*/
5:         For the active vertices load the required vertex data (vertex values, in-edge and out-edge lists, in-edge and out-edge weights) into the buffer
6:         for each of the active vertices do
7:              ProcessVertex(VertexData)          
8:     until Entire update buffer is processed
Algorithm 1 Overview of a superstep in PartitionedVC
Figure 4. Internal components of PartitionedVC framework

Given the multi-log architecture design, we propose the vertex-centric programming model that can be implemented as follows. In a superstep, we loop over each of the vertex intervals for processing. For each of the vertex intervals, we load its update log and schedule the vertex processing functions for each of the active vertices in that interval. Algorithm 1 shows the overall framework functionality. What is important to highlight in the algorithms is the fact that we load only active vertex data in step 5, rather than the entire graph. There is however a small penalty we pay for logging the updates, which requires us to sort each of the vertex interval logs based on the target vertex id of the update. As long as each vertex interval log fits in main memory, we can do in-memory sorting. Furthermore, as we described earlier the active graph size shrinks dramatically with each superstep, and the total log size is proportional to the active graph size. Hence the log size also shrinks with each superstep, thereby shrinking the cost of managing and sorting the log.

Figure 4 shows the software components used to realize the framework, which is described in detail in the following paragraphs. First, we describe the framework for the synchronous computation model (described in subsection 2.2) and then later extend it to asynchronous computation model.

Multi-Log Unit: Handling Updates - This component handles storing and retrieving updates generated by the vertices in a superstep. While processing a vertex in a superstep (line 7 in Algorithm 1), the programmer invokes the update function as usual to pass the update to the target vertex. The update function calls the MutliLogVC’s runtime system transparent to the programmer. The runtime system invokes the multi-log unit to log the update. A log is maintained for each vertex interval, and an update is generated to a target vertex in the current superstep is appended to the target vertex interval’s log. As we will describe later, these updates are retrieved in the next superstep and processed by the corresponding target vertices.

To efficiently implement logging PartitionedVC first caches the log writes in main memory buffers, called the multi-log memory buffers. Buffering helps to reduce fine-grained writes which in turn reduces write amplification to the SSD storage. Note that flash memory in SSDs can only be written at page granularity. As such PartitionedVC maintains memory buffers in chunks of SSD page size. Since any vertex interval may generate an update to a target vertex that may be present in any other vertex interval, at least one log buffer is allocated for each vertex interval in the entire graph. In our experiments, even with the largest graph size, the number of vertex intervals was in the order of a few (¡5) thousands. Hence, at least several thousands of pages may be allocated in multi-log memory buffer at one time.

For each vertex interval log, a top page is maintained in the buffer. When a new update is sent to the multi-log unit, first the top page of that vertex interval where the update is bound for is identified. As updates are just appended to the log, an update that can fit in the available space in the top page is written into it. If there is not enough space in the top page, then a new page is allocated by the multi-log unit, and that new page becomes the top page for that vertex interval log. We maintain a simple mapping table indexed by the vertex interval to identify the top page.

When the available free space in the multi-log buffer is less than a certain threshold, some log pages are evicted from the main memory to SSD. In synchronous mode of graph updates there is one log file per each vertex interval that is stored in SSD. When a log is evicted from memory it is appended to the corresponding vertex interval log file. While the multi-log architecture may theoretically store many log pages to SSD in practice we noticed that evictions to SSD are needed only when the multi-log buffer overflows the size of main memory. But as we mentioned earlier, the total size of the active graph changes in each superstep and in majority of supersteps the active size is much smaller than the total graph size. Since the log file is proportional to the number of updates the log file size also shrinks when the active graph size is small. Hence the log for each vertex interval is mostly cached in memory as the supersteps progress.

VC Unit – Handling Data Retrievals - The updates that are bound to each vertex interval are logged, either in memory or maybe in SSD when there is an overflow, as described above. When the next superstep begins the updates received by each vertex in the previous iteration must be processed by that vertex. Note that the updates bound to a given vertex are dispersed throughout the log associated with that vertex interval. Hence, it is necessary to group all the updates in that log first before initiating a vertex processing function. The VC unit is responsible for this task. At the start of each vertex interval processing the VC unit reads the corresponding log and groups all the messages bound for each vertex in that interval.

As described in the background section some graph algorithms support associative and commutative property on updates. Hence, the updates can be merged in any order. For such programs along with the vertex processing function, we provide an accumulation function. The programmer uses the accumulation function to specify the combine operation for the updates. This function is processed for all the incoming updates in a superstep by the VC unit. The accumulation function is operated on all the updates to a target vertex in a superstep before the target vertex’s processing function is called. Algorithm 3 shows how the accumulation function is specified for the the page rank application. Hence, the VC unit can optimize the performance automatically whenever there is an accumulation function defined in a graph algorithms. In the case of non-associative and commutative programs, the updates are grouped based on the target vertex id, and the update function is individually called for each update.

Graph Loader Unit: As is the case with any graph algorithm, the programmer decides when a vertex will become active or otherwise. It is typically specified as part of the vertex processing function or the accumulate function. PartitionedVC maintains an active vertex bit for each vertex in the main memory and update that bit during each superstep. This active vertex bit mask is used to decide which vertices to process in the next superstep.

As described eariler, PartitionedVC uses CSR format to store graphs since CSR is more efficient to load a collection of active vertices. A graph loader unit is responsible to load the graph data for the vertices present in the active vertex list (line 5 in Algorithm 1). Graph data unit maintains the row buffer for loading the row pointer and buffer for each of the vertex data (adjacency edge lists/weights). The graph data unit loops over the row pointer array for the range of vertices in the active vertex list, each time fetching vertices that can fit in the graph data row pointer buffer. For the vertices which are active in the row pointer buffer, vertex data - in-edge/out-edge neighbors, in-edge/out-edge weights, are fetched from the colIdx or val vector stored in the SSD, accessing only the pages in SSD that have active vertex data. The VC unit indicates which vertex data to load, such as in-edge or out-edge adjacent neighbors or edge weights, as not all the vertex data may be required by the application program. Graph data unit uses double buffering so that it can overlap loading vertex data from storage with the processing of the vertex data loaded into the buffer by the VC unit.

3.4. Design Choices: Graph Vertex Interval

The first design choice for PartitionedVC implementation is the size of each vertex interval. If each vertex interval has only a few vertices it will lead to many vertex intervals. Recall that during the graph update propagation any vertex interval may update a target in any other vertex interval. Hence, having more intervals increases the overhead of vertex interval processing and also requires more time while routing updates to a target vertex interval. On the other hand having too many vertices in a single vertex interval will lead to memory overflow. While processing a vertex interval, it is important that updates to that vertex interval should all fit in the main memory. Typically, in vertex-centric programming updates are sent over the outgoing edges of a vertex, so the number of updates received by a vertex is at most the number of incoming edges for that vertex. Since fitting the updates of each vertex interval in main memory is a critical need, we conservatively assume that there may be an update on each incoming edge of a vertex for the purposes of determining the vertex interval size. We statically partition the vertices into contiguous segments of vertices, such that the sum of the number of incoming updates to the vertices is less than the main memory size provided. This size could be limited by the administrators, application programmer or could be simply limited by the size of the virtual machines allocated for graph processing.

Due to our conservative assumption that there may be a message on each incoming edge, the size of vertex interval may be small. But during runtime updates received by a vertex can be less than the number of incoming edges. For each of this vertex intervals, we keep a counter, which tracks the number of updates sent to that vertex interval in the current superstep. At the beginning of the next superstep the PartitionedVC’s runtime may dynamically fuse contigous vertex intervals into a single large interval to process at once. Such dynamic fusing enables efficient use of the memory during each superstep.

3.5. Design Choices: Graph Data Sizes

Currently our system supports any arbitrary structure of data associated with a vertex. But for efficiency reasons we assume that the structure does not morph dynamically. The graph loader unit loops over the active vertex list and loads each structure into memory. Hence, as long as the programmer assigns a large enough structure to hold all the vertex data our system can easily handle that algorithm. However, if the structure of the data morphs and in particular the size grows arbitrarily it is less efficient to dynamically alter the CSR format organization in SSDs. However, PartitionedVC still works functionally correctly but the graph loading process may be slower. Hence, we recommend that the programmer may conservatively assign a large enough structure statically to enable future growth of data associated with a vertex.

3.6. Graph structural updates

In vertex-centric programming, graph structure can be updated during the supersteps. Graph structure updates in a superstep can be applied at the end of the superstep.

In the CSR format, merging the graph structural updates into the column index or value vectors is a costly operation, as one needs to re-shuffle the entire column vectors. To minimize the costly merging operation, we partition the CSR format graph based on the vertex intervals, so that each vertex interval’s graph data is stored separately in the CSR format.

Instead of merging each update directly into the vertex interval’s graph data, we batch several structural updates for a vertex interval and later merge them into the graph data after a certain threshold number of structural updates. As graph structural updates generated during the vertex processing can be to any vertex, we buffer each vertex interval’s structural updates in memory. The multi-log and VC units always access these buffered updates to accurately fetch the most current graph data for processing.

3.7. Support for asynchronous computation

In asynchronous vertex-centric programming, in a superstep, if a target vertex is scheduled later than the source vertex generating the update, then the update is available to the target vertex in the same superstep. Therefore, we maintain a single multi-log file for each vertex interval. The current superstep’s updates are also appended to the same log as the previous iteration. Hence, the vertex interval’s can load all the updates which are generated to them in the previous or current superstep. Current updates are made by the previously scheduled vertex intervals in the current superstep. Note that in asynchronous operation the current and previous superstep logs are isolated.

To facilitate routing of the updates to the target vertices that are scheduled within the same vertex interval, but later than the source vertex, we keep two arrays. All the updates to active target vertices which are in the same vertex interval, but are not yet scheduled are kept in one array as a linked list, each target vertex having a separate linked list. Another array, having an entry for each vertex in that vertex interval, points to the start of it’s linked list in the first array.

In a similar fashion, for the graph structural updates that are within the same interval are passed to the target vertex using the two arrays. As described in the subsection 3.6 these graph structural updates are written to the storage as a log at the end of the interval to make them persistent.

3.8. Programming model

For each vertex, vertex processing function is provided. The main function logic for a vertex is written in this function. Each vertex can access it’s vertex data in this function, send updates to other vertices, and mutate the graph. Vertices can access and modify their local vertex data, which includes: vertex values, in-edge/out-edge lists, in-edge/out-edge labels. Communication between the vertices is implemented using sending updates, each vertex can send an update to any vertex. For the synchronous computation model, updates will be delivered to the target vertex by the start of its vertex processing in next superstep. In the asynchronous computation model, the latest updates from the source vertices will be delivered to the target vertices, which can either be from the current superstep or the previous superstep. Vertices can modify the graph structure and these graph modifications will be finished by the start of next superstep. For the mutating graph, we provide a basic graph structure modification functions, add/delete the edge/vertex, which can modify any part of the graph, not just local vertex structure. The vertex also indicates in the vertex processing function if it wants to be deactivated. If a vertex is deactivated, it will be re-activated again if it receives an update from any other vertex. Algorithm 2 shows the pseudo-code for the community detection program using the most frequent label propagation method.

A programmer can indicate several hints to the framework to further optimize performance. A programmer can indicate whether the program, a) requires adjacency lists, b) requires adjacency edge-weights, c) performs any graph structural changes. Note that one can use a compiler also to infer these hints.

1:function ProcessVertex(VertexData v)
2:     for each update m in v.updates() do
3:         v.edge(m.source_id).set_label(m.data)      
4:     new_label frequent_label(v.edges_label)
5:     old_label v.get_value()
6:     if old_label new_label then
7:         v.set_value(new_label)
8:         for each edge in v.edges() do
9:              update m.source_id = v.id()
10:              m.target_id = edge.id(),m.data = new_label
11:              send_update(m)               
12:     deactivate(v.id())
Algorithm 2 Code snippet of community detection program
1:function Accumulate(VertexValue val, update m)
2:     val.change m.data.change
3:     if is_set(m.data.activate) then
4:         activate(m.target_id)      
5:function ProcessVertex(VertexData v)
6:     for each edge in v.edges() do
7:         update m.target_id = edge.id()
8:         m.data = v.val.changev.num_edges()
9:         if v.val.change Threshold then
10:              m.data.activate = 1          
11:         send_update(m)      
12:     value val.page_rank = (change)
13:     val.change = 0
14:     v.set_value(val)
15:     deactivate(v.id())
Algorithm 3 Code snippet of Pagerank - an associative and commutative program

3.9. Analysis of I/O costs

SSDs have a hierarchical organization and the minimum granularity at which one can access it is NAND page size. So we perform our I/O analysis based on the number of NAND pages accessed. Note that here we assume that all the I/O accesses the NAND pages and ignore the buffer present in the SSD DRAM, as it is small and typically used for buffering the NAND page before transferring it to the host system.

In each iteration, storage is accessed for logging the updates and for accessing the graph data. As updates are appended as a log, and the log is read sequentially, the amount of storage accessed is proportional to the number of updates generated by the active vertices. Note that as the log is initially appended in the main memory buffer and to create space in the buffer, we evict fully written pages to the storage. So the partially written pages are present in the main memory only and the storage is accessed only for fully written pages. During an iteration, graph data is accessed only for the active vertices. As active vertices graph data may be spread across several SSD pages and the minimum granularity for accessing the storage is an SSD page, so in a superstep, the amount of storage accessed for graph data is proportional to the number of SSD pages containing active vertices data. In an iteration if of vertices are active, and on average if each vertex has edge data, and while reading the vertices graph data if the read amplification factor due to reading of data from SSD in granularities of pages is then the amount of storage accessed in an iteration is which is O(E). This is optimal as edge data corresponding to the actives vertices has to be accessed at least once in each superstep. If the number of updates generated by each of the active vertexes on average is then the number of updates generated is , then the amount of data accessed from storage in an iteration for updates is less than or equal to , once each for writing and reading.

4. System design and Implementation

We implemented the PartitionedVC system as a graph analytics runtime on an Intel i7-4790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use TB EVO SSDs (SSD860EVO, ). We use Ubuntu operating system which runs on Linux Kernel version 3.19.

To simultaneously load pages from several non-contiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses in granularities of 16KB, typical SSD page size (ssd_page_size, ). Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep increasing to accommodate higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes (enterprise_storage, ).

We used OpenMP to parallelize the code for running on multi-cores. We use 8-byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system.

Baseline: We compare our results with the popular out-of-core GraphChi framework. While comparing with GraphChi, we use the same host-side memory cache size as the size of the multi-log buffer used in the PartitionedVC system. In both our implementation and GraphChi’s implementation, we limit the memory usage to GB. In our implementation, we limit memory usage by limiting the total size of the mulit-log buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such GraphChi also achieves peak storage access bandwidth.

Graph dataset: To evaluate the performance of PartitionedVC, we selected two real-world datasets, one from the popular SNAP dataset (snapnets, ), and another is a popular web graph from Yahoo Webscope dataset (yahooWebscopre_graph, ). These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 1 shows the number of vertices and edges for these graphs.

Dataset name Number of vertices Number of edges
com-friendster (CF) 124,836,180 3,612,134,270
YahooWebScope (YWS) 1,413,511,394 12,869,122,070
Table 1. Graph dataset

5. Applications

To illustrate the benefits of our framework, we evaluate several graph applications, which are:

BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.

Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.

Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bi-partite or similar structure.

Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding in-edge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.

Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.

K-Core: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.

Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).

6. Experimental evaluation

(a) BFS relative to GraphChi
(b) Pagerank relative to GraphChi
(c) FLP relative to GraphChi
Figure 5. Application performance
(a) GC relative to GraphChi
(b) MIS relative to GraphChi
Figure 6. Application performance
Figure 7. Ratio of page access counts relative to GraphChi
Figure 8. Storage access times and compute times

Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The X-axis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an X-axis of 0.1 means that the selected source-target pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Y-axis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.

On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.

Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The X-axis shows the two graph datasets that we used and Y-axis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here X-axis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.

(a) Pagerank performance over several supersteps
(b) FLP performance over several supersteps
(c) GC performance over several supersteps
Figure 9. Performance comparisons over supersteps
Figure 10. MIS performance over several supersteps

Form the Figures (a)a, (b)b, (c)c, and 10 we can observe that as we keep increasing the number of supersteps, the performance benefits when compared to GraphChi increase.

PartitionedVC uses CSR format, which is suitable for accessing fewer vertices data but is costly to merge into it as one has to shuffle the entire graph. Using multiple intervals helps in reducing the merge cost of CSR format. Figure LABEL:fig:KCore_algoritm shows the performance of K-core on PartitionedVC compared to it on GraphChi. In K-core as delete operations are used, GraphChi can directly update the delete bit in its outgoing edge’s shard, whereas in PartitionedVC we log the structural update and later update it in the graph. As graph updates are passed using asynchronous fashion, K-core takes only one iteration and all the vertices are active, so GraphChi performs better than PartitionedVC for K-core application. However, for other structural update operations like add edge, add a vertex, we expect PartitionedVC to perform in a similar fashion, as GraphChi and PartitionedVC tackle the structural updates in a similar fashion, buffer the updates and merge after a threshold.

7. Related work

Graph analytics systems are widely deployed in important domains such as drug processing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems (low2014graphlab, ; gonzalez2012powergraph, ).

For vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraFBoost implements external memory graph analytics system (GraFBoost, ). It logs updates to the storage and pass vertex updates in a superstep. They use sort-reduce technique for efficiently combining the updates and applying them to vertex values. In a superstep, they access graph pages in storage corresponding to the active vertex list only once. However, they may access storage multiple times for the updates, as they sort-reduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our PartitionedVC system supports, complete vertex-centric programming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems (apache_flink, ).

A recent work (elyasi2019large, ) extends GraFBoost work for the vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value. In this work, they improve performance by avoiding the sorting phase of the log. For this, they partition the destination vertices such that they can fit in the main memory and replicate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, with this scheme, one may need to replicate the source vertices and access edge data multiple times. When extending this scheme to support complete vertex-centric programming, computing with this scheme may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. Linearly extrapolating based on the data presented in their paper, the replication overhead for 1000 partitions is around , which is prohibitively expensive.

X-stream (roy2013x, ) and GridGraph (zhu2015gridgraph, ) are edge-centric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edge-centric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as random-walk.

GraphChi (graphchi, ) is the only external memory based vertex-centric programming system that supports more than associative and commutative combine programs. GraphChi partitions the graph into several vertex intervals, namely shards. Processing based on shards, updates by a vertex are passed using shared memory communication and all the updates are done in memory only, accessing storage efficiently in a coarse-grained manner. However, when processing with GraphChi, one has to read most of the graph in each iteration and is not suitable for processing graph algorithms which may access only a part of the graph, such as widely used BFS, or for processing algorithms where not all vertices are active during an iteration. In this work, we compare with GraphChi as a baseline and show considerable performance improvements.

There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard (ai2017squeezing, ; vora2016load, ). However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep.

Semi-external memory systems such as Flash graph (flashgraph, ) stores the vertex data in main memory and achieve high performance. When processing with a low-cost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the fine-grained accesses to the vertex-value vector.

Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertex-centric programming based graph analytic systems including Pregel (malewicz2010pregel, ), Graphlab (low2014graphlab, ), PowerGraph (tiwari_hotpower12, ), etc. In the single-node in-memory setting, Ligra (shun2013ligra, ) provides graph processing framework optimized for multi-core processing.

8. Conclusion

Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art out-of-core graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadth-first search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.

References

  • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
  • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
  • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
  • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
  • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
  • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
  • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
  • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
  • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
  • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
  • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
  • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
  • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
  • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
  • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
  • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
  • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
  • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
  • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
  • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
  • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
  • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.

3. CSR format in the era of SSDs

(a) CSR format representation for the example graph
(b) GraphChi shard structure for the example graph
Figure 2. Graph storage formats

Given the large loading bandwidth demands of GraphChi, we evaluated a compressed sparse row (CSR) format for graph processing. CSR format has the desirable property that one may load just the active vertex information more efficiently. Large graphs tend to be sparsely connected. CSR format takes the adjacency matrix representation of a graph and compresses it using three vectors. The value vector,

val, stores all the non-zero values from each column sequentially. The column index vector, colIdx, stores the column index of each element in the val vector. The row pointer vector, rowPtr, stores the starting index of each row in the val vector. CSR format representation for the example graph is shown in Figure (a)a. The edge weights on the graph are stored in val vector, and the adjacent outgoing vertices are stored in colIdx vector. To access adjacent outgoing vertices associated with a vertex in CSR graph storage format, we first need to access the rowPtr vector to get the starting index in the colIdx vector where the adjacent vertices associated with the vertex are stored in a contiguous fashion. As all the outgoing edges connected to a vertex are stored in a contiguous location, while accessing the adjacency information for the active vertices CSR format is suitable for minimizing the number of pages accessed in an SSD and reducing the read amplification while accessing the SSD.

Figure 3. Active vertices and edges over supersteps

3.1. Challenges for graph processing with a CSR format

While CSR format looks appealing, it suffers one fundamental challenge. One can either maintain an in-edge list in the colIdx vector or the out-edge list, but not both (due to coherency issues with having the same information in two different vectors). Consider the case that adjacency list stores only in-edges and during the superstep all the updates on the out-edges generate many random accesses to the adjacency lists to extract the out-edge information.

3.2. Two Key Observations: Using a log and splitting the log

To avoid the random access problem with the CSR format, we make the first key observation. Namely, updates to the out-edges do not need to be propagated using the adjacency list directly. Instead these updates can be simply logged. Thus, we propose to log the updates sent between the vertices, instead of directly updating at the target vertex location. In a superstep, we log all the vertex updates, and group the target vertex messages in the next superstep and pass them to that target vertex.

One can maintain a single log for all the updates that can be parsed in the next superstep. However, as multiple messages sent to a vertex may be spread all over the log, one may need to do external sorting over a large number of updates. The second key observation is that we maintain a separate log for a collection of vertices. As such we create a coarse-grain log for an interval of vertices that stores all the updates bound to those vertices.

We partition the graph into several vertex intervals and use a log for each interval. We choose the size of a vertex interval such that typically the entire update log corresponding to that interval can be loaded into the host memory, and used for processing by the vertices in that interval.

3.3. Multi-log Architecture

1:for all the vertex intervals in the superstep do
2:     Load vertex interval’s update log into the buffer
3:     Group updates based on target vertex id
4:     repeat/* repeat ends in line 8*/
5:         For the active vertices load the required vertex data (vertex values, in-edge and out-edge lists, in-edge and out-edge weights) into the buffer
6:         for each of the active vertices do
7:              ProcessVertex(VertexData)          
8:     until Entire update buffer is processed
Algorithm 1 Overview of a superstep in PartitionedVC
Figure 4. Internal components of PartitionedVC framework

Given the multi-log architecture design, we propose the vertex-centric programming model that can be implemented as follows. In a superstep, we loop over each of the vertex intervals for processing. For each of the vertex intervals, we load its update log and schedule the vertex processing functions for each of the active vertices in that interval. Algorithm 1 shows the overall framework functionality. What is important to highlight in the algorithms is the fact that we load only active vertex data in step 5, rather than the entire graph. There is however a small penalty we pay for logging the updates, which requires us to sort each of the vertex interval logs based on the target vertex id of the update. As long as each vertex interval log fits in main memory, we can do in-memory sorting. Furthermore, as we described earlier the active graph size shrinks dramatically with each superstep, and the total log size is proportional to the active graph size. Hence the log size also shrinks with each superstep, thereby shrinking the cost of managing and sorting the log.

Figure 4 shows the software components used to realize the framework, which is described in detail in the following paragraphs. First, we describe the framework for the synchronous computation model (described in subsection 2.2) and then later extend it to asynchronous computation model.

Multi-Log Unit: Handling Updates - This component handles storing and retrieving updates generated by the vertices in a superstep. While processing a vertex in a superstep (line 7 in Algorithm 1), the programmer invokes the update function as usual to pass the update to the target vertex. The update function calls the MutliLogVC’s runtime system transparent to the programmer. The runtime system invokes the multi-log unit to log the update. A log is maintained for each vertex interval, and an update is generated to a target vertex in the current superstep is appended to the target vertex interval’s log. As we will describe later, these updates are retrieved in the next superstep and processed by the corresponding target vertices.

To efficiently implement logging PartitionedVC first caches the log writes in main memory buffers, called the multi-log memory buffers. Buffering helps to reduce fine-grained writes which in turn reduces write amplification to the SSD storage. Note that flash memory in SSDs can only be written at page granularity. As such PartitionedVC maintains memory buffers in chunks of SSD page size. Since any vertex interval may generate an update to a target vertex that may be present in any other vertex interval, at least one log buffer is allocated for each vertex interval in the entire graph. In our experiments, even with the largest graph size, the number of vertex intervals was in the order of a few (¡5) thousands. Hence, at least several thousands of pages may be allocated in multi-log memory buffer at one time.

For each vertex interval log, a top page is maintained in the buffer. When a new update is sent to the multi-log unit, first the top page of that vertex interval where the update is bound for is identified. As updates are just appended to the log, an update that can fit in the available space in the top page is written into it. If there is not enough space in the top page, then a new page is allocated by the multi-log unit, and that new page becomes the top page for that vertex interval log. We maintain a simple mapping table indexed by the vertex interval to identify the top page.

When the available free space in the multi-log buffer is less than a certain threshold, some log pages are evicted from the main memory to SSD. In synchronous mode of graph updates there is one log file per each vertex interval that is stored in SSD. When a log is evicted from memory it is appended to the corresponding vertex interval log file. While the multi-log architecture may theoretically store many log pages to SSD in practice we noticed that evictions to SSD are needed only when the multi-log buffer overflows the size of main memory. But as we mentioned earlier, the total size of the active graph changes in each superstep and in majority of supersteps the active size is much smaller than the total graph size. Since the log file is proportional to the number of updates the log file size also shrinks when the active graph size is small. Hence the log for each vertex interval is mostly cached in memory as the supersteps progress.

VC Unit – Handling Data Retrievals - The updates that are bound to each vertex interval are logged, either in memory or maybe in SSD when there is an overflow, as described above. When the next superstep begins the updates received by each vertex in the previous iteration must be processed by that vertex. Note that the updates bound to a given vertex are dispersed throughout the log associated with that vertex interval. Hence, it is necessary to group all the updates in that log first before initiating a vertex processing function. The VC unit is responsible for this task. At the start of each vertex interval processing the VC unit reads the corresponding log and groups all the messages bound for each vertex in that interval.

As described in the background section some graph algorithms support associative and commutative property on updates. Hence, the updates can be merged in any order. For such programs along with the vertex processing function, we provide an accumulation function. The programmer uses the accumulation function to specify the combine operation for the updates. This function is processed for all the incoming updates in a superstep by the VC unit. The accumulation function is operated on all the updates to a target vertex in a superstep before the target vertex’s processing function is called. Algorithm 3 shows how the accumulation function is specified for the the page rank application. Hence, the VC unit can optimize the performance automatically whenever there is an accumulation function defined in a graph algorithms. In the case of non-associative and commutative programs, the updates are grouped based on the target vertex id, and the update function is individually called for each update.

Graph Loader Unit: As is the case with any graph algorithm, the programmer decides when a vertex will become active or otherwise. It is typically specified as part of the vertex processing function or the accumulate function. PartitionedVC maintains an active vertex bit for each vertex in the main memory and update that bit during each superstep. This active vertex bit mask is used to decide which vertices to process in the next superstep.

As described eariler, PartitionedVC uses CSR format to store graphs since CSR is more efficient to load a collection of active vertices. A graph loader unit is responsible to load the graph data for the vertices present in the active vertex list (line 5 in Algorithm 1). Graph data unit maintains the row buffer for loading the row pointer and buffer for each of the vertex data (adjacency edge lists/weights). The graph data unit loops over the row pointer array for the range of vertices in the active vertex list, each time fetching vertices that can fit in the graph data row pointer buffer. For the vertices which are active in the row pointer buffer, vertex data - in-edge/out-edge neighbors, in-edge/out-edge weights, are fetched from the colIdx or val vector stored in the SSD, accessing only the pages in SSD that have active vertex data. The VC unit indicates which vertex data to load, such as in-edge or out-edge adjacent neighbors or edge weights, as not all the vertex data may be required by the application program. Graph data unit uses double buffering so that it can overlap loading vertex data from storage with the processing of the vertex data loaded into the buffer by the VC unit.

3.4. Design Choices: Graph Vertex Interval

The first design choice for PartitionedVC implementation is the size of each vertex interval. If each vertex interval has only a few vertices it will lead to many vertex intervals. Recall that during the graph update propagation any vertex interval may update a target in any other vertex interval. Hence, having more intervals increases the overhead of vertex interval processing and also requires more time while routing updates to a target vertex interval. On the other hand having too many vertices in a single vertex interval will lead to memory overflow. While processing a vertex interval, it is important that updates to that vertex interval should all fit in the main memory. Typically, in vertex-centric programming updates are sent over the outgoing edges of a vertex, so the number of updates received by a vertex is at most the number of incoming edges for that vertex. Since fitting the updates of each vertex interval in main memory is a critical need, we conservatively assume that there may be an update on each incoming edge of a vertex for the purposes of determining the vertex interval size. We statically partition the vertices into contiguous segments of vertices, such that the sum of the number of incoming updates to the vertices is less than the main memory size provided. This size could be limited by the administrators, application programmer or could be simply limited by the size of the virtual machines allocated for graph processing.

Due to our conservative assumption that there may be a message on each incoming edge, the size of vertex interval may be small. But during runtime updates received by a vertex can be less than the number of incoming edges. For each of this vertex intervals, we keep a counter, which tracks the number of updates sent to that vertex interval in the current superstep. At the beginning of the next superstep the PartitionedVC’s runtime may dynamically fuse contigous vertex intervals into a single large interval to process at once. Such dynamic fusing enables efficient use of the memory during each superstep.

3.5. Design Choices: Graph Data Sizes

Currently our system supports any arbitrary structure of data associated with a vertex. But for efficiency reasons we assume that the structure does not morph dynamically. The graph loader unit loops over the active vertex list and loads each structure into memory. Hence, as long as the programmer assigns a large enough structure to hold all the vertex data our system can easily handle that algorithm. However, if the structure of the data morphs and in particular the size grows arbitrarily it is less efficient to dynamically alter the CSR format organization in SSDs. However, PartitionedVC still works functionally correctly but the graph loading process may be slower. Hence, we recommend that the programmer may conservatively assign a large enough structure statically to enable future growth of data associated with a vertex.

3.6. Graph structural updates

In vertex-centric programming, graph structure can be updated during the supersteps. Graph structure updates in a superstep can be applied at the end of the superstep.

In the CSR format, merging the graph structural updates into the column index or value vectors is a costly operation, as one needs to re-shuffle the entire column vectors. To minimize the costly merging operation, we partition the CSR format graph based on the vertex intervals, so that each vertex interval’s graph data is stored separately in the CSR format.

Instead of merging each update directly into the vertex interval’s graph data, we batch several structural updates for a vertex interval and later merge them into the graph data after a certain threshold number of structural updates. As graph structural updates generated during the vertex processing can be to any vertex, we buffer each vertex interval’s structural updates in memory. The multi-log and VC units always access these buffered updates to accurately fetch the most current graph data for processing.

3.7. Support for asynchronous computation

In asynchronous vertex-centric programming, in a superstep, if a target vertex is scheduled later than the source vertex generating the update, then the update is available to the target vertex in the same superstep. Therefore, we maintain a single multi-log file for each vertex interval. The current superstep’s updates are also appended to the same log as the previous iteration. Hence, the vertex interval’s can load all the updates which are generated to them in the previous or current superstep. Current updates are made by the previously scheduled vertex intervals in the current superstep. Note that in asynchronous operation the current and previous superstep logs are isolated.

To facilitate routing of the updates to the target vertices that are scheduled within the same vertex interval, but later than the source vertex, we keep two arrays. All the updates to active target vertices which are in the same vertex interval, but are not yet scheduled are kept in one array as a linked list, each target vertex having a separate linked list. Another array, having an entry for each vertex in that vertex interval, points to the start of it’s linked list in the first array.

In a similar fashion, for the graph structural updates that are within the same interval are passed to the target vertex using the two arrays. As described in the subsection 3.6 these graph structural updates are written to the storage as a log at the end of the interval to make them persistent.

3.8. Programming model

For each vertex, vertex processing function is provided. The main function logic for a vertex is written in this function. Each vertex can access it’s vertex data in this function, send updates to other vertices, and mutate the graph. Vertices can access and modify their local vertex data, which includes: vertex values, in-edge/out-edge lists, in-edge/out-edge labels. Communication between the vertices is implemented using sending updates, each vertex can send an update to any vertex. For the synchronous computation model, updates will be delivered to the target vertex by the start of its vertex processing in next superstep. In the asynchronous computation model, the latest updates from the source vertices will be delivered to the target vertices, which can either be from the current superstep or the previous superstep. Vertices can modify the graph structure and these graph modifications will be finished by the start of next superstep. For the mutating graph, we provide a basic graph structure modification functions, add/delete the edge/vertex, which can modify any part of the graph, not just local vertex structure. The vertex also indicates in the vertex processing function if it wants to be deactivated. If a vertex is deactivated, it will be re-activated again if it receives an update from any other vertex. Algorithm 2 shows the pseudo-code for the community detection program using the most frequent label propagation method.

A programmer can indicate several hints to the framework to further optimize performance. A programmer can indicate whether the program, a) requires adjacency lists, b) requires adjacency edge-weights, c) performs any graph structural changes. Note that one can use a compiler also to infer these hints.

1:function ProcessVertex(VertexData v)
2:     for each update m in v.updates() do
3:         v.edge(m.source_id).set_label(m.data)      
4:     new_label frequent_label(v.edges_label)
5:     old_label v.get_value()
6:     if old_label new_label then
7:         v.set_value(new_label)
8:         for each edge in v.edges() do
9:              update m.source_id = v.id()
10:              m.target_id = edge.id(),m.data = new_label
11:              send_update(m)               
12:     deactivate(v.id())
Algorithm 2 Code snippet of community detection program
1:function Accumulate(VertexValue val, update m)
2:     val.change m.data.change
3:     if is_set(m.data.activate) then
4:         activate(m.target_id)      
5:function ProcessVertex(VertexData v)
6:     for each edge in v.edges() do
7:         update m.target_id = edge.id()
8:         m.data = v.val.changev.num_edges()
9:         if v.val.change Threshold then
10:              m.data.activate = 1          
11:         send_update(m)      
12:     value val.page_rank = (change)
13:     val.change = 0
14:     v.set_value(val)
15:     deactivate(v.id())
Algorithm 3 Code snippet of Pagerank - an associative and commutative program

3.9. Analysis of I/O costs

SSDs have a hierarchical organization and the minimum granularity at which one can access it is NAND page size. So we perform our I/O analysis based on the number of NAND pages accessed. Note that here we assume that all the I/O accesses the NAND pages and ignore the buffer present in the SSD DRAM, as it is small and typically used for buffering the NAND page before transferring it to the host system.

In each iteration, storage is accessed for logging the updates and for accessing the graph data. As updates are appended as a log, and the log is read sequentially, the amount of storage accessed is proportional to the number of updates generated by the active vertices. Note that as the log is initially appended in the main memory buffer and to create space in the buffer, we evict fully written pages to the storage. So the partially written pages are present in the main memory only and the storage is accessed only for fully written pages. During an iteration, graph data is accessed only for the active vertices. As active vertices graph data may be spread across several SSD pages and the minimum granularity for accessing the storage is an SSD page, so in a superstep, the amount of storage accessed for graph data is proportional to the number of SSD pages containing active vertices data. In an iteration if of vertices are active, and on average if each vertex has edge data, and while reading the vertices graph data if the read amplification factor due to reading of data from SSD in granularities of pages is then the amount of storage accessed in an iteration is which is O(E). This is optimal as edge data corresponding to the actives vertices has to be accessed at least once in each superstep. If the number of updates generated by each of the active vertexes on average is then the number of updates generated is , then the amount of data accessed from storage in an iteration for updates is less than or equal to , once each for writing and reading.

4. System design and Implementation

We implemented the PartitionedVC system as a graph analytics runtime on an Intel i7-4790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use TB EVO SSDs (SSD860EVO, ). We use Ubuntu operating system which runs on Linux Kernel version 3.19.

To simultaneously load pages from several non-contiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses in granularities of 16KB, typical SSD page size (ssd_page_size, ). Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep increasing to accommodate higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes (enterprise_storage, ).

We used OpenMP to parallelize the code for running on multi-cores. We use 8-byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system.

Baseline: We compare our results with the popular out-of-core GraphChi framework. While comparing with GraphChi, we use the same host-side memory cache size as the size of the multi-log buffer used in the PartitionedVC system. In both our implementation and GraphChi’s implementation, we limit the memory usage to GB. In our implementation, we limit memory usage by limiting the total size of the mulit-log buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such GraphChi also achieves peak storage access bandwidth.

Graph dataset: To evaluate the performance of PartitionedVC, we selected two real-world datasets, one from the popular SNAP dataset (snapnets, ), and another is a popular web graph from Yahoo Webscope dataset (yahooWebscopre_graph, ). These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 1 shows the number of vertices and edges for these graphs.

Dataset name Number of vertices Number of edges
com-friendster (CF) 124,836,180 3,612,134,270
YahooWebScope (YWS) 1,413,511,394 12,869,122,070
Table 1. Graph dataset

5. Applications

To illustrate the benefits of our framework, we evaluate several graph applications, which are:

BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.

Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.

Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bi-partite or similar structure.

Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding in-edge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.

Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.

K-Core: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.

Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).

6. Experimental evaluation

(a) BFS relative to GraphChi
(b) Pagerank relative to GraphChi
(c) FLP relative to GraphChi
Figure 5. Application performance
(a) GC relative to GraphChi
(b) MIS relative to GraphChi
Figure 6. Application performance
Figure 7. Ratio of page access counts relative to GraphChi
Figure 8. Storage access times and compute times

Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The X-axis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an X-axis of 0.1 means that the selected source-target pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Y-axis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.

On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.

Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The X-axis shows the two graph datasets that we used and Y-axis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here X-axis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.

(a) Pagerank performance over several supersteps
(b) FLP performance over several supersteps
(c) GC performance over several supersteps
Figure 9. Performance comparisons over supersteps
Figure 10. MIS performance over several supersteps

Form the Figures (a)a, (b)b, (c)c, and 10 we can observe that as we keep increasing the number of supersteps, the performance benefits when compared to GraphChi increase.

PartitionedVC uses CSR format, which is suitable for accessing fewer vertices data but is costly to merge into it as one has to shuffle the entire graph. Using multiple intervals helps in reducing the merge cost of CSR format. Figure LABEL:fig:KCore_algoritm shows the performance of K-core on PartitionedVC compared to it on GraphChi. In K-core as delete operations are used, GraphChi can directly update the delete bit in its outgoing edge’s shard, whereas in PartitionedVC we log the structural update and later update it in the graph. As graph updates are passed using asynchronous fashion, K-core takes only one iteration and all the vertices are active, so GraphChi performs better than PartitionedVC for K-core application. However, for other structural update operations like add edge, add a vertex, we expect PartitionedVC to perform in a similar fashion, as GraphChi and PartitionedVC tackle the structural updates in a similar fashion, buffer the updates and merge after a threshold.

7. Related work

Graph analytics systems are widely deployed in important domains such as drug processing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems (low2014graphlab, ; gonzalez2012powergraph, ).

For vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraFBoost implements external memory graph analytics system (GraFBoost, ). It logs updates to the storage and pass vertex updates in a superstep. They use sort-reduce technique for efficiently combining the updates and applying them to vertex values. In a superstep, they access graph pages in storage corresponding to the active vertex list only once. However, they may access storage multiple times for the updates, as they sort-reduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our PartitionedVC system supports, complete vertex-centric programming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems (apache_flink, ).

A recent work (elyasi2019large, ) extends GraFBoost work for the vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value. In this work, they improve performance by avoiding the sorting phase of the log. For this, they partition the destination vertices such that they can fit in the main memory and replicate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, with this scheme, one may need to replicate the source vertices and access edge data multiple times. When extending this scheme to support complete vertex-centric programming, computing with this scheme may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. Linearly extrapolating based on the data presented in their paper, the replication overhead for 1000 partitions is around , which is prohibitively expensive.

X-stream (roy2013x, ) and GridGraph (zhu2015gridgraph, ) are edge-centric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edge-centric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as random-walk.

GraphChi (graphchi, ) is the only external memory based vertex-centric programming system that supports more than associative and commutative combine programs. GraphChi partitions the graph into several vertex intervals, namely shards. Processing based on shards, updates by a vertex are passed using shared memory communication and all the updates are done in memory only, accessing storage efficiently in a coarse-grained manner. However, when processing with GraphChi, one has to read most of the graph in each iteration and is not suitable for processing graph algorithms which may access only a part of the graph, such as widely used BFS, or for processing algorithms where not all vertices are active during an iteration. In this work, we compare with GraphChi as a baseline and show considerable performance improvements.

There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard (ai2017squeezing, ; vora2016load, ). However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep.

Semi-external memory systems such as Flash graph (flashgraph, ) stores the vertex data in main memory and achieve high performance. When processing with a low-cost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the fine-grained accesses to the vertex-value vector.

Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertex-centric programming based graph analytic systems including Pregel (malewicz2010pregel, ), Graphlab (low2014graphlab, ), PowerGraph (tiwari_hotpower12, ), etc. In the single-node in-memory setting, Ligra (shun2013ligra, ) provides graph processing framework optimized for multi-core processing.

8. Conclusion

Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art out-of-core graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadth-first search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.

References

  • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
  • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
  • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
  • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
  • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
  • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
  • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
  • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
  • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
  • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
  • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
  • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
  • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
  • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
  • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
  • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
  • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
  • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
  • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
  • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
  • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
  • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.

4. System design and Implementation

We implemented the PartitionedVC system as a graph analytics runtime on an Intel i7-4790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use TB EVO SSDs (SSD860EVO, ). We use Ubuntu operating system which runs on Linux Kernel version 3.19.

To simultaneously load pages from several non-contiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses in granularities of 16KB, typical SSD page size (ssd_page_size, ). Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep increasing to accommodate higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes (enterprise_storage, ).

We used OpenMP to parallelize the code for running on multi-cores. We use 8-byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system.

Baseline: We compare our results with the popular out-of-core GraphChi framework. While comparing with GraphChi, we use the same host-side memory cache size as the size of the multi-log buffer used in the PartitionedVC system. In both our implementation and GraphChi’s implementation, we limit the memory usage to GB. In our implementation, we limit memory usage by limiting the total size of the mulit-log buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such GraphChi also achieves peak storage access bandwidth.

Graph dataset: To evaluate the performance of PartitionedVC, we selected two real-world datasets, one from the popular SNAP dataset (snapnets, ), and another is a popular web graph from Yahoo Webscope dataset (yahooWebscopre_graph, ). These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 1 shows the number of vertices and edges for these graphs.

Dataset name Number of vertices Number of edges
com-friendster (CF) 124,836,180 3,612,134,270
YahooWebScope (YWS) 1,413,511,394 12,869,122,070
Table 1. Graph dataset

5. Applications

To illustrate the benefits of our framework, we evaluate several graph applications, which are:

BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.

Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.

Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bi-partite or similar structure.

Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding in-edge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.

Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.

K-Core: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.

Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).

6. Experimental evaluation

(a) BFS relative to GraphChi
(b) Pagerank relative to GraphChi
(c) FLP relative to GraphChi
Figure 5. Application performance
(a) GC relative to GraphChi
(b) MIS relative to GraphChi
Figure 6. Application performance
Figure 7. Ratio of page access counts relative to GraphChi
Figure 8. Storage access times and compute times

Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The X-axis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an X-axis of 0.1 means that the selected source-target pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Y-axis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.

On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.

Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The X-axis shows the two graph datasets that we used and Y-axis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here X-axis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.

(a) Pagerank performance over several supersteps
(b) FLP performance over several supersteps
(c) GC performance over several supersteps
Figure 9. Performance comparisons over supersteps
Figure 10. MIS performance over several supersteps

Form the Figures (a)a, (b)b, (c)c, and 10 we can observe that as we keep increasing the number of supersteps, the performance benefits when compared to GraphChi increase.

PartitionedVC uses CSR format, which is suitable for accessing fewer vertices data but is costly to merge into it as one has to shuffle the entire graph. Using multiple intervals helps in reducing the merge cost of CSR format. Figure LABEL:fig:KCore_algoritm shows the performance of K-core on PartitionedVC compared to it on GraphChi. In K-core as delete operations are used, GraphChi can directly update the delete bit in its outgoing edge’s shard, whereas in PartitionedVC we log the structural update and later update it in the graph. As graph updates are passed using asynchronous fashion, K-core takes only one iteration and all the vertices are active, so GraphChi performs better than PartitionedVC for K-core application. However, for other structural update operations like add edge, add a vertex, we expect PartitionedVC to perform in a similar fashion, as GraphChi and PartitionedVC tackle the structural updates in a similar fashion, buffer the updates and merge after a threshold.

7. Related work

Graph analytics systems are widely deployed in important domains such as drug processing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems (low2014graphlab, ; gonzalez2012powergraph, ).

For vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraFBoost implements external memory graph analytics system (GraFBoost, ). It logs updates to the storage and pass vertex updates in a superstep. They use sort-reduce technique for efficiently combining the updates and applying them to vertex values. In a superstep, they access graph pages in storage corresponding to the active vertex list only once. However, they may access storage multiple times for the updates, as they sort-reduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our PartitionedVC system supports, complete vertex-centric programming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems (apache_flink, ).

A recent work (elyasi2019large, ) extends GraFBoost work for the vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value. In this work, they improve performance by avoiding the sorting phase of the log. For this, they partition the destination vertices such that they can fit in the main memory and replicate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, with this scheme, one may need to replicate the source vertices and access edge data multiple times. When extending this scheme to support complete vertex-centric programming, computing with this scheme may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. Linearly extrapolating based on the data presented in their paper, the replication overhead for 1000 partitions is around , which is prohibitively expensive.

X-stream (roy2013x, ) and GridGraph (zhu2015gridgraph, ) are edge-centric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edge-centric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as random-walk.

GraphChi (graphchi, ) is the only external memory based vertex-centric programming system that supports more than associative and commutative combine programs. GraphChi partitions the graph into several vertex intervals, namely shards. Processing based on shards, updates by a vertex are passed using shared memory communication and all the updates are done in memory only, accessing storage efficiently in a coarse-grained manner. However, when processing with GraphChi, one has to read most of the graph in each iteration and is not suitable for processing graph algorithms which may access only a part of the graph, such as widely used BFS, or for processing algorithms where not all vertices are active during an iteration. In this work, we compare with GraphChi as a baseline and show considerable performance improvements.

There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard (ai2017squeezing, ; vora2016load, ). However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep.

Semi-external memory systems such as Flash graph (flashgraph, ) stores the vertex data in main memory and achieve high performance. When processing with a low-cost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the fine-grained accesses to the vertex-value vector.

Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertex-centric programming based graph analytic systems including Pregel (malewicz2010pregel, ), Graphlab (low2014graphlab, ), PowerGraph (tiwari_hotpower12, ), etc. In the single-node in-memory setting, Ligra (shun2013ligra, ) provides graph processing framework optimized for multi-core processing.

8. Conclusion

Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art out-of-core graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadth-first search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.

References

  • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
  • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
  • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
  • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
  • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
  • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
  • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
  • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
  • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
  • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
  • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
  • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
  • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
  • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
  • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
  • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
  • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
  • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
  • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
  • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
  • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
  • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.

5. Applications

To illustrate the benefits of our framework, we evaluate several graph applications, which are:

BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.

Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.

Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bi-partite or similar structure.

Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding in-edge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.

Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.

K-Core: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.

Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).

6. Experimental evaluation

(a) BFS relative to GraphChi
(b) Pagerank relative to GraphChi
(c) FLP relative to GraphChi
Figure 5. Application performance
(a) GC relative to GraphChi
(b) MIS relative to GraphChi
Figure 6. Application performance
Figure 7. Ratio of page access counts relative to GraphChi
Figure 8. Storage access times and compute times

Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The X-axis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an X-axis of 0.1 means that the selected source-target pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Y-axis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.

On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.

Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The X-axis shows the two graph datasets that we used and Y-axis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here X-axis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.

(a) Pagerank performance over several supersteps
(b) FLP performance over several supersteps
(c) GC performance over several supersteps
Figure 9. Performance comparisons over supersteps
Figure 10. MIS performance over several supersteps

Form the Figures (a)a, (b)b, (c)c, and 10 we can observe that as we keep increasing the number of supersteps, the performance benefits when compared to GraphChi increase.

PartitionedVC uses CSR format, which is suitable for accessing fewer vertices data but is costly to merge into it as one has to shuffle the entire graph. Using multiple intervals helps in reducing the merge cost of CSR format. Figure LABEL:fig:KCore_algoritm shows the performance of K-core on PartitionedVC compared to it on GraphChi. In K-core as delete operations are used, GraphChi can directly update the delete bit in its outgoing edge’s shard, whereas in PartitionedVC we log the structural update and later update it in the graph. As graph updates are passed using asynchronous fashion, K-core takes only one iteration and all the vertices are active, so GraphChi performs better than PartitionedVC for K-core application. However, for other structural update operations like add edge, add a vertex, we expect PartitionedVC to perform in a similar fashion, as GraphChi and PartitionedVC tackle the structural updates in a similar fashion, buffer the updates and merge after a threshold.

7. Related work

Graph analytics systems are widely deployed in important domains such as drug processing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems (low2014graphlab, ; gonzalez2012powergraph, ).

For vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraFBoost implements external memory graph analytics system (GraFBoost, ). It logs updates to the storage and pass vertex updates in a superstep. They use sort-reduce technique for efficiently combining the updates and applying them to vertex values. In a superstep, they access graph pages in storage corresponding to the active vertex list only once. However, they may access storage multiple times for the updates, as they sort-reduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our PartitionedVC system supports, complete vertex-centric programming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems (apache_flink, ).

A recent work (elyasi2019large, ) extends GraFBoost work for the vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value. In this work, they improve performance by avoiding the sorting phase of the log. For this, they partition the destination vertices such that they can fit in the main memory and replicate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, with this scheme, one may need to replicate the source vertices and access edge data multiple times. When extending this scheme to support complete vertex-centric programming, computing with this scheme may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. Linearly extrapolating based on the data presented in their paper, the replication overhead for 1000 partitions is around , which is prohibitively expensive.

X-stream (roy2013x, ) and GridGraph (zhu2015gridgraph, ) are edge-centric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edge-centric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as random-walk.

GraphChi (graphchi, ) is the only external memory based vertex-centric programming system that supports more than associative and commutative combine programs. GraphChi partitions the graph into several vertex intervals, namely shards. Processing based on shards, updates by a vertex are passed using shared memory communication and all the updates are done in memory only, accessing storage efficiently in a coarse-grained manner. However, when processing with GraphChi, one has to read most of the graph in each iteration and is not suitable for processing graph algorithms which may access only a part of the graph, such as widely used BFS, or for processing algorithms where not all vertices are active during an iteration. In this work, we compare with GraphChi as a baseline and show considerable performance improvements.

There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard (ai2017squeezing, ; vora2016load, ). However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep.

Semi-external memory systems such as Flash graph (flashgraph, ) stores the vertex data in main memory and achieve high performance. When processing with a low-cost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the fine-grained accesses to the vertex-value vector.

Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertex-centric programming based graph analytic systems including Pregel (malewicz2010pregel, ), Graphlab (low2014graphlab, ), PowerGraph (tiwari_hotpower12, ), etc. In the single-node in-memory setting, Ligra (shun2013ligra, ) provides graph processing framework optimized for multi-core processing.

8. Conclusion

Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art out-of-core graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadth-first search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.

References

  • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
  • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
  • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
  • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
  • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
  • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
  • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
  • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
  • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
  • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
  • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
  • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
  • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
  • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
  • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
  • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
  • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
  • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
  • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
  • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
  • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
  • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.

6. Experimental evaluation

(a) BFS relative to GraphChi
(b) Pagerank relative to GraphChi
(c) FLP relative to GraphChi
Figure 5. Application performance
(a) GC relative to GraphChi
(b) MIS relative to GraphChi
Figure 6. Application performance
Figure 7. Ratio of page access counts relative to GraphChi
Figure 8. Storage access times and compute times

Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The X-axis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an X-axis of 0.1 means that the selected source-target pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Y-axis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.

On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.

Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The X-axis shows the two graph datasets that we used and Y-axis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here X-axis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.

(a) Pagerank performance over several supersteps
(b) FLP performance over several supersteps
(c) GC performance over several supersteps
Figure 9. Performance comparisons over supersteps
Figure 10. MIS performance over several supersteps

Form the Figures (a)a, (b)b, (c)c, and 10 we can observe that as we keep increasing the number of supersteps, the performance benefits when compared to GraphChi increase.

PartitionedVC uses CSR format, which is suitable for accessing fewer vertices data but is costly to merge into it as one has to shuffle the entire graph. Using multiple intervals helps in reducing the merge cost of CSR format. Figure LABEL:fig:KCore_algoritm shows the performance of K-core on PartitionedVC compared to it on GraphChi. In K-core as delete operations are used, GraphChi can directly update the delete bit in its outgoing edge’s shard, whereas in PartitionedVC we log the structural update and later update it in the graph. As graph updates are passed using asynchronous fashion, K-core takes only one iteration and all the vertices are active, so GraphChi performs better than PartitionedVC for K-core application. However, for other structural update operations like add edge, add a vertex, we expect PartitionedVC to perform in a similar fashion, as GraphChi and PartitionedVC tackle the structural updates in a similar fashion, buffer the updates and merge after a threshold.

7. Related work

Graph analytics systems are widely deployed in important domains such as drug processing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems (low2014graphlab, ; gonzalez2012powergraph, ).

For vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraFBoost implements external memory graph analytics system (GraFBoost, ). It logs updates to the storage and pass vertex updates in a superstep. They use sort-reduce technique for efficiently combining the updates and applying them to vertex values. In a superstep, they access graph pages in storage corresponding to the active vertex list only once. However, they may access storage multiple times for the updates, as they sort-reduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our PartitionedVC system supports, complete vertex-centric programming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems (apache_flink, ).

A recent work (elyasi2019large, ) extends GraFBoost work for the vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value. In this work, they improve performance by avoiding the sorting phase of the log. For this, they partition the destination vertices such that they can fit in the main memory and replicate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, with this scheme, one may need to replicate the source vertices and access edge data multiple times. When extending this scheme to support complete vertex-centric programming, computing with this scheme may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. Linearly extrapolating based on the data presented in their paper, the replication overhead for 1000 partitions is around , which is prohibitively expensive.

X-stream (roy2013x, ) and GridGraph (zhu2015gridgraph, ) are edge-centric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edge-centric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as random-walk.

GraphChi (graphchi, ) is the only external memory based vertex-centric programming system that supports more than associative and commutative combine programs. GraphChi partitions the graph into several vertex intervals, namely shards. Processing based on shards, updates by a vertex are passed using shared memory communication and all the updates are done in memory only, accessing storage efficiently in a coarse-grained manner. However, when processing with GraphChi, one has to read most of the graph in each iteration and is not suitable for processing graph algorithms which may access only a part of the graph, such as widely used BFS, or for processing algorithms where not all vertices are active during an iteration. In this work, we compare with GraphChi as a baseline and show considerable performance improvements.

There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard (ai2017squeezing, ; vora2016load, ). However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep.

Semi-external memory systems such as Flash graph (flashgraph, ) stores the vertex data in main memory and achieve high performance. When processing with a low-cost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the fine-grained accesses to the vertex-value vector.

Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertex-centric programming based graph analytic systems including Pregel (malewicz2010pregel, ), Graphlab (low2014graphlab, ), PowerGraph (tiwari_hotpower12, ), etc. In the single-node in-memory setting, Ligra (shun2013ligra, ) provides graph processing framework optimized for multi-core processing.

8. Conclusion

Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art out-of-core graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadth-first search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.

References

  • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
  • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
  • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
  • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
  • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
  • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
  • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
  • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
  • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
  • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
  • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
  • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
  • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
  • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
  • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
  • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
  • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
  • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
  • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
  • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
  • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
  • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.

7. Related work

Graph analytics systems are widely deployed in important domains such as drug processing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems (low2014graphlab, ; gonzalez2012powergraph, ).

For vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraFBoost implements external memory graph analytics system (GraFBoost, ). It logs updates to the storage and pass vertex updates in a superstep. They use sort-reduce technique for efficiently combining the updates and applying them to vertex values. In a superstep, they access graph pages in storage corresponding to the active vertex list only once. However, they may access storage multiple times for the updates, as they sort-reduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our PartitionedVC system supports, complete vertex-centric programming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems (apache_flink, ).

A recent work (elyasi2019large, ) extends GraFBoost work for the vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value. In this work, they improve performance by avoiding the sorting phase of the log. For this, they partition the destination vertices such that they can fit in the main memory and replicate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, with this scheme, one may need to replicate the source vertices and access edge data multiple times. When extending this scheme to support complete vertex-centric programming, computing with this scheme may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. Linearly extrapolating based on the data presented in their paper, the replication overhead for 1000 partitions is around , which is prohibitively expensive.

X-stream (roy2013x, ) and GridGraph (zhu2015gridgraph, ) are edge-centric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edge-centric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as random-walk.

GraphChi (graphchi, ) is the only external memory based vertex-centric programming system that supports more than associative and commutative combine programs. GraphChi partitions the graph into several vertex intervals, namely shards. Processing based on shards, updates by a vertex are passed using shared memory communication and all the updates are done in memory only, accessing storage efficiently in a coarse-grained manner. However, when processing with GraphChi, one has to read most of the graph in each iteration and is not suitable for processing graph algorithms which may access only a part of the graph, such as widely used BFS, or for processing algorithms where not all vertices are active during an iteration. In this work, we compare with GraphChi as a baseline and show considerable performance improvements.

There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard (ai2017squeezing, ; vora2016load, ). However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep.

Semi-external memory systems such as Flash graph (flashgraph, ) stores the vertex data in main memory and achieve high performance. When processing with a low-cost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the fine-grained accesses to the vertex-value vector.

Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertex-centric programming based graph analytic systems including Pregel (malewicz2010pregel, ), Graphlab (low2014graphlab, ), PowerGraph (tiwari_hotpower12, ), etc. In the single-node in-memory setting, Ligra (shun2013ligra, ) provides graph processing framework optimized for multi-core processing.

8. Conclusion

Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art out-of-core graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadth-first search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.

References

  • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
  • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
  • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
  • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
  • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
  • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
  • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
  • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
  • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
  • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
  • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
  • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
  • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
  • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
  • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
  • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
  • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
  • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
  • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
  • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
  • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
  • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.

8. Conclusion

Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art out-of-core graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadth-first search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.

References

  • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
  • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
  • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
  • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
  • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
  • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
  • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
  • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
  • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
  • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
  • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
  • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
  • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
  • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
  • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
  • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
  • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
  • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
  • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
  • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
  • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
  • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.

References

  • (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
  • (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/iterative_graph_processing.html, 2019.
  • (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
  • (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Large-scale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
  • (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storage-hardware/nand-flash-memory.html, 2019.
  • (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/, 2014.
  • (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graph-parallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
  • (8) S.-W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
  • (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12.   Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
  • (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
  • (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
  • (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.   ACM, 2010, pp. 135–146.
  • (13) “Pagerank application,,” https://github.com/GraphChi/graphchi-cpp/blob/master/example_apps/streaming_pagerank.cpp.
  • (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregel-like large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12.   Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
  • (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
  • (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.   ACM, 2013, pp. 472–488.
  • (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregel-like systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
  • (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8.   ACM, 2013, pp. 135–146.
  • (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019.
  • (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, ser. HotPower ’12.   Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
  • (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for disk-based graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
  • (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
  • (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15.   Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng
  • (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.