1. Introduction
Graphs analytics are at the heart of a broad range of applications. The size of the graphs in many of these domains exceeds the size of main memory. Hence, many outofcore (also called external memory) graph processing systems have been proposed. These systems primarily operate on graphs by splitting the graph into chunks and operating on each chunk that fits in main memory. An alternative approach is to distribute the graph processing across multiple systems to fit the graph across the distributed memory. In this work, we focus on single system graph processing.
Many popular graph processing systems use vertexcentric programming paradigm. This computational paradigm uses bulk synchronous parallel (BSP) processing where each vertex is processed once during a single superstep, which may generate new updates to their connected vertices which are then iteratively processed in the following superstep. A vertex can modify its state or generate updates to another vertex, or even mutate the topology of the graph.
GraphChi is an external memory graph analytics system which supports a vertexcentric programming model (graphchi, ). Graphchi was originally designed for graph processing with hard disks, so they tried to minimize the number of random disk accesses. As such GraphChi uses a custom graph structure called a shard to split the graph into chunks such that each chunk fits in main memory (more details in the next section). GraphChi loads a shard containing a set of vertices and all the outgoing edges from these vertices that are located in other shards into main memory to process vertices in batches.
As we will describe in more detail in the next section, the loading time for a graph dominates the total execution time due to the repeated fetching of shards from storage. GraphChi has to load all the shards that make up the full graph during every iteration of the BSP compute model, even if certain vertices/edges do not have any new updates. This limitation is primarily due to the structure of the shard that stores outgoing edge information across multiple shards. Our data shows that the number of vertices that receive an update from a previous BSP iteration will dynamically change as the graph algorithm converges. In some algorithms, such as breadth first search, the active graph starts to be very small and then grows with each superstep. In other algorithms, such as page rank the active graph size shrinks with each superstep. Hence, there is a significant opportunity to reduce the cost of graph processing if only the active vertices are loaded from storage.
GraphChi’s shard structure was a right choice for hard disk based systems that heavily penalize random accesses. With the advent of solid state disks (SSDs), we need to evaluate new graph storage structures, such as compressed sparse row (CSR) formats, that are better suited for incrementally loading active vertices. In fact, GraphChi considered CSR formats in its original design (graphchi, ), but the main drawback with CSR format is that updating outgoing edges of a vertex during each BSP iteration leads to significant random access traffic. The limitation of random updates in CSR format can be mitigated by using a message passing based log structure. The log structure in essence records all the updates to various target vertices as a set of sequential log writes, rather than directly updating the graph itself. This log structure mitigates the random access concern. One drawback of the log structure is that messages bound to a single target vertex may be interspersed throughout the log. Hence, the log structure itself must be sorted (or traversed multiple times) to process the updates in the next iteration.
Given these limitations this paper proposes a new graph processing framework that allows loading only the active vertices in each superstep. In particular, we demonstrate how CSR formatted graphs can be exploited to load only the active vertices to reduce the graph load time in iterative graph processing algorithms. Second, to tackle the large overheads of managing log structure we propose a split log structure that divides the logs into multiple vertex intervals. All the updates generated by one vertex interval to other vertex interval are stored in their corresponding log. When an interval of vertices are scheduled for processing all the updates it needs are located in a single log.
Our main contributions in this work are:

We propose an efficient external memory graph analytic system, called PartitionedVC, which reduces read amplification while accessing the storage for active vertices data and the updates sent between them. To realize this, we use a compressed graph storage format suitable for accessing active vertices data and log the updates sent between the vertices, instead of directly updating at the target vertex location. We show that the log based updates significantly reduce the number of random accesses while performing a wide range of graph analytics.

To efficiently log and access the updates in a superstep, we partition the graph into multiple vertex intervals for processing. To further reduce the log management overhead we split the update log into multiple logs each associated with one vertex interval.

While processing these vertex intervals, we efficiently schedule graph accesses in order to access only active graph pages from SSD.

We also support efficient graph structural updates to the compressed graph format by logging the graph structural changes for each interval.
2. Background and Motivation
2.1. Modern SSD platforms
Since much of our work uses SSDspecific optimizations, we provide a very brief overview of SSD architecture. Figure 1 illustrates the architecture of a modern SSD platform. An SSD equips multiple flash memory channels to support high data bandwidth. Multiple dies are integrated into a single NAND flash package by employing diestacking structure to integrate more storage space on the limited platform board. Data parallelism can be achieved per die with multiplane or multiway composition. Each plane or way is divided into multiple blocks which have dozens of physical pages. A page is a basic physical storage unit that can be read or written by one flash command. SSD uses firmware to manage all the idiosyncrasies while accessing a page. To execute the SSD firmware, major components of SSD include an embedded processor, DRAM, onchip interconnection network, and a flash controller.
2.2. Graph Computational model
In this work, we support the commonly used vertexcentric programming model for graph analytics. The input to a graph computation is a directed graph, . Each vertex in the graph has an id between to and a modifiable userdefined value associated with it. For a directed edge , we refer to be the outedge of , and inedge of . Also for , we refer to be source vertex and to be the target vertex, and may be associated with a modifiable, userdefined value.
A typical vertexcentric computation consists of input, where the graph is initialized, followed by a sequence of supersteps separated by global synchronization points until the algorithm terminates, and finishes with output. Within each superstep vertices compute in parallel, each executing the same userdefined function that expresses the logic of a given algorithm. A vertex can modify its state or that of its neighboring edges, or generate updates to another vertex, or even mutate the topology of the graph. Edges are not firstclass citizens in this model, having no associated computation.
Depending on when the updates are visible to the target vertices, the computational model can be either synchronous or asynchronous. In the synchronous computational model, the updates generated to a target vertex are available to it in the next superstep. Graph systems such as Pregel (malewicz2010pregel, ), and Apache Giraph (apache_giraph, ) use this approach. In an asynchronous computational model, an update to a target vertex is visible to that vertex in the current superstep. So if the target vertex is scheduled after the source vertex in a superstep, then the current superstep’s update is available to the target vertex. GraphChi (graphchi, ) and Graphlab (low2014graphlab, ) use this approach. An asynchronous computational model is shown to be useful for accelerating the convergence of many numerical algorithms (graphchi, ). PartitionedVC supports asynchronous updates.
Graph algorithms can also be broadly classified into two categories, based on how the updates are handled. A certain class of graph algorithms exhibits associative and commutative property. In such algorithms updates to a target vertex can be combined into a single value, and they can be combined in any order before processing the vertex. Algorithms such as pagerank, BFS, singlesource shortest path fall in this category. There are many other graph algorithms that require the update order to be preserved, and each update is individually applied. Algorithms such as community detection
(FLP_implementation, ), graph coloring (gonzalez2012powergraph, ), maximal independent set (malewicz2010pregel, ) fall in this category. PartitionedVC supports both these types of graph algorithms.2.3. Outofcore graph processing
In the outofcore graph processing context, graphs sizes are considered to be large when compared to the main memory size but can fit in the storage size of current SSDs (in Terabytes). As described earlier, GraphChi (graphchi, ) is an outofcore vertexcentric programming system. GraphChi partitions the graph into several vertex intervals, and stores all the incoming edges to a vertex interval as a shard. Figure (b)b shows the shard structure for an illustrative graph shown in Figure (a)a. For instance, shard1 stores all the incoming edges of vertex interval , shard2 stores ’s incoming edges, and shard3 stores incoming edges of all the vertices in the interval . While incoming edges are closely packed in a shard, the outgoing edges of a vertex are dispersed across other shards. In this example, the outgoing edges of are dispersed across shard1, shard2, and shard3. Another unique property of shard organization is that each shard stores all its inedges sorted by source vertex.
GraphChi relies on this shard organization to process vertices in intervals. It first loads into memory a shard corresponding to one vertex interval, as well as all the outgoing edges of those vertices that may be stored across multiple shards. Updates generated during processing are asynchronously and directly passed to the target vertices through the outgoing edges in other shards in the memory. Once the processing for a vertex interval in a superstep is finished, its corresponding shard and its outgoing edges in other shards are written back to the disk.
Using the above approach GraphChi primarily relies on sequential accesses to disk data and minimizes random accesses. However, in the following superstep, a subset of vertices may become active (if they received any messages on their inedges). The inedges to a vertex are stored in a shard, and inedges to a vertex can come from any source vertex. These inedges in a shard are sorted based on source vertex id. Hence, even if a single vertex is active within a vertex interval the entire shard must be loaded since the inedges for that vertex may be dispersed throughout the shard. So to access the inedges of active vertices within a shard, one has to load the entire shard, even there may be few active vertices in that shard. For instance, if any of the
is active, the entire shard3 must be loaded. Loading a shard may be avoided only if all the vertices in the associated vertex interval are not active. However, in realworld graphs, the vertex intervals typically span tens of thousands of vertices, and during each superstep the probability of a single vertex being active in a given interval is very high. As a result, GraphChi in practice ends up loading all the shards in every superstep independent of the number of active vertices in that superstep.
2.4. Active graph
To quantify the amount of superfluous loading that must be performed we counted the active vertex and the active edge counts in each superstep while running graph coloring application described in section 5 over the datasets shown in Table 1. For this application, we ran a maximum of 15 supersteps. Figure 3 shows the active vertices and active edges count over these supersteps. The xaxis indicates the superstep number, the major yaxis shows the ratio of active vertices divided by total vertices, and the minor yaxis shows the number of active edges (updates sent over an edge) divided by the total number of edges in the graph. The fraction of active vertices and active edges shrink dramatically as supersteps progress. However, at the granularity of a shard, even the few active vertices lead to loading many shards since the active vertices are spread across the shards.
3. CSR format in the era of SSDs
Figure 2. Graph storage formats Given the large loading bandwidth demands of GraphChi, we evaluated a compressed sparse row (CSR) format for graph processing. CSR format has the desirable property that one may load just the active vertex information more efficiently. Large graphs tend to be sparsely connected. CSR format takes the adjacency matrix representation of a graph and compresses it using three vectors. The value vector,
val, stores all the nonzero values from each column sequentially. The column index vector, colIdx, stores the column index of each element in the val vector. The row pointer vector, rowPtr, stores the starting index of each row in the val vector. CSR format representation for the example graph is shown in Figure (a)a. The edge weights on the graph are stored in val vector, and the adjacent outgoing vertices are stored in colIdx vector. To access adjacent outgoing vertices associated with a vertex in CSR graph storage format, we first need to access the rowPtr vector to get the starting index in the colIdx vector where the adjacent vertices associated with the vertex are stored in a contiguous fashion. As all the outgoing edges connected to a vertex are stored in a contiguous location, while accessing the adjacency information for the active vertices CSR format is suitable for minimizing the number of pages accessed in an SSD and reducing the read amplification while accessing the SSD.3.1. Challenges for graph processing with a CSR format
While CSR format looks appealing, it suffers one fundamental challenge. One can either maintain an inedge list in the colIdx vector or the outedge list, but not both (due to coherency issues with having the same information in two different vectors). Consider the case that adjacency list stores only inedges and during the superstep all the updates on the outedges generate many random accesses to the adjacency lists to extract the outedge information.
3.2. Two Key Observations: Using a log and splitting the log
To avoid the random access problem with the CSR format, we make the first key observation. Namely, updates to the outedges do not need to be propagated using the adjacency list directly. Instead these updates can be simply logged. Thus, we propose to log the updates sent between the vertices, instead of directly updating at the target vertex location. In a superstep, we log all the vertex updates, and group the target vertex messages in the next superstep and pass them to that target vertex.
One can maintain a single log for all the updates that can be parsed in the next superstep. However, as multiple messages sent to a vertex may be spread all over the log, one may need to do external sorting over a large number of updates. The second key observation is that we maintain a separate log for a collection of vertices. As such we create a coarsegrain log for an interval of vertices that stores all the updates bound to those vertices.
We partition the graph into several vertex intervals and use a log for each interval. We choose the size of a vertex interval such that typically the entire update log corresponding to that interval can be loaded into the host memory, and used for processing by the vertices in that interval.
3.3. Multilog Architecture
Given the multilog architecture design, we propose the vertexcentric programming model that can be implemented as follows. In a superstep, we loop over each of the vertex intervals for processing. For each of the vertex intervals, we load its update log and schedule the vertex processing functions for each of the active vertices in that interval. Algorithm 1 shows the overall framework functionality. What is important to highlight in the algorithms is the fact that we load only active vertex data in step 5, rather than the entire graph. There is however a small penalty we pay for logging the updates, which requires us to sort each of the vertex interval logs based on the target vertex id of the update. As long as each vertex interval log fits in main memory, we can do inmemory sorting. Furthermore, as we described earlier the active graph size shrinks dramatically with each superstep, and the total log size is proportional to the active graph size. Hence the log size also shrinks with each superstep, thereby shrinking the cost of managing and sorting the log.
Figure 4 shows the software components used to realize the framework, which is described in detail in the following paragraphs. First, we describe the framework for the synchronous computation model (described in subsection 2.2) and then later extend it to asynchronous computation model.
MultiLog Unit: Handling Updates  This component handles storing and retrieving updates generated by the vertices in a superstep. While processing a vertex in a superstep (line 7 in Algorithm 1), the programmer invokes the update function as usual to pass the update to the target vertex. The update function calls the MutliLogVC’s runtime system transparent to the programmer. The runtime system invokes the multilog unit to log the update. A log is maintained for each vertex interval, and an update is generated to a target vertex in the current superstep is appended to the target vertex interval’s log. As we will describe later, these updates are retrieved in the next superstep and processed by the corresponding target vertices.
To efficiently implement logging PartitionedVC first caches the log writes in main memory buffers, called the multilog memory buffers. Buffering helps to reduce finegrained writes which in turn reduces write amplification to the SSD storage. Note that flash memory in SSDs can only be written at page granularity. As such PartitionedVC maintains memory buffers in chunks of SSD page size. Since any vertex interval may generate an update to a target vertex that may be present in any other vertex interval, at least one log buffer is allocated for each vertex interval in the entire graph. In our experiments, even with the largest graph size, the number of vertex intervals was in the order of a few (¡5) thousands. Hence, at least several thousands of pages may be allocated in multilog memory buffer at one time.
For each vertex interval log, a top page is maintained in the buffer. When a new update is sent to the multilog unit, first the top page of that vertex interval where the update is bound for is identified. As updates are just appended to the log, an update that can fit in the available space in the top page is written into it. If there is not enough space in the top page, then a new page is allocated by the multilog unit, and that new page becomes the top page for that vertex interval log. We maintain a simple mapping table indexed by the vertex interval to identify the top page.
When the available free space in the multilog buffer is less than a certain threshold, some log pages are evicted from the main memory to SSD. In synchronous mode of graph updates there is one log file per each vertex interval that is stored in SSD. When a log is evicted from memory it is appended to the corresponding vertex interval log file. While the multilog architecture may theoretically store many log pages to SSD in practice we noticed that evictions to SSD are needed only when the multilog buffer overflows the size of main memory. But as we mentioned earlier, the total size of the active graph changes in each superstep and in majority of supersteps the active size is much smaller than the total graph size. Since the log file is proportional to the number of updates the log file size also shrinks when the active graph size is small. Hence the log for each vertex interval is mostly cached in memory as the supersteps progress.
VC Unit – Handling Data Retrievals  The updates that are bound to each vertex interval are logged, either in memory or maybe in SSD when there is an overflow, as described above. When the next superstep begins the updates received by each vertex in the previous iteration must be processed by that vertex. Note that the updates bound to a given vertex are dispersed throughout the log associated with that vertex interval. Hence, it is necessary to group all the updates in that log first before initiating a vertex processing function. The VC unit is responsible for this task. At the start of each vertex interval processing the VC unit reads the corresponding log and groups all the messages bound for each vertex in that interval.
As described in the background section some graph algorithms support associative and commutative property on updates. Hence, the updates can be merged in any order. For such programs along with the vertex processing function, we provide an accumulation function. The programmer uses the accumulation function to specify the combine operation for the updates. This function is processed for all the incoming updates in a superstep by the VC unit. The accumulation function is operated on all the updates to a target vertex in a superstep before the target vertex’s processing function is called. Algorithm 3 shows how the accumulation function is specified for the the page rank application. Hence, the VC unit can optimize the performance automatically whenever there is an accumulation function defined in a graph algorithms. In the case of nonassociative and commutative programs, the updates are grouped based on the target vertex id, and the update function is individually called for each update.
Graph Loader Unit: As is the case with any graph algorithm, the programmer decides when a vertex will become active or otherwise. It is typically specified as part of the vertex processing function or the accumulate function. PartitionedVC maintains an active vertex bit for each vertex in the main memory and update that bit during each superstep. This active vertex bit mask is used to decide which vertices to process in the next superstep.
As described eariler, PartitionedVC uses CSR format to store graphs since CSR is more efficient to load a collection of active vertices. A graph loader unit is responsible to load the graph data for the vertices present in the active vertex list (line 5 in Algorithm 1). Graph data unit maintains the row buffer for loading the row pointer and buffer for each of the vertex data (adjacency edge lists/weights). The graph data unit loops over the row pointer array for the range of vertices in the active vertex list, each time fetching vertices that can fit in the graph data row pointer buffer. For the vertices which are active in the row pointer buffer, vertex data  inedge/outedge neighbors, inedge/outedge weights, are fetched from the colIdx or val vector stored in the SSD, accessing only the pages in SSD that have active vertex data. The VC unit indicates which vertex data to load, such as inedge or outedge adjacent neighbors or edge weights, as not all the vertex data may be required by the application program. Graph data unit uses double buffering so that it can overlap loading vertex data from storage with the processing of the vertex data loaded into the buffer by the VC unit.
3.4. Design Choices: Graph Vertex Interval
The first design choice for PartitionedVC implementation is the size of each vertex interval. If each vertex interval has only a few vertices it will lead to many vertex intervals. Recall that during the graph update propagation any vertex interval may update a target in any other vertex interval. Hence, having more intervals increases the overhead of vertex interval processing and also requires more time while routing updates to a target vertex interval. On the other hand having too many vertices in a single vertex interval will lead to memory overflow. While processing a vertex interval, it is important that updates to that vertex interval should all fit in the main memory. Typically, in vertexcentric programming updates are sent over the outgoing edges of a vertex, so the number of updates received by a vertex is at most the number of incoming edges for that vertex. Since fitting the updates of each vertex interval in main memory is a critical need, we conservatively assume that there may be an update on each incoming edge of a vertex for the purposes of determining the vertex interval size. We statically partition the vertices into contiguous segments of vertices, such that the sum of the number of incoming updates to the vertices is less than the main memory size provided. This size could be limited by the administrators, application programmer or could be simply limited by the size of the virtual machines allocated for graph processing.
Due to our conservative assumption that there may be a message on each incoming edge, the size of vertex interval may be small. But during runtime updates received by a vertex can be less than the number of incoming edges. For each of this vertex intervals, we keep a counter, which tracks the number of updates sent to that vertex interval in the current superstep. At the beginning of the next superstep the PartitionedVC’s runtime may dynamically fuse contigous vertex intervals into a single large interval to process at once. Such dynamic fusing enables efficient use of the memory during each superstep.
3.5. Design Choices: Graph Data Sizes
Currently our system supports any arbitrary structure of data associated with a vertex. But for efficiency reasons we assume that the structure does not morph dynamically. The graph loader unit loops over the active vertex list and loads each structure into memory. Hence, as long as the programmer assigns a large enough structure to hold all the vertex data our system can easily handle that algorithm. However, if the structure of the data morphs and in particular the size grows arbitrarily it is less efficient to dynamically alter the CSR format organization in SSDs. However, PartitionedVC still works functionally correctly but the graph loading process may be slower. Hence, we recommend that the programmer may conservatively assign a large enough structure statically to enable future growth of data associated with a vertex.
3.6. Graph structural updates
In vertexcentric programming, graph structure can be updated during the supersteps. Graph structure updates in a superstep can be applied at the end of the superstep.
In the CSR format, merging the graph structural updates into the column index or value vectors is a costly operation, as one needs to reshuffle the entire column vectors. To minimize the costly merging operation, we partition the CSR format graph based on the vertex intervals, so that each vertex interval’s graph data is stored separately in the CSR format.
Instead of merging each update directly into the vertex interval’s graph data, we batch several structural updates for a vertex interval and later merge them into the graph data after a certain threshold number of structural updates. As graph structural updates generated during the vertex processing can be to any vertex, we buffer each vertex interval’s structural updates in memory. The multilog and VC units always access these buffered updates to accurately fetch the most current graph data for processing.
3.7. Support for asynchronous computation
In asynchronous vertexcentric programming, in a superstep, if a target vertex is scheduled later than the source vertex generating the update, then the update is available to the target vertex in the same superstep. Therefore, we maintain a single multilog file for each vertex interval. The current superstep’s updates are also appended to the same log as the previous iteration. Hence, the vertex interval’s can load all the updates which are generated to them in the previous or current superstep. Current updates are made by the previously scheduled vertex intervals in the current superstep. Note that in asynchronous operation the current and previous superstep logs are isolated.
To facilitate routing of the updates to the target vertices that are scheduled within the same vertex interval, but later than the source vertex, we keep two arrays. All the updates to active target vertices which are in the same vertex interval, but are not yet scheduled are kept in one array as a linked list, each target vertex having a separate linked list. Another array, having an entry for each vertex in that vertex interval, points to the start of it’s linked list in the first array.
In a similar fashion, for the graph structural updates that are within the same interval are passed to the target vertex using the two arrays. As described in the subsection 3.6 these graph structural updates are written to the storage as a log at the end of the interval to make them persistent.
3.8. Programming model
For each vertex, vertex processing function is provided. The main function logic for a vertex is written in this function. Each vertex can access it’s vertex data in this function, send updates to other vertices, and mutate the graph. Vertices can access and modify their local vertex data, which includes: vertex values, inedge/outedge lists, inedge/outedge labels. Communication between the vertices is implemented using sending updates, each vertex can send an update to any vertex. For the synchronous computation model, updates will be delivered to the target vertex by the start of its vertex processing in next superstep. In the asynchronous computation model, the latest updates from the source vertices will be delivered to the target vertices, which can either be from the current superstep or the previous superstep. Vertices can modify the graph structure and these graph modifications will be finished by the start of next superstep. For the mutating graph, we provide a basic graph structure modification functions, add/delete the edge/vertex, which can modify any part of the graph, not just local vertex structure. The vertex also indicates in the vertex processing function if it wants to be deactivated. If a vertex is deactivated, it will be reactivated again if it receives an update from any other vertex. Algorithm 2 shows the pseudocode for the community detection program using the most frequent label propagation method.
A programmer can indicate several hints to the framework to further optimize performance. A programmer can indicate whether the program, a) requires adjacency lists, b) requires adjacency edgeweights, c) performs any graph structural changes. Note that one can use a compiler also to infer these hints.
3.9. Analysis of I/O costs
SSDs have a hierarchical organization and the minimum granularity at which one can access it is NAND page size. So we perform our I/O analysis based on the number of NAND pages accessed. Note that here we assume that all the I/O accesses the NAND pages and ignore the buffer present in the SSD DRAM, as it is small and typically used for buffering the NAND page before transferring it to the host system.
In each iteration, storage is accessed for logging the updates and for accessing the graph data. As updates are appended as a log, and the log is read sequentially, the amount of storage accessed is proportional to the number of updates generated by the active vertices. Note that as the log is initially appended in the main memory buffer and to create space in the buffer, we evict fully written pages to the storage. So the partially written pages are present in the main memory only and the storage is accessed only for fully written pages. During an iteration, graph data is accessed only for the active vertices. As active vertices graph data may be spread across several SSD pages and the minimum granularity for accessing the storage is an SSD page, so in a superstep, the amount of storage accessed for graph data is proportional to the number of SSD pages containing active vertices data. In an iteration if of vertices are active, and on average if each vertex has edge data, and while reading the vertices graph data if the read amplification factor due to reading of data from SSD in granularities of pages is then the amount of storage accessed in an iteration is which is O(E). This is optimal as edge data corresponding to the actives vertices has to be accessed at least once in each superstep. If the number of updates generated by each of the active vertexes on average is then the number of updates generated is , then the amount of data accessed from storage in an iteration for updates is less than or equal to , once each for writing and reading.
4. System design and Implementation
We implemented the PartitionedVC system as a graph analytics runtime on an Intel i74790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use TB EVO SSDs (SSD860EVO, ). We use Ubuntu operating system which runs on Linux Kernel version 3.19.
To simultaneously load pages from several noncontiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses in granularities of 16KB, typical SSD page size (ssd_page_size, ). Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep increasing to accommodate higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes (enterprise_storage, ).
We used OpenMP to parallelize the code for running on multicores. We use 8byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system.
Baseline: We compare our results with the popular outofcore GraphChi framework. While comparing with GraphChi, we use the same hostside memory cache size as the size of the multilog buffer used in the PartitionedVC system. In both our implementation and GraphChi’s implementation, we limit the memory usage to GB. In our implementation, we limit memory usage by limiting the total size of the mulitlog buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such GraphChi also achieves peak storage access bandwidth.
Graph dataset: To evaluate the performance of PartitionedVC, we selected two realworld datasets, one from the popular SNAP dataset (snapnets, ), and another is a popular web graph from Yahoo Webscope dataset (yahooWebscopre_graph, ). These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 1 shows the number of vertices and edges for these graphs.
Dataset name Number of vertices Number of edges comfriendster (CF) 124,836,180 3,612,134,270 YahooWebScope (YWS) 1,413,511,394 12,869,122,070 Table 1. Graph dataset 5. Applications
To illustrate the benefits of our framework, we evaluate several graph applications, which are:
BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.
Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.
Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bipartite or similar structure.
Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding inedge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.
Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.
KCore: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.
Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).
6. Experimental evaluation
Figure 5. Application performance Figure 6. Application performance Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The Xaxis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an Xaxis of 0.1 means that the selected sourcetarget pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Yaxis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.
On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.
Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The Xaxis shows the two graph datasets that we used and Yaxis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here Xaxis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.
Figure 9. Performance comparisons over supersteps
2. Background and Motivation
2.1. Modern SSD platforms
Since much of our work uses SSDspecific optimizations, we provide a very brief overview of SSD architecture. Figure 1 illustrates the architecture of a modern SSD platform. An SSD equips multiple flash memory channels to support high data bandwidth. Multiple dies are integrated into a single NAND flash package by employing diestacking structure to integrate more storage space on the limited platform board. Data parallelism can be achieved per die with multiplane or multiway composition. Each plane or way is divided into multiple blocks which have dozens of physical pages. A page is a basic physical storage unit that can be read or written by one flash command. SSD uses firmware to manage all the idiosyncrasies while accessing a page. To execute the SSD firmware, major components of SSD include an embedded processor, DRAM, onchip interconnection network, and a flash controller.
2.2. Graph Computational model
In this work, we support the commonly used vertexcentric programming model for graph analytics. The input to a graph computation is a directed graph, . Each vertex in the graph has an id between to and a modifiable userdefined value associated with it. For a directed edge , we refer to be the outedge of , and inedge of . Also for , we refer to be source vertex and to be the target vertex, and may be associated with a modifiable, userdefined value.
A typical vertexcentric computation consists of input, where the graph is initialized, followed by a sequence of supersteps separated by global synchronization points until the algorithm terminates, and finishes with output. Within each superstep vertices compute in parallel, each executing the same userdefined function that expresses the logic of a given algorithm. A vertex can modify its state or that of its neighboring edges, or generate updates to another vertex, or even mutate the topology of the graph. Edges are not firstclass citizens in this model, having no associated computation.
Depending on when the updates are visible to the target vertices, the computational model can be either synchronous or asynchronous. In the synchronous computational model, the updates generated to a target vertex are available to it in the next superstep. Graph systems such as Pregel (malewicz2010pregel, ), and Apache Giraph (apache_giraph, ) use this approach. In an asynchronous computational model, an update to a target vertex is visible to that vertex in the current superstep. So if the target vertex is scheduled after the source vertex in a superstep, then the current superstep’s update is available to the target vertex. GraphChi (graphchi, ) and Graphlab (low2014graphlab, ) use this approach. An asynchronous computational model is shown to be useful for accelerating the convergence of many numerical algorithms (graphchi, ). PartitionedVC supports asynchronous updates.
Graph algorithms can also be broadly classified into two categories, based on how the updates are handled. A certain class of graph algorithms exhibits associative and commutative property. In such algorithms updates to a target vertex can be combined into a single value, and they can be combined in any order before processing the vertex. Algorithms such as pagerank, BFS, singlesource shortest path fall in this category. There are many other graph algorithms that require the update order to be preserved, and each update is individually applied. Algorithms such as community detection
(FLP_implementation, ), graph coloring (gonzalez2012powergraph, ), maximal independent set (malewicz2010pregel, ) fall in this category. PartitionedVC supports both these types of graph algorithms.2.3. Outofcore graph processing
In the outofcore graph processing context, graphs sizes are considered to be large when compared to the main memory size but can fit in the storage size of current SSDs (in Terabytes). As described earlier, GraphChi (graphchi, ) is an outofcore vertexcentric programming system. GraphChi partitions the graph into several vertex intervals, and stores all the incoming edges to a vertex interval as a shard. Figure (b)b shows the shard structure for an illustrative graph shown in Figure (a)a. For instance, shard1 stores all the incoming edges of vertex interval , shard2 stores ’s incoming edges, and shard3 stores incoming edges of all the vertices in the interval . While incoming edges are closely packed in a shard, the outgoing edges of a vertex are dispersed across other shards. In this example, the outgoing edges of are dispersed across shard1, shard2, and shard3. Another unique property of shard organization is that each shard stores all its inedges sorted by source vertex.
GraphChi relies on this shard organization to process vertices in intervals. It first loads into memory a shard corresponding to one vertex interval, as well as all the outgoing edges of those vertices that may be stored across multiple shards. Updates generated during processing are asynchronously and directly passed to the target vertices through the outgoing edges in other shards in the memory. Once the processing for a vertex interval in a superstep is finished, its corresponding shard and its outgoing edges in other shards are written back to the disk.
Using the above approach GraphChi primarily relies on sequential accesses to disk data and minimizes random accesses. However, in the following superstep, a subset of vertices may become active (if they received any messages on their inedges). The inedges to a vertex are stored in a shard, and inedges to a vertex can come from any source vertex. These inedges in a shard are sorted based on source vertex id. Hence, even if a single vertex is active within a vertex interval the entire shard must be loaded since the inedges for that vertex may be dispersed throughout the shard. So to access the inedges of active vertices within a shard, one has to load the entire shard, even there may be few active vertices in that shard. For instance, if any of the
is active, the entire shard3 must be loaded. Loading a shard may be avoided only if all the vertices in the associated vertex interval are not active. However, in realworld graphs, the vertex intervals typically span tens of thousands of vertices, and during each superstep the probability of a single vertex being active in a given interval is very high. As a result, GraphChi in practice ends up loading all the shards in every superstep independent of the number of active vertices in that superstep.
2.4. Active graph
To quantify the amount of superfluous loading that must be performed we counted the active vertex and the active edge counts in each superstep while running graph coloring application described in section 5 over the datasets shown in Table 1. For this application, we ran a maximum of 15 supersteps. Figure 3 shows the active vertices and active edges count over these supersteps. The xaxis indicates the superstep number, the major yaxis shows the ratio of active vertices divided by total vertices, and the minor yaxis shows the number of active edges (updates sent over an edge) divided by the total number of edges in the graph. The fraction of active vertices and active edges shrink dramatically as supersteps progress. However, at the granularity of a shard, even the few active vertices lead to loading many shards since the active vertices are spread across the shards.
3. CSR format in the era of SSDs
Given the large loading bandwidth demands of GraphChi, we evaluated a compressed sparse row (CSR) format for graph processing. CSR format has the desirable property that one may load just the active vertex information more efficiently. Large graphs tend to be sparsely connected. CSR format takes the adjacency matrix representation of a graph and compresses it using three vectors. The value vector,
val, stores all the nonzero values from each column sequentially. The column index vector, colIdx, stores the column index of each element in the val vector. The row pointer vector, rowPtr, stores the starting index of each row in the val vector. CSR format representation for the example graph is shown in Figure (a)a. The edge weights on the graph are stored in val vector, and the adjacent outgoing vertices are stored in colIdx vector. To access adjacent outgoing vertices associated with a vertex in CSR graph storage format, we first need to access the rowPtr vector to get the starting index in the colIdx vector where the adjacent vertices associated with the vertex are stored in a contiguous fashion. As all the outgoing edges connected to a vertex are stored in a contiguous location, while accessing the adjacency information for the active vertices CSR format is suitable for minimizing the number of pages accessed in an SSD and reducing the read amplification while accessing the SSD.3.1. Challenges for graph processing with a CSR format
While CSR format looks appealing, it suffers one fundamental challenge. One can either maintain an inedge list in the colIdx vector or the outedge list, but not both (due to coherency issues with having the same information in two different vectors). Consider the case that adjacency list stores only inedges and during the superstep all the updates on the outedges generate many random accesses to the adjacency lists to extract the outedge information.
3.2. Two Key Observations: Using a log and splitting the log
To avoid the random access problem with the CSR format, we make the first key observation. Namely, updates to the outedges do not need to be propagated using the adjacency list directly. Instead these updates can be simply logged. Thus, we propose to log the updates sent between the vertices, instead of directly updating at the target vertex location. In a superstep, we log all the vertex updates, and group the target vertex messages in the next superstep and pass them to that target vertex.
One can maintain a single log for all the updates that can be parsed in the next superstep. However, as multiple messages sent to a vertex may be spread all over the log, one may need to do external sorting over a large number of updates. The second key observation is that we maintain a separate log for a collection of vertices. As such we create a coarsegrain log for an interval of vertices that stores all the updates bound to those vertices.
We partition the graph into several vertex intervals and use a log for each interval. We choose the size of a vertex interval such that typically the entire update log corresponding to that interval can be loaded into the host memory, and used for processing by the vertices in that interval.
3.3. Multilog Architecture
Given the multilog architecture design, we propose the vertexcentric programming model that can be implemented as follows. In a superstep, we loop over each of the vertex intervals for processing. For each of the vertex intervals, we load its update log and schedule the vertex processing functions for each of the active vertices in that interval. Algorithm 1 shows the overall framework functionality. What is important to highlight in the algorithms is the fact that we load only active vertex data in step 5, rather than the entire graph. There is however a small penalty we pay for logging the updates, which requires us to sort each of the vertex interval logs based on the target vertex id of the update. As long as each vertex interval log fits in main memory, we can do inmemory sorting. Furthermore, as we described earlier the active graph size shrinks dramatically with each superstep, and the total log size is proportional to the active graph size. Hence the log size also shrinks with each superstep, thereby shrinking the cost of managing and sorting the log.
Figure 4 shows the software components used to realize the framework, which is described in detail in the following paragraphs. First, we describe the framework for the synchronous computation model (described in subsection 2.2) and then later extend it to asynchronous computation model.
MultiLog Unit: Handling Updates  This component handles storing and retrieving updates generated by the vertices in a superstep. While processing a vertex in a superstep (line 7 in Algorithm 1), the programmer invokes the update function as usual to pass the update to the target vertex. The update function calls the MutliLogVC’s runtime system transparent to the programmer. The runtime system invokes the multilog unit to log the update. A log is maintained for each vertex interval, and an update is generated to a target vertex in the current superstep is appended to the target vertex interval’s log. As we will describe later, these updates are retrieved in the next superstep and processed by the corresponding target vertices.
To efficiently implement logging PartitionedVC first caches the log writes in main memory buffers, called the multilog memory buffers. Buffering helps to reduce finegrained writes which in turn reduces write amplification to the SSD storage. Note that flash memory in SSDs can only be written at page granularity. As such PartitionedVC maintains memory buffers in chunks of SSD page size. Since any vertex interval may generate an update to a target vertex that may be present in any other vertex interval, at least one log buffer is allocated for each vertex interval in the entire graph. In our experiments, even with the largest graph size, the number of vertex intervals was in the order of a few (¡5) thousands. Hence, at least several thousands of pages may be allocated in multilog memory buffer at one time.
For each vertex interval log, a top page is maintained in the buffer. When a new update is sent to the multilog unit, first the top page of that vertex interval where the update is bound for is identified. As updates are just appended to the log, an update that can fit in the available space in the top page is written into it. If there is not enough space in the top page, then a new page is allocated by the multilog unit, and that new page becomes the top page for that vertex interval log. We maintain a simple mapping table indexed by the vertex interval to identify the top page.
When the available free space in the multilog buffer is less than a certain threshold, some log pages are evicted from the main memory to SSD. In synchronous mode of graph updates there is one log file per each vertex interval that is stored in SSD. When a log is evicted from memory it is appended to the corresponding vertex interval log file. While the multilog architecture may theoretically store many log pages to SSD in practice we noticed that evictions to SSD are needed only when the multilog buffer overflows the size of main memory. But as we mentioned earlier, the total size of the active graph changes in each superstep and in majority of supersteps the active size is much smaller than the total graph size. Since the log file is proportional to the number of updates the log file size also shrinks when the active graph size is small. Hence the log for each vertex interval is mostly cached in memory as the supersteps progress.
VC Unit – Handling Data Retrievals  The updates that are bound to each vertex interval are logged, either in memory or maybe in SSD when there is an overflow, as described above. When the next superstep begins the updates received by each vertex in the previous iteration must be processed by that vertex. Note that the updates bound to a given vertex are dispersed throughout the log associated with that vertex interval. Hence, it is necessary to group all the updates in that log first before initiating a vertex processing function. The VC unit is responsible for this task. At the start of each vertex interval processing the VC unit reads the corresponding log and groups all the messages bound for each vertex in that interval.
As described in the background section some graph algorithms support associative and commutative property on updates. Hence, the updates can be merged in any order. For such programs along with the vertex processing function, we provide an accumulation function. The programmer uses the accumulation function to specify the combine operation for the updates. This function is processed for all the incoming updates in a superstep by the VC unit. The accumulation function is operated on all the updates to a target vertex in a superstep before the target vertex’s processing function is called. Algorithm 3 shows how the accumulation function is specified for the the page rank application. Hence, the VC unit can optimize the performance automatically whenever there is an accumulation function defined in a graph algorithms. In the case of nonassociative and commutative programs, the updates are grouped based on the target vertex id, and the update function is individually called for each update.
Graph Loader Unit: As is the case with any graph algorithm, the programmer decides when a vertex will become active or otherwise. It is typically specified as part of the vertex processing function or the accumulate function. PartitionedVC maintains an active vertex bit for each vertex in the main memory and update that bit during each superstep. This active vertex bit mask is used to decide which vertices to process in the next superstep.
As described eariler, PartitionedVC uses CSR format to store graphs since CSR is more efficient to load a collection of active vertices. A graph loader unit is responsible to load the graph data for the vertices present in the active vertex list (line 5 in Algorithm 1). Graph data unit maintains the row buffer for loading the row pointer and buffer for each of the vertex data (adjacency edge lists/weights). The graph data unit loops over the row pointer array for the range of vertices in the active vertex list, each time fetching vertices that can fit in the graph data row pointer buffer. For the vertices which are active in the row pointer buffer, vertex data  inedge/outedge neighbors, inedge/outedge weights, are fetched from the colIdx or val vector stored in the SSD, accessing only the pages in SSD that have active vertex data. The VC unit indicates which vertex data to load, such as inedge or outedge adjacent neighbors or edge weights, as not all the vertex data may be required by the application program. Graph data unit uses double buffering so that it can overlap loading vertex data from storage with the processing of the vertex data loaded into the buffer by the VC unit.
3.4. Design Choices: Graph Vertex Interval
The first design choice for PartitionedVC implementation is the size of each vertex interval. If each vertex interval has only a few vertices it will lead to many vertex intervals. Recall that during the graph update propagation any vertex interval may update a target in any other vertex interval. Hence, having more intervals increases the overhead of vertex interval processing and also requires more time while routing updates to a target vertex interval. On the other hand having too many vertices in a single vertex interval will lead to memory overflow. While processing a vertex interval, it is important that updates to that vertex interval should all fit in the main memory. Typically, in vertexcentric programming updates are sent over the outgoing edges of a vertex, so the number of updates received by a vertex is at most the number of incoming edges for that vertex. Since fitting the updates of each vertex interval in main memory is a critical need, we conservatively assume that there may be an update on each incoming edge of a vertex for the purposes of determining the vertex interval size. We statically partition the vertices into contiguous segments of vertices, such that the sum of the number of incoming updates to the vertices is less than the main memory size provided. This size could be limited by the administrators, application programmer or could be simply limited by the size of the virtual machines allocated for graph processing.
Due to our conservative assumption that there may be a message on each incoming edge, the size of vertex interval may be small. But during runtime updates received by a vertex can be less than the number of incoming edges. For each of this vertex intervals, we keep a counter, which tracks the number of updates sent to that vertex interval in the current superstep. At the beginning of the next superstep the PartitionedVC’s runtime may dynamically fuse contigous vertex intervals into a single large interval to process at once. Such dynamic fusing enables efficient use of the memory during each superstep.
3.5. Design Choices: Graph Data Sizes
Currently our system supports any arbitrary structure of data associated with a vertex. But for efficiency reasons we assume that the structure does not morph dynamically. The graph loader unit loops over the active vertex list and loads each structure into memory. Hence, as long as the programmer assigns a large enough structure to hold all the vertex data our system can easily handle that algorithm. However, if the structure of the data morphs and in particular the size grows arbitrarily it is less efficient to dynamically alter the CSR format organization in SSDs. However, PartitionedVC still works functionally correctly but the graph loading process may be slower. Hence, we recommend that the programmer may conservatively assign a large enough structure statically to enable future growth of data associated with a vertex.
3.6. Graph structural updates
In vertexcentric programming, graph structure can be updated during the supersteps. Graph structure updates in a superstep can be applied at the end of the superstep.
In the CSR format, merging the graph structural updates into the column index or value vectors is a costly operation, as one needs to reshuffle the entire column vectors. To minimize the costly merging operation, we partition the CSR format graph based on the vertex intervals, so that each vertex interval’s graph data is stored separately in the CSR format.
Instead of merging each update directly into the vertex interval’s graph data, we batch several structural updates for a vertex interval and later merge them into the graph data after a certain threshold number of structural updates. As graph structural updates generated during the vertex processing can be to any vertex, we buffer each vertex interval’s structural updates in memory. The multilog and VC units always access these buffered updates to accurately fetch the most current graph data for processing.
3.7. Support for asynchronous computation
In asynchronous vertexcentric programming, in a superstep, if a target vertex is scheduled later than the source vertex generating the update, then the update is available to the target vertex in the same superstep. Therefore, we maintain a single multilog file for each vertex interval. The current superstep’s updates are also appended to the same log as the previous iteration. Hence, the vertex interval’s can load all the updates which are generated to them in the previous or current superstep. Current updates are made by the previously scheduled vertex intervals in the current superstep. Note that in asynchronous operation the current and previous superstep logs are isolated.
To facilitate routing of the updates to the target vertices that are scheduled within the same vertex interval, but later than the source vertex, we keep two arrays. All the updates to active target vertices which are in the same vertex interval, but are not yet scheduled are kept in one array as a linked list, each target vertex having a separate linked list. Another array, having an entry for each vertex in that vertex interval, points to the start of it’s linked list in the first array.
In a similar fashion, for the graph structural updates that are within the same interval are passed to the target vertex using the two arrays. As described in the subsection 3.6 these graph structural updates are written to the storage as a log at the end of the interval to make them persistent.
3.8. Programming model
For each vertex, vertex processing function is provided. The main function logic for a vertex is written in this function. Each vertex can access it’s vertex data in this function, send updates to other vertices, and mutate the graph. Vertices can access and modify their local vertex data, which includes: vertex values, inedge/outedge lists, inedge/outedge labels. Communication between the vertices is implemented using sending updates, each vertex can send an update to any vertex. For the synchronous computation model, updates will be delivered to the target vertex by the start of its vertex processing in next superstep. In the asynchronous computation model, the latest updates from the source vertices will be delivered to the target vertices, which can either be from the current superstep or the previous superstep. Vertices can modify the graph structure and these graph modifications will be finished by the start of next superstep. For the mutating graph, we provide a basic graph structure modification functions, add/delete the edge/vertex, which can modify any part of the graph, not just local vertex structure. The vertex also indicates in the vertex processing function if it wants to be deactivated. If a vertex is deactivated, it will be reactivated again if it receives an update from any other vertex. Algorithm 2 shows the pseudocode for the community detection program using the most frequent label propagation method.
A programmer can indicate several hints to the framework to further optimize performance. A programmer can indicate whether the program, a) requires adjacency lists, b) requires adjacency edgeweights, c) performs any graph structural changes. Note that one can use a compiler also to infer these hints.
3.9. Analysis of I/O costs
SSDs have a hierarchical organization and the minimum granularity at which one can access it is NAND page size. So we perform our I/O analysis based on the number of NAND pages accessed. Note that here we assume that all the I/O accesses the NAND pages and ignore the buffer present in the SSD DRAM, as it is small and typically used for buffering the NAND page before transferring it to the host system.
In each iteration, storage is accessed for logging the updates and for accessing the graph data. As updates are appended as a log, and the log is read sequentially, the amount of storage accessed is proportional to the number of updates generated by the active vertices. Note that as the log is initially appended in the main memory buffer and to create space in the buffer, we evict fully written pages to the storage. So the partially written pages are present in the main memory only and the storage is accessed only for fully written pages. During an iteration, graph data is accessed only for the active vertices. As active vertices graph data may be spread across several SSD pages and the minimum granularity for accessing the storage is an SSD page, so in a superstep, the amount of storage accessed for graph data is proportional to the number of SSD pages containing active vertices data. In an iteration if of vertices are active, and on average if each vertex has edge data, and while reading the vertices graph data if the read amplification factor due to reading of data from SSD in granularities of pages is then the amount of storage accessed in an iteration is which is O(E). This is optimal as edge data corresponding to the actives vertices has to be accessed at least once in each superstep. If the number of updates generated by each of the active vertexes on average is then the number of updates generated is , then the amount of data accessed from storage in an iteration for updates is less than or equal to , once each for writing and reading.
4. System design and Implementation
We implemented the PartitionedVC system as a graph analytics runtime on an Intel i74790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use TB EVO SSDs (SSD860EVO, ). We use Ubuntu operating system which runs on Linux Kernel version 3.19.
To simultaneously load pages from several noncontiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses in granularities of 16KB, typical SSD page size (ssd_page_size, ). Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep increasing to accommodate higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes (enterprise_storage, ).
We used OpenMP to parallelize the code for running on multicores. We use 8byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system.
Baseline: We compare our results with the popular outofcore GraphChi framework. While comparing with GraphChi, we use the same hostside memory cache size as the size of the multilog buffer used in the PartitionedVC system. In both our implementation and GraphChi’s implementation, we limit the memory usage to GB. In our implementation, we limit memory usage by limiting the total size of the mulitlog buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such GraphChi also achieves peak storage access bandwidth.
Graph dataset: To evaluate the performance of PartitionedVC, we selected two realworld datasets, one from the popular SNAP dataset (snapnets, ), and another is a popular web graph from Yahoo Webscope dataset (yahooWebscopre_graph, ). These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 1 shows the number of vertices and edges for these graphs.
Dataset name  Number of vertices  Number of edges 

comfriendster (CF)  124,836,180  3,612,134,270 
YahooWebScope (YWS)  1,413,511,394  12,869,122,070 
5. Applications
To illustrate the benefits of our framework, we evaluate several graph applications, which are:
BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.
Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.
Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bipartite or similar structure.
Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding inedge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.
Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.
KCore: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.
Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).
6. Experimental evaluation
Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The Xaxis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an Xaxis of 0.1 means that the selected sourcetarget pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Yaxis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.
On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.
Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The Xaxis shows the two graph datasets that we used and Yaxis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here Xaxis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.
3. CSR format in the era of SSDs
Given the large loading bandwidth demands of GraphChi, we evaluated a compressed sparse row (CSR) format for graph processing. CSR format has the desirable property that one may load just the active vertex information more efficiently. Large graphs tend to be sparsely connected. CSR format takes the adjacency matrix representation of a graph and compresses it using three vectors. The value vector,
val, stores all the nonzero values from each column sequentially. The column index vector, colIdx, stores the column index of each element in the val vector. The row pointer vector, rowPtr, stores the starting index of each row in the val vector. CSR format representation for the example graph is shown in Figure (a)a. The edge weights on the graph are stored in val vector, and the adjacent outgoing vertices are stored in colIdx vector. To access adjacent outgoing vertices associated with a vertex in CSR graph storage format, we first need to access the rowPtr vector to get the starting index in the colIdx vector where the adjacent vertices associated with the vertex are stored in a contiguous fashion. As all the outgoing edges connected to a vertex are stored in a contiguous location, while accessing the adjacency information for the active vertices CSR format is suitable for minimizing the number of pages accessed in an SSD and reducing the read amplification while accessing the SSD.3.1. Challenges for graph processing with a CSR format
While CSR format looks appealing, it suffers one fundamental challenge. One can either maintain an inedge list in the colIdx vector or the outedge list, but not both (due to coherency issues with having the same information in two different vectors). Consider the case that adjacency list stores only inedges and during the superstep all the updates on the outedges generate many random accesses to the adjacency lists to extract the outedge information.
3.2. Two Key Observations: Using a log and splitting the log
To avoid the random access problem with the CSR format, we make the first key observation. Namely, updates to the outedges do not need to be propagated using the adjacency list directly. Instead these updates can be simply logged. Thus, we propose to log the updates sent between the vertices, instead of directly updating at the target vertex location. In a superstep, we log all the vertex updates, and group the target vertex messages in the next superstep and pass them to that target vertex.
One can maintain a single log for all the updates that can be parsed in the next superstep. However, as multiple messages sent to a vertex may be spread all over the log, one may need to do external sorting over a large number of updates. The second key observation is that we maintain a separate log for a collection of vertices. As such we create a coarsegrain log for an interval of vertices that stores all the updates bound to those vertices.
We partition the graph into several vertex intervals and use a log for each interval. We choose the size of a vertex interval such that typically the entire update log corresponding to that interval can be loaded into the host memory, and used for processing by the vertices in that interval.
3.3. Multilog Architecture
Given the multilog architecture design, we propose the vertexcentric programming model that can be implemented as follows. In a superstep, we loop over each of the vertex intervals for processing. For each of the vertex intervals, we load its update log and schedule the vertex processing functions for each of the active vertices in that interval. Algorithm 1 shows the overall framework functionality. What is important to highlight in the algorithms is the fact that we load only active vertex data in step 5, rather than the entire graph. There is however a small penalty we pay for logging the updates, which requires us to sort each of the vertex interval logs based on the target vertex id of the update. As long as each vertex interval log fits in main memory, we can do inmemory sorting. Furthermore, as we described earlier the active graph size shrinks dramatically with each superstep, and the total log size is proportional to the active graph size. Hence the log size also shrinks with each superstep, thereby shrinking the cost of managing and sorting the log.
Figure 4 shows the software components used to realize the framework, which is described in detail in the following paragraphs. First, we describe the framework for the synchronous computation model (described in subsection 2.2) and then later extend it to asynchronous computation model.
MultiLog Unit: Handling Updates  This component handles storing and retrieving updates generated by the vertices in a superstep. While processing a vertex in a superstep (line 7 in Algorithm 1), the programmer invokes the update function as usual to pass the update to the target vertex. The update function calls the MutliLogVC’s runtime system transparent to the programmer. The runtime system invokes the multilog unit to log the update. A log is maintained for each vertex interval, and an update is generated to a target vertex in the current superstep is appended to the target vertex interval’s log. As we will describe later, these updates are retrieved in the next superstep and processed by the corresponding target vertices.
To efficiently implement logging PartitionedVC first caches the log writes in main memory buffers, called the multilog memory buffers. Buffering helps to reduce finegrained writes which in turn reduces write amplification to the SSD storage. Note that flash memory in SSDs can only be written at page granularity. As such PartitionedVC maintains memory buffers in chunks of SSD page size. Since any vertex interval may generate an update to a target vertex that may be present in any other vertex interval, at least one log buffer is allocated for each vertex interval in the entire graph. In our experiments, even with the largest graph size, the number of vertex intervals was in the order of a few (¡5) thousands. Hence, at least several thousands of pages may be allocated in multilog memory buffer at one time.
For each vertex interval log, a top page is maintained in the buffer. When a new update is sent to the multilog unit, first the top page of that vertex interval where the update is bound for is identified. As updates are just appended to the log, an update that can fit in the available space in the top page is written into it. If there is not enough space in the top page, then a new page is allocated by the multilog unit, and that new page becomes the top page for that vertex interval log. We maintain a simple mapping table indexed by the vertex interval to identify the top page.
When the available free space in the multilog buffer is less than a certain threshold, some log pages are evicted from the main memory to SSD. In synchronous mode of graph updates there is one log file per each vertex interval that is stored in SSD. When a log is evicted from memory it is appended to the corresponding vertex interval log file. While the multilog architecture may theoretically store many log pages to SSD in practice we noticed that evictions to SSD are needed only when the multilog buffer overflows the size of main memory. But as we mentioned earlier, the total size of the active graph changes in each superstep and in majority of supersteps the active size is much smaller than the total graph size. Since the log file is proportional to the number of updates the log file size also shrinks when the active graph size is small. Hence the log for each vertex interval is mostly cached in memory as the supersteps progress.
VC Unit – Handling Data Retrievals  The updates that are bound to each vertex interval are logged, either in memory or maybe in SSD when there is an overflow, as described above. When the next superstep begins the updates received by each vertex in the previous iteration must be processed by that vertex. Note that the updates bound to a given vertex are dispersed throughout the log associated with that vertex interval. Hence, it is necessary to group all the updates in that log first before initiating a vertex processing function. The VC unit is responsible for this task. At the start of each vertex interval processing the VC unit reads the corresponding log and groups all the messages bound for each vertex in that interval.
As described in the background section some graph algorithms support associative and commutative property on updates. Hence, the updates can be merged in any order. For such programs along with the vertex processing function, we provide an accumulation function. The programmer uses the accumulation function to specify the combine operation for the updates. This function is processed for all the incoming updates in a superstep by the VC unit. The accumulation function is operated on all the updates to a target vertex in a superstep before the target vertex’s processing function is called. Algorithm 3 shows how the accumulation function is specified for the the page rank application. Hence, the VC unit can optimize the performance automatically whenever there is an accumulation function defined in a graph algorithms. In the case of nonassociative and commutative programs, the updates are grouped based on the target vertex id, and the update function is individually called for each update.
Graph Loader Unit: As is the case with any graph algorithm, the programmer decides when a vertex will become active or otherwise. It is typically specified as part of the vertex processing function or the accumulate function. PartitionedVC maintains an active vertex bit for each vertex in the main memory and update that bit during each superstep. This active vertex bit mask is used to decide which vertices to process in the next superstep.
As described eariler, PartitionedVC uses CSR format to store graphs since CSR is more efficient to load a collection of active vertices. A graph loader unit is responsible to load the graph data for the vertices present in the active vertex list (line 5 in Algorithm 1). Graph data unit maintains the row buffer for loading the row pointer and buffer for each of the vertex data (adjacency edge lists/weights). The graph data unit loops over the row pointer array for the range of vertices in the active vertex list, each time fetching vertices that can fit in the graph data row pointer buffer. For the vertices which are active in the row pointer buffer, vertex data  inedge/outedge neighbors, inedge/outedge weights, are fetched from the colIdx or val vector stored in the SSD, accessing only the pages in SSD that have active vertex data. The VC unit indicates which vertex data to load, such as inedge or outedge adjacent neighbors or edge weights, as not all the vertex data may be required by the application program. Graph data unit uses double buffering so that it can overlap loading vertex data from storage with the processing of the vertex data loaded into the buffer by the VC unit.
3.4. Design Choices: Graph Vertex Interval
The first design choice for PartitionedVC implementation is the size of each vertex interval. If each vertex interval has only a few vertices it will lead to many vertex intervals. Recall that during the graph update propagation any vertex interval may update a target in any other vertex interval. Hence, having more intervals increases the overhead of vertex interval processing and also requires more time while routing updates to a target vertex interval. On the other hand having too many vertices in a single vertex interval will lead to memory overflow. While processing a vertex interval, it is important that updates to that vertex interval should all fit in the main memory. Typically, in vertexcentric programming updates are sent over the outgoing edges of a vertex, so the number of updates received by a vertex is at most the number of incoming edges for that vertex. Since fitting the updates of each vertex interval in main memory is a critical need, we conservatively assume that there may be an update on each incoming edge of a vertex for the purposes of determining the vertex interval size. We statically partition the vertices into contiguous segments of vertices, such that the sum of the number of incoming updates to the vertices is less than the main memory size provided. This size could be limited by the administrators, application programmer or could be simply limited by the size of the virtual machines allocated for graph processing.
Due to our conservative assumption that there may be a message on each incoming edge, the size of vertex interval may be small. But during runtime updates received by a vertex can be less than the number of incoming edges. For each of this vertex intervals, we keep a counter, which tracks the number of updates sent to that vertex interval in the current superstep. At the beginning of the next superstep the PartitionedVC’s runtime may dynamically fuse contigous vertex intervals into a single large interval to process at once. Such dynamic fusing enables efficient use of the memory during each superstep.
3.5. Design Choices: Graph Data Sizes
Currently our system supports any arbitrary structure of data associated with a vertex. But for efficiency reasons we assume that the structure does not morph dynamically. The graph loader unit loops over the active vertex list and loads each structure into memory. Hence, as long as the programmer assigns a large enough structure to hold all the vertex data our system can easily handle that algorithm. However, if the structure of the data morphs and in particular the size grows arbitrarily it is less efficient to dynamically alter the CSR format organization in SSDs. However, PartitionedVC still works functionally correctly but the graph loading process may be slower. Hence, we recommend that the programmer may conservatively assign a large enough structure statically to enable future growth of data associated with a vertex.
3.6. Graph structural updates
In vertexcentric programming, graph structure can be updated during the supersteps. Graph structure updates in a superstep can be applied at the end of the superstep.
In the CSR format, merging the graph structural updates into the column index or value vectors is a costly operation, as one needs to reshuffle the entire column vectors. To minimize the costly merging operation, we partition the CSR format graph based on the vertex intervals, so that each vertex interval’s graph data is stored separately in the CSR format.
Instead of merging each update directly into the vertex interval’s graph data, we batch several structural updates for a vertex interval and later merge them into the graph data after a certain threshold number of structural updates. As graph structural updates generated during the vertex processing can be to any vertex, we buffer each vertex interval’s structural updates in memory. The multilog and VC units always access these buffered updates to accurately fetch the most current graph data for processing.
3.7. Support for asynchronous computation
In asynchronous vertexcentric programming, in a superstep, if a target vertex is scheduled later than the source vertex generating the update, then the update is available to the target vertex in the same superstep. Therefore, we maintain a single multilog file for each vertex interval. The current superstep’s updates are also appended to the same log as the previous iteration. Hence, the vertex interval’s can load all the updates which are generated to them in the previous or current superstep. Current updates are made by the previously scheduled vertex intervals in the current superstep. Note that in asynchronous operation the current and previous superstep logs are isolated.
To facilitate routing of the updates to the target vertices that are scheduled within the same vertex interval, but later than the source vertex, we keep two arrays. All the updates to active target vertices which are in the same vertex interval, but are not yet scheduled are kept in one array as a linked list, each target vertex having a separate linked list. Another array, having an entry for each vertex in that vertex interval, points to the start of it’s linked list in the first array.
In a similar fashion, for the graph structural updates that are within the same interval are passed to the target vertex using the two arrays. As described in the subsection 3.6 these graph structural updates are written to the storage as a log at the end of the interval to make them persistent.
3.8. Programming model
For each vertex, vertex processing function is provided. The main function logic for a vertex is written in this function. Each vertex can access it’s vertex data in this function, send updates to other vertices, and mutate the graph. Vertices can access and modify their local vertex data, which includes: vertex values, inedge/outedge lists, inedge/outedge labels. Communication between the vertices is implemented using sending updates, each vertex can send an update to any vertex. For the synchronous computation model, updates will be delivered to the target vertex by the start of its vertex processing in next superstep. In the asynchronous computation model, the latest updates from the source vertices will be delivered to the target vertices, which can either be from the current superstep or the previous superstep. Vertices can modify the graph structure and these graph modifications will be finished by the start of next superstep. For the mutating graph, we provide a basic graph structure modification functions, add/delete the edge/vertex, which can modify any part of the graph, not just local vertex structure. The vertex also indicates in the vertex processing function if it wants to be deactivated. If a vertex is deactivated, it will be reactivated again if it receives an update from any other vertex. Algorithm 2 shows the pseudocode for the community detection program using the most frequent label propagation method.
A programmer can indicate several hints to the framework to further optimize performance. A programmer can indicate whether the program, a) requires adjacency lists, b) requires adjacency edgeweights, c) performs any graph structural changes. Note that one can use a compiler also to infer these hints.
3.9. Analysis of I/O costs
SSDs have a hierarchical organization and the minimum granularity at which one can access it is NAND page size. So we perform our I/O analysis based on the number of NAND pages accessed. Note that here we assume that all the I/O accesses the NAND pages and ignore the buffer present in the SSD DRAM, as it is small and typically used for buffering the NAND page before transferring it to the host system.
In each iteration, storage is accessed for logging the updates and for accessing the graph data. As updates are appended as a log, and the log is read sequentially, the amount of storage accessed is proportional to the number of updates generated by the active vertices. Note that as the log is initially appended in the main memory buffer and to create space in the buffer, we evict fully written pages to the storage. So the partially written pages are present in the main memory only and the storage is accessed only for fully written pages. During an iteration, graph data is accessed only for the active vertices. As active vertices graph data may be spread across several SSD pages and the minimum granularity for accessing the storage is an SSD page, so in a superstep, the amount of storage accessed for graph data is proportional to the number of SSD pages containing active vertices data. In an iteration if of vertices are active, and on average if each vertex has edge data, and while reading the vertices graph data if the read amplification factor due to reading of data from SSD in granularities of pages is then the amount of storage accessed in an iteration is which is O(E). This is optimal as edge data corresponding to the actives vertices has to be accessed at least once in each superstep. If the number of updates generated by each of the active vertexes on average is then the number of updates generated is , then the amount of data accessed from storage in an iteration for updates is less than or equal to , once each for writing and reading.
4. System design and Implementation
We implemented the PartitionedVC system as a graph analytics runtime on an Intel i74790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use TB EVO SSDs (SSD860EVO, ). We use Ubuntu operating system which runs on Linux Kernel version 3.19.
To simultaneously load pages from several noncontiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses in granularities of 16KB, typical SSD page size (ssd_page_size, ). Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep increasing to accommodate higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes (enterprise_storage, ).
We used OpenMP to parallelize the code for running on multicores. We use 8byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system.
Baseline: We compare our results with the popular outofcore GraphChi framework. While comparing with GraphChi, we use the same hostside memory cache size as the size of the multilog buffer used in the PartitionedVC system. In both our implementation and GraphChi’s implementation, we limit the memory usage to GB. In our implementation, we limit memory usage by limiting the total size of the mulitlog buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such GraphChi also achieves peak storage access bandwidth.
Graph dataset: To evaluate the performance of PartitionedVC, we selected two realworld datasets, one from the popular SNAP dataset (snapnets, ), and another is a popular web graph from Yahoo Webscope dataset (yahooWebscopre_graph, ). These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 1 shows the number of vertices and edges for these graphs.
Dataset name  Number of vertices  Number of edges 

comfriendster (CF)  124,836,180  3,612,134,270 
YahooWebScope (YWS)  1,413,511,394  12,869,122,070 
5. Applications
To illustrate the benefits of our framework, we evaluate several graph applications, which are:
BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.
Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.
Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bipartite or similar structure.
Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding inedge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.
Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.
KCore: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.
Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).
6. Experimental evaluation
Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The Xaxis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an Xaxis of 0.1 means that the selected sourcetarget pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Yaxis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.
On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.
Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The Xaxis shows the two graph datasets that we used and Yaxis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here Xaxis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.
4. System design and Implementation
We implemented the PartitionedVC system as a graph analytics runtime on an Intel i74790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use TB EVO SSDs (SSD860EVO, ). We use Ubuntu operating system which runs on Linux Kernel version 3.19.
To simultaneously load pages from several noncontiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses in granularities of 16KB, typical SSD page size (ssd_page_size, ). Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep increasing to accommodate higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes (enterprise_storage, ).
We used OpenMP to parallelize the code for running on multicores. We use 8byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system.
Baseline: We compare our results with the popular outofcore GraphChi framework. While comparing with GraphChi, we use the same hostside memory cache size as the size of the multilog buffer used in the PartitionedVC system. In both our implementation and GraphChi’s implementation, we limit the memory usage to GB. In our implementation, we limit memory usage by limiting the total size of the mulitlog buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such GraphChi also achieves peak storage access bandwidth.
Graph dataset: To evaluate the performance of PartitionedVC, we selected two realworld datasets, one from the popular SNAP dataset (snapnets, ), and another is a popular web graph from Yahoo Webscope dataset (yahooWebscopre_graph, ). These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 1 shows the number of vertices and edges for these graphs.
Dataset name  Number of vertices  Number of edges 

comfriendster (CF)  124,836,180  3,612,134,270 
YahooWebScope (YWS)  1,413,511,394  12,869,122,070 
5. Applications
To illustrate the benefits of our framework, we evaluate several graph applications, which are:
BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.
Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.
Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bipartite or similar structure.
Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding inedge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.
Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.
KCore: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.
Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).
6. Experimental evaluation
Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The Xaxis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an Xaxis of 0.1 means that the selected sourcetarget pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Yaxis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.
On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.
Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The Xaxis shows the two graph datasets that we used and Yaxis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here Xaxis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.
5. Applications
To illustrate the benefits of our framework, we evaluate several graph applications, which are:
BFS: We consider whether a given target node is reachable from a given source node. For evaluating BFS, we select the source id at one end and destination id at several levels along the longest path, at 3 equal intervals, visiting around the same number of vertices in each interval. Each superstep explores the next level of vertices. We terminate the search in the superstep in which the destination id is found.
Page rank (PR): (pagerank_implementation, ) Page rank is a classic graph update algorithm and in our implementation, a vertex receives and accumulates delta updates from its neighbors, and it gets activated if it receives a delta update greater than a certain threshold value (0.4). As described earlier PartitionedVC and GraphChi both use asynchronous propagation of updates between supersteps.
Community detection (CD): (FLP_implementation, ) We implement community detection using the most frequent label propagation (FLP) algorithm. With this algorithm, each node is assigned a community label, to which most neighbors belong to. This algorithm uses asynchronous propagation of updates between the supersteps so that it doesn’t lead to oscillations of labels in graphs that have a bipartite or similar structure.
Graph coloring (GC): (gonzalez2012powergraph, ) We implement graph coloring using the greedy graph coloring algorithm. In this greedy algorithm, in each iteration, a node picks the minimum color id that has not been used by its neighbors. As its local data, each node stores the color id of its neighbors on the corresponding inedge weights, so that in a superstep only nodes that have changed their color can send updates to their neighbors.
Maximal independent set (MIS): (salihoglu2014optimizing, ) Maximal independent set algorithms are based on the classical Luby’s algorithm. In this algorithm nodes are selected with a probability of , and these selected nodes are added to the independent list if either 1) nodes don’t have any of its neighbors among the selected nodes or 2) it has the minimum id among its selected neighboring nodes. Neighbors of independent list nodes are kept in the dependent list. The algorithm runs until each node is in either of the lists. In this algorithm, as successive supersteps have different operations, it is necessary to use synchronous propagation of updates between the supersteps for functionally correct execution. Hence, GraphChi and PartitionedVC both use synchronous update scheme.
KCore: (Quick:2012:UPL:2456719.2457085, ) For each superstep in this algorithm, if a node has fewer than neighbors then the node deletes itself, its neighboring edges and sends an update to its neighbors. As the deletions happen at the end of the superstep, we implemented the algorithm using synchronous propagation of updates between the supersteps.
Due to extremely high computational load, for all the applications we ran 15 supersteps or less than that if the problem converges before that. Many prior graph analytics systems also evaluate their approach by limiting the superstep count (GraFBoost, ).
6. Experimental evaluation
Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The Xaxis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an Xaxis of 0.1 means that the selected sourcetarget pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Yaxis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.
On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.
Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The Xaxis shows the two graph datasets that we used and Yaxis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here Xaxis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.
6. Experimental evaluation
Figure (a)a shows the performance comparison of BFS application on our PartitionedVC and GraphChi frameworks. The Xaxis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an Xaxis of 0.1 means that the selected sourcetarget pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traversal demands. In all the charts we present the performance normalized to GraphChi’s performance. Therefore, the Yaxis indicates the performance ratio, which is the application execution time on GraphChi divided by application execution time on PartitionedVC framework.
On average BFS performs times better on PartitionedVC when compared to GraphChi. Performance benefits come from the fact that PartitionedVC accesses only the required graph pages from storage. BFS has a unique access patterns. Initially as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows during after each superstep. As such the performance of PartitionedVC is much higher in the initial supersteps and then reduces in later intervals. Figure 7 validates this assertion. Figure 7 shows the ratio of page accesses in GraphChi divided by the page accesses in PartitionedVC. GraphChi loads nearly 80X more data when using 0.1 (10%) traversals. However, as the traversal need increases GraphChi loads only 5X more pages. As such the performance improvements seen in BFS are much higher with PartitionedVC when only a small fraction of the graph needs to traversed. Figure 8 shows the distribution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of graph that must be traversed the storage access time is about 75%, however as the traversal demands increase the storage access time reaches nearly 90% even with PartitionedVC. Note that with GraphChi the storage access time stays nearly constant at well over 90% of the total execution time.
Figure (b)b shows the performance comparison of pagerank application on PartitionedVC framework with GraphChi framework. The Xaxis shows the two graph datasets that we used and Yaxis is the performance normalized to GraphChi. On average, pagerank performs times better with PartitionedVC. Unlike BFS, pagerank has an opposite traversal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and PartitionedVC performs better when compared to GraphChi. Figure (a)a shows the performance of PartitionedVC compared to GraphChi over several supersteps. Here Xaxis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps PartitionedVC has similar, or in the case of YWS dataset worse performance than GraphChi. The reason is that the size of the log generated is large. But as the supersteps progress and the update size decreases the performance of PartitionedVC gets better.
7. Related work
Graph analytics systems are widely deployed in important domains such as drug processing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems (low2014graphlab, ; gonzalez2012powergraph, ).
For vertexcentric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraFBoost implements external memory graph analytics system (GraFBoost, ). It logs updates to the storage and pass vertex updates in a superstep. They use sortreduce technique for efficiently combining the updates and applying them to vertex values. In a superstep, they access graph pages in storage corresponding to the active vertex list only once. However, they may access storage multiple times for the updates, as they sortreduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our PartitionedVC system supports, complete vertexcentric programming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems (apache_flink, ).
A recent work (elyasi2019large, ) extends GraFBoost work for the vertexcentric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value. In this work, they improve performance by avoiding the sorting phase of the log. For this, they partition the destination vertices such that they can fit in the main memory and replicate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, with this scheme, one may need to replicate the source vertices and access edge data multiple times. When extending this scheme to support complete vertexcentric programming, computing with this scheme may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. Linearly extrapolating based on the data presented in their paper, the replication overhead for 1000 partitions is around , which is prohibitively expensive.
Xstream (roy2013x, ) and GridGraph (zhu2015gridgraph, ) are edgecentric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edgecentric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as randomwalk.
GraphChi (graphchi, ) is the only external memory based vertexcentric programming system that supports more than associative and commutative combine programs. GraphChi partitions the graph into several vertex intervals, namely shards. Processing based on shards, updates by a vertex are passed using shared memory communication and all the updates are done in memory only, accessing storage efficiently in a coarsegrained manner. However, when processing with GraphChi, one has to read most of the graph in each iteration and is not suitable for processing graph algorithms which may access only a part of the graph, such as widely used BFS, or for processing algorithms where not all vertices are active during an iteration. In this work, we compare with GraphChi as a baseline and show considerable performance improvements.
There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard (ai2017squeezing, ; vora2016load, ). However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep.
Semiexternal memory systems such as Flash graph (flashgraph, ) stores the vertex data in main memory and achieve high performance. When processing with a lowcost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the finegrained accesses to the vertexvalue vector.
Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertexcentric programming based graph analytic systems including Pregel (malewicz2010pregel, ), Graphlab (low2014graphlab, ), PowerGraph (tiwari_hotpower12, ), etc. In the singlenode inmemory setting, Ligra (shun2013ligra, ) provides graph processing framework optimized for multicore processing.
8. Conclusion
Graph analytics are at the heart of a broad set of applications. In externalmemory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multilog update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art outofcore graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadthfirst search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.
References
 (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An outofcore graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
 (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flinkdocsstable/dev/libs/gelly/iterative_graph_processing.html, 2019.
 (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
 (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Largescale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
 (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storagehardware/nandflashmemory.html, 2019.
 (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understandingflashblockspagesandprogramerases/, 2014.
 (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graphparallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
 (8) S.W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
 (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Largescale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
 (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
 (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
 (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for largescale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 135–146.
 (13) “Pagerank application,,” https://github.com/GraphChi/graphchicpp/blob/master/example_apps/streaming_pagerank.cpp.
 (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregellike large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
 (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in largescale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
 (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “Xstream: Edgecentric graph processing using streaming partitions,” in Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles. ACM, 2013, pp. 472–488.
 (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregellike systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
 (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8. ACM, 2013, pp. 135–146.
 (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/SamsungInchInternalMZ76E2T0BAM/dp/B0786QNSBD, 2019.
 (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on PowerAware Computing and Systems, ser. HotPower ’12. Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
 (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for diskbased graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
 (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
 (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing BillionNode Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15. Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technicalsessions/presentation/zheng
 (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Largescale graph processing on a single machine using 2level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.
8. Conclusion
Graph analytics are at the heart of a broad set of applications. In externalmemory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this paper, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multilog update system that logs updates in several log files, where each log file is associated with a single vertex interval. Over the current state of the art outofcore graph processing framework, our evaluation results show that PartitionedVC framework improves the performance by up to , , , , and for the widely used breadthfirst search, pagerank, community detection, graph coloring, and maximal independent set applications, respectively.
References
 (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An outofcore graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
 (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flinkdocsstable/dev/libs/gelly/iterative_graph_processing.html, 2019.
 (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
 (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Largescale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
 (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storagehardware/nandflashmemory.html, 2019.
 (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understandingflashblockspagesandprogramerases/, 2014.
 (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graphparallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
 (8) S.W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
 (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Largescale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
 (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
 (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
 (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for largescale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 135–146.
 (13) “Pagerank application,,” https://github.com/GraphChi/graphchicpp/blob/master/example_apps/streaming_pagerank.cpp.
 (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregellike large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
 (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in largescale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
 (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “Xstream: Edgecentric graph processing using streaming partitions,” in Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles. ACM, 2013, pp. 472–488.
 (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregellike systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
 (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8. ACM, 2013, pp. 135–146.
 (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/SamsungInchInternalMZ76E2T0BAM/dp/B0786QNSBD, 2019.
 (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on PowerAware Computing and Systems, ser. HotPower ’12. Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
 (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for diskbased graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
 (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
 (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing BillionNode Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15. Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technicalsessions/presentation/zheng
 (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Largescale graph processing on a single machine using 2level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.
References
 (1) Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An outofcore graph processing system with reduced disk i/o,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 2017, pp. 125–137.
 (2) “Apache Flink, Iterative Graph Processing,,” https://ci.apache.org/projects/flink/flinkdocsstable/dev/libs/gelly/iterative_graph_processing.html, 2019.
 (3) “ Introduction to Apache Giraph,,” https://giraph.apache.org/intro.html, 2019.
 (4) N. Elyasi, C. Choi, and A. Sivasubramaniam, “Largescale graph processing on emerging storage devices,” in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 309–316.
 (5) “NAND Flash Memory,,” https://www.enterprisestorageforum.com/storagehardware/nandflashmemory.html, 2019.
 (6) “Understanding Flash: Blocks, Pages and Program / Erases,,” https://flashdba.com/2014/06/20/understandingflashblockspagesandprogramerases/, 2014.
 (7) J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: distributed graphparallel computation on natural graphs.” in OSDI, vol. 12, no. 1, 2012, p. 2.
 (8) S.W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind, “GraFBoost: Accelerated Flash Storage for External Graph Analytics,” ISCA, 2018.
 (9) A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Largescale Graph Computation on Just a PC,” in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI ’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 31–46. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387880.2387884
 (10) J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”
 (11) Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, “Graphlab: A new framework for parallel machine learning,” arXiv preprint arXiv:1408.2041, 2014.
 (12) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for largescale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 135–146.
 (13) “Pagerank application,,” https://github.com/GraphChi/graphchicpp/blob/master/example_apps/streaming_pagerank.cpp.
 (14) L. Quick, P. Wilkinson, and D. Hardcastle, “Using pregellike large scale graph processing frameworks for social network analysis,” in Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ser. ASONAM ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 457–463. [Online]. Available: http://dx.doi.org/10.1109/ASONAM.2012.254
 (15) U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in largescale networks,” Physical review E, vol. 76, no. 3, p. 036106, 2007.
 (16) A. Roy, I. Mihailovic, and W. Zwaenepoel, “Xstream: Edgecentric graph processing using streaming partitions,” in Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles. ACM, 2013, pp. 472–488.
 (17) S. Salihoglu and J. Widom, “Optimizing graph algorithms on pregellike systems,” Proceedings of the VLDB Endowment, vol. 7, no. 7, pp. 577–588, 2014.
 (18) J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8. ACM, 2013, pp. 135–146.
 (19) “Samsung SSD 860 EVO 2TB,” https://www.amazon.com/SamsungInchInternalMZ76E2T0BAM/dp/B0786QNSBD, 2019.
 (20) D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers, “Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD,” in Proceedings of the 2012 USENIX Conference on PowerAware Computing and Systems, ser. HotPower ’12. Berkeley, CA, USA: USENIX Association, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387873
 (21) K. Vora, G. H. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimization for diskbased graph processing.” in USENIX Annual Technical Conference, 2016, pp. 507–522.
 (22) “Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002,” http://webscope.sandbox.yahoo.com/, 2018.
 (23) D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay, “FlashGraph: Processing BillionNode Graphs on an Array of Commodity SSDs,” in 13th USENIX Conference on File and Storage Technologies (FAST 15), ser. FAST ’15. Santa Clara, CA: USENIX Association, 2015, pp. 45–58. [Online]. Available: https://www.usenix.org/conference/fast15/technicalsessions/presentation/zheng
 (24) X. Zhu, W. Han, and W. Chen, “Gridgraph: Largescale graph processing on a single machine using 2level hierarchical partitioning.” in USENIX Annual Technical Conference, 2015, pp. 375–386.