Working with large graphs that are continuously changing in real-time, with a stream of unbounded updates, is an increasingly important and challenging problem. Not only the graph topology is constantly changing with the addition and removal of edges and vertices, but also query response times must be interactive and meet stringent latency demands. Common domains of applicability of such graphs include social networks, recommendation systems, and people and vehicle position tracking. In these domains, the ability to quickly react to change would allow for useful detection of trends.
Graph Processing Engines (GPEs) often resort to approximate computing techniques in order to provide timely query responses in very large graphs, without adding extra resources. Approximate computing allows inaccurate query results in exchange of lower latencies. Under specific error bounds, approximate results would be as equally acceptable as the exact answers for many scenarios. Approximate results may allow for considerable improvements in speed (e.g., reduced latency and processing time, increased throughput) and resource efficiency (e.g., reducing cloud computing costs and energy footprint). In this domain, three common techniques have been employed: sampling, where queries are executed on a sampled summarization of the graph ; task dropping, which consists of discarding parts of a partitioned global task processing list ; and load shedding, which partially discards inputted data according to a shedding scheme . Developing novel techniques for approximate graph processing can strongly benefit many systems and applications. This would pave the road for high-level optimizations like Service-Level Agreements (SLAs) for graph processing, with different tiers of accuracy and resource efficiency. These are relevant for applications like product recommendation and monitoring user influence.
We introduce GraphBolt
, a novel execution model for GPEs that enables approximate computations on general graph applications. Our model features an abstraction that flexibly allows the expression of custom vertex impact estimators for random walk based algorithms. With this abstraction, we build a representative graph summarization that solely comprises the subset of vertices estimated as yielding high impact. This way,GraphBolt is capable of delivering lower latencies in a resource-efficient manner, while maintaining query result accuracy within acceptable limits. As a concrete instance, we integrated GraphBolt with a modern and popular GPE, Gelly/Apache Flink. Experimental results indicate that our approximate computing model can achieve a 2-fold latency improvement over the base (exact) computing model, while not degrading result accuracy in more than 5%.
The rest of the paper is organized as follows. An overview of the GraphBolt model is provided in Section II. Section III describes the architecture. In Sec. IV we present the experimental evaluation, followed by an analysis of improvements. Section V analyzes state-of-the-art GPEs that approach similar challenges. We summarize our contribution and future research in Section VI.
Ii Model: Big Vertex
We explicitly separate the graph algorithm expression paradigm and the underlying summarization model. Allowing for different graph summarization models, the goal is to enable different approximate computation strategies in exchange for result accuracy. In this work, we implemented and analyzed what we call the big vertex model. When processing a stream of new edges, the parameters of our model highlight a subset of the graph’s vertices, known as hot vertices. The aim of this set is to reduce the number of processed vertices as close as possible to . These vertices are used to update the algorithm output.
Ii-a Not all vertices are equal
In order to use only a subset of the vertices, it is necessary to employ approximate computing techniques using a fraction of the total data. In this model there is an aggregating vertex . We refer to as the big vertex – a single vertex representing all the vertices outside (in this model, the values are not updated for vertices in ). For the original graph and vertex set , we define a summary graph , where . We define , where , which is the set of edges with both source and target vertices contained in and as the set of edges with sources contained inside and target in . Conceptually, this consists in replacing all vertices of which are not in by a single big vertex and representing the edges whose targets are hot vertices and whose sources are now in . The summary graph does not contain vertices outside of (again, those are represented by ).
It is relevant to retain that for each iteration, the impact of a vertex depends on what is received through its incoming edges. By definition, represents all vertices whose impact is not expected to change significantly. The contribution of each vertex (and therefore represented by ) is constant between iterations, so it can be registered initially and used afterward. As a consequence, the summary graph does not contain edges targeting vertices represented by . However, their existence must be recorded: even if the edges coming out of and into are irrelevant for the computation, they still matter for the vertex degree, which influences the emitted scores of the vertices in . Despite the fact those edges targeting are being discarded when building , the summarized computation must occur as if nothing had happened. To ensure correctness, for each edge , we store with as the out-degree of before discarding the outgoing edges of targeting vertices in .
It is also necessary to record the contribution of all the vertices fused together in . For each edge whose source is inside and whose target is in , we store the contribution that would originally be sent from as where is the stored score/value of and the out-degree of is defined as . The contribution of as a single vertex in is then represented as and defined as:
The fusion of vertices into is performed while preserving the global influence from vertices placed inside to vertices in
. Our model intuition is that vertices receiving more updates have a greater probability of having their measured impact change in between execution points. Their neighboring vertices are also likely to incur changes, but as we consider vertices further and further away from, contributions are likely to remain the same [5, 4].
GraphBolt aims to enable approximate computing on a stream of incoming edge updates. While we focused on the class of random walk algorithms, GraphBolt’s architecture has the potential to be applied to other classes of graph algorithms (and potentially, other models, depending on the relation between a given graph algorithm and the summarization model) using the same principles. Our contribution strikes a balance between two strategies for when a query is to be served: a) recomputing the whole graph properties when a query arrives; b) returning a previous query result without incurring any type of additional computation. While the former is obviously much more time-consuming, it has the property of maintaining result accuracy. The latter, on the other hand, would quickly lower the accuracy of the algorithm’s results. Throughout this document, we use the term query to state that a graph algorithm’s results are required. So when it is said that a query is executed, it means that the algorithm was executed, independently of being executed over the complete graph or our summarization model.
Ii-B Building the model
The model is based on techniques such as defining and determining a confidence threshold for error in the calculation , graph sampling  and other hybrid approaches. GraphBolt registers updates as they arrive for both statistical and processing purposes. Vertex and edge changes are kept until updates are formally applied to the graph. Until they are applied, statistics such as the total change in the number of vertices and edges (with respect to accumulated updates) are readily available. When applying the generic concepts of our technique, a useful insight is that most likely, not all vertex scores will need to be recalculated. In our big vertex model, generating the subset of a graph () depends on three parameters () used in a two-step procedure. From the client perspective, we consider a query to be an updated view of the algorithm information pertaining . As each individual query represents an important instant as far as computation is concerned, we refer to each query as a measurement point . For any measurement point , the whole graph is represented as ().
Update ratio threshold r. This parameter defines the minimal amount of change in a vertex ’s degree in order to include in . Parameters and are parameters of GraphBolt’s big vertex
model to harness a graph algorithm’s heuristics to approximate a result. We adopt the notation where the set of neighbors of vertexin a directed graph at measurement instance is written as . We further write the degree of vertex in measurement instance as . The function represents the length (number of hops) of the minimum path between vertices and and represents the same concept at measurement instance . It is not required to maintain shortest paths between vertices (that would be a whole different problem ). This model is based on a vertex-centric breadth-first neighborhood expansion. Let us define as the set of vertices which satisfy parameter , where is the degree of vertex , represents the current measurement point and is the previous measurement point:111New vertices are always included in . The subtraction in the formula registers the degree change ratio with respect to the previous value .
Neighborhood diameter n. This parameter is used as an expansion around the neighborhood of the vertices in . It aims to capture the locality in graph updates: those vertices neighboring the ones beyond the threshold, and as such still likely to suffer relevant modifications when vertices in are recalculated (attenuating as distance increases). On measurement point , for each vertex , we will expand a neighborhood of size , starting from and including every additional vertex found in the neighborhood diameter expansion. The expansion is then defined as:
is the set of vertices of the graph at measurement point . is set to promote performance, while a greater value of is expected to trade performance for more accuracy.
Result-specific neighborhood extension . This last parameter allows users to extend the functionality of by further expanding neighborhood size as a function of vertices’ results. This is achieved by accounting for specific underlying algorithm’s properties which may be used as heuristics. This allows updating vertex results around those vertices that, while not included by Eq. 2 or Eq. 3, are neighbors to vertices subject to change. We use the relative change of vertex score between the two consecutive measurement points and . The expansion of the neighborhood with is:
where is the -expansion function:
In Eq. 5, is a result on vertex and is the average degree of the currently accumulated vertices with respect to stream . This allows us to have a fine-grained neighborhood expansion starting at and limited by on the maximum contribution of the score of . The intuition here is that vertex would contribute to the value of its immediate neighbors with a value of . For its neighbors’ neighbors, the influence of would be further diminished (contribution would now be where is a direct neighbor of . Additional expansions would further dilute the contribution that could possibly have. For example, when evaluating GraphBolt with a bound of , we keep considering further neighborhood expansion hops from until the contribution from drops below 10% of its score.
We then have a set of hot vertices which is used as part of a graph summary model (deriving from techniques in iterative aggregation ), written as . An example of how the parameters influence the selection of vertices is depicted in Fig. 1. The left side represents a zoom of a small portion of the complete graph (which may be composed of millions of vertices). Part a) shows the vertices whose amount of change satisfied the threshold ratio , leading to : in this case, only the top vertex was included, for which its contour is a solid line with a gray fill. Part b) then shows the usage of the neighborhood expansion parameter over the vertex of a): the two middle vertices are now gray as well (the one included in the previous part is colored black). This makes up . Lastly, part c) accounts for the per-vertex neighborhood expansion parameter and represents the set of hot vertices . This is depicted by coloring the bottom vertex in gray and the remaining ones in black (they are already part of ). Some dashed arrows remain to illustrate that the vertices on the other end of the edges were not included in .
Iii GraphBolt Architecture
The architecture of GraphBolt was designed while taking in account three information types which are inherent to the work flow, presented in Fig. 2. The GraphBolt module will constantly monitor one or more streams of data and track the changes made to the graph. When the data is queried, GraphBolt will execute the request by submitting a job to a Flink cluster. In our experiments, we trigger the incorporation of updates into the graph whenever a client query arrives.222While outside the scope of this work, a live scenario would have a more elaborate ingestion scheme, possibly using dedicated ingestion nodes like in KineoGraph . The main elements of the flow of information in an execution are:
Initial graph G. The original graph upon which updates and queries will be performed.
Stream of updates S. Our model of updates could be the removal or addition of edges and the same for vertices (). We make as little assumptions as possible regarding : the data supplied needs not respect any defined order. In our experiments we used both edge additions and removals.
Result R. Information produced by the system as an answer to the queries received in .
GraphBolt was designed to allow programmers to define fine-grained logic for the approximate computation when necessary. This is achieved through the usage of user-defined functions. As a design decision, there are five distinct functions which are articulated in a fixed program structure. The API of GraphBolt uses them to define the execution logic that will guide the processing strategy. They are key points in execution where important decisions should take place (e.g., how to apply updates, how to perform monitoring tasks). To implement other graph algorithms, users can simply extend the GraphBolt Java class implementing the logic shown in Algorithm 1 to enable our model’s functionality, while abstracting away many implementation details (inherent to our architecture) unrelated to the graph processing itself.
Additional behavior control is possible by customizing the model by implementing their own functions (left as abstract methods of the class implementing the architecture logic). Overall, this approach has the advantage of abstracting away the API’s complexity, while still empowering power users who wish to create fine-tuned policies. GraphBolt’s architecture creates a separation between the graph model, the way the graph processing is expressed (e.g. such as vertex-centric) and the function logic to apply on vertices.
GraphBolt was implemented over Apache Flink , a framework built for distributed stream processing.333https://flink.apache.org/ It has many different libraries, among which Gelly, its official graph processing library. It features algorithms such as PageRank, Single-Source Shortest Paths and Community Detection, among others. Overall, it empowers researchers and engineers to express graph algorithms in familiar ways such as the gather-sum-apply or the vertex-centric approach of Google Pregel , while providing a powerful abstraction with respect to the underlying scheme of distributed computation. We employ Flink’s mechanism for efficient dataflow iterations  with intermediate result usage. To employ our module, the user can express the algorithm using Flink dataflow programming primitives.
The source of GraphBolt is available online.444https://fenix.tecnico.ulisboa.pt/homepage/ist162460/graphbolt We provide an API allowing programmers to implement their logic succinctly. GraphBolt was evaluated with the PageRank power method algorithm . The PageRank logic is succinctly implemented as a function as follows:
This is then passed on to the underlying graph processing paradigm, as such:
While we focus our evaluation on PageRank, we note that other random walk based algorithms can be expressed easily.
In our PageRank implementation, all vertices are initialized with the same value at the beginning. We focus on a vertex-centric implementation of PageRank, where for each iteration, each vertex sends its value (divided by its outgoing degree) through each of its outgoing edges. A vertex defines its score as the sum of values received from its incoming edges, multiplied by a constant factor and then summed with a constant value with . PageRank, based on the random surfer model, uses as a dampening factor. For our work, this means that whether one considers one-time offline processing or online processing over a stream of graph updates, the underlying computation of PageRank is an approximate numerical version well known in the literature.
This distinction is important, for when we state GraphBolt
enables approximate computing, we are considering a potential for applicability to a scope of graph algorithms, such as algorithms for computing eigenvector centrality and optimization algorithms for finding communities in networks. Whether the specific graph algorithm itself incurs numerical approximations (such as thepower method) or not, that is orthogonal to our model.
Experiments. Experiments were performed on an SMP machine with 256 GB RAM and 8 Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz with eight cores each. Each dataset execution was performed with a parallelism of four555https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html, meaning a single JobManager and four TaskManagers (master and worker nodes in Flink). The workers were set to use either 4GB or 8GB of memory, depending on the dataset.
In our scenario, PageRank is initially computed over the complete graph and then processing a stream of chunks of incoming edge updates in GraphBolt. For each chunk received in GraphBolt we: 1) integrate the edge updates into the graph; 2) compute the summarized graph as described in Section II-B and execute PageRank over . Henceforth, we say that we are processing a query when a PageRank version (summarized or complete) is executed after integrating a chunk of updates.
To reduce the variability, the stream of edge updates was set up so that the number of queries for each dataset and parameter combination is always the same: fifty (). For each dataset and stream size, we defined (offline) a tab-separated file containing the stream of edge updates. Additionally, for each dataset, streams were generated by uniformly sampling from the edges in the original dataset file. A stream size of was used, equaling 800 edges added before executing every query. We test with both edge additions and deletions. Every time we add edges, we remove an amount equal to 20% of the number of edges added. The edges to remove are chosen at random with equal probability. When additions and removals are applied before an execution, the removal only targets remaining edges which already existed in the original graph or that were added in an older update that preceded a previous execution.
For each dataset and stream of size , each combination of parameters is tested against a replay of the same stream. Essentially, each execution (representing a unique combination of parameters) will begin with a complete PageRank execution followed by summarized PageRank executions. This initial computation represents the real-world situation where the results have already been calculated for the whole graph. In such a situation, one is focused on the incoming updates. For each dataset and stream , we also execute a scenario which does not use the parameters: it starts likewise with a complete execution of PageRank, but the complete PageRank is executed for all queries. This is required to obtain ground-truth results to measure accuracy and performance of the summarized implementation of the model. Many datasets such as web graphs are usually provided in an incidence model . In this model, the out-edges of a vertex are provided together sequentially. This may lead to an unrealistically favorable scenario, as it is a property that will not necessarily hold in online graphs and which may benefit performance measurements. To account for this fact, we tested the same parameter combinations with a previous shuffling of stream (a single shuffle was performed offline a priori so that the randomized stream is the same for different parameter combinations that were tested). This increases the entropy and allows us to validate our model under fewer assumptions.
The datasets’ vertex and edge counts are shown in Table I. We evaluate results over two types of graphs: web graphs and social networks. The web graphs and social networks were obtained from the Laboratory for Web Algorithmics , . These datasets were used to evaluate the model against different types of real-world networks.
Iv-B Assessment Metrics
We measure the results of our approach in terms of: a) ability to delay computation in light of result accuracy (top graph of each figure); b) obtained execution speedup (middle graph of each figure); c) reduction in number of processed edges (bottom graph of each figure). Accuracy in our case takes on special importance and requires additional attention to detail. The PageRank score itself is a measure of importance and we wish to compare rankings obtained on a summarized execution against rankings obtained on the non-summarized graph. As such, what is desired is a method to compare rankings.
Rank comparison can incur different pitfalls. If we order the list of PageRank results in decreasing order, only a set of top-vertices is relevant. After a given index in the ranking, the centrality of the vertices is so low that they are not worth considering for comparative purposes. But where to define the truncation? The decision to truncate at a specific position of the rank is arbitrary and leads to the list being incomplete. Furthermore, the contention between ranking positions is not constant. Competition is much more intense between the first and second-ranked vertices than between the two-hundredth and two-hundredth and first.
We employed Rank-Biased-Overlap (RBO) 
as a useful evaluation metric (representing relative accuracy) developed to deal with these inherent issues of rank comparison. RBO has useful properties such as weighting higher ranks more heavily than lower ranks, which is a natural match for PageRank as a vertex centrality measure. It can also handle rankings of different lengths. This is in tune with the output of a centrality algorithm such as PageRank. The RBO value obtained from two rank lists is a scalar in the interval. It is zero if the lists are completely disjoint and one if they are completely equal. While more recent comparison metrics have been proposed , we believe they go beyond the scope of what is required in our comparisons. Effectively, the quality of our case study algorithm’s accuracy is itself produced by a comparison (between sequences of rank lists). As far as we know, and due to the specificity of our evaluation, we believe there is no better-suited baseline in the literature against which to compare our own ranking comparison results.
Performance-wise, we test values of associated to different levels of sensitivity to vertex degree change (the higher the number, the less expected objects to process per query). With , we minimize the expansion around the first set so that and . For , we are taking a more conservative approach regarding result accuracy. An overall tendency to expect is that the higher the value of is, the higher the RBO (this is demonstrated in our results). The values were chosen to evaluate individual different weight schemes applied to vertex score changes. The relation between parameters and has a greater impact in performance and accuracy than the relation of any of these parameters with . We tested with two sets of parameter combinations:
RBO-oriented (), (), (), (). This has a very low threshold of sensitivity to the ratio of vertex degree change ().
Performance-oriented (), (), (), (), (). With , the goal is to be less sensitive pertaining degree change ratio.
For both of these combinations, we test with low and high values of to examine how expanding the neighborhood of vertices complements the initial degree change ratio filter.
Using a higher number of ranks for the RBO evaluation favors a comparison of calculated ranks which has greater resolution, as more vertices are being compared. In our evaluation, the RBO of each execution is calculated using 10% of the complete graph’s vertex count as the number of top ranks to compare. This means for example that for dataset cnr-2000 which has 325,557 vertices, we use the top 32,555 produced vertex ranks. Every 10 executions, we calculate RBO using all of the vertices of the graph to ensure no artifact is being masked in the lower rank values.
Results are tailored to illustrate the impact of the GraphBolt model based on three metrics: a) obtained speedup for each execution (compared against the execution over the complete graph); b) RBO evolution as the number of executions increases (it starts with a value of 1 as the complete version of PageRank is executed initially for all execution strategy scenarios); c) summary graph edge count as a percentage of the original graph’s edge count.
For these three metric categories, we present the best-three and worst-three results obtained within each category. This means, for example, that the parameters () that produce the best accuracy result are not necessarily the same ones producing the best speedups. The horizontal axis represents the same for all plots: it is the sequence of queries from 1 up to . Each figure has three graphics in a column, corresponding to speedup (top), accuracy measured with RBO (middle) and summary graph edge count as percentage of the complete graph edge count (bottom). Due to how the dynamics of parameter combinations and the structure of the data sets behave, some parameter combinations produced extremely similar values, leading to almost overlapping plots. We describe first the meaning of the results observed for the web graphs cnr-2000 and eu-2005 and last for the social graphs dblp-2010 and amazon-2008.
One needs to take into account this is a challenging assessment context for GraphBolt. In fact, between each consecutive pair of the 50 queries (i.e., on every of the 800 edge/vertex updates we are ingesting between them), if the user prompted a query execution, GraphBolt could offer near-instant results against the previously summarized graph, contrary to a full graph execution (thus yielding several 100-fold speedups each time), and still provide results with very high RBO (in line with those from the preceding and successor of the pair of queries where the update lied between).
cnr-2000: For this dataset the results are shown in Fig. 3. The best speedup achieved was 1.20 for parameters () as shown on the top of the figure with the blue star markers. For these, increasing from to produced an execution speed similar to the baseline (the execution time of complete executions). This is to be expected as the scope of computation increases due to a bigger hot vertex set . The worst speedups were obtained with (), a combination which promotes accuracy above everything. These actually performed slower than the baseline complete executions due to the overhead of summarization model construction and computation.
Regarding RBO values (middle of Fig. 3), we observe that the relation between the impact of parameters and did not behave as one would completely expect. Until around execution , the best accuracy is achieved with the parameters () with an RBO of around 99% (the blue star plot). After execution , RBO value of this parameter combination drops and stabilizes at 90%, from then on barely surpassed by combinations using parameters and (the yellow cross and the green diamond plots). The combination () is interesting because if we take it and increase from to while keeping and constant, there is a massive drop in RBO, resulting in the purple circle marker plot. We attribute this to graph topology and the way value propagation has an effect on accuracy: when using , GraphBolt disregards the neighbors of the vertices whose degree change ratio was at least . By not including them in the computation, the excluded immediate neighbors retain their old values (by virtue of inclusion in the big vertex ) – paradoxically, this is a case where promoting performance also led to higher accuracy by limiting the scope of error propagation. When moving from () to (), the increased computational scope promotes higher accuracy.
Regarding summary graph edge percentage, the higher parameter was, the bigger the percentage was. This is in line with speedups: overall for this dataset, higher values lead to increased accuracy and reduced speedup.
eu-2005: Results for this dataset are on Fig. 4. Speedups of around 3.00 were achieved (see the blue star and yellow cross markers at the top of Fig. 4). These are parameter combinations which promote speed, only considering for the hot vertex set the vertices whose degree changed by at least 20%.
Parameter combinations with and achieved the best RBO throughout all executions for this dataset. This was due to the same phenomenon which arose with dataset cnr-2000.
In terms of the number of edges in the summary graph, the parameters with more conservative values (bigger and lower ) led to the biggest summary graphs. However, these were not the ones producing the highest RBO values throughout executions.
dblp-2010: Results are shown in Fig. 5. The best speedups obtained were around 1.60-1.80 for high values of the update ratio threshold () and lower levels of neighborhood expansion (, ). These are the markers with blue stars, yellow crosses and green diamonds. For result accuracy, parameter combination () with the red left-facing triangles started at around 85% and decreased steadily as executions progressed. Adjusting this combination by switching from to produced a plot with RBO values above 90%, shown in the green diamond marker plot. The best RBO values were produced by the bigger value of . These same parameter combinations using and also led to a summary graph edge count very close to the complete graph’s edge count. A more balanced combination (slightly lower and higher ) produced a summary graph whose edge fraction (with respect to the total graph) hovered around 70%.