Graph-based data is found almost everywhere, with examples such as analyzing the structure of the World Wide Web [DBLP:conf/www/BoldiV04], bio-informatics data representation via de Bruijn graphs [ACombinatorialProblem1946] in metagenomics [Li01022010, 18349386] and the structure of distributed computation itself [Malewicz:2010:PSL:1807167.1807184]. There are many usages of graph-structured data such as large-scale web search engines [brin1998anatomy], massive parallel learning of tree ensembles [panda2009planet] and parallel topic models [smola2010architecture]
. Industry players like Facebook, Microsoft and Google have rolled out their own graph processing systems, contributing to the development of several open-source frameworks. They need to deal with huge graphs, such as the case of the Facebook graph with billions of vertices and hundreds of billions of edges111https://www.wordstream.com/blog/ws/2017/11/07/facebook-statistics.
More traditional approaches to graph processing focus on using a single machine with as much main memory available as possible. An example of this is Ringo,222http://snap.stanford.edu/ringo/ [DBLP:conf/sigmod/PerezSBPRSL15] which provides many different functionalities to explore graphs. Additionally, as a way to manipulate graphs in single machines with limited memory, compression techniques have been successfully applied. Perhaps the most famous instance of this is WebGraph [DBLP:conf/www/BoldiV04], a framework employing techniques accounting for properties of scale-free graphs and exploiting techniques like gap compression, intervals and lazy decompression to enable high compression ratios balanced with performance.
A way to process graphs which we pay dedicated attention to in this document is to use distributed systems. Such systems, which have been built to process big data, have also been applied to graph processing (e.g. attempts have been made to employ MapReduce for this purpose [DBLP:conf/sigmod/QinYCCZL14]). While there is a plethora of solutions and techniques tailor-made for single-machine systems (branching into high versus low available memory categories), the fact is that distributed solutions have continued to improve. Many use-cases can be satisfied or explored by deploying execution in a cluster.
As far as we are aware, while there has been a focus on keeping the graph updated (against a stream of incoming updates) and available for continuous processing in a cluster (without intermediate access to storage), existing solutions incur design limitations, although different architectures have been proposed as seen in Section 2. The read-eval-write loop approaches this problem by breaking down the flow of computation and establishing a dynamic of reading from storage or memory, processing and writing back. Even if data exists in memory, synchronization boundaries between cluster job management and explicit API calls to retain datasets in memory incur overheads. This is the case for example with Spark’s cache functionality applied in the context of the GraphX graph library. Reading from disk or distributed storage into a system, processing and then writing back results will lead to I/O overheads in storage and communications. The work done on processing graph streams is also limited in that it typically only maintains a limited view of the graph to support a specific type of property. Some systems have, at best, provided some low-level primitives for the user to specify when (thus implying lock-in to a specific graph algorithm) graph results are outdated (even partially) [Cheng:2012:KTP:2168836.2168846, Iyer:2016:TGP:2960414.2960419].
A new class of system is needed to finally integrate the semantics of stream processing windows with the maintenance of a graph in cluster memory.333While most systems we mention approach fault-tolerance in some way, that is not the focus of this work. This would allow composing algorithm executions over evolving graphs, one after the other and also with some overlap. Herein we detail the status quo in distributed graph processing while highlighting an emerging use-case that, as far as we know, so far has not been sufficiently satisfied in any existing distributed system. We base this statement on having surveyed many different systems in the literature, which we detail and compare throughout this document.
1.2 Document structure
goes over the most recent systems, classifies them in context of the categories and summarizes the presence of relevant attributes in graph processing systems. Section4 further details the systems we consider most desirable to pursue the use-case we mentioned. In this section we also discuss the limitations of the described systems in the context of maintaining a given graph updated and efficiently available across the cluster. Section 5 summarizes the most important insights.
2 Domain: Graph OLAP vs OLTP
It is important to consider the definitions of online analytical processing (OLAP) and online transaction processing (OLTP).444https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/graphOLTPvsOLAP.html Among graph OLAP systems, we may count systems such as Flink and Spark which are general-purpose dataflow programming systems which also have graph libraries to express graph computation. There are other systems designed for graph processing from the start like Giraph [DBLP:books/sp/SOAK2016], Hama [DBLP:journals/tjs/SiddiqueAKJY17], PowerGraph [Gonzalez:2012:PDG:2387880.2387883], X-Stream [Roy:2013:XEG:2517349.2522740], Chaos [Roy:2015:CSG:2815400.2815408], GraphLab [Low_graphlab:a], Tornado [Shi:2016:TSR:2882903.2882950], KickStarter [Vora:2017:KFA:3037697.3037748], Kineograph [Cheng:2012:KTP:2168836.2168846] and Ringo [DBLP:conf/sigmod/PerezSBPRSL15] (this last one is a single-machine system).
We propose that the first generation of distributed systems specialized on graphs was marked by a focus on how to express computations. Pregel [Malewicz:2010:PSL:1807167.1807184] introduced the vertex-centric approach, where computation is expressed from the point-of-view of a vertex. For a given vertex , a vertex function will read messages from incoming edges, perform some computation and then update the value of . This is done globally and possibly in parallel for all vertices, in a sequence of global steps known as supersteps. It hides the partitioning of the graph (across the cluster) from users. This model has been adopted in many state-of-the-art graph processing systems.
Still, in this first generation, other systems focused on optimizing performance (random access and I/O latency) first and then adjusting to it the way to express graph computations. Such is the case of X-Stream [Roy:2013:XEG:2517349.2522740], which focused not on vertices but on processing streams of edges (think-like-an-edge). Other paradigms were proposed, such as think-like-a-graph [Tian:2013:TLV:2732232.2732238] which was tested over Giraph [DBLP:books/sp/SOAK2016], an open-source implementation of Pregel. The paradigm allows users to specify the partition structure in a way to enable communication within a partition to bypass message passing or scheduling mechanisms. The type of use-case found in graph OLAP activity requires processing the complete graph (e.g., computing PageRank or detecting communities). It consists in the execution of algorithms over the whole graph, representing the aforementioned read-eval-write loop.
OLTP systems focus on ingesting information and processing transactions. Graph OLTP system examples are such as Neo4j [Webber:2012:PIN:2384716.2384777], JanusGraph [janus] and DataStax Enterprise Graph, the last two deriving from the now-inactive Titan555http://titan.thinkaurelius.com/ [titan]. Other examples are Google Cayley [cayley], Dgraph666https://github.com/dgraph-io/dgraph, Graph Engine [graphengine], InfiniteGraph [infinitegraph] and AllegroGraph [allegrograph], these last two having been previously compared [DBLP:conf/data/FernandesB18]. These systems are graph databases tuned to serve queries for parts of the graph (known as traversals), focusing on semantic relations of a given business logic (e.g., retrieving from a social graph all consumers who know someone who bought a specific product). They support dedicated query languages to write traversals, such as Cypher and Gremlin, allowing for the integration of different languages [holzschuher2013performance]. Systems such as Neo4j implement their own low-level storage mechanisms. Others, like those derived from Titan, abstract the storage layer into an interchangeable module, allowing for example the use of Apache HBase [dimiduk2012hbase] or Apache Cassandra [Lakshman:2010:CDS:1773912.1773922].
3 System Classification
We present in Table 1 an overview of the graph processing systems we considered most relevant and their attributes. Systems are considered to be active if they have received any source code update in the last three months as of September, 2019.
Giraph, Flink and Spark in the OLAP group have a crossed check mark in dynamism because while their APIs support updating a graph (or results of an algorithm executed over it), they incur the read-eval-write loop. In the case of Flink and Spark, reusing their respective dataset APIs on the same data (without writing and reading from disk) will lead to dataflow plans with increasing complexity, leading to prohibitively unscalable execution times.
Effectively, at the -th time we update the graph (reusing the Java/Scala object reference to the distributed dataset), these systems incur an overhead of where is the -th update – updates that have already occurred are reapplied in sequence every time a new update is to be integrated. In this way, these OLAP systems support dynamism, albeit it quickly becomes prohibitive and unsuitable for scenarios where a stream of graph updates arrive and we need to process results very quickly. For these systems in their current form, the complexity of this dynamism may be mitigated by writing to disk and reading back into the cluster between updates (so every time a new update is applied, at most one update integration is executed). However, this incurs secondary storage overheads (which are subjected to communication overhead too even if using distributed storage) while failing to solve the limitation.
The OLTP systems also have a crossed check mark on dynamism. By definition they support dynamism as they perform transactions of updates, meaning the graphs evolve. However, these same transactions imply the use of secondary storage, making this a type of dynamism which does not satisfy the needs we mention in Section 1.1. While some OLTP systems may have mechanisms to support access to parts of the graph quickly, secondary storage still becomes a bottleneck in this type of system as well. Ultimately, we consider dynamism as the ability to quickly-update the graph for applying new algorithms (or updating existing values) in the context of a stream of updates (possibly under real-time constraints and window semantics).
Graph computation paradigms
As far as we know, Flink is the one offering the most variety. Furthermore, extensions have been built over it (GRADOOP) to express logic through query languages like Cypher (seen in Neo4j).
Number of algorithms
We consider the number of algorithms that a system offers to be a useful metric in assessing its usability for long-term projects and development.
A high and diverse number of algorithms usually implies an active community, but not always – in the case of X-Stream and Chaos, those systems boast a large number of algorithms but were projects isolated in time.
In the OLAP category for actively-developed distributed systems, Flink wins in the number of algorithms. Effectively, its graph library also has variants of the same algorithm implemented using the different graph computation paradigms it offers. For the OLTP category, Neo4j with its graph algorithm library also offers a rich set of algorithms out-of-the-box. We note that the algorithms offered in Neo4j execute in a different way than in Flink or Spark. Neo4j is a graph database which will always go through a storage layer, while Flink and Spark will translate the algorithm logic into a dataflow program which will ingest the whole graph into the cluster and then execute it without optimization mechanisms offered by a potentially-distributed storage layout (present in Neo4j).
Employed graph partitioning strategies vary, with different systems offering different solutions. Among performance-impacting factors [soudani2019investigation], we have the number of active vertices and edges influencing machine load. At the same time, communication will be more expensive depending on how replication of edges and vertices is performed.
Partitioning must balance communication and machine loads. The partitioning challenge in vertex-centric systems is relevant due to how widespread this model is. The authors of [soudani2019investigation] note three major approaches for big graph partitioning: a) partitioning the graph serially in a single pass and permanently assigning the partition on the first time an edge or vertex is assigned (stream-based); b); methods that partition in a distributed way; c) dynamic methods that adapt the partitions based on monitoring the load and communication of machines during algorithm execution. The way the distribution is achieved and data is represented will be a factor in going beyond the read-eval-write loop. In this scope, a dynamic method would be necessary as a basis to develop the properties we described.
Programming models for graph processing have been studied and documented in the literature [kalavri2017high, heidari2018scalable]. They define properties such as the granularity of the unit of computation, how to distribute it across the cluster and how communication is performed to synchronize computational state across machines.
The vertex-centric model was introduced in Pregel, where a function is executed from the point-of-view of a vertex. It was then extended to the concept of vertex scope, which includes the adjacent edges of the vertex. A vertex is the unit of parallelization and a vertex program receives a directed graph and a vertex function as input.
Scatter-gather shares the same idea behind vertex-centric but separates message sending from message collecting and update application [stutz2010signal]. In the scatter phase, vertices execute a user-defined function that sends messages along outgoing edges. In the gather phase, each vertex collects received messages and applies a user-defined function to update vertex state.
Gather-Sum-Apply-Scatter (GAS) was introduced by PowerGraph [Gonzalez:2012:PDG:2387880.2387883] and was aimed at solving the limitations encountered by vertex-centric or scatter-gather when operating on power-law graphs. The discrepancy between the ratios of high-degree and low-degree vertices leads to imbalanced computational loads during a superstep, with high-degree vertices being more computationally-heavy and becoming stragglers. GAS consists of decomposing the vertex program in several phases, such that computation is more evenly distributed across the cluster. This is achieved by parallelizing the computation over the edges of the graph. In the gather phase, a user-defined function is applied to each of the adjacent edges of each vertex in parallel.
Subgraph-centric. The previous models are subjected to higher communication overheads due to being fine-grained. It is possible to use subgraph structure to reduce these overheads. In this category, the work of [kalavri2017high] denotes two subgraph-centric approaches: partition-centric and neighborhood-centric. Partition-centric instead of focusing on a collection of unassociated vertices, considers subgraphs of the original graph. Information from any vertex can be freely propagated within its physical partition, as opposed to the vertex-centric approach where a vertex only accesses the information of its most immediate neighbors. This allows for reduction in communication overheads. Ultimately, the partition becomes the unit of parallel execution, with each subgraph being exposed to a user function. This subgraph-centric approach is also known as think-like-a-graph [Tian:2013:TLV:2732232.2732238]. Neighborhood-centric, on the other hand, allows for a physical partition to contain more than one subgraph. Shared state updates exchange information between subgraphs of the same partition, with replicas and messages for sharing between subgraphs that aren’t in the same partition.
3.1 High-performance computing
The systems mentioned so far may be grouped together due to being loosely-coupled systems. For completeness, here we first mention relevant high-performance tightly-coupled systems:
A parallel graph distributed-memory library offering linear algebra primitives based on sparse arrays for graph analytics. This system considers the adjacency matrix of the graph as a sparse matrix data structure. CombBLAS is edge-based in the sense that each element of the matrix represents an edge and the computation is defined over it. It decouples the parallel logic from the sequential parts of the computation and makes use of MPI. However, its MPI implementation does not take advantage of flexible shared-memory operations. Its authors targeted hierarchical parallelism of supercomputers for future work.
Parallel Boost Graph Library (PBGL) [gregor2005parallel]
It is an extension of Boost’s graph library. It is a distributed graph computation library, also offering abstractions over the communication medium (e.g. MPI). The graph is represented as a distributed adjacency list, which is distributed across multiple processors. In PBGL, vertices are divided among the processors, and each vertex’s outgoing edges are stored on the processor storing that vertex. PBGL was evaluated on a system composed of 128 compute nodes connected via Infiniband.
A system with techniques for processing scale-free graphs using distributed memory. To handle the scale-free properties of the graph, it uses edge list partitioning to deal with high-degree vertices (hubs) and dummy vertices to represent them to reduce communication hot spots. HavoqGT allows algorithm designers to define vertex-centric procedures in what they name a distributed asynchronous visitor queue. This queue is part of an asynchronous visitor pattern designed to tackle load imbalance and memory latencies. HavoqGT targets supercomputers and clusters with local NVRAM.
These systems are hallmarks of high-performance computing solutions applied to graph processing. Their merits encompass algebraic decomposition of the major graph operations, implementing them and translating them across different homogeneous layers of parallelism (across cores, across CPUs). While they are able to achieve scalability in processing edges in the orders of trillions, their tight-coupling and specific hardware target place them outside the scope of candidate systems for our study. However, they retain their value as some of their conceptual ideas could be applied to existing systems which we detail next.
3.2 Relevant OLAP systems
We list here some of the most relevant state-of-the-art systems used for graph processing in the scope of OLAP. We first describe Giraph, a system designed for graph processing, followed by Flink and Spark, two general-purpose data processing systems with libraries to express graph computations.
An open-source implementation of Pregel [Malewicz:2010:PSL:1807167.1807184], tailor-made for graph algorithms. It was created as an efficient and scalable fault-tolerant implementation on clusters with thousands of commodity hardware, hiding implementation details underneath abstractions. Work has been done to extend Giraph from the think-like-a-vertex model to think-like-a-graph [Tian:2013:TLV:2732232.2732238]. It uses Hadoop’s MapReduce implementation to process graphs. It was inspired by the Bulk-Synchronous-Parallel model.
Giraph allows for master computation, sharded aggregators, has edge-oriented input, and also uses out-of-core computation – limited partitions in memory. Partitions are stored in local disks, and for cluster computing settings, the out-of-core partitions are spread out across all disks. Giraph’s authors use the concept of superstep, which are sequences of iterations for graph processing.
In a superstep , a user-supplied function is executed for each vertex (this can be done in parallel) that has a status of active. When terminates, all vertices may send messages which can be processed by user-defined functions at step . Giraph attempts to keep vertices and edges in memory and uses only the network for the transfer of messages. Improving Giraph’s performance by optimizing its messaging overhead has also been studied [Liu:2016:GSO:2983323.2983726]. It is interesting to note that single-machine large-memory systems such as Ringo highlight the message overhead as one of the major reasons to avoid a distributed processing scheme.
Formerly known as Stratosphere, it is a tool which supports built-in iterations [carbone2015apache]
(and delta iterations) to efficiently aid in graph processing and machine learning algorithms. It has a graph processing API calledGelly, which comes packaged with algorithms such as PageRank, Single-Source Shortest Paths and Community Detection, among others. Flink was built with the aim of supporting all data types and providing seamless code integration. It supports all Hadoop file systems as well as the Amazon S3 web service, among others. Delta iterations are also possible with Flink, which is quite relevant as they take advantage of computational dependencies to improve performance. It also has flexible windowing mechanisms to operate on incoming data (the windowing mechanism can also be based on user-specific logic). Researchers have also looked into extending its DataStream constructs and its streaming engine to deal with applications where the incoming flow of data is graph-based.777https://github.com/vasia/gelly-streaming
And its GraphX [graphx] graph processing library. It is a graph processing framework built on top of Spark, enabling low-cost fault-tolerance. The authors target graph processing by expressing graph-specific optimizations as distributed join optimizations and graph views’ maintenance. In GraphX, the property graph is reduced to a pair of collections. This way, the authors are able to compose graphs with other collections in a distributed dataflow framework. Operations such as adding additional vertex properties are then naturally expressed as joins against the collection of vertex properties. Graph computations and comparisons are thus an exercise in analyzing and joining collections.
This system and its follow-up Chaos is in a separate group due to its novelty. It provided an alternative view to the traditional vertex-centric approach. It is based on considering computation from the perspective of edges instead of vertices and experiments optimized the use of storage I/O both locally and on the cloud. X-Stream is an open-source888https://github.com/epfl-labos/x-stream system which introduced the concept of edge-centric graph processing via streaming partitions. X-Stream exposes a scatter-gather programming model (a model that on its own can be vertex, edge or partition-centric) that is edge-centric and was motivated by the lack of access locality when traversing edges, which makes it difficult to obtain good performance.
State is maintained in vertices. This tool uses the streaming partition, which works well with RAM and secondary (SSD and Magnetic Disk) storage types. It does not provide any way by which to iterate over the edges or updates of a vertex. A sequential access to vertices leads to random access of edges which decreases performance. X-Stream is innovative in the sense that it enforces sequential processing of edges (edge-centric) in order to improve performance.
A system which had its foundations on XStream. On top of the secondary storage studies performed in the past, graph processing in Chaos achieves scalability with multiple machines in a cluster computing system. It is based on different functionalities: load balancing, randomized work stealing, sequential access to storage and an adaptation of X-Stream’s streaming partitions to enable parallel execution. Chaos uses an edge-centric Gather-Apply-Scatter (GAS) model of computing. It is composed of a storage sub-system and a computation sub-system. The former exists concretely as a storage engine in each machine. Its concern is that of providing edges, vertices and updates to the computation sub-system. Previous work on X-Stream highlighted that the primary resource bottleneck is the storage device bandwidth. In Chaos, the storage and computation engines’ communication is designed in a way that storage devices are busy all the time – thus optimizing for the bandwidth bottleneck.
The following OLAP graph processing systems were grouped together because each of the improvements they proposed are important concerns to be aware of in designing a graph processing system with the capabilities we described in Section 1.1. We believe each of their results and conclusions should be considered for a new system or paradigm that: maintains the graph in cluster memory (GraphX offers this albeit in a very limited fashion with cache functions over distributed datasets), whether resorting to compact representations or not; harnesses awareness of graph topology to minimize communication overheads (PowerLyra); ingests a stream of data with potentially-high throughput requirements (Kineograph); explores trade-offs between lowering result accuracy in exchange for avoiding or delaying computation (Tornado, KickStarter).
A graph computation engine which adopts different partitioning and computing strategies depending on vertex types.101010https://github.com/realstolz/powerlyra The authors note that most systems use a one-size-fits-all approach. They note that Pregel and GraphLab focus in hiding latency by evenly distributing vertices to machines, making resources locally accessible. This may result in imbalanced computation and communication for vertices with higher degrees. Another provided design example is that of PowerGraph and GraphX which focus on evenly parallelizing the computation by partitioning edges among machines, incurring communication costs on vertices, even those with low degrees.
A system which combines snapshots allowing full processing in the background and explicit alternative/custom functions that, besides assessing updates’ impact, also apply them incrementally, propagating their outcome across the graph. It is a distributed system to capture the relations in incoming data feeds, built to maintain timely updates against a continuous flux of data. Its architecture uses ingest nodes to register graph update operations as identifiable transactions, which are then distributed to graph nodes. Nodes of the later type form a distributed in-memory key/value store. Kineograph performs computation on static snapshots, which simplifies algorithm design.
A system for real-time iterative analysis over evolving data. It was implemented over Apache Storm and provides an asynchronous bounded iteration model, offering fine-grained updates while ensuring correctness. It is based on the observations that: 1) loops starting from good enough guesses usually converge quickly; 2) for many iterative methods, the running time is closely related to the approximation error. From this, an execution model was built where a main loop continuously gathers incoming data and instantly approximates the results. Whenever a result request is received, the model creates a branch loop from the main loop. This branch loop uses the most recent approximations as a guess for the algorithm.
Debuted a runtime technique for trimming approximation values for subsets of vertices impacted by edge deletions. The removal of edges may invalidate the convergence of approximate values pertaining monotonic algorithms. KickStarter deals with this by identifying values impacted by edge deletions and adapting the network impacts before the following computation, achieving good results on real-world use-cases. Despite this, by focusing on monotonic graph algorithms, its scope is narrowed to selection-based algorithms.111111For this class, updating a vertex value implies choosing a neighbor under some criteria.
3.3 Relevant OLTP systems
An open-source project licensed under the Apache License 2.0. A database optimized for storing and querying large graphs with (billions of) edges and vertices distributed across a multi-machine cluster. JanusGraph, which debuted in 2017, is based on the source Java code base of the Titan graph database project and is supported by the likes of Google, IBM and the Linux Foundation, to name a few. Like Titan, it supports Cassandra, HBase and BerkeleyDB. It was designed with a focus on scalability and it is in fact a transactional database aimed at handling many concurrent users, complex traversals and analytic queries. JanusGraph can integrate platforms such as Apache Spark, Giraph and Hadoop. It also natively integrates with the TinkerPop graph stack, supporting Gremlin applications, the query language and its graph server.
A NoSQL graph database implemented in Scala and Java. There is a Community Edition licensed under the free GNU General Public License (GPL) v3. Neo4j is optimized for highly-connected data. It relies on methods of data access for graphs without considering data locality. Neo4j’s graph processing consists of mostly random data access. For large graphs which require out-of-memory processing, the major performance bottleneck becomes the random access to secondary storage. The authors created a system which supports ACID transactions, high availability, with operations that modify data occurring within transactions to guarantee consistency. As previously mentioned, it uses the query language Cypher. Neo4j has a library offering many different graph algorithms.121212https://neo4j.com/developer/graph-algorithms/ As far as we know, Neo4j’s scale-out capabilities are only true for read operations. All writes are directed to the Neo4j cluster master, an architecture which has its limitations.
Google Cayley [cayley]
An open-source database behind Google’s Knowledge Graph, having been tested at least since 2014. It is a community-driven database written inGo, including a REPL, a RESTful API, a query editor and visualizer. Cayley supports multiple storage backend such as LevelDB, Bolt, PostgreSQL, MongoDB (distributed stores) and also an ephemeral in-memory storage. Cayley being distributed depends on the underlying storage being distributed as well. Also in active development as of 2019.131313https://github.com/cayleygraph/cayley
A commercial graph database [allegrograph] which supports many programming languages.141414https://franz.com/agraph/allegrograph/ It is in a category in its own, as the numbers boasted in the company website were obtained in uniform machine architectures (e.g., 240 core Intel x5650, 2.66GHz, 1.28TB RAM, 88TB Disk) and not clusters.
An open-source project under a custom permissive license. The authors describe it as a distributed graph database for efficient, transactional graph analytics. They introduced the concept of refinable timestamps. It is a mechanism to obtain a rough ordering of distributed operations if that is sufficient, but also fine-grained orderings when they become necessary. It is capable of distributing a graph across multiple shards while supporting concurrency. Refinable timestamps allow for the existence of a multi-version graph: write operations use their timestamps as a mark for vertices and edges. This allows for the existence of consistent versions of the graph so that long-running analysis queries can operate on a consistent version of the graph, as well as historical queries. Weaver is written in C++, offering binding options for different languages. We attempted to test this platform in a comparative study, which led to discovering some of its technical limitations [2017arXiv170310628C].
Written in Go, it is a distributed graph database161616https://dgraph.io/ offering horizontal scaling and ACID properties. It is built to reduce disk seeks and minimize network usage footprint in cluster scenarios. It automatically moves data to rebalance cluster shards. It uses a simplified version of the GraphQL query language. Support for Gremlin or Cypher has been mentioned for the future but will depend on community efforts.171717https://docs.dgraph.io/faq/ Dgraph has a scalability advantage over Neo4j as the latter may have multiple servers but they are merely replicas, while the former can grow horizontally (vertical scaling is expensive).
4 Analysis and Discussion
We provide an analysis of the most relevant of the aforementioned systems and their relationships. For completeness, we also include in Figure 1 some systems which were not detailed in Section 3. They are DataStax Enterprise Graph, InfiniteGraph, Gra -ph Engine, and Weaver. As stated, we depict the fact that the Titan database has led to two active projects, JanusGraph and DataStax Enterprise Graph.
4.1 Bridging the gap for OLAP systems
We have presented members of a rich graph processing landscape. Many of the systems we discussed pioneered new approaches which were adopted by more popular ones. We see that Giraph as an instance of Pregel has some of its concepts present in Flink and Spark. Systems such as KickStarter, Kineograph and Tornado, although their source is not available as far as we know, validated approaches to dimensions such as stream processing and expressing approximations over graphs. Despite the existence of all these systems, we prioritize for the choice of candidate systems the criteria of active open-source communities and continuous innovation. Flink and Spark are the most relevant to our knowledge with respect to this. Notwithstanding the importance of other OLAP graph processing systems, these are particularly interesting for potential developments due to their booming and active communities as well as higher-level libraries they received (e.g., GRADOOP and GraphFrames). While in Table 1 we show Giraph remains active, its number of contributors and activity are much lower.
Flink was designed from the start as a stream-processing engine with rich semantics. It supports batch and stream processing respectively through its DataSet and DataStream APIs. The internal implementation of these APIs is separate. In the case of batch processing, it has libraries for machine learning and also for graph processing (Gelly). Unfortunately, processing evolving graphs with this library falls under the read-eval-write loop. This is due to Gelly being implemented over the DataSet API. As it does not use the stream API, all the window semantics of the DataStream API become unusable. While Gelly has seen a recent freeze regarding new features (most algorithms have been last changed one/two years ago, with the most recent updates dating six months and touching on performance improvements), it has been used in GRADOOP, which is an open-source distributed graph analytics research framework [DBLP:journals/pvldb/JunghannsKTGPR18] under active development. GRADOOP provides an even higher-level of expressing graph manipulation.
Spark, was designed as the next-generation big data processing system of its time. As stream processing gained additional attention, it was implemented as the Spark Streaming library. Stream and batch processing have different APIs which were implemented differently. It has its graph processing library Graphx which was built over the system’s batch processing API, like the case of Flink’s Gelly and also suffering from the same previously mentioned limitations. A higher-level API was designed to extend the functionalities of GraphX while harnessing Spark’s DataFrame API. For this, the GraphFrames library was created. It is relevant to mention that in GraphX, there is a method to cache results of a dataflow computation for it to be reused. While not a solution, it avoids the recalculation of results and allows the flow of execution to skip intermediate I/O overheads.
We display in Figure 2 parallels between Flink, Spark and the graph processing ecosystems built on top of them. Gelly is implemented in Java181818https://github.com/apache/flink/tree/master/flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph and expresses building-blocks of graph computation in dataflow programming over Flink’s DataSet API.191919https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/ GRADOOP
extends it with additional specialized operators such as a graph pattern matching operator (which abstracts a cost-based query engine) and a graph grouping operator (implemented as a composition of map, filter, group and join transformations onFlink’s DataSet). GRADOOP also adopts the Cypher query language (typically found in OLTP graph systems like Neo4j) to express logic that is translated to the relational algebra that underlies Flink’s DataSet [DBLP:conf/grades/JunghannsKAPR17].
Gelly’s equivalent in Spark is GraphX, implemented in Scala.202020https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx A look at its implementation reveals that it has less high-level operations than Gelly. Effectively, without simulating some of Gelly’s API, equivalent programs in GraphX lend themselves to more conceptual verbosity due to the lack of syntactic sugar. Vertices and edges are manipulated by using Spark’s Resilient Distributed Datasets (RDDs), which can be viewed as a conceptual precursor to Flink’s DataSet. Spark also offers the DataFrame API to enable tabular manipulation of data. GraphFrames is another graph processing library for Spark. While it has interoperability and a certain overlap with the functionality offered in GraphX, it integrates the tabular perspective supported by Spark’s DataFrame API and also supports performing traversal-like queries of the graph via SparkSQL. In this way, GraphFrames provides graph analytics capabilities in Spark much the same way GRADOOP does in Flink.
Overall, Spark and Flink are general-purpose dataflow processing systems with graph libraries. Due to the size of their communities and their activity as open-source projects, they become very interesting target candidates to incorporate the overlapping concerns of stream semantics, graph evolution and maintenance across cluster memory. These concerns would have to be implemented across different components of the systems, from the way dataflow jobs are broken down into tasks, to task-scheduling itself, all the way to job execution status (Spark and Flink have state machines to control the evolution of a job). Orthogonally to this, a new logic would be needed to avoid triggering massive communication overheads when the topology of the maintained graph (in the cluster) changes due to the windows of arriving updates. When it becomes apparent that a vertex in the graph is exhibiting scale-free properties (by virtue of its degree growing), special strategies must be taken to ensure stability of the graph maintained in the cluster. The results achieved in PowerLyra provide some preliminary insights on how to approach this challenge.
4.2 Bridging the gap for OLTP systems
The decade of 2010 has seen the launch of many open-source and commercial graph database technologies. They focus overall on aspects of efficient low-level graph representation, with some offering only vertical scaling (Neo4j) while others provide horizontal scaling (sharding with JanusGraph, Dgraph, ArangoDB and Cayley). Some of these database technologies delegate the ability of sharding to an underlying interchangeable storage technology: for example those derived from Titan can switch between HBase and Cassandra.
The graph dynamism in these technologies is inherent to the fact transactions may be applied to update the graph. However, as far as we know, the way the graph can evolve and algorithm results may be updated does not consider stream semantics or time-bounded conditions in any of the OLTP systems herein detailed. We envision the possibility to enrich graph databases with such semantics (e.g. taking inspiration from systems such as BlinkDB [agarwal2013blinkdb]).
To this end, under the OLTP category, we focus on systems with a high number of algorithms (Neo4j), the ability to plug in different distributed back-ends to support the graph database semantics (JanusGraph) and inherent sharding capabilities (e.g. the open-source Dgraph). Neo4j’s biggest limitation is the lack of graph sharding. For Cayley, we did not find any list of algorithms implemented over it. Still, it supports sharding like JanusGraph by delegating that responsibility over the underlying storage module. The ArangoDB website claims212121https://docs.arangodb.com/3.4/Manual/Scaling/ it automatically performs sharding of the data. Together with its number of algorithms, it is also a valid candidate to develop new approaches to what is discussed in this document. We do not focus on Weaver due to it not being active. AllegroGraph is unsuitable as it is neither distributed nor open-source. Dgraph is open-source, supports sharding and although it does not have many graph algorithms, has an active open-source community.
Regarding Neo4j, while its most relevant capabilities are only available in the Enterprise Edition (such as storing more than 34 billion vertices in a graph), its active community and broadening scope of use cases make it interesting. If Neo4j’s development is invested in scale-out capabilities for sharding a graph across multiple cluster nodes (increasing write performance compared to read operations), it will be closer to the possibility of offering a graph that is in fact maintained in a cluster. By using smart caching mechanisms, even if some storage I/O is necessary, it could become a strong candidate to bridge the best features of OLAP and OLTP systems. Access latency to secondary storage in this scenario would be further mitigated due to the fact that the Neo4j process in each worker node could be responsible for both distributed graph processing and mediating I/O access.
Moving towards a system that supports executing over an evolving graph that is quickly-accessible on OLTP-type systems would require some key-points: incorporating stream processing windowing semantics in the API; exploiting low-level graph representation optimizations (as it happens on Neo4j); researching optimal strategies of graph sharding coupled with caching mechanisms (JanusGraph, ArangoDB and Dgraph already have sharding). Of the latter, JanusGraph and Dgraph already have an active community, making them relevant targets to apply our vision.
This survey aimed to explain the landscape of graph processing and important vectors of research. We conclude by detailing the efforts we have undertaken so far to bridge the gap in the scope of OLAP systems.
Planning implementation on dataflow systems: Flink. We have studied the Flink system as a candidate to implement the vision we described in this document. The case of maintaining data for iterative computation is offered in the batch processing DataSet API, but manipulating data streams with rich logic with respect to the windowing and data aggregation behaviors are exclusive to the DataStream API.
Analysis of mailing list discussions and ongoing brainstorms222222Flink design proposal for unified stream and batch processing: https://docs.google.com/document/d/1G0NUIaaNJvT6CMrNCP6dRXGv88xNhDQqZFrQEuJ0rVU/edit#heading=h.ob9i0lcn7ulz led us to the conclusion that Flink, while offering stream and batch processing, if it does so in a unified way at all, is only at a very low implementation level. The contract offered to programmers in the form of the aforementioned APIs establishes a clear separation between them.
We see the lack of representativity of this use-case in the Flink API as another indicator that this vector of research in graph processing systems is still in early development. To stimulate the development of Flink in the direction of this use-case, we proposed a feature request using the project’s open-source collaboration system, JIRA.232323JIRA issue FLINK-10867: https://issues.apache.org/jira/browse/FLINK-10867?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel Some initial positive feedback was received.
On the needs of graph processing users . A recent survey on the needs and challenges of graph processing [Sahu:2017:ULG:3186728.3164139] highlighted the need for scalability and visualization as the most pressing challenges in industry and academia. Particularly, the authors identify the challenge of scalability, the ability to process very large graphs efficiently, as the biggest limitation to existing software.
Finishing remarks. The point between the capabilities of OLAP and OLTP systems represented in the middle of Figure 1 is the ideal target of bridging from either type of system. The main factor of dynamism is a need in both OLAP and OLTP systems, for each in its own particular way, but generally facing I/O overhead as an obstacle. Regarding the stream windowing semantics, in OLAP systems we saw that the APIs must be integrated in the existing graph processing libraries. This same type of semantics is completely missing in the OLTP systems as far as we are aware. Lastly, we reiterate that for any instance of a future research vector towards this middle-ground, the architectures of Kineograph, KickStarter and Tornado are important to keep in mind due to how they approach different aspects of the ideal distributed system, such as how to strategically partition the graph across the cluster, and how the question of data ingestion should be posed in relation to the maintenance of the graph.