Science is, of course, driven by observation. It is now also becoming ever more data driven. Some of the datasets involved are unimaginably large. The data is often wildly heterogeneous, and rarely well structured as in business applications. This demands new skills, methods, and approaches of scientists, and challenges computer scientists with devising new data models, query languages, systems, and tools that better support this. Graph-like data has become prevalent among scientific data stores and elsewhere. The data-science research community has begun to focus on how best to support the management of graph data and its analysis. Onedata model for graph databases is the Resource Description Framework () (W3C, ), paired with the query language (Eric Prud’hommeaux, ). These have evolved as W3C standards, initially for addressing the Semantic Web. An store conceptually consists of a set of triples to represent a directed, edge-labeled multi-graph. The triple s, p, o represents the directed, labeled edge from subject node “s” to target node “o” with label (predicate) “p”. In , nodes have unique identity. The semantics, however, is carried by the labels and how the nodes are connected. The (UniProt, 2019) Endpoint (dataset) (UniProt Endpoint, ), for example, consists of 63,376,853,475 triples as of this writing. (Universal Protein resource) is a freely accessible, popular repository of protein data.The query language provides a formal way to query over such graph databases. Types of queries can be thought about as small graphs themselves, so-called query graphs. In a conjunctive query (), the “nodes” are the query’s binding variables and the “edges” between these are the labels to be matched. An “answer” to a on a data graph (that is, the graph database), denoted as , is a homomorphic embedding of the query graph into the data graph that matches the query’s labels to the data graph’s labeled edges. An answer is then a tuple of node ID’s as a binding of the query’s node variables. As such, each answer can be considered as a sub-graph matched in the data graph. A can be quite expensive to evaluate and require extreme resources, given both the potentially immense size of the data graph and the relative complexity of the ’s query graph. The challenge is to reduce the expense and needed machinery. We present a novel approach to query optimization and evaluation for s that consists of two parts: 1) factorization; and 2) cost-based plan enumeration via dynamic programming. The answer set of a is a set of embeddings. This in itself is not a graph. Instead, as an intermediate step, we do find an answer graph, the subset of edges from the data graph that suffices to compose the ’s embeddings. A number of systems have been developed with quite different architectures. These include triple stores (e.g., RDF-3X (Neumann and Weikum, 2008)), property tables (e.g., BitMat (Atre et al., 2010)), column-based stores (e.g., (Abadi et al., 2007)), and graph-based stores (e.g., gStore (Zou et al., 2011)). Our answer-graph approach is generic, and can be implemented within any of these architectures. We posed this idea initially in (Godfrey et al., 2017). We develop it here. We demonstrate its key advantages via our prototype system, , and a micro-benchmark.
2. The Answer-graph Approach
In (Bakibayev et al., 2012), the authors introduce the concept of factorization as a query-optimization technique for relational databases. Their technique is designed, and works exceptionally well, for schema and queries for which cross products of projections of the answer tuples all show up as answer tuples. This happens, for instance, in schema not in fourth normal form. Evaluating for these projected tuples first and then cross-producting them later can be a much more efficient strategy. Deciding how best to factorize—how to project into sub-tuples—is difficult, however. For s, this last part is trivial: the factorization of the embedding tuples is fully down to component node pairs, corresponding to the labeled edges. This is our answer graph.111This is demonstrably true when the is tree shaped. This is arguable when the has cycles. In the latter case, the factorization can be characterized as projections to tuples of node pairs and node triples (triangles). While factorization is sometimes a significant win for evaluating relational queries, it is virtually always a win for evaluating graph s. An answer graph, , for a is a subset of the data graph that suffices to compute the embeddings for the . We call the minimum such subset the ideal answer graph, . The is often quite small, significantly smaller than the set of embeddings, and extremely much smaller than . Thus, evaluating a ’s embeddings in two steps—first, find its , then compose the embeddings (which we call defactorization) from the rather than from —can be significantly more efficient. Consider the data graph and the chain query [C] in Fig. 1. [C] which finds all node-tuples w, x, y, z from such that w, x is connected by an edge labeled A, x, y by B, and y, z by C. Due to multiplicity from A-edges fanning in to, and C-edges fanning out of, B values, the embedding set is twelve tuples. Meanwhile, our answer graph consists of eight labeled node pairs (shown in red). Such differences are greatly magnified when on a larger scale. Our answer-graph approach affords us a second key advantage. We can devise a cost-based query optimizer based on dynamic programming to construct a query plan. A plan for us is simply a specified order of the ’s query edges in which to evaluate to matching answer-graph edges. Our evaluation strategy for such plans is explained next, and our optimizer for choosing plans in Section 4.
3. The Evaluation Model
Our evaluation model for s then becomes two phase: answer-graph generation and embedding generation. Answer-graph generation. For each query edge of the plan, in turn, our answer graph () is populated with the matching labeled edges from that meet the join constraints with the current state of the . Call this an edge-extension step. Then nodes in the that failed to extend are removed. This “node burnback” cascades. Consider the with query edges ?w, A, ?x, ?x, B, ?y, and ?y, C, ?z. Fig. 2 illustrates the interleaved edge-extension and burn-back steps over a sample data graph . Embedding generation. The embedding tuples are then generated over the answer graph by joining the answer edges appropriately. Given the ideal answer graph and an acyclic , the order in which we join is immaterial. No -ary tuple is ever eliminated during a join with a next query edge from the . This step is often quite fast, given the is small. Evaluating this directly from the data graph , on the other hand—which is what other evaluation methods for s do—can be exceedingly expensive. (Fig. 3 illustrates, comparing a standard evaluation with ours.)
4. The Planners
I. The Answer-graph Planner. Plan Cost. The edge walk
is our unit for estimating a plan’s cost: the retrieval of a matching edge from . To estimate the number of edge walks,node and edge cardinality estimations are made for each successive edge extension. Note that the cost of node burnback is amortised: every edge added that does not survive to the is at some point removed. employs cardinality estimators drawn from a catalog consisting of 1-gram and 2-gram edge-label statistics computed offline (Yakovets, 2016; Yakovets et al., 2016; Mannino et al., 1988; Christodoulakis, 1989). The Edgifier. A plan is a sequence of the ’s query edges to be materialized. We employ a bottom-up, dynamic-programming algorithm to construct the edge order based on cost estimation (which relies upon the cardinality estimations). When the query graph of a has cycles—a cyclic query—there is an additional part to planning. Node burn-back suffices to generate the ideal answer graph for acyclic queries, but not for cyclic. The example in Fig. 4 illustrates why. Spurious edges—e.g., 1, 6 and 5, 2—can remain that do not participate in any embedding. To cull spurious edges requires an edge burnback procedure in addition to node burnback. This requires the ’s cycles have been triangulated; node triples are materialized in addition to the node doubles (the edges) during evaluation. Triangulation is the choice of which additional “query edges”, which we call chords, to add.
The Triangulator. For cyclic s, in addition to the query-edge enumeration, cycles in the query graph of length greater than three are triangulated by adding chord edges. We employ a bottom-up dynamic programming algorithm to generate a bushy plan that dictates the order and choice of chord bisection of cycles (down to triangles). During evaluation, a chord is maintained as the intersection of the materialized joins of the opposite two edges for each triangle in which it participates. The chordified plan when executed with node burnback guarantees that the node sets will always be minimal. A correct answer graph, , will be found, but it is no longer guaranteed to be ideal. This is because spurious edges may remain in the , as demonstrated in Fig. 4. The embeddings can, of course, be found from the non-ideal established. Edge Burnback. With the addition of an edge burn-back mechanism at runtime, we can guarantee again that we find the ideal (). This works by checking the chords’ materializations to chase what needs to be removed on cascade. This ensures that spurious edges are removed. The additional overhead of edge burnback must be balanced off against the benefit of obtaining the versus a larger, non-ideal . This is work in progress. In our experiments, our evaluation over cyclic s is without edge burnback.
|Snowflake-shaped Queries (1/2/3/4/5/6/7/8/9)||PG||WF||VT||MD||NJ||—iAG—||—Embeddings—|
|Diamond-shaped Queries (1/2/3/4)||PG||WF||VT||MD||NJ||—AG—||—Embeddings—|
II. The Embedding Planner. Plan Cost. When generating the embeddings for an acyclic from its , the order in which we join (connected) answer edges is immaterial. As the -ary tuples are extended, no intermediate results are ever lost. Thus, for this, no planning is required. The Defactorizer. On the other hand, when the is cyclic or the provided is non-ideal, intermediate results can be lost. The join order then matters. We call this process defactorization. Alternative plans for embedding materialization are synonymous with choosing this join order. It is possible to do this again via a cost-based approach via a bottom-up, dynamic programming algorithm, using our catalog statistics.
Prototype. We have implemented a prototype, , which runs on top of , a popular relation database system. implements the two phases described in Section 4, each with a separate planner and evaluator. The planner for the first phase outputs an optimal left-deep tree plan that indicates the execution order of the query edges. The evaluator then takes the tree plan to evaluate the query edges in sequence. For the second phase, we presently use a greedy approach to generate a tree plan based on the available statistics from the answer graph phase. The node burnback procedure is implemented via procedural SQL. Environment. For evaluating ’s performance, we use the YAGO2s dataset (Yago2s, ). With a select set of five acyclic and five cyclic s, we compare query execution times against v11.0 (PG), v6.01 (VT), v11.31 (MD), and v.3.5 (NJ). All experiments were conducted on a server running 18.04 LTS with two Intel Xeon X5670 processors and 192GB of RAM. After preprocessing the dataset (contains 242M triples with 104 distinct predicates), we imported it to each of the systems. For the queries, we implemented a query miner that generates queries over a dataset using query templates (with placeholders for edge labels). The query miner then generates valid, non-empty queries. For our experiments, we use two templates, and , as shown in Figures 3 and 4, respectively. With these two templates, we mined 218,014 snowflake-shaped queries and 18,743 diamond-shaped queries. For our preliminary experimental study, we chose five queries of each shape which could be attributed to real-life use cases. For and , the dataset was imported as a triple store, with indexes on the string dictionary, and six composite indexes over the permutations of subject, predicate, and object. We set the size of the memory pool to eight GB for all of the systems, except for (which sets its own resource allocations based on the server). We repeat execution of each query five times, taking the average of the last four runs (i.e., warm cache), as reported in Table 1. The execution time is the time spent to retrieve all the result tuples for a query. Results. One can observe from Table 1 that the size of the answer graph is exceedingly smaller than the number of embeddings. For instance, for the second snowflake-shaped query, the is 2,867 times smaller than the number of embeddings. It is no surprise, therefore, that (WF) achieves good performance; it avoids the redundant edge-walks that arise from many-many joins. While the second snowflake-shaped query took 88 seconds on , it only took five seconds on . The answer-graph approach requires a much smaller memory footprint, which can be beneficial for traditional database systems that heavily use secondary storage. The approach also competes well against main-memory intense systems such as and . For the cyclic, diamond-shaped queries, employing only node burnback does not guarantee the ideal answer graph, as discussed above. We have found that the resulting s can be significantly larger than the ideal, sometimes close to the number of embeddings. For this reason, was slower for some of the cyclic queries, notably 1 and 2. Even so, its performance over cyclic queries is quite good. With further plan- and run-time optimization with edge burnback, we believe that the performance will be stellar.
We have clear objectives for our next steps. First, one has a richer plan space when considering bushy plans for both our first and second phases. The challenge is to devise a suitable cost model for searching the bushy-plan space via dynamic programming. Second, when the size of an answer graph is distant from the ideal, generating the embeddings can be costly. Triangulation promises to reduce this significantly. This requires investigating the trade-offs between the added cost for maintaining the triangle materializations and the reduced cost from generating the embeddings from the significantly smaller ideal . Lastly, we are to explore further optimizations within this space. Large graphs are meant to be queried.
- Scalable semantic web data management using vertical partitioning. VLDB ’07, pp. 411–422. Cited by: §1.
- Matrix ”bit” loaded: a scalable lightweight join query processor for rdf data. WWW ’10, New York, NY, USA, pp. 41–50. Cited by: §1.
- FDB: A query engine for factorised relational databases. VLDB 5 (11), pp. 1232–1243. Cited by: §2.
- On the estimation and use of selectivities in database performance evaluation. CS Dept., U. of Waterloo. Cited by: §4.
-  SPARQL query language for RDF. W3C recommendation, 15 january, 2008.. Cited by: §1.
- WIREFRAME: two-phase, cost-based optimization for conjunctive regular path queries.. In AMW, Cited by: §1.
- Statistical profile estimation in database systems. ACM Computing Surveys (CSUR) 20 (3), pp. 191–221. Cited by: §4.
- RDF-3x: a risc-style engine for rdf. Proceedings of the VLDB Endowment 1 (1), pp. 647–659. Cited by: §1.
-  UniProt SPARQL endpoint. Note: https://sparql.uniprot.org/,2020 Cited by: §1.
- The UniProt consortium (author notes), UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research 47 (D1), pp. D506–D515. External Links: Cited by: §1.
-  W3C: resource description framework (rdf). Note: http://www.w3.org/TR/rdf-concepts/,2004 Cited by: §1.
-  YAGO2s: a high-quality knowledge base. Note: http://yago-knowledge.org/resource/Max Planck Institut Informatik Cited by: §5.
- Query planning for evaluating SPARQL property paths. In SIGMOD, pp. 1–15. Cited by: §4.
- Optimization of regular path queries in graph databases. Ph.D. Thesis, York University. Cited by: §4.
- GStore: answering sparql queries via subgraph matching. Proc. VLDB Endow. 4 (8), pp. 482–493. Cited by: §1.