Memory-Efficient Group-by Aggregates over Multi-Way Joins

Aggregate computation in relational databases has long been done using the standard unary aggregation and binary join operators. These implement the classical model of computing joins between relations two at a time, materializing the intermediate results, and then proceeding to do either sort-based or hash-based aggregate computation to derive the final result. This approach, however, can be dramatically sub-optimal in the case of low-selectivity joins, and often ends up generating large intermediate results even if the relations involved as well as the final result sets themselves are quite small. Moreover, many of the computed intermediate results may never be part of the final result. In this paper we propose a novel aggregate query processing technique that leverages graph data structures towards efficiently answering aggregate queries over joins, without materializing intermediate join results. We wrap this technique inside a multi-way composite database operator called JOIN-AGG that combines Join and Aggregation. We provide a general framework for executing aggregation queries over arbitrary acyclic joins, involving any number of group-by attributes from any relation. We also present a thorough experimental evaluation on both real world and synthetic datasets. Our experiments show that our operators can achieve orders of magnitude lower query times and memory requirements compared to the traditional approach, even when implemented outside of the database system.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/05/2017

Covers of Query Results

We introduce succinct lossless representations of query results called c...
05/23/2019

Conjunctive Queries with Theta Joins Under Updates

Modern application domains such as Composite Event Recognition (CER) and...
05/31/2019

Efficient Multiway Hash Join on Reconfigurable Hardware

We propose the algorithms for performing multiway joins using a new type...
10/01/2020

Sort-based grouping and aggregation

Database query processing requires algorithms for duplicate removal, gro...
12/04/2020

Hiperfact: In-Memory High Performance Fact Processing – Rethinking the Rete Inference Algorithm

The Rete forward inference algorithm forms the basis for many rule engin...
02/12/2019

Generalized Lineage-Aware Temporal Windows: Supporting Outer and Anti Joins in Temporal-Probabilistic Databases

The result of a temporal-probabilistic (TP) join with negation includes,...
11/27/2018

Efficiently Charting RDF

We propose a visual query language for interactively exploring large-sca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Traditionally, aggregate processing over conjunctive queries in RDBMSs has been done through the use of simple binary operators for executing joins, followed by a (typically separate) unary aggregation operator. The simplicity of these operators has proven invaluable throughout the development of modern RDBMSs. Each simple operator enables the optimization of a very specific operation with a concise set of parameters, inputs and outputs. This enabled simpler query optimization, since it is easier to create good cost models for simple operators than for complex ones.

Simplicity however often comes at the cost of performance. It is known that binary operators can lead to sub-optimal performance regardless of the query plan used [5, 8]. The main drawback with binary join operators in RDBMSs specifically, is the generation of intermediate join results, potentially with materialization at the granularity of every join. Each individual join within a query plan may output an increasingly larger number of tuples, making the latter intermediate results unwieldy, especially as they start becoming larger than memory. Pipelining operators help in some cases, but enumeration of the full intermediate result set though joining every single tuple is necessary and cannot be easily avoided (e.g., bloom filters can help in some cases to filter out results, but break the classical model as well). These issues become significantly more pronounced in analytics settings where the joins are often done on non-key attributes to derive higher-level insights (see examples below).

This has led to an increasing interest in the idea of multi-way database operators. Eddies was one example of such an operator [5], where the benefits of combining multiple operators into one came from the ability to choose different execution paths for different tuples. More recently, breakthroughs in worst-case optimal join algorithms [23] have shown that one can put tight bounds on the maximum possible number of tuples generated by a query, and then develop algorithms whose runtime guarantees match those worst-case bounds. These breakthroughs have led to a variety of different query operators that take a multi-way join approach over the traditional binary operator. The benefits seen by many of these proposed operators typically come from the fact that the operator takes multiple relations that are part of a large conjunctive query into account simultaneously. This allows for avoiding the materialization of large intermediate results [31], enables pruning out various portions of the computation based on complex conditions [33], or allows for exploiting more parallelism and fast set intersections toward the join result [3].

In this paper, we focus on another very common combination of operators, namely a series of joins followed by a group-by aggregate. The data graph paradigm proposed here is reminiscent of factorized representation of conjunctive query results, by Olteanu et al. [26], and the idea of a tuple hypergraph that can cover all tuples in a query result [16]. All of these provide compact representations of the underlying join result, especially in presence of low-selectivity joins, with minor differences because of the specific goals behind their genesis. Our key contributions here are a novel way to use such a structure for computing group by aggregates efficiently over complex acyclic joins.

Several different works have considered the problem of executing group by aggregate queries against a factorized representation of a conjunctive query [6, 18, 17, 19, 29, 28, 25]. The key guarantees like constant-delay enumeration, however, do not extend to the kind of group by queries we focus on in this work, e.g., the “branching” query . Because all of (group by attributes) need to be present in the output, either (a) one of the other attributes needs to be eliminated (which requires generation of a large intermediate result), or (b) we have to iterate over all combinations of values for and compute the aggregate value for each combination (which can be prohibitively expensive if either the sizes or the number of group by attributes is large). Our work here, thus, can be seen as exploring an alternative approach to computing aggregates over the factorized representation.

As we discussed earlier, recent work on worst-case optimal joins [23, 32, 20, 24] shows how to avoid large intermediate results during execution of multi-way join queries; Joglekar et al., [14, 15] discuss how to generalize that to aggregate queries. Their approach is largely complementary to  [6], as well as our line of work. Recent work on FAQ [4] proposed a generalized way of viewing a very common type of aggregation query called a Functional Aggregate Query which they see parallels in multiple scenarios other than databases, e.g., matrix multiplication, probabilistic graphical models, and logic. For acyclic join queries, those approaches effectively reduce to pushing aggregates below joins in the query plans, similar to the “aggressive pre-aggregation” approach we analyzed and compared against in this paper.

  SELECT ps_suppkey, c_zipcode, COUNT(*)
  FROM partsupp, lineitem, orders, customer
  WHERE ps_partkey = l_partkey AND
        o_orderkey = l_orderkey AND
        o_custkey = c_custkey
  GROUP BY ps_suppkey, c_zipcode;

Consider a query like [Q1] in Listing LABEL:lst:q1 over the standard TPC-H dataset of parts, customers and suppliers. The lineitem table includes all orders of parts that were supplied, the individual parts each order contains, as well as which supplier each part was bought from. The goal of [Q1] is to compute the number of parts that a supplier could provide to a certain zipcode, given the transaction data that we already have. Note that c_zipcode isn’t a distinct field in the customer table, but is typically extracted from the c_address attribute. This type of complex decision-support query requires a non-key join which could yield very large intermediate results that will be fed as input to the aggregation operator. As shown in Figure 1, running [Q1] over TPC-H (using scale factor SF=1), the intermediate join result for this query contains over 24 million tuples. The size of the result post-aggregation would be bounded by the number of distinct zip codes times the number of suppliers, and therefore is highly likely to be orders of magnitude smaller than the join result.

Fig. 1: Query plan for query [Q1]. Aggregate queries can have very large intermediate results even though the number of output groups could be small

Another very common situation where a decision-support query would require a non-key, large-output join is that of self-joins, where a table is joined with itself. For example, on the standard TPC-H dataset, a classic data-mining task is to compute the number of times pairs of parts appeared in the same order; this is a standard frequent pattern mining query. A self-join between the lineitem table and itself (on partkey) is required to compute this.

Another example of queries that require large-output self-joins include path aggregation queries in graphs. Any graph stored inside a relational database in the form of a Nodes and Edges table, is conducive to queries that e.g. count the number of paths that follow a certain pattern in terms of the nodes. If we had an edge table Edges(src,dst), and a Nodes(id,label), a query like [Q2] in Listing LABEL:lst:q2, counts the paths between nodes with certain labels. Queries like this one end up outputting a huge number of intermediate results which constitute the sub-paths for each intermediate stage of the graph traversal.

  SELECT n1.label, n2.label, COUNT(*)
  FROM Nodes n1, Edges e1,
       Edges e2, Nodes n2
  WHERE n1.id = e1.src AND
        e1.dst = e2.src AND
        n2.id = e2.dst
  GROUP BY n1.label, n2.label;

Our main contributions in this paper are twofold; first, we propose a new multi-way database operator called Join-Agg, which enables the efficient computation of aggregation queries, without materializing any intermediate join results, by computing the join and aggregation simultaneously. We describe a novel general framework for executing aggregation over conjunctive queries of arbitrary numbers of relations, and numbers of group by attributes that may be derived from any participating relation, by leveraging a graph representation of the underlying data. We restrict our formal development to acyclic queries – although our algorithms can be adapted to handle cyclic queries, systematically combining our data-level optimizations with the recent work on cyclic joins raises complex issues that are beyond the scope of this paper. We implement a prototype of the Join-Agg operator outside of the RDBMS and experimentally showcase the benefits of our operator over synthetic and real datasets.

Second, we provide a comprehensive complexity analysis of common example queries that benefit from our Join-Agg operator in comparison to executing them using the classical RDBMS model, or other less general techniques such as partial pre-aggregation [21] which only looks at reducing intermediate data size at the level of each individual join instead of looking at the join as a whole. We show that in terms of computational complexity Join-Agg is comparable or asymptotically better than those techniques, particularly in the general case of complex acyclic branching join queries. We also show that Join-Agg is overall better than those techniques in terms of memory complexity.

The rest of this paper is organized as follows: Section II introduces some preliminary definitions required later on and provides a high level description of the Join-Agg operator, Section III describes the process with which we load the data into memory in a data graph representation which is stage 1 of the Join-Agg algorithm. Section IV describes the traversal and result generation stages of the algorithm after defining basic concepts used the traversal process. Furthermore, Section V presents a comprehensive complexity analysis comparing the complexity of Join-Agg with other traditional techniques, Section VI goes deeper into the implementation details of the prototype we implemented, while Section VII presents the experimental evaluation of the aforementioned prototype. We finally place this work into context with other related work in Section VIII, and conclude in Section IX.

   SELECT $A.a$,$B.b$,$C.c$, COUNT(*)
   FROM $R_1~A$, $R_2~J$,$R_3~B$,$R_4~C$
   WHERE $A.j_1$=$J.j_1$ AND $J.j_2$=$B.j_2$ AND $J.j_3$=$C.j_3$
   GROUP BY $A.a$,$B.b$,$C.c$;

Ii Preliminaries and Overview

Fig. 2: Inner-workings of the Join-Agg operator.

In this section we formally describe the general framework for efficiently answering queries such as [Q3] in Listing LABEL:lst:q3. Our framework views the join between a series of relations in the form of a graph structure of interconnected tuples that we call the data graph. For the sake of simplicity, we use the COUNT(*) aggregation function in our explanations and examples. We provide a discussion on how more standard aggregation functions can be supported using the same framework in Section IV-D.

Ii-a Preliminaries

Let be an aggregate query over a join between a set of relations , where is the set of group-by attributes of this query. For now, assume that we only need to count the number of tuples in each group (COUNT(*)). We assume without loss of generality that a group by attribute corresponds to a single attribute in relation , and that none of the participate in a join condition. We also assume that all joins are natural joins. All of these can be relaxed easily through standard tuple-level transformations (e.g. if a group-by attribute participates in a join, we can (implicitly) create a copy of that column). As mentioned earlier, we restrict our attention to acyclic joins in this paper.

We represent the overall join-aggregation query as a hyper graph where is the set of all attributes that take part in the join conditions between the relations in and hyperedges, containing one hyperedge per relation , i.e., . Note that the only attributes that are relevant here are either join condition attributes, or group attributes–as the result is a set of tuples that represent groups (grouped by ). Let denote attribute (or set of attributes) from relation .

For every , we partition the attributes of into two disjoint groups . We describe the specifics in Section III-B, but intuitively this is done so as to reduce the size of the “data graph” that we load into memory, while also capturing enough information to execute the query.

Ii-B JOIN-AGG Operator

We propose a new database operator called Join-Agg that receives a set of input relations and outputs a single set of result tuples, i.e., after the appropriate grouping and aggregation, as the output. The decision of whether to use the operator is made by the query optimizer in a cost-based manner; in essence, if at least one of the joins in the query is a non-key join or a join that may result in a large output compared to the input relations, then this new opeartor should be considered. We develop the necessary cost models to make this decision. When the operator is chosen, instead of conducting a series of binary joins as traditional RDBMSs do, we would instead go through each relation, and load each one into the same in-memory data graph which is then traversed to output the resulting grouped tuples.

Prior to the instantiation phase, the operator creates a query hypergraph that captures the joins in . This query hypergraph is then turned into a query decomposition tree, which is traversed in order to transform each individual relation into a set of edges in the data graph. Based on the final decomposition, during the execution phase, the operator constructs the edges that correspond to each relation as the data graph. Finally, this in-memory data graph structure is used (and potentially re-used) to directly compute and output the grouped tuples.

Iii Data Graph Representation and Construction

In this section, we begin with describing how a query decomposition tree is constructed and how it is used to split the attributes of each relation into two groups that form the edges of the data graph. We then describe the basic representation of a data graph and explain how it is constructed by loading in relations from the underlying database.

Fig. 3: A Data graph created by a set of joining relations (after projections have been applied). Relation B has multiple attributes as part of , which merge into the multi-node (jc1,jd1). In the relations involved in the join, we have four different group attributes , one of which is a source attribute (). Node 1a is a source node, 2a, 2b, 3a, 3b, 4a, 4b are all group nodes, and (jc1,jd1) and je1 are both branching nodes. The rest are all intermediate nodes.

Iii-a Query Decomposition

A query decomposition of a hypergraph is defined as a tree where each node corresponds to a hyperedge . We create a pure query decomposition of where each node in the decomposition directly corresponds to a single relation. In this work, we focus on acyclic queries, i.e., queries for which there exists a tree decomposition [10]; in future work, we are planning to extend our approach to handle cyclic queries by combining it with recently proposed techniques for optimal worst-case join algorithms.

We construct the query decomposition tree using the standard elimination algorithm [30]. First, we note that, all of the relations that contain at least one attribute not present in any other relation must contain a group attribute; we’ll call these group relations. We start with one of those as the root of the tree, and traverse the hypergraph in a breadth-first manner to construct a query decomposition tree. An example of such a decomposition can be seen in Figure 3d. Given , to build the query decomposition tree, we can start from any group relation; here we picked . The hypergraph is then traversed in a breadth-first manner starting from using a standard queue. We start by creating a root node in the decomposition tree for – let that be the current node. Then, for every neighbor relation of , if it has not been visited, we add it to the queue. We then pop the queue and add the popped relation as a child to the current node in the decomposition tree. Thereafter we continue with neighbor which becomes the current node, it is added as a child node to and so on until the queue is empty.

Fig. 4: Derivation of a Query Decomposition tree from a Query Hypergraph

Iii-B Splitting Attributes

As previously mentioned, the attributes of each relation are partitioned into a pair of attribute sets . This is done in order to properly view every relation as a set of edges for the data graph–an edge has two entities it connects, here it connects and . To do this, we simply traverse the query decomposition tree starting from the root. As we discuss in further detail later on, this splitting mechanism is a data reduction mechanism (similar to pre-aggregation [21]) for reducing the input data as much as possible before the query is executed.

To construct these pairs, we traverse the decomposition tree in order to partition the set of attributes for each relation into a pair. In a slight abuse of notation, let also represent the set of attributes relevant to the query from relation . We start the traversal at the root; let that be , for which we set . Afterwards, we set where is the set of attributes in the children relations of in the decomposition tree, plus (union) itself. Next, for every relation , if is not a group relation, we set , and again . If is a group relation, we set and . As we will later describe, nodes created from group attributes need to be sinks in the DAG structure that we will be building. This same process is recursively executed on all relations in .

Below we provide a few standard examples of how the aforementioned algorithm is used to split the attribute sets.

Example III.1.

Looking at the decomposition tree in Figure 3c, a simple and common case is that of D. Because D’s parent relation is B, we have . Also, the children of D are E,F, therefore . In the data graph constructed later, a single node will be created for each value in and for each .

Example III.2.

Again regarding the tree in Figure 3c, for relation B, its value will be , and . This multi-valued value means that a multi-node will be created for each value in .

Example III.3.

Looking at the decomposition tree in Figure 4, for relation , since it’s a group relation (one that contains at least one group attribute), we always split its attributes by setting the group attribute as ,and .

Iii-C Data Graph Representation

Next, we formally define the data graph representation for a join-aggregate query. For a given query , a data graph captures the underlying data in the relations and the interconnections between those data elements. Let , and the nodes and (directed) edges respectively that make up graph .

Relation & Node Types: At a high level, we partition the relations in four (overlapping) groups: , which dictates how the corresponding nodes are handled during execution of the Join-Agg algorith:

  1. denotes the group relations in , each containing a group by attribute .

  2. is the source relation , which we choose to start the computation from. The algorithm we develop is based on a traversal of the data graph. As we describe in Section IV, nodes originating from the source relation are the anchors of the traversal and therefore do not get visited multiple times. As a result, no additional data needs to be maintained for them.

  3. a set of branching relations in . A relation is marked as a branching relation if its corresponding node in the query decomposition tree: (a) has more than one child and therefore branches out the join execution, or (b) is not a leaf node or the root node, and is also a group relation. In other words, either the tuples in these relations need to sequentially be joined with multiple tables, one at a time in the context of the overall conjunctive query, or there is a grouping attribute in the relation that needs to be separated out so that we can exploit caching effects (as discussed later). An example of a type (a) branching relation is in Figure 3d, and a type (b) would be relation in Figure 4.

  4. a set of intermediate relations in . These are relations that only have exactly one child and one parent in the query decomposition tree, and are not group relations.

Consequently, there are four types of nodes in the data graph, each originating from its adjacent relation type: source nodes , group nodes , intermediate nodes , and branching nodes . Each of the aforementioned relations all portray a pair of attributes that are relevant to the query. As discussed in Section III-E, source nodes are always loaded from the attribute in the source relation, while group and branching nodes from the attribute in their respective relations. Group nodes are always sinks in the data graph, while we made the choice to always set branching nodes on the right hand side of the split for the sake of simplicity. All remaining nodes in the data graph are intermediate nodes.

Note that a relation can have multiple types (e.g., is both a source relation and a group relation). Similarly, based on the specific query, a branching relation can also be a group relation if it satisfies the criteria for both relation types simultaneously. This would occur for instance if a relation includes a group attribute , but also joins with more than two relations in the query described by a hypergraph . The nodes derived from relations with multiple types naturally also inherit the same set of node types.

Attribute splitting (Section III-B) enables us to conduct a pre-aggragation step in which we group tuples with the same values (after projection) into a single edge with a multiplicity value. For example, in regards to relation in Figure 3, splitting attributes this way not only allows us to pre-aggregate , but to also only load in a single node for .

Iii-D Mapping Relations to a Data-Graph

We now formally describe how we map rows in the underlying relations to nodes and edges in the data graph. Let denote the set of values of attributes in relation , where indicates bag semantics for the projection. Also, let denote the set of unique values of the attribute in relation , and let denote the set of attributes in relation that take part in a join condition. We create the nodes in the data graph in two simple steps; for every relation :

  1. We create a hyperedge for every tuple in relation , as is seen in Figure 3 (hyperedges with only nodes are shown as regular directed edges). Every hyperedge describes a set of values from attribute sets . A unique value in attribute set corresponds to a single node in such a hyperedge.

  2. For every set of nodes in the intersection of a set of hyperedges, create a multi-node that includes all values in the intersection (also shown in Figure 3 for relation ). The result is a set of regular directed edges between nodes and/or multi-nodes. For all purposes moving forward, multi-nodes function exactly the same way as regular nodes in the data graph. In general, the node created from an attribute set is simply , and its set of values are denoted as .

We now define the edges in . Let denote the multiplicity of an edge . The multiplicity of an edge is a numeric value associated with each edge and is defined as the number of times the tuple that edge corresponds to exists in the relation. contains a directed edge iff one of the following applies:

  1. There exists a tuple . If , the set of tuples in with values in , the multiplicity .

  2. A tuple in joins with one in on attribute , such that . In this case, the multiplicity of the edge is always . An example of such an edge is in Figure 3a.

For the sake of simplicity, and without loss of generality we can assume that any , corresponds to a single attribute, i.e. relations only join with one another through single attribute join conditions. In practice, can be a set of attributes, in which case (the value for node ) would constitute of a bag of values and be described as a multi-node in the data graph. Formally, in that case we simply have that where and and .

Iii-E Join-Agg Stage 1: Loading Data Graph

To summarize, the input to the overall load process is the hypergraph . We initially need to partition all sets of relation attributes to pairs, by first creating a query decomposition of , and then using that decomposition to do the partitioning.

The data graph is then loaded into memory by simply scanning the input relations, sorting them, and creating nodes as described above. If there are any attributes in the input relations that don’t participate in the query, we push down appropriate projections (without duplicate elimination) to the underlying database to minimize the amount of data transferred over the network.

Iv Traversing The Data Graph

In this section, we describe our algorithm that computes the aggregated groups of such a query , by traversing a data graph . For the sake of simplicity we will focus on the query that counts the number of tuples in each group and discuss how it is generalized in Section IV-D.

The high level idea behind Join-Agg is it to traverse the data graph, which represents the underlying data being joined, starting for one source node at a time and maintain certain partial aggregate values (in this case, counts) at all reachable group nodes in each iteration. We can later combine these values in order to obtain the final aggregate value of each group, instead of materializing the join at any point. The way this happens at a high level is by propagating the counts along the data graph, starting from each unique source node, to the group nodes, while keeping track of certain path information (which we refer to as path-ids) along the way. These path-ids allow us to figure out which counts are derived from which paths in the data-graph and enable us to properly combine them to compute the correct count for each group.

Iv-a Definitions & Axioms

Before we formally describe our general algorithm for executing these queries over a data graph , we enumerate a few core definitions and axioms for concepts that we be regularly reference in the algorithm description. The execution algorithm we propose revolves around traversing the data graph and maintaining certain information along the way in order to directly output the groups in the result.

Definition IV.1.

A rooted tree (also formally known in the context of directed graphs as an arborescence), is defined as a directed subgraph that consists of a tree, with a single root node, therefore containing exactly one path between that root node and every leaf node.

Definition IV.2.

Let denote a count between nodes , . We conceptualize the traversal of the data graph as equivalent to conducting joins between the tuple that each element of the data graph represents, thus generating new tuples which we want to avoid materializing. A count represents the number of tuples generated along all paths between node to . Any such path cannot include a branching node ( may themselves be branching nodes, but there cannot exist a branching node in any path between them). More formally, using Axiom IV.1 we have , where the set of intermediate nodes in one of the unique paths between and .

Definition IV.3.

Let denote a path-id. A path-id is a list of branching nodes found in a unique path from a source node to a group node . We maintain path-ids in order to logically re-construct all possible rooted trees which have as their root, and include all group nodes in their leaves in order to compute the number of tuples within each group . Path-ids are unique identifiers for unique paths in the data graph, and are always paired with a path-id count described below.

Definition IV.4.

A path-id count denoted by , is defined as the count between two branching nodes and is equal to , where path-id . The path-id maintains information about the rooted trees this certain path is part of. The path-id count itself however represents the count between the last two branching nodes in the path-id (even though the path-id might include more than two branching nodes). In the case where then where a source node. The intuition here, thinking about this from a query processing perspective, is that we need to keep track of how many tuples were generated at the point where a relation joins with more than one other relation. Once we join with one of the relations, we need to go back and join with the rest one at a time. In order to do that properly (without actually materializing the join result) we need to know how many tuples were generated at that time in the query before it branches off to multiple joins.

Definition IV.5.

A group-node count denoted is the count between the last branching node of a path, and a group node . Intuitively, a group-node count represents the number of tuples generated by joining the tuples in the underlying relation that contain value , with all intermediate tuples, and then also joining them with all tuples that contain .

Definition IV.6.

A c-pair denoted by , is a pair consisting of a path-id and a group-node count. These pairs are recorded at every group node during the traversal of the data graph described in the algorithm in Section IV-B.

Axiom IV.1.

Let , denote the number of tuples generated when there exists a path from to which includes a set of in-between (either branching or intermediate) nodes . By definition of the data graph (Section III-C), there must exist tuples , (where a tuple appears times in its corresponding relation), such that . The number of tuples generated is , and is derived by taking the product of all edge multiplicities along the path.

Iv-B Join-Agg Stage 2: Traversal and Multiplicities

Stage two of this algorithm traverses in a depth first fashion starting from each source node, properly keeping track of the cumulative edge multiplicity along the way, and finally setting the appropriate c-pairs at all reachable group nodes.

A depth first traversal starting from source node to the rest of the group nodes consists of multiple different rooted trees, each of which ends up at a potentially different set of leaf nodes (group nodes). Every rooted tree that reaches exactly one of each type of group node, corresponds to a tuple (or set of identical tuples) in the result of the join . The purpose of this algorithm is to count the number of such rooted trees that each combination of group nodes has in common.

The result of the traversal step is a set of lists , containing one list of c-pairs associated with each group node reachable from – let denote each such list. Again, denotes a c-pair, comprising of a path-id , and a group-node count . There is also a path-id count associated with each unique path-id (we define the terms path-id, path-id count and group-node count in Section IV-A).

We now outline the process that traverses the data graph and sets the appropriate c-pair lists at every group node . We start at a source node , and conduct a DFS traversal. Let denote the current path-id, denote the current count, and the current node being visited. Also let denote the ’th neighbor of , and denote the multiplicity of the edge between them.

We now define a recursive visit() function: if is a group node, record , and return. This is the base case of the recursion. If is not a group node, for each , if is a branching node, update with the current neighbor’s multiplicity, append to the path-id , and reset . The reason we reset the current count is because we now need to keep track of the count along the new path since we encountered a branching node. If has already been visited by this traversal (the traversal starting from ), simply update that path-id count to and return. Next, recursively visit every . This can be seen as a form of computation caching. If we’ve been through a path in a current traversal, we don’t need to go through it again, whereas in traditional execution, this path would be computed multiple times (in the form of joining tuples).

Iv-C Join-Agg Stage 3: Result Generation

Finally, we end up with a list for every reachable from – let denote this set of group nodes. We utilize these lists in order to generate the final result groups. The intuition behind this process is that a group in the final result will only have a non-zero count value iff there is at least one rooted-tree in the data graph with as the root, and as leaves. Every c-pair set during the traversal stage of the algorithm will contain a path-id that is part of such a rooted-tree. There is a count computed for every such rooted-tree. The goal of this stage of the algorithm is to use these c-pairs set at every in order to re-construct all of the rooted-trees that contribute to the result, and finally compute the sum of all of their counts. That sum is equal to the size of the output group.

First, we separate the group nodes reached by the traversal into a set of buckets. The combination of all c-pairs found in all nodes in , will result in the final count for the group ; if this count is non-zero, the group is output to the final result. We will now properly explain how this combination of c-pairs is conducted.

We partition nodes into buckets, one for each group relation , by adding a node into a bucket if it was derived from group relation . Let denote this set of group node buckets. Next, for each bucket , we output a list of tuples that we will combine in order to generate the final result. The way we output these tuples is the following: For each node , for every c-pair , we output a tuple , so that we keep track of which group node each c-pair in came from. Let denote the set of lists output from all buckets in this step.

Lastly, in order to construct and aggregate all distinct groups that are in the final output and their associated counts, we conduct a prefix-join (denoted as ) of the lists on the path-id in a pair-wise fashion. In this prefix-join, two tuples match if their path-ids share a common prefix.

More specifically, let define a binary relationship between path-ids, that indicates they share a common path prefix. Let path-ids where are their respective lengths, and . We say that iff .

Therefore, for every tuple in , we compute a value that will contribute to a group in the final result. Say we’re computing ; Let a tuple and . If we output where and iff or iff i.e. the path-id with the smallest length. We only multiply the result with the path-id count of each unique path-id, once – if e.g. we joined (a tuple with the exact same path-id), the output tuple would be where .

For every iteration of the algorithm, we start from a source node , we end up getting an set of c-pairs, out of which we output all tuples resulting from the prefix-join described above iff they have non-zero counts. After the end of step 3, we will have output all groups, that have any combination of values where every value comes from a different group relation.

Fig. 5: A rooted tree in the data graph corresponds to at least one tuple in that contain the values at the root and the leaves of the rooted tree (the source node and the group node values)
Example IV.1.

Consider the example on Figure 5, showing the data graph and the join result for this query. The red arrows showcase an example of a rooted tree, with source node 1a as its root. Every possible rooted tree in which includes one of each type of group node in its leaves directly corresponds to a tuple in the join result . Since the idea is to avoid materializing , we instead traverse this graph, and set a c-pair at every group node every time it is visited, identifying the path that reached a given node by its unique path-id. Here we can see that for group node , its c-pair list is , and for we have accordingly. We will transform these lists into sets of tuples where and . We compute the prefix-join , which outputs the tuple with the value and the count (the count of times the count of times the path-id count for the path-id ). Finally, is output. There is such a tuple computed for every rooted-tree in the data graph that has as its root, and as its leaves. The sum of the counts for every unique group is equal to the size of the group in the final result.

Iv-D Other Aggregation Functions

The list of basic aggregation functions supported by most SQL execution engines includes COUNT,SUM, MIN,MAX and AVG. We argue that our ability to execute COUNT without outputting individual intermediate results generalizes directly to the rest of these basic aggregation functions.

SUM: Firstly, COUNT can be thought of as a special case of SUM, if we assume that every single tuple in the group includes an attribute for which the value is always 1. If the value of such an attribute is not 1, while executing the query , we can simply keep track of the running sum of tuples, that include the attribute values being summed over, instead of just the running multiplicity of generated tuples. The sum would then need to be multiplied by the count for a specific group, which would then output the correct result.

MIN, MAX: These two functions would only require keeping track of a single value and do not require maintaining counts at all.

AVG: This requires keeping track of the sum of the certain combination of attribute values that need to be averaged over, as well as the count that the current version of our algorithm is maintaining.

V Complexity Analysis

Here, we provide a high-level analysis of the computational complexity of executing a join-aggregate query, with the goal of showing the asymptotic benefits of our approach. We make several simplifying assumptions for clarity. For any relation that we use in the examples below, we make a uniformity assumption about all join condition attributes. Moreover, any join between relations in the below examples are natural joins. Again, let denote the domain of values for attribute in relation . We assume that all relations in any example are of a constant size .

We contrast the time and space complexity of the algorithm traditional RDBMSs would use to compute join-aggregate queries, our Join-Agg operator, and an idealized pre-aggregation approach. We do this by choosing three simplified example queries. For the traditional RDBMS execution we assume that a sort-merge join operator is used to compute the join between any two joining relations and that neither of the joining relations have indexes on the join condition attribute. In most modern database systems, a hash-join

operator of some sort would be chosen by the query optimizer, but only if the optimizer accurately estimates the amount of memory required for storing the hash-table given the amount of memory available. In terms of how

aggregation is performed in the RDBMS, we assume again that a sort-aggregate operator is used instead of a hash-aggregate operator.

It’s also important to note that we are loading in a node in the data graph for every unique value in each distinct relation. For the sake of simplicity we assume set semantics for any relation, i.e., we assume that every tuple in a relation is unique. Note that there are various optimizations possible for the first two examples below, which nonetheless don’t change the asymptotic complexity of the algorithm.

Self-Join: Here, for a single relation , the join-aggregate query is to do a self-join on , and a group-by on the two copies of the attribute (shown in Figure 6a). More specifically, let the two copies of , and let , giving us the Join-Agg query .

Traditional RDBMS: A traditional RDBMS would compute the join and then aggregate the results. Let and . The join computation requires us to first sort both of the relations (), then compute the join between them and output all result tuples. The complexity of computing the join will be equal to the number of output tuples in the result of the join, which is . Overall, the join process takes steps. The aggregation process that follows the join requires a sort of the join result, giving us a total time of .

Join-Agg Operator: The number of vertices in the data graph . The number of edges in the data graph is at most . At the traversal stage of the algorithm, we need to conduct a full dfs traversal of the graph for every source node, of which we have here. A single dfs requires , therefore overall, we will have: . Since there is no merging step for this query, the result generation requires a pass over the reachable group nodes, and there may be at most different results. Overall we have a time complexity of .

Pre-aggregation: For this query, we can do a partialy pre-aggregation on by aggregating on , thus reducing its size to at most . The total execution time then reduces to , and is thus comparable to the Join-Agg operator runtime. However, the maximum memory consumed by this approach at any point is , whereas the Join-Agg operator consumes at most memory.

Comparison: It is easy to see that, the Join-Agg operator performs better asymptotically than the traditional approach if , i.e., if the number of unique values of and/or the number of unique values of is small relative to the relation.

Chain Join: Next we consider a simple chain join between four relations, (shown in Figure 6b), still maintaining group attributes in total.

Traditional RDBMS: Computation of the join result is again dominated by the generation of the result tuples, and requires steps. For aggregation, we again would sort the join result and scan it to output the result groups, overall requiring .

Join-Agg Operator: The number of vertices in the data graph . The number of edges in the data graph is at most . Similarly as the above case, the traversal stage will take . Overall, we again have the total time complexity of .

Pre-aggregation: With aggressive pre-aggregation over the input relations and all intermediate results after they are generated, the time complexity of the join-at-a-time approach can be reduced to . However, the memory consumption of this approach reaches at various points during execution.

Comparison: As our experimental results also validate, the benefits of a single operator are clearly apparent here, with potentially very large gains coming from more careful and combined evaluation.

Chain Join w/ 4 Grouping Attributes: Next we consider a chain join between four relations, , but with a total of 4 grouping attributes.

Traditional RDBMS: Since we assume the relation sizes and selectivities are unchanged, the total time complexity here remains .

Join-Agg Operator: The number of vertices in the data graph is here because there will be two sets of multi-nodes here, one for and each. The number of edges in the data graph is at most . Similarly as the above case, the traversal stage will take time. However, the result generation stage is more complex here as we have to maintain “paths” at the reachable group nodes and merge them at the end. Both the space and time complexity here is dominated by the number of different paths to the nodes. In the worst case, there may be such paths per node, giving us a total time complexity of per source node. The overall complexity then is , and the space complexity is .

Pre-aggregation: The partial pre-aggregation possibilities are somewhat limited here since the intermediate results contain a larger number of attributes, and thus have limited duplicity. If we assume there is no reduction due to partial pre-aggregration, then the time complexity here is similar to the traditional approach, giving us time complexity. However, another lower bound can be calculated using the similar worst-case assumption as above for the Join-Agg operator, where we assume all possible combinations of values exists in at least one join result, giving us a time complexity of , with a space complexity of .

Comparison: The complexities of Join-Agg and partial pre-aggregation approaches are very different in this case. The partial pre-aggregation approach may perform somewhat better if the number of unique values for the group attributes is small relative to the join attributes; however in that scenario, we expect the number of different paths to a node to be significantly smaller than (which assumes a worst-case situation that won’t occur in practice). As above, we see that the Join-Agg space complexity is lower by a factor.

Branching Join: Next we consider a 5-relation branching query , with a group by aggregate on four attributes from four different relations.

Traditional RDBMS: As above, the join computation time is dominated by the generation of the result tuples, giving us a total time of .

Join-Agg Operator: The number of vertices in the data graph (since every tuple from will be a different node). The number of edges in the data graph is at most . The traversal stage would again take in total. The result generation however requires merging the lists of paths at each of the reachable grouping attribute nodes (). It is easy to see that maximum number of different paths from a given source node to any of the destination grouping nodes (say a node) is , thus giving us a result generation time of per source node. Since this has to be done for each of the nodes, the total time for result generation is bounded by . Thus the overall complexity is , with a space complexity of . Unlike the bounds so far, we don’t attempt to substitute with and as the bounds become very loose in that case.

Pre-aggregation: The partial pre-aggregation possibilities are somewhat limited here (outside of the input relations) since the intermediate results contain a larger number of attributes, and thus have limited duplicity. The largest intermediate result we may generate here is , assuming we join with followed by in that order (with aggressive pre-aggregation at every step). The size of can be estimated at , and is the result of projecting out and from that join result (and any other attributes from those relations that did not participate in the join). However, it is difficult to estimate the reduction in size from that projection. If is sufficiently large compared to (i.e., ), then under uniformity assumptions, we expect minimal reduction in the size. Thus, in general, we expect the total time and space complexity of the pre-aggregation approach to be very high compared to the Join-Agg operator.

Comparison: Join queries with branching really illustrate the benefits of a holistic approach to executing such queries. The benefits over the traditional approach, even with aggressive pre-aggregation, come from the ability to avoid generating large intermediate results, and exploit “caching effects”.

Fig. 6: Hypergraphs of example queries
Fig. 7: Hypergraphs of real world queries in the experiments

Vi Implementation Details

The data graph we load into memory is stored in a data structure resembling a Compressed Sparse Row (CSR) graph representation. We store a list of all Edge objects in the graph called outNeighbors, and a list of Node objects. Each Node object contains the List of attribute values the node is comprised of (note that a Node can have values from multiple attributes in its relation if it is a multi-node). Nodes also include one Integer for the offset value, and one Integer that stores the number of neighbors that particular node has. The offset value points to the outgoing edges that correspond to the particular Node; i.e. the outgoing edges of a node n would start at outNeighbors[n.offset] and end at outNeighbors[n.offset+ n.numNeighbors]. Nodes also include an Integer value representing the type of node (source, branching, group etc.). Group nodes in particular, are assigned a unique Integer that references the relation they came from in this field. Edge objects store a reference to their outgoing neighbor Node, and a single Integer value for their multiplicity. Path-ids are an integral part of the algorithm and are also stored as explicit objects, containing a List of branching node values, as well as an Integer for the path-id count.

Stage 1: Data Graph Loading: During the loading for the data graph, each relation is sorted by PostgreSQL, read in using a JDBC connector as a List of tuples, and each tuple is partitioned into its two subsets. Duplicate tuples (after projection) in each relation are also pre-aggregated by PostgreSQL itself before being loaded into the data graph. A HashMap index is used to keep track of Node objects already loaded and access them in order to incrementally add each additional Edge to the data graph. For each relation being loaded, all children relations in the decomposition tree are subsequently loaded, as well as Edges between Nodes with overlapping values; these Nodes intuitively map to joining tuples between the original relations. The CSR representation we use for our graph data structure is generally immutable, we therefore make sure to properly load in each Node and all of its Edges entirely before moving on to the next one so as to never require to shift anything in the outNeighbors list.

Stage 2: Traversal: During the second step of the algorithm, the data graph is traversed in a dfs fashion starting from each source node. The visit() method is recursively called over the neighbors of the current source node, properly propagating the multiplicity as well as the path-id along the way. We keep track of the path-ids in each iteration inside of a HashMap, therefore a single hash-lookup is required to check if the current path-id has already been visited by this current traversal. If so, we simply need to update its path-id count and continue with the next neighbor without continuing the traversal beyond that path since it has already been explored (for the current source node). This caching effect is one of the crucial optimizations that sets Join-Agg apart from other approaches such as partial pre-aggregation [21].

Stage 3: Result Generation: After a full traversal of the graph starting from a single source node concludes, we now have enough information to output all the groups that contain that source node as a value. First we separate the set of reached nodes into buckets, based on their type. If and only if at least one node from every relation in was touched, do we take any further action in this stage.

Next we merge every c-pair in every node in each bucket into a single list of tuples ordered by path-id. We use a k-way merge algorithm to do this since c-pairs are all naturally sorted by path-id at the end of Stage 2. Next, for every list generated by the previous step, we conduct a sort-merge join starting from the smallest list that contains path-ids of the longest size. We therefore sort the lists first by path-id length (in a decreasing order), and then sort them by list size (in an increasing order). After the sort-merge join is completed, the result is sorted by the value of each output group lexicographically.

Vi-A Pre-aggregation Implementation

In order to experimentally support our hypothesis described in Section V, we implemented a simple in-memory database in Java which allowed us to manually describe query plans. We stored in-memory rows as Java LinkedLists, and stored all values as String objects, as we did in the Join-Agg implementation, for the sake of consistency. We implemented a hash-join over two sets of tuples, project over a set of tuples, as well as a hash-aggregate group by operation over a set of tuples. We use the standard algorithms for hash-join. In particular, we create a HashMap on the join condition value for every tuple in the smallest of the two sets of tuples, and probe that HashMap for every tuple in the larger set, to generate all combinations ,.

Optimizations: We included a few optimizations in order for our code to be as comparable as possible to a real in-database implementation. Firstly, we combined the project and hash-aggregate operators so that tuples are only read once, unnecessary columns are projected out, and the column is then hashed for aggregation in the same step. Moreover, due to the fact that each tuple’s values are static (before it is joined), we compute the hashCode() of every tuple only once, upon its creation so that it doesn’t need to be computed again when hashing the tuple (either at the join or aggregation step). At the hash-join stage, we allocate new memory for the output tuples only when outputting the join result. We store the pre-aggregated count at every stage in a separate field for each tuple.

Vii Experimental Evaluation

We present an experimental evaluation over a series of synthetic and real datasets that showcase the benefits and trade-offs of the Join-Agg operator. We’ve generated synthetic datasets for three types of queries described in Section V, the hypergraphs for which can be seen in Figure 6. We also present experiments on queries over TPCH [1] (using scale factor SF=1), DBLP [2], ORDS [27] and IMDB [12]. Each dataset is associated with a specific query, query hypergraphs for which are shown in Figure 7. Datasets DBLP and ORDS are both simple self-joins. Additional information about these datasets is shown in Table I.

We implemented a prototype of the Join-Agg operator entirely in Java. We load the data directly from PostgreSQL into the JVM by using the JDBC connector. Our aim with this prototype is to showcase that the execution of aggregate queries over large-output joins can, in many situations, be evaluated more efficiently even outside of the RDBMS including the often substantial overhead of loading the data from PostgreSQL into the JVM. We advocate that a native version of Join-Agg implemented natively within an RDBMS itself in a lower level language would demonstrate an even wider performance gap in favor of Join-Agg. The main reason is that loading the data graph would naturally be significantly faster, because reading the data tuple-at-a-time using JDBC is a significant portion of the loading time overhead.

These experiments were all done on a single machine running Red Hat Enterprise Linux Server 6.9, with 64GB of RAM, and an Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz, using PostgreSQL version 9.4.18 and Java 8.

numbers/synthetic-datasets.csv

TABLE I: Characteristics about all synthetic and real datasets used in the experiments. JoinR shows the size of the join result before aggregation in Million (M) or Billion (B) tuples. Groups shows the number of groups output for each query in each dataset. Load is the total time required (in seconds) to load the data from PostreSQL to the in-memory data graph.

numbers/sample-datasets.csv

TABLE II: Samples from the B2 dataset, the max memory consumption (max heap used in GB) when running Join-Agg or pre-aggregation respectively, as well as the size of the max intermediate result (in rows) that needed to be processed when using pre-agg.

Vii-A Synthetic Datasets

The synthetic datasets that were used for studying the behavior on the example queries showcased in Section V

, were generated by pulling from a uniform distribution (using Java’s

Random class) of a certain set of values, based on the selectivity we wanted to emulate each time. We define the term selectivity as , where the domain of unique values of attribute in relation . For each S1,S2,S3 dataset, we generate a single relation for which the join selectivity when joining with itself is roughly the one reported in Table I. Similarly, for each C1,C2,C3 dataset, we again generate a single relation with the specified join selectivity and use copies of that relation for each part of the chain–therefore all joins in the chain portray the same selectivity. For the B1,B2,B3 datasets, there are two different selectivities specified, the first is for the join and the second for the joins and . Again, for each of the join condition attributes in each table, we generated each tuple by drawing from a uniform distribution of integers in the range . Group attributes were generated the exact same way. The range that we used for generating the group attribute in each of these relations is roughly reflected by the number of output groups generated by the queries. For the sake of simplicity all generated relations are of size tuples.

Dataset S1 S2 S3
(groups/size) (6.25 M/80) (6.25 M/26) (3.4 M/1)
PostgreSQL 499 s 181 s 11 s
JOIN-AGG 38 s 28 s 33 s
TABLE III: Experiment for the Self-join example
Dataset C1 C2 C3
(groups/size) (5 M/165) (1.7 M/37) (1 M/22)
PostgreSQL 512 s 65 s 18 s
JOIN-AGG 21 s 22 s 24 s
TABLE IV: Experiment for the Chain example
Dataset B1 B2 B3
(groups/size) (125 K/11 K) (125 K/4 K) (976 K/1)
PostgreSQL 1104 s 393 s 18 s
JOIN-AGG 136 s 226 s 55 s
TABLE V: Experiment for the Branching example
Dataset TPCH DBLP ORDS IMDB
(groups/size) (23 M/1) (87 M/1) (7.5 M/7) (13 M/340)
PostgreSQL 71 s 172 s 95 s 3422 s
JOIN-AGG 248 s 384 s 31 s 1156 s
TABLE VI: Experiment for queries over real datasets
Fig. 8: Maximum memory consumption (max heap used), at any point during execution. Each value in the y-axis represents the largest intermediate result we needed to store when using pre-aggregation at every stage of the join
Fig. 9: Total computation time spent when using pre-aggregation per sample, showing the portion of the computation time spent on garbage collection (GC)
Fig. 10: Total computation time spent when using join-agg per sample, showing the portion of the computation time spent on garbage collection (GC)
Fig. 11: Only computation time (excluding GC time) for every sample dataset.

Vii-B Tuning PostgreSQL

We evaluate the performance of Join-Agg by comparing it to running these queries directly over a state of the art RDBMS; PostgreSQL. One of the database parameters that proved crucial for these queries is work_mem, which specifies the amount of memory every distinct query operator can utilize within a single query. In a data warehouse setting, given the specifications of the server machine we used, work_mem would normally be set to around 256MB. Setting work_mem to a very high value is generally not recommended because it increases the risk of the database running out of usable memory very quickly as multiple user queries are executed simultaneously. Join-Agg on the other hand only asymptotically needs as much memory as is required to store the data graph, per query, thus enabling multiple such queries to practically be run simultaneously and efficiently whereas PostreSQL would need to use slow methods (e.g. use SortMerge Joins and GroupAggregate for aggregation).

Nevertheless, to showcase the best possible performance we could get out of PostgreSQL on this specific machine, we set work_mem to 10GB. This essentially allowed the PostgreSQL query planner to mostly choose the HashAggregate operator instead of GroupAggregate which can be orders of magnitude slower, depending on whether the Sort phase happens in memory or on disk. The query plan generated by PostgreSQL when running these aggregate queries, showed that it always chooses to use SortMerge Joins, and GroupAggregate, when it estimates the value of work_mem isn’t high enough to fit the hash-table based on the estimated number of output groups.

We also observe that PostgreSQL is completely unable to estimate the number of tuples in the result set, and uses the same cardinality estimate as the result of the join, for estimating the number of groups in the result. Anecdotally, we estimate this is as the primary reason PostgreSQL may choose to use GroupAggregate and SortMerge joins, to ensure that the query will not run out of memory instead of trying to use operators that require hashing, which are faster but require significantly larger amounts of memory.

Vii-C Join-Agg Performance Analysis

We studied the three basic types of queries that constitute the baseline for most join-aggregation queries over a database. Our overall conclusion was that Join-Agg can make a huge difference in query execution time for a query, as that query outputs larger groups. The larger the size of the groups in the output, the more there is to gain from Join-Agg. In cases where the output is comprised of small groups (i.e. of size 1), Join-Agg portrays comparable performance to the traditional approach when taking into account the fact that a large portion of the execution time in Join-Agg is taken loading the data out of the database.

Table III, showcases the performance of a join-aggregation query over a single self-join. We can see that Join-Agg performs over to an order of magnitude better than PostgreSQL when we have a relatively large group size and the gap between the two systems closes as that average size leans towards . This makes sense since outputting many groups of size indicates the intermediate result is close in size to the final result, thus materializing it is mostly inevitable. Similar behavior is seen for the chain example shown in Table IV. Note that when we have multiple non-key joins in a row as is the case with this example, the selectivities of those joins don’t need to be absurdly low for Join-Agg to have a substantial difference in performance. This is because the intermediate results keep expanding as non-key joins progress resulting in the output of a very large set of tuples that then need to be aggregated.

In Table V we can see that for datasets B1,B2,B3, where we have three group attributes from different relations, showcase a similar performance trend as the other examples. In the datasets that output large groups, Join-Agg performs up to an order of magnitude better whereas the performance of B3 which outputs groups of size on average, is comparable to PostgreSQL. Particularly for dataset B1, we have a very low selectivity () join for whereas the other joins portray a high selectivity (). We can see that even a single low-selectivity join in this complex query, can result in a huge ( tuples) intermediate output which Join-Agg helps to avoid materializing.

In the real datasets we experimented with, showcased in Table VI, we observe results consistent with the synthetic datasets. The DBLP (rf. Figure 6a) and TPCH (rf. Figure 7a) queries output very small groups, causing the time for loading the data graph to dominate the computation. The dataset ORDS [27] is a typical market basket dataset of invoices that contain multiple items. We are querying all item pairs and counting how many times they were bought together. IMDB, is graph pattern counting query over the IMDB movie graph as seen in Figure 7b. This query counts the number of paths between an actor and a genre, two hops away from that actor, i.e. even genres of movies that co-actors of theirs played in. For both of the latter queries the groups portray a higher average size and the benefits of Join-Agg start becoming apparent.

Vii-D Pre-aggregation Performance Analysis

To experimentally validate our hypothesis in regards to how using pre-aggregation stacks up against our approach, we sampled the B2 dataset – incrementally taking a larger sample. Information about the samples can be seen in Table II.

Figure 8 showcases the difference in memory requirements between Join-Agg and pre-aggregation. We can see that in the case of pre-aggregation, as the size of the largest intermediate result required for the query after using aggressive pre-aggregation at every stage of the join increases, the maximum amount of memory required to complete the query increases rapidly. The memory required when it comes to executing Join-Agg however increases slowly since it only has to do with the size of the input data in combination with the largest amount of c-pairs that need to be stored at a single iteration (after we process any one source node).

Figures 9 and 10 showcase the computation time required for the execution of the branching query shown in Figure 6c. Due to the fact that our pre-aggregation implementation is relatively simple and done in Java (as discussed in detail in Section VI), a large portion of the computation comes down to garbage collection time. If however we only look at the amount of time spent doing actual computation as shown in Figure 11, we can clearly see the gap in performance between the two techniques, as was expected based on the complexity analysis in Section V.

Viii Related Work

Here we discuss the closely related work on executing these types of queries in different contexts within RDBMSs.

Factorized Databases and Worst-Case Optimal Joins: The data-graph paradeigm that we propose in this paper is reminiscent of the factorized representation of conjunctive query results previously proposed by Olteanu et al. [26]. Both representations aim at representing the underlying join while reducing the amount of data stored in order to do so. The data graph can also be connected to the idea of a tuple hypergraph which can cover all tuples in a query result  [16]; it however serves a very different purpose.

Our main objective in this work is to be able to compute aggregations over a representation like the data graph, especially in the case of complex acyclic joins.

Several different works have considered the problem of executing group by aggregate queries against a factorized representation of a conjunctive query [6, 18, 17, 19, 29, 28, 25]. The key guarantees like constant-delay enumeration, however, do not extend to the kind of group by queries we focus on in this work, e.g., the “branching” query . Because all of (group by attributes) need to be present in the output, either (a) one of the other attributes needs to be eliminated (which requires generation of a large intermediate result), or (b) we have to iterate over all combinations of values for and compute the aggregate value for each combination (which can be prohibitively expensive if either the sizes or the number of group by attributes is large). Our work here, thus, can be seen as exploring an alternative approach to computing aggregates over the factorized representation.

As we discussed earlier, recent work on worst-case optimal joins [23, 32, 20, 24] shows how to avoid large intermediate results during execution of multi-way join queries; Joglekar et al., [14, 15] discuss how to generalize that to aggregate queries. Their approach is largely complementary to  [6], as well as our line of work. Recent work on FAQ [4] proposed a generalized way of viewing a very common type of aggregation query called a Functional Aggregate Query which they see parallels in multiple scenarios other than databases e.g. matrix multiplication, probabilistic graphical models, and logic. The InsideOut algorithm proposed in FAQ however is not focused on executing SQL queries, like our work as well as the factorized databases work is aimed at doing. FAQ also assumes an optimal variable order, while this paper does not explore the benefits of choosing the optimal variable order (tree decomposition in our case).

Iceberg Queries: An iceberg query is a particular class of SQL queries, defined as an aggregate query, counting occurrences of target group instances of the GROUP BY clause columns, and filtering the results post-aggregation using a HAVING clause. These queries typically return a small fraction of the overall (potentially large) join result, (the tip of the iceberg). Iceberg queries are clearly a special case of the queries we’re studying in this paper.

Fang et al. [9] propose a wide array of techniques for computing iceberg queries which focus on minimizing the passes done over the data (Disk I/O), being able to answer such queries in reasonable time, and doing so with a small amount of memory. The authors focus on combining two techniques: coarse counting (probabilistic counting), and sampling. These techniques may start causing issues as the final result increases in size. In a similar setting there has been work on efficiently computing the iceberg CUBE [7, 11, 36], which is largely orthogonal to ours, since this paper focuses on the general case of outputting all groups. Developing techniques for more efficient iceberg queries using our Join-Agg operator are delegated to future work.

Walenz et al. [33], presented a series of optimizations applicable to certain types of iceberg queries. The main focus of this work is to use formal methods towards automatically identifying whether a general SQL query would benefit from certain specialized optimizations for evaluating certain types of iceberg queries, as well as towards automatically using such optimizations during evaluation. The optimizations they consider involve pruning techniques based on memoization and complex non-equality join conditions. Given a general SQL query, their methods systematically identify if any optimization technique is applicable, and use it during execution of the query. Similarly to us, the authors implement and wrap the above optimizations into a custom database join operator. The work in this paper is largely orthogonal to ours since it mainly deals with complex join conditions, it does not focus on minimizing extra memory consumption during execution, and is more aimed at providing formal methods for automatically identifying queries that would benefit from these specialized optimizations.

Similarity Joins: Work on similarity joins [13, 35, 34, 22] uses various techniques to prune join computation. In a similarity join between two relations, (on a string join condition), a pair of tuples join if their join attribute similarity surpasses a threshold. This can be directly mapped onto the iceberg query problem where the aggregation function is COUNT. From this perspective, iceberg queries aim at finding the tuples in the result that have a certain number of join condition attributes in common, which surpasses a threshold. Similarity join techniques are almost exclusively signature-based (strings are collapsed into smaller signature sets). In a lot of these approaches, an “inverted index” is built beforehand, which in a sense resembles our in-memory graph structure. These join algorithms are however only studied for binary operations, similar to the self-join case.

Data Reduction Operators: Work on lazy vs eager aggregation [37, 39, 38] aims at re-arranging group-by operators in the logical query plan tree, moving them after or before joins accordingly. These techniques don’t deal with avoiding materialization of intermediate results in situations when group by operators cannot be pushed down. Aggregation can only be pushed down if it can be partially applied to a single relation, thus reducing that relation’s cardinality. In the general case however when the query contains a series of group by attributes, each one coming from a different relation, there’s no way to apply any complex aggregation to a single relation because the aggregation applies to the result.

Larson et al., describe techniques for doing partial pre-aggregation [21]. They describe a way to apply pre-aggregation to input relations when another aggregation is conducted on their join result. A simple hash table is used to aggregate tuples in the same relation, thus reducing the number of tuples joining with the next relation. As groups are pre-aggregated sequentially, if the number of pre-aggregated groups surpasses the memory capacity, partially pre-aggregated tuples are output to make room for new groups; therefore the pre-aggregation can be incomplete. Those same-group tuples will be aggregated later on at the final aggregation step. They also describe techniques to combine this pre-aggregation process with a join by pre-aggregating while reading the relation and joining the output partially pre-aggregated tuples with the tuples from the inner relation. These techniques however apply to a single binary join at a time, and as we show in Figure 8, Join-Agg provides substantial memory benefits than partial pre-aggregation especially when the two are combined and we use pre-aggregation before loading in the data graph.

As previously mentioned, the way we load the data graph into memory is reminiscent of these data reduction operators since we are pre-aggregating all relations to compute the multiplicity of each edge in the data graph. The creation of multi-nodes in the data graph can also be seen as an even more effective form of pre-aggregation. For example in Figure 3, looking at relation , we can see that any standard pre-aggregation operator would reduce the relation to at least tuples with jc1,jd1 appearing twice whereas we load a single jc1,jd1 node. Our techniques are comparable with partial pre-aggregation in the case where there are no branching relations. As branching relations and multiple group by attributes are included in a complex join, our technique enables computation caching at the level of path-ids which can reduce the number of paths taken in the data graph during Stage 2 of the algorithm. The partial pre-aggregation technique has no means of skipping these paths and may require computing all of the joins associated with those paths potentially multiple times.

Ix Conclusion

In this paper, we proposed a multi-way database operator called Join-Agg that enables the memory-efficient execution of aggregation queries over joins that output large intermediate results, by executing the query over a graph representation of the underlying data called the data graph. We presented a detailed complexity analysis comparing our approach to the traditional binary joins-based approach as well as an idealized partial pre-aggregation approach. Our experiments show that Join-Agg operator can be over an order of magnitude more efficient than the traditional approach for a wide variety of queries, even when implemented outside of the RDBMS. We advocate that multi-way database operators may be the answer to dealing with the “non-normalized” data of the real world, which often leads to expensive non-key joins along with other aforementioned issues. They will enable users to continue leveraging RDBMSs for their OLAP analyses, requiring smaller amounts of resources.

References

  • [1] http://www.tpc.org/tpch/.
  • [2] Dblp dataset. https://dblp.uni-trier.de/faq/How+can+I+download+the+whole+dblp+dataset.
  • [3] C. R. Aberger, S. Tu, K. Olukotun, and C. Ré. EmptyHeaded: A relational engine for graph processing. In SIGMOD, 2016.
  • [4] M. Abo Khamis, H. Q. Ngo, and A. Rudra. FAQ: questions asked frequently. In PODS, pages 13–28. ACM, 2016.
  • [5] R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, 2000.
  • [6] N. Bakibayev, T. Kočiskỳ, D. Olteanu, and J. Závodnỳ. Aggregation and ordering in factorised databases. PVLDB, 6(14), 2013.
  • [7] K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cube. In ACM Sigmod Record, pages 359–370, 1999.
  • [8] A. Deshpande. An initial study of overheads of eddies. ACM SIGMOD Record, 33(1):44–49, 2004.
  • [9] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In VLDB, 1999.
  • [10] G. Gottlob, G. Greco, N. Leone, and F. Scarcello. Hypertree decompositions: Questions and answers. In PODS, 2016.
  • [11] J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. In ACM SIGMOD Record, 2001.
  • [12] O. S. Jan Motl. Relational dataset repository. https://relational.fit.cvut.cz/.
  • [13] Y. Jiang, G. Li, J. Feng, and W.-S. Li. String similarity joins: An experimental evaluation. Proceedings of the VLDB Endowment, 7(8):625–636, 2014.
  • [14] M. Joglekar, R. Puttagunta, and C. Ré. Aggregations over generalized hypertree decompositions. arXiv preprint arXiv:1508.07532, 2015.
  • [15] M. R. Joglekar, R. Puttagunta, and C. Ré. Ajar: Aggregations and joins over annotated relations. In PODS, 2016.
  • [16] A. Kara and D. Olteanu. Covers of query results. In 21st International Conference on Database Theory, 2018.
  • [17] M. A. Khamis, R. R. Curtin, B. Moseley, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. On functional aggregate queries with additive inequalities. arXiv preprint arXiv:1812.09526, 2018.
  • [18] M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. Ac/dc: In-database learning thunderstruck. In

    Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning

    , page 8. ACM, 2018.
  • [19] M. A. Khamis, H. Q. Ngo, D. Olteanu, and D. Suciu.

    Boolean tensor decomposition for conjunctive queries with negation.

    In 22nd International Conference on Database Theory, 2019.
  • [20] P. Koutris, P. Beame, and D. Suciu. Worst-case optimal algorithms for parallel query processing. In LIPIcs-Leibniz International Proceedings in Informatics, volume 48. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
  • [21] P.-A. Larson. Data reduction by partial preaggregation. In ICDE, 2002.
  • [22] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. Proceedings of the VLDB Endowment, 5(3):253–264, 2011.
  • [23] H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms:[extended abstract]. In PODS, 2012.
  • [24] H. Q. Ngo, C. Ré, and A. Rudra. Skew strikes back: new developments in the theory of join algorithms. ACM SIGMOD Record, 42(4):5–16, 2014.
  • [25] D. Olteanu and M. Schleich. Factorized databases. SIGMOD Record, 45(2), 2016.
  • [26] D. Olteanu and J. Závodnỳ. Factorised representations of query results: size bounds and readability. In ICDT, 2012.
  • [27] U. M. L. Repository. Online retail data set. https://archive.ics.uci.edu/ml/datasets/Online+Retail.
  • [28] M. Schleich, D. Olteanu, and R. Ciucanu.

    Learning linear regression models over factorized joins.

    In Proceedings of the 2016 International Conference on Management of Data, pages 3–18. ACM, 2016.
  • [29] M. Schleich, D. Olteanu, and H. Ngo. A layered aggregate engine for analytics workloads. 2019.
  • [30] R. E. Tarjan and M. Yannakakis. Simple linear-time algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM Journal on computing, 13(3):566–579, 1984.
  • [31] T. L. Veldhuizen. Leapfrog triejoin: A simple, worst-case optimal join algorithm. arXiv preprint arXiv:1210.0481, 2012.
  • [32] T. L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In ICDT, pages 96–106, 2014.
  • [33] B. Walenz, S. Roy, and J. Yang. Optimizing iceberg queries with complex joins. In SIGMOD, 2017.
  • [34] J. Wang, J. Feng, and G. Li. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment, 3(1-2):1219–1230, 2010.
  • [35] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3):15, 2011.
  • [36] D. Xin, J. Han, X. Li, and B. W. Wah. Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. In VLDB, 2003.
  • [37] W. Yan and P.-A. Larson. Interchanging the order of grouping and join. Technical report, Technical Report CS 95-09, Dept. of Computer Science, University of Waterloo, Canada, 1995.
  • [38] W. P. Yan and P.-A. Larson. Performing group-by before join. In ICDE, 1994.
  • [39] W. P. Yan and P.-A. Larson. Eager aggregation and lazy aggregation. Group, 1:G2, 1995.