In this paper, we study the problem of constructing space-efficient compressed representations of the output of conjunctive query results, with the goal of efficiently supporting a given access pattern directly over the compressed result, instead of the original input database. In many data management tasks, the data processing pipeline repeatedly accesses the result of a conjunctive query (CQ) using a particular access pattern. In the simplest case, this access pattern can be to enumerate the full result (e.g., in a multiquery optimization context). Generally, the access pattern can specify, or bound, the values of some variables, and ask to enumerate the values of the remaining variables that satisfy the query.
Currently, there are two extremal solutions for this problem. In one extreme, we can materialize the full result of the CQ and index the result according to the access pattern. However, since the output result can often be extremely large, storing this index can be prohibitively expensive. In the other extreme, we can service each access request by executing the CQ directly over the input database every time. This solution does not need extra storage, but can lead to inefficiencies, since computation has to be done from scratch and may be redundant. In this work, we explore the design space between these two extremes. In other words, we want to compress the query output such that it can be stored in a space-efficient way, while we can support a given access pattern over the output as fast as possible.
Suppose we want to perform an analysis about mutual friends of users in a social network. The friend relation is represented by a symmetric binary relation of size , where a tuple denotes that user a is a friend of user b. The data analysis involves accessing the database through the following pattern: given any two users and who are friends, return all mutual friends . We formalize this task through an adorned view . The above formalism says that the view of the database will be accessed as follows: given values for the bound () variables , we have to return the values for the free () variable such that the tuple is in the view The sequence is called the access pattern for the adorned view.
One option to solve this problem is to satisfy each access by evaluating a query on the input database. This approach is space-efficient, since we work directly on the input and need space . However, we may potentially have to wait time to even learn whether there is any returned value for . A second option is to materialize the view and build a hash index with key : in this case, we can satisfy any access optimally with constant delay .111the notation includes a poly-logarithmic dependence on . On the other hand, the space needed for storing the view can be .
In this scenario, we would like to construct representations that trade off between space and delay (or answer time). As we will show later, for this particular example we can construct a data structure for any parameter that needs space , and can answer any access request with delay .
The idea of efficiently compressing query results has recently gained considerable attention, both in the context of factorized databases , as well as constant-delay enumeration [32, 5]. In these settings, the focus is to construct compressed representations that allow for enumeration of the full result with constant delay: this means that the time between outputting two consecutive tuples is , independent of the size of the data. Using factorization techniques, for any input database , we can construct a compressed data structure for any CQ without projections, called a -representation, using space , where is the fractional hypertree width of the query . Such a -representation guarantees constant delay enumeration of the full result. In [31, 5], the compression of CQs with projections is also studied, but the setting is restricted to time preprocessing –which also restricts the size of the compressed representation to .
In this work, we show that we can dramatically decrease the space for the compressed representation by both taking advantage of the access pattern, and tolerating a possibly increased delay. For instance, a -representation for the query in Example 1 needs space, while no linear-time preprocessing can support constant delay enumeration (under reasonable complexity assumptions ). However, we show that if we are willing to tolerate a delay of , we can support the access pattern of Example 1 using only space, linear in the input size.
Applications. We illustrate the applicability of compressed representations of conjunctive queries on two practical problems: processing graph queries over relational databases, and scaling up statistical inference engines.
In the context of graph analytics, the graph to be analyzed is often defined as a declarative query over a relational schema [34, 35, 36, 2]. For instance, consider the DBLP dataset, which contains information about which authors write which papers through a table . To analyze the relationships between co-authors, we will need to extract the co-author graph, which we can express as the view . Most graph analytics algorithms typically access such a graph through an API that asks for the set of neighbors of a given vertex, which corresponds to the adorned view . Since the option of materializing the whole graph (here defined as the view ) may require prohibitively large space, it is desirable to apply techniques that compress , while we can still answer any access request efficiently. Recent work  has proposed compression techniques for this particular domain, but these techniques are limited to adorned views of the form
, rely on heuristics, and do not provide any formal analysis on the tradeoff between space and runtime.
The second application of query compression is in statistical inference. For example, Felix  is an inference engine for Markov Logic Networks over relational data, which provides scalability by optimizing the access patterns of logical rules that are evaluated during inference. These access patterns over rules are modeled exactly as adorned views. Felix groups the relations in the body of the view in partitions (optimizing for some cost function), and then materializes each partition (which corresponds to materializing a subquery). In the one extreme, it will eagerly materialize the whole view, and in the other extreme it will lazily materialize nothing. The materialization in Felix is discrete, in that it is not possible to partially materialize each subquery. In contrast, we consider materialization strategies that explore the full continuum between the two extremes.
Our Contribution. In this work, we study the design space for compressed representations of conjunctive queries in the full continuum between optimal space and optimal runtime, when our goal is to optimize for a specific access pattern.
Our main contribution is a novel data structure that can compress the result for every CQ without projections according to the access pattern given by an adorned view, and can be tuned to tradeoff space for delay and answer time. At the one extreme, the data structure achieves constant delay ; At the other extreme it uses linear space , but provides a worst delay guarantee. Our proposed data structure includes as a special case the data structure developed in  for the fast set intersection problem.
To construct our data structure, we need two technical ingredients. The first ingredient (Theorem 1) is a data structure that trades space with delay with respect to the worst-case size bound of the query result. As an example of the type of tradeoffs that can be achieved, for any CQ without projections and any access pattern, the data structure needs space to achieve delay , where is the fractional edge cover number of , and the size of the input database. In many cases and for specific access patterns, the data structure can substantially improve upon this tradeoff. To prove Theorem 1, we develop novel techniques on how to encode information about expensive sub-instances of the problem in a balanced way.
However, Theorem 1 by its own gives suboptimal tradeoffs, since it ignores structural properties of the query (for example, for constant delay it materializes the full result). Our second ingredient (Theorem 2) combines the data structure of Theorem 1 with a type of tree decomposition called connex tree decomposition . This tree decomposition has the property of restricting the tree structure such that the bound variables in the adorned view always form a connected component at the top of the tree.
Finally, we discuss the complexity of choosing the optimal parameters for our two main theorems, when we want to optimize for delay given a space constraint, or vice versa.
Organization. We present our framework in Section 2, along with the preliminaries and basic notation. Our two main results (Theorems 1 and 2) are presented in Section 3. We then present the detailed construction of the data structure of Theorem 1 in Section 4, and of Theorem 2 in Section 5. Finally, in Section 6, we discuss some complexity results for optimizing the choice of parameters.
2 Problem Setting
In this section we present the basic notions and terminology, and then discuss in detail our framework.
2.1 Conjunctive Queries
In this paper we will focus on the class of conjunctive queries (CQs), which are expressed as
Here, the symbols
are vectors that containvariables or constants, the atom is the head of the query, and the atoms form the body. The variables in the head are a subset of the variables that appear in the body. A CQ is full if every variable in the body appears also in the head, and it is boolean if the head contains no variables, i.e. it is of the form . We will typically use the symbols to denote variables, and to denote constants. If is an input database, we denote by the result of running over .
Natural Joins. If a CQ is full, has no constants and no repeated variables in the same atom, then we say it is a natural join query. For instance, the triangle query is a natural join query. A natural join can be represented equivalently as a hypergraph , where is the set of variables, and for each hyperedge there exists a relation with variables . We will write the join as . The size of relation is denoted by . Given a set of variables , we define .
Valuations. A valuation over a subset of the variables is a total function that maps each variable to a value , where is a domain of constants. Given a valuation of the variables , we denote .
Join Size Bounds. Let be a hypergraph, and . A weight assignment is called a fractional edge cover of if for every and for every . The fractional edge cover number of , denoted by is the minimum of over all fractional edge covers of . We write .
In a celebrated result, Atserias, Grohe and Marx  proved that for every fractional edge cover of , the size of a natural join is bounded using the following inequality, known as the AGM inequality:
Tree Decompositions. Let be a hypergraph of a natural join query . A tree decomposition of is a tuple where is a tree, and every is a subset of , called the bag of , such that
each edge in is contained in some bag ; and
for each , the set of nodes is connected in .
The fractional hypertree width of a tree decomposition is defined as , where is the minimum fractional edge cover of the vertices in . The fractional hypertree width of a query , denoted , is the minimum fractional hypertree width among all tree decompositions of its hypergraph.
Computational Model. To measure the running time of our algorithms, we will use the uniform-cost RAM model , where data values as well as pointers to databases are of constant size. Throughout the paper, all complexity results are with respect to data complexity (unless explicitly mentioned), where the query is assumed fixed.
We use the notation to hide a polylogarithmic factor for some constant , where is the input database.
2.2 Adorned Views
In order to model access patterns over a view defined over the input database, we will use the concept of adorned views . In an adorned view, each variable in the head of the view definition is associated with a binding type, which can be either bound or free (). A view is then written as , where is called the access pattern. We denote by (resp. ) the set of bound (resp. free) variables from .
We can interpret an adorned view as a function that maps a valuation over the bound variables to a relation over the free variables . In other words, for each valuation over , the adorned view returns the answer for the query , which we will also refer to as an access request.
captures the following access pattern: given values , list all the -values that form a triangle with the edge . As another example, simply captures the case where we want to perform a full enumeration of all the triangles in the result. Finally, expresses the access pattern where given a node with , we want to know whether there exists a triangle that contains it or not.
An adorned view is boolean if every head variable is bound, it is non-parametric if every head variable is free, and it is full if the CQ if full (i.e., every variable in the body also appears in the head). Of particular interest is the adorned view that is full and non-parametric, which we call the full enumeration view, and simply asks to output the whole result.
2.3 Problem Statement
Given an adorned view and an input database , our goal is to answer any access request that conforms to the access pattern . The view can be expressed through any type of query, but in this work we will focus on the case where is a conjunctive query.
There are two extremal approaches to handle this problem. The first solution is to answer any such query directly on the input database , without materializing . This solution is efficient in terms of space, but it can lead to inefficient query answering. For instance, consider the adorned view . Then, every time we are given new values , we would have to compute all the nodes that form a triangle with , which can be very expensive.
The second solution is to materialize the view , and then answer any incoming query over the materialized result. For example, we could choose to materialize all triangles, and then create an appropriate index over the output result. The drawback of this approach is that it requires a lot of space, which may not be available.
We propose to study the solution space between these two extremal solutions, that is, instead of materializing all of , we would like to store a compressed representation of . The compression function must guarantee that the compression is lossless, i.e., there exists a decompression function such that for every database , it holds that . We compute the compressed representation during a preprocessing phase, and then answer any access request in an online phase.
Parameters. Our goal is to construct a compression that is as space-efficient as possible, while it guarantees that we can efficiently answer any access query. In particular, we are interested in measuring the tradeoff between the following parameters, which are also depicted in Figure 1:
Compression Time (): the time to compute during the preprocessing phase.
Space (): the size of .
Answer Time: this parameter measures the time to enumerate a query result, where the query is of the form . The enumeration algorithm must enumerate the query result without any repetitions of tuples, and use only extra memory222Memory requirement also depends on the memory required for executing the join algorithm. Note that worst case optimal join algorithms such as NPRR  can be executed using memory assuming query size is constant and all relations are sorted and indexed.. We will measure answer time in two different ways.
delay (): the maximum time to output any two consecutive tuples (and also the time to output the first tuple, and the time to notify that the enumeration has completed).
total answer time (): the total time to output the result.
In the case of a boolean adorned view, the delay and the total answer time coincide. In an ideal situation, both the compression time and the space are linear to the input size and any query can be answered with constant delay . As we will see later, this is achievable in certain cases, but in most cases we have to tradeoff space and preprocessing time for delay and total answer time.
2.4 Some Basic Results
We present here some basic results that set up a baseline for our framework. We will study the case where the given view definition is a conjunctive query.
Our first observation is that if we allow the compression time to be at least , we can assume without loss of generality that the adorned view has no constants or repeated variables in a single atom. Indeed, we can first do a linear time computation to rewrite the adorned view to a new view where constants and repeated variables are removed, and then compute the compressed representation for this new view (with the same adornment).
Consider . We can first compute in linear time and , and then rewrite the adorned view as .
Hence, whenever the adorned view is a full CQ, we can w.l.o.g. assume that it is a natural join query. We now state a simple result for the case where the adorned view is full and every variable is bound.
Suppose that the adorned view is a natural join query with head . Then, in time , we can construct a data structure with space , such that we can answer any access request over with constant delay .
Next, consider the full enumeration view . A first observation is that if we store the materialized view, we can enumerate the result in constant delay. From the AGM bound, to achieve this we need space , where is the hypergraph of . However, it is possible to improve upon this naive solution using the concept of a factorized representation . Let denote the fractional hypertree width of . Then, the result from  can be translated in our terminology as follows.
Proposition 2 ().
Suppose that the adorned view is a natural join query with head . Then, in compression time , we can construct a data structure with space , such that we can answer any access request over with constant delay .
Since every acyclic query has , for acyclic CQs without projections both the compression time and space become linear, . In the next section, we will see how we can generalize the above result to an arbitrary adorned view that is full.
3 Main Results and Application
In this section we present our two main results, and show how they can be applied. The first result (Theorem 1) is a compression primitive that can be used with any full adorned view. The second result (Theorem 2) builds upon Theorem 1 and query decomposition techniques to obtain an improved tradeoff between space and delay.
3.1 First Main Result
Consider a full adorned view , where is a natural join query expressed by the hypergraph . Recall that are the bound and free variables respectively. Since the query is a natural join and there are no projections, we have . We will denote by the number of free variables. We also impose a lexicographic order on the enumeration order of the output tuples. Specifically, we equip the domain with a total order , and then extend this to a total order for output tuples in using some order of the free variables.333There is no restriction imposed on the lexicographic ordering of the free variables.
As a running example, consider
We have and . To keep the exposition simple, assume that .
If we materialize the result and create an index with composite key , then in the worst case we need space , but we will be able to enumerate the output for every access request with constant delay. On the other hand, if we create three indexes, one for each with key , we can compute each access request with worst-case running time and delay of . Indeed, once we fix the bound variables to constants , we need to compute the join , which needs time using any worst-case optimal join algorithm.
For any fractional edge cover of , and , we define the slack of for as:
Intuitively, the slack is the maximum positive quantity such that is still a fractional edge cover of . By construction, the slack is always at least one, . For our running example, suppose that we pick a fractional edge cover for with . Then, the slack of for is .
Let be an adorned view over a natural join query with hypergraph . Let be any fractional edge cover of . Then, for any input database and parameter we can construct a data structure with
such that for any access request , we can enumerate its result in lexicographic order with
Let us apply Theorem 1 to our running example for and . The slack for the free variables is . The theorem tells us that we can construct in time a data structure with space , such that every access request can be answered with delay and answer time .
Applying Theorem 1. We start with the observation that we can always apply Theorem 1 by choosing to be the fractional edge cover with optimal value . Since the slack is always , we obtain the following result.
Let be an adorned view over a natural join query with hypergraph . Then, for any input database and parameter , we can construct a data structure with
such that for any access request , we can enumerate its result in lexicographic order with
Proposition 3 tells us that the data structure has a linear tradeoff between space and delay. Also, to achieve (almost) constant delay , the space requirement becomes ; in other words, the data structure will essentially materialize the whole result. Our second main result will allow us to exploit query decomposition to avoid this case.
Consider the following adorned view over the Loomis-Whitney join:
The minimum fractional edge cover assigns weight to each hyperedge and has . Then, Proposition 3 tells us that for , we can construct a compressed representation with space and delay . Notice that if we aim for linear space, we can choose and achieve a small delay of .
Proposition 3 ignores the effect of the slack for the free variables. The next example shows that taking slack into account is critical in obtaining better tradeoffs.
Consider the adorned view over the star join
The star join is acyclic, which means that the -representation of the full result takes only linear space. This -representation can be used for any adornment of where is a bound variable; hence, in all these cases we can guarantee delay using linear compression space. However, we cannot get any guarantees when is free, as is in the adornment used above.
We should note here that our data structure strictly generalizes the data structure proposed in  for the problem of fast set intersection. Given a family of sets , the goal in this problem is to construct a space-efficient data structure, such that given any two sets we can compute their intersection as fast as possible. It is easy to see that this problem is captured by the adorned view , where is a relation that describes set membership ( means that ).
3.2 Second Main Result
The direct application of Theorem 1 can lead to suboptimal tradeoffs between space and time/delay, since it ignores the structural properties of the query. In this section, we show how to overcome this problem by combining Theorem 1 with tree decompositions.
We first need to introduce a variant of a tree decomposition of a hypergraph , defined with respect to a given subset .
Definition 1 (Connex Tree Decomposition ).
Let be a hypergraph, and . A -connex tree decomposition of is a tuple , where:
is a tree decomposition of ; and
is a connected subset of such that .
In a -connex tree decomposition, the existence of the set forces the set of nodes that contain some variable from to be connected in the tree.
Consider the hypergraph in Figure 2. The decomposition depicted on the left is a -connex tree decomposition for . The -connex tree decomposition on the right is for . In both cases, consists of a single bag (colored grey) which contains exactly the variables in .
In , -connex decompositions were used to obtain compressed representations of CQs with projections (where is the set of the head variables). In our setting, we will choose to be the set of bound variables in the adorned view, i.e., . Additionally, we will use a novel notion of width, which we introduce next.
Given a -connex tree decomposition , we orient the tree from some node in . For any node , we denote by the union of all the bags for the nodes that are the ancestors of . Define and . Intuitively, (resp. ) are the bound (resp. free) variables for the bag as we traverse the tree top-down. Figure 2 depicts each bag as .
Given a -connex tree decomposition, a delay assignment is a function that maps each bag to a non-negative number, such that for . Intuitively, this assignment means that we want to achieve a delay of for traversing this particular bag. For a bag , define
where is a fractional edge cover of the bag . The -connex fractional hypertree -width of is defined as . It is critical that we ignore the bags in the set in the max computation. We also define where is the fractional edge cover of bag that minimizes .
When for every bag , the -width of any -connex tree decomposition becomes , where is the fractional edge cover number of . Define as the smallest such quantity among all -connex tree decompositions of . When , then , thus recovering the notion of fractional hypertree width. Appendex D shows the relationship between and other hypergraph related parameters.
Finally, we define the -height of a -connex tree decomposition to be the maximum weight root-to-leaf path, where the weight of a path is defined as .
Consider the decomposition on the right in Figure 2, and a delay assignment that assigns to node with , to the bag with , and to the node with . The -height of the tree is . To compute the fractional hypertree -width, observe that we can cover the bag by assigning weight of 1 to the edges , in which case . We also have , and . Hence, the fractional hypertree -width is . Also, observe that and .
Let be an adorned view over a natural join query with hypergraph . Suppose that admits a -connex tree decomposition. Fix any delay assignment , and let be the -connex fractional hypertree -width, the -height of the decomposition, and .
Then, for any input database , we can construct a data structure in compression time with space , such that we can answer any access request with delay .
If we write the delay in the above result as , where is the maximum-weight path, Theorem 2 tells us that the delay is essentially multiplicative in the same branch of the tree, but additive across branches. Unlike Theorem 1, the lexicographic ordering of the result for Theorem 2 now depends on the tree decomposition.
For our running example, Theorem 2 implies a data structure with space and delay . This data structure can be computed in time . Notice that this is much smaller than the time required to compute the worst case output. We prove the theorem in detail in Section 5, and we discuss the complexity of choosing the optimal parameters in Section 6. Next, we delve deeper into Theorem 2 and how to apply it.
Consider the following adorned view:
A direct application of Theorem 1 results in a tradeoff of space with delay . On the other hand, we can construct a connex tree decomposition where has a single bag , which is connected to , which is in turn connected to , and so on. Consider the delay assignment that assigns to each bag . The -width of this decomposition is , while the -height is . Hence, Theorem 2 results in a tradeoff of space with delay .
Suppose now that our goal is to achieve constant delay. From Theorem 2, in order to do this we have to choose the delay assignment to be 0 everywhere. In this case, we have the following result (which slightly strengthens Theorem 2 in this special case by dropping the polylogarithmic dependence).
Let be a full adorned view over a hypergraph . Then, for any input database , we can construct a data structure in compression time and space , such that we can answer any access request with delay .
Observe that when all variables are free, then , in which case , thus recovering the compression result of a -representation. Moreover, since the delay assignment is 0 for all bags, the compression time .
Beyond full adorned views. Our work provides compression strategies for queries that do not admit out-of-the-box factorization (such as Loomis-Whitney joins), and can also recover the result of compressed -representations as a special case when all variables are free (Proposition 4). On the other hand, factorized databases support a much richer set of queries such as projections, aggregations [7, 6] and analytical tasks such as learning regression models [30, 27]. One possible approach to handling projections in our setting is to force a variable ordering in the -connex decomposition: more precisely, we can force projection variables to appear first in any root to leaf path. This idea of fixing variable ordering would be similar to how order-by clauses are handled in -tree query plans . Remarkably, the -connex decomposition in our setting also corresponds to the tree decompositions used to compute aggregations and orderings with group-by attributes as . This points to a deeper connection between our compressed representation and -tree representations used to compute group-by aggregates. We defer the study of these connections and extension of our framework to incorporate more expressive queries to future work.
3.3 A Remark on Optimality
So far we have not discussed the optimality of our results. We remark why proving tight lower bounds might be a hard problem.
The problem of k-SetDisjointness is defined as follows. Given a family of sets of total size , we want to ask queries of the following form: given as input a subset of size , is the intersection empty? The goal is to construct a space-efficient data structure such that we can answer as fast as possible. Note that k-SetDisjointness corresponds to the following adorned view: , where has size . One can see that we can use the data structure for the corresponding full view with head (see Example 7) to answer k-SetDisjointness queries in time , using space .
In a recent work, Goldstein et al.  conjecture the following lower bound:
(due to ) Consider a data structure that preprocesses a family of sets of total size . If the data structure can answer k-SetDisjointness queries in time (or delay) 444For boolean queries, answer time and delay coincide. , then it must use space.
The above conjecture is a generalization of a conjecture from  for the case , which in turn generalizes a folklore conjecture of Patrascu and Roditty , which was stated only for the case where and . Applied in our setting, Conjecture 1 implies that for the adorned view , the tradeoff between space and delay (or answer time) is essentially optimal when all relations have equal size. Unfortunately, proving even the weaker conjecture of  is considered a hard open problem.
4 A Compression Primitive
In this section, we describe the detailed construction of our data structure for Theorem 1.
4.1 Intervals and Boxes
Before we present the compression procedure, we first introduce two important concepts in our construction, -intervals and -boxes, both of which describe subspaces of the space of all possible tuples in the output.
Intervals. The active domain of each variable is equipped with a total order induced from the order of . We will use to denote the smallest and largest element of the active domain respectively (these will always exist, since we assume finite databases). An interval for variable is any subset of of the form , where , denoted by . We adopt the standard notation for closed and open intervals and write , and . The interval is called the unit interval and represents a single value. We will often write for the interval , and the symbol for the interval .
By lifting the order from a single domain to the lexicographic order of tuples in , we can also define intervals over , which we call -intervals. For instance, if and , the -interval represents all valuations over that are lexicographically at least , but strictly smaller than .
Boxes. It will be useful to consider another type of subsets of , which we call -boxes.
Definition 2 (-box).
An -box is defined as a tuple of intervals , where is an interval of . The -box represents all valuations over , such that for every .
We say that a -box is canonical if whenever , then every with is a unit interval. A canonical -box is always of the form . For ease of notation, we will omit the intervals in the end of a canonical -box, and simply write .
A -box satisfies the following important property:
For every -box , .
Suppose that the -box is .
Consider some valuation over that belongs in . Then, for every we have , and also for every variable we have . Since for every variable in we have as well, we conclude that ). Thus, belongs in as well.
For the opposite direction, consider some valuation over that belongs in . Since , we have that for every ,