Recent work has made remarkable progress in developing data structures and algorithms for answering set intersection problems [goldstein2017conditional], reachability oracles and directed reachability [agarwal2011approximate, agarwal2014space, cohen2010hardness], histogram indexing [chan2015clustered, kociumaka2013efficient], and problems related to document retrieval [afshani2016data, larsen2015hardness]. This class of problems splits an algorithmic task into two phases: the preprocessing phase, which computes a space-efficient data structure, and the answering phase, which uses the data structure to answer the requests to minimize the answering time. A fundamental algorithmic question related to these problems is the tradeoff between the space necessary for data structures and the answering time for requests.
For example, consider the -Set Disjointness problem: given a universe of elements and a collection of sets , we want to create a data structure such that for any pair of integers , we can efficiently decide whether is empty or not. Previous work [cohen2010hardness, goldstein2017conditional] has shown that the space-time tradeoff for -Set Disjointness is captured by the equation , where is the total size of all sets. The data structure obtained is conjectured to be optimal [goldstein2017conditional], and its optimality was used to develop conditional lower bounds for other problems, such as approximate distance oracles [agarwal2011approximate, agarwal2014space]. Similar tradeoffs have been independently established for other data structure problems as well. In the -Reachability problem [goldstein2017conditional, Cohen2010] we are given as an input a directed graph , an arbitrary pair of vertices , and the goal is to decide whether there exists a path of length between and . In the edge triangle detection problem [goldstein2017conditional], we are given an input undirected graph , the goal is to develop a data structure that takes space and can answer in time whether a given edge participates in a triangle or not. Each of these problems has been studied in isolation and, as a result, the algorithmic solutions are not generalizable due to a lack of comprehensive framework.
In this paper, we cast many of the above problems into answering Conjunctive Queries (CQs) over a relational database. CQs are a powerful class of relational queries with widespread applications in data analytics and graph exploration [graphgen2015, graphgen2017, deep2018compressed]. For example, by using the relation to encode that element belongs to set , -Set Disjointness can be captured by the following CQ: . As we will see later, -Reachability can also be naturally captured by a CQ.
The insight of casting data structure problems into CQs over a database allows for a unified treatment for developing algorithms within the same framework, which in turn allows for improved algorithms and data structures. In particular, we can leverage the techniques developed by the data management community through a long line of research on efficient join evaluation [yannakakis1981algorithms, skewstrikesback, ngo2012worst], including worst-case optimal join algorithms [ngo2012worst] and tree decompositions [gottlob2014treewidth, robertson1986graph]. The use of these techniques has been a subject of previous work [abo2020decision, greco2013structural, deep2018compressed, olteanu2016factorized, kara19, kara2019counting] for enumerating query results under static and dynamic settings. In this paper, we build upon the aforementioned techniques to develop a framework that allows us to obtain general space-time tradeoffs for any Boolean CQ (a Boolean CQ is one that outputs only true or false). As a consequence, we recover state-of-the-art tradeoffs for several existing problems (e.g., -Set Disjointness as well as its generalization -Set Disjointness and -Reachability) as special cases of the general tradeoff. We can even obtain improved tradeoffs for some specific problems, such as edge triangles detection, thus falsifying existing conjectures.
Our Contribution. We summarize our main technical contributions below.
A Comprehensive Framework. We propose a unified framework that captures several widely-studied data structure problems. More specifically, we resort to the formalism of CQs and the notion of Boolean adorned queries, where the values of some variables in the query are fixed by the user (denoted as an access pattern), and aim to evaluate the Boolean query. We then show how this framework captures the -Set Disjointness and -Reachability problems. Our first main result (Subsection 4.1) is an algorithm that builds a data structure to answer any Boolean CQ under a specific access pattern. Importantly, the data structure can be tuned to trade off space for answering time, thus capturing the full continuum between optimal space and answering time. At one extreme, the data structure achieves constant answering time by explicitly storing all possible answers. At the other extreme, the data structure stores nothing, but we execute each request from scratch. We show how to recover existing and new tradeoffs using this general framework. The first main result may sometimes lead to suboptimal tradeoffs since it does not take into account the structural properties of the query. Our second main result (Subsection 4.2) combines tree decompositions of the query structure with access patterns to improve space efficiency. We then show how this algorithm can handle Boolean CQs with negation.
Improved Algorithms. In addition to the main result above, we explicitly improve the best-known space-time tradeoff for the -Reachability problem for . For any , the tradeoff of was conjectured to be optimal by [goldstein2017conditional], where is the number of edges in the graph, and was used to conditionally prove other lower bounds on space-time tradeoffs. We show that for a regime of answer time , it can be improved to , thus breaking the conjecture. To the best of our knowledge, this is the first non-trivial improvement for the -Reachability problem. We also refute a lower bound conjecture for the edge triangles detection problem established by [goldstein2017conditional].
Conditional Lower Bounds. Finally, we show a reduction between lower bounds for the problem of -Set Disjointness for different values of , which generalizes the -Set Disjointness to computing the intersection between given sets, for .
Organization. We introduce the basic terminology and problem definition in Section 2 and Section 3. We presents our main results for Boolean adorned queries in Section 4 and our improved result for path queries in Section 5. We discuss the lower bounds and related work in Section 6 and Section 7, and finally conclude in Section 8 with promising future research directions and open problems.
In this section we present the basic notation and terminology.
Data Model. A schema is a non-empty ordered set of distinct variables. Each variable has a discrete domain . A tuple over schema is an element from . A relation over schema (denoted ) is a function such that the multiplicity is non-zero for finitely many . A tuple exists in , denoted by , if . The size of relation , denoted as , is the size of set . A database is a set of relations and the size of the database is the sum of sizes of all its relations. Given a tuple over schema and a set of variables , denotes the restriction of to and the values of follows the same variable ordering as . We also define the selection operator and projection operator .
Conjunctive Queries. A Conjunctive Query (CQ) is an expression of the form
Expressions are called atoms or relations. The atom is the head of the query, while the atoms form the body. Here is the set of all variables occurring in , i.e, . The existential quantified variables is the set of variables . Throughout the paper, we will omit the existential quantified part whenever and are mentioned in the query. A CQ is full if every variable in the body appears also in the head (a.k.a. quantifier-free), and Boolean if the head contains no variables, i.e. it is of the form (a.k.a. fully-quantified). We will typically use the symbols to denote variables, and to denote constants. Given an input database , we use to denote the result of the query over the database. In this paper, we will consider CQs that have no constants and no repeated variables in the same atom. Such a query can be represented equivalently as a hypergraph , where is the set of variables, and for each hyperedge there exists a relation with variables .
Suppose that we have a directed graph that is represented through a binary relation : this means that there exists an edge from node to node . We can compute the pairs of nodes that are connected by a directed path of length using the following CQ, which we call a path query:
A CQ with negation, denoted as , is a CQ where some of the atoms can be negative, i.e., is allowed. For , we denote by the conjunction of the positive atoms in . A is said to be safe if every variable appears in at least some positive atom. In this paper, we restrict our scope to class of safe , a standard assumption [wei2003containment, nash2004processing] ensuring that query results are well-defined and do not depend on domains.
Join Size Bounds. Let be a hypergraph. A weight assignment is called a fractional edge cover of if for every and for every . The fractional edge cover number of , denoted by is the minimum of over all fractional edge covers of . We write . In a celebrated result, Atserias, Grohe and Marx [AGM] proved that for every fractional edge cover of , the size of join result is bounded by the AGM inequality:
The above bound is constructive [skewstrikesback, ngo2012worst]: there exists an algorithm that computes the result of in time for every fractional edge cover of .
Tree Decompositions. Let be a hypergraph of a CQ . A tree decomposition of is a tuple where is a tree, and every is a subset of , called the bag of , such that
Each edge in is contained in some bag; and
For each variable , the set of nodes form a connected subtree of .
The fractional hypertree width of a decomposition is defined as , where is the minimum fractional edge cover of the vertices in . The fractional hypertree width of a query , denoted , is the minimum fractional hypertree width among all tree decompositions of its hypergraph. We say that a query is acyclic if .
Computational Model. To measure the running time of our algorithms, we will use the uniform-cost RAM model [hopcroft1975design], where data values and pointers to databases are of constant size. Throughout the paper, all complexity results are with respect to data complexity, where the query is assumed fixed. Each relation over schema is implemented via a data structure that stores all entries in space, which supports look-up, insertion, and deletion entries in time. For a schema , we use an index structure that for some defined over can (i) check if ; and return in constant time.
In this section, we discuss the concept of adorned queries and present our framework.
3.1 Adorned Queries
In order to model different access patterns, we will use the concept of adorned queries introduced by [ullman1986approach]. In an adorned query, each variable in the head of the query is associated with a binding type, which can be either bound () or free (). We denote this as , where is called the access pattern. The access pattern tells us for which variables the user must provide a value as input. Concretely, let be the bound variables. An instantiation of the bound variables to constants, such as , is an access request: we need to return the query results where we have replaced each bound variable with the corresponding constant . In the next few examples, we demonstrate how to capture several data structure problems by adorned queries.
[Set Disjointness and Set Intersection] In the set disjointness problem, we are given sets drawn from the same universe . Let be the total size of input sets. Each access request is a pair of indexes , for which we need to decide whether is empty or not. To cast this problem as an adorned query, we encode the family of sets as a binary relation , such that element belongs to set . Note that the relation will have size . Then, the set disjointness problem corresponds to:
An access request in this case specifies two sets , and issues the (Boolean) query .
In the related set intersection problem, given a pair of indexes for , we instead want to enumerate the elements in the intersection , which can be captured by the following adorned query: .
[-Set Disjointness] The -set disjointness problem is a generalization of 2-set disjointness problem, where each request asks whether the intersection between sets is empty or not. Again, we can cast this problem into the following adorned query:
[-Reachability] Given a direct graph , the -reachability problem asks, given a pair vertices , to check whether they are connected by a path of length . Representing the graph as a binary relation (which means that there is an edge from to ), we can model this problem through the following adorned query:
Observe that we can also check whether there is a path of length at most by combining the results of such queries (one for each length ).
[Edge Triangles Detection] Given a graph , this problem asks, given an edge as the request, whether participates in a triangle or not. This task can be expressed as the following adorned query
In the reporting version, the goal is to enumerate all triangles participated by edge , which can also be expressed by the following adorned query .
We say that an adorned query is Boolean if every head variable is bound. In this case, the answer for every access request is also Boolean, i.e., true or false.
3.2 Problem Statement
Given an adorned query and an input database , our goal is to construct a data structure, such that we can answer any access request that conforms to the access pattern as fast as possible. In other words, an algorithm can be split into two phases:
Preprocessing phase: we computes a data structure using space .
Answering phase: given an access request, we compute the answer using the data structure built in the preprocessing phase, within time .
In this work, our goal is to study the relationship between the space of the data structure and the answering time for a given adorned query . We will focus on Boolean adorned queries, where the output is just true or false.
4 General Space-Time Tradeoffs
4.1 Space-Time Tradeoffs via Worst-case Optimal Algorithms
Let be an adorned query, and let be the corresponding hypergraph. Let denote the bound variables in the head of the query. For any fractional edge cover of , we define the slack of as: In other words, the slack is the maximum factor by which we can scale down the fractional cover so that it remains a valid edge cover of the non-bound variables in the query111We will omit the parameter from the notation of whenever it is clear from the context.. Hence is a fractional cover of the nodes in . We always have .
Consider with the optimal fractional edge cover , where for . The slack is , since the fractional edge cover , where covers the only non-bound variable .
We can now state our first main theorem.
Let be a Boolean adorned query with hypergraph . Let be any fractional edge cover of . Then, for any input database , we can construct a data structure that answers any access request in time and takes space
We should note that Subsection 4.1 applies when the relation sizes are different; this gives us sharper upper bounds compared to the case where each relation is bounded by the total size of the input. Indeed, if using as an upper bound on each relation, we obtain a space requirement of for achieving answering time , where is the fractional edge cover number. Since , this gives us at worst a linear tradeoff between space and time, i.e., . For cases where , we can obtain much better tradeoff.
Continue the example in this section . We obtain an improved tradeoff: . Note that this result matches the best-known space-time tradeoff for the -Set Disjointness problem [goldstein2017conditional]. (Note that all atoms use the same relation symbol , so for every . )
We next present a few more applications of Subsection 4.1.
[Edge Triangles Detection] For the Boolean version, it was shown in [goldstein2017conditional] that – conditioned on the strong set disjointness conjecture – any data structure that achieves answering time needs space . A matching upper bound can be constructed by using a fractional edge cover with slack . Thus, Subsection 4.1 can be applied to achieve answering time using space . Careful inspection reveals that a different fractional edge cover with slack , achieves a better tradeoff. Thus, Subsection 4.1 can be applied to obtain the following corollary.
For a graph , there exists a data structure of size that can answer the edge triangles detection problem in time.
The data structure implied by Subsection 4.1 is always better when 222All answering times are trivial to achieve using linear space by using the data structure for and holding the result back until time has passed. However, in certain practical settings such as transmitting data structure over a network, it is beneficial to construct a sublinear sized structures. In those settings, is useful., thus refuting the conditional lower bound in [goldstein2017conditional]. We should note that this does not imply that the strong set disjointness conjecture is false, as we have observed an error in the reduction used by [goldstein2017conditional].
[Square Detection] Beyond triangles, we consider the edge square detection problem, which checks whether a given edge belongs in a square pattern in a graph , Using the fractional edge cover with slack , we obtain a tradeoff .
4.2 Space-Time Tradeoffs via Tree Decompositions
Subsection 4.1 does not always give us the optimal tradeoff. For the -reachability problem with the adorned query , Subsection 4.1 gives a tradeoff , by taking the optimal fractional edge covering number and slack , which is far from efficient. In this section, we will show how to leverage tree decompositions to further improve the space-time tradeoff in Subsection 4.1.
Again, let be an adorned query, and let be the corresponding hypergraph. Given a set of nodes , a -connex tree decomposition of is a pair , where is a tree decomposition of , and is a connected subset of the tree nodes such that the union of their variables is exactly . For our purposes, we choose . Given a -connex tree decomposition, we orient the tree from some node in . We then define the bound variables for the bag , as the variables in that also appear in the bag of some ancestor of . The free variables for the bag are the remaining variables in the bag, .
Consider the -path query . Here, and are the bound variables. Figure 1 shows the unconstrained decomposition as well as the -connex decomposition for , where . The root bag contains the bound variables . Bag contains as bound variables and as the free variables. Bag contains as bound variables for and as free variables.
Next, we define a parameterized notion of width for the -connex tree decomposition. The width is parameterized by a function that maps each node in the tree to a non-negative number, such that whenever . The intuition here is that we will spend in the node while answering the access request. The parameterized width of a bag is now defined as: where is a fractional edge cover of the bag , and is the slack (on the bound variables of the bag). The -width of the decomposition is then defined as . Finally, we define the -height as the maximum-weight path from the root to any leaf, where the weight of a path is . We now have all the necessary machinery to state our second main theorem.
Let be a Boolean adorned query with hypergraph . Consider any -connex tree decomposition of . For some parametrization of the decomposition, let be its -width, and be its -height. Then, for any input database , we can construct a data structure that answers any access request in time in space .
The function allows us to trade off between time and space. If setting for every node in the tree, then the -height becomes , while the -width equals to the fractional hypetree width of the decomposition. As we increase the values of in each bag, the -height increases while the -width decreases, i.e., the answer time increases while the space decreases. Additionally, we note that the tradeoff from Subsection 4.2 is at least as good as the one from Subsection 4.1. Indeed, we can always construct a tree decomposition where all variables reside in a single node of the tree. In this case, we recover exactly the tradeoff from Subsection 4.1. Due to a lack of space, we refer the reader to Appendix B for details.
We continue with the -path query. Since , we assign . For , the only valid fractional edge cover assigns weight 1 to both and has slack 1. Hence, if we assign for some parameter , the width is . For , the only fractional cover also assigns weight 1 to both , with slack again. Assigning , the width becomes for as well. Hence, the -width of the tree decomposition is , while the -height is . Plugging this to Subsection 4.2, it gives us a tradeoff with answering time and space usage , which matches the state-of-the-art result in [goldstein2017conditional]. The above argument can be generalized to -path query with answering time and space usage .
Consider a variant of the square detection problem: given two vertice, the goal is to decide whether they occur in two opposites corners of a square, which can be captured by the following adorned query:
Subsection 4.1 gives a tradeoff with answering time and space . But we can obtain a better tradeoff using Subsection 4.2. Indeed, consider the tree decomposition where we have a root bag with , and two children of with Boolean and . For , we can see that if assigning a weight of to both hyperedges, we get a slack of . Hence, if , the -width is . Similarly for , we assign , for a -width with . Applying Subsection 4.2, we obtain a tradeoff with time (since both root-leaf paths have only one node), and space . So the space usage can be improved from to .
4.3 Extension to CQs with Negation
In this section, we present a simple but powerful extension of our result to adorned Boolean CQs with negation. Given a query , we build the data structure from Subsection 4.2 for but impose two constraints on the decomposition: no leaf node(s) contains any free variable, for every negated relation , all variables of must appear together as bound variables in some leaf node(s). In other words, there exists a leaf node such that are present in it. It is easy to see that such a decomposition always exists. Indeed, we can fix the root bag to be , its child bag with free variables as and bound variables as , and the leaf bag, which is connected to the child of the root, with bound variables as without free variables. Observe that the bag containing free variables can be covered by only using the positive atoms since is safe. The intuition is the following: during the query answering phase, we wish to find the join result over all variables before reaching the leaf nodes; and then, we can check whether there the tuples satisfy the negated atoms or not, in time. The next example shows the application of the algorithm to adorned path queries containing negation.
Consider the query . Using the decomposition in Figure 2, we can now apply Subsection 4.2 to obtain the tradeoff and . Both leaf nodes only require linear space since a single atom covers the variables. Given an access request , we check whether the answer for this request has been materialized or not. If not, we proceed to the query answering phase and find at most answers after evaluating the join in the middle bag. For each of these answers, we can now check in constant time whether the tuples formed by values for and are not present in relations and respectively.
For adorned queries where , we can further simplify the algorithm. In this case, we no longer need to create a constrained decomposition since the check to see if the negated relations are satisfied or not can be done in constant time at the root bag itself. Thus, we can directly build the data structure from Subsection 4.2 using the query .
[Open Triangle Detection] Consider the query , where is and is with the adorned view as . Observe that . We apply Subsection 4.2 to obtain the tradeoff and with root bag , its child bag with and , and the leaf bag to be and . Given an access request , we check whether the answer for this request has been materialized or not. If not, we traverse the decomposition and evaluating the join to find if there exists a connecting value for . For the last bag, we simply check whether exists in or not in time.
A note on optimality. It is easy to see that the algorithm obtained for Boolean CQs with negation is conditionally optimal assuming the optimality of Subsection 4.2. Indeed, if all negated relations are empty, the join query is equivalent to and the algorithm now simply applies Subsection 4.2 to . In example Figure 2, assuming relation is empty, the query is equivalent to set intersection whose tradeoffs are conjectured to be optimal.
5 Path Queries
In this section, we present an algorithm for the adorned query that improves upon the conjectured optimal solution. Before diving into the details, we first state the upper bound on the tradeoff between space and query time.
[due to [goldstein2017conditional]] There exists a data structure for solving with space and answering time such that .
Note that for , the problem is equivalent to SetDisjointness with the space/time tradeoff as . [goldstein2017conditional] also conjectured that the tradeoff is essentially optimal.
Conjecture (due to [goldstein2017conditional]).
Any data structure for with answering time must use space .
If is not a constant, Section 5 implies that space is needed for achieving answering time. Building upon Section 5, [goldstein2017conditional] also showed a result on the optimality of approximate distance oracles. Our results implies that Section 5 can be improved further, thus refuting Section 5. The first observation is that the tradeoff in Section 5 is only useful when . Indeed, we can always answer any boolean path query in linear time using breadth-first search. Surprisingly, it is also possible to improve Section 5 for the regime of small answering time as well. In what follows, we will show the improvement for paths of length 4; we will generalize the algorithm for any length in the next section.
5.1 Length-4 Path
There exists a parameterized data structure for solving that uses space and answering time that satisfies the tradeoff .
Preprocessing Phase. Consider . Let be a degree threshold. We say that a constant is heavy if its frequency on attribute is greater than in both relations and ; otherwise, it is light. In other words, is heavy if and . We distinguish two cases based on whether a constant for is heavy or light. Let denote the unary relation that contains all heavy values, and the one that contains all light values. Observe that we can compute both of these relations in time by simply iterating over the active domain of variable and checking the degree in relations and . We compute two views:
We store the views as a hash index that, given a value of (or ), returns all matching values of . Both views take space . Indeed, . Since we can construct a fractional edge cover for by assigning a weight of 1 to and , this gives us an upper bound of for the query output. The same argument holds for . We also compute the following view for light values: This view requires space , since the degree of the light constants is at most . We can now rewrite the original query as
The rewritten query is a three path query. Hence, we can apply Subsection 4.1 to create a data structure with answering time and space .
Query Answering. Given an access request, we first check whether there exists a 4-path that goes through some heavy value in . This can be done in time using the views and . Indeed, we obtain at most values for using the index for , and values for using the index for . We then intersect the results in time by iterating over the values for and checking if the bound values for and from a tuple in and respectively. If we find no such 4-path, we check for a 4-path that uses a light value for . From the data structure we have constructed in the preprocessing phase, we can do this in time .
Tradeoff Analysis. From the above, we can compute the answer in time . From the analysis in the preprocessing phase, the space needed is . Thus, whenever , the space becomes , completing our analysis.
5.2 General Path Queries
We can now use the algorithm for the 4-path query to improve the space-time tradeoff for general path queries of length greater than four.
Let be an input instance. For , there is a data structure for with space and answer time for .
Fix some . We construct the data structure for a path of length recursively. The base case is when , with answer time and space .
In the recursive step, similar to the previous section, we set as the degree threshold for any constant that variables and can take. Let be unary relations that store the heavy values for respectively. We compute and store the result of
This view has size bounded by . We consider the following queries:
both of which correspond to the -path, so we can recursively apply the data structure here. Let be the space and time for -path. For space, we have following observation: