When can we allow for direct access to a ranked list of answers to a database query without (and considerably faster than) materializing all answers? To illustrate the concrete instantiation of this question, assume the following simple relational schema for information about pandemic spread and relevant activity of residents:
Here, mentions, for each person, the cities that the person visits regularly (e.g., for work and relatives) and the age of the person (for risk assessment); the relation specifies the number of new infection cases in specific cities at specific dates (a measure that is commonly used for spread assessment albeit being sensitive to the amount of testing).
Suppose that we wish to efficiently compute the natural join based on equality of the city attribute, so that we have all combinations of people (with their age), the cities they regularly visit, and the city’s daily new cases. For example,
While the number of such answers could be quadratic in the size of the database, the seminal work of Bagan, Durand, and Grandjean (Bagan et al., 2007) has established that the it can be evaluated using an enumeration algorithm with a constant delay between consecutive answers, after a linear-time preprocessing phase. This is due to the fact that this join is a special case of a free-connex acyclic Conjunctive Query (CQ). In the case of CQs without self-joins, being free-connex acyclic is a sufficient and necessary condition for such efficient evaluation (Brault-Baron, 2013; Bagan et al., 2007). The necessity requires conventional assumptions in fine-grained complexity111For the sake of simplicity, throughout this section we make all of these complexity assumptions. In Section 2 we give their formal statements. and it holds even if we multiply the preprocessing and delay by a logarithmic factor in the size of the database.222We refer to those as quasilinear preprocessing and log delay, respectively.
To realize the constant (or logarithmic) delay, the preprocessing phase constructs a structure that allows for efficient iteration over the answers in the enumeration phase. Brault-Baron (Brault-Baron, 2013) showed that in the linear preprocessing phase, we can construct a structure with better guarantees: not only log-delay enumeration, but even log-time direct access: a structure that allows to directly retrieve the answer in the enumeration, given , without needing to enumerate the preceding answers first.333“Direct access” is also widely known as “random access.” Later, Carmeli et al. (Carmeli et al., 2020) showed how such a structure can be used for enumerating answers in a random order (random permutation)444Not to be confused with “random access.”
with the statistical guarantee that the order is uniformly distributed. In particular, in the above example we can enumerate the answers ofin a provably uniform random permutation (hence, ensuring statistical validity of each prefix) with logarithmic delay, after a linear-time preprocessing phase. Their direct-access structure also allows for inverted access: given an answer, return the index of that answer (or determine that it is not a valid answer).
The direct-access structures of Brault-Baron (Brault-Baron, 2013) and Carmeli et al. (Carmeli et al., 2020) have the byproduct that they allow the answers to be sorted by some lexicographic order. For instance, in our the structure could be such that the tuples are in the (descending) order of and then by date, or in the order of and then by
. Hence, in logarithmic time we can evaluate quantile queries (find theth answer in order) and determine the position of a tuple inside the sorted list. From this we can also conclude (fairly easily) that we can enumerate the answers ordered by where ties are broken randomly, again provably uniformly. Carmeli et al. (Carmeli et al., 2020) have also shown how the order of the answers can be useful for generalizing direct-access algorithms from CQs to UCQs. Note that direct access to the sorted list of answers is a stronger requirement than ranked enumeration that has been studied in previous work (Tziavelis et al., 2020a, b; Deep and Koutris, 2019; Yang et al., 2018).
Yet, the choice of which lexicographic order is taken is an artefact of the structure construction (e.g., the elimination order (Brault-Baron, 2013) or the join tree (Carmeli et al., 2020)). If the application desires any specific lexicographic order, we can only hope to find a matching construction; which is not necessarily the case. For example, could we construct in (quasi) linear time a direct-access structure for ordered by and then by ? Interestingly, it turns out the answer is negative: it is impossible to build in quasilinear time a direct-access structure with logarithmic access time.
Getting back to the question posed at the beginning of this section, in this paper we embark on the challenge of identifying, for each CQ, the orders that allow for efficiently constructing a direct-access structure. We adopt the tractability yardstick of quasilinear construction (preprocessing) time and logarithmic access time. In addition, we focus on two types of orders: lexicographic orders, and scoring by the sum of attribute scores.
Contributions. Our first main result is an algorithm for direct access for lexicographic orders, including ones that are not achievable by past structures. We further show that within the class of CQs without self-joins, our algorithm covers all the tractable cases (in the sense adopted here), and we establish a decidable (and easy to test) classification of the lexicographic orders over the free variables into tractable and intractable ones. For instance, in the case of the lexicographic order
is intractable. It is classified as such becauseand are non-neighbours (i.e., do not co-occur in the same atom), but , which comes after and in the order, is a neighbour of both. This is what we call a disruptive trio.555One could argue that, in reality, this example involves functional dependencies, such as , which could invalidate the lower bounds. Indeed, our classification does not account for constraints. Yet, all hardness statements mentioned about this example in this section can be shown to follow from the results of this paper. We further discuss constraints in the Conclusions (Section 7). The lexicographic order is also intractable since the query is not -connex. In contrast, the lexicographic order is tractable. We also show that within the tractable side, the structure we construct allows for inverted access in constant time.
Our classification is proved in two steps. We begin by considering the complete lexicographic orders (that involve all variables). We show that for free-connex CQs without self-joins, the absence of a disruptive trio is a sufficient and necessary condition for tractability. We then generalize to partial lexicographic orders over a subset of the variables. There, the condition is that there is no disruptive trio and that the query is -connex. Interestingly, it turns out that a partial lexicographic order is tractable if and only if it is the prefix of a complete tractable lexicographic order.
A lexicographic order is a special case of an ordering by the sum of attribute scores, where every database value is mapped to some number. Hence, a natural question now is which CQs have a tractable direct access by the order of sum. For example, what about with the order ? It is easy to see that this order is intractable because the lexicographic order is intractable. In fact, it is easy to show that a lexicographic order by sum is intractable whenever any lexicographic order is intractable (e.g., there is a disruptive trio). However, the situation is worse: the only tractable case is the one where the CQ is acyclic and there is an atom that contains all of the free variables. In particular, ordering by sum is intractable already for the Cartesian product , even though every lexicographic order is tractable (according to our aforementioned classification). This daunting hardness also emphasizes how ranked direct access is fundamentally harder than ranked enumeration where, in the case of the sum of attributes, the answers of every full CQ can be enumerated with logarithmic delay after a quasilinear preprocessing time (Tziavelis et al., 2020a).
To understand the root cause of the hardness of sum, we narrow our question to a considerably weaker guarantee. Our notion of tractability so far requires the construction of a structure in quasilinear time and a direct access in logarithmic time. In particular, if our goal is to compute just a single quantile, say the th answer, then it takes quasilinear time. Computing a single quantile is known as the selection problem (Blum et al., 1973). The question we ask is to what extent is selection a weaker requirement than direct access in the case of CQs. That is, how larger is the class of CQs with quasilinear selection than that of CQs with a quasilinear construction of a logarithmic-access structure?
We answer the above question for the class of full CQs without self-joins by establishing the following dichotomy for the order by sum (again assuming fine-grained hypotheses): the selection problem can be solved in time, where is the size of the database, if and only if the hypergraph of the CQ contains at most two maximal hyperedges (w.r.t. containment). The tractable side is applicable even in the presence of self-joins, and it is achieved by adopting an algorithm by Frederickson and Johnson (Frederickson and Johnson, 1984). For illustration, the selection problem is solvable in quasilinear time for the query ordered by sum.
Outline. The remainder of the paper is organized as follows. Section 2 gives the necessary background. In Section 3 we consider direct access by lexicographic orders that include all the free variables, and Section 4 extends the results to partial ones. We move on to the (for the most part) negative results for direct access by sum orderings in Section 5 and then study the selection problem in Section 6. Section 7 concludes and gives some directions for future work. Due to space constraints, some proofs are in the Appendix.
2.1. Basic Notions
Database. A schema is a set of relational symbols . We use for the arity of a relational symbol . A database instance contains a finite relation for each , where dom is a set of constant values called the domain. We use for the size of the database, i.e., the total number of tuples.
Queries. A conjunctive query (CQ) over schema is an expression of the form , where the tuples hold variables, every variable in appears in some , and . Each is called an atom of the query , and denotes the set of all atoms. We use or for the set of variables that appear in an atom or query , respectively. The variables are called free and are denoted by . A CQ is full if and Boolean if . Sometimes, we say that CQs that are not full have projections. A repeated occurrence of a relational symbol is a self-join and if no self-joins exist, a CQ is called self-join-free. A homomorphism from a CQ to a database is a mapping of to constants from dom, such that every atom of maps to a tuple in the database . A query answer is such a homomorphism followed by a projection of on the free variables, denoted by . The answer to a Boolean CQ is whether such a homomorphism exists. The set of query answers is .
Hypergraphs. A hypergraph is a set of vertices and a set of subsets of called hyperedges. Two vertices in a hypergraph are neighbors if they appear in the same edge. A path of is a sequence of vertices such that every two succeeding variables are neighbors. A chordless path is a path in which no two non-succeeding vertices appear in the same atom (in particular, no vertex appears twice). A join tree of a hypergraph is a tree where the nodes are the hyperedges of and the running intersection property holds, namely: for all the set forms a (connected) subtree in . An equivalent phrasing of the running intersection property is that given two vertices of the tree, for any vertex on the simple path between them, we have that . A hypergraph is acyclic if there exists a join tree for . We associate a hypergraph to a CQ where the vertices are the variables of , and every atom of corresponds to a hyperedge with the same set of variables. Stated differently, and . With a slight abuse of notation, we identify atoms of with hyperedges of . A CQ is acyclic if is acyclic, otherwise it is cyclic.
Free-connex CQs. A hypergraph is an inclusive extension of if every edge of appears in , and every edge of is a subset of some edge in . Given a subset of the vertices of , a tree is an ext--connex tree (i.e., extension--connex tree) for a hypergraph if: (1) is a join tree of an inclusive extension of , and (2) there is a subtree of that contains exactly the vertices (Bagan et al., 2007). We say that a hypergraph is -connex if it has an ext--connex tree (Bagan et al., 2007). A hypergraph is -connex iff it is acyclic and it remains acyclic after the addition of a hyperedge containing exactly (Brault-Baron, 2013). Given a hypergraph and a subset of its vertices, an -path is a chordless path in with , such that , and . A hypergraph is -connex iff it has no -path (Bagan et al., 2007). A CQ is free-connex if is -connex (Bagan et al., 2007).
2.2. Problem Definitions
Orders of Answers. For a CQ and database instance , a ranking function compares two query answers and returns the smaller one according to some underlying total order.666WLOG, we assume that the order is ascending but all results hold if we rank returns the bigger () instead of the smaller (). We consider two types of orders in this paper. Assuming that the domain values are ordered, a lexicographic order is an ordering of such that first compares on the value of the first variable, and if they are equal on the value of the second variable, and so on. A lexicographic order is called partial if the variables in are a subset of .
The second type of order assumes a given weight function that assigns a real-valued weight to the domain values of each variable. More precisely, for a variable , we define and then the weight of a query answer is computed by aggregating the weights of the assigned values of free variables. In a sum-of-weights order, denoted by , we have and compares with . To simplify notation, we refer to all and together as one weight function . If two query answers have the same weight, then we break ties arbitrarily but consistently, e.g., according to a lexicographic order on their assigned values.
Attribute Weights vs. Tuple Weights. Notice that in the definition above, we assume that the input weights are assigned to the domain values of the attributes. Alternatively, the input weights could be assigned to the relation tuples, a convention that has been used in past work on ranked enumeration (Tziavelis et al., 2020a). Since there are several reasonable semantics for interpreting a tuple-weight ranking for CQs with projections and/or self-joins, we elect to present our results for the case of attribute weights. For self-join-free CQs, attribute weights can easily be transformed to tuple weights in linear time such that the weights of the query answers remain the same. This works by assigning each variable to one of the atoms that it appears in, and computing the weight of a tuple by aggregating the weights of the assigned attribute values. Therefore, our hardness results for sum-of-weights orders directly extend to the case of tuple weights. Moreover, note that our positive results on selection (Section 6.2) rely on algorithms that innately operate on tuple weights, thus we cover that case too.
Direct access vs. Selection. In the problem of direct access by an underlying order, we are given as an input a query , and a database , and the goal is to construct a data structure which then allows us to support accesses on the sorted array of query answers. Specifically, an access asks for the query answer at index on the (implicit) array containing sorted via rank comparisons, for a given integer . This data structure is built in a preprocessing phase, after which we have to be able to support multiple such accesses. Our goal is to achieve efficient access (in polylogarithmic time) with a preprocessing phase that is significantly smaller than (quasilinear in the database size).
The problem of selection (Blum et al., 1973; Floyd and Rivest, 1975; Frederickson, 1993) is a computationally easier task that requires only a single direct access, hence does not make a distinction between preprocessing and access phases. For example, a special case of the problem is to find the median query result.
2.3. Complexity Framework and Sorting
We measure asymptotic complexity in terms of the size of the database , while the size of the query is considered constant. The model of computation is the RAM model with uniform cost measure. In particular, it allows for linear time construction of lookup tables, which can be accessed in constant time. We would like to point out that some past works (Bagan et al., 2007; Carmeli et al., 2020) have assumed that in certain variants of the model, sorting can be done in linear time (Grandjean, 1996). Since we consider problems related to summation and sorting (Frederickson and Johnson, 1984) where a linear-time sort would improve otherwise optimal bounds, we adopt a more standard assumption that sorting is comparison-based and possible only in quasilinear time. As a consequence, some upper bounds mentioned in this paper are weaker than the original sources which assumed linear-time sorting (Brault-Baron, 2013; Carmeli et al., 2020).
2.4. Hardness Hypotheses and Background
Hardness Hypotheses. Denote by sparseBMM the hypothesis that two Boolean matrices and , represented as lists of their non-zero entries, cannot be multiplied in time , where is the number of non-zero entries in , , and . A consequence of this hypothesis is that we cannot answer the query with quasilinear preprocessing and polylogarithmic delay. In more general terms, any self-join-free acyclic non-free-connex CQ cannot be enumerated with quasilinear777 Works in the literature typically phrase this as linear, yet any logarithmic factor increase is still covered by the hypotheses. preprocessing time and polylogarithmic delay assuming the sparseBMM hypothesis (Bagan et al., 2007; Berkholz et al., 2020).
A -hyperclique is a set of vertices in a hypergraph such that every -element subset is a hyperedge. Denote by Hyperclique the hypothesis that for every there is no algorithm for deciding the existence of a -hyperclique in a -uniform hypergraph with hyperedges. When , this follows from the -Triangle hypothesis (Abboud and Williams, 2014) for any . When , this is a special case of the Hyperclique Hypothesis (Lincoln et al., 2018). A known consequence is that Boolean cyclic and self-join-free CQs cannot be answered in quasilinearfootnote 7 time (Brault-Baron, 2013). Moreover, cyclic and self-join-free CQs do not admit enumeration with quasilinear preprocessing time and polylogarithmic delay assuming the Hyperclique hypothesis (Brault-Baron, 2013).
In its simplest form, the 3SUM problem asks for three distinct real numbers from a set with elements that satisfy . There is a simple algorithm for the problem, but it is conjectured that in general, no truly subquadratic solution exists (Patrascu, 2010). The significance of this conjecture has been highlighted by many conditional lower bounds for problems in computational geometry (Gajentaan and Overmars, 1995) and within the P class in general (Williams, 2015). Note that the problem remains hard even for integers provided that they are sufficiently large (i.e., in the order of ) (Patrascu, 2010). We denote by 3sum the following equivalent hypothesis (Baran et al., 2005) that uses three different sets of numbers: Deciding whether there exist from three sets of integers such that cannot be done in time for any . This lower bound has been confirmed in some restricted models of computation (Erickson, 1995; Ailon and Chazelle, 2005).
Known Results for CQs. We now provide some background that relates to the efficient handling of CQs. For a query with projections, a standard strategy is to reduce it to an equivalent one where techniques for acyclic full CQs can be leveraged. The following proposition, that is widely known and used (Berkholz et al., 2020), shows that this is possible for free-connex CQs.
Proposition 2.1 (Folklore).
Given a CQ over a database , a join tree of an inclusive extension of and a subtree of that contains all free variables, it is possible to compute in linear time a database over the schema of the CQ consisting of the nodes of such that .
This reduction is done by first creating a relation for every node in using projections of existing relations, then performing the classic semi-join reduction by Yannakakis (Yannakakis, 1981) to filter the relations of according to the relations of , and then we can simply ignore all relations that do not appear in and obtain the same answers. Afterwards, they can be handled efficiently, e.g. their answers can be enumerated with constant delay (Bagan et al., 2007).
For direct access, past work has identified the tractable queries, yet there is no guarantee on the order of the query answers.
Let be a CQ. If is free-connex, then direct access (in some order) is possible with preprocessing and delay. Otherwise, if it is also self-join-free, then direct access (in any order) is not possible with preprocessing and delay, assuming sparseBMM and Hyperclique.
The established direct access algorithms are allowed to internally choose any order, while in this paper, we receive a desired order as input. Even though these algorithms do not explicitly discuss the order of the answers, a closer look shows that they produce a lexicographic order. The algorithm of Carmeli et al. (Carmeli et al., 2020, Algorithm 3) assumes that a join tree is given with the CQ, and the variable order is imposed by the join tree. Specifically, it is the one achieved by a preorder depth-first traversal of the tree. The algorithm of Brault-Baron (Brault-Baron, 2013, Algorithm 4.3) assumes that an elimination order is given along with the CQ. The resulting lexicographic order is affected by that elimination order, but is not necessarily the same. Moreover, there exist orders (which we show in this paper to be tractable) that these algorithms cannot produce. For instance, these include lexicographic orders that interleave variables from different atoms, such as the order for the query of Section 1.
3. Direct Access by Lexicographic Orders
In this section, we answer the following question: for which underlying lexicographic orders can we achieve “tractable” direct access to ranked CQ answers, i.e. with quasilinear preprocessing and polylogarithmic time per answer?
Example 3.1 (No direct access).
Consider the lexicographic order for the query . Direct access to the query answers according to that order would allow us to “jump over” the values via binary search and essentially enumerate the answers to . However, we know that is not free-connex and that is impossible to achieve enumeration with quasilinear preprocessing and polylogarithmic delay (if sparseBMM holds). Therefore, the bounds we are hoping for are out of reach for the given query and order. The core difficulty is that the joining variable appears after the other two in the lexicographic order.
We formalize this notion of “variable in the middle” in order to detect similar situations in more complex queries.
Definition 3.2 (Disruptive Trio).
Let be a CQ and a lexicographic order of its free variables. We say that three free variables are a disruptive trio in with respect to if and are not neighbors (i.e. they don’t appear together in an atom), is a neighbor of both and , and appears after and in .
As it turns out, when considering free-connex and self-join-free CQs, the tractable CQs are precisely captured by this simple criterion. Regarding self-join-free CQs that are not free-connex, their known intractability of enumeration implies that direct access is also intractable. This leads to the following dichotomy:
Theorem 3.3 ().
Let be a CQ and be a lexicographic order.
If is free-connex and does not have a disruptive trio with respect to , then direct access by is possible with preprocessing and time per access.
Otherwise, if is also self-join-free, then direct access by is not possible with preprocessing and time per access, assuming sparseBMM and Hyperclique.
Remark 1 ().
On the positive side of Theorem 3.3, the preprocessing time is dominated by sorting the input relations, which we assume requires time. If we assume instead that sorting takes linear time (as assumed in some related work (Brault-Baron, 2013; Carmeli et al., 2020; Grandjean, 1996)), then the time required for preprocessing is only instead of .
In Section 3.1, we provide an algorithm for this problem for full acyclic CQs that have a particular join tree that we call layered. Then, we show how to find such a layered join tree whenever there is no disruptive trio in Section 3.2. In Section 3.3, we explain how to adapt our solution for CQs with projections, and in Section 3.4 we prove a lower bound which establishes that our algorithm applies to all cases where direct access is tractable.
3.1. Layer-Based Algorithm
Before we explain the algorithm, we first define one of its main components. A layered join tree is a join tree of an inclusive extension of a hypergraph, where each node belongs to a layer. The layer number matches the position in the lexicographic order of the last variable that the node contains. Intuitively, “peeling” off the outermost (largest) layers must result in a valid join tree (for a hypergraph with fewer variables).
Definition 3.4 (Layered Join Tree).
Let be a full acyclic CQ, and let be a lexicographic order. A layered join tree for with respect to is a join tree of an inclusive extension of where () every vertex is assigned to layer , () there is exactly one vertex for each layer, and () for all the induced subgraph with only the vertices that belong to the first layers is a tree.
Example 3.5 ().
Consider the CQ
and the lexicographic order . To support that order, we first take an inclusive extension of its hypergraph, shown in Figure 0(a). Notice that we added two hyperegdes that are strictly contained in the existing ones. A layered join tree constructed from that hypergraph is depicted in Figure 0(b). There are four layers, one for each vertex of the join tree. The layer of the vertex containing is because appears after in the order and it is the third variable. If we remove the last layer, then we obtain a join tree for the induced hypergraph where the last variable is removed.
We now describe an algorithm that takes as an input a CQ , a lexicographic order , and a corresponding layered join tree and provides direct access to the query answers after a preprocessing phase. For preprocessing, we leverage a construction from Carmeli et al. (Carmeli et al., 2020, Algorithm 2) and apply it to our layered join tree. For completeness, we briefly explain how it works below. Subsequently, we describe the access phase that takes into account the layers of the tree to accommodate the provided lexicogrpahic order. Thus, the way we access the structure is different than that of past work (Carmeli et al., 2020). This allows us to support lexicographic orders that were impossible for the existing algorithms (e.g. that of Example 3.5).
Preprocessing. The preprocessing phase () creates a relation for every vertex of the tree, () removes dangling tuples, () sorts the relations, () partitions the relations into buckets, and () uses dynamic programming on the tree to compute and store certain counts. After preprocessing, we are guaranteed that for all , the vertex of layer has a corresponding relation where each tuple participates in at least one query answer; this relation is partitioned into buckets by the assignment of the variables preceding . In each bucket, we sort the tuples lexicographically by . Each tuple is given a weight that indicates the number of different answers this tuple agrees with when only joining its subtree. The weight of each bucket is the sum of its tuple weights. We denote both by the function weight. Moreover, for every tuple , we compute the sum of weights of the preceding tuples in the bucket, denoted by . We use for the sum that corresponds to the tuple following in the same bucket; if is last, we set this to be the bucket weight. If we think of the query answers in the subtree sorted in the order of values, then start and end distribute the indices between and the bucket weight to tuples. The number of indices within the range of each tuple corresponds to its weight.
Example 3.6 (Continued).
The result of the preprocessing phase on an example database for our query is shown in Figure 2. Notice that has been split into two buckets according to the values of its parent , one for value and one for . For tuple , we have because this is the number of answers that agree on that value in its subtree: the left subtree has such answers which can be combined with any of the possible answers of the right subtree. The start index of tuple is the sum of the previous weights within the bucket: . Not shown in the figure is that every bucket stores the sum of weights it contains.
Access. The access phase works by going through the tree layer by layer. When resolving a layer , we select a tuple from its corresponding relation, which sets a value for the th variable in , and also determines a bucket for each child. Then, we erase the vertex of layer and its outgoing edges.
The access algorithm maintains a directed forest and an assignment to a prefix of the variables. Each tree in the forest represents the answers obtained by joining its relations. Each root contains a single bucket that agrees with the already assigned values, thus every answer agrees on the prefix. Due to the running intersection property, different trees cannot share unassigned variables. As a consequence, any combination of answers from different trees can be added to the prefix assignment to form an answer to . The answers obtained this way are exactly the answers to that agree with the already set assignment. Since we start with a layered join tree, we are guaranteed that at each step, the next layer (which corresponds to the variable following the prefix for which we have an assignment) appears as a root in the forest.
Recall that from the preprocessing phase, the weight of each root is the number of answers in its tree. When we are at layer , we have to take into account the weights of all the other roots in order to compute the number of query answers for a particular tuple. More specifically, the number of answers to containing the already selected attributes (smaller than ) and some value contained in a tuple is found by multiplying the tuple weight with the weights of all other roots. That is because the answers from all trees can be combined into a query answer. Let be the selected tuple when resolving the layer. The number of answers to that have a value of smaller than that of and a value of equal to that of for all is then:
where ranges over tuples preceding in its bucket. Denote by factor the product of all root weights. Then we can rewrite as:
Therefore, when resolving layer we select the last tuple such that the index we want to access is at least .
Algorithm 1 summarizes the process we described where is the index to be accessed and is the number of variables. Iteration resolves layer . Pointers to the selected buckets from the roots are kept in a bucket array. The product of the weights of all roots is kept in a factor variable. In each iteration, the variable is updated to the index that should be accessed among the answers that agree with the already selected attribute values. Note that is always initialized when accessed since layer is guaranteed to be a child of a smaller layer.
Example 3.7 (Continued).
We demonstrate how the access algorithm works for index . When resolving , the tuple is chosen since ; then, the single bucket in and the bucket containing in are selected. The next iteration resolves . When it reaches line 7, and (since this is the bucket weight of ). As , the tuple is selected. Next, is resolved, which we depict in Figure 3. The current index is . The weights of the other roots (only here) gives us . To make our choice in , we multiply the weights of the tuples by . Then, we find that the index we are looking for falls into the range of because . Next, is resolved, , and . As , the tuple is selected. Overall, answer number (the answer) is .
Lemma 3.8 ().
Let be a full acyclic CQ, and be a lexicographic order. If there is a layered join tree for with respect to , then direct access is possible with preprocessing and time per access.
The correctness of Algorithm 1 follows from the discussion above. For the time complexity, note that it contains a constant number of operations (assuming the number of attributes is fixed). Line 7 can be done in logarithmic time using binary search, while all other operations only require constant time in the RAM model. Thus, we obtain direct access in logarithmic time per answer after the quasilinear preprocessing (dominated by sorting). ∎
With minor modifications, the algorithm we presented in this section can be used for the (reverse) task of inverted access. We describe this variation in Appendix B.
3.2. Finding Layered Join Trees
We now have an algorithm that can be applied whenever we have a layered join tree. We next show that the existence of such a join tree relies on the disruptive trio condition we introduced earlier. In particular, if no disruptive trio exists, we are able to construct a layered join tree for full acyclic CQs.
Lemma 3.9 ().
Let be a full acyclic CQ, and be a lexicographic order. If does not have a disruptive trio with respect to , then there is a layered join tree for with respect to .
We show by induction on that there exists a layered join tree for the hypergraph containing the hyperedges with respect to the prefix of containing its first elements. The induction base is the tree that contains the vertex and no edges.
In the inductive step, we assume a layered join tree with layers for , and we build a layer on top of it. Denote by the sets of that contain (these are the sets that to be included in the new layer). First note that is acyclic. Indeed, by the running intersection property, the join tree for has a subtree with all the vertices that contain . This subtree forms a join tree for after projecting out all variables that occur after in the ordering.
We next claim that some set in contains all the others; that is, there exists such that for all , we have that . Consider a join-tree for . Every variable of defines a subtree induced by the vertices that contain this variable. If two variables are neighbors, their subtrees share a vertex. It is known that every collection of subtrees of a tree satisfies the Helly property (Golumbic, 1980): if every two subtrees share a vertex, then some vertex is shared by all subtrees. In particular, since is acyclic, if every two variables of are neighbors, then some element of contains all variables that appear in (elements of) . Thus, if, by way of contradiction, there is no such , there exist two non-neighboring variables and that appear in (elements of) . Since appears in all elements of , this means that there exist with and . Since and are not neighbors, these three variables are a disruptive trio: and are both neighbors of the later variable . The existence of a disruptive trio contradicts the assumption of the lemma we are proving, and so we conclude that there is such that for all , we have that .
With at hand, we can now add the additional layer to the tree given by the inductive hypothesis. Insert with an edge to a vertex containing (which exists by the inductive hypothesis). This results in the join tree we need: (1) the hyperedges are all contained in vertices, since the ones that do not appear in the tree from the inductive hypothesis are contained in the new vertex; (2) it is a tree since we add one leaf to an existing tree; and (3) the running intersection property holds since the added vertex is connected to all of its variables that already appear in the tree. ∎
3.3. Supporting Projection
Next, we show how to support CQs that have projections. A free-connex CQ can be efficiently reduced to a full acyclic CQ using Proposition 2.1. We next show that the resulting CQ contains no disruptive trio if the original CQ does not.
Lemma 3.10 ().
Given a database instance and a free-connex CQ with no disruptive trio, an equivalent pair of database instance and full acyclic CQ with no disruptive trio can be computed in linear time, and the new CQ does not depend on the database instance.
Lemma 3.11 ().
Let be a CQ, and be a lexicographic order. If does not have a disruptive trio with respect to , direct access by is possible with preprocessing and access time.
3.4. Lower Bound for Conjunctive Queries
Next, we show that our algorithm supports all feasible cases (for self-join-free CQs); we prove that all unsupported cases are intractable.
Lemma 3.12 ().
Let be a self-join-free CQ, and be a lexicographic order. If has a disruptive trio with respect to , then direct access by is not possible with preprocessing and time per access, assuming sparseBMM.
Lemma 3.12 is a special case of the more general Lemma 4.5 that we prove later when we discuss partial lexicographic orders. Since has a disruptive trio, two non-neighboring variables , are both neighbors of a later variable in . Thus, is a chordless path, and Lemma 4.5 implies the correctness of Lemma 3.12.
By combining Lemma 3.11 and Lemma 3.12 together with the known hardness results for non-free-connex CQs (Theorem 2.2), we prove the dichotomy given in Theorem 3.3: direct access by a lexicographic order for a self-join-free CQ is possible with quasilinear preprocessing and polylogarithmic time per answer if and only if the query is free-connex and does not have a disruptive trio with respect to the required order.
4. Partial Lexicographic Orders
We now investigate the case where the desired lexicographic order is partial, i.e., it contains only some of the free variables. This means that there is no particular order requirement for the rest of the variables. One way to achieve direct access to a partial order is to complete it into a full lexicographic order and then leverage the results of the previous section. If such a completion is impossible, we have to consider cases where tie breaking between the non-ordered variables is done in an arbitrary way. However, we will show in this section that the tractable partial orders are precisely those that can be completed into a full lexicographic order. In particular, we will prove the following dichotomy which also gives an easy to detect criterion for the tractability of direct access.
Theorem 4.1 ().
Let be a CQ and be a partial lexicographic order.
If is free-connex and -connex and does not have a disruptive trio with respect to , then direct access by is possible with preprocessing and time per access.
Otherwise, if is also self-join-free, then direct access by is not possible with preprocessing time and time per access, assuming the sparseBMM and Hyperclique hypotheses.
Example 4.2 ().
Consider the CQ . If the free variables are exactly and , then the query is not free-connex, and so it is intractable. Next assume that all variables are free. If , then the query is not -connex, and so it is intractable. If , than is a disruptive trio, thus the query is intractable. However, if or , then the query is free-connex, -connex and has no disruptive trio, so it is tractable.
4.1. Tractable Cases
For the positive side, we can solve our problem efficiently if the CQ is free-connex and there is a completion of the lexicographic order to all free variables with no disruptive trio. Lemma 4.4 identifies these cases with a connexity criterion. To prove it, we first need a way to combine two different connexity properties. The proof of the following proposition uses ideas from a proof of the characterization of free-connex CQs in terms of the acyclicity of the hypergraph obtained by including a hyperedge with the free variables (Berkholz et al., 2020).
Proposition 4.3 ().
If a CQ is both -connex and -connex where , then there exists a join tree of an inclusive extension of with a subtree containing exactly the variables and a subtree of contains exactly the variables .
We are now in position to show the following:
Lemma 4.4 ().
Let be a CQ and be a partial lexicographic order. If is free-connex and -connex and does not have a disruptive trio with respect to , then there is an ordering of that starts with such that has no disruptive trio with respect to .
Take a tree for given by Proposition 4.3 with a subtree containing exactly the free variables, and a subtree of containing exactly the variables . We assume that contains at least one vertex; otherwise (this can only happen in case is empty), we can introduce a vertex with no variables to all of , and and connect it to any one vertex of . We describe a process of extending while traversing . Consider the vertices of as handled, and initialize . Then, repeatedly handle a neighbor of a handled vertex until all vertices are handled. When handling a vertex, append to all of its variables that are not already there. We prove by induction that has no disruptive trio w.r.t any prefix of . The base case is guaranteed by the premises of this lemma since (hence all of its prefixes) have no disruptive trio.
Let be a new variable added to a prefix of . Let be the subtree of with the handled vertices when adding to and let be the vertex being handled. Note that, since is being added, but is not in any vertex of .
We first claim that every neighbor of with is in . Our arguments are illustrated in Figure 4. Since and are neighbors, they appear together in a node outside of . Let be a node in containing (such a node exists since appears before in ). Consider the path from to . Let be the last node of this path not in . If , the path between and goes only through vertices of (except for the end-points). Thus, concatenating the path from to with the path from to results in a simple path. By the running intersection property, all vertices on this path contain . In particular, the vertex following contains in contradiction to the fact that does not appear in . Therefore, . By the running intersection property, since is on the path between and , we have that contains .
We now prove the induction step. We know by the inductive hypothesis that have no disruptive trio. Assume by way of contradiction that appending introduces a disruptive trio. Then, there are two variables with such that are neighbors, are neighbors, but are not neighbors. As we proved, since and are neighbors of preceding it, we have that all three of them appear in the handled vertex . This is a contradiction to the fact that and are not neighbors. ∎
4.2. Intractable Cases
For the negative part, we prove a generalization of Lemma 3.12. For that, we use the hardness of Boolean matrix multiplication with a construction that is similar to that of Bagan et al. (Bagan et al., 2007) for the hardness of enumeration on acyclic CQs that are not free-connex.
Lemma 4.5 ().
Let be a self-join-free CQ and be a partial lexicographic order. If there is a chordless path such that and appear in and no variable appears in before any of them, then direct access by is not possible with preprocessing and time per access, assuming sparseBMM.
Let . We encode Boolean matrix multiplication with such that, in the answers to , the assignments to and form the answers to the given matrix multiplication instance, the assignments to variables of can be skipped using binary search (given direct access), and all other variables are assigned a constant value .
Let and be Boolean matrices represented as binary relations. That is, , and means that the entry in the th row and th column is . We define a partition of the atoms of where is the set of all atoms that contain , and holds all other atoms. Note that no atom in contains (since and are not neighbors) and no atom in contains . Given three values , we define a function as follows:
For a vector, we denote by the vector obtained by element-wise application of . We define a database instance over as follows: For every atom , if we set , and if we set . Note that we do not define relations twice since and are disjoint and is self-join-free.
Since is connected, our construction guarantees that in every answer to all variables are assigned the same value. Since and are neighbors, we are guaranteed that there is an atom that contains them both in . The same holds for and in . Therefore, the answers to describe the matrix multiplication. Consider a query answer . We have that , for all and for some and . All other variables are mapped to the constant . Note that the answers projected to and are the answers to the matrix multiplication problem.
Assume, by way of contradiction, that direct access to the answers of by a lexicographic order in which no variable of occurs before any of and is possible with preprocessing and delay. We show how to find all the unique values of and in the answers efficiently. Perform the following starting with and until there are no more answers. Access answer number and print its assignment to . Then, set to be the index of the next answer which assigns to different values and repeat. Finding the next index can be done using binary search with a logarithmic number of direct accesses, each taking polylogarithmic time. Overall, we solve Boolean matrix multiplication in time, contradicting sparseBMM. ∎
The negative part of the dichotomy has three cases. First, if is not free-connex, then we know that direct access by any order is intractable according to Theorem 2.2. Next, if has a disruptive trio with respect to , then is a chordless path satisfying the conditions of Lemma 4.5. The last case is that is not -connex. In this case, there is an -path, and this path satisfies the conditions of Lemma 4.5. Therefore, we obtain that the last two cases are hard too, assuming the sparseBMM hypothesis.
5. Direct Access by Sum of Weights
We now consider direct access for the more general orderings based on (the sum of attribute weights). As with lexicographic orderings, we are able to exhaustively characterize the class of self-join-free CQs, even those with projections, in terms of tractability. We will show that direct access for is significantly harder and tractable only for a small class of queries.
5.1. Overview of Results
The complexity of direct access depends on the ability of the query to express certain combinations of weights. If the query contains independent free variables, then its answers may contain all possible combinations of their corresponding attribute weights. Our characterization is based on this independence measure.
Definition 5.1 (Independent free variables).
A set of vertices of a hypergraph is called independent iff no pair of these vertices appears in the same hyperedge, i.e., for all . For a CQ , we denote by the maximum number of variables among that are independent in .
Intuitively, we can construct a database instance where each independent free variable is assigned to different domain values with different weights. By appropriately choosing the assignment of the other variables, all possible combinations of these weights will appear in the query answers.
Example 5.2 ().
For , we have , namely for variables . If the database instance is , , , then the query answers are .
The main result of this section is a dichotomy for direct access by ordering:
Theorem 5.3 ().
Let be a CQ and be a weight function.
If is acyclic and , then direct access by is possible with preprocessing and time per answer.
Otherwise, if is also self-join-free, direct access by is not possible with preprocessing and time per answer, assuming 3sum and Hyperclique.
For the hardness results, we rely mainly on the 3sum hypothesis. To more easily relate our direct-access problem to 3sum, which asks for the existence of a particular sum of weights, it is useful to define an auxiliary problem:
Definition 5.4 (weight lookup).
Given a CQ , weight function , and , weight lookup by returns the first position of a query answer of weight in the sorted array of answers.
The following lemma associates direct access with weight lookup via binary search on the query answers:
Lemma 5.5 ().
If the query answer according to some ranking function can be directly accessed in time for every , then weight lookup can be performed in .
Lemma 5.5 implies that whenever we are able to support efficient direct access on the sorted array of query answers, weight lookup increases time complexity only by a logarithmic factor, i.e., it is also efficient. The main idea behind our reductions is that via weight lookups on a CQ with an appropriately constructed database, we can decide the existence of a zero-sum triplet over three distinct sets of numbers, thus hardness follows from 3sum. First, we consider the case of three independent variables that are free. These three variables are able to simulate a three-way Cartesian product in the query answers. This allows us to directly encode the 3sum triplets using attribute weights, obtaining a lower bound for direct access.
Lemma 5.6 ().
If a CQ is self-join-free and , then direct access by is not possible with preprocessing and time per access for any assuming 3sum.
Assume for the sake of contradiction that the lemma does not hold. We show that this would imply an -time algorithm for 3sum. To this end, consider an instance of 3sum with integer sets , , and of size , given as arrays. We reduce 3sum to direct access over the appropriate query and input instance by using a construction similar to Example 5.2. Let , , and be free and independent variables of , which exist because . This also implies that contains at least 3 atoms , , and , with variable , , and , respectively. Note that variables other than , , and may exist in these or other atoms. We create a database instance where , , and take on each value in , while all the other attributes have value . This ensures that has exactly answers—one for each combination in , no matter the number of other atoms and other attributes in any of the atoms (including in , , and ). To see this, note that since , , and are independent, they never appear together in a relation. Thus each relation either contains tuple (if neither , , nor is present) or tuples (if one of , , or is present). No matter on which attributes these relations are joined (including Cartesian products), the output result is always the “same” set of size , where is the number of free variables other than , , and . (We use the term “same” loosely for the sake of simplicity. Clearly, for different values of the query-result schema changes, e.g., consider example 5.2 with removed from the head. However, this only affects the number of additional 0s in each of the answer tuples, therefore it does not impact our construction.)
For the reduction from 3sum, weights are assigned to the attribute values as , , , , and for all other attributes . By our weight assignment, the weights of the answers are , , and thus in one-to-one correspondence with the possible value combinations in the 3sum problem. We first perform the preprocessing for direct access in , which enables direct access to any position in the sorted array of query answers in . By Lemma 5.5, weight lookup for a query result with zero weight is possible in . Thus, we answer the original 3sum problem in for any , violating the 3sum hypothesis. ∎
For queries that do not have three independent free variables we need a different construction. We show next that two variables are sufficient to encode partial 3sum solutions (i.e., pairs of elements), enabling a full solution of 3sum via weight lookups. This yields a weaker lower bound than Lemma 5.6, but still is sufficient to prove intractability according to our yardstick.
Lemma 5.7 ().
If a CQ is self-join-free and , then direct access by is not possible with preprocessing and time per access for any assuming 3sum.
A special case of Lemma 5.7 is closely related to the problem of selection in X+Y (Johnson and Mizoguchi, 1978), where we want to access the smallest sum of pairs between two sets and . This is equivalent to accessing the answers to by ordering. It has been shown that if and are given sorted, then selection is possible even in linear time (Frederickson and Johnson, 1984; Mirzaian and Arjomandi, 1985). Thus, for direct access by is possible with preprocessing (where we simply sort the input relations) and per access.
Next, we show that the remaining acyclic CQs (those with ) are tractable. For these queries, a single relation contains all the answers, and so direct access can be supported by simply sorting that relation.
Lemma 5.8 ().
If a CQ is acyclic and , then direct access by is possible with preprocessing and time per answer.
Combining these lemmas with the hardness of Boolean self-join-free cyclic CQs based on Hyperclique, gives a proof of Theorem 5.3.
6. Selection by Sum of Weights
Given that direct access by order with quasilinear preprocessing and polylogarithmic delay is possible only in very few cases, we next investigate the tractability of a simpler version of the problem: When is selection, i.e., direct access to a single query answer, possible in quasilinear time? We further simplify the problem by not allowing any projections in the query, i.e., we limit our attention to full CQs. Our main result is a dichotomy theorem that covers all full self-join-free CQs. We show that the simplifications move only a narrow set of queries to the tractable side. For example, the 2-path query is tractable for selection (single direct access), even though is it not for direct access.
6.1. Overview of Results
We first introduce necessary terminology. For a CQ with hypergraph , the maximal number of hyperedges w.r.t. containment is , i.e., . An atom is absorbed by an atom if . A query is a contraction of if every atom of appears in , and all the rest of the atoms of are absorbed by some atom of . is a maximal contraction of if it is a contraction and there is no that is a contraction of except itself. It is easy to see that the number of atoms of is .
Example 6.1 ().
Consider . Here, is absorbed by and , and the latter two absorb each other. There are two minimal contractions that we can obtain from : either or . The number of maximal hyperdges of is .
We summarize the results of this section in the following theorem, which characterizes the class of full CQs based on