Enumeration Complexity of Conjunctive Queries with Functional Dependencies

12/21/2017
by   Nofar Carmeli, et al.
Technion
0

We study the complexity of enumerating the answers of Conjunctive Queries (CQs) in the presence of Functional Dependencies (FDs). Our focus is on the ability to list output tuples with a constant delay in between, following a linear-time preprocessing. A known dichotomy classifies the acyclic self-join-free CQs into those that admit such enumeration, and those that do not. However, this classification no longer holds in the common case where the database exhibits dependencies among attributes. That is, some queries that are classified as hard are in fact tractable if dependencies are accounted for. We establish a generalization of the dichotomy to accommodate FDs; hence, our classification determines which combination of a CQ and a set of FDs admits constant-delay enumeration with a linear-time preprocessing. In addition, we generalize a hardness result for cyclic CQs to accommodate a common type of FDs. Further conclusions of our development include a dichotomy for enumeration with linear delay, and a dichotomy for CQs with disequalities. Finally, we show that all our results apply to the known class of "cardinality dependencies" that generalize FDs (e.g., by stating an upper bound on the number of genres per movies, or friends per person).

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/10/2018

Enumeration Complexity of Unions of Conjunctive Queries

We study the enumeration complexity of answering unions of conjunctive q...
06/10/2022

Conjunctive queries with self-joins, towards a fine-grained complexity analysis

The complexity of evaluating conjunctive queries without self-joins is w...
01/07/2022

Tight Fine-Grained Bounds for Direct Access on Join Queries

We consider the task of lexicographic direct access to query answers. Th...
03/17/2022

Efficiently Enumerating Answers to Ontology-Mediated Queries

We study the enumeration of answers to ontology-mediated queries (OMQs) ...
09/18/2017

Enumeration on Trees under Relabelings

We study how to evaluate MSO queries with free variables on trees, withi...
12/10/2018

On the Enumeration Complexity of Unions of Conjunctive Queries

We study the enumeration complexity of Unions of Conjunctive Queries (UC...
12/23/2019

Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration

As data analytics becomes more crucial to digital systems, so grows the ...

1 Introduction

When evaluating a non-boolean Conjunctive Query (CQ) over a database, the number of results can be huge. Since this number may be larger than the size of the database itself, we need to use specific measures of enumeration complexity to describe the hardness of such a problem. In this perspective, the best we can hope for is to constantly output results, in such a way that the delay between them is unaffected by the size of the database instance. For this to be possible, we need to allow a precomputation phase before printing the first result, as linear time preprocessing is necessary to read the input instance.

A known dichotomy determines when the answers to self-join-free acyclic CQs can be enumerated with constant delay after linear time preprocessing [3]. This class of enumeration problems, denoted by , can be regarded as the most efficient class of nontrivial enumeration problems and therefore current work on query enumeration has focused on this class [9, 14, 5]. Bagan et al.[3] show that a subclass of acyclic queries, called free-connex, are exactly those that are enumerable in , under the common assumption that boolean matrix multiplication cannot be solved in quadratic time. An acyclic query is called free-connex if the query remains acyclic when treating the head of the query as an additional atom. This and all other results in this paper hold under the RAM model [15].

The above mentioned dichotomy only holds when applied to databases with no additional assumptions, but oftentimes this is not the case. In practice, there is usually a connection between different attributes, and Functional Dependencies (FDs) and Cardinality Dependencies (CDs) are widely used to model situations where some attributes imply others. As the following example shows, these constraints also have an immediate effect on the complexity of enumerating answers for queries over such a schema. For a list of actors and the production companies they work with, we have the query: . At first glance, it appears as though this query is not in , as it is acyclic but not free-connex. Nevertheless, if we take the fact that a movie has only one production company into account, we have the FD , and the enumeration problem becomes easy: we only need to iterate over all tuples of Cast and replace the value with the single value that the relation Release assigns to it. This can be done in linear time by first sorting (in linear time [10]) both relations according to . ∎

Example 1 shows that the dichotomy by Bagan et al. [3] does not hold in the presence of FDs. In fact, we believe that dependencies between attributes are so common in real life, that ignoring them in such dichotomies can lead to missing a significant portion of the tractable cases. Therefore, to get a realistic picture of the enumeration complexity of CQs, we have to take dependencies into account. The goal of this work is to generalize the dichotomy to fully accommodate FDs.

Towards this goal, we introduce an extension of a query according to the FDs. The extension is called the FD-extended query, and denoted . In this extension, each atom, as well as the head of the query, contains all variables that can be implied by its variables according to some FD. This way, instead of classifying every combination of CQ and FDs directly, we encode the dependencies within the extended query, and use the classification of to gain insight regarding . This approach draws inspiration from the proof of a dichotomy in the complexity of deletion propagation, in the presence of FDs [11]. However, the problem and consequently the proof techniques are fundamentally different.

The FD-extension is defined in such a way that if is satisfied by an assignment, then the same assignment also satisfies the extension , as the underlying instance is bound by the FDs. In fact, we can show that enumerating the solutions of under FDs can be reduced to enumerating the solutions of . Therefore, tractability of ensures that can be efficiently solved as well. By using the positive result in the known dichotomy, is tractable w.r.t enumeration if it is free-connex. Moreover, it can be shown that the structural restrictions of acyclicity and free-connex are closed under taking FD-extensions. Hence, the class of all queries such that is free-connex is an extension of the class of free-connex queries, and this extension is in fact proper. We denote the classes of queries such that is acyclic or free-connex as FD-acyclic respectively FD-free-connex.

To reach a dichotomy, we now need to answer the following question: Is it possible that can be enumerated efficiently even if is not free-connex? To show that an enumeration problem is not within a given class, enumeration complexity has few tools to offer. One such tool is a notion of completeness for enumeration problems [8]. However, this notion focuses on problems with a complexity corresponding to higher classes of the polynomial hierarchy. So in order to deal with this problem, Bagan et al. [3] reduced the matrix multiplication problem to enumerating the answers to any query that is acyclic but not free-connex. This reduction fails, however, when dependencies are imposed on the data, as the constructed database instance does not necessarily satisfy the underlying dependencies.

As it turns out, however, the structure of the FD-extended query allows us to extend this reduction to our setting. By carefully expanding the reduced instance such that on the one hand, the dependencies hold and on the other hand, the reduction can still be performed within linear time, we establish a dichotomy. That is, we show that the tractability of enumerating the answers of a self-join-free query in the presence of FDs is exactly characterized by the structure of : Given an FD-acyclic query , we can enumerate the answers to within the class iff is FD-free-connex.

The resulting extended dichotomy, as well as the original one, brings insight to the case of acyclic queries. Concerning unrestricted CQs, providing even a first solution of a query in linear time is impossible in general. This is due to the fact that the parameterized complexity of answering boolean CQs, taking the query size as the parameter, is -hard [13]. This does not imply, however, that there are no cyclic queries with the corresponding enumeration problems in . The fact that no such queries exist requires an additional proof, which was presented by Brault-Baron [6]. This result holds under a generalization of the triangle finding problem, which is considered not to be solvable within linear time [16]. As before, this proof does no longer apply in the presence of FDs. Moreover, it is possible for to be cyclic and acyclic. In fact, may even be free-connex, and therefore tractable in . We show that, under the same assumptions used by Brault-Baron [6], the evaluation problem for a self-join-free CQ in the presence of unary FDs where is cyclic cannot be solved in linear time. As linear time preprocessing is not enough to achieve the first result, a consequence is that enumeration within is impossible in that case. This covers all types of CQs and shows a full dichotomy, at least for the case of unary FDs.

The results we present here are not limited to FDs. CDs (Cardinality Dependencies) [7, 2] are a generalization of FDs, denoted . Here, the right-hand side does not have to be unique for every assignment to the left-hand side, but there can be at most different values to the variables of for every value of the variables of . FDs are in fact a special case of CDs where . Constraints of that form appear naturally in many applications. For example: a movie has only a handful of directors, there are at most 200 countries, and a person is typically limited to at most 5000 friends in (some) social networks. We show that all results described in this paper also apply to CDs. Moreover, we show how our results can be easily used to yield additional results, such as a dichotomy for CQs with disequalities, and a dichotomy to evaluate CQs with linear delay.

Contributions.

Our main contributions are as follows.

  • We extend the class of queries that can be evaluated in by incorporating the FDs. This extension is the class of FD-free-connex CQs.

  • We establish a dichotomy for the enumeration complexity of self-join-free FD-acyclic CQs. Consequently, we get a dichotomy for self-join-free acyclic CQs under FDs.

  • We show a lower bound for FD-cyclic CQs. In particular, we get a dichotomy for all self-join-free CQs in the presence of unary FDs.

  • We extend our results to CDs.

This work is organized as follows: In Section 2 we provide definitions and state results that we will use. Section 3 introduces the notion of FD-extended queries and establishes the equivalence between a query and its FD-extension. The generalized version of the dichotomy is shown in Section 4. In Section 5, a lower bound for cyclic queries under unary FDs is shown, and Section 6 shows that all results from the previous sections extend to CDs. Concluding remarks are given in Section 7. Full proofs for all of our results are given in the appendix.

2 Preliminaries

In this section we provide preliminary definitions as well as state results that we will use throughout this paper.

Schemas and Functional Dependencies.

A schema is a pair where is a finite set of relational symbols and is a set of Functional Dependencies (FDs). We denote the arity of a relational symbol as . An FD has the form , where and are non-empty with .

Let be a finite set of constants. A database over schema is called an instance of , and it consists of a finite relation for every relational symbol , such that all FDs in are satisfied. An FD is said to be satisfied if, for all tuples that are equal on the indices of , and are equal on the indices of . Here we assume that all FDs are of the form , where , as we can replace an FD of the form where by the set of FDs . If , we say that is a unary FD.

Conjunctive Queries.

Let be a set of variables disjoint from . A Conjunctive Query (CQ) over a schema is an expression of the form , where are relational symbols of , the tuples hold variables, and every variable in appears in at least one of . We often denote this query as or even . Define the variables of as , and define the free variables of as . We call the head of , and the atomic formulas are called atoms. We further use to denote the set of atoms of Q. A CQ is said to contain self-joins if some relation symbol appears in more than one atom.

For the evaluation of a CQ with free variables over a database , we define to be the set of all mappings such that is a homomorphism from into , where denotes the restriction (or projection) of to the variables . The problem is, given a database instance , determining whether such a mapping exists.

Given a query over a schema , we often identify an FD as a mapping between variables. That is, if has the form for , we sometimes denote it by , where is the -th variable of . To distinguish between these two representations, we usually denote subsets of integers by , integers by , and variables by letters from the end of the alphabet.

Hypergraphs.

A hypergraph is a pair consisting of a set of vertices, and a set of non-empty subsets of called hyperedges (sometimes edges). A join tree of a hypergraph is a tree where the nodes are the hyperedges of , and the running intersection property holds, namely: for all the set forms a connected subtree in . A hypergraph is said to be acyclic if there exists a join tree for . Two vertices in a hypergraph are said to be neighbors if they appear in the same edge. A clique of a hypergraph is a set of vertices, which are pairwise neighbors in . A hypergraph is said to be conformal if every clique of is contained in some edge of . A chordless cycle of is a tuple such that the set of neighboring pairs of variables of is exactly . It is well known (see [4]) that a hypergraph is acyclic iff it is conformal and contains no chordless cycles.

A pseudo-minor of a hypergraph is a hypergraph obtained from by a finite series of the following operations: (1) vertex removal: removing a vertex from and from all edges in that contain it. (2) edge removal: removing an edge from provided that some other contains it. (3) edge contraction: replacing all occurrences of a vertex (within every edge) with a vertex , provided that and are neighbors.

Classes of CQs.

To a CQ we associate a hypergraph where the vertices are the variables of and every hyperedge is a set of variables occurring in a single atom of , that is . With a slight abuse of notation, we also identify atoms of with edges of . A CQ is said to be acyclic if is acyclic, and it is said to be free-connex if both and are acyclic.

A head-path for a CQ is a sequence of variables with , such that: (1) (2) (3) It is a chordless path in , that is, two succeeding variables appear together in some atom, and no two non-succeeding variables appear together in an atom. Bagan et al. [3] showed that an acyclic CQ has a head-path iff it is not free-connex.

Enumeration Complexity.

Given a finite alphabet and binary relation , we denote by the enumeration problem of given an instance , to output all such that . In this paper we adopt the Random Access Machine (RAM) model (see [15]). Previous results in the field assume different variations of the RAM model. Here we assume that the length of memory registers is linear in the size of value registers, that is, the accessible memory is polynomial. For a class of enumeration problems, we say that , if there is a RAM that – on input – outputs all with without repetition such that the first output is computed in time and the delay between any two consecutive outputs after the first is , where:

  • For , we have and .

  • For , we have .

Let and be enumeration problems. We say that there is an exact reduction from to , written as , if there are mappings and such that for every the mapping is computable in , for every with , is computable in constant time and in multiset notation. Intuitively, is used to map instances of to instances of , and is used to map solutions to to solutions of . An enumeration class is said to be closed under exact reduction if for every and such that and , we have . Bagan et al. [3] proved that is closed under exact reduction. The same proof holds for any meaningful enumeration complexity class that guarantees generating all unique answers with at least linear preprocessing time and at least constant delay between answers.

Enumerating Answers to CQs.

For a CQ over a schema , we denote by the enumeration problem , where is the binary relation between instances over and sets of mappings . We consider the size of the query as well as the size of the schema to be fixed. Bagan et al. [3] showed that a self-join-free acyclic CQ is in iff it is free-connex:

[[3]] Let be an acyclic CQ without self-joins over a schema .

  1. If is free-connex, then .

  2. If is not free-connex, then , assuming the product of two boolean matrices cannot be computed in time .

3 FD-Extended CQs

In this section, we formally define the extended query . We then discuss the relationship between and : their equivalence w.r.t. enumeration and the possible structural differences between them. As a result, we obtain that if is in a class of queries that allows for tractable enumeration, then is tractable as well.

We first define . The extension of an atom according to an FD where is possible if but . In that case, is added to the variables of . The FD-extension

of a query is defined by iteratively extending all atoms as well as the head according to every possible dependency in the schema, until a fixpoint is reached. The schema extends accordingly: the arities of the relations increase as their corresponding atoms extend, and dummy variables are added to adjust to that change in case of self-joins. The FDs apply in every relation that contains all relevant variables.

[(FD-Extended Query)] Let be a CQ over a schema . We define two types of extension steps:

  • The extension of an atom according to an FD .
    Prerequisites: and .
    Effect: The arity of increases by one, and is replaced by . In addition, every such that = and is replaced with , where is a fresh variable.

  • The extension of the head according to an FD .
    Prerequisites: and .
    Effect: The head is replaced by .

The FD-extension of is the query , obtained by performing all possible extension steps on according to FDs of until a fixpoint is reached. The extension is defined over the schema , where is with the extended arities, and .

Given a query, its FD-extension is unique up to a permutation of the added variables, and renaming of the new variables. As the order of the variables and the naming make no difference w.r.t. enumeration, we can treat the FD-extension as unique.

Consider a schema with , and the query . As the FDs are and , the FD-extension is . We first apply on the head, and then and consequently on . These two FDs now appear in the schema also for , and the FDs of the extended schema are . ∎

We later show that the enumeration complexity of a CQ over a schema with FDs only depends on the structure of , which is implicitly given by . Therefore, we introduce the notions of acyclic and free-connex queries for FD-extensions:

Let be a CQ over a schema , and let be its FD-extension.

  • We say that is FD-acyclic, if is acyclic.

  • We say that is FD-free-connex, if is free-connex.

  • We say that is FD-cyclic, if is cyclic.

The following proposition shows that the classes of acyclic queries and free-connex queries are both closed under constructing FD-extensions.

Let be a CQ over a schema .

  • If the query is acyclic, then it is FD-acyclic.

  • If the query is free-connex, then it is FD-free-connex.

Proof.

We prove that if is acyclic, then is also acyclic (the case where is free-connex follows along the same lines). Denote by a sequence of queries such that is the result of extending all possible relations of according to a single FD . By induction, it suffices to show that if is acyclic, then is acyclic as well. So consider an acyclic query extended to the query according to the FD . Further let be the join tree of . We claim that the same tree (but with the extended atoms), is a join tree for . More formally, define such that and . Next we show that the running intersection property holds in , and therefore it is a join tree of .

For the new variables introduced in the extension, every such variable appears only in one atom, so the subtree of containing such a variable contains one node and is trivially connected. For any other variable , the attribute appears in the same atoms in and . Therefore, the subgraph of containing is isomorphic to the subgraph of containing , and since is a join tree, it is connected. It is left to show that the subtree of containing is connected. let be the atom in containing . Note that corresponds to vertices in and containing and . Let be some vertex in containing . We will show that all vertices on the path between and contain . If appears in the vertex in , then it also appears in since is a join tree. Since the extension doesn’t remove occurrences of variables, appears in these vertices in as well. Otherwise, was added to via . Since is a join tree, the vertices all contain the variables . Thus by the definition of , is added to each of (if it was not already there) in . Thus also the subtree of containing is connected. Therefore is indeed a join tree. ∎

Example 1 shows that the converse of the proposition above does not hold. This means that, by Theorem 2, there are queries such that we can enumerate the answers to in , but we cannot enumerate the answers to with the same complexity, if we do not assume the FDs. The following lemma shows that enumerating the answers of (when relying on the FDs) is in fact equally hard as enumerating the answers of .

Let be a CQ over a schema , and let be its FD-extended query. Then and .

We first sketch the reduction . Given an instance for the problem , we set as described next. We start by removing tuples that interfere with the extended dependencies. For every dependency and every atom that contains the variables , we only keep tuples of that agree with some tuple of over the values of . Next, we follow the extension of the schema, and in each step we extend some to according to some FD . For each tuple , if there is no tuple that agrees with over the values of , then we remove altogether. Otherwise, we copy to and assign with the same value that assigns it. Given an answer , we set to be the projection of to . To show that , we describe the construction of an instance by “reversing” the extension steps. If an atom was extended, we simply remove the added attribute. If the head was extended using some , then for each tuple in that assigns and with the values and respectively, we add the value to a lookup table with pointer . For every , is defined as extended by the values from the lookup table.

Proof.

Let and . We first show that . Given an instance for the problem , we set as described next. We start by removing tuples that interfere with the extended dependencies. For every dependency and every atom that contains the corresponding variables (i.e., ), we correct according to : We only keep tuples of that agree with some tuple of over the values of . We say that a tuple agrees with a tuple on the value of a variable if for every pair of indices such that we have that . This check can be done in linear time by first sorting both and according to , and then performing one scan over both of them. Next, we follow the extension of the schema, and in each step we extend some to according to some FD as described in Definition 3. For each tuple , if there is no tuple that agrees with over the values of , then we remove altogether. Otherwise, we copy to and assign with the same value that assigns it. We say that a tuple assigns a variable with the value if for every index such that we have that . Given an answer , we set to be the projection of to . The projection is computable in constant time.

For the correctness, we need to show that in multiset notation. The easy direction is that if then . Since is a homomorphism from to , and since all tuples of appear (perhaps projected) in , then is also a homomorphism from to . We now show the opposite direction, that if then . Consider a sequence of queries such that each one is the result of extending an atom or the head of the previous query according to an FD . We claim that if is an answer for , then is an answer for . This claim is trivial in case the head was extended. Note also that there cannot be two answers and to such that , as the added variable is bound by the FD to have only one possible value. Now consider the case where an atom was extended since . Denote by and the tuples that are mapped by from and respectively. The construction guarantees that and agree on the value of , so can still map the extended to the extended . In case of self-joins, other atoms with the relation are extended with a new and distinct variable, and the new variable can be mapped to any value appearing in the extension. Therefore if then .

To show that , we now define the mapping between instances. Let be an instance of . First, we “clean” from any tuples that disagree with original FDs. That is, for every FD and every atom such that , remove all tuples that agree with some tuple over but disagree with over . This can be done in linear time by first sorting both and according to . Next, we construct a lookup table . For every added to the head due to an FD , denote by

a vector containing the variables of

in lexicographic order, for each tuple in that assigns and with the values and respectively, we add the value to the lookup table with pointer . Note that due to the FD, a pointer cannot map to two different values. Lastly, we project the relations to . These steps result in the construction of an instance and a lookup table in linear time. Given , we now define . We define a mapping for the variables added to the head using the lookup table. For every added due to some FD , we add to . We define . Note that is computable in constant time since we can use the lookup table in constant time.

We need to show that in multiset notation. First we claim that given , we have that . If maps to , then was added to the head due to some FD , and there is some tuple in that assigns and with the values and respectively. Due to the dependency, all tuples of which assign with , also assign with , and this is also true in . Therefore also maps to . This means that . We now show the first direction, that given we have that . We now claim that is (a subset of) a homomorphism from to . We know that is a homomorphism from to . For any , denote by the tuple . If an atom was extended due to an FD , then and the extension of must agree on , otherwise this would have been deleted in the cleaning phase. In case of self-joins, additional atoms such that may have been extended with new variables. As each new variable has only one occurrence, the extension of these atoms does not interfere with , as the new variables can map to any value present in the tuple that was mapped by from . We conclude that . The second direction is that given , we have that and . It is only left to show that . Indeed, if maps an atom to a tuple , then it maps to the same (perhaps projected) tuple in . This tuple was not removed during the cleaning phase, as the only removed tuples do not have a tuple of agreeing with them on the value of , and therefore cannot map to them. ∎

The direction of Theorem 3 proves that FD-extensions can be used to expand tractable enumeration classes, as the following corollary states.

Let be an enumeration class that is closed under exact reduction. Let be a CQ and let be its FD-extended query. If , then .

Proof.

According to Theorem 3, . Since and is closed under exact reduction, we have that . ∎

Since free-connex queries are in and is closed under exact reduction, if is an FD-free-connex query, then the corresponding enumeration problem is in . This follows from Theorem 2 and the fact that .

Let be a CQ over a schema . If is FD-free-connex, then .

Proof.

According to Theorem 2, we have that as is free-connex. Given an instance over the schema , the same instance is also over , and any query has the same answers over both schemas. Therefore, we have the reduction by using the identity mapping. Overall, we conclude that , and using Corollary 3 we get that

We can now revisit Example 1. The query is not free-connex. Therefore, disregarding the FDs, according to Theorem 2 it is not in . However, given , the FD-extended query is . As it is free-connex, enumerating is in by Corollary 3.

4 A Dichotomy for Acyclic CQs

In this section, we characterize which self-join-free FD-acyclic queries are in . We use the notion of FD-extended queries defined in the previous section to establish a dichotomy stating that enumerating the answers to an FD-acyclic query is in iff the query is FD-free-connex. We will prove the following theorem:

Let be an FD-acyclic CQ without self-joins over a schema .

  • If is FD-free-connex, then .

  • If is not FD-free-connex, then , assuming that the product of two boolean matrices cannot be computed in time .

Proof.

The positive case for this dichotomy was given in Corollary 3. The lower bound is a consequence of Theorem 3 and Lemma 4.2, stating that and . Therefore having would mean that , which is in contradiction to the conjecture. ∎

The positive case for the dichotomy was described in Corollary 3. Note that the restriction of considering only self-joins-free queries is required only for the negative side. This assumption is standard [3, 6, 11], as it allows to assign different atoms with different relations independently. The hardness result described here builds on that of Bagan et al. [3] for databases that are assumed not to have FDs, and it relies on the hardness of the boolean matrix multiplication problem. This problem is defined as the enumeration of the query over the schema where . It is strongly conjectured that this problem is not computable in time and currently, the best known algorithms require time for some  [12, 1].

The original proof describes an exact reduction . Since is acyclic but not free-connex, it contains a head-path . Given an instance of the matrix multiplication problem, an instance of is constructed, where the variables , and of the head-path respectively encode the variables , and of , while all other variables of are assigned constants. This way, is encoded by an atom containing and , and is encoded by an atom containing and . Atoms containing some and only propagate the value of . Since and are in , but are not, the answers to correspond to those of . As no atom of