This paper considers the problem of answering conjunctive queries with negation of the form
where body is the body of an arbitrary conjunctive query, denotes a tuple of variables (or attributes) indexed by a set of positive integers, and is the set of hyperedges of a multi-hypergraph111In a multi-hypergraph, each hyperedge can occur multiple times. All hypergraphs in this paper are multi-hypergraphs. . Every hyperedge corresponds to a bounded-degree relation on attributes . Section 2 formalizes this notion of bounded degree. For instance, the equality () relation is a bounded-degree (binary) relation, because every element in the active domain has degree one; the edge relation of a graph with bounded maximum degree is also a bounded-degree relation. Eq. (1) captures SQL queries with NOT EXISTS clauses and disequality predicates (). In the rule mining problem [DBLP:conf/sigmod/ChenGWJ16], one must count the number of violations of a conjunctive rule, which also leads to Eq. (1). We next exemplify using three Boolean queries222We denote Boolean queries where by instead of . We also use . over a directed graph with nodes and edges: the -walk query333Unlike a path, in a walk some vertices may repeat., the -path query, and the induced (or chordless) -path query. They have the same body and encode graph problems of increasing complexity:
The hypergraph for the -walk query is empty since it has no negated relations. This query can be answered in time using the Yannakakis dynamic programming algorithm [Yannakakis:VLDB:81]. The -path query has the hypergraph . It can be answered in -time [PlehnV90] and even better in -time using the color-coding technique [DBLP:journals/jacm/AlonYZ95]. The induced -path query has the hypergraph similar to that of , but every edge has now multiplicity two due to the negated edge relation and also the disequality. This query is -hard [DBLP:conf/coco/ChenF07], but it can be answered in , where is the number of nodes in the input data graph [PlehnV90].
Our results imply the above complexity results for the three queries.
1.1 Main Contribution
In this paper we propose an algorithm for answering arbitrary conjunctive queries with negation on bounded-degree relations of arbitrary arities, as defined by Eq. (1). Its data complexity matches that of the best known algorithms for the positive subquery and is expressed in terms of the fractional hypertree width [DBLP:journals/talg/GroheM14] and the submodular width [Marx:subw].
Let be a query of the form (1), where for each the relation has bounded degree. Then, the data complexity of answering over a database of size is and .
Our work is the first to exploit the bounded degree of the negated relations. Existing algorithms for positive queries can also answer queries with negation, albeit with much higher complexity since already one negation can increase their worst-case runtime. For example, the Boolean path query with a disequality between the two end points takes linear time with our approach, but quadratic time with existing approaches [faq, panda].
Theorem 1.1 draws on a number of conceptual and technical contributions:
(ii) A generalization of color coding from cliques of disequalities to arbitrary conjunctions of not-all-equal predicates; and
(iii) An alternative view of color coding via Boolean tensor decomposition of conjunctions of not-all-equal predicates (Lemma 4.1). This decomposition admits a probabilistic construction that can be derandomized efficiently (Corollary 4.7).
Our algorithm proceeds as follows. We first rewrite the query into an equivalent disjunction of queries of the form (cf. Proposition 3.3)
For each query , may be different from body in , since fresh variables and unary predicates may be introduced. Its fractional hypertree and submodular widths remain however at most that of body. We thus rewrite the conjunction of the negated relations into a much simpler conjunction of NAE predicates without increasing the data complexity of . The number of such queries depends exponentially on the arities and the degrees of the negated relations, hence the necessary constant bound on these degrees.
The second step is based on the observation that a conjunction of NAE predicates can be answered by an adaptation of the color-coding technique [DBLP:journals/jacm/AlonYZ95], which has been used so far for checking cliques of disequalities. The crux of this technique is to randomly color each value in the active domain with one color from a set whose size is much smaller than the size of the active domain, and to use these colors instead of the values themselves to check the disequalities. We generalize this idea to conjunctions of NAE predicates and show that such conjunctions can be expressed equivalently as disjunctions of simple queries over the different possible colorings of the variables in these queries.
We further explain color coding by providing an alternative view of it: Color coding is a Boolean tensor decomposition of the Boolean tensor defined by the conjunction . As a tensor, is a multivariate function over variables in the set . The tensor decomposition rewrites it into a disjunction of conjunctions of univariate functions over individual variables (Lemma 13). That is,
where is the rank of the tensor decomposition, and for each , the inner conjunction can be thought of as a rank-1 tensor of inexpensive Boolean univariate functions (). The key advantages of this decomposition are that (i) the addition of univariate conjuncts to does not increase its (fractional hypertree and submodular) width and (ii) the dependency of the rank of the decomposition on the database size is only a factor. Lemma 13 shows that the rank depends on two quantities: . The first is the chromatic polynomial of the hypergraph of using -colors. The second is the size of a family of hash functions that represent proper -colorings of homomorphic images of the input database. The number of needed colors is at most the number of variables in . We show it to be the maximum chromatic number of a hypergraph defined by any homomorphic image of the database.
The construction of the Boolean decomposition in the above step is non-trivial. We give a probabilistic construction that generalizes the construction used by the color-coding technique. It selects a color distribution dependent on the query structure, which allows the Boolean tensor rank of to take a wide range of query complexity asymptotics, from polynomial to exponential in the query size. This is more refined than the previously known bound [DBLP:journals/jacm/AlonYZ95], which amounts to a tensor rank that is exponential in the query size. Furthermore, our approach shaves off a factor in the number of colors used for color coding. Recall that the RAM model of computation comes in two variants [DBLP:books/cu/MotwaniR95]: the bit and the unit models, where the cost of a single operation is defined to be and respectively. We show that the operations on colors can be encoded as bit operations that only take step in the unit model. We further derandomize this construction by adapting ideas from derandomization for -restrictions [DBLP:journals/talg/AlonMS06] (with being related to the tensor rank).
The third and final step uses known query evaluation techniques to compute the Boolean tensor decomposition expression: It can achieve the fractional hypertree width using the InsideOut algorithm [faq] and the submodular width using the PANDA algorithm [panda].
1.2 Related Work
The color-coding technique [DBLP:journals/jacm/AlonYZ95] underlies existing approaches to answering queries with disequalities [Papadimitriou:JCSS:99, Bagan:CSL:07, Koutris:TCS:17], the homomorphic embedding problem [DBLP:series/txtcs/FlumG06], and motif finding and counting in computational biology [DBLP:journals/tit/AlonBNNR92]. This technique has been originally proposed for checking cliques of inequalities. It is typically used in conjunction with a dynamic programming algorithm, whose analysis involves combinatorial arguments that make it difficult to apply and generalize to problems beyond the path query from Eq. (2). For example, it is unclear how to use color coding to recover the Plehn and Voigt result for the induced path query from Eq. (3). In this paper, we generalize the technique to arbitrary conjunctions of NAE predicates and from graph coloring to hypergraph coloring.
Our work also generalizes prior work on answering queries with disequalities, which are a special case of negated relations of bounded degree. Papadimitriou and Yannakakis showed that any acyclic join query with an arbitrary set of disequalities on variables can be evaluated in time over any database [Papadimitriou:JCSS:99]. This builds on, yet uses more colors than, the color-coding technique. Bagan, Durand and Grandjean [Bagan:CSL:07] extended this result to free-connex acyclic queries and also shaved off a factor as in our approach. Koutris, Milo, Roy and Suciu [Koutris:TCS:17] introduced a practical algorithm for conjunctive queries with disequalities: Given a select-project-join (SPJ) plan for the conjunctive query without disequalities, the disequalities can be solved uniformly using an extended projection operator. Differently from prior work and in line with our work, they investigate query structures for which the combined complexity becomes polynomial: This is the case for queries whose augmented hypergraphs have bounded treewidth (an augmented hypergraph is the hypergraph of the skeleton conjunctive query augmented with one hyperedge per disequality). The rank of our tensor decomposition also depends on the query structure, as discussed above for the second step of our algorithm. The reliance on SPJ plans is a limitation, since it is known that such plans are suboptimal for join processing [skew] and are inadequate to achieve fhtw and subw bounds. Our approach adapts the InsideOut [faq] and PANDA [panda] algorithms to negation of bounded-degree relations and inherits their low data complexity for arbitrary conjunctive queries, thus achieving both bounds, as stated in Theorem 1.1. Koutris et al [Koutris:TCS:17] further proposed an alternative query answering approach that uses the probabilistic construction of the original color-coding technique coupled with any query evaluation algorithm.
Our Boolean tensor decomposition for conjunctions of NAE
predicates draws on the general framework of tensor decomposition used in signal processing and machine learning[TensorDecomp:2009, TensorDecomp:2017]. It is a special case of sum-product decomposition and a powerful tool. Typical dynamic programming algorithms solve subproblems by combining relations and eliminating variables [Yannakakis:VLDB:81, FDB:TODS:2015, faq]. The sum-product decomposition is the dual approach that decomposes a formula and introduces new variables. The PANDA algorithm [panda] achieves a generalization of the submodular width by rewriting a conjunction as a sum-product over tree decompositions. By combining PANDA with our tensor decomposition, we can answer queries with negation within submodular-width time.
We connect two notions of sparsity in this work. One is the bounded degree of the input relations that are negated in the query. The other is the sparsity of the conjunction of NAE predicates and is captured by the rank of its Boolean tensor decomposition. There are several notions of graph sparsity proposed in the literature, cf. [Sparsity:2017] for an excellent and comprehensive course on sparsity. The most refined sparsity notion is that of nowhere denseness [Grohe:JACM:2017], which characterizes the input monotone graph classes on which FO model checking is fixed-parameter tractable. We leave as future work the generalization of our work to queries with negated nowhere-dense relations. We note that the relation represented by the conjunction of NAE predicates is not necessarily nowhere dense.
While close in spirit to -restrictions [DBLP:journals/talg/AlonMS06], our approach to derandomization of the construction of the Boolean tensor decomposition is different since it has a strict runtime budget defined by the fhtw-bound for computing body. Our derandomization uses a code-concatenation technique where the outer-code is a linear error-correcting code on the Gilbert-Varshamov boundary [DBLP:journals/tit/PoratR11] that can be constructed in linear time. As a byproduct, the code enables an efficient construction of an -perfect hash family of size . To the best of our knowledge, the prior constructions yield families of size [DBLP:journals/talg/AlonMS06].
We illustrate our algorithm using a Boolean query:
where all input relations are materialized and have sizes upper bounded by and thus the active domain of any variable has size at most . The query can be answered trivially in time by joining and first, and then, for each triple in the join, by verifying whether with a (hash) lookup. Define the degree of relation
Suppose we know that . Can we do better than ? The answer is YES.
Rewriting to not-all-equal predicates
By viewing as a bipartite graph of maximum degree two, it is easy to see that can be written as a disjoint union of two relations and that represent matchings in the following sense: for any , if and , then either or and . Let denote the active domain of the variable . Define, for each , a singleton relation . Clearly, and given , can be computed in preprocessing time. For each , create a new variable with domain . Then,
The predicate NAE stands for not-all-equal: It is the negation of the conjunction of pairwise equality on its variables. For arity two as in the rewriting of , stands for the disequality .
From and (5), we can rewrite the original query from (4) into a disjunction of Boolean conjunctive queries without negated relations but with one or two extra existential variables that are involved in disequalities (): , where
It takes linear time to compute the matching decomposition of into and since: (1) the relation is a bipartite graph with degree at most two, and it is thus a union of even cycles and paths; and (2) we can trace the cycles and paths and put alternative edges on and . In general, when the maximum degree is higher and when is not a binary predicate, we show in Proposition 2.3 how to decompose a relation into high-dimensional matchings efficiently. The number of queries depends exponentially on the arities and degrees of the negated relations.
Boolean tensor decomposition
The acyclic query can be answered in time, where the factor is due to sorting of relations. The query can be answered as follows. Let denote the function such that is the th bit of in its binary representation. Then, by noticing that
we can break up the query into the disjunction of acyclic queries of the form
For a fixed , both and are singleton relations on and , respectively. Then, can be answered in time . The same applies to . We can use the same trick to answer in time . However, we can do better than that by observing that when viewed as a Boolean tensor in (6), the disequality tensor has the Boolean rank bounded by . In order to answer in time , we will show that the three-dimensional tensor has the Boolean rank bounded by as well. To this end, we extend the color-coding technique. We can further shave off a factor in the complexities of , , and , as explained in Section 5.
Construction of the Boolean tensor decomposition
We next explain how to compute a tensor decomposition for the conjunction of disequalities in . There exists a family of functions satisfying the following conditions:
for every triple for which , there is a function such that , and
can be constructed in time .
We think of each function as a “coloring” that assigns a “color” in to each element of . Assuming to hold, it follows that
where ranges over all triples in such that and . Given this Boolean tensor decomposition, we can solve in time .
We prove to using a combinatorial object called the disjunct matrices. These matrices are the central subject of combinatorial group testing [NgoSurvey1999, MR1742957].
Definition 1.2 (-disjunct matrix).
A binary matrix is called a -disjunct matrix if for every column and every set such that and , there exists a row for which and for all .
It is known that for every integer , there exists a -disjunct matrix (or equivalently a combinatorial group testing [NgoSurvey1999]) with rows that can be constructed in time [DBLP:journals/tit/PoratR11]. (If
, we can just use the identity matrix.) In particular, forand , a -disjunct matrix of size can be constructed in time . From the matrix we define the function family by associating a function to each row of the matrix, and every member to a distinct column of the matrix. Define and – straightforwardly follow.
In this paper we consider arbitrary conjunctive queries with negated relations of the form (1). We make use of the following naming convention. Capital letters with subscripts such as or denote variables. For any set of positive integers, denote a tuple of variables indexed by . Given a relation over variables and , denotes the projection of onto variables , i.e., we write instead of . If is a variable, then the corresponding lower-case denotes a value from the active domain of . Bold-face denotes a tuple of values in .
For any relation , we associate a hypergraph defined as follows. The vertex set is . Each tuple corresponds to an edge . Note that is a -uniform hypergraph (all hyperedges have size ).
Hypergraph coloring. Let denote a multi-hypergraph and be a positive integer. A proper -coloring of is a mapping such that for every edge , there exists with such that . The chromatic polynomial [MR95h:05067] of is the number of proper -colorings of . We use and to denote the chromatic number of and the chromatic index (the edge coloring number) of , respectively. Coloring a (hyper)graph is equivalent to coloring it without singleton edges.
Bounded-degree relation. The maximum degree of a vertex in a hypergraph is denoted by : . For a relation , its maximum degree is the maximum number of tuples in with the same value for a variable : We will use a slightly different notion of degree of a relation denoted by , which also accounts for the arity of the relation . Proposition 2.3 connects the two notions.
Definition 2.1 (Matching).
A -ary relation is called a (-dimensional) matching if for every two tuples , either , i.e., and are the same tuple, or it holds that .
Definition 2.2 (Degree).
The degree of a -ary relation , denoted by , is the smallest integer for which can be written as the disjoint union of matchings. The degree is bounded if there is a constant such that .
It is easy to see that . If is a binary relation, then is a bipartite graph and . This follows from König’s line coloring theorem [konig1916], which states that the chromatic index of a bipartite graph is equal to its maximum degree. When the arity is higher than two, to the best of our knowledge there does not exist such a nice characterization of the chromatic index of in terms of the maximum degree of individual vertices in the graph, although there has been some work on bounding the chromatic index of (linear) uniform hypergraphs [MR1426745, MR3324967, 2016arXiv160304938F, MR993646, MR993646]. In our setting, we are willing to live with sub-optimal matching decomposition, which can be done in linear time straightforwardly. A bounded-degree relation can be decomposed into a disjunction of matchings in linear time.
Let denote a -ary relation of size . The following holds:
, where ;
We can compute in -time disjoint -ary matchings such that .
The fact that is obvious. To show that , note that any edge in is adjacent to at most other edges of , hence greedy coloring can color the edges of in time using colors. ∎
3 Untangling bounded-degree relations
In this section we introduce a rewriting of queries defined by Eq. (1), where for every hyperedge , the relation has bounded degree , into queries with so-called not-all-equal predicates.
Definition 3.1 (Not-all-equal).
Let be an integer, and be a set of integers. The relation , or for simplicity, holds true iff not all variables in are equal:
The disequality () relation is exactly . The negation of a matching is connected to NAE predicates as follows.
Let be a -ary matching, where . For any , define the unary relation and the binary relation . For any , it holds that
The intuition for this rewriting is as follows. A value occurs in at most one tuple in the matching . Therefore, any value in a tuple determines the rest of the tuple. The rewriting in (9) first turns every tuple in into a tuple of equal values. The negation of consists of tuples of not-all-equal values.
We next prove that the rewriting is correct.
In one direction, consider a tuple , i.e., holds, and suppose for all . This means, for every , there is a unique tuple such that . Define for all . The tuple satisfies , for all . Moreover, one can verify that holds. In particular, if for all , then all tuples are the same tuple (since is a matching) and that tuple is . Hence which is a contradiction.
Conversely, suppose there exists a tuple satisfying the right hand side of (9). If for any , then , i.e., satisfies the left hand side of (9). Now, suppose for all . Suppose to the contrary that . Then, for all we have since must hold. This means that does not hold. This contradicts our hypothesis. ∎
We use the connection to NAE predicates to decompose a query containing a conjunction of negated bounded-degree relations as follows. In the following, let and denote the fractional hypertree width and submodular width of the conjunctive query (see Definitions A.11 and A.16 respectively).
Let be the query defined in Eq. (1): . We can compute in linear time a collection of hypergraphs such that
and is the body of a conjunctive query satisfying
Furthermore, the number of queries is bounded by .
From Proposition 2.3, each relation can be written as a disjoint union of matchings , . These matchings can be computed in linear time. Hence, the second half of the body of query can be rewritten equivalently as
To simplify notation, let denote the multiset of edges obtained from by duplicating the edge exactly times. Furthermore, for the -th copy of , associate the matching with the copy of in ; use to denote the matching corresponding to that copy. Then, we can write equivalently
For each , fix an arbitrary integer . From Proposition 3.2, the negation of can be written as
where is a unary relation on variable , and is a tuple of fresh variables, only associated with (the copy of) . In particular, if and are two distinct items in the multiset , then and are two distinct variables.
Each negated term is thus expressed as a disjunction of positive terms. We can then express the conjunction of negated terms as the disjunction of conjunctions. For this, define a collection of tuples . In particular, every member is a tuple where . The second half of the body of query can be rewritten equivalently as
The original query is equivalent to the disjunction
of up to queries defined by
In the above definition of , let us denote all but the last conjunction of NAE predicates by . From Lemma A.21, we have , and . The second line is a conjunction of NAE predicates. Since each is repeated at most times in , it follows that the number of conjunctive queries is at most . ∎
4 Boolean tensor decomposition
Thanks to the untangling result in Proposition 3.3, this section concentrates on answering queries of the reduced form (10). To deal with the conjunction of NAE predicates, we first describe how to construct a Boolean tensor decomposition of a conjunction of NAE predicates. This conjunction has the multi-hypergraph , where is the set of all of its variables and is the multi-set of NAE predicates.
Let be the multi-hypergraph of a conjunction , an upper bound on the domain sizes for variables , and a positive integer. Suppose there exists a family of functions satisfying the following property
|for any proper -coloring of there exists a function||(12)|
|such that is a proper -coloring of .|
Then, the following holds:
where ranges over all proper -colorings of . In particular, the Boolean tensor rank of the left-hand side of (13) is bounded by .
Let denote any tuple satisfying the LHS of (13). Define by setting . Then is a proper -coloring of , which means there exists such that is a proper -coloring of . Then the conjunct on the RHS corresponding to this particular pair is satisfied.
Conversely, let denote any tuple satisfying the RHS of (13). Then, there is a pair whose corresponding conjunct on the RHS of (13) is satisfied, i.e., for all . Recall that is a proper -coloring of . If there exists such that does not hold, then for all , implying for all , contradicting the fact that is a proper coloring.
For the Boolean rank statement, note that (13) is a Boolean tensor decomposition of the formula , because is a unary predicate on variable . This predicate is of size bounded by . ∎
To explain how Lemma 13 can be applied, we present two techniques, showing the intimate connections of this problem to combinatorial group testing and perfect hashing.
Example 4.2 (Connection to group testing).
Consider the case when the graph is a -star, i.e., a tree with a center vertex and leaf vertices. Let be a binary -disjunct matrix, which can be constructed in time (cf. Section A.4). We can assume to avoid triviality. Consider a family of functions constructed as follows: there is a function for every row of , where , for all . The family has size . We show that satisfies condition (12). Let denote any coloring of the star. Let be the color assigns to the center, and be the set of colors assigned to the leaf nodes. Clearly . Hence, there is a function for which and for all , implying is a proper -coloring of .
A consequence of our observation is that for a -star the conjunction has Boolean rank bounded by . ∎
Example 4.3 (Connection to perfect hashing).
Consider now the case when the graph is a -clique. Let denote any -perfect hash family, i.e., a family of hash functions from such that for every subset of size , there is a function in the family for which its image is also of size . It is easy to see that this hash family satisfies (12). From [DBLP:journals/talg/AlonMS06], it is known that we can construct in polytime an -perfect hash family of size . However, it is not clear what the runtime exponent of their construction is. What we need for our application is that the construction should run in linear data-complexity and polynomial in query complexity. We use below a result from [DBLP:journals/tit/PoratR11] to exhibit such a construction; furthermore, our hash family has size only . ∎
We next bound the size of the smallest family satisfying Lemma 4.1 using the probabilistic method [AlonSpencer:probabilistic]. We also specify how to derandomize the probabilistic construction of to obtain a deterministic algorithm. For this, we need some terminology.
Every coloring of induces a homomorphic image