Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems

03/27/2018 ∙ by Hung Q. Ngo, et al. ∙ 0

Worst-case optimal join algorithms are the class of join algorithms whose runtime match the worst-case output size of a given join query. While the first provably worse-case optimal join algorithm was discovered relatively recently, the techniques and results surrounding these algorithms grow out of decades of research from a wide range of areas, intimately connecting graph theory, algorithms, information theory, constraint satisfaction, database theory, and geometric inequalities. These ideas are not just paperware: in addition to academic project implementations, two variations of such algorithms are the work-horse join algorithms of commercial database and data analytics engines. This paper aims to be a brief introduction to the design and analysis of worst-case optimal join algorithms. We discuss the key techniques for proving runtime and output size bounds. We particularly focus on the fascinating connection between join algorithms and information theoretic inequalities, and the idea of how one can turn a proof into an algorithm. Finally, we conclude with a representative list of fundamental open problems in this area.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Overview

Relational database query evaluation is one of the most well-studied problems in computer science. Theoretically, even special cases of the problem are already equivalent to fundamental problems in other areas; for example, queries on one edge relation can already express various graph problems, conjunctive query evaluation is deeply rooted in finite model theory and constraint satisfaction [18, 43, 23, 65, 57], and the aggregation version is inference in discrete graphical models [5]

. Practically, relational database management systems (RDBMS) are ubiquitous and commercially very successful, with almost 50 years of finely-tuned query evaluation algorithms and heuristics 

[61, 31, 13].

In the last decade or so there have emerged fundamentally new ideas on the three key problems of a relational database engine: (1) constructing query plans, (2) bounding intermediate or output size, and (3) evaluating (intermediate) queries. The new query plans are based on variable elimination and equivalently tree decompositions [5, 6, 29]. The new (tight) size bounds are information-theoretic, taking into account in a principled way input statistics and functional dependencies [32, 12, 30, 7, 8]. The new algorithms evaluate the multiway join operator in a worst-case optimal manner [52, 66, 51, 7, 8], which is provably asymptotically better than the one-pair-at-a-time join paradigm.

These fresh developments are exciting both from the theory and from the practical stand point. On the theory side, these results demonstrate beautiful synergy and interplay of ideas from many different research areas: algorithms, extremal combinatorics, parameterized complexity, information theory, databases, machine learning, and constraint satisfaction. We will briefly mention some of these connections in Sec. 

1.2 below. On the practice side, these results offer their assistance “just in time” for the ever demanding modern data analytics workloads. The generality and asymptotic complexity advantage of these algorithms open wider the pandora box of true “in-database” graph processing, machine learning, large-scale inference, and constraint solving [50, 2, 3, 22, 25, 36, 11, 54, 55, 59, 45, 34].

The reader is referred to [5, 6] for descriptions of the generality of the types of queries the new style of query plans can help answer. In particular, one should keep in mind that the bounds and algorithms described in this paper apply to aggregate queries in a very general setting, of which conjunctive queries form a special case. The focus of this paper is on the other two developments: output size bounds and worst-case optimal join (WCOJ) algorithms.

Roughly speaking, a WCOJ algorithm is a join algorithm evaluating a full conjunctive query in time that is proportional to the worst-case output size of the query. More precisely, we are given a query along with a set DC of “constraints” the input database is promised to satisfy. The simplest form of constraints contain the sizes of input relations; these are called cardinality constraints. The second form of constraints is prevalent in RDBMSs, that of functional dependencies (FD). We shall refer to them as FD constraints. These constraints say that, if we fix the bindings of a set of variables, then there is at most one binding for every variable in another set . More generally, there are degree constraints, which guarantee that for any fixed binding of variables in , there are at most some given number of bindings of variables in . Degree constraints generalize both cardinality and FD constraints, because cardinality constraints correspond to degree constraints when .

We write to denote the fact that the database satisfies the degree constraints DC. The worst-case output size problem is to determine the quantity

(1)

and a WCOJ algorithm runs in time , where hides a factor in the data size and some query-size dependent factor. In what follows we present a brief overview of the history of results on determining (1) and on associated WCOJ algorithms.

Independent of WCOJ

algorithms, the role of bounding and estimating the output size in query optimizer is of great importance, as estimation errors propagate and sometimes are as good as random guesses, leading to bad query plans 

[39]. Hence, as we enrich the class of constraints allowable in the DC set (say from upper degree bounds to histogram information or more generally various statistical properties of the input), one should expect the problem of determining (1) or its expectation to gain more prominence in any RDBMS.

The role of determining and computing (1) in the design of WCOJ algorithms, on the other hand, has a fascinating and different bent: we can turn a mathematical proof of a bound for (1) into an algorithm; different proof strategies yield different classes of algorithms with their own pros and cons. Deriving the bound is not only important for analyzing the runtime of the algorithm, but also instrumental in how one thinks about designing the algorithm in the first place. Another significant role that the problem (1) plays is in its deep connection with information theory and (abstract algebraic) group theory. This paper aims to be a guided tour of these connections.

The notion of worst-case optimality has influenced a couple of other lines of inquiries, in parallel query processing [42, 44], and join processing in the IO model [37]. Furthermore, more than just paperware, WCOJ algorithms have found their way to academic data management and analytic systems [19, 1, 41, 10, 56], and are part of two commercial data analytic engines at LogicBlox [11] and RelationalAI.

The author is deeply indebted to Mahmoud Abo Khamis and Dan Suciu, whose insights, enthusiasm, and collaborative effort (both on [7, 8] and off official publication records) in the past few years have helped form the skeleton of the story that this article is attempting to tell. The technical landscape has evolved drastically from an early exposition on the topic [52].

1.2. A brief history of bounds and algorithms

We start our history tour with the bound (1) in the simple setting when all constraints in DC are cardinality constraints. Consider, for example, the following “triangle query”:

(2)

While simple, this is not a toy query. In social network analysis, counting and enumerating the number of triangles in a large graph is an important problem, which corresponds to (2) with . There is a large literature on trying to speed up this one query; see, e.g. [64, 63, 15] and references thereof.

One way to think about the output size bound is to think of as containing points in a three-dimensional space, whose projection onto the -plane is contained in , onto the plane is contained in , and onto the -plane is contained in . There is a known geometric inequality shown by Loomis and Whitney in 1949 [46]

which addresses a more general problem: bound the volume of a convex body in space whose shadows on the coordinate hyperplanes have bounded areas. The triangle query above corresponds to the discrete measure case, where “volume” becomes “count”. Specializing to the triangle, Loomis-Whitney states that

. Thus, while studied in a completely different context, Loomis-Whitney’s inequality is our earliest known answer to determining (1) for a special class of join queries. In [51, 52], we referred to these as the Loomis-Whitney queries: those are queries where every input atom contains all but one variable.

In a different context, in 1981 Noga Alon [9] studied the problem (1) in the case where we want to determine the maximum number of occurrences of a given subgraph in a large graph . ( is the query’s body, and is the database.) Alon’s interest was to determine the asymptotic behavior of this count, but his formula was also asymptotically tight. In the triangle case, for example, Alon’s bound is , the same as that of Loomis-Whitney. Here, is the number of edges in . Alon’s general bound is , where denote the “fractional edge cover number” of (see Section 3). However, his results were not formulated in this more modern language.

A paper by Chung et al. [20] on extremal set theory was especially influential in our story. The paper proved the “Product Theorem” which uses the entropy argument connecting a count estimation problem to an entropic inequality. We will see this argument in action in Section 4. The Product theorem is proved using what is now known as Shearer’s lemma; a clean formulation and a nice proof of this lemma was given by Radhakrishnan [58].

In 1995, Bollobás and Thomason [16] proved a vast generalization of Loomis-Whitney’s result. Their bound, when specialized down to the discrete measure and our problem, implies what is now known as the AGM-bound (see below and Corollary 4.2). The equivalence was shown in Ngo et al. [51]. The key influence of Bollobás-Thomason’s result to our story was not the bound, which can be obtained through Shearer’s lemma already, but the inductive proof based on Hölder’s inequality. Their inductive proof suggests a natural recursive algorithm and its analysis, which lead to the algorithms in [51, 52].

Independently, in 1996 Friedgut and Kahn [27] generalized Alon’s earlier result from graphs to hypergraphs, showing that the maximum number of copies of a hypergraph inside another hypergraph is . Their argument uses the product theorem from Chung et al. [20]. The entropic argument was used in Friedgut’s 2004 paper [26] to prove a beautiful inequality, which we shall call Friedgut’s inequality. In Theorem 4.1 we present an essentially equivalent version of Friedgut’s inequality formulated in a more database-friendly way, and prove it using the inductive argument from Bollobás-Thomason. Friedgut’s inequality not only implies the AGM-bound as a special case, but also can be used in analyzing the backtracking search algorithm presented in Section 5. Theorem 4.1 was stated and used in Beame et al. [14] to analyze parallel query processing; Friedgut’s inequality is starting to take roots in database theory.

Grohe and Marx ([32, 33], 2006) were pushing boundaries on the parameterized complexity of constraint satisfaction problems. One question they considered was to determine the maximum possible number of solutions of a sub-problem defined within a bag of a tree decomposition, given that the input constraints were presented in the listing representation. This is exactly our problem (1) above, and they proved the bound of using Shearer’s lemma, where denote the fractional edge cover number of the hypergraph of the query. They also presented a join-project query plan running in time , which is almost worst-case optimal.

Atserias et al. ([12], 2008) applied the same argument to conjunctive queries showing what is now known as the AGM-bound. More importantly, they proved that the bound is asymptotically tight and studied the average-case output size. The other interesting result from [12] was from the algorithmic side. They showed that there is class of queries for which a join-project plan evaluates them in time while any join-only plan requires , where is the query size. In particular, join-project plans are strictly better than join-only plans.

Continuing with this line of investigation, Ngo et al. (2012, [51]) presented the NPRR algorithm running in time , and presented a class of queries for which any join-project plan is worse by a factor of where is the query size. The class of queries contains the Loomis-Whitney queries. The NPRR algorithm and its analysis were overly complicated. Upon learning about this result, Todd Veldhuizen of LogicBlox realized that his Leapfrog-Triejoin algorithm (LFTJ) can also achieve the same runtime, with a simpler proof. LFTJ is the work-horse join algorithm for the LogicBlox’s datalog engine, and was already implemented since about 2009. Veldhuizen published his findings in 2014 [66]. Inspired by LFTJ and its simplicity, Ngo et al. [52] presented a simple recursive algorithm called Generic-Join which also has a compact analysis.

The next set of results extend DC to more than just cardinality constraints. Gottlob et al. [30] extended the AGM bound to handle FD constraints, using the entropy argument. They also proved that the bound is tight if all FD’s are simple FDs. Abo Khamis et al. [7] observed that the same argument generalizes the bound to general degree constraints, and addressed the corresponding algorithmic question. The bound was studied under the FD-closure lattice formalism, where they showed that the bound is tight if the lattice is a distributive lattice. This result is a generalization of Gottlob et al.’s result on the tightness of the bound under simple FDs. The connection to information theoretic inequalities and the idea of turning an inequality proof into an algorithm was also developed with the CSMA algorithm in [7]. However, the algorithm was also too complicated, whose analysis was a little twisted at places.

Finally, in [8] we developed a new collection of bounds and proved their tightness and looseness for disjunctive datalog rules, a generalization of conjunctive queries. It turns out that under general degree constraints there are two natural classes of bounds for the quantity (1): the entropic bounds are tight but we do not know how to compute them, and the relaxed versions called polymatroid bounds

are computable via a linear program. When there is only cardinality constraints, these two bounds collapse into one (the

AGM bound). We discuss some of these results in Section 4. The idea of reasoning about algorithms’ runtimes using Shannon-type inequalities was also developed to a much more mature and more elegant degree, with an accompanying algorithm called PANDA. We discuss what PANDA achieves in more details in Section 5.

1.3. Organization

The rest of the paper is organized as follows. Section 2 gently introduces two ways of bounding the output size and two corresponding algorithms using the triangle query. Section 3 presents notations, terminology, and a brief background materials on information theory and properties of the entropy functions. Section 4 describes two bounds and two methods for proving output size bounds on a query given degree constraints. This section contains some proofs and observations that have not appeared elsewhere. Section 5 presents two algorithms evolving naturally from the two bound-proving strategies presented earlier. Finally, Section 6 lists selected open problems arising from the topics discussed in the paper.

2. The triangle query

The simplest non-trivial example illustrating the power of WCOJ algorithms is the triangle query (2). We use this example to illustrate several ideas: the entropy argument, two ways of proving an output size bound, and how to derive algorithms from them. The main objective of this section is to gently illustrate some of the main reasoning techniques involved in deriving the bound and the algorithm; we purposefully do not present the most compact proofs. At the end of the section, we raise natural questions regarding the assumptions made on the bound and algorithm to motivate the more general problem formulation discussed in the rest of the paper.

The bound

Let denote the domain of attribute . Construct a distribution on where a triple is selected from the output uniformly. Let denote the entropy function of this distribution, namely for any , denotes the entropy of the marginal distribution on the variables . Then, the following hold:

The first inequality holds because the support of the marginal distribution on is a subset of , and the entropy is bounded by the of the support (see Section 3.2). The other two inequalities hold for the same reason. Hence, whenever there are coefficients for which

(3)

holds for all entropy functions , we can derive an output size bound for the triangle query:

(4)

In Section 4 we will show that (3) holds for all entropy function if and only if , , and , for non-negative . This fact is known as Shearer’s inequality, though Shearer’s original statement is weaker than what was just stated.

One consequence of Shearer’s inequality is that, to obtain the best possible bound, we will want to minimize the right-hand-side (RHS) of (4) subject to the above constraints:

(5)
(6) s.t.
(7)
(8)
(9)

This bound is known as the AGM-bound for . It is a direct consequence of Friedgut’s inequality (Theorem 4.1).

Algorithms

Let , and denote an optimal solution to the LP (5) above. A WCOJ algorithm needs to be able to answer in time , where hides a single factor. The feasible region of (5) is a -dimensional simplex. Without loss of generality, we can assume that is one of the vertices of the simplex, which are , , , and . If , then the traditional join plan has the desired runtime of , modulo preprocessing time.

Consequently, the only interesting case is when . It is easy to see that this is optimal to LP (5) when the product of sizes of any two relations from , , and is greater than the size of the third relation. To design an algorithm running in -time, we draw inspiration from two different proofs of the bound (5).

First Algorithm. Write to denote the indicator variable for the Boolean event ; for example is if and otherwise. Let denote the relational selection operator. The Bollobás-Thomason’s argument for proving (5) goes as follows.

(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)

All three inequalities follow from Cauchy-Schwarz. Tracing the inequalities back to , Algorithm 1 emerges.

for  do
       for  do
             for  do
                   Report ;
                  
             end for
            
       end for
      
end for
Algorithm 1 based on Hölder’s inequality proof

The analysis is based on only a single assumption, that we can loop through the intersection of two sets and in time bounded by . This property can be satisfied with sort-merge or simple hash join when we iterate through the smaller of the two sets and look up in the hash table of the other. For a fixed binding , the inner-most loop runs in time

A binding gets in the inner loop only if , and so the total amount of work is

(19)

Compare this with (13), and the runtime analysis is completed.

Second Algorithm. This algorithm is inspired by a proof of inequality (3), which implies (5). In this particular case (3) can be written as

(20)

Using the chain rule (eq. (

29)) and the submodularity rule (eq. (33)) for entropic functions, the inequality can be proved as follows.

(21)
(22)
(23)
(24)

The first replacement is interpreted as a decomposition of the relation into two parts “heavy” and “light”. After applying submodularity, two compositions are performed to obtain two copies of : and . These correspond to join operators. Algorithm 2 has the pseudo-code. It is remarkable how closely the algorithm mimics the entropy proof.

;
;
;
return ;
Algorithm 2 based on entropy inequality proof

The analysis is also compact. Note that and thus

In the other case, This completes the analysis.

Follow-up questions

In a more realistic setting, we know more about the input than just the cardinalities. In a database there may (and will) be FDs. In a graph we may know the maximum degree of a vertex. How do the bounds and algorithms change when we take such information into account? Which of the above two bounds and algorithms generalize better in the vastly more general setting? We explore these questions in the remainder of this paper.

3. Preliminaries

Throughout the paper, we use the following convention. The non-negative reals, rationals, and integers are denoted by , and respectively. For a positive integer , denotes the set .

Functions without a base specified are base-, i.e. . Uppercase denotes a variable/attribute, and lowercase denotes a value in the discrete domain of the variable. For any subset , define , . In particular, is a tuple of variables and is a tuple of specific values with support . We also use to denote variables and , to denote value tuples in the same way.

3.1. Queries and degree constraints

A multi-hypergraph is a hypergraph where edges may occur more than once. We associate a full conjunctive query to a multi-hypergraph , ; the query is written as

(25)

with variables , , and atoms , .

Definition 1 (Degree constraint).

A degree constraint is a triple , where and . The relation is said to guard the degree constraint if and

(26)

Note that a given relation may guard multiple degree constraints. Let DC denote a set of degree constraints. The input database is said to satisfy DC if every constraint in DC has a guard, in which case we write .

A cardinality constraint is an assertion of the form , for some ; it is exactly the degree constraint guarded by . A functional dependency is a degree constraint with . In particular, degree constraints strictly generalize both cardinality constraints and FDs.

Our problem setting is general, where we are given a query of the form (25) and a set DC of degree constraints satisfied by the input database . The first task is to find a good upper bound, or determine exactly the quantity , the worst-case output size of the query given that the input satisfies the degree constraints. The second task is to design an algorithm running in time as close to the bound as possible.

Given a multi-hypergraph , define its corresponding “fractional edge cover polytope”:

Every point is called a fractional edge cover of . The quantity

is called the fractional edge cover number of .

3.2. Information theory

The books [21, 67]

are good references on information theory. We extract only simple facts needed for this paper. Consider a joint probability distribution

on discrete variables and a probability mass function . The entropy function associated with is a function , where

(27)

is the entropy of the marginal distribution on . To simplify notations, we will also write for , turning into a set function . For any , define the “support” of the marginal distribution on to be

(28)

Given , define the conditional entropy to be

(29)

This is also known as the chain rule of entropy. The following facts are basic and fundamental in information theory:

(30)
(31)
(32)
(33)

Inequality (31) follows from Jensen’s inequality and the concavity of the entropy function. Equality holds if and only if the marginal distribution on is uniform. Entropy measures the “amount of uncertainty” we have: the more uniform the distribution, the less certain we are about where a random point is in the space. Inequality (32) is the monotonicity property: adding more variables increases uncertainty. Inequality (33) is the submodularity property: conditioning on more variables reduces uncertainty.111.

A function is called a (non-negative) set function on . A set function on is modular if for all , is monotone if whenever , is subadditive if for all , and is submodular if for all . Let be a positive integer. A function is said to be entropic

if there is a joint distribution on

with entropy function such that for all . We will write and interchangeably, depending on context.

Unless specified otherwise, we will only consider non-negative and monotone set functions for which ; this assumption will be implicit in the entire paper.

Definition 2.

Let , , and denote the set of all (non-negative and monotone) modular, subadditive, and submodular set functions on , respectively. Let denote the set of all entropic functions on variables, and denote its topological closure. The set is called the set of polymatroidal functions, or simply polymatroids.

The notations are standard in information theory. It is known [67] that is a cone which is not topologically closed. And hence, when optimizing over this cone we take its topological closure , which is convex. The sets and are clearly polyhedral cones.

As mentioned above, entropic functions satisfy non-negativity, monotonicity, and submodularity. Linear inequalities regarding entropic functions derived from these three properties are called Shannon-type inequalities. For a very long time, it was widely believed that Shannon-type inequalities form a complete set of linear inequalities satisfied by entropic functions, namely . This indeed holds for , for example. However, in 1998, in a breakthrough paper in information theory, Zhang and Yeung [68] presented a new inequality which cannot be implied by Shannon-type inequalities. Their result proved that, for any . Lastly the following chain of inclusion is known [67]

(34)

When , all of the containments are strict.

4. Output size bounds

This section addresses the following question: given a query and a set of degree constraints DC, determine or at least a good upper bound of it.

4.1. Cardinality constraints only

Friedgut’s inequality is essentially equivalent to Hölder’s inequality. Following Beame et al. [14], who used the inequality to analyze parallel query processing algorithms, we present here a version that is more database-friendly. We also present a proof of Friedgut’s inequality using Hölder’s inequality, applying the same induction strategy used in the proof of Bollobás-Thomason’s inequality [16] and the “query decomposition lemma” in [52].

Theorem 4.1 (Friedgut [26]).

Let denote a full conjunctive query with (multi-) hypergraph and input relations , . Let denote a fractional edge cover of . For each , let denote an arbitrary non-negative weight function. Then, the following holds

(35)
Proof.

We induct on . When , the inequality is exactly generalized Hölder inequality [35]. Suppose , and – for induction purposes – define a new query whose hypergraph is , new fractional edge cover for , and new weight functions for each as follows:

(36)
(37)
(38)
(39)
(40)
(41)

Then, by noting that the tuple belongs to if and only if , we have222We use the convention that .

The first inequality follows from Hölder’s inequality and the fact that is a fractional edge cover; in particular, . The second inequality is the induction hypothesis. ∎

By setting all weight functions to be identically , we obtain

Corollary 4.2 (Agm-bound [12]).

Given the same setting as that of Theorem 4.1, we have

(42)

In particular, let then .

4.2. General degree constraints

To obtain a bound in the general case, we employ the entropy argument, which by now is widely used in extremal combinatorics [40, 20, 58]. In fact, Friedgut [26] proved Theorem 4.1 using an entropy argument too. The particular argument below can be found in the first paper mentioning Shearer’s inequality [20], and a line of follow-up work [27, 58, 30, 7, 8].

Let be any database instance satisfying the input degree constraints. Construct a distribution on by picking uniformly a tuple from the output . Let denote the corresponding entropy function. Then, due to uniformity we have . Now, consider any degree constraint guarded by an input relation . From (31) it follows that . Define the collection HDC of set functions satisfying the degree constraints DC:

Then, the entropy argument immediately gives the following result, first explicitly formulated in [8]:

Theorem 4.3 (From [7, 8]).

Let be a conjunctive query and DC be a given set of degree constraints, then for any database satisfying DC, we have

(43) (entropic bound)
(44) (polymatroid bound)

Furthermore, the entropic bound is asymptotically tight and the polymatroid bound is not.

The polymatroid relaxation follows from the chain of inclusion (34); the relaxation is necessary because we do not know how to compute the entropic bound. Also from the chain of inclusion, we remark that while the set is not relevant to our story, we can further move from to and end up with the integral edge cover number [8].

Table 1, extracted from [8], summarizes our current state of knowledge on the tightness and looseness of these two bounds.

Bound Entropic Bound Polymatroid Bound
Definition
(See [30, 7])
(See [30, 7])
DC contains only
cardinality constraints
AGM bound [33, 12]
(Tight [12])
AGM bound [33, 12]
(Tight [12])
DC contains only
cardinality and FD constraints
Entropic Bound for FD [30]
(Tight [28])
Polymatroid Bound for FD [30]
(Not tight [8])
DC is a general
set of degree constraints
Entropic Bound for DC [7]
(Tight [8])
Polymatroid Bound for DC [7]
(Not tight [8])
Table 1. Summary of entropic and polymatroid size bounds for full conjunctive queries along with their tightness properties.

The entropic bound is asymptotically tight, i.e. there are arbitrarily large databases for which approaches the entropic bound. The polymatroid bound is not tight, i.e. there exist a query and degree constraints for which its distance from the entropic bound is arbitrarily large.

The tightness of the entropic bound is proved using a very interesting connection between information theory and group theory first observed in Chan and Yeung [17]. Basically, given any entropic function , one can construct a database instance which satisfies all degree constrains DC and . The database instance is constructed from a system of (algebraic) groups derived from the entropic function.

The looseness of the polymatroid bound follows from Zhang and Yeung result [68] mentioned in Section 3.2. In [8], we exploited Zhang-Yeung non-Shannon-type inequality and constructed a query for which the optimal polymatroid solution to problem (44) strictly belongs to . This particular proves the gap between the two bounds, which we can then magnify to an arbitrary degree by scaling up the degree constraints.

In addition to being not tight for general degree constraints, the polymatroid bound has another disadvantage: the linear program (44) has an exponential size in query complexity. While this is “acceptable” in theory, it is simply not acceptable in practice. Typical OLAP queries we have seen at LogicBlox or RelationalAI have on average variables; and certainly cannot be considered a “constant” factor, let alone analytic and machine learning workloads which have hundreds if not thousands of variables. We next present a sufficient condition allowing for the polymatroid bound to not only be tight, but also computable in polynomial time in query complexity.

Definition 3 (Acyclic degree constraints).

Associate a directed graph to the degree constraints DC by adding to all directed edges for every . If is acyclic, then DC is said to be acyclic degree constraints, in which case any topological ordering (or linear ordering) of is said to be compatible with DC. The graph is called the constraint dependency graph associated with DC.

Note that if there are only cardinality constraints, then is empty and thus DC is acyclic. In particular, acyclicity of the constraints does not imply acyclicity of the query, and the cardinality constraints do not affect the acyclicity of the degree constraints. In a typical OLAP query, if in addition to cardinality constraints we have FD constraints including non-circular key-foreign key lookups, then DC is acyclic. Also, verifying if DC is acyclic can be done efficiently in