1. Introduction
We consider the problem of computing functional aggregate queries with inequality joins, or FAQAI queries for short. This is a fundamental computational problem that goes beyond databases: core computation for supervised and unsupervised machine learning can be formulated in FAQAI.
Inequalities occur naturally in scenarios involving temporal and spatial relationships between objects in databases. In a retail scenario (e.g., TPCH), we would like to compute the revenue generated by a customer’s orders whose dates closely precede the ship dates of their lineitems. In streaming scenarios, we would like to detect patterns of events whose time stamps follow a particular order (Golab and Özsu, 2010). In spatial data management scenarios, we would like to retrieve objects whose coordinates are within a multidimensional range or in close proximity of other objects (Mamoulis, 2011). The evaluation of Core XPath queries over XML documents amounts to the evaluation of a special class of conjunctive queries with inequalities expressing tree relationships in the pre/post plane (Grust, 2002).
1.1. Motivating examples
A key insight of this paper is that the efficient computation of inequality joins can reduce the computational complexity of supervised and unsupervised machine learning.
Example 1.1 ().
The means algorithm divides the input dataset into clusters of similar data points (Jain, 2010). Each cluster has a mean , which is chosen according to the following optimization (similarity is defined here with respect to the norm):
(1) 
Let be the ’th component of mean vector . For a data point , the function computes the difference between the squares of the distances from to and from to :
A data point is closest to mean from the set of means iff .
To compute the mean vector , we need to compute the sum of values for each dimension over . If the dataset is the join of database relations over schemas , we can formulate this sum computation as a dataloglike query with aggregates (Halpin and Rugaber, 2015):
Section 4 gives further queries necessary to compute the means. As we show in this paper, such queries with aggregates and inequalities can be computed asymptotically faster than the join defining . ∎
Simple queries with inequalities can already show the limitations of current evaluation techniques, as highlighted next.
1.2. The FaqAi problem
One way to answer the above queries is to view them as functional aggregate queries (FAQ) (Abo Khamis et al., 2016) formulated in sumproduct form over (potentially many) semirings. We therefore briefly introduce FAQ over a single semiring.
First we establish notation. For any positive integer , let . For , let denote a variable/attribute, and denote a value in the discrete domain of . For any , define , . That is, is a tuple of variables and is a tuple of values for these variables.
Let a semiring and a multi^{1}^{1}1This means that is a multiset.hypergraph . To each edge we associate a function called factor.^{2}^{2}2The naming is borrowed from graphical models literature, where FAQ has its root. A (singlesemiring) FAQ query with free variables has the form:
(2) 
Under the Boolean semiring , the query in (2) becomes a conjunctive query: The factors represent input relations, where iff , with some notational overloading. For counting the number of tuples in the result of a join query, we can use instead the sumproduct semiring and define an indicator function for every input relation . To aggregate over some input variable, say , we can designate an identity factor .
It is known (Abo Khamis et al., 2016) that over an arbitrary semiring, the query (2) can be answered in time , where fhtw denotes the fractional hypertree width of the query and has no free variables (Grohe and Marx, 2014). If does have free variables, fhtwwidth becomes FAQwidth instead (Abo Khamis et al., 2016). Here is the size of the largest factor . Over the Boolean semiring, the time can be lowered to (Abo Khamis et al., 2017b), where subw is the submodular width (Marx, 2013) and hides a polylogarithmic factor in .
Motivated by the examples in Section 1.1, we formulate a class of FAQ queries called FAQAI: the hyperedge multiset is partitioned into two multisets , where stands for “skeleton” and stands for “ligament”. The input to our class of queries consists of the following: (1) to each hyperedge , there corresponds a function , as in the FAQ case; (2) to each hyperedge , there corresponds functions , one for every variable . The query we want to compute is the following:
(3) 
The summation is over tuples . The notation denotes the indicator function of the event in the semiring : if holds, and otherwise. The (univariate) functions can be userdefined functions, e.g., , or binary predicates with one key in and a numeric value. The only requirement we impose is that, given , the value can be accessed/computed in time.
Note that if , then we get back the FAQ formulation (2). Thus, FAQAI can also be considered a superclass of FAQ queries, i.e., FAQ and FAQAI are the same language.
1.3. Our contributions
To answer FAQ queries of the form (2), currently there are two dominant width parameters: fractional hypertree width (fhtw (Grohe and Marx, 2014)) and submodular width (subw (Marx, 2013)).^{3}^{3}3Section 2.1 overviews other notions of widths. It is known that for any query, and in the Boolean semiring we can answer (2) in time (Abo Khamis et al., 2017b; Marx, 2013). For nonBoolean semirings, the best known algorithm, called InsideOut (Abo Khamis et al., 2016; Abo Khamis et al., 2017a), evaluates (2) in time . For queries with free variables, fhtw is replaced by the more general notion of FAQwidth (faqw) (Abo Khamis et al., 2016); however, for brevity we discuss the nonfree variable case here.
Following (Abo Khamis et al., 2017a), both width parameters subw and fhtw can be defined via two constraint sets: the first is the set TD of all tree decompositions of the query hypergraph , and the second is the set of polymatroids on vertices of . The widths subw and fhtw are then defined as maximin or minimax optimization problems on the domain pair TD and , subject to “edge domination” constraints for . Section 2 presents these notions and other related preliminary concepts in detail.
Our contributions include the following:
Answering FaqAi over Boolean semiring
On the Boolean semiring, one way to answer query (3) is to apply the PANDA algorithm (Marx, 2013), using edge domination constraints on and the set TD of all tree decompositions of . However, this is suboptimal. Therefore, in Section 3.2 we define a new notion of tree decomposition: relaxed tree decomposition, in which the hyperedges in only have to be covered by adjacent TD bags. Then, we present a variant of the InsideOut algorithm running on these relaxed TDs, exploiting Chazelle’s classic geometric data structure (Chazelle, 1988) for solving the semigroup range search problem. We show that our InsideOut variant meets the “relaxed fhtw” runtime, which is the analog of fhtw on relaxed TD. The PANDA algorithm can use the InsideOut variant as a blackbox to meet the “relaxed subw” runtime. The relaxed widths are smaller than the nonrelaxed counterparts, and are strictly smaller for some classes of queries, which means our algorithms yield asymptotic improvements over existing ones.
Answering Faq over other semirings
Next, to prepare the stage for answering FAQAI over nonBoolean semirings, in Section 3.3 we revisit FAQ over nonBoolean semirings, where no known algorithm can achieve the subwruntime. Here, we relax the set of polymatroids to a superset of relaxed polymatroids. Then, by relaxing the subw definition over relaxed polymatroids, we obtain a new width parameter called “sharp submodular width” (#subw). We show how a variant of PANDA, called #PANDA, can achieve a runtime of for evaluating FAQ over nonBoolean semirings. We prove that , and that there are classes of queries where #subw is unboundedly smaller than fhtw.
Answering FaqAi over other semirings
Getting back to FAQAI, we apply the #subw result under both relaxations: relaxed TD and relaxed polymatroids, to obtain a new width parameter called the relaxed #subw. We show that the new variants of PANDA and InsideOut can achieve the relaxed #subw runtime. We also show that there are queries for which relaxed #subw is essentially the best we can hope for, modulo sumhardness.
Applications in relational Machine Learning
Equipped with the algorithms for answering FAQAI, in Section 4
we return to relational machine learning applications over datasets defined by feature extraction queries over relational databases. We show how one can train linear SVM,
means, and ML models over Huber/hinge loss functions without completely materializing the output of the feature extraction queries. In particular, this shows that for these important classes of ML models, one can sometimes train models in time sublinear in the training dataset size.
1.4. Related work
Appendix C revisits two prior results on the evaluation of queries with inequalities through FAQAI lenses: Core XPath queries over XML documents and inequality joins over tupleindependent probabilistic databases (Olteanu and Huang, 2009). Throughout the paper, we contrast our new width notions with fhtw and subw and our new algorithm #PANDA with the stateoftheart algorithms PANDA and InsideOut for FAQ and FAQAI queries. A seminal work considers the containment and minimization problem for queries with inequalities (Klug, 1988). There is a bulk of work on queries with disequalities (notequal), e.g., (Abo Khamis et al., 2019) and references therein, which are at times referred to as inequalities.
Section 4 sets the context for our results on machine learning.
2. Preliminaries
Throughout the paper, we use the following convention. For any Boolean event/variable and a given semiring , let denote the indicator variable, which takes the value if holds (or is true), and othewise. We assume without loss of generality in the paper that semiring operations and can be performed in time. (When the assumption does not hold, for the set semiring for instance, we can multiply the claimed runtime with the real operation’s runtime.)
2.1. Tree decompositions and polymatroids
We briefly define tree decompositions, fhtw and subw parameters. We refer the reader to the recent survey by Gottlob et al. (Gottlob et al., 2016) for more details and historical contexts. In what follows, the hypergraph should be thought of as the hypergraph of the input query, although the notions of tree decomposition and width parameters are defined independently of queries.
A tree decomposition of a hypergraph is a pair , where is a tree and maps each node of the tree to a subset of vertices such that

every hyperedge is a subset of some , (i.e. every edge is covered by some bag),

for every vertex , the set is a nonempty (connected) subtree of . This is called the running intersection property.
The sets are often called the bags of the tree decomposition.
Let denote the set of all tree decompositions of . When is clear from context, we use TD for brevity.
To define width parameters, we use the polymatroid characterization from Abo Khamis et al. (2017b). A function is called a (nonnegative) set function on . A set function on is modular if for all , is monotone if whenever , and is submodular if for all . A monotone, submodular set function with is called a polymatroid. Let denote the set of all polymatroids on with .
Given some , define the set of edge dominated set functions:
(6)  ED 
With this, we define the submodular width and fractional hypertree width of a given hypergraph :
(7)  
(8) 
It is known (Marx, 2013) that , and there are classes of hypergraphs with bounded subw and unbounded fhtw. Furthermore, fhtw is strictly less than other width notions such as (generalized) hypertree width and tree width.
Remark 2.1.
Prior to Abo Khamis et al. (2017b), the commonly used definition of is , where is the fractional edge cover number of a vertex set using the hyperedge set
. It is straightforward to show, using linear programming duality
(Abo Khamis et al., 2017b), that(9) 
proving the equivalence of the two definitions. However, the characterization (7) has two primary advantages: (i) it exposes the minimax / maximin duality between fhtw and subw, and more importantly (ii) it makes it completely straightforward to relax the definitions by replacing the constraints by other applicable constraints, as shall be shown in later sections.∎
Definition 2.2 (connex tree decomposition (Bagan et al., 2007; Segoufin, 2013)).
Given a hypergraph and a set , a tree decomposition of is connex if or the following holds: There is a nonempty subset that forms a connected subtree of and satisfies .
We use to denote the set of all connex tree decompositions of . (Note that when , .)
2.2. InsideOut and Panda
To answer the FAQ query (2), we need a model for the representation of the input factors . The support of the function is the set of tuples such that . We use to denote the size of its support. For example, if represents an input relation, then is the number of tuples in . In practice, there often are factors with infinite support, e.g., represents a builtin function in a database, an arithmetic operator, or a comparison operator as in (3). To deal with this more general setting, the edge set is partitioned into two sets , where is finite for all and for all . For simplicity, we often state runtimes of algorithms in terms of the “input size” . Moreover, we use to denote the output size of .
InsideOut (Abo Khamis et al., 2016; Abo Khamis et al., 2017a)
To answer (2), the InsideOut algorithm works by eliminating variables, along with an idea called the “indicator projection”. Its runtime is described by the FAQwidth of the query, a slight generalization of fhtw. In the context of one semiring, we can define by applying Definition (7) over a restricted set of tree decompositions and edge dominated polymatroids. In particular, let denote the set of free variables in (2), and recall from Definition 2.2. Then,
(10)  
(11)  
(12) 
Note that when and (i.e. ). A simple result from Abo Khamis et al. (2016) is the following:
Proposition 2.3.
InsideOut answers query (2) in time .
Panda (Abo Khamis et al., 2017b)
In case of the Boolean semiring, i.e., when the FAQ query (2) is of the form
(13) 
we can do much better than Proposition 2.3. When , Marx (Marx, 2013) showed that (13) can be answered in time . The PANDA algorithm (Abo Khamis et al., 2017b) generalizes Marx’s result to deal with general degree constraints, and to meet precisely the runtime. In fact, PANDA works with queries such as (13) with free variables as well. In the context of this paper, we can define the following notion of submodular FAQwidth in a very natural way:
(14) 
Then, the results from Abo Khamis et al. (2017b) imply:
Proposition 2.4.
PANDA answers query (13) in time .
These results only work for the Boolean semiring. Section 3 introduces a variant of PANDA, called #PANDA, that also works for nonBoolean semirings.
2.3. Semigroup range searching
Orthogonal range counting (and searching) is a classic and ubiquitous problem in computational geometry (de Berg et al., 2008): given a set of points in a dimensional space, build a data structure that, given any dimensional rectangle, can efficiently return the number of enclosed points. More generally, there is the semigroup range searching problem (Chazelle, 1988), where each point of the input points also has a weight , where is a semigroup.^{4}^{4}4In a semigroup we can add two elements using , but there is no additive inverse. The problem is: given a dimensional rectangle , compute .
Classic results by Chazelle (Chazelle, 1988) show that there are data structures for semigroup range searching which can be constructed in time , and answer rectangular queries in time. Also, this is almost the best we can hope for (Chazelle, 1990). There are more recent improvements to Chazelle’s result (see, e.g., Chan et al. (Chan et al., 2011)), but they are minor (at most a factor), as the original results were already very close to matching the lower bound.
Most of these range search/counting problems can be reduced to the dominance range searching problem (on semigroups), where the query is represented by a point , and the objective is to return . Here, denotes the “dominance” relation (coordinatewise ). We can think of as the lowercorner of an infinite rectangle query.
3. Relaxed tree decompositions and relaxed polymatroids
3.1. Connection to a geometric data structure
We start with a special case of (3) in which the skeleton part contains only two hyperedges and . Formally, consider the aggregate query of the form
(15) 
where and are two input functions/relations over variable sets and , respectively. We prove the following very simple but important lemma:
Lemma 3.1.
Let , and , then when , query (15) can be answered in time .
Proof.
If there is a hyperedge for which , then in a time preprocessing step we can “absorb” the factor into the factor , by replacing with . A similar absorption can be done with . Hence, without loss of generality we can assume that and for all . Furthermore, we only need to show that we can compute (15) for , because after is computed, we can marginalize away variables in time.
Abusing notation somewhat, for each and each , define the function by
(16) 
Fix a tuple such that . A tuple is said to be adjacent if . We show how to compute the following sum in polylogarithmic time:
(17)  
(18) 
where the inner sum ranges over only tuples which are adjacent; nonadjacent tuples contribute .
Now, for each define two dimensional points:
(19)  
(20) 
We write to say that is dominated by coordinatewise: . Assign to each point a “weight” of . Now, taking (18),
(21)  
(22) 
Example 3.2.
Let be a binary relation. Suppose we want to count the number of tuples satisfying , then by setting , , , it is easy to see that the problem can be reduced to the form (15) with , . We can thus compute this count in time .∎
3.2. Relaxed tree decompositions
Equipped with this basic case, we can now proceed to solve the general setting of (3). To this end, we define a new width parameter.
Definition 3.3 (Relaxed tree decomposition).
Let denote a multihypergraph whose edge multiset is partitioned into and . A relaxed tree decomposition of (with respect to the partition ) is a pair , where is a tree, and satisfies the following properties:

The running intersection property holds: for each node the set is a connected subtree in .

Every “skeleton” edge is covered by some bag , .

Every “ligament” edge is covered by the union of two adjacent bags: , where .
Let denote the set of all relaxed tree decompositions of (with respect to the skeletonligament partition). When is clear from context we use for the sake of brevity. Let denote the set of all relaxed connex tree decompositions of .
3.2.1. FaqAi on a general semiring
We use relaxed TDs in conjunction with Lemma 3.1 to answer FAQAI with a relaxed notion of faqw. In particular, the relaxed width parameters of are defined in exactly the same way as the usual width parameters defined in Section 2, except we allow the TDs to range over relaxed ones.
Definition 3.4 (Relaxed faqw).
Let be an FAQAI query (3), and be its hypergraph. Furthermore, let denote the set of hyperedges for which . Then, the relaxed FAQwidth of is defined by
(23) 
When , collapses back to , in which case we define the relaxed fhtw for FAQAI without free variables:
(24) 
A relaxed tree decomposition of is optimal if its width is equal to , i.e., .
Theorem 3.5.
Any FAQAI query of the form (3) on any semiring can be answered in time , where is the maximum number of additive inequalities covered by a pair of adjacent bags in an optimal relaxed tree decomposition.^{5}^{5}5Note that can be a lot smaller than .
Proof.
We first consider the case of no free variables (i.e. ), because this case captures the key idea. Fix an optimal relaxed TD . We first compute, for each bag of the tree decomposition, a factor such that
(25)  
(26) 
To define the factors , we need the notion of the indicator projection (Abo Khamis et al., 2017a; Abo Khamis et al., 2016). For a given and such that , the indicator projection of onto the bag is a function defined by
(27) 
Recall from Definition 3.3 that every is covered by at least one bag for . Fix an arbitrary coverage assignment , where is covered by the bag . Then, the factors are defined by:
(28) 
It is straightforward to verify that (26) holds. Using any worstcase optimal join algorithm (Ngo, 2018; Ngo et al., 2012; Veldhuizen, 2014) we can compute (28) in time
(29) 
Over all , our runtime is bounded by , where
(30) 
In addition, the support of each factor has size bounded by .
Next we compute (26) in time . We will make use of the fact that is a relaxed TD. Fix an arbitrary root of the tree decomposition ; following InsideOut, we compute (26) by eliminating variables from the leaves of up to the root. Without loss of generality, we assume that the tree decomposition is nonredundant, i.e., no bag is a subset of another in the tree decomposition (otherwise the contained bag factor can be “absorbed” into the containee bag factor). Let be any leaf of , be its parent, where and . Now write (26) as follows:
(31)  
(32) 
The third equality uses the semiring’s distributive law. (Note that implies that thanks to Definition 3.3 and the fact that is the only neighbor of .) Lemma 3.1 implies that we can compute the subquery in the allotted time. The above step eliminates all variables in . Repeatedly applying the above step yields the desired output .
Example 3.6.
Given 3 binary relations and , consider a query about the number of tuples that satisfy:
(33) 
The query has and . Let . Note that . In fact, any of the previously known algorithms, e.g. (Abo Khamis et al., 2016; Abo Khamis et al., 2017a), would take time to answer . However, this query has , and by Theorem 5, it can be answered in time . (Note that here .) An optimal relaxed tree decomposition is shown in Figure 1.∎
We next give a couple of simple lower and upper bounds for . The upper bound shows that, effectively is the best we can hope for, if the FAQAI query is arbitrary. The lower bound shows that, while the relaxed tree decomposition idea can improve the runtime by a polynomial factor, it cannot improve the runtime over straightforwardly applying InsideOut (over nonrelaxed tree decompositions) by more than a polynomial factor.
Proposition 3.7.
For any positive integer , there exists an FAQAI query of the form (3) for which , and it cannot be answered in time , modulo sum hardness.
Proposition 3.8.
For any FAQAI query of the form (3), we have ; in particular, when has no free variables .
3.2.2. FaqAi on the Boolean semiring
Before formally explaining how we can adapt PANDA to solve an FAQAI query on the Boolean semiring, we give the intuition with an example.
Example 3.9.
Consider the following FAQAI (written in Datalog):
(34) 
Here . Using fractional hypertree width measure and InsideOut (even with relaxed TDs and Theorem 5), the best runtime is , because no matter which (relaxed) TD we choose, the worstcase bag relation size is . A key idea of the PANDA framework (Abo Khamis et al., 2017b) is the use of a disjunctive Datalog rule. Consider the following disjunctive Datalog rule:
(35) 
There are two relations in the head and , and they form a solution to the rule iff the following holds: if satisfies the body, then either or . Via informationtheoretic inequalities (Abo Khamis et al., 2017b), we are able to show that PANDA can compute a solution to the above disjunctive Datalog rule in time . In particular, both and are bounded by .
Given the solution to (35), it is straightforward to verify that the following also holds, using the distributivity of over :
(36) 
By semijoinreducing against , and semjoinreducing against , we conclude that
Finally, we have a rewrite of the original body:
(37) 
By defining intermediate rules, we can compute from them:
(38)  
(39)  
(40) 
The key point is that and are of the form (15), and thus they each can be answered in time (since ). This implies that can be answered in time overall.∎
The strategy outlined in the above example uses PANDA to evaluate an FAQAI query over the Boolean semiring. The resulting algorithm achieves a natural generalization of the submodular FAQwidth defined in (14):
Definition 3.10.
Given an FAQAI query (3) over the Boolean semiring. The relaxed submodular FAQwidth of is defined by
(41) 
(Recall that the set of relaxed tree decompositions was defined in Definition 3.3.)
Theorem 3.11.
Any FAQAI query of the form (3) on the Boolean semiring can be answered in time
Comments
There are no comments yet.