On Functional Aggregate Queries with Additive Inequalities

12/22/2018 ∙ by Mahmoud Abo Khamis, et al. ∙ 0

Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering Functional Aggregate Queries (FAQ) in which some of the input factors are defined by a collection of Additive Inequalities between variables. We refer to these queries as FAQ-AI for short. To answer FAQ-AI in the Boolean semiring, we define "relaxed" tree decompositions and "relaxed" submodular and fractional hypertree width parameters. We show that an extension of the InsideOut algorithm using Chazelle's geometric data structure for solving the semigroup range search problem can answer Boolean FAQ-AI in time given by these new width parameters. This new algorithm achieves lower complexity than known solutions for FAQ-AI. It also recovers some known results in database query answering. Our second contribution is a relaxation of the set of polymatroids that gives rise to the counting version of the submodular width, denoted by "#subw". This new width is sandwiched between the submodular and the fractional hypertree widths. Any FAQ and FAQ-AI over one semiring can be answered in time proportional to #subw and respectively to the relaxed version of #subw. We present three applications of our FAQ-AI framework to relational machine learning: k-means clustering, training linear support vector machines, and training models using non-polynomial loss. These optimization problems can be solved over a database asymptotically faster than computing the join of the database relations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

We consider the problem of computing functional aggregate queries with inequality joins, or FAQ-AI queries for short. This is a fundamental computational problem that goes beyond databases: core computation for supervised and unsupervised machine learning can be formulated in FAQ-AI.

Inequalities occur naturally in scenarios involving temporal and spatial relationships between objects in databases. In a retail scenario (e.g., TPC-H), we would like to compute the revenue generated by a customer’s orders whose dates closely precede the ship dates of their lineitems. In streaming scenarios, we would like to detect patterns of events whose time stamps follow a particular order (Golab and Özsu, 2010). In spatial data management scenarios, we would like to retrieve objects whose coordinates are within a multi-dimensional range or in close proximity of other objects (Mamoulis, 2011). The evaluation of Core XPath queries over XML documents amounts to the evaluation of a special class of conjunctive queries with inequalities expressing tree relationships in the pre/post plane (Grust, 2002).

1.1. Motivating examples

A key insight of this paper is that the efficient computation of inequality joins can reduce the computational complexity of supervised and unsupervised machine learning.

Example 1.1 ().

The -means algorithm divides the input dataset into clusters of similar data points (Jain, 2010). Each cluster has a mean , which is chosen according to the following optimization (similarity is defined here with respect to the norm):

(1)

Let be the ’th component of mean vector . For a data point , the function computes the difference between the squares of the -distances from to and from to :

A data point is closest to mean from the set of means iff .

To compute the mean vector , we need to compute the sum of values for each dimension over . If the dataset is the join of database relations over schemas , we can formulate this sum computation as a datalog-like query with aggregates (Halpin and Rugaber, 2015):

Section 4 gives further queries necessary to compute the means. As we show in this paper, such queries with aggregates and inequalities can be computed asymptotically faster than the join defining . ∎

Simple queries with inequalities can already show the limitations of current evaluation techniques, as highlighted next.

Example 1.2 ().

State-of-the-art techniques take time to compute the following query over relations of size at most :

Example 3.9 (3.19) shows how to compute (its counting version) in time using the techniques introduced in this paper.∎

1.2. The Faq-Ai problem

One way to answer the above queries is to view them as functional aggregate queries (FAQ(Abo Khamis et al., 2016) formulated in sum-product form over (potentially many) semirings. We therefore briefly introduce FAQ over a single semiring.

First we establish notation. For any positive integer , let . For , let denote a variable/attribute, and denote a value in the discrete domain of . For any , define , . That is, is a tuple of variables and is a tuple of values for these variables.

Let a semiring and a multi111This means that is a multiset.-hypergraph . To each edge we associate a function called factor.222The naming is borrowed from graphical models literature, where FAQ has its root. A (single-semiring) FAQ query with free variables has the form:

(2)

Under the Boolean semiring , the query in (2) becomes a conjunctive query: The factors represent input relations, where iff , with some notational overloading. For counting the number of tuples in the result of a join query, we can use instead the sum-product semiring and define an indicator function for every input relation . To aggregate over some input variable, say , we can designate an identity factor .

It is known (Abo Khamis et al., 2016) that over an arbitrary semiring, the query (2) can be answered in time , where fhtw denotes the fractional hypertree width of the query and has no free variables (Grohe and Marx, 2014). If does have free variables, fhtw-width becomes FAQ-width instead (Abo Khamis et al., 2016). Here is the size of the largest factor . Over the Boolean semiring, the time can be lowered to  (Abo Khamis et al., 2017b), where subw is the submodular width (Marx, 2013) and hides a polylogarithmic factor in .

Motivated by the examples in Section 1.1, we formulate a class of FAQ queries called FAQ-AI: the hyperedge multiset is partitioned into two multisets , where stands for “skeleton” and stands for “ligament”. The input to our class of queries consists of the following: (1) to each hyperedge , there corresponds a function , as in the FAQ case; (2) to each hyperedge , there corresponds functions , one for every variable . The query we want to compute is the following:

(3)

The summation is over tuples . The notation denotes the indicator function of the event in the semiring : if holds, and otherwise. The (uni-variate) functions can be user-defined functions, e.g., , or binary predicates with one key in and a numeric value. The only requirement we impose is that, given , the value can be accessed/computed in -time.

Note that if , then we get back the FAQ formulation (2). Thus, FAQ-AI can also be considered a super-class of FAQ queries, i.e., FAQ and FAQ-AI are the same language.

Example 1.3 ().

The queries from Section 1.1 are instances of (3):

(4)
(5)

is on the sum-product semiring. can be on any semiring: Example 3.9 discusses the case of the Boolean semiring while Example 3.19 discusses the sum-product semiring. ∎

1.3. Our contributions

To answer FAQ queries of the form (2), currently there are two dominant width parameters: fractional hypertree width (fhtw (Grohe and Marx, 2014)) and submodular width (subw (Marx, 2013)).333Section 2.1 overviews other notions of widths. It is known that for any query, and in the Boolean semiring we can answer (2) in -time (Abo Khamis et al., 2017b; Marx, 2013). For non-Boolean semirings, the best known algorithm, called InsideOut (Abo Khamis et al., 2016; Abo Khamis et al., 2017a), evaluates (2) in time . For queries with free variables, fhtw is replaced by the more general notion of FAQ-width (faqw(Abo Khamis et al., 2016); however, for brevity we discuss the non-free variable case here.

Following (Abo Khamis et al., 2017a), both width parameters subw and fhtw can be defined via two constraint sets: the first is the set TD of all tree decompositions of the query hypergraph , and the second is the set of polymatroids on vertices of . The widths subw and fhtw are then defined as maximin or minimax optimization problems on the domain pair TD and , subject to “edge domination” constraints for . Section 2 presents these notions and other related preliminary concepts in detail.

Our contributions include the following:

Answering Faq-Ai over Boolean semiring

On the Boolean semiring, one way to answer query (3) is to apply the PANDA algorithm (Marx, 2013), using edge domination constraints on and the set TD of all tree decompositions of . However, this is sub-optimal. Therefore, in Section 3.2 we define a new notion of tree decomposition: relaxed tree decomposition, in which the hyperedges in only have to be covered by adjacent TD bags. Then, we present a variant of the InsideOut algorithm running on these relaxed TDs, exploiting Chazelle’s classic geometric data structure (Chazelle, 1988) for solving the semigroup range search problem. We show that our InsideOut variant meets the “relaxed fhtw” runtime, which is the analog of fhtw on relaxed TD. The PANDA algorithm can use the InsideOut variant as a blackbox to meet the “relaxed subw” runtime. The relaxed widths are smaller than the non-relaxed counterparts, and are strictly smaller for some classes of queries, which means our algorithms yield asymptotic improvements over existing ones.

Answering Faq over other semirings

Next, to prepare the stage for answering FAQ-AI over non-Boolean semirings, in Section 3.3 we revisit FAQ over non-Boolean semirings, where no known algorithm can achieve the subw-runtime. Here, we relax the set of polymatroids to a superset of relaxed polymatroids. Then, by relaxing the subw definition over relaxed polymatroids, we obtain a new width parameter called “sharp submodular width” (#subw). We show how a variant of PANDA, called #PANDA, can achieve a runtime of for evaluating FAQ over non-Boolean semirings. We prove that , and that there are classes of queries where #subw is unboundedly smaller than fhtw.

Answering Faq-Ai over other semirings

Getting back to FAQ-AI, we apply the #subw result under both relaxations: relaxed TD and relaxed polymatroids, to obtain a new width parameter called the relaxed #subw. We show that the new variants of PANDA and InsideOut can achieve the relaxed #subw runtime. We also show that there are queries for which relaxed #subw is essentially the best we can hope for, modulo -sum-hardness.

Applications in relational Machine Learning

Equipped with the algorithms for answering FAQ-AI, in Section 4

we return to relational machine learning applications over datasets defined by feature extraction queries over relational databases. We show how one can train linear SVM,

-means, and ML models over Huber/hinge loss functions without completely materializing the output of the feature extraction queries. In particular, this shows that for these important classes of ML models, one can sometimes train models in time sub-linear in the training dataset size.

1.4. Related work

Appendix C revisits two prior results on the evaluation of queries with inequalities through FAQ-AI lenses: Core XPath queries over XML documents and inequality joins over tuple-independent probabilistic databases (Olteanu and Huang, 2009). Throughout the paper, we contrast our new width notions with fhtw and subw and our new algorithm #PANDA with the state-of-the-art algorithms PANDA and InsideOut for FAQ and FAQ-AI queries. A seminal work considers the containment and minimization problem for queries with inequalities (Klug, 1988). There is a bulk of work on queries with disequalities (not-equal), e.g., (Abo Khamis et al., 2019) and references therein, which are at times referred to as inequalities.

Section 4 sets the context for our results on machine learning.

2. Preliminaries

Throughout the paper, we use the following convention. For any Boolean event/variable and a given semiring , let denote the indicator variable, which takes the value if holds (or is true), and othewise. We assume without loss of generality in the paper that semiring operations and can be performed in -time. (When the assumption does not hold, for the set semiring for instance, we can multiply the claimed runtime with the real operation’s runtime.)

2.1. Tree decompositions and polymatroids

We briefly define tree decompositions, fhtw and subw parameters. We refer the reader to the recent survey by Gottlob et al. (Gottlob et al., 2016) for more details and historical contexts. In what follows, the hypergraph should be thought of as the hypergraph of the input query, although the notions of tree decomposition and width parameters are defined independently of queries.

A tree decomposition of a hypergraph is a pair , where is a tree and maps each node of the tree to a subset of vertices such that

  1. every hyperedge is a subset of some , (i.e. every edge is covered by some bag),

  2. for every vertex , the set is a non-empty (connected) sub-tree of . This is called the running intersection property.

The sets are often called the bags of the tree decomposition.

Let denote the set of all tree decompositions of . When is clear from context, we use TD for brevity.

To define width parameters, we use the polymatroid characterization from Abo Khamis et al. (2017b). A function is called a (non-negative) set function on . A set function on is modular if for all , is monotone if whenever , and is submodular if for all . A monotone, submodular set function with is called a polymatroid. Let denote the set of all polymatroids on with .

Given some , define the set of edge dominated set functions:

(6) ED

With this, we define the submodular width and fractional hypertree width of a given hypergraph :

(7)
(8)

It is known (Marx, 2013) that , and there are classes of hypergraphs with bounded subw and unbounded fhtw. Furthermore, fhtw is strictly less than other width notions such as (generalized) hypertree width and tree width.

Remark 2.1.

Prior to Abo Khamis et al. (2017b), the commonly used definition of is , where is the fractional edge cover number of a vertex set using the hyperedge set

. It is straightforward to show, using linear programming duality 

(Abo Khamis et al., 2017b), that

(9)

proving the equivalence of the two definitions. However, the characterization (7) has two primary advantages: (i) it exposes the minimax / maximin duality between fhtw and subw, and more importantly (ii) it makes it completely straightforward to relax the definitions by replacing the constraints by other applicable constraints, as shall be shown in later sections.∎

Definition 2.2 (-connex tree decomposition (Bagan et al., 2007; Segoufin, 2013)).

Given a hypergraph and a set , a tree decomposition of is -connex if or the following holds: There is a nonempty subset that forms a connected subtree of and satisfies .

We use to denote the set of all -connex tree decompositions of . (Note that when , .)

2.2. InsideOut and Panda

To answer the FAQ query (2), we need a model for the representation of the input factors . The support of the function is the set of tuples such that . We use to denote the size of its support. For example, if represents an input relation, then is the number of tuples in . In practice, there often are factors with infinite support, e.g., represents a built-in function in a database, an arithmetic operator, or a comparison operator as in (3). To deal with this more general setting, the edge set is partitioned into two sets , where is finite for all and for all . For simplicity, we often state runtimes of algorithms in terms of the “input size” . Moreover, we use to denote the output size of .

InsideOut (Abo Khamis et al., 2016; Abo Khamis et al., 2017a)

To answer (2), the InsideOut algorithm works by eliminating variables, along with an idea called the “indicator projection”. Its runtime is described by the FAQ-width of the query, a slight generalization of fhtw. In the context of one semiring, we can define by applying Definition (7) over a restricted set of tree decompositions and edge dominated polymatroids. In particular, let denote the set of free variables in (2), and recall from Definition 2.2. Then,

(10)
(11)
(12)

Note that when and (i.e. ). A simple result from Abo Khamis et al. (2016) is the following:

Proposition 2.3.

InsideOut answers query (2) in time .

To solve the FAQ-AI (3), we can apply Proposition 2.3 with (because all ligament factors are infinite). But this is suboptimal—later, we show a new InsideOut variant that is polynomially better.

Panda (Abo Khamis et al., 2017b)

In case of the Boolean semiring, i.e., when the FAQ query (2) is of the form

(13)

we can do much better than Proposition 2.3. When , Marx (Marx, 2013) showed that (13) can be answered in time . The PANDA algorithm (Abo Khamis et al., 2017b) generalizes Marx’s result to deal with general degree constraints, and to meet precisely the -runtime. In fact, PANDA works with queries such as (13) with free variables as well. In the context of this paper, we can define the following notion of submodular FAQ-width in a very natural way:

(14)

Then, the results from Abo Khamis et al. (2017b) imply:

Proposition 2.4.

PANDA answers query (13) in time .

These results only work for the Boolean semiring. Section 3 introduces a variant of PANDA, called #PANDA, that also works for non-Boolean semirings.

2.3. Semigroup range searching

Orthogonal range counting (and searching) is a classic and ubiquitous problem in computational geometry (de Berg et al., 2008): given a set of points in a -dimensional space, build a data structure that, given any -dimensional rectangle, can efficiently return the number of enclosed points. More generally, there is the semigroup range searching problem (Chazelle, 1988), where each point of the input points also has a weight , where is a semigroup.444In a semigroup we can add two elements using , but there is no additive inverse. The problem is: given a -dimensional rectangle , compute .

Classic results by Chazelle (Chazelle, 1988) show that there are data structures for semigroup range searching which can be constructed in time , and answer rectangular queries in -time. Also, this is almost the best we can hope for (Chazelle, 1990). There are more recent improvements to Chazelle’s result (see, e.g., Chan et al. (Chan et al., 2011)), but they are minor (at most a factor), as the original results were already very close to matching the lower bound.

Most of these range search/counting problems can be reduced to the dominance range searching problem (on semigroups), where the query is represented by a point , and the objective is to return . Here, denotes the “dominance” relation (coordinate-wise ). We can think of as the lower-corner of an infinite rectangle query.

3. Relaxed tree decompositions and relaxed polymatroids

3.1. Connection to a geometric data structure

We start with a special case of (3) in which the skeleton part contains only two hyperedges and . Formally, consider the aggregate query of the form

(15)

where and are two input functions/relations over variable sets and , respectively. We prove the following very simple but important lemma:

Lemma 3.1.

Let , and , then when , query (15) can be answered in time .

Proof.

If there is a hyperedge for which , then in a -time pre-processing step we can “absorb” the factor into the factor , by replacing with . A similar absorption can be done with . Hence, without loss of generality we can assume that and for all . Furthermore, we only need to show that we can compute (15) for , because after is computed, we can marginalize away variables in -time.

Abusing notation somewhat, for each and each , define the function by

(16)

Fix a tuple such that . A tuple is said to be -adjacent if . We show how to compute the following sum in poly-logarithmic time:

(17)
(18)

where the inner sum ranges over only tuples which are -adjacent; non-adjacent tuples contribute .

Now, for each define two -dimensional points:

(19)
(20)

We write to say that is dominated by coordinate-wise: . Assign to each point a “weight” of . Now, taking (18),

(21)
(22)

The expression thus computes, for a given “query point” , the weighted sum over all points that dominate the query point. This is precisely the dominance range counting problem, which—modulo a -preprocessing step—can be solved in time  (Chazelle, 1988), as reviewed in Section 2.3.

To conclude the proof, note that (15) can be written as (assuming as is the case in Lemma 3.1)

where the outer sum ranges over tuples in . ∎

Example 3.2.

Let be a binary relation. Suppose we want to count the number of tuples satisfying , then by setting , , , it is easy to see that the problem can be reduced to the form (15) with , . We can thus compute this count in time .∎

3.2. Relaxed tree decompositions

Equipped with this basic case, we can now proceed to solve the general setting of (3). To this end, we define a new width parameter.

Definition 3.3 (Relaxed tree decomposition).

Let denote a multi-hypergraph whose edge multiset is partitioned into and . A relaxed tree decomposition of (with respect to the partition ) is a pair , where is a tree, and satisfies the following properties:

  • The running intersection property holds: for each node the set is a connected subtree in .

  • Every “skeleton” edge is covered by some bag , .

  • Every “ligament” edge is covered by the union of two adjacent bags: , where .

Let denote the set of all relaxed tree decompositions of (with respect to the skeleton-ligament partition). When is clear from context we use for the sake of brevity. Let denote the set of all relaxed -connex tree decompositions of .

3.2.1. Faq-Ai on a general semiring

We use relaxed TDs in conjunction with Lemma 3.1 to answer FAQ-AI with a relaxed notion of faqw. In particular, the relaxed width parameters of are defined in exactly the same way as the usual width parameters defined in Section 2, except we allow the TDs to range over relaxed ones.

Definition 3.4 (Relaxed faqw).

Let be an FAQ-AI query (3), and be its hypergraph. Furthermore, let denote the set of hyperedges for which . Then, the relaxed FAQ-width of is defined by

(23)

When , collapses back to , in which case we define the relaxed fhtw for FAQ-AI without free variables:

(24)

A relaxed tree decomposition of is optimal if its width is equal to , i.e., .

Theorem 3.5.

Any FAQ-AI query of the form (3) on any semiring can be answered in time , where is the maximum number of additive inequalities covered by a pair of adjacent bags in an optimal relaxed tree decomposition.555Note that can be a lot smaller than .

Proof.

We first consider the case of no free variables (i.e. ), because this case captures the key idea. Fix an optimal relaxed TD . We first compute, for each bag of the tree decomposition, a factor such that

(25)
(26)

To define the factors , we need the notion of the indicator projection (Abo Khamis et al., 2017a; Abo Khamis et al., 2016). For a given and such that , the indicator projection of onto the bag is a function defined by

(27)

Recall from Definition 3.3 that every is covered by at least one bag for . Fix an arbitrary coverage assignment , where is covered by the bag . Then, the factors are defined by:

(28)

It is straightforward to verify that (26) holds. Using any worst-case optimal join algorithm (Ngo, 2018; Ngo et al., 2012; Veldhuizen, 2014) we can compute (28) in time

(29)

Over all , our runtime is bounded by , where

(30)

In addition, the support of each factor has size bounded by .

Next we compute (26) in time . We will make use of the fact that is a relaxed TD. Fix an arbitrary root of the tree decomposition ; following InsideOut, we compute (26) by eliminating variables from the leaves of up to the root. Without loss of generality, we assume that the tree decomposition is non-redundant, i.e., no bag is a subset of another in the tree decomposition (otherwise the contained bag factor can be “absorbed” into the containee bag factor). Let be any leaf of , be its parent, where and . Now write (26) as follows:

(31)
(32)

The third equality uses the semiring’s distributive law. (Note that implies that thanks to Definition 3.3 and the fact that is the only neighbor of .) Lemma 3.1 implies that we can compute the sub-query in the allotted time. The above step eliminates all variables in . Repeatedly applying the above step yields the desired output .

When the query has free variables, the algorithm proceeds similarly to the case of an FAQ query with free variables (Abo Khamis et al., 2017a; Abo Khamis et al., 2016). ∎

Example 3.6.

Given 3 binary relations and , consider a query about the number of tuples that satisfy:

(33)

The query has and . Let . Note that . In fact, any of the previously known algorithms, e.g. (Abo Khamis et al., 2016; Abo Khamis et al., 2017a), would take time to answer . However, this query has , and by Theorem 5, it can be answered in time . (Note that here .) An optimal relaxed tree decomposition is shown in Figure 1.∎

         

         

         

Figure 1. An optimal relaxed tree decomposition for the query in Example 3.6. Ligament edges are dashed. Each skeleton edge is held in one bag.

We next give a couple of simple lower and upper bounds for . The upper bound shows that, effectively is the best we can hope for, if the FAQ-AI query is arbitrary. The lower bound shows that, while the relaxed tree decomposition idea can improve the runtime by a polynomial factor, it cannot improve the runtime over straightforwardly applying InsideOut (over non-relaxed tree decompositions) by more than a polynomial factor.

Proposition 3.7.

For any positive integer , there exists an FAQ-AI query of the form (3) for which , and it cannot be answered in time , modulo -sum hardness.

Proposition 3.8.

For any FAQ-AI query of the form (3), we have ; in particular, when has no free variables .

3.2.2. Faq-Ai on the Boolean semiring

Before formally explaining how we can adapt PANDA to solve an FAQ-AI query on the Boolean semiring, we give the intuition with an example.

Example 3.9.

Consider the following FAQ-AI (written in Datalog):

(34)

Here . Using fractional hypertree width measure and InsideOut (even with relaxed TDs and Theorem 5), the best runtime is , because no matter which (relaxed) TD we choose, the worst-case bag relation size is . A key idea of the PANDA framework (Abo Khamis et al., 2017b) is the use of a disjunctive Datalog rule. Consider the following disjunctive Datalog rule:

(35)

There are two relations in the head and , and they form a solution to the rule iff the following holds: if satisfies the body, then either or . Via information-theoretic inequalities (Abo Khamis et al., 2017b), we are able to show that PANDA can compute a solution to the above disjunctive Datalog rule in time . In particular, both and are bounded by .

Given the solution to (35), it is straightforward to verify that the following also holds, using the distributivity of over :

(36)

By semijoin-reducing against , and semjoin-reducing against , we conclude that

Finally, we have a rewrite of the original body:

(37)

By defining intermediate rules, we can compute from them:

(38)
(39)
(40)

The key point is that and are of the form (15), and thus they each can be answered in -time (since ). This implies that can be answered in -time overall.∎

The strategy outlined in the above example uses PANDA to evaluate an FAQ-AI query over the Boolean semiring. The resulting algorithm achieves a natural generalization of the submodular FAQ-width defined in (14):

Definition 3.10.

Given an FAQ-AI query  (3) over the Boolean semiring. The relaxed submodular FAQ-width of is defined by

(41)

(Recall that the set of relaxed tree decompositions was defined in Definition 3.3.)

Theorem 3.11.

Any FAQ-AI query of the form (3) on the Boolean semiring can be answered in time