A Layered Aggregate Engine for Analytics Workloads

06/20/2019 ∙ by Maximilian Schleich, et al. ∙ 0

This paper introduces LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory optimization and execution engine for batches of aggregates over the input database. The primary motivation for this work stems from the observation that for a variety of analytics over databases, their data-intensive tasks can be decomposed into group-by aggregates over the join of the input database relations. We exemplify the versatility and competitiveness of LMFAO for a handful of widely used analytics: learning ridge linear regression, classification trees, regression trees, and the structure of Bayesian networks using Chow-Liu trees; and data cubes used for exploration in data warehousing. LMFAO consists of several layers of logical and code optimizations that systematically exploit sharing of computation, parallelism, and code specialization. We conducted two types of performance benchmarks. In experiments with four datasets, LMFAO outperforms by several orders of magnitude on one hand, a commercial database system and MonetDB for computing batches of aggregates, and on the other hand, TensorFlow, Scikit, R, and AC/DC for learning a variety of models over databases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

This work has its root in two observations. First, the majority of practical analytics tasks involve relational data, with the banking or retail domains exceeding 80% (Kaggle, 2018). Second, for a variety of such analytics tasks, their data-intensive computation can be reformulated as batches of group-by aggregates over the join of the database relations (Schleich et al., 2016; Abo Khamis et al., 2018a).

We introduce LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory optimization and execution engine for batches of aggregates over relational data. We exemplify the versatility and competitiveness of LMFAO for a handful of widely used analytics: learning ridge linear regression, classification trees, regression trees, and the structure of Bayesian networks using Chow-Liu trees; and data cubes used for exploration in data warehousing.

Query processing lies at the core of database research, with four decades of innovation and engineering on query engines for relational databases. Without doubt, the efficient computation of a handful of group-by aggregates over a join is well-supported by mature academic and commercial systems and also widely researched. There is relatively less development for large batches of such queries, with initial work in the context of data cubes (Gray et al., 1996; Harinarayan et al., 1996; Mumick et al., 1997) and SQL-aware data mining systems (Chaudhuri, 1998; Chaudhuri et al., 1999) from two decades ago.

We show that by designing for the workload required by analytics tasks, LMFAO can outperform general-purpose mature database systems such as PostgreSQL, MonetDB, and a commercial database system by orders of magnitude. This is not only a matter of query optimization, but also of execution. Aspects of LMFAO’s optimized execution for query batches can be cast in SQL and fed to a database system. Such SQL queries capture decomposition of aggregates into components that can be pushed past joins and shared across aggregates, and as such they may create additional intermediate aggregates. This poses scalability problems to these systems due to, e.g., design limitations such as the maximum number of columns or lack of efficient query batch processing, and led to larger compute times than for the plain unoptimized queries. This hints at LMFAO’s distinct design that departs from mainstream query processing.

The performance advantage brought by LMFAO’s design becomes even more apparent for the end-to-end applications. For the aforementioned use cases in machine learning, the application layer takes relatively insignificant time as it offloads all data-intensive computation to LMFAO. LMFAO computes from the input database sufficient statistics whose size ranges from tens of KBs to hundreds of MBs (Table 

2) and that are used for learning regression and classification models. Mainstream solutions, e.g., MADlib (Hellerstein et al., 2012), R (R Core Team, 2013), Scikit-learn (Pedregosa et al., 2011), and TensorFlow (Abadi et al., 2016)

, either take orders of magnitude more time than LMFAO to train the same model or do not work due to various design limitations. These solutions use data systems to materialize the training dataset, which is defined by a feature extraction query over a database of multiple relations, and ML libraries to learn models over this dataset. We confirm experimentally that the main bottleneck of these solutions is this materialization: The training datasets can be an order of magnitude larger than the input databases used to create them (Table 

1). In addition to being expected to work on much larger inputs, the ML libraries are less scalable than the data systems. Furthermore, these solutions inherit the limitations of both of their underlying systems, e.g., the maximum data frame size in R and the maximum number of columns in PostgreSQL are much less than typical database sizes and respectively number of model features.

1.1. Problem Statement

LMFAO evaluates batches of queries of the following form:

The user-defined aggregate functions (UDAFs), or simply aggregates, can be sums of products of functions:

We next give examples of such aggregates. To express count and sum aggregates, i.e., SUM(1) and SUM for some attribute , we take and then is the constant function and respectively the identity function . To encode a selection condition

that defines a decision tree node, where

is an attribute, op is a binary operator, and is a value in the domain of , we use the Kronecker delta , which evaluates to 1 if the condition is true and 0 otherwise. A further example is given by , , and for the constant functions and the identity functions . Then, is and captures the linear regression function with parameters and features . A final example is that of an exponential -ary function

, which is used for logistic regression.

Applications, such as those in Section 2, may generate batches of tens to thousands of aggregates (Table 2). They share the same join of database relations and possibly of relations defined by queries of the same above form.

1.2. The Layers of LMFAO

To evaluate aggregate batches, LMFAO employs a host of techniques, either novel or adaptations of known concepts to our specific workload. The layered architecture of LMFAO is given in Figure 1 and highlighted next. Section 3 expands on the key design choices behind LMFAO.

The Join Tree layer takes as input the batch of aggregates, the database schema, and cardinality constraints (e.g., sizes of relations and attribute domains) and produces one join tree that is used to compute all aggregates. This step uses state-of-the-art techniques111For cyclic queries, we first compute a hypertree decomposition and materialize its bags (cycles) to obtain a join tree. (Abiteboul et al., 1995).

Application

Aggregates

Join Tree

Find Roots

AggregatePushdown

Merge Views

Group Views

Multi-OutputOptimization

Parallelization

Compilation

App LMFAO

Logical Optimization

Code Optimization
Figure 1. The Optimization Layers of LMFAO.

The Find Roots layer is novel and affects the design of all subsequent layers. By default, LMFAO computes each group-by aggregate in one bottom-up pass over the join tree, by decomposing the aggregate into views computed along each edge in the join tree. We allow for different traversals of the join tree: different aggregates may be computed over the same join tree rooted at different nodes. This can reduce the overall compute time for the batch as it can reduce the number of views and increase the sharing of their computation. In our experiments, the use of multiple roots for the computation of aggregate batches led to speedup.

LMFAO uses directional views to support different traversals of the join tree: For each edge between two nodes, there may be views flowing in both directions. Directional views are similar in spirit with messages in the message passing algorithm used for inference in graphical models (Pearl, 1982). Figure 3 (middle) depicts directional views along a join tree.

In the Aggregate Pushdown layer, each aggregate is decomposed into one directional view per edge in the join tree. Its view at an edge going out of a node computes the aggregate when restricted to the subtree rooted at and is defined over the join of the views at the incoming edges of and of the relation at . The directions of these views are from the leaves to the root of the aggregate. The rationale for this decomposition is twofold. First, it partially pushes the aggregates past joins (represented by edges in the tree), as in prior work on eager computation of one aggregate (Yan and Larson, 1995) and its generalization to factorized databases (Bakibayev et al., 2013). Second, it allows for sharing common views across the aggregates.

The Merge Views layer consolidates the views generated in the previous layer. There are three types of merging possible for views along the same edge in the join tree, depending on whether they have the same: group-by attributes; aggregates; and body. Views with the same direction are defined over the same subtree of the join tree. We first identify identical views constructed for different aggregates and only keep one copy. We then merge the views with the same group-by attributes and body but different aggregates. We finally merge views with the same group-by attributes and different bodies. This consolidation is beneficial. For instance, there are 814 aggregates to compute for learning a linear regression model over the join of five relations in our Retailer dataset. This amounts to , which are consolidated into views that have between themselves 1,468 aggregates.

The previous three layers are concerned with logical transformations of view expressions. The remaining layers consider optimizations not expressible at the syntactic level.

In the Group Views layer, we group the views going out of the same node possibly along different edges such that there is no dependency between them. No dependency means that they can be evaluated together once the incoming views used in their joins are computed. The views in a group do not necessarily have the same group-by attributes, so a view group has multiple outputs. To continue our example, the remaining views are clustered into groups.

The view group is a computational unit in LMFAO. At the Multi-Output Optimization layer, we construct the execution plan for each view group at a node. This plan needs one pass over the relation at that node, with lookups using the join keys into the incoming views to fetch aggregates needed for the computation of the views in the group. This is yet another instance of sharing in LMFAO: The computation of different views share the scan of the relation at the node. This is particularly beneficial for snowflake schemas with large fact relations, e.g., Inventory in Retailer and Sales in Favorita datasets. This scan sees the relation organized logically as a trie: first grouped by one attribute, then by the next in the context of values for the first, and so on. This trie organization is reminiscent of factorized databases (Bakibayev et al., 2012) and LeapFrog TrieJoin (Veldhuizen, 2014) and can visit up to three times less values than a standard row-based scan for our datasets. In our experiments, this layer brought extra speedup.

The Parallelization layer addresses task and domain parallelism. LMFAO parallelizes the computation of multi-output plans for view groups that do not depend on each other. For this, it computes the dependency graph of the view groups. LMFAO partitions the largest input relations and allocates a thread per partition to compute the multi-output plan on that partition. This layer brought extra speedup on a machine with four vCPUs (AWS d2.xlarge).

Finally, the Compilation

layer generates C++ code for the parallel execution of multi-output plans. This code is specialized to the join tree and database schema, with separate code for each view group and also for general tasks such as data loading. The separate code chunks are compiled in parallel. The code layout for each view group is designed to maximize the computation sharing across many aggregates with different group-by and UDAFs via the introduction of local variables, and to minimize the number of accesses (initialization, update, lookup) to these local variables. LMFAO adopts various low-level code optimizations: inlining function calls; organization of the aggregates for each view in a contiguous fixed-size array and ordered to allow sequential read/write; reuse of arithmetic operations, e.g., repeating multiplication of entries in the aggregate array; and synthesis of loops from long sequences of lockstep computations. The latter two optimizations are enabled by sorted input relations and views that are accessed in lockstep. The organization of aggregates allows us to manage them in contiguous batches. This is reminiscent of vectorization 

(Zukowski et al., 2012), now applied to aggregates instead of data records.

Some applications require the computation of UDAFs that change between iterations depending on the outcome of computation. For instance, the nodes in a decision tree are iteratively constructed in the context of conditions that are selected based on the data. The application tags these functions as dynamic to instruct LMFAO to avoid inlining their calls and instead generate separate code that is compiled between iterations and linked dynamically.

1.3. Contributions

To sum up, the contributions of this work are as follows:

1. We introduce LMFAO, a principled layered approach to computing large batches of group-by aggregates over joins. Its layers encompass several stages of logical and code optimization that come with novel contributions as well as adaptations of known techniques to a novel setting. The novel contributions are on: using different traversals of the same join tree to solve many aggregates with different group-by clauses; synthesizing directional views out of large sets of views representing components of aggregate queries; and the multi-output execution plans for computing groups of directional views using one pass over the input relations. It adapts compilation techniques to generate specialized code for the parallel computation of multi-output plans for aggregate queries with static and dynamic user-defined functions.

2. We show the versatility of LMFAO for a range of analytics applications built on top of it.

3. We implemented LMFAO in C++ and conducted two kinds of performance benchmarks: The computation of aggregate batches and of end-to-end applications using these aggregates. In experiments with four datasets, LMFAO outperforms by several orders of magnitude on one hand, PostgreSQL, MonetDB and a commercial DBMS for computing aggregate batches, and, on the other hand, TensorFlow, Scikit, R, and AC/DC for learning models over databases.

2. Applications

LMFAO encompasses a unified formulation and processing of core data processing tasks in database, data mining, and machine learning problems. We exemplify with a small sample of such problems: data cubes; gradients and covariance matrices used for linear regression, polynomial regression, factorization machines; classification and regression trees; and mutual information of pairwise variables used for learning the structure of Bayesian networks.

We next introduce a compact query syntax and use it to formulate the above-mentioned data processing tasks.

Query Language

We are given a database of (materialized) relations over relation schemas . For convenience, we see relation schemas, which are lists of attributes, also as sets. The list of attributes in the database is denoted by .

We would like to compute a set of group-by aggregates over the natural join of these relations. This join may represent the training dataset for machine learning models, the input for multi-dimensional data cubes, or the joint probability distribution to be approximated by a Bayesian network.

We use the following query formulation, which is more compact than the SQL form from Section 1.1:

(1)

In the head of , the group-by attributes are separated from the aggregate functions by semicolon; we omit the semicolon if there are no group-by attributes. The aggregate functions are as defined in Section 1.1. We use += to capture the SUM over each aggregate function. In the query body, we make explicit the attributes of each relation for a clearer understanding of the definitions of the aggregate functions. By definition, there is a functional dependency .

Our queries generalize FAQ-SS (Abo Khamis et al., 2016) and MPF (Marginalize a Product Function) (Aji and McEliece, 2006) by allowing tuples of arbitrary UDAFs.

Ridge Linear Regression

Assuming one parameter per attribute (feature) , the linear regression model is given by:

In practice, features may be defined by attributes in both the input relations and results of queries over these relations.

We assume without loss of generality that (1) only takes value and then is the so-called intercept and (2) is the label and has a corresponding new parameter .

The error of the model is given by an objective function that is the sum of the least squares loss function and of the penalty term that is the

norm of the parameter vector :

We optimize the model using batch gradient descent (BGD), which updates the parameters in the direction of the gradient vector of using a step size :

repeat unti l convergence:

The above update relies on the aggregates for the size of the dataset and the product of with the inner product . There are two ways to express these aggregates. The common approach, which we call the gradient vector, is to compute this inner product and then, for each gradient , multiply it with the corresponding . This requires recomputation for each new vector of parameters. The second approach (Schleich et al., 2016) is to rewrite as and compute the non-centered covariance matrix (the covar matrix hereafter).

The covar matrix accounts for all pairwise multiplications . Each entry can be computed as aggregate query:

(2)

Categorical attributes are one-hot encoded in a linear regression model. In our formalism, such attributes become group-by attributes. If only

is categorical, we get:

(3)

If both and are categorical, we get instead:

(4)

The computation of the covar matrix does not depend on the parameters , and can be done once for all BGD iterations.

Higher-degree Regression Models

A polynomial regression models of degree is defined as follows:

The covar matrix for has the following aggregates in the gradient of the square loss function:

(5)

A similar generalization works for factorization machines (Olteanu and Schleich, 2016; Abo Khamis et al., 2018b). Categorical attributes can be accommodated as for linear regression and then each categorical attribute with exponent becomes a group-by attribute.

Data Cubes

Data cubes (Gray et al., 1996) are popular in data warehousing scenarios. For a set of attributes or dimensions, a -dimensional data cube is a shorthand for the union of cube aggregates with the same aggregation function over the same (measure) attribute out of attributes. We define one aggregate for each of the possible subsets of :

(6)

The cube aggregates have a similar structure with covar matrices for polynomial regression models. They both represent sets of group-by aggregates over the same join. However, the two constructs compute different aggregates and use different data representations. Whereas all cube aggregates use the same measure aggregation, the covar aggregates sum over different attributes. Data cubes are represented as tables in 1NF using a special ALL value standing for a set of values, whereas the covar matrices for regression models are matrices whose entries are the regression aggregates whose outputs have varying sizes and arities.

A polynomial regression model of degree (PR) over categorical features given by attributes () requires regression aggregates whose group-by clauses are over all subsets of size at most of the set of attributes. In contrast, a -dimensional data cube for a given set of (dimension) attributes defines aggregates whose group-by clauses are over all subsets of the attributes. The set of group-by clauses used by the aggregates for PR is captured by all -dimensional data cubes constructed using the attributes.

Mutual Information

The mutual information of two distinct discrete random variables

and

is a measure of their mutual dependence and determines how similar the joint distribution is to the factored marginal distribution. In our database setting, we capture the distributions of two attributes

and using the following count queries that group by any subset of (thus expressible as a 2-dimensional data cube with a count measure):

(7)

The mutual information of and is the given by the following query with a -ary aggregate function over the aggregates of the queries defined above:

Mutual information has many applications as it is used: as cost function in learning decision trees; in determining the similarity of two different clusterings of a dataset; as criterion for feature selection; in learning the structure of Bayesian networks. The Chow-Liu algorithm 

(Chow and Liu, 2006) constructs an optimal tree-shaped Bayesian network with one node for each input attribute in the set . It proceeds in rounds and in each round it adds to an edge between the nodes and such that the mutual information of and is maximal among all pairs of attributes not chosen yet.

Classification and Regression Trees

Decision trees are popular machine learning models that use trees with inner nodes representing conditional control statements to model decisions and their consequences. Leaf nodes represent predictions for the label. If the label is continuous, we learn a regression tree and the prediction is the average of the label values in the fragment of the training dataset that satisfies all control statements on the root to leaf path. If the label is categorical, the tree is a classification tree, and the prediction is the most likely category for the label in the dataset fragment. Figure 2 shows an example of a regression tree.

The CART algorithm (Breiman et al., 1984) constructs the tree one node at a time. Given an input dataset , CART repeatedly finds a condition on one of the attributes of that splits so that a given cost function over the label is minimized. For categorical attributes (e.g., city), may be a set of categories and op denotes inclusion. For continuous attributes (e.g., age), is a real number and op is inequality. Once this condition is found, a new node is constructed and the algorithm proceeds recursively to construct the subtree rooted at this node for the dataset representing the fragment of satisfying the conditions at the new node and at its ancestors.

Is this a good split?
Figure 2. Example of a regression tree. Classification trees replace numerical leaves by categorical values.

Practical implementations of CART compute at each node the cost for 20-100 conditions per continuous attribute and for categorical attributes the best subset of categories is chosen based on the cost of splitting on each individual category.

For regression trees, the cost is given by the variance:

variance

It is computed over the fragment of the dataset . For the tree depicted in Figure 2, , where is the new condition for which we compute the cost of the split in the context of the conjunction of conditions representing its ancestors in the tree. The computation of this cost needs the aggregates COUNT(), SUM(), and SUM() over :

(8)
where

The product aggregate evaluates to 1 whenever all conditions in the subscript are satisfied and to 0 otherwise.

For a categorical attribute , the variance for all split conditions can be expressed using a single query of the form (8) extended with the group-by attribute .

For classification trees, the label has a set of categories. The cost is given by the entropy or Gini index:

The aggregates for compute the frequencies of each category for the label in the dataset , i.e., for category this frequency is the fraction of the tuples in where and of all tuples in : . These frequencies can all be computed with the following two aggregate queries:

(9)
(10)

For a categorical attribute , the cost of all split conditions can be expressed using two queries of the form (9) and (10) extended with the group-by attribute .

Applications need a large number of aggregates.

The number of aggregates in a batch is a function of the number of attributes in the database: for linear regression; for polynomial regression of degree ; for -dimensional data cubes with measures; for Chow-Liu trees with nodes; and for classification/regression trees with nodes where conditions are tried per attribute and the response has categories in case of classification tree; the formula for regression tree is obtained with . Table 2 gives the number of aggregates for these applications and our four datasets, whose details are in Table 1. This number ranges from tens to tens of thousands.

Further Applications

Virtually any in-database machine learning setting can benefit from an efficient processor for aggregate batches over joins. Although not reported in this work, we also investigated SVM, k-means clustering, and low-rank models such as quadratically regularized PCA and Non-Negative Matrix Factorization, as well as linear algebra operations such as QR and SVD decompositions of matrices defined by the natural join of database relations. All these applications decompose into batches of aggregates of a similar form to those mentioned here.

3. The LMFAO Engine

In this section we discuss key design choices for LMFAO and motivate them using examples. Section 1 already provided an overview of its layers that are depicted in Figure 1.

3.1. Aggregates over Join Trees

LMFAO evaluates a batch of aggregate queries of the form (1) over a join tree of the database schema, or equivalently of the natural join of the database relations. We next recall the notion of join trees and exemplify the evaluation of aggregates over joins trees by decomposing them into views.

Sales: date, store, item, units, promo Holidays: date, htype, locale, transferred StoRes: store, city, state, stype, cluster Items: item, family, class, perishable Transactions: date, store, txns Oil: date, price

Sales

Transactions

StoRes

Oil

Items

Holidays

Group 6

Group 5

Group 1

Group 2

Group 4

Group 2

Group 3

Group 7
Figure 3. (left) The schema for the Favorita dataset. (middle) A join tree for this schema with directional views and four queries, partitioned in 7 groups. (right) The dependency graph of the view groups.

The join tree of the natural join of the database relations is an undirected tree such that (Abiteboul et al., 1995):

  • The set of nodes of is .

  • For every pair of nodes and , their common attributes are in the schema of every node along the distinct path from to , i.e., .

Figure 3 shows a possible join tree for the natural join of the six relations in the Favorita dataset (Favorita, 2017) (details are given in Appendix A). Instead of showing join attributes on the edges, we underline them in the schema (left) to avoid clutter.

Acyclic joins always admit join trees. Arbitrary joins are transformed into acyclic ones by means of hypertree decompositions and materialization of their nodes (called bags) using worst-case optimal join algorithms (Marx, 2010; Veldhuizen, 2014). We next exemplify the computation of aggregates over a join tree (Bakibayev et al., 2013; Abo Khamis et al., 2016).

Example 3.1 ().

Let us compute the sum of the product of two aggregate functions and over the natural join of the Favorita relations:

We abbreviated the names of the Favorita relations as highlighted in Figure 3. The aggregate functions and are over the attributes units in Sales and price in Oil. We can rewrite to push these functions down to the relations and also follow the structure of the join tree in Figure 3:

Except for and , which have attributes in the aggregate functions, we only need to count the number of tuples with the same join key in each of the other relations.∎

The computation of several aggregates over the same join tree may share views between themselves.

Example 3.2 ().

Consider now over the same join. This query reports the sum of for each item family. We can rewrite it similarly to in Example 3.1:

We can share the views , and thus its underlying views and , and between and . ∎

3.2. Directional Views over Join Trees

An alternative evaluation for in Example 3.2 would not create the view and instead create a view over the subtree rooted at Sales and then join it with Items in the body of . This effectively means that we use the same join tree but rooted at different nodes: Sales for and Items for . This also means that the edge between Sales and Item has two views, yet with different direction.

To accommodate this evaluation approach, we introduce directional views: These are queries of the form (1) where we also specify their direction. They flow along an edge from a source node to a neighboring target node and are computed using the relation at the source node and some of its incoming views. Examples 3.1 and 3.2 showed views whose directions are always towards the root Sales. The direction of is and of for the alternative evaluation of is .

Consider a join tree with root and children , where child is the root of a subtree in . We use and to denote the schema of the relation in and respectively the union of the schemas of all relations in .

We decompose a query with group-by attributes and aggregate function as follows:

The view for a child of is defined as the “projection” of onto as follows. Its group-by attributes are ; here, are the attributes shared between and a child and are the group-by attributes from present in . Its body is the natural join of the relations in . If all attributes of are (are not) in , then (respectively ). Otherwise, has some of the attributes required to compute , in which case we add them as group-by attributes, i.e., , and use the aggregate to count. We can now decompose the views recursively as explained for .

Using different roots for different queries may lower the overall complexity of evaluating a batch of aggregates. At the same time, we would like to share computation as much as possible, which is intuitively maximized if all queries are computed at the same root. We next discuss our solution to this tension between complexity and sharing.

3.3. Each Aggregate to Its Own Root

We next exemplify the advantage of evaluating a batch of queries, which are common in linear regression and mutual information settings where all attributes are categorical, at different roots in the join tree and then explain how to find a root for a given aggregate in a batch of aggregates.

Example 3.3 ().

Consider the following count queries over the join of relations of size , :

We first explain how to compute these queries by decomposing them into directional views that are over the join tree with root and have the same direction along this path towards the root.

For simplicity, we denote by the view constructed for with direction from to . The views are defined as follows, with the addition of that associates each value in the domain of with .

The above decomposition proceeds as follows. counts the number of occurrences of each value for in the join. We start with 1, as provided by , and progress to the left towards the root . The view computes the counts for in the context of each value for as obtained from . We need to keep the values for to connect with . Eventually, we reach that gives the counts for in the context of , and we sum them over the values of . The same idea applies to any with one additional optimization: Instead of starting from the leaf , we can jump-start at and reuse the computation of the counts for in the context of as provided by . We need many views and those of them that have group-by attributes from two different relations take time.

We can lower this complexity to by using different roots for different queries. We show the effect of using the root for query . For each query , we construct two directional views: view from to (i.e., from left to right) and view from to (i.e., from right to left). The counts for values are the products of the counts in the left view and the right view :

We also use two trivial views and . Note how the left view is expressed using the left view coming from the node below . Similarly for the right views. Each of the views takes linear time. Moreover, they share much more computation among them than the views used in the first scenario.

The second approach that chooses the root for query can also be used for queries over all pairs of attributes:

Each of these queries takes time for and otherwise. At each node , we compute a left view , for any , that counts the number of tuples for over the path . Then, the overall count in is computed as the product of the count for given by the right view and the count for given by the left view ():

The trivial views assign a count of 1 to each value of . ∎

LMFAO chooses the root in a join tree for each query in a batch using a simple and effective approximation for the problem of minimizing the overall size of the views used to compute the entire batch. For each query in the batch, we assign a weight to each relation in the join tree that is the fraction of the number of group-by attributes of in ; if has no group-by attribute, then any relation is a possible root and we assign to each relation the same weight that is an equal fraction of the number of relations. At the end of this weight assignment phase, each relation will be assigned some weight. We assign roots in the reverse order of their weights. A relation with the largest weight is then assigned as root to all queries that considered it as possible root. We break ties by choosing a relation with the largest size. The rationale for this choice is threefold. The choice for the largest relation avoids the creation of possibly large views over it. If the root for has no group-by attribute of , then we will create views carrying around values for these attributes, so larger views. A root with a large weight ensures that many views share the same direction towards it, so their computation may be shared and they may be merged or grouped (as explained in the next sections).

3.4. Merging and Grouping Views

The views generated for a batch of aggregates can be consolidated or merged if they have in common: (1) only the group-by attributes and direction, (2) also the body, and (3) also the aggregates. The common case (3), which is also the most restrictive one, has been seen in Example 3.2: The same view is created for several queries, in which case we only need to compute it once. Case (2) concerns views with the same group-by attributes and join but different aggregates. Such views are merged into a single one that keeps the same group-by attributes and join but merges the lists of aggregates. Case (1) is the most general form of merging supported by LMFAO and consolidates the views into a new view that is a join of these views on their (same) group-by attributes. The reason why this merging is sound is twofold. First, these views are over the same join, so they have the same set of tuples over their group-by attributes. Second, the aggregates are functionally determined by the group-by attributes.

Example 3.4 ().

We continue Examples 3.1 and 3.2 and add a third count query over the same join body as and . This is decomposed into the following views over the same join tree rooted at Sales:

where is shared with and ; is shared with .

The view for can be merged with for into a new view since they have the same group-by attributes and body:

Both views for and for are now defined over instead of the views and . Views and have the same group by attributes and direction, but different bodies (one joins over and the other over ). We can merge them in following Case (1):

Besides merging, grouping is another way of clustering the views that can share computation: We form groups of views that go out of the same node, regardless of their group-by attributes and bodies. We group the views as follows. We compute a topological order of these views: If a view uses a view in its body, i.e., it depends directly on it, then appears after in this order. We then traverse this order and create a group with all views such that (1) no view in the group depends on another view, and (2) all views within the group go out of the same relation in the join tree.

Figure 3(center) shows a scenario with directional views and four queries along the edges of our Favorita join tree. Their grouping is shown in Figure 3(right).

In the next section, we show how to compute all views within a group in one scan over their common relation.

3.5. Multi-Output Optimization

The view group is a computational unit in LMFAO. We introduce a new optimization that constructs a plan that computes all views in a group in one scan over their common input relation. Since this plan outputs the results for several views, we call it multi-output optimization, or MOO for short.

One source of complexity in MOO is that the views in the group are defined over different incoming views. While scanning the common relation, the multi-output plan looks up into the incoming views to fetch aggregates needed for the computation of the views in the group. A second challenge is to update the aggregates of each view in the group as soon as possible and with minimal number of computation steps.

MOO has three steps: (1) Find an order of join attributes of the common relation; Register (2) incoming and outgoing views and (3) aggregate functions to attributes in the attribute order. We next present each of these steps and exemplify them using the following group of three views with the common relation Sales ():