Deep Neural Networks (DNNs) have not only been shown to solve many complex problems such as image classification or machine translation, but are applied in many other domains, too. This is also the case for DBMSs, where DNNs have successfully been used to replace existing DBMS components with learned counterparts such as learned cost models[kipf2019learned, sun2019an] as well as learned query optimizers [DBLP:journals/corr/abs-1904-03711], or even learned indexes [kraska2018thecase] or query scheduling and query processing schemes [DBLP:conf/sigmod/MaT19, DBLP:conf/sigmod/ShengTZP19].
The predominant approach for learned DBMS components is that they capture the behavior of a component by running a representative set of queries over a given database and use the observations to train the model. For example, for learned cost models such as [kipf2019learned, sun2019an] different query plans need to be executed to collect the training data, which captures the runtime (or cardinalities), to then learn a model that can estimate costs for new query plans. This observation also holds for the other approaches such as learned query optimizers or the learned query processing schemes, which are also based on collected training data that requires the execution of a representative workload.
A major obstacle of this workload-driven approach to learning is that collecting the training data is typically very expensive since many queries need to be executed to gather enough training data. For example, approaches like [kipf2019learned, sun2019an] have shown that the runtime of hundreds of thousands of query plans is needed for the model to provide a high accuracy. Still, the training corpora often only cover a limited set of query patterns to avoid even higher training costs. For example, in [kipf2019learned] the training data covers only queries up to two joins (three tables) and filter predicates with a limited number of attributes.
Moreover, the training data collection is not a one-time effort since the same procedure needs to be repeated over and over if the workload changes or if the current database is not static and the data is constantly being updated as it is typical for OLTP. Otherwise, without collecting new training data and retraining the models for the characteristics of the changing workload or data, the accuracies of these models degrade with time. To illustrate this, consider Figure 1. Here, we see (orange bars) the error of the cardinality estimation model of Kipf et al. [kipf2019learned], called MCSN, that was trained on queries with three joined tables only. On queries with four and more tables that the model has not seen, however, the error quickly increases.
In this paper, we take a different route. Instead of learning a model over the workload, we propose to learn a purely data-driven model that captures the joint probability distribution of the data and reflects important characteristics such as correlations across attributes but also the data distribution of single attributes. Another important difference to existing approaches is that our data-density approach supports direct updates; i.e., inserts, updates, and deletes on the underlying database can be absorbed by the model without the need to retrain the model.
As a result, since our model only captures information of the data (and this is workload-independent) it can not only be used for one task but supports many different tasks ranging from query answering, over cardinality estimation to machine learning tasks such as classification or regression. One could now think that this all comes at a price and that the accuracy of our approach must be lower since the workload-driven approaches get more information than a pure data-driven approach. However, as we demonstrate in our experiments, this is not the case. Our approach actually outperforms many state-of-the-art workload-driven approaches. Furthermore, it generalizes much better. Reconsider Figure 1. The blue bars show the results when using our model for cardinality estimation, proving that it provides an order-of-magnitude better accuracies.
Indeed, we do not argue that data-driven models are a silver bullet to solve all possible tasks in a DBMS. Instead, we think that data-driven models should be combined with workload-driven models when it makes sense. For example, a workload-driven model for a learned query optimizer might use the cardinally estimates of our model as input features. This combination of data-driven and workload-driven models provides an interesting avenue for future work but is beyond the scope of this paper.
To summarize, the main contributions of this paper are:
We developed a new class of deep probabilistic models over databases, called Relational Sum Product Networks (RSPNs), that can capture important characteristics of a database.
To support different tasks, we devise a probabilistic query compilation approach that translates incoming database queries into probability and expectations for RSPNs (that are learned over a given database).
We implemented our data-driven approach in a prototypical DBMS architecture, called DeepDB, and evaluated it against state-of-the-art learned and non-learned approaches that are workload-aware, showing the benefits of our approach over these baselines.
The remainder of the paper is organized as follows. In Section 2 we first present an overview of DeepDB and then discuss details of our models and the query compilation in Sections 3 and 4. Afterwards, we explain further extensions of DeepDB in Section 5 before we show an extensive evaluation comparing DeepDB against state-of-the art approaches for various tasks. Finally, we iterate over related work in Section 7 and conclude in Section 8.
2 Overview and Applications
As shown in Figure 2, the main idea of DeepDB is to learn a distribution of the data. An important aspect of DeepDB is that we do not aim to replace the original data with a model (or a set of models as we discuss later). Instead, a model created in DeepDB augments a database similar to indexes to speed-up query processing and to provide additional query capabilities while we can still run standard SQL queries over the original database.
To optimally capture relevant characteristics of relational data, we developed a new class of models called Relational Sum Product Networks (RSPNs). RSPNs are based on the basic structure of Sum Product Networks (SPNs) [domingos2011spn]. In a nutshell, SPNs are deep probabilistic models that capture the joint probability distribution of a given data set. RSPNs extend SPNs to optimize them for the use in a relational DBMS. First, RSPNs provide additional algorithms to support a wider class of applications. Also database-specific extensions such as correct NULL-value handling etc. are handled by RSPNs. Most importantly, what differentiates RSPNs from many other ML models, is that they support direct updates; i.e., the model does not need to be retrained but can be updated directly if new tuples are inserted into or tuples are deleted from the underlying database.
In DeepDB, we create an ensemble of RSPNs that represents a given database in an offline learning procedure (similar to bulk loading an index). Once the RSPNs are created, the models can be leveraged at runtime for different tasks. Since RSPNs capture the joint probability distribution of the underlying database, they can support a wide variety of different applications, ranging from user-facing tasks (e.g., to provide fast approximate answers for SQL queries or to execute ML tasks on the model) to system-internal tasks (e.g., to provide estimates for cardinalities). In order to support these tasks, DeepDB provides a new so called probabilistic query compilation that translates a given task into products of expectations and probability queries on the RSPNs.
In the following, to show that our approach is capable of supporting a range of applications, we give a brief overview of how we support these tasks in DeepDB. The main goal of this paper is to show the potentials of our data-driven learning approach to support a wide variety of different applications. However, DeepDB is not limited to the applications presented next and can be easily extended to other applications by providing a translation of queries into products of expectations and probabilities.
The first task DeepDB supports is cardinality estimation for a query optimizer. Cardinality estimation is needed to provide cost estimates but also to find the correct join order during query optimization. Since DeepDB learns a representation of the data, it can also be leveraged to provide precise cardinality estimates. A particular advantage of DeepDB is that we do not have to create dedicated training data, i.e. pairs of queries and cardinalities. Instead, since RSPNs capture information of the data, we can support arbitrary queries without the need to train the model for the particular workload. Moreover, since RSPNs are easy to update they can be kept up to date at low costs similar to traditional histogram-based approaches, which is different from existing learned approaches for cardinality estimation such as [kipf2019learned].
Approximate Query Processing (AQP)
The second task we currently support in DeepDB is AQP. AQP aims to provide approximate answers to support faster query response times on large data sets. Currently, aggregate queries with equi-joins and typical selection predicates with group-by clauses are supported. The basic idea of how a query on a single table is executed inside DeepDB is simple: for example, an aggregate query with a where condition is equal to the conditional expectation . These conditional expectations can be approximated with RSPNs. This principle can also be applied to approximate join queries. In the simplest case, a full model was learned already on the join of the corresponding tables. An alternative is to use multiple but smaller RSPNs that have to be combined to execute a join. All these cases are supported by our probabilistic query compilation engine, which is explained in more detail in Section 4.
Machine Learning (ML)
Finally, many ML tasks can also directly be conducted in DeepDB based on our models without any further learning. For instance, DeepDB can provide answers for regression or classification tasks for every column of the database using any set of columns as features.
3 Learning a Data Model
In this section, we introduce Relational Sum Product Networks (RSPNs), which we use to learn a representation of a database and, in turn, to answer queries using our query engine explained in the next section. We first review Sum Product Networks (SPNs) and then introduce RSPNs. Afterwards, we describe how an ensemble of RSPNs can be created to encode a given database multiple tables.
3.1 Sum Product Networks
Sum-Product Networks (SPNs) [domingos2011spn] learn the joint probability distribution of the variables , which are present in the data set. They are an appealing choice because probabilities for arbitrary conditions can be computed very efficiently. We will later make use of these probabilities for our applications like AQP and cardinality estimation.
(Tree-structured) SPNs are trees with sum and product nodes as internal nodes and leaves. Intuitively, sum nodes split the population (i.e., the rows of data set) into clusters and product nodes split independent variables of a population (i.e., the columns of a data set). Leaf nodes represent a single attribute and approximate the distribution of that attribute either using histograms for discrete domains or piecewise linear functions for continuous domains [molina2017mixed].
For instance, in Figure 2(c), an SPN was learned over the variables region and age of the corresponding customer table in Figure 2(a). The top sum node splits the data into two groups: The left group contains of the population which is dominated by older European customers (corresponding to the first rows of the table) and the right group contains of the population with younger Asian customers (corresponding to the last rows of the table). In both groups region and age are independent and thus split by a product node each. The leaf nodes determine the probability distributions of the variables region and age for every group.
With an SPN at hand, one can compute probabilities for conditions on arbitrary columns. Intuitively, the conditions are first evaluated on every relevant leaf. Afterwards, the SPN is evaluated bottom up. For instance in Figure 2(d), to estimate how many customers are from Europe and younger than 30, we compute the probability of European customers in the corresponding blue region leaf nodes (80% and 10%) and the probability of a customer being younger than 30 (15% and 20%) in the green age leaf nodes. These probabilities are then multiplied at the product node level above, resulting in probabilities of 12% and 2%, respectively. Finally, at the root level (sum node), we have to consider the weights of the clusters, which leads to Multiplied by the number of rows in the table, we get an approximation of 50 European customers who are younger than 30.
3.2 Relational Sum-Product Networks
Using standard SPNs directly as models for DeepDB is insufficient due to the following problems: they cannot be updated easily, leading to an obsolete data representation over time. Moreover, for our applications it is insufficient to just compute probabilities; we require extended inference algorithms, which in particular consider database-specifics like NULL values and functional dependencies. This led us to develop Relational SPNs (RSPNs)111Nath et al. [nath2015learning] also modified SPNs to deal with relational data. Different from RSPNs, they did neither handle updates, nor NULL-values or functional dependencies. The relational structure was exploited solely for the learning process (i.e., to avoid joining the tables before building an SPN)..
This is the most important extensions of RSPNs over SPNs. If the underlying database tables are updated, the model might become inaccurate. For instance, if we insert more young European customers in the table in Figure 2(a), the probability computed in Figure 2(d) is too low and thus the RSPN needs to be updated. As described before, an RSPN consists of product and sum nodes, as well as leaf nodes, which represent probability distributions for individual variables. The key-idea to support direct updates of an existing RSPN is to traverse the RSPN tree top-down and update the value distribution of the weights of the sum-nodes during this traversal. For instance, the weight of a sum node for a subtree of younger European customers could be increased to account for updates. Finally, the distributions in the leaf-nodes are adjusted. The detailed algorithm of how to directly update RSPNs is discussed in Section 5.2.
First, SPNs do not provide mechanisms for handling NULL values. Hence, we developed an extension where NULL values are represented as a dedicated value for both discrete and continuous columns at the leaves during learning. Furthermore, when computing conditional probabilities and expectations, NULL values must be handled according to the three-valued logic of SQL.
Second, SPNs aim to generalize the data distribution and thus approximate the leaf distribution abstracting away specifics of the data set to generalize. For instance, in the leaf nodes for the age in Figure 2(c), a piecewise linear function would be used to approximate the distribution [molina2017mixed]. Instead, we want to represent the data as accurate as possible. Hence, for continuous values, we store each individual value and its frequency. If the number of distinct values exceeds a given limit, we also use binning for continuous domains.
Third, functional dependencies between non-key attributes are not well captured by SPNs. We could simply ignore these and learn the RSPN with both attributes and but this often leads to large SPNs since the data would be split into many small clusters (to achieve independence of and ). Hence, we allow users to define additional functional dependencies along with a table schema. If a functional dependency is defined, we store the mapping from values of to values of in a separate dictionary of the RSPN and omit the column when learning the RSPN. At runtime, queries with filter predicates for are translated to queries with filter predicates for .
Extended Inference Algorithms
A last important extension is that for many queries such as AVG and SUM expectations are required (e.g., to answer a SQL aggregate query which computes an average over a column). In order to answer these queries, RSPNs allows computing expectations over the variables on the leaves to answer those aggregates. To additionally apply a filter predicate, we still compute probabilities on the leaves for the filter attribute and propagate both values up in the tree. At product nodes, we multiply the expectations and probabilities coming from child nodes whereas on sum nodes the weighted average is computed. In Figure 4 we show an example how the average age of European customers is computed. The ratio of both terms yields the correct conditional expectation.
A related problem is that SPNs do not provide confidence intervals. We also developed corresponding extensions on SPNs in Section5.1.
3.3 Learning Ensembles of RSPNs
An RSPN can easily be used to represent the attributes of a single table. However, given a more complex database with multiple tables, we have to decide which RSPNs to learn. Naively, one could learn a single RSPN per table. However, then important information about dependencies between tables might be lost and lead to inaccurate approximations. For learning an ensemble of RSPNs for a given database in DeepDB, we thus take into account if tables of a schema are correlated.
In the following, we describe our procedure that constructs a so called base ensemble for a given database scheme. In this procedure, for every foreign keyprimary key relationship we learn an RSPN over the corresponding full outer join of two tables if there is a correlation between attributes of the different tables. Otherwise, RSPNs for the single tables will be learned. For instance, if the schema consists of a Customer and an Order table as shown in Figure 5, we could either learn two independent RSPNs (one for each table) or a joint RSPN (over the full outer join). In order to test independence of two tables and thus to decide if one or two RSPNs are more appropriate, we check for every pair of attributes from these tables if they can be considered independent or not. In order to enable an efficient computation, this test can be done on a small random sample. As a correlation measure that does not make major distributional assumptions, we compute RDC values [lopez2013randomized] between two attributes, which are also used in the MSPN learning algorithm [molina2017mixed]. If the maximum pairwise RDC value between all attributes of two tables exceeds a threshold (where we use the standard thresholds of SPNs), we assume that two tables are correlated and learn an RSPN over the join. Otherwise, single RSPNs are learned.
In the base ensemble only correlations between two tables are captured. While in our experiments, we see that this already leads to highly accurate answers, there might also be correlations not only between directly neighboring tables. Learning these helps to further improve the accuracy of queries that span more than two tables. For instance, if there was an additional Product table that can be joined with the Orders table and the product prize is correlated with the customers region, this would not be taken into account in the base ensemble. In Section 5.3, we extend our basic procedure for ensemble creation to take dependencies among multiple tables into account.
4 Probabilistic Query Compilation
The main challenge of probabilistic query compilation is to translate an incoming query (e.g., for AQP) into an inference procedure against an ensemble of RSPNs. To this end, recall that an ensemble for a given database either consists of RSPNs for single tables or spanning two (or more) tables.
In the following, we first describe how the translation procedure for a COUNT query works (which can be used either for AQP or for cardinality estimation) and then extend it to more complex queries (e.g., AVG and SUM). We then show how machine learning tasks can be supported with the help of RSPNs.
4.1 COUNT Queries
In this section, we explain how we can translate simple COUNT queries with and without filter predicates over single tables as well as COUNT queries that join multiple tables using inner joins (equi-joins). For filter predicates we support conjunctions of predicates of the form where is an attribute, a constant, and one of the comparison operators or an IN-comparison (e.g., age IN (20, 30, 40)). String or arithmetic expressions, as well as user-defined functions are currently not supported. Disjunctions could be realized using the inclusion-exclusion principle.
These types of queries can be used already for cardinality estimation but also cover some cases of aggregate queries for AQP. We later show the extensions to support a broader set of queries for AQP including other aggregates (AVG and SUM) as well as group-by statements. For answering the simple COUNT queries, we distinguish three cases of how queries can be mapped to RSPNs: (1) an RSPN exists that exactly matches the tables of the query, (2) the RSPN is larger and covers more tables, and (3) we need to combine multiple RSPNs since there is no single RSPN that contains all tables of the query.
Case 1: Exact matching RSPN available
The simplest case is a single table COUNT query with (or without) a filter predicate. If an RSPN is available for this table and denotes the number of rows in the table, the result is simply . For instance, the query
can be answered with the CUSTOMER RSPN in Figure 5. The result is . Note that
denotes the random variable being one if the conditionis fulfilled and thus .
A natural extension for COUNT queries over joins could be to learn an RSPN for the underlying join and use the formula where the size of the joined tables without applying a filter predicate is . For instance, the query
could be represented as which is .
However, joint RSPNs over multiple tables are learned over the full outer join. By using full outer joins we preserve all tuples of the original tables and not only those that have one or more join partner in the corresponding table(s). This way we are able for example to answer also single table queries from a joint RSPN, as we will see in Case 2. The additional NULL tuples that result from a full outer join must be taken into account when answering an inner join query. For instance, the second customer in Figure 5 does not have any orders and thus should not be counted for query To make it explicit which tuples have no join partner and thus would not be in the result of an inner join, we add an additional column for every table such as in the ensemble in Figure 5. This column is also learned by the RSPN and can be used as an additional filter column to eliminate tuples that do not have a join partner for the join query given. Hence, the complete translation of query for the RSPN learned over the full outer join in Figure 5 is .
Case 2: Larger RSPN available
The second case is that we have to use an RSPN that was created on a set of joined tables, however, the query only needs a subset of those tables. For example, let us assume that the query asking for European customers is approximated using an RSPN learned over a full outer join of customers and orders such as the one in Figure 5. The problem here is that customers with multiple orders would appear several times in the join and thus be counted multiple times. For instance, the ratio of European customers in the full outer join is though two out of three customers in the data set are European.
To address this issue, for each foreign keyprimary key relationship between tables and we add a column to table denoting how many corresponding join partners a tuple has. We call these tuple factors and later use them as correction factor. For instance, in the customer table in Figure 5 for the first customer the tuple factor is two since there are two tuples in the order table for this customer. It is important to note that tuple factors have to be computed only once per pair of tables that can be joined via a foreign key. In DeepDB, we do this when the RSPNs for a given database are created and our update procedure changes those values as well. Tuple factors are included as additional column and learned by the RSPNs just as usual columns. When used in a join, we denote them as . Since we are working with outer joins, the value of is at least 1.
We can now express the query that asks for the count of customers from Europe as
which results in . First, this query both includes the first customer (who has no orders) because the RSPN was learned on the full outer join. Second, the query also takes into account that the second and third customer have two orders each by normalizing them with their tuple factor .
In general, we can define the procedure to compile a query requiring only a part of an RSPN as follows:
Let be a COUNT query with a filter predicate which only queries a subset of the tables of a full outer join Let denote the product of all tuple factors that cause result tuples of to appear multiple times in The result of the query is equal to:
For an easier notation, we write the required factors of query as . The expectation of theorem 1 can be computed with an RSPN because all columns are learned.
Case 3: Combination of multiple RSPNs
As the last case, we handle a COUNT query that needs to span over multiple RSPNs. We first handle the case of two RSPNs and extend the procedure to RSPNs later. In this case, the query can be split into two subqueries and , one for each RSPN. There can also be an overlap between and which we denote as (i.e., a join over the shared common tables). The idea is first to estimate the result of using the first RSPN. We then multiply this result by the ratio of tuples in vs. tuples in the overlap . Intuitively, this expresses how much the missing tables not in increase the COUNT value of the query result.
For instance, there is a separate RSPN available for the Customer and the Order table in Figure 5. The query , as shown before, would be split into two queries and , one against the RSPN built over the Customer table and the other one over the RSPN for the Order table. is empty in this case. The query result of can thus be expressed using all these sub-queries as:
which results in The intuition of this query is that the left-hand side that uses computes the orders of European customers while the right-hand side computes the fraction of orders that are ordered online out of all orders.
We now handle the more general case that the overlap is not empty and that there is a foreign key relationship between a table in (and ) and a table in (but not in ). In this case, we exploit the tuple factor in the left RSPN. We now do not just estimate the result of but of joined with the table Of course this increases the overlap which we now denote as As a general formula for this case, we obtain Theorem 2:
Let the filter predicates and tuple factors of and be conditionally independent given the filter predicates of . Let be the foreign key relationship between a table in and a table in that we want to join. The result of is equal to
Independence across RSPNs is often given since our ensemble creation procedure preferably learns RSPNs over correlated tables as discussed in Section 3.
Alternatively, we can start the execution with . In our example query where is the query over the orders table, we can remove the corresponding tuple factor from the left expectation. However, we then need to normalize by the tuple factors to correctly compute the fraction of customers who come from Europe. To that end, the query can alternatively be computed using:
If multiple RSPNs are required to answer a query, we have several possible execution strategies. Our goal should be to handle as many correlations between filter predicates as possible because predicates across RSPNs are considered independent. For instance, assume we have both the Customer, Order and Customer-Order RSPNs of Figure 5 in our ensemble, and a join of customers and orders would have filter predicates on c_age and In this case, we would prefer the Customer-Order RSPN because it can handle all pairwise correlations between filter columns (c_region-c_age, c_region-o_channel, c_age-c_channel). Hence, at runtime we greedily use the RSPN that currently handles the filter predicates with the highest sum of pairwise RDC values. We also experimented with strategies enumerating several probabilistic query compilations and using the median of their predictions. However, this was not superior to our RDC-based strategy. Moreover, the RDC values have already been computed to decide which RSPNs to learn. Hence, at runtime this strategy is very compute-efficient.
The final aspect is how to handle joins spanning over more than two RSPNs. To support this, we can apply Theorem 2 several times.
4.2 Other AQP Queries
So far, we only looked into COUNT queries without group-by statements. In the following, we first discuss how we extend our query compilation to also support AVG and SUM queries before we finally explain group-by statements as well as outer joins.
We again start with the case that we have an RSPN that exactly matches the tables of a query and later discuss the other cases. For this case, queries with AVG aggregates can be expressed as conditional expectations. For instance, the query
can be formulated as with the ensemble in Figure 5.
However, for the case that an RSPNs spans more tables than required, we cannot directly use this conditional expectation because otherwise customers with several orders would be weighted higher. Again, normalization by the tuple factors is required. For instance, if the RSPN spans customers and orders as in Figure 5 for query we use
In general, if an average query for the attribute should be computed for a join query with filter predicates on an RSPN on a full outer join , we use the following expectation to answer the average query:
The last case is where the query needs more than one RSPN to answer the query. In this case, we only use one RSPN that contains and ignore some of the filter predicates that are not in the RSPN. As long as is independent of these attributes, the result is correct. Otherwise, this is just an approximation. For selecting which RSPN should be used, we again prefer RSPNs handling stronger correlations between and quantified by the RDC values. The RCDs can also be used to detect cases where the approximation would ignore strong correlations with the missing attributes in .
For handling SUM queries we run two queries: one for the COUNT and AVG queries. Multiplying them yields the correct result for the SUM query.
Finally, a group by query can be handled also by several individual queries with additional filter predicates for every group. This means that for groups we have to compute times more expectations than for the corresponding query without grouping. In our experimental evaluation, we show that this does not cause performance issues in practice if we compute the query on the model.
Query compilation can be easily extended to support outer joins (left/right/full). The idea is that we only filter out tuples that have no join partner for all inner joins (case 1 and 2 in Section 4.1) but not for outer joins (depending on the semantics of the outer join). Moreover, in case 3, the tuple factors with value zero have to be handled as value one to support the semantics of the corresponding outer join.
4.3 Machine Learning (ML) Tasks
Many ML tasks can directly also be expressed using RSPNs. For example, regression tasks can directly be translated into conditional expectations. For classification we can use most probable explanation (MPE) algorithms [molina2017mixed]. RSPNs are optimized to accurately represent the data which is beneficial for AQP and cardinality estimation. However, they still generalize since the dependency structure of the data is identified and thus the regression and classification performances are competitive as we show in our experiments.
5 DeepDB Extensions
We now describe important extensions of our basic framework presented before. We first explain how confidence intervals are provided, which is especially important for AQP. We then discuss how RSPNs can be updated if the database is changed. Finally, we present how we can optimize the basic ensemble of RSPNs by additional RSPNs that can span more than two tables.
5.1 Support for Confidence Intervals
Especially for AQP confidence intervals are important. However, SPNs do not provide those. After the probabilistic query compilation the query is expressed as a product of expectations. We first describe how to estimate the uncertainty for each of those factors and eventually how a confidence interval for the final estimate can be derived.
First, we split up expectations as a product of probabilities and conditional expectations. For instance, the expectation would be turned into . This allows us to treat all probabilities for filter predicates as a single binomial variable with probability and the amount of training data of the RSPN as
. Hence, the variance is. For the conditional expectations, we use the Koenig-Huygens formula . Note that also squared factors can be computed with RSPNs since the square can be pushed down to the leaf nodes. We now have a variance for each factor in the result.
For the combination we need two simplifying assumptions: (i) the estimates for the expectations and probabilities are independent, and (ii) the resulting estimate is normally distributed. In our experimental evaluation, we show that despite these assumptions our confidence intervals match those of typical sample-based approaches.
We can now approximate the variance of the product using the independence assumption by recursively applying the standard equation for the product of independent random variables: . Since we know the variance of the entire probabilistic query compilation and we assume that this estimate is normally distributed we can provide confidence intervals.
5.2 Support for Updates
The intuition of our update algorithm is to regard RSPNs as indexes. Similar to these, insertions and deletions only affect subtrees and can be performed recursively. Hence, the updated tuples recursively traverse the tree and passed weights of sum nodes and the leaf distributions are adapted. Our approach supports insert and delete operations, where an update-operation is mapped to a pair of delete and insert operations.
The update algorithm is summarized in Algorithm 1. Since it is recursive, we have to handle sum, product and leaf nodes. At sum nodes (line 4) we have to identify to which child node the inserted (deleted) tuple belongs to determine which weight has to be increased (decreased). Since children of sum nodes represent row clusters found by KMeans during learning [molina2017mixed], we can compute the closest cluster center (line 5), increase (decrease) its weight (line 6) and propagate the tuple to this subtree (line 7). In contrast, product nodes (line 8) split the set of columns. Hence, we do not propagate the tuple to one of the children but split it and propagate each tuple fragment to the corresponding child node (lines 9-11). Arriving at a leaf node, only a single column of the tuple is remaining. We now update the leaf distribution according to the column value (line 2).
This approach does not change the structure of the RSPN, but only adapts the weights and the histogram values. If there are new dependencies as a result of inserts they are not represented in the RSPN. As we show in Section 6.1 on a real-word data set, this typically does not happen, even for high incremental learning rates of 40%. Nevertheless, in case of new dependencies the RSPNs have to be rebuilt. This is solved by checking the database cyclically for changed dependencies by calculating the pairwise RDC values as explained in Section 5.3 on column splits of product nodes. If changes are detected in the dependencies, the affected RSPNs are regenerated. As for traditional indexes, this can be done in the background.
5.3 Ensemble Optimization
As mentioned before, we create an ensemble of RSPNs for a given database. The base ensemble contains either RSPNs for single tables or they span over two tables connected by a foreign key relationship if they are correlated. Correlations occurring over more than two tables are ignored so far since they lead to larger models and higher training times. In the following, we thus discuss an extension of our ensemble creation procedure that allows a user to specify a training budget (in terms of time or space) and DeepDB selects the additional larger RSPNs that should be created. We formulate the problem of which additional RSPNs to learn as constrained optimization problem.
To quantify the correlations between tables, as mentioned already before, we compute the pairwise RDC values for every pair of attributes in the schema. For every pair of tables, we define the maximum RDC value between two columns as the dependency value. The dependency value indicates which tables should appear in the same RSPN and which not. An example is given in Figure 5(a). Here, the Customer, Order and Orderline tables have high pairwise correlations while the State table is only highly correlated with the Customer table.
For every RSPN the goal is to achieve a high mean of these pairwise maximal RDC values. This ensures that only tables with high pairwise correlation are merged in an RSPN. For instance, the mean RDC value for the RSPN learned over the full outer join of the tables Customer, Order and Orderline would be . This RSPN is more valuable than an RSPN learned over the State, Customer and Order tables with a mean RDC value of . The overall objective function for our optimization procedure to create an ensemble is thus to maximize the sum of all mean RDC values of the RSPNs in the ensemble.
The limiting factor (i.e., the constraint) for the additional RSPN ensemble selection should be the budget (i.e., extra time compared to the base ensemble) we allow for the learning of additional RSPNs. For the optimization procedure, we define the maximum learning costs as a factor relative to the learning costs of the base ensemble . Hence, a budget factor means that only the base ensemble would be created. For higher budget factors , additional RSPNs over more tables are learned in addition. If we assume that an RSPN among the set of all possible unique RSPNs has a cost , then we could formulate the optimization problem as follows:
However, estimating the real cost (i.e., time) to build an RSPN is hard and thus we can not directly solve the optimization procedure. Instead, we estimate the relative cost to select the RSPN that has the highest mean RDC value and the lowest relative creation cost. To model the relative creation cost, we assume that the costs grow quadratic with the number of columns since the RDC values are created pairwise and linear in the number of rows . Consequently, we pick the RSPN with highest mean RDC and lowest cost which is as long as the maximum training time is not exceeded.
6 Experimental Evaluation
In this Section, we show that DeepDB outperforms state-of-the-art systems for both cardinality estimation and AQP, where we not only demonstrate the performance of DeepDB for both tasks but also show the capabilities of updating RSPNs. Moreover, we also study the performance of DeepDB for different ML tasks.
The RSPNs we used in all experiment were implemented in Python as extensions of SPFlow [molina2019spflow]
. As hyperparameters, we used an RDC threshold ofand a minimum instance slice of of the input data, which determines the granularity of clustering. Moreover, we used a budget factor of 0.5, i.e. the training of the larger RSPNs takes approximately 50% more training time than the base ensemble. We determined the hyperparameters using a grid-search, which gave us the best results across different data sets.
6.1 Exp. 1: Cardinality Estimation
First, we compare the prediction quality of DeepDB which is purely data-driven with state-of-the-art learned cardinality estimation techniques that take the workload into account.
In addition to the learned baselines, we also compare against non-learned baselines. First we trained a Multi-Set Convolutional Network (MCSN) [kipf2019learned] as a learned baseline. MCSNs are specialized deep neural networks using the join paths, tables and filter predicates as inputs. Also, we use Index-Based Join Sampling [leis2017ibjs] as a non-learned baseline. This algorithm exploits secondary indexes to estimate the full join size using sampling. Furthermore, the standard cardinality estimation of Postgres 11.5 was employed as another non-learned baseline. Additionally, we implemented random sampling.
As in [kipf2019learned, leis2015how], the JOB-light benchmark is used. The benchmark uses the real-world IMDb database and defines 70 queries. Furthermore, we additionally defined a synthetic query set of 200 queries were joins from three to six tables and one to five filter predicates appear uniformly on the IMDb data set. We use this query set to compare the generalization capabilities of the learned approaches.
In contrast to other learned approaches for cardinality estimation [kipf2019learned, sun2019an], no dedicated training data is required for DeepDB. Instead, we just learn a representation of the data. The training of the base ensemble takes 48 minutes. The creation time includes the data preparation time to compute the tuple factors as introduced in Section 4.1.
In contrast, for the MCSN [kipf2019learned] approach, 100k queries need to be executed to collect cardinalities resulting in 34 hours of training data preparation time (when using Postgres). Moreover, the training of the neural network takes about 15 minutes on a Nvidia V100 GPU.
We see that our training time is much lower since we do not need to collect any training data for the workload. Another advantage is that we do not have to re-run the queries once the database is modified. Instead, we provide an efficient algorithm to update RSPNs in DeepDB as discussed in Section 3.2.
The prediction quality of cardinality estimators is usually evaluated using the q-error which is the factor by which an estimate differs from the real execution join size. For example, if the real result size of a join is 100, the estimates of 10 or 1000 tuples both have a q-error of 10. Using the ratio instead of an absolute or quadratic error captures the intuition that for making optimization decisions only relative differences matter.
In Table 1 we depicted the median, 90-th and 95-th percentile and max q-errors for the JOB-light benchmark of our approach compared to the baselines. As we can see DeepDB
outperforms the best competitors in every percentile often by orders of magnitude. In the median it outperforms the best competitor Index Based Join Sampling (1.23 vs. 1.59). The advantage of the learned approach MCSN is that it outperforms traditional approaches by orders of magnitude for the higher percentiles and is thus more robust. Even for these outliers,DeepDB provides additional robustness having a 95-th percentile for the q-errors of 3.16 vs. 143 (MCSN). The q-errors of both Postgres and random sampling are again significantly larger both for the medians and the higher percentiles. The estimation latencies for cardinalities using DeepDB are currently in the order of s to ms which suffices for complex join queries that can often run multiple on larger data sets. By using smaller RSPNs or an optimized implementation of SPNs such as [sommer2018automatic], the latencies could further be reduced.
Especially for learned approaches the question of generalization is important, i.e. how well the models perform on previously unseen queries. For instance, by default the MCSN approach is only trained with queries up to three joins because otherwise the training data generation would be too expensive [kipf2019learned]. Similarly in our approach, in the ensemble only few RSPNs with large joins occur because otherwise the training would also be too expensive. However, both approaches support cardinality estimates for unseen queries.
To compare both learned approaches, we randomly generated queries for joins with four to six tables and one to five selection predicates for the IMDb data set. In Figure 7, we plot the resulting median q-errors for both learned approaches: DeepDB and MCSN [kipf2019learned]. The median q-errors of DeepDB are orders of magnitude lower for larger joins. Additionally, we can observe that for the MCSN approach the estimates tend to become less accurate for queries with fewer selection predicates. One possible explanation is that more tuples qualify for such queries and thus higher cardinalities have to be estimated. However, since there are at most three tables joined in the training data such higher cardinality values are most likely not predicted. We can conclude that using RSPNs leads to superior generalization capabilities.
In this experiment, we show that updated RSPN ensembles can precisely estimate cardinalities. To this end, we first learn the base RSPN ensemble on a certain share of the full IMDb data set (95%, 90%, 80% and 60%) and then update it using the remaining tuples. In a first experiment, the IMDb data set is randomly split while in the second experiment we learn the initial RSPNs on all movies up to a certain production year. Both experiments show that the q-error does not change significantly for the updated RSPN ensembles. Detailed results are given in Table 2. We use zero as the budget factor to demonstrate that even base ensembles provide good estimates after updates. This is also the reason that the estimation errors slightly deviate from Table 1.
Since in the initial learning of the RSPN ensemble we learn the RSPN on a sample of the full outer join, the same sample rate has to be used for the updates, i.e. we only update the RSPN with a sample of all inserted tuples. Using a sampling rate of 1%, we can handle up to 55,000 updates per second. The structure of the RSPN tree is not changed during updates, but only the parameters are updated according to the new tuples. However, in the experiments we could show that this does not impair the accuracy on a real-world data set.
The updateability is a clear advantage of DeepDB
compared to deep-learning based approaches for cardinality estimation[kipf2019learned, sun2019an]. Since these model the problem end-to-end all training data queries would have to be run again on the database to gather the updated cardinalities.
Finally, in the last experiment we explore the tradeoff between ensemble training time and prediction quality of DeepDB. We first vary the budget factor used in the ensemble selection between zero (i.e. learning only the base ensemble with one RSPN per join of two tables) and B=3 (i.e. the training of the larger RSPNs takes approximately three times longer than the base ensemble) while using samples per RSPN. We then use the resulting ensemble to evaluate 200 queries with three to six tables and one to five selection predicates. The resulting median q-errors are shown in Figure 8. For higher budget factors the means are improving but already saturate at . This is because there are no strong correlations in larger joins that have not already been captured in the base ensemble.
We moreover evaluate the effect of the sampling to reduce the training time. In this experiment we vary the sample size from 1000 to 10 million. We observe that while the training time increases, the higher we choose this parameter, the prediction quality improves (from 2.5 to 1.9 in the median). In summary, the training time can be significantly reduced if slight compromises in prediction quality are acceptable. When minimization of training time is the more important objective we can also fall back and only learn RSPNs for all single tables and no joins at all. This reduces the ensemble training time to just five minutes. However, even this cheap strategy is still competitive. For JOB-light this ensemble has a median q-error of 1.98, a 90-th percentile of 5.32, a 95-th percentile of 8.54 and a maximum q-error of 186.53. Setting this in perspective to the baselines, this ensemble still outperforms state of the art for the higher percentiles and only Index Based Join Sampling is slightly superior in the median. This again proves the robustness of RSPNs.
6.2 Exp. 2: AQP
In this Section, we compare DeepDB with state-of-the-art systems for AQP.
We evaluated the approaches on both a synthetic data set and a real-world data set. As synthetic data set, we used the Star Schema Benchmark (SSB) [o2009star] with a scale factor of with the standard queries (denoted by S1.1-S4.3). As the real-world data set, we used the Flights data set222https://www.kaggle.com/usdot/flight-delays with queries ranging from selectivities between 5% an 0.01% covering a variety of group by attributes, AVG, SUM and COUNT queries (denoted by F1.1-F5.2). To scale the data set up to 1 billion records we used IDEBench [eichmann2018idebench].
As baselines we used VerdictDB [park2018verdict], Wander Join [li2016wander] and the Postgres TABLESAMPLE command (which uses random samples). VerdictDB is a middleware that can be used with any database system. It creates a stratified and a uniform sample for the fact tables to provide approximate queries. For VerdictDB, we used the default sample size (1% of the full data set) for the Flights data set. For the SSB benchmark, this led to high query latencies and we thus decided to choose a sample size such that the query processing time was two seconds on average. Wander Join is a join sampling algorithm leveraging secondary indexes to generate join samples quickly. We set the time bound also to two seconds for a fair comparison and only evaluated this algorithm for data sets with joins. To this end, we created all secondary indexes for foreign key relationships and predicates on the dimension tables. For the TABLESAMPLE command we chose a sample percentage such that the queries take two seconds on average as well.
For DeepDB the same hyperparameters were used as for the previous experiment (Exp. 1). The training took just 17 minutes for the SSB data set and 3 minutes for the Flights data set. The shorter training times compared to the IMDb data set are due to fewer cross-table correlations and hence fewer large join models in the ensemble. For VerdictDB, scrambles have to be created, i.e. uniform and stratified samples from the data set. This took 10 hours for the flights data set and 6 days for the SSB benchmark using the standard implementation.333https://docs.verdictdb.org/reference/pyverdict/ For wander join, secondary indexes had to be created also requiring several hours for the SSB data set.
Accuracy and Latency
For AQP two dimensions are of interest. First, the quality of the approximation quantified with the relative error. Second, the latency of the result is relevant when evaluating AQP systems. The relative error is defined as where and are the true and predicted aggregate function, respectively. If the query is a group by query, several aggregates have to be computed. In this case, the relative error is averaged over all groups. The results for the Flights data set are given in Figure 9.
For the Flights data set, we can observe that DeepDB always has the lowest average relative error. This is often the case for queries with lower selectivities where sample-based approaches have few tuples that satisfy the selection predicates and thus the approximations are very inaccurate. In contrast, DeepDB does not rely on samples but models the data distribution and leverages the learned representation to provide estimates. For instance, for query 11 with a selectivity of 0.5% VerdictDB and the TABLESAMPLE strategy have an average relative error of 15.6% and 13.6%, respectively. In contrast, the average relative error of DeepDB is just 2.6%.
Moreover, the latencies for both TABLESAMPLE and VerdictDB are between one and two seconds on average. In contrast, DeepDB does not rely on sampling but on evaluating the RSPNs. This is significantly faster resulting in a maximum latency of 31ms. This even holds true for queries with several groups where more expectations have to be computed (at least one additional per different group).
The higher accuracies of DeepDB are even more severe for the SSB benchmark. The queries have even lower selectivities between 3.42% and 0.0075% for queries 1 to 12 and 0.00007% for the very last query. This results in very inaccurate predictions of the sample-based approaches. Here, the average relative errors are orders of magnitude lower for DeepDB always being less than 6%. In contrast, VerdictDB, Wander Join and the TABLESAMPLE approach often have average relative errors larger than 100%. Moreover, for some queries no estimate can be given at all because no samples are drawn that satisfy the filter predicates. However, while the other approaches take two seconds to provide an estimate, DeepDB requires no more than 293ms in the worst case. In general the latencies for DeepDB are lower for queries with fewer groups because less expectations have to be computed.
In this experiment, we evaluate how accurate the confidence intervals predicted by DeepDB are. To this end, we measure the relative confidence interval length defined as: , where is the prediction and is the lower bound of the confidence interval.
This relative confidence interval length is compared to the confidence interval of a sample-based approach. For this we draw 10 million samples (as many samples as our models use for learning in this experiment) and compute estimates for the average, count and sum aggregates. We then compute the confidence intervals of these estimates using standard statistical methods. For COUNT aggregates, the estimator is simply a binomial variable with parameters and for which we can compute a confidence interval. For AVG
queries we exploit the central limit theorem stating that the estimator is normally distributed. We then compute the standard deviation on the sample and derive the confidence interval for a normal variable having this standard deviation and the mean of our estimate. ForSUM queries we model the estimator as a product of both estimators. The resulting confidence interval lengths can be seen as ground truth and are compared to the confidence intervals of our system in Figure 11. Note that we excluded queries where less than 10 samples fulfilled the filter predicates. In these cases the estimation of a standard deviation has itself a too high variance.
In all cases, the confidence intervals of DeepDB are very good approximations of the true confidence intervals. The only exception is query F5.2 for the Flights data set which is a difference of two SUM aggregates. In this case, assumption (i) of Section 5.1 does not hold: the probabilities and expectation estimates cannot be considered independent. This is the case because both SUM aggregates contain correlated attributes and thus the confidence intervals are overestimated. However, note that in the special case of the difference of two sum aggregates the AQP estimates are still very precise as shown in Figure 9 for the same query F5.2. Only the confidence interval is overestimated. Such cases can easily be identified and only occur when arithmetic expressions of several aggregates should be estimated.
Other ML-based Approaches
The only learned approach for AQP that was recently published is DBEst [ma2019dbest]. Other approaches like [thirumuruganathan2019approximate] cannot provide estimates for joins and are thus similarly excluded. DBEst creates density and regression models for popular queries. They can be reused if only where conditions on numeric attributes or ordinal categorical attributes are changed. But if an unseen new query arrives and there is no model available we have to create a biased sample fulfilling the non-ordinal conditions on categorical columns.
Depending on the selectivity, this comes at a cost. Afterwards, the density and regression models on the sample have to be learned. In contrast, in our approach, we learn an RSPN ensemble once and can provide estimates for arbitrary queries immediately. In Figure 12, we thus compare the cumulative training time including sampling and data preparation times of DBEst and DeepDB for SSB. As we can see, for query S1.2 and S1.3 the model of query S1.1 can be reused and thus the cumulative training time does not increase. In contrast, for some selective queries like S3.3 the biased sampling and training takes very long ( 3 hours). For DeepDB the ensemble has to be trained just once and any query can be answered ad-hoc.
6.3 Exp. 3: Machine Learning
In this experiment we show that RSPNs are competitive ML regression models. We first predict all different numeric attributes for the Flights data set using all other columns as features with the same RSPN we used for the AQP queries. As baselines we trained standard ML models on the same training data to solve the same prediction task and compare the training time and Root Mean Squared Error (RMSE) on the test set in Figure 13. The advantage of DeepDB is that no additional training is required to execute the regression task while the RMSE is comparable to standard models. Consequently, using RSPNs we obtain a free classification and regression model for any combination of features.
7 Related Work
Before concluding, we discuss further related work on using ML for cardinality estimation and AQP and SPNs.
Learned Cardinality Estimation
The problem of selectivity estimation for single tables is a special case of cardinality estimation. There is a large body of work applying different ML approaches including probabilistic graphical models [tzoumas2013efficiently, getoor2001selectivity, getoor2011learning], neural networks [lakshmi1998selectivity, liu2015cardinality] and specialized deep learning density models [shohedul2019multiattribute] to this problem. Recently, Dutt et al. [selectivity2019lightweight] suggested using lightweight tree-based models in combination with log-transformed labels leading to superior predictions.
The first works applying ML to cardinality estimation including joins used simple regression models [akdere2012learning, malik07ablackbox]. More recently, Deep Learning was specifically proposed to solve cardinality estimation end-to-end [kipf2019learned, sun2019an]. Woltmann et al. [woltmann2019cardinality] also separate the problem of cardinality estimation on a large schema. To this end, deep learning models similar to [kipf2019learned] are learned for certain schema sub-parts. However, two models for schema sub-parts cannot be combined to provide estimates for a larger join. Other techniques exploit learned models for overlapping subgraph templates for recurring cloud workloads [wu2018towards]. All these models need a workload to be executed and used as training data which is different from our data-driven approach.
Early work [shanmugasundaram1999compressed] suggests to approximate OLAP cubes by mixture models based on found clusters in the data. Though greatly reducing the required storage, the approximation errors are relatively high. FunctionDB [thiagarajan2008querying] constructs piecewise linear functions as approximation. In contrast to DeepDB, only continuous variables are supported. DBEst [ma2019dbest] builds models for popular queries and thus samples do not have to be kept. However, in contrast to DeepDB only those popular queries and no ad-hoc queries are supported. Park et al. suggested Database Learning [park2019database] which builds a model from query results that is leveraged to provide approximate results for future queries. In contrast, DeepDB is data-driven and does not require past query results. Moreover, specialized generative models were suggested to draw samples for AQP [thirumuruganathan2019approximate]. However, this technique does not work for joins.
Sum Product Networks [domingos2011spn, molina2017poisson, molina2017mixed] have recently gained attention because these graphical models allow an efficient inference process. Furthermore, our update process can be seen as an orthogonal approach to online learning for SPNs [kalra2018online]. In contrast to incremental learning schemes [lee2013online] for SPNs, we do not change the structure if new tuples are inserted for performance reasons.
8 Conclusion and Future work
In this work we have proposed DeepDB which is a data-driven approach for learned database components. We have shown that our approach is general and can be used to support various tasks including cardinality estimation and approximate query processing. Our experiments demonstrate that DeepDB outperforms both traditional and learned state-of-the-art techniques often by orders of magnitude. In addition, we leveraged the same approach to support ML tasks on the data set with accuracies competitive with neural networks while not requiring any additional training time.
We believe our data-driven approach for learning can also be exploited to improve other database internals. For instance, it has already been shown that column correlations can be exploited to improve indexing [wu2019designing]. In addition, SPNs naturally provide a notion of correlated clusters that can also be used for suggesting using interesting patterns in data exploration. Finally, we believe that it is an interesting avenue of future work to combine data-driven and workload-driven approaches to combine the best of both worlds.