A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation

07/26/2021
by   Peizhi Wu, et al.
Nanyang Technological University
0

Cardinality estimation is a fundamental problem in database systems. To capture the rich joint data distributions of a relational table, most of the existing work either uses data as unsupervised information or uses query workload as supervised information. Very little work has been done to use both types of information, and cannot fully make use of both types of information to learn the joint data distribution. In this work, we aim to close the gap between data-driven and query-driven methods by proposing a new unified deep autoregressive model, UAE, that learns the joint data distribution from both the data and query workload. First, to enable using the supervised query information in the deep autoregressive model, we develop differentiable progressive sampling using the Gumbel-Softmax trick. Second, UAE is able to utilize both types of information to learn the joint data distribution in a single model. Comprehensive experimental results demonstrate that UAE achieves single-digit multiplicative error at tail, better accuracies over state-of-the-art methods, and is both space and time efficient.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/02/2019

DeepDB: Learn from Data, not from Queries!

The typical approach for learned DBMS components is to capture the behav...
05/10/2019

Selectivity Estimation with Deep Likelihood Models

Selectivity estimation has long been grounded in statistical tools for d...
01/01/2021

SetSketch: Filling the Gap between MinHash and HyperLogLog

MinHash and HyperLogLog are sketching algorithms that have become indisp...
03/24/2019

Multi-Attribute Selectivity Estimation Using Deep Learning

Selectivity estimation - the problem of estimating the result size of qu...
06/04/2019

KERMIT: Generative Insertion-Based Modeling for Sequences

We present KERMIT, a simple insertion-based approach to generative model...
06/15/2020

NeuroCard: One Cardinality Estimator for All Tables

Query optimizers rely on accurate cardinality estimates to produce good ...
11/11/2020

Comprehensive and Efficient Workload Compression

This work studies the problem of constructing a representative workload ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Cardinality estimation — estimating the result size for a SQL predicate — is a critical component in query optimization, which aims to identify a good execution plan based on cardinality estimation (R. Hu, Z. Wang, W. Fan, and S. Agarwal (2018); 46; 57). In spite of the importance of cardinality estimation, modern DBMS systems may still produce large estimation errors on complex queries and datasets with strong correlations (Leis et al., 2018). Additionally, cardinality estimation can also be used for approximate query processing (Hilprecht et al., 2020).

The fundamental difficulty of cardinality estimation is to construct or learn an accurate and compact representation to estimate the joint data distribution of a relational table (the frequency of each unique tuple normalized by the table’s cardinality). Most of the existing work on cardinality estimation can be broadly categorized into two classes: data-driven and query-driven cardinality estimation. Data-driven methods aim to summarize the joint data distribution for cardinality estimation. Traditional data-driven methods include data-driven histograms, sampling, and sketching. However, they usually make independence and uniformity assumptions that do not hold in complex real-world datasets. Learning-based methods have been proposed for data-driven cardinality estimation by formulating cardinality estimation as a machine learning problem. Traditional machine learning methods still have their shortcomings. For example, kernel density estimation 

(Heimel et al., 2015; Kiefer et al., 2017)

is vulnerable to high-dimensional data and probabilistic graphical models 

(Chow and Liu, 1968; Getoor et al., 2001; Spiegel and Polyzotis, 2006; Tzoumas et al., 2013) are inefficient in estimation.

Recent advances in deep learning have offered promising tools in this regard. Recently, the Sum-Product Networks-based model 

(Hilprecht et al., 2020)

has been applied for approximating the joint distribution. However, it will not handle well strong attribute correlations in real-world datasets. A promising recent advance in this direction would be applying deep autoregressive models 

(Nash and Durkan, 2019; Salimans et al., 2017; Germain et al., 2015a)) for cardinality estimation (Yang et al., 2020; Hasan et al., 2020)

, which can capture attribute correlations and have reasonable estimation efficiency. However, the optimization target (loss function) of deep autoregressive models is to minimize the ordinary

average error over the overall data, and could neglect tricky (e.g., long tail) data regions, because the model may be largely dominated by those few head data regions, but degraded for many other tail data regions.

Alternatively, query-driven cardinality estimation utilizes query workload (from either query log or generated workload) to perform cardinality estimation without seeing the data, and it is expected to have more focused information on workload queries than data-driven methods (Bruno et al., 2001). Traditional query-driven histograms (Aboulnaga and Chaudhuri, 1999; Stillger et al., 2001; Bruno et al., 2001; Lim et al., 2003; Anagnostopoulos and Triantafillou, 2015)

also suffer from the same problems of histogram-based methods. Recently, deep learning (DL)-based estimators can estimate complex joins without independence assumptions, based on the powerful representation ability of deep neural networks 

(Schmidhuber, 2015). However, query-driven models assume that test queries share similar properties with training queries: they are drawn from the same distribution. This may not be the case due to workload shifts. In other words, test and training workloads may focus on different data regions. Moreover, it would be expensive to generate workload queries that sufficiently cover all data areas to train a model.

It would be a natural idea of utilizing both data and query workload for cardinality estimation. In fact, a few proposals (e.g., DeepDB) consider the combination as an interesting avenue for future work. Moreover, towards this direction several solutions (Kipf et al., 2019; Dutt et al., 2019; Heimel et al., 2015; Kiefer et al., 2017) have been proposed to utilize both data and workload. However, the combination methods of the two types of information in these pioneering studies are simple, and they are not sufficient in capturing both types of information to learn the joint data distribution for cardinality estimation. As to be discussed in related work in more details, these solutions simply use one side of data and queries as auxiliary information to enhance the model of the other side. Consequently, these pioneering solutions cannot model the data as unsupervised information and query workload as supervised information in a unified model to learn the joint data distribution for cardinaltiy estimation.

Goals To solve the aforementioned problems, we conclude four design goals as follows:

  • [leftmargin=*]

  • G1. Capturing data correlations without independence or uniformity assumption;

  • G2. Utilizing both data and query workload for model training;

  • G3. Incrementally ingesting new data and query workload;

  • G4. Time and space efficient estimation.

Our Solution To achieve the four goals, in this paper we propose a new unified deep autoregressive estimator to utilize data as unsupervised information and query workload as supervised information for learning the joint data distribution. Deep autoregressive models have demonstrated superior performance for their effectiveness and efficiency in the pioneering work (Yang et al., 2020; Hasan et al., 2020) of training autoregressive models for cardinality estimation. However, no existing deep autoregressive model in the literature is able to incorporate the query workload as supervised information for learning joint data distribution, much less support both data as unsupervised information and query workload as supervised information in the same model. To enable incorporating query workload as supervised information in the deep autoregressive model to learn the joint data distribution, we propose a novel idea — we utilize the Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017) to differentiate the categorically sampled variables so that the deep autoregressive model can learn joint data distribution directly from queries. Furthermore, we propose to combine the unsupervised and supervised losses produced from data and queries, respectively, with a trade-off hyper-parameter, and thus we are able to train the deep autoregressive model to learn the joint data distribution with a single set of model parameters by minimizing the combined loss. Therefore, can learn from both data and queries simultaneously using the same set of model parameters. Moreover, since is trained with both data and queries, it is naturally capable of incorporating incremental data and query workload.

Contributions This work makes the following contributions:

  • [leftmargin=*]

  • We propose a novel approach, UAE-Q, to incorporating query workload as supervised information in the deep autoregressive model. This is the first deep autoregressive model that is capable of learning density distribution from query workload.

  • We propose the first unified deep autoregressive model to use both data as unsupervised information and query workload as supervised information for learning joint data distribution. To the best of our knowledge, this is the first deep model that are truly capable of using query workload as supervised information and data as unsupervised information to learn joint data distribution.

  • We conduct comprehensive experiments to compare with 9 baseline methods on three real-life datasets. The 9 baseline methods cover data-driven methods, query-driven methods, and hybrid methods, including the recent deep learning based methods. The experimental results show that achieves single-digit multiplicative error at tail, better accuracies over other state-of-the-art estimators, and is both space and time efficient. The results demonstrate that our method can well achieve the four aforementioned goals. Interestingly, the experimental results also show that UAE-Q, which is trained on queries only, outperforms the state-of-the-art query-driven method.

2. Related Work

Selectivity or cardinality estimation has been an active area of research for decades  (Cormode et al., 2012). We present the previous solutions in three categories: data-driven estimators, query-driven estimators, and hybrid estimators as summarized in Table 1.

Category
Method
Without Assumptions Learning from Data Learning from Queries Incorporating Incremental Data Incorporating Incremental Query Workload Efficient Estimation
Data-driven
Sampling (Lipton et al., 1990; Haas et al., 1994; Riondato et al., 2011)
Histograms (Poosala et al., 1996; Deshpande et al., 2001; Ilyas et al., 2004; Lynch, 1988; Muralikrishna and DeWitt, 1988; Van Gelder, 1993; Jagadish et al., 2001; Thaper et al., 2002; To et al., 2013)
KDE (Gunopulos et al., 2000, 2005)
PGM (Chow and Liu, 1968; Getoor et al., 2001; Spiegel and Polyzotis, 2006; Tzoumas et al., 2013)
RSPN model (Hilprecht et al., 2020)
DL models (Yang et al., 2020; Hasan et al., 2020)

Query-driven
Histograms (Aboulnaga and Chaudhuri, 1999; Stillger et al., 2001; Bruno et al., 2001; Lim et al., 2003; Anagnostopoulos and Triantafillou, 2015)
Mixture models (Park et al., 2020)
DL models (Wu et al., 2018; Ortiz et al., 2019; Hasan et al., 2020; Kipf et al., 2019; Sun and Li, 2019)

Hybrid
Sampling-enhanced ML models (Kipf et al., 2019)
Histogram-enhanced ML models (Dutt et al., 2019)
Query-enhanced KDE (Heimel et al., 2015; Kiefer et al., 2017)
(Ours)
Table 1. A summary of existing cardinality estimation methods.

Data-driven Cardinality Estimation Data-driven cardinality estimation methods construct estimation models based on the underlying data. First, sampling-based methods (Lipton et al., 1990; Haas et al., 1994; Riondato et al., 2011) estimate cardinalities by scanning a sample of data, which has space overhead and can be expensive. Histograms (Poosala et al., 1996; Deshpande et al., 2001; Ilyas et al., 2004; Lynch, 1988; Muralikrishna and DeWitt, 1988; Van Gelder, 1993; Jagadish et al., 2001; Thaper et al., 2002; To et al., 2013) construct histograms to approximate the data distribution. However, most of these methods make partial or conditional independence and uniformity assumptions, i.e.,

the data is uniformly distributed in a bucket. A host of unsupervised machine learning based methods have been developed for data-driven cardinarlity estimation. Probabilistic graphical models (

PGM(Chow and Liu, 1968; Getoor et al., 2001; Spiegel and Polyzotis, 2006; Tzoumas et al., 2013)

use Bayesian networks to model the joint data distribution, which also relies on conditional independence assumptions. Kernel density estimation (

KDE)-based methods (Gunopulos et al., 2000, 2005) do not need the independence assumptions, but their accuracy is not very competitive due to the difficulty in adjusting the bandwidth parameter. Recently, Naru  (Yang et al., 2020) and MADE (Hasan et al., 2020)

utilize unsupervised deep autoregressive models for learning the conditional probability distribution and use it for answering point queries.

Naru uses progressive sampling and MADE uses adaptive importance sampling algorithm for answering range queries and they achieve comparative results. Both Naru and MADE do not make any independence assumption.

Query-driven Cardinality Estimation Supervised query-driven cardinality estimation approaches build models by leveraging the query workload. As opposed to data-driven histograms, query-driven histograms (Aboulnaga and Chaudhuri, 1999; Stillger et al., 2001; Bruno et al., 2001; Lim et al., 2003; Anagnostopoulos and Triantafillou, 2015) build histogram buckets from query workload, without seeing the underlying data. Recently, QuickSel (Park et al., 2020) uses uniform mixture model to fit the data distribution using every query in the query workload, which avoids the overhead of multi-dimensional histograms. QuickSel also relies on uniformity assumptions. Deep Learning models have recently been employed for query-driven cardinality estimation. Ortiz et al. (Ortiz et al., 2019)

evaluate the performance of using multi-layer perceptron neural networks and recurrent neutral networks on encoded queries for cardinality estimation.

Sup (Hasan et al., 2020) encodes queries as a set of features and learns weights for these features utilizing a fully connected neural network to estimate the selectivity. In addition, Wu et al. (Wu et al., 2018) consider a relevant but different problem, which is to estimate the cardinality for each point of a query plan graph by training a traditional machine learning model. Sun et al. (Sun and Li, 2019) consider estimating both the execution cost of a query plan and caridinality together using a multi-task learning framework.

Hybrid Cardinality Estimation A few proposals leverage both query workload and the underlying data to predict the cardinalities. Query-enhanced approaches (Heimel et al., 2015; Kiefer et al., 2017) leverage query workload to further adjust the bandwidth parameter of KDE to numerically optimize a KDE model for better accuracy. However, KDE-based models do not work well for high-dimensional data (Heimel et al., 2015; Kiefer et al., 2017). Recently, selectivity estimation results from data driven models are used together with encoded queries as input features to machine learning models (Kipf et al., 2019; Dutt et al., 2019). Dutt et al. include the cardinality estimates of histograms as extra features in addition to query features, and use neural network and tree-based ensemble machine learning models for cardinality estimation. Kipf et al.  (Kipf et al., 2019)

include estimator results from sampling as extra features in addition to query features and use convolutional neural networks for cardinality estimation. However, the two approaches have the following problems: (1) They cannot be trained on data directly and do not fully capture the benefits of the two types of information; (2) Their combination methods significantly increase the model budgets (for storing samples or histograms) and negatively affect the training and estimating efficiencies of the model; (3) They cannot directly ingest incremental data because they have to be trained with new queries whose cardinalities are obtained on the updated data.

Autoregressive Models Autoregressive models capture the joint data distribution

by decomposing it into a product of conditional distributions. Recent deep autoregressive models include Masked Autoencoder 

(Germain et al., 2015b), Masked Autoregressive Flow (Papamakarios et al., 2017) and Autoregressive Energy Machines (Nash and Durkan, 2019).

Remark Our belongs to the hybrid family and is based on deep autoregressive models. To our knowledge, no existing work on deep autoregressive models in the machine learning literature is able to support using query workload as supervised information to train the model, much less supporting both data as unsupervised information and query workload as supervised information in one model. In , we propose a novel solution to enable using query workload as supervised information, as well as a unified model to utilize both data as unsupervised information and query workload as supervised information w.r.t. deep autoregressive models.

3. PROBLEM Statement

Consider a relation that consists of columns (or attributes) . A tuple (or data point) is an

-dimensional vector. The row count of

is defined as . The domain region of attribute is given by , which represents the set of distinct values in .  
Predicates A query is a conjunction of predicates, each of which contains an attribute, an operator and a value. A predicate indicates a constraint on an attribute (e.g., equality constraint , or range constraint ).

Cardinality The cardinality of a query , , is defined as the number of tuples of that satisfy the query. Another related term, selectivity, is defined as the fraction of the rows of that satisfy the query, i.e., .

Supported Queries Our proposed estimator supports cardinality estimation for queries with conjunctions of predicates. Each predicate contains a range constraint (), equality constraint (=) or clause on a numeric or categorical column. For a numerical column, we make the assumption that the domain region is finite and use the values present in that column as the attribute domain. Moreover, the estimator can also support disjunctions via the inclusion-exclusion principle. Note that our formulation follows a large amount of previous work on cardinality estimation (Yang et al., 2020; Hasan et al., 2020; Getoor et al., 2001; Park et al., 2020). For joins, UAE supports multi-way and multi-key equi-joins, as it is in (Yang et al., 2021). Moreover, group-by queries could be supported by learning query containment rates (Hayek and Shmueli, 2020).

Problem Consider (1) the underlying tabular data and (2) a set of queries with their cardinalities . Then, this work aims to build a model that leverages the set of queries and their cardinalities and the underlying data to predict the cardinalities for incoming queries. Furthermore, after training the model, it is also desirable that the model can ingest new data and query workload in an incremental fashion, rather than retraining. Note that such labeled queries can be collected as feedback from prior query executions (query log).

Formulation as Distribution Estimation. Consider a set of attributes of a relation and an indicator function for a query , which produces if the tuple satisfies the predicate of , and 0 otherwise. The joint data distribution of is given by , which is a valid distribution. Next, we can form the selectivity as: . Thus, the key problem of selectivity estimation is obtaining the joint data distribution under the formulation.

4. Proposed Model

We present the proposed unified deep autoregressive estimator, called UAE, that is capable of learning from both data and query workload to support cardinality estimation. We first present an overview of (Section 4.1). Then we introduce how to use a trained autoregressive model for cardinality estimation (Section 4.2). We then present our idea on differentiating progressive sampling to enable the deep autoregressive models to be trained with query workload (Section 4.3). Next, a hybrid training procedure (Section 4.4) is proposed to use data as unsupervised information and queries as supervised information to jointly train UAE. We present the approaches to incorporating incremental data and query workload in Section 4.5. Finally, we discuss several miscellaneous issues (Section 4.6), and make several remarks (Section 4.7).

4.1. Overview

Figure 1. Overview of

Motivations On the one hand, data-driven methods have been claimed to be more general and robust to workload shifts than query driven methods (Hilprecht et al., 2020; Yang et al., 2020). On the other hand, query workload with true selectivities provides additional information of the workload (Bruno et al., 2001). Therefore, it would be a natural idea of combining data-driven and query-driven models. As discussed before, the existing proposals leveraging both data and query workload (Kipf et al., 2019; Dutt et al., 2019; Heimel et al., 2015; Kiefer et al., 2017) are insufficient towards this direction.

An idea to overcome the problem of data-driven methods suffering the tail of the distribution due to their averaging optimization target would be using ensemble methods with each component targeting a different part of the distribution. However, 1) it is not easy to define a good partition. 2) It is nontrivial to integrate the results of different ensembles since queries may span multiple ensembles. For example, (Hilprecht et al., 2020) uses an SPN to combine different ensembles and consequently independence assumptions are made. 3) Using ensembles is orthogonal to UAE. We can integrate UAE with ensemble methods if good ensemble methods could be designed.

Challenges In this work, to achieve the four goals in Introduction, we resort to deep autoregressive models since they  (Yang et al., 2020; Hasan et al., 2020) have demonstrated superior performance for their expressiveness in capturing attribute correlations and efficiency. This is however challenging in two aspects: (1) Off-the-shelf deep autoregressive models in the machine learning literature are not able to incorporate the query workload information as supervised information for training the model. (2)  As the naive combination of data-driven and query-driven models is not desirable, we aim to develop a unified autoregressive model with a single set of model parameters to use both data as unsupervised information and query workload as supervised information to learn the joint data distribution.

Overview of High-level Idea Both challenges call for designing new deep autoregressive models. First, deep autoregressive models rely on sampling techniques to answer range queries (Yang et al., 2020)

. However, it cannot be trained with queries because the sampled categorical variables are not differentiable (to be explained in detail in Section 

4.3). Therefore, to enable the deep autoregressive model to incorporate query workload as supervised information to learn the joint data distribution, we propose a novel idea that we utilize the Gumbel-Softmax trick to differentiate the sampled variables so that the deep autoregressive model can learn the joint data distribution directly from queries. In this way, our proposed model can also incorporate incremental query workload as discussed later.

Second, to fully leverage data as unsupervised information and queries as supervised information in the hybrid training setting, we combine the unsupervised and supervised losses produced from data and queries, respectively, with a trade-off hyper-parameter. This enables to jointly train the deep autoregressive model to learn the joint data distribution by minimizing the combined loss. Therefore, the deep autoregressive model can learn from both data and queries simultaneously using the same set of model parameters.

Figure 1 shows the workflow of our proposed estimator . We can train with data only, and batches of random tuples are fetched from the table for learning the joint data distribution. We can also train with query as supervised information only and batches of random (query, cardinality) pairs are read from the query workload log to learn the joint data distribution. is able to lean the joint data distribution with a single autoregressive model from both data and queries.

4.2. Preliminary: Deep Autoregressive Models for Cardinality Estimation

Autoregressive Decomposition  Naively, one could store the point distribution of all tuples in a table for exact selectivity estimation. However, the number of entries in the table will grow exponentially in the number of attributes and is not feasible. Many previous methods have attempted to use Bayesian Networks (BN) for approximating the joint distribution via factorization (Getoor et al., 2001; Tzoumas et al., 2011). However, (1) they still make some conditional independence assumptions and (2) the expense of learning and inferring from BN is often prohibitive (we empirically found that the estimation time of a BN could be 110-120s on the DMV dataset).

To achieve a better trade-off between the ability to capture attribute correlations and space budgets while keeping the tractability and efficiency in model training and inference, we utilize the autoregressive decomposition mechanism which factorizes the joint distribution in an autoregressive manner without any independence assumption:

(1)

After training on the underlying tuples using neural network architectures, only model weights of the deep autoregressive model need to be stored to compute the conditional distributions etc.. Note that we use left-to-right order in this work, which was demonstrated to be effective in previous work (Yang et al., 2020). More strategies for choosing a good ordering can be found in (Yang et al., 2020; Hasan et al., 2020).

Encoding Tuples  Deep autogressive models treat a table as a multi-dimensional discrete distribution. We first scan all attributes to obtain their attribute domain. Next, for each attribute , its values are encoded into integers in range in a natural order. For instance, consider a string attribute , the encoded dictionary would be: . It is a bijection transformation without any information loss.

After the integer transformation, for each attribute, a specific encoder further encodes these integers into vectors for training the neural networks. The simplest method would be one-hot encoding. Specifically, consider the encoded integers of an attribute

with three distinct values: , one-hot encoding represents them as . However, this naive method is not efficient in storage because the encoded vector is -dimensional. We thus use binary encoding, which encodes the same attribute into , a dimensional vector.

Model Architectures  We use ResMADE (Nash and Durkan, 2019), a multi-layer perceptron with information masking technique which masks out the influence of on . Exploring advanced architectures of deep autoregressive models (Papamakarios et al., 2017; Germain et al., 2015b) is orthogonal to our work.

Model Training  In a nutshell, the input to deep autoregressive estimators is each data tuple and its output is the corresponding predicted density estimation. In the training phase, the weights (or parameters) of deep autoregressive estimators are learned from data tuples by minimizing the cross-entropy between real and estimated data distributions (Germain et al., 2015a).

(2)

where

are the model weights. Normally, gradient updates for neural networks are performed by stochastic gradient descent (SGD) 

(Bottou, 2010) using backpropagation (Rumelhart and McClelland, 1987; Goodfellow et al., 2016)

as a gradient computing technique. Backpropagation is an efficient method for computing gradients in directed graphs of computations, such as neural networks, using

chain rule (Rudin and others, 1964)

. It is a simple implementation of chain rule of derivatives, which computes all required partial derivatives in linear time in terms of the graph size, as shown in (1) of Figure 

2. An important characteristic of backpropagation is that it requires each node of the computation graph involved in the flow to be deterministic and differentiable, which has a well defined gradient.

Answering Range Queries with Sampling  After being trained, deep autoregressive models can be directly used to answer point queries (e.g., AND for a relation with two attributes), because a deep autoregressive model is essentially a point distribution estimator. However, it is not easy to use the point estimator to answer range queries. Estimating the selectivity of a range query is equivalent to estimating the sum of selectivities for the set of data points the query contains. Suppose the region of a query is: , where denotes the number of attributes. A naive approach for estimating the range query is exhaustive enumeration:

(3)

where represents the list of distinct tuples contained in and is the estimated selectivity of query . However, this method is computationally prohibitive because in the worst case, the number of estimated tuples would grow exponentially in the number of attributes. We thus resort to sampling techniques to efficiently compute the approximate selectivity result as follows.

  • [leftmargin=*]

  • Uniform Sampling method samples tuples at random and then computes the estimated selectivity as

    (4)

    where

    . However, uniform sampling could produce large variances if the data distribution is skewed.

  • Progressive Sampling is a Monte Carlo integration approach (Yang et al., 2020), which sequentially samples each tuple in order of its attributes by concentrating on the regions of high probability. Specifically, to sample a tuple , we sequentially sample its attributes {} from distributions {}, respectively, where the categorical distribution is the distribution of given attributes

    predicted by the deep autoregressive model. Therefore, the tuple having higher probabilities in

    could be more likely sampled. The selectivity estimate made by a sampled tuple is given by . The estimation result from multiple sampled tuples can be easily obtained by averaging the estimate of each single tuple. It is easy to verify that progressive sampling estimates are unbiased. This method is more robust to skewed data distribution than uniform sampling. We thus adopt progressive sampling in our work.

4.3. Training Deep Autoregressive Models with Queries

Figure 2. Gradient estimation in stochastic computation graphs for different models. We denote and by and . (1) The autoregressive model trained with data. All nodes are deterministic so gradients can easily flow from to . (2) The original autoregressive model fed by queries. The presence of stochastic, non-differentiable nodes prevents backpropagation because the categorically sampled variables do not have a well-defined gradient. (3) Our UAE trained with queries. UAE allows gradients to flow from to by using a continuous variable to approximate . The stochastic variable for each attribute is not invloved in the gradient flow from to .

We proceed to present our idea of empowering the autoregressive model with the ability of learning from queries. Nevertheless, the existing autoregressive models cannot learn from queries via back-propagation in an end-to-end manner, because in principle gradients cannot flow through the sampled discrete random variables, and hence the process of progressive sampling is not differentiable, which is a prerequisite of backpropagation as explained earlier. Specifically, consider a set of queries

, we define the query loss for autoregressive models as:

(5)

where is the predicted selectivity of . There are many choices to define the function , e.g., root mean square error (RMSE) and Q-error (Moerkotte et al., 2009):

(6)

Next, let us focus on (2) in Figure 2, which illustrates the gradient flow of a deep autoregressive model with progressive sampling using sample, trained with queries. In each forward pass, the autoregressive model utilizes progressive sampling to successively sample one-hot vectors for each attribute (In practice, the result of sampling from in deep autoregressive models is an one-hot vector that represents , we thus denote as thereinafter for clarity) and use them to compute . can be obtained after this. However, we observe that during backpropagation, gradients cannot completely flow from to . This is because that gradients cannot flow from to , respectively, since the stochastic variables do not have a well-defined gradient w.r.t. . One can easily generalize the case for as only an averaging operation is needed to combine the estimate of each sample and it does not change the non-differentiable property of progressive sampling. Consequently, the model weights cannot be trained using query workload with the current techniques.

Our key insight is that if we can find a method making the process of progressive sampling differentiable, the deep autoregressive models can be trained directly from queries by minimizing the discrepancy between the actual selectivities and the estimated selectivities through progressive sampling via back-propagation.

The key challenge of differentiating progressive sampling is differentiating the non-differentiable sample from the categorical distribution . To this end, we consider two ideas: score function estimators and the Gumbel-Softmax trick, and analyze which is more suitable for our work.

Score Function Estimators. The score function estimator (SF), also known as REINFORCE (Williams, 1992), derives the gradient of query loss for autoregressive models w.r.t. the model weights by:

(7)

where is sampled from , and is a function of and . With SF, we only need and being continuous in (which is valid), without requiring back-propagating through the sampled tuple .

However, SF often suffers from high variance, even if it is improved with variance reducing techniques (Gu et al., 2016). Also, SF is not scalable if used for categorical distribution because the variance will grow linearly in the number of dimensions in categorical distribution (Rezende et al., 2014). Consequently, a better method is needed.

The Gumbel-Softmax Trick.

1:   Temperature ; Categorical distribution with items;
2:   A differentiable one-hot vector from ;
3: for ;
4: for ;
5:Sample according to Eq.10;
6:return Sampled differentiable one-hot vector ;
Algorithm 1 The Gumbel-Softmax Trick (GS-Sampling)

The Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017), which was originally used to differentiate discrete latent variables in variational auto-encoders (Kingma and Welling, 2014) and is summarized in Algorithm 1. We leverage this technique to provide a simple and effective way to draw a sample for attribute .

Consider a categorical distribution with -dimensional class probability , to sample a one-hot vector from , the key idea of the Gumbel-Softmax trick is,

(8)

where are independent and identically distributed samples drawn from a Gumble(0,1) distribution, which can be sampled by:

(9)

Since is non-differentiable, we can use differentiable as a continuous and approximate distribution to sample :

(10)

where is an adjustable hyper-parameter, referred as temperature. When the temperature approaches 0, sampling from the Gumbel-Softmax distribution becomes one-hot. Hence, the temperature is a trade-off between the gradient variance and the degree of approximation to a one-hot vector. Essentially, sampled by the Gumbel-Softmax trick is differentiable and has been proven to have lower variance and be more scalable than SF. Therefore, in this work we use the Gumbel-Softmax trick as the core technique for differentiating progressive sampling. Note that can be any categorical distribution, including .

Based on the procedure of the Gumbel-Softmax trick, we introduce in detail how to use the Gumbel-Softmax trick to differentiate progressive sampling. As shown in (3) of Figure 2, the key idea of differentiable progressive sampling is using deterministic, continuous variables to approximate stochastic discrete variables so that gradient can flow from to completely. Specifically, for each attribute , in a forward pass, a stochastic vector is first generated from E.q. 9. Next, we define and sample from softmax, according to E.q. 10. Note here the categorical distribution in E.q. 10 is set to . Then we can use the sampled to continue the forward pass. In doing so, we surprisingly find that gradients from to can be computed completely, because the stochastic nodes are dis-encountered from the entire gradient flows.

1:  
2:Temperature ; Number of samples ;
3:Query region ; Model density estimation ;
4:   Estimated selectivity ;
5:;  // Initialize the ultimate density estimate
6:for  to  do
7:     ; // Initialize the -th sample’s density estimate
8:     for  to  do
9:         Forward pass the model and obtain ;
10:         ;
11:         Zero-out probabilities outside for ;
12:         Normalize and obtain ;
13:         Sample differentiable via the Gumbel-Softmax trick: = GS-Sampling(, );  // Alg. 1
14:     end for
15:     ;
16:end for
17:Average the density estimates of samples: ;
18:return
Algorithm 2 Differentiable Progressive Sampling ()

We present the flow of differentiable progressive sampling (DPS) in Algorithm 2. Note that in practice we can perform DPS with samples in batch. Note that in line 7 of Algorithm 2, we can simply musk out the probabilities of by setting the corresponding values in to negative infinity (-inf). This does not change the categorical property of and does not affect GS-Sampling.

4.4. Hybrid Training

1:  
2:Temperature ; Number of samples ; // Used in DPS
3:; The underlying data ; Query workload ;
4:   Trained model weights ;
5:Randomly initialize model weights ;
6:while JointTraining() do
7:     ;
8:     ;
9:     ;
10:     ;
11:     Perform SGD via backpropagation for (Eq. 11);
12:end while
13:return
Algorithm 3 Hybrid Training of UAE

Now, UAE is able to learn from either data or queries. To achieve our ultimate goal, which is to take both data as unsupervised information and queries as supervised information into the training of UAE, we propose a hybrid training method, which trains the model of UAE by minimizing an overall loss function combining (Eq. 2) and (Eq. 5) by a hyper-parameter :

(11)

The adjustable hyper-parameter rescales the relative values of and . In doing so, UAE learns to capture both the data and query workload information simultaneously to learn the joint data distribution. We summarize the workflow of the hybrid training of UAE in Algorithm 3.

4.5. Incorporating Incremental Data and Query Workload

We introduce the superiorities of UAE in efficiently and effectively ingesting incremental data or query workload.

Incremental Data denotes the tuples newly added to the database after the model is trained. UAE can perform incremental training on the incremental data by minimizing the unsupervised loss produced from the new data.

Incremental Query Workload is a set of queries drawn from a different distribution compared to the training workload (i.e.,

they focus on different data regions). For example, on IMDB dataset, a workload might focus on the data area where

title.production_year¿1975 but another workload might focus on title.production_year¡1954. To adapt to the new query workload after being trained, UAE only need to minimize the supervised loss offered from

for incrementally ingesting these new queries. In our experiments, we find a smaller value of training epochs (10~20) is enough to prevent

UAE from catastrophic forgetting.

4.6. Miscellaneous Issues

Supporting Join Queries. A natural idea of supporting multi-table joins for UAE is to train UAE on join results offered by join samplers (Leis et al., 2017; Huang et al., 2019). We follow the idea (Yang et al., 2021; Hilprecht et al., 2020) to handle join queries, which adds virtual indicator and fanout columns into the model architecture of UAE. Then we train UAE on tuples sampled by the Exact Weight algorithm (Zhao et al., 2018) and queries with fanout scaling. Interested readers may refer to (Yang et al., 2021; Hilprecht et al., 2020) for details.

Handling Columns with Large NDVs. A problem of the autoregressive architecture UAE is when the number of distinct values (NDVs) in a column is very large, storing the model parameters would consume large space. Hence, for columns with large NDVs, we leverage 1) embedding method (which embeds one-hot column vectors by learnable embedding matrices) for tuple encoding and decoding; 2) column factorization which slice a value into groups of bits and then transforms them into base-10 integers (Yang et al., 2021).

Handling Unqueried Columns (Wildcards). We use wildcard skipping (Liang et al., 2020; Yang et al., 2020) which randomly masks tuples and replaces them with special tokens as the inputs to UAE’s data part during training. This could improve the efficiency of training UAE and query inference, because for omitted columns we can skip DPS during training and skip progressive sampling during query inference.

4.7. Remarks

We call the UAE trained with data and queries as UAE-D and UAE-Q, respectively. We make several remarks as follows:

  • [leftmargin=*]

  • UAE can accurately capture complex attribute correlations without independence or uniformity assumptions because of its deep autoregressive model architecture;

  • By learning from both data and query workload, UAE is further forced to produce more accurate estimates in the data regions accessed by the workload. Meanwhile, UAE can maintain the knowledge of overall data distribution;

  • In fact, UAE-D is equivalent to Naru (Yang et al., 2020). We thus claim that UAE generalizes Naru;

  • We opt for Q-error as the for UAE-Q

    because it is consistent with our evaluation metric.

  • A distinct feature of UAE-Q is that, different from other supervised methods (Wu et al., 2018; Ortiz et al., 2019; Hasan et al., 2020; Sun and Li, 2019) or sampling enhanced ML model (Kipf et al., 2019) for cardinality estimation which are all discriminative, deep autoregressive model-based UAE-Q is a generative model. To the best of our knowledge, UAE-Q is the first supervised deep generative model for cardinality estimation.

  • When being used to estimate the cadinality of a query, UAE only uses its model weights, without scanning the data. Thus the estimation process is convenient and efficient, especially if accelerated by advanced GPU architectures.

  • By switching between UAE-D and UAE-Q, UAE can learn from new data or query workload in an incremental manner, without being retrained. To the best of our knowledge, there is no single deep learning-based model for cardinality estimation can achieve the two goals of incremental learning, although it is a consequent advantage of UAE’s construction strategy.

5. Experimental Results

We conduct comprehensive experiments to answer the following research questions.

  • [leftmargin=*]

  • RQ1:  Compared to state-of-the-art cardinality estimation models, how does UAE perform in accuracy (Section 5.2)?

  • RQ2:  How different hyper-parameters (e.g., temperature , trade-off parameter ) affect the results of UAE (Section 4)?

  • RQ3:  How well can UAE incrementally incorporate new data and query workload (Section 5.4)?

  • RQ4:  How long does it take to train UAE and how efficient does it produce a cardinality estimate (Section 5.5)?

  • RQ5:  How does UAE impact on a query optimizer (Section 5.6)?

5.1. Experimental Settings

5.1.1. Datasets

We use three real-world datasets with different characteristics for single-table experiments as follows:

  • [leftmargin=*]

  • DMV (Zanettin, 2019). This dataset consists of vehicle registration information in New York. We follow the preprocessing strategy in previous work (Yang et al., 2020), and get 11.6M tuples and 11 columns after preprocessing. The 11 columns has widely different data types and domain sizes ranging from 2 to 2101. We also use DMV-large which includes colums with very large NDVs (e.g., 100% unique VIN column and 31K unique CITY column) and has 16 columns. This dataset is used to evaluate the sensitivities of compared methods to very large NDVs. We find that the results provide similar clues as those on DMV, and thus we do not report them here due to the space limit.

  • Census (72). This dataset was extracted from the 1994 Census database, consisting of person income information. It contains 48K tuples and 14 columns. The 14 columns contain a mix of categorical columns and numerical columns with domain sizes ranging from 2 to 123.

  • Kddcup98 (72). This dataset was used in the KDD Cup 98 Challenge. We use 100 columns for experiments and use this dataset to evaluate the sensitivities of various methods to very high-dimensional data . It contains 95K tuples with domain sizes ranging from 2 to 43.

We use two statistics in probability theory to measure the skewness and correlation of datasets: Fisher–Pearson standardized moment coefficient 

(Doane and Seward, 2011) for skewness and Nonlinear Correlation Information Entropy (NCIE) (Wang et al., 2005) for correlation. Smaller values of the two measures indicate weaker skewness or correlation. The skewness measures are 4.9, 2.1, 4.7 and the correlation measures are 0.23, 0.15, 0.32 for DMV, Census and Kddcup98, respectively. Therefore, DMV and Kddcup98 have relatively stronger skewness and attribute correlation while Census has weaker skewness and attribute correlation. In addition, we use the real-world IMDB dataset for experiments on join queries. IMDB was reported to have strong attribute correlation (Leis et al., 2015).

5.1.2. Query Workload


Training Queries. We follow the previous work (Bruno et al., 2001; Kipf et al., 2019) to generate query workload as there is no real query log available for the datasets we use. Specifically, we first choose an attribute with a relatively large domain size as the bounded attribute. The bounded attribute is specified by a distribution for the centers and a target measure (Bruno et al., 2001), which is based on real usage scenarios. The distribution center is chosen uniformly within a specific range and the target measure is a target volume of 1% of the distinct values. We have also varied the selection method for the centers (e.g., following data distribution) and the target measure (e.g., target selectivity) and the experimental results turned out to be qualitatively similar. We thus do not report these results due to the page limit. Next, for other attributes (i.e., random attributes), we follow the method (Kipf et al., 2019; Yang et al., 2020) to generate queries. We draw the number of filters () at random. Then we uniformly sample columns and the corresponding filter operators. Finally, the filter literals are set from the values of a randomly sampled tuple. We generate 20K training queries for each dataset. For join experiments, we use one template (a join table subset) out of 18 templates in JOB-light, a 70-query benchmark used by a number of previous work, to generate 10K training queries. This template includes 3 tables, title, movie_companies, movie_info. We set title.production_year as the bounded attribute and then choose 2-5 filters on other content columns as discussed above. This generation procedure follows (Yang et al., 2021) which produces more diversified queries than job-light, using the same join template. We term this benchmark as JOB-light-ranges-focused,

Test Queries. Apart from the performance on the training workload (i.e., in-workload queries), we also evaluate whether the estimators are robust to out-of-workload queries. Therefore, we generate two kinds of test queries to thoroughly evaluate the performance of estimators: (1) In-workload Queries: 2K test queries are generated in the same procedure of training query workload. For joins experiments we generate 1K test queries from JOB-light-ranges-focused; (2) Random Queries: We also generate 2K test queries without bounded attributes, i.e., all attribute filters are generated randomly, to evaluate the robustness of different models to workload shifts. For join experiments we use JOB-light as it contains no focused information.

Workload Characteristics. Figure LABEL:fig.cards plots the selectivity distributions of 2K in-workload and random queries on all datasets. We observe: 1) the selectivities of generated workloads are widely spaced. 2) Random queries have much wider selectivity spectrums than in-workload queries because in-workload queries have an additional bounded column. Note that though training and test in-workload queries share the same generation procedure, we manually ensure that each training query is different from each test query.

5.1.3. Performance Metric

Following the previous work (Kipf et al., 2019; Yang et al., 2020; Hasan et al., 2020), we evaluate all models by the popular metric for cardinality estimation, q-error, which is defined in E.q. 6.

5.1.4. Baseline Methods

We compare  111The source code is available at https://github.com/pagegitss/UAE. to 9 cardinality estimation methods, including state-of-the-art and the newest methods.

Query-driven Models:

  • [leftmargin=*]

  • MSCN-base (Kipf et al., 2019)

    . This query-driven deep learning (DL)-based method uses a multi-set convolutional neural network for answering correlated joins. For each predicate, it featurizes the attribute and operator using one-hot vectors and normalizes the value. It then concatenates the average results over the predicate set as the query encoding. We use two layers (256 hidden units) of multilayer perceptrons, the default setting of 

    (Kipf et al., 2019), on the query encoding. We apply the code from (48). Note that the original MSCN was proposed to handle join queries. We adapt MSCN to single-table queries by dropping the join module. We also evaluate another query-driven DL method Sup (Hasan et al., 2020) and find that it shares the similar performance with MSCN.

  • LR (Kutner et al., 2005). This method first represents a query as the concatenation of the domain range of each predicate (following (Dutt et al., 2019)

    ), and trains a linear regression model on the query representation. We use this method as a non-DL query-driven counterpart to demonstrate the effectiveness of DL-based query-driven methods (

    MSCN-base).

Data-driven Models:

  • [leftmargin=*]

  • Sampling. This method keeps a portion () of tuples uniformly from the dataset and scans them to estimate query cardinalities.

  • BayesNet (Chow and Liu, 1968). We follow the same setting (Yang et al., 2020) for this method for a fair comparison.

  • KDE (Gunopulos et al., 2005). This method leverages the kernel density estimator for estimating the data distribution. Gaussian kernels are adopted in the experiments and the bandwidth is calculated from data by Scott’s rule (Scott, 2015).

  • DeepDB (Hilprecht et al., 2020). This method models joint data distribution by learning relational sum-product networks, which is based on the structure of Sum Product Networks (SPN(Poon and Domingos, 2011). The number of samples per SPN

    for learning the ensemble is set to 1M. We use its open-sourced code 

    (10). DeepDB is a deep model but non-neural, which is a proxy to compare neural deep models (Naru, UAE) against the effectiveness of non-neural deep models.

  • Naru (Yang et al., 2020). Naru is equivalent to UAE-D. We extend the open-sourced code from (50) because the original code does not support two-sided queries. We also compare with MADE (Hasan et al., 2020), which also uses deep autoregressive models and its performance is close to Naru (Yang et al., 2020). For join queries, we compare UAE with NeuroCard (Yang et al., 2021), a concurrent work that extends Naru for join.

Hybrid Models:

  • [leftmargin=*]

  • MSCN+sampling (Kipf et al., 2019). This method uses estimates on materialized sampled tuples as additional inputs to MSCN-base. We use this method to demonstrate the advantages of leveraging both data and workload information.

  • Feedback-KDE (Heimel et al., 2015). This method further utilizes query feedback to adjust the bandwidth in KDE (Gunopulos et al., 2005). We apply the code from the authors (14) and modify it to run it with more than 10 columns. SquaredQ loss function and Batch variant are adopted for bandwidth optimization.

We also compared with STHoles (Bruno et al., 2001), Postgres (57) and MHIST (Poosala et al., 1996). The performances are worse than the 9 methods, and thus we do not report them here. The numbers of sample tuples in two KDE-based methods (KDE and Feedback-KDE) and two sampling-based methods (Sampling and MSCN+sampling) are set to match the memory budget of our model for a fair comparison. For Sampling method, the sample ratios are 0.2%, 9%, 4.6% for DMV, Census and Kddcup98, respectively. We also test two sampling budgets, one smaller () and one lager () than . The results do not change the conclusions in this paper, and thus we do not report them here. Additionally, for a fair comparison, two autoregressive based models ( and Naru) share the same model architecture, which is 2 hidden layers ( units). We also turn on column factorization on IMDB due to its high-cardinality columns. Afterward, the space consumption of the autoregressive model is 2.0MB, 0.5MB, 3.45MB and 4.1MB on DMV, Census, Kddcup98 and IMDB, respectively. In addition, the two deep autoregressive based methods, Naru and UAE, rely on progressive sampling for answering range queries. For fair comparison, the number of estimate samples is set to 200 on DMV in-workload queries, Census and 1K on DMV random queries, Kddcup98, IMDB for both of them, because we empirically found by this setting, they can strike a better balance between the estimation accuracy and overheads, e.g., further increasing the estimate number will not result in a significantly improvement in accuracy but will increase the estimation overhead. In , the temperature and the number of samples in DPS are set to 1 and 200, respectively, on all the datasets. The trade-off parameter is set to on three single-table datasets and 10 on IMDB dataset.

All the experiments were run on a machine with a Tesla V100 GPU and a 20 cores E5-2698 v4 @ 2.20GHz CPU.

5.2. Performance Comparison

Model Size In-workload Queries Random Queries Mean Median 95th MAX Mean Median 95th MAX LR 17KB 85.51 3.683 21.31 MSCN-base 0.5MB 5.474 2.219 9.641 2580 2537 235.3 UAE-Q 2MB 2.78 3.14 3.72 108.0 317 1.68 68.35 Sampling 2MB 24.66 1.104 65.25 2179 21.96 1.036 56.1 2086 BayesNet 1.9MB 1.233 1.044 1.325 174.0 3.106 1.041 9.364 480.0 KDE 2.1MB 22.48 1.299 4.302 70.16 1.190 46.06 DeepDB 1.3MB 1.454 1.120 1.519 293.5 836.3 1.058 5347 Naru 2MB 1.113 1.029 1.160 108.0 1.062 1.021 1.213 7.0 MSCN+sampling 2MB 1.943 1.068 4.942 122.6 25.31 6.037 109.3 760.5 Feedback-KDE 2.1MB 27.63 1.288 4.20 110.3 1.184 54.7 2041 (Ours) 2MB 1.058 1.027 1.143 5.0 1.062 1.024 1.20 6.0
Table 2. Estimation Errors on DMV
Model Size In-workload Queries Random Queries Mean Median 95th MAX Mean Median 95th MAX LR 14KB 6.876 2.963 20.154 467.0 3949 200.2 MSCN-base 0.3MB 5.90 2.563 17.72 77.14 318.7 35.44 1909 4992 UAE-Q 0.5MB 2.41 1.50 7.0 36.0 17.4 1.98 34 2694 Sampling 0.5MB 2.73 1.333 12.0 49.0 2.08 1.353 6.0 41.0 BayesNet 0.6MB 1.734 1.239 3.786 70.0 1.822 1.240 4.0 98.0 KDE 0.5MB 4.249 1.865 14.38 215.0 4.374 1.996 14.13 177.6 DeepDB 0.5MB 1.876 1.333 4.586 15.66 1.660 1.216 3.776 21.0 Naru 0.5MB 1.775 1.258 4.0 14.5 1.925 1.324 4.565 35.0 MSCN+sampling 0.5MB 2.495 1.729 5.818 71.2 11.45 3.11 52.33 338.0 Feedback-KDE 0.5MB 4.214 1.861 14.0 215.0 4.335 1.962 14.02 177.6 (Ours) 0.5MB 1.462 1.196 3.0 9.0 1.333 1.138 2.25 7.0
Table 3. Estimation Errors on Census
Model Size In-workload Queries Random Queries Mean Median 95th MAX Mean Median 95th MAX LR 17KB 3.65 2.97 14.3 314 674 406 3461 MSCN-base 0.5MB 1.990 1.412 4.00 139 538 219 2132 7231 UAE-Q 3.4MB 1.528 1.174 3.0 56.0 2.50 1.378 5.0 690 Sampling 3.4MB 3.51 1.19 17.0 99.0 2.71 1.09 10.05 106 BayesNet 4.4MB 1.632 1.385 3.754 56.4 1.97 1.203 4.10 690 KDE 3.3MB 1.519 1.161 3.000 28.67 2.145 1.166 4.000 690 DeepDB 3.2MB 1.385 1.10 2.032 18.29 1.281 1.112 2.0 14.18 Naru 3.4MB 1.594 1.258 3.0 23.429 2.156 1.286 3.808 690 MSCN+sampling 3.4MB 1.717 1.257 3.882 26.0 68.43 8.61 299 9045 Feedback-KDE 3.3MB 1.528 1.166 3.037 28.67 2.142 1.167 4.0 690 (Ours) 3.4MB 1.305 1.138 2.0 13.0 1.562 1.157 2.167 345
Table 4. Estimation Errors on Kddcup98
Model Size JOB-light-ranges-focused JOB-light Median 95th MAX Median 95th MAX DeepDB 4.0MB 2.96 22.29 2435 1.26 376 MSCN+sampling 4.0MB 1.33 11.34 32 NeuroCard 4.1MB 3.58 28.07 234 1.39 6.53 31.1 (Ours) 4.1MB 1.08 9.0 63 1.56 6.29 33.0
Table 5. Estimation Errors on IMDB (Join Queries)

Tables 35 show the experimental results of all models on both in-workload queries and random (out-of-workload) queries. The results show that matches or significantly exceeds the best estimator across the board, not only in terms of mean and median, but also in terms of max, which demonstrates the robustness of in handling tail of difficult queries. From these experimental results, we conclude several major findings as follows.

(1) UAE-Q outperforms other supervised methods in most cases and they are vulnerable to workload shifts. We observe that the proposed UAE-Q outperforms LR and MSCN-base in most of the cases. We also observe that for all the supervised methods, the accuracy on random queries is much worse than that on in-workload queries. This indicates that supervised models may learn the data distribution on the data regions the training workload focuses on, but are vulnerable to workload shifts. For example, on DMV, when moving from in-workload queries to random queries, MSCN-base degrades by in mean error and in max error.

(2) Unsupervised methods are more robust to workload shifts but still produce large error at tail. The performance gap of unsupervised methods between in-workload and random queries is much smaller than those of supervised methods. Nevertheless, these unsupervised models still have large worst-case errors likely due to their optimization target of minimizing the average error. For instance, on DMV dataset, Naru produces 108 for max error.

(3) DL-based methods outperform non-DL methods, especially at tail. For supervised methods, deep learning (DL) models MSCN-base preforms significantly better than traditional machine learning method LR. Also, for unsupervised methods, deep learning method Naru and DeepDB usually perform better than non-DL methods (e.g., BayesNet ), especially in mean and max errors. These results demonstrate that DL can better capture complex data correlation than non-DL models.

(4) KDE-based methods suffer from large domain sizes. KDE-based methods (KDE and Feedback-KDE) perform poorly on two datasets with large domain sizes (DMV). Moreover, We find that Feedback-KDE can not enhance the performance of KDE significantly. It is likely because KDE-based methods suffer on these datasets inherently or the bandwidths computed by Feedback-KDE are not optimal on these datasets.

(5) DeepDB preforms relatively well on datasets with weak attribute correlation but degrades largely on datasets with strong attribute correlation. On the dataset with weaker attribute correlation (i.e., Census), DeepDB

offers accurate estimates across all error quantiles. Nevertheless, on DMV that has strong attribute correlations,

DeepDB’s performance drops quickly especially at tail, since the independence assumption in sum-product networks of DeepDB does not hold for this dataset.

(6) Deep autoregressive models suffer from high dimensional data and SPNs might suffer from high NDVs. We draw interesting conclusions in the comparison between deep autoregressive model-based methods (Naru, UAE) and SPNs-based method (DeepDB). On two datasets with relatively high domain sizes (2K for DMV, 100% unique for DMV-large), DeepDB may suffer at tail (e.g., max error on DMV random queries) while deep autoregressive model-based methods achieve much more stable accuracy. This is likely because that the histograms used in the leaf nodes of DeepDB cannot accurately model the distributions of attributes with high NDVs while deep autoregressive models can well capture them because they consider the probability of each distinct value at the output layer. On the contrary, on Kddcup98 with 100 attributes, deep autoregressive models may degrade at tail (e.g., Naru makes max error 690 on random queries. Although UAE improves Naru by , it still makes max error 345), but DeepDB can achieve very low max error. This is likely because a higher dimensional dataset would contain more independent attributes. In this case, the autoregressive decomposition of autoregressive models might introduce noises to the model learning since it ”forces” the model to learn the correlation between independent attributes. However, DeepDB would not have this problem as SPNs successfully separate those independent attributes into different groups for this dataset. The result indicates that a promising future work would be to combine the best of deep autoregressive models and SPNs for high-dimensional datasets with high NDVs.

(7) Additional data information boosts supervised methods by a large margin. Compared to MSCN-base, MSCN+sampling achieves much better performance on all datasets. The improvements become more obvious on random queries. The results demonstrate that including the estimates from sampling as extra features together with query features improve the accuracy of neural networks in MSCN. We also note that integrating query as supervised information in KDE does not help on the three datasets.

(8) outperforms both of its two modules. For example, on DMV both UAE-D and UAE-Q have a max error 108. However, is able to achieve max error 5.0, which greatly improves the tail behaviour. This demonstrates the effectiveness of the unified modeling and training in for cardinality estimation.

(9)  achieves the best performance on in-workload queries while maintaining the robustness on random queries. achieves the best overall results (in mean and max errors) on in-workload queries on all single-table datasets. For instance, on DMV, outperforms the second best method Naru by at tail. Also, produces the lowest median errors on in-workload queries on most datasets. Additionally, also achieves the best or comparable overall performance for join queries. As shown in Table 5, on JOB-light-ranges-focused, produces the lowest median error and beats two newest data-driven models (DeepDB, NeuroCard) by a large margin across all error quantiles. Although MSCN+sampling outperforms at tail on this workload, its performance drops on random queries (JOB-light) while does not. Surprisingly, can even achieve the best overall performance on random queries on DMV and Census, e.g., 7.0 vs. 21.0 in max error on Census, compared to the second best method DeepDB. It is likely because the supervised component using query workload enforces to offer accurate estimates on some tricky data regions (e.g., long tail data) that all other models cannot successfully handle. In addition, ’s performance on random queries on other datasets matches or outperforms that of Naru (or NeuroCard). These results demonstrate that can effectively leverage query workload as supervised information for enhancing the unsupervised autoregressive model, while keeping the knowledge of overall data distribution.

5.3. Hyper-parameter Studies

We report in-depth experimental studies to explore the best hyper-parameter settings of . The results on all the datasets are qualitatively similar and we report the results on DMV only.

Impact of Temperature  The temperature parameter is used in the Gumbel-Softmax algorithm which works for the supervised part of (i.e., UAE-Q). We have to isolate the influence from ’s unsupervised part (i.e., UAE-D) when training with UAE-Q. To this end, we first train only with UAE-D to obtain the overall data knowledge. Next is refined by UAE-Q with various settings of . Specifically, 10K training queries are generated following the procedure described in Section 5.1.2 and the evaluation is conducted on 2K in-workload queries since we are interested in the effect of on the performance of ’s supervised part. As discussed in (Jang et al., 2017), fixed between 0.5 and 1.0 yields good results empirically. We thus evaluate candidate values {0.5, 0.75, 1.0, 1.25} for . In addition, we use 2K samples for estimation because we are interested in the limits of influenced by and lower numbers of estimation samples also share the same trend. We empirically find that at = 1.0, achieves the lowest estimation errors.

(a) Impact of in DPS
(b) Impact of
Figure 4. The effect of hyper-parameters of on DMV.

Impact of the number of training samples in DPS. Like , also belongs to UAE-Q. Therefore, we use the same experimental setting as in . We evaluate the values {50, 100, 200, 400} for , as larger numbers beyond 400 will significantly increase training overhead (i.e., consume more GPU memory). From Figure 4 (a), we observe that if the number of estimation samples is set to 200, the best setting for would be 200 as well; If the number of estimation samples is increased to 2K, 400 becomes the best setting for . In most cases can offer very accurate estimates on in-workload queries with 200 samples. Furthermore, considering the training and estimation overheads, we suggest 200 for on these datasets.

Impact of Trade-off Parameter . Trade-off parameter rescales the losses produced by two parts of , UAE-D and UAE-Q. We thus use the same query workload in Section 5.1.3. The candidate values of are {,,, , }. Figure 4 (b) shows the performance of on both in-workload and random queries as is varied, from which we conclude ’s best setting is . Moreover, when is larger than , the performance drops quickly on both kinds of queries, indicating that putting too much emphasis on UAE-Q will negatively affect model training and is not encouraged.

5.4. Incremental Data and Query Workload

This experiment is to study the incremental learning ability of . Since the ability of autoregressive models to incorporate incremental data has been demonstrated in previous work (Yang et al., 2020; Hasan et al., 2020), , based on autoregressive model, can inherently handle incremental data. Consequently, we will not repeat the experiment in this paper. Beyond the previous work, we aim to show that is also able to incorporate incremental query workload while previous work (Yang et al., 2020; Hasan et al., 2020) cannot. To this end, using the same procedure in Section 5.1.3, we generate 5 parts of query workload with different query center for the bounded column, i.e., different query workload focuses on different data region. Each part consists of 4K training queries and 200 in-workload test queries since we our goal is to demonstrate ’s effectiveness of incorporating new query workload. After training with the underlying data, we ingest each partition of query workload in order, following the experimental setting of incremental data in (Yang et al., 2020). Evaluations are conducted after each ingest using the test queries in that partition.

Ingested Partitions 1 2 3 4 5
Naru: mean 1.035 1.047 1.152 1.197 2.903
UAE: mean 1.031 1.039 1.095 1.132 1.073
Table 6. Effectiveness of incorporating incremental query workload. Stale Naru vs. Refined UAE.

We compare refined to the model only trained with the underlying data (i.e., Naru), which cannot further ingest incremental query workload, on DMV. Table 6 shows the mean errors of both methods, which are estimated by 200 samples. From the table, we observe: (1) due to the incapability of leveraging query workload, the performance of Naru is not stable on queries of various workloads. (2) can offer consistently accurate estimates after being refined by each query workload, which demonstrates the ability of in effectively ingesting incremental query workload.

5.5. Training Time Estimation Efficiency

An epoch of takes about 363 seconds, 62 seconds, and 657 seconds on DMV, Census, and Kddcup98, respectively. We report the changing process of max error estimated by 200 samples as training progresses on Census in-workload queries in Figure LABEL:fig.timings (1). We observe that about 13 epochs for yields the performance of single-digit max error, which is 9.0.

On all datasets, can produce estimates in around 10ms on a V100 GPU. Figure LABEL:fig.timings (2) shows the estimation latencies of different estimators on DMV. As shown in the figure, can produce estimates in reasonable efficiency, much faster than sampling-based methods (MSCN+sampling, Sampling).

5.6. Impact on Query Optimization

We proceed to evaluate the impact of UAE on query optimization, compared to PostgreSQL and NeuroCard. We follow the procedure (Cai et al., 2019) and we modify the source code of PostgreSQL to allow it accept external cardinality estimates. Then, for each query we collect the cardinality estimates of its subqueries returned by different estimators and inject them into the modified PostgreSQL. We use the JOB-M (Leis et al., 2015) benchmark as the testbed for this case because it has a more complex join schema, which is more challenging for query optimization. We generate 50 test queries using a template of JOB-M (including 6 tables and multi-way joins), following the generation procedure of JOB-light-ranges-focused. For training , we use the same template to randomly generate 10K subqueries (including 25 tables). Figure LABEL:fig.qo shows the impact of cardinality estimates from NeuroCard and on query performance compared to PostgreSQL.

We have two major findings. First, more accurate cardinality estimates from deep autoregressive model-based estimators could translate into better query plans in query optimization of PostgreSQL. Second, for in-workload queries, could result in equivalent or better query plans to improve the quality of query optimization without any significant slowdown compared with PostgreSQL and NeuroCard .

6. CONCLUSIONS and future work

We propose a novel unified deep autoregressive model that is able to utilize both data as unsupervised information and query workload as supervised information for cardinality estimation. Experiments demonstrate that achieves the four goals in Section 1.

We see this work as the first step toward a unified deep learning model that is able to train a single model exploiting both data information and workload information for cardinality estimation. We expect that our unified model for cardinality estimation would be inspirational to future developments of cardinality estimation models that fuse data information and query workload. We believe that our model will open interesting and promising future research directions. For example, exploring the power of UAE-Q on database generation is a very promising direction. The generative characteristic of UAE-Q allows us to efficiently sample tuples from the model. This is not the case in other supervised models because it is hard to obtain the normalizing constant (Chow and Teicher, 2003) of the data probability for these models. This characteristic makes UAE-Q suitable for database generation for DBMS testing and benchmarking (Arasu et al., 2011; Lo et al., 2010; Li et al., 2018), another important task for big data management.

Acknowledgements.
This research was conducted at Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU), which is a collaboration between Singapore Telecommunications Limited (Singtel) and Nanyang Technological University (NTU) that is funded by the Singapore Government through the Industry Alignment Fund ‐ Industry Collaboration Projects Grant. This work was also supported in part by a MOE Tier-2 grant MOE2019-T2-2-181, a MOE Tier-1 grant RG114/19, and an NTU ACE grant. We would like to thank Zizhong Meng (NTU) for helping with some of the experiments, and the anonymous reviewers for providing constructive feedback and valuable suggestions.

References

  • A. Aboulnaga and S. Chaudhuri (1999) Self-tuning histograms: building histograms without looking at data. ACM SIGMOD Record 28 (2), pp. 181–192. Cited by: §1, Table 1, §2.
  • C. Anagnostopoulos and P. Triantafillou (2015) Learning to accurately count with query-driven predictive analytics. In 2015 IEEE international conference on big data (big data), pp. 14–23. Cited by: §1, Table 1, §2.
  • A. Arasu, R. Kaushik, and J. Li (2011) Data generation using declarative constraints. In SIGMOD, pp. 685–696. Cited by: §6.
  • L. Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §4.2.
  • N. Bruno, S. Chaudhuri, and L. Gravano (2001) STHoles: a multidimensional workload-aware histogram. In SIGMOD, pp. 211–222. Cited by: §1, Table 1, §2, §4.1, §5.1.2, §5.1.4.
  • W. Cai, M. Balazinska, and D. Suciu (2019) Pessimistic cardinality estimation: tighter upper bounds for intermediate join cardinalities. In SIGMOD, pp. 18–35. Cited by: §5.6.
  • C. Chow and C. Liu (1968)

    Approximating discrete probability distributions with dependence trees

    .
    IEEE transactions on Information Theory 14 (3), pp. 462–467. Cited by: §1, Table 1, §2, item 4..
  • Y. S. Chow and H. Teicher (2003) Probability theory: independence, interchangeability, martingales. Springer Science & Business Media. Cited by: §6.
  • G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4 (1–3), pp. 1–294. External Links: ISSN 1931-7883, Link, Document Cited by: §2.
  • [10] () DeepDB. Note: https://github.com/DataManagementLab/deepdb-public/ Cited by: item 6..
  • A. Deshpande, M. Garofalakis, and R. Rastogi (2001) Independence is good: dependency-based histogram synopses for high-dimensional data. ACM SIGMOD Record 30 (2), pp. 199–210. Cited by: Table 1, §2.
  • D. P. Doane and L. E. Seward (2011) Measuring skewness: a forgotten statistic?. Journal of statistics education 19 (2). Cited by: §5.1.1.
  • A. Dutt, C. Wang, A. Nazi, S. Kandula, V. Narasayya, and S. Chaudhuri (2019) Selectivity estimation for range predicates using lightweight models. VLDB 12 (9), pp. 1044–1057. Cited by: §1, Table 1, §2, §4.1, item 2..
  • [14] () Feedback-kde. Note: https://bitbucket.org/mheimel/feedback-kde/ Cited by: item 9..
  • M. Germain, K. Gregor, I. Murray, and H. Larochelle (2015a) Made: masked autoencoder for distribution estimation. In ICML, pp. 881–889. Cited by: §1, §4.2.
  • M. Germain, K. Gregor, I. Murray, and H. Larochelle (2015b) MADE: masked autoencoder for distribution estimation. Cited by: §2, §4.2.
  • L. Getoor, B. Taskar, and D. Koller (2001) Selectivity estimation using probabilistic models. In SIGMOD, pp. 461–472. Cited by: §1, Table 1, §2, §3, §4.2.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §4.2.
  • S. Gu, S. Levine, I. Sutskever, and A. Mnih (2016) Muprop: unbiased backpropagation for stochastic neural networks. In ICLR, Cited by: §4.3.
  • D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi (2000) Approximating multi-dimensional aggregate range queries over real attributes. Acm Sigmod Record 29 (2), pp. 463–474. Cited by: Table 1, §2.
  • D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi (2005) Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14 (2), pp. 137–154. Cited by: Table 1, §2, item 5., item 9..
  • P. J. Haas, J. F. Naughton, and A. N. Swami (1994) On the relative cost of sampling for join selectivity estimation. In PODS, pp. 14–24. Cited by: Table 1, §2.
  • S. Hasan, S. Thirumuruganathan, J. Augustine, N. Koudas, and G. Das (2020) Deep learning models for selectivity estimation of multi-attribute queries. In SIGMOD, pp. 1035–1050. Cited by: §1, §1, Table 1, §2, §2, §3, 5th item, §4.1, §4.2, item 1., item 7., §5.1.3, §5.4.
  • R. Hayek and O. Shmueli (2020) Improved cardinality estimation by learning queries containment rates. In EDBT, Cited by: §3.
  • M. Heimel, M. Kiefer, and V. Markl (2015) Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD, pp. 1477–1492. Cited by: §1, §1, Table 1, §2, §4.1, item 9..
  • B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina, K. Kersting, and C. Binnig (2020) DeepDB: learn from data, not from queries!. Vol. 13, pp. 992–1005. Cited by: §1, §1, Table 1, §4.1, §4.1, §4.6, item 6..
  • R. Hu, Z. Wang, W. Fan, and S. Agarwal (2018) Cost based optimizer in apache spark 2.2. Note: https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html Cited by: §1.
  • D. Huang, D. Y. Yoon, S. Pettie, and B. Mozafari (2019) Joins on samples: a theoretical guide for practitioners. VLDB. Cited by: §4.6.
  • I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga (2004) CORDS: automatic discovery of correlations and soft functional dependencies. In SIGMOD, pp. 647–658. Cited by: Table 1, §2.
  • H. Jagadish, H. Jin, B. C. Ooi, and K. Tan (2001) Global optimization of histograms. ACM SIGMOD Record 30 (2), pp. 223–234. Cited by: Table 1, §2.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §1, §4.3, §5.3.
  • M. Kiefer, M. Heimel, S. Breß, and V. Markl (2017) Estimating join selectivities using bandwidth-optimized kernel density models. VLDB 10 (13), pp. 2085–2096. Cited by: §1, §1, Table 1, §2, §4.1.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §4.3.
  • A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper (2019) Learned cardinalities: estimating correlated joins with deep learning. In CIDR, Cited by: §1, Table 1, §2, 5th item, §4.1, item 1., item 8., §5.1.2, §5.1.3.
  • M. H. Kutner, C. J. Nachtsheim, J. Neter, W. Li, et al. (2005) Applied linear statistical models. Vol. 5, McGraw-Hill Irwin New York. Cited by: item 2..
  • V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann (2015) How good are query optimizers, really?. Proceedings of the VLDB Endowment 9 (3), pp. 204–215. Cited by: §5.1.1, §5.6.
  • V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann (2017) Cardinality estimation done right: index-based join sampling.. In CIDR, Cited by: §4.6.
  • V. Leis, B. Radke, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann (2018) Query optimization through the looking glass, and what we found running the join order benchmark. The VLDB Journal 27 (5), pp. 643–668. Cited by: §1.
  • Y. Li, R. Zhang, X. Yang, Z. Zhang, and A. Zhou (2018) Touchstone: generating enormous query-aware test databases. In 2018 USENIX Annual Technical Conference (USENIXATC 18), pp. 575–586. Cited by: §6.
  • E. Liang, Z. Yang, I. Stoica, P. Abbeel, Y. Duan, and X. Chen (2020) Variable skipping for autoregressive range density estimation. Cited by: §4.6.
  • L. Lim, M. Wang, and J. S. Vitter (2003) SASH: a self-adaptive histogram set for dynamically changing workloads. In Proceedings 2003 VLDB Conference, pp. 369–380. Cited by: §1, Table 1, §2.
  • R. J. Lipton, J. F. Naughton, and D. A. Schneider (1990) Practical selectivity estimation through adaptive sampling. In SIGMOD, pp. 1–11. Cited by: Table 1, §2.
  • E. Lo, N. Cheng, and W. Hon (2010) Generating databases for query workloads. VLDB 3 (1-2), pp. 848–859. Cited by: §6.
  • C. A. Lynch (1988) Selectivity estimation and query optimization in large databases with highly skewed distribution of column values.. In VLDB, pp. 240–251. Cited by: Table 1, §2.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    .
    In ICLR, Cited by: §1, §4.3.
  • [46] (2017) Microsoft database. Note: https://docs.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-2017 Cited by: §1.
  • G. Moerkotte, T. Neumann, and G. Steidl (2009) Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB 2 (1), pp. 982–993. Cited by: §4.3.
  • [48] () MSCN. Note: https://github.com/andreaskipf/learnedcardinalities Cited by: item 1..
  • M. Muralikrishna and D. J. DeWitt (1988) Equi-depth multidimensional histograms. In SIGMOD, pp. 28–36. Cited by: Table 1, §2.
  • [50] () Naru-project. Note: https://github.com/naru-project/naru/ Cited by: item 7..
  • C. Nash and C. Durkan (2019) Autoregressive energy machines. Cited by: §1, §2, §4.2.
  • J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi (2019) An empirical analysis of deep learning for cardinality estimation. arXiv preprint arXiv:1905.06425. Cited by: Table 1, §2, 5th item.
  • G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In NIPS, pp. 2338–2347. Cited by: §2, §4.2.
  • Y. Park, S. Zhong, and B. Mozafari (2020) Quicksel: quick selectivity learning with mixture models. In SIGMOD, pp. 1017–1033. Cited by: Table 1, §2, §3.
  • H. Poon and P. Domingos (2011) Sum-product networks: a new deep architecture. In ICCV Workshops, pp. 689–690. Cited by: item 6..
  • V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita (1996) Improved histograms for selectivity estimation of range predicates. ACM Sigmod Record 25 (2), pp. 294–305. Cited by: Table 1, §2, §5.1.4.
  • [57] () PostgreSQL. Note: https://www.postgresql.org/ Cited by: §1, §5.1.4.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. In ICML, Cited by: §4.3.
  • M. Riondato, M. Akdere, U. Çetintemel, S. B. Zdonik, and E. Upfal (2011) The vc-dimension of sql queries and selectivity estimation through sampling. In ECML PKDD, pp. 661–676. Cited by: Table 1, §2.
  • W. Rudin et al. (1964) Principles of mathematical analysis. Vol. 3, McGraw-hill New York. Cited by: §4.2.
  • D. E. Rumelhart and J. L. McClelland (1987) Learning internal representations by error propagation. Cited by: §4.2.
  • T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. Cited by: §1.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
  • D. W. Scott (2015) Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. Cited by: item 5..
  • J. Spiegel and N. Polyzotis (2006) Graph-based synopses for relational selectivity estimation. In SIGMOD, pp. 205–216. Cited by: §1, Table 1, §2.
  • M. Stillger, G. M. Lohman, V. Markl, and M. Kandil (2001) LEO-db2’s learning optimizer. In VLDB, Vol. 1, pp. 19–28. Cited by: §1, Table 1, §2.
  • J. Sun and G. Li (2019) An end-to-end learning-based cost estimator. VLDB 13 (3), pp. 307–319. Cited by: Table 1, §2, 5th item.
  • N. Thaper, S. Guha, P. Indyk, and N. Koudas (2002) Dynamic multidimensional histograms. In SIGMOD, pp. 428–439. Cited by: Table 1, §2.
  • H. To, K. Chiang, and C. Shahabi (2013) Entropy-based histograms for selectivity estimation. In CIKM, pp. 1939–1948. Cited by: Table 1, §2.
  • K. Tzoumas, A. Deshpande, and C. S. Jensen (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4 (11), pp. 852–863. Cited by: §4.2.
  • K. Tzoumas, A. Deshpande, and C. S. Jensen (2013) Efficiently adapting graphical models for selectivity estimation. The VLDB Journal 22 (1), pp. 3–27. Cited by: §1, Table 1, §2.
  • [72] () UCI machine learning repository. Note: https://archive.ics.uci.edu/ml/index.php Cited by: item 2., item 3..
  • A. Van Gelder (1993) Multiple join size estimation by virtual domains. In Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pp. 180–189. Cited by: Table 1, §2.
  • Q. Wang, Y. Shen, and J. Q. Zhang (2005) A nonlinear correlation measure for multivariable data set. Physica D: Nonlinear Phenomena 200 (3-4), pp. 287–295. Cited by: §5.1.1.
  • R. J. Williams (1992)

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    .
    Machine learning 8 (3-4), pp. 229–256. Cited by: §4.3.
  • C. Wu, A. Jindal, S. Amizadeh, H. Patel, W. Le, S. Qiao, and S. Rao (2018) Towards a learning optimizer for shared clouds. VLDB 12 (3), pp. 210–222. Cited by: Table 1, §2, 5th item.
  • Z. Yang, A. Kamsetty, S. Luan, E. Liang, Y. Duan, X. Chen, and I. Stoica (2021) NeuroCard: one cardinality estimator for all tables. PVLDB. Cited by: §3, §4.6, §4.6, item 7., §5.1.2.
  • Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica (2020) Deep unsupervised cardinality estimation. Vol. 13, pp. 279–292. Cited by: §1, §1, Table 1, §2, §3, 2nd item, 3rd item, §4.1, §4.1, §4.1, §4.2, §4.6, item 1., item 7., item 4., §5.1.2, §5.1.3, §5.4.
  • F. Zanettin (2019) State of new york. vehicle, snowmobile, and boat registrations. Note: catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations Cited by: item 1..
  • Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi (2018) Random sampling over joins revisited. In SIGMOD, pp. 1525–1539. Cited by: §4.6.