1. Introduction
Cardinality estimation — estimating the result size for a SQL predicate — is a critical component in query optimization, which aims to identify a good execution plan based on cardinality estimation (R. Hu, Z. Wang, W. Fan, and S. Agarwal (2018); 46; 57). In spite of the importance of cardinality estimation, modern DBMS systems may still produce large estimation errors on complex queries and datasets with strong correlations (Leis et al., 2018). Additionally, cardinality estimation can also be used for approximate query processing (Hilprecht et al., 2020).
The fundamental difficulty of cardinality estimation is to construct or learn an accurate and compact representation to estimate the joint data distribution of a relational table (the frequency of each unique tuple normalized by the table’s cardinality). Most of the existing work on cardinality estimation can be broadly categorized into two classes: datadriven and querydriven cardinality estimation. Datadriven methods aim to summarize the joint data distribution for cardinality estimation. Traditional datadriven methods include datadriven histograms, sampling, and sketching. However, they usually make independence and uniformity assumptions that do not hold in complex realworld datasets. Learningbased methods have been proposed for datadriven cardinality estimation by formulating cardinality estimation as a machine learning problem. Traditional machine learning methods still have their shortcomings. For example, kernel density estimation
(Heimel et al., 2015; Kiefer et al., 2017)is vulnerable to highdimensional data and probabilistic graphical models
(Chow and Liu, 1968; Getoor et al., 2001; Spiegel and Polyzotis, 2006; Tzoumas et al., 2013) are inefficient in estimation.Recent advances in deep learning have offered promising tools in this regard. Recently, the SumProduct Networksbased model
(Hilprecht et al., 2020)has been applied for approximating the joint distribution. However, it will not handle well strong attribute correlations in realworld datasets. A promising recent advance in this direction would be applying deep autoregressive models
(Nash and Durkan, 2019; Salimans et al., 2017; Germain et al., 2015a)) for cardinality estimation (Yang et al., 2020; Hasan et al., 2020), which can capture attribute correlations and have reasonable estimation efficiency. However, the optimization target (loss function) of deep autoregressive models is to minimize the ordinary
average error over the overall data, and could neglect tricky (e.g., long tail) data regions, because the model may be largely dominated by those few head data regions, but degraded for many other tail data regions.Alternatively, querydriven cardinality estimation utilizes query workload (from either query log or generated workload) to perform cardinality estimation without seeing the data, and it is expected to have more focused information on workload queries than datadriven methods (Bruno et al., 2001). Traditional querydriven histograms (Aboulnaga and Chaudhuri, 1999; Stillger et al., 2001; Bruno et al., 2001; Lim et al., 2003; Anagnostopoulos and Triantafillou, 2015)
also suffer from the same problems of histogrambased methods. Recently, deep learning (DL)based estimators can estimate complex joins without independence assumptions, based on the powerful representation ability of deep neural networks
(Schmidhuber, 2015). However, querydriven models assume that test queries share similar properties with training queries: they are drawn from the same distribution. This may not be the case due to workload shifts. In other words, test and training workloads may focus on different data regions. Moreover, it would be expensive to generate workload queries that sufficiently cover all data areas to train a model.It would be a natural idea of utilizing both data and query workload for cardinality estimation. In fact, a few proposals (e.g., DeepDB) consider the combination as an interesting avenue for future work. Moreover, towards this direction several solutions (Kipf et al., 2019; Dutt et al., 2019; Heimel et al., 2015; Kiefer et al., 2017) have been proposed to utilize both data and workload. However, the combination methods of the two types of information in these pioneering studies are simple, and they are not sufficient in capturing both types of information to learn the joint data distribution for cardinality estimation. As to be discussed in related work in more details, these solutions simply use one side of data and queries as auxiliary information to enhance the model of the other side. Consequently, these pioneering solutions cannot model the data as unsupervised information and query workload as supervised information in a unified model to learn the joint data distribution for cardinaltiy estimation.
Goals To solve the aforementioned problems, we conclude four design goals as follows:

[leftmargin=*]

G1. Capturing data correlations without independence or uniformity assumption;

G2. Utilizing both data and query workload for model training;

G3. Incrementally ingesting new data and query workload;

G4. Time and space efficient estimation.
Our Solution To achieve the four goals, in this paper we propose a new unified deep autoregressive estimator to utilize data as unsupervised information and query workload as supervised information for learning the joint data distribution. Deep autoregressive models have demonstrated superior performance for their effectiveness and efficiency in the pioneering work (Yang et al., 2020; Hasan et al., 2020) of training autoregressive models for cardinality estimation. However, no existing deep autoregressive model in the literature is able to incorporate the query workload as supervised information for learning joint data distribution, much less support both data as unsupervised information and query workload as supervised information in the same model. To enable incorporating query workload as supervised information in the deep autoregressive model to learn the joint data distribution, we propose a novel idea — we utilize the GumbelSoftmax trick (Jang et al., 2017; Maddison et al., 2017) to differentiate the categorically sampled variables so that the deep autoregressive model can learn joint data distribution directly from queries. Furthermore, we propose to combine the unsupervised and supervised losses produced from data and queries, respectively, with a tradeoff hyperparameter, and thus we are able to train the deep autoregressive model to learn the joint data distribution with a single set of model parameters by minimizing the combined loss. Therefore, can learn from both data and queries simultaneously using the same set of model parameters. Moreover, since is trained with both data and queries, it is naturally capable of incorporating incremental data and query workload.
Contributions This work makes the following contributions:

[leftmargin=*]

We propose a novel approach, UAEQ, to incorporating query workload as supervised information in the deep autoregressive model. This is the first deep autoregressive model that is capable of learning density distribution from query workload.

We propose the first unified deep autoregressive model to use both data as unsupervised information and query workload as supervised information for learning joint data distribution. To the best of our knowledge, this is the first deep model that are truly capable of using query workload as supervised information and data as unsupervised information to learn joint data distribution.

We conduct comprehensive experiments to compare with 9 baseline methods on three reallife datasets. The 9 baseline methods cover datadriven methods, querydriven methods, and hybrid methods, including the recent deep learning based methods. The experimental results show that achieves singledigit multiplicative error at tail, better accuracies over other stateoftheart estimators, and is both space and time efficient. The results demonstrate that our method can well achieve the four aforementioned goals. Interestingly, the experimental results also show that UAEQ, which is trained on queries only, outperforms the stateoftheart querydriven method.
2. Related Work
Selectivity or cardinality estimation has been an active area of research for decades (Cormode et al., 2012). We present the previous solutions in three categories: datadriven estimators, querydriven estimators, and hybrid estimators as summarized in Table 1.


Without Assumptions  Learning from Data  Learning from Queries  Incorporating Incremental Data  Incorporating Incremental Query Workload  Efficient Estimation  





























Datadriven Cardinality Estimation Datadriven cardinality estimation methods construct estimation models based on the underlying data. First, samplingbased methods (Lipton et al., 1990; Haas et al., 1994; Riondato et al., 2011) estimate cardinalities by scanning a sample of data, which has space overhead and can be expensive. Histograms (Poosala et al., 1996; Deshpande et al., 2001; Ilyas et al., 2004; Lynch, 1988; Muralikrishna and DeWitt, 1988; Van Gelder, 1993; Jagadish et al., 2001; Thaper et al., 2002; To et al., 2013) construct histograms to approximate the data distribution. However, most of these methods make partial or conditional independence and uniformity assumptions, i.e.,
the data is uniformly distributed in a bucket. A host of unsupervised machine learning based methods have been developed for datadriven cardinarlity estimation. Probabilistic graphical models (
PGM) (Chow and Liu, 1968; Getoor et al., 2001; Spiegel and Polyzotis, 2006; Tzoumas et al., 2013)use Bayesian networks to model the joint data distribution, which also relies on conditional independence assumptions. Kernel density estimation (
KDE)based methods (Gunopulos et al., 2000, 2005) do not need the independence assumptions, but their accuracy is not very competitive due to the difficulty in adjusting the bandwidth parameter. Recently, Naru (Yang et al., 2020) and MADE (Hasan et al., 2020)utilize unsupervised deep autoregressive models for learning the conditional probability distribution and use it for answering point queries.
Naru uses progressive sampling and MADE uses adaptive importance sampling algorithm for answering range queries and they achieve comparative results. Both Naru and MADE do not make any independence assumption.Querydriven Cardinality Estimation Supervised querydriven cardinality estimation approaches build models by leveraging the query workload. As opposed to datadriven histograms, querydriven histograms (Aboulnaga and Chaudhuri, 1999; Stillger et al., 2001; Bruno et al., 2001; Lim et al., 2003; Anagnostopoulos and Triantafillou, 2015) build histogram buckets from query workload, without seeing the underlying data. Recently, QuickSel (Park et al., 2020) uses uniform mixture model to fit the data distribution using every query in the query workload, which avoids the overhead of multidimensional histograms. QuickSel also relies on uniformity assumptions. Deep Learning models have recently been employed for querydriven cardinality estimation. Ortiz et al. (Ortiz et al., 2019)
evaluate the performance of using multilayer perceptron neural networks and recurrent neutral networks on encoded queries for cardinality estimation.
Sup (Hasan et al., 2020) encodes queries as a set of features and learns weights for these features utilizing a fully connected neural network to estimate the selectivity. In addition, Wu et al. (Wu et al., 2018) consider a relevant but different problem, which is to estimate the cardinality for each point of a query plan graph by training a traditional machine learning model. Sun et al. (Sun and Li, 2019) consider estimating both the execution cost of a query plan and caridinality together using a multitask learning framework.Hybrid Cardinality Estimation A few proposals leverage both query workload and the underlying data to predict the cardinalities. Queryenhanced approaches (Heimel et al., 2015; Kiefer et al., 2017) leverage query workload to further adjust the bandwidth parameter of KDE to numerically optimize a KDE model for better accuracy. However, KDEbased models do not work well for highdimensional data (Heimel et al., 2015; Kiefer et al., 2017). Recently, selectivity estimation results from data driven models are used together with encoded queries as input features to machine learning models (Kipf et al., 2019; Dutt et al., 2019). Dutt et al. include the cardinality estimates of histograms as extra features in addition to query features, and use neural network and treebased ensemble machine learning models for cardinality estimation. Kipf et al. (Kipf et al., 2019)
include estimator results from sampling as extra features in addition to query features and use convolutional neural networks for cardinality estimation. However, the two approaches have the following problems: (1) They cannot be trained on data directly and do not fully capture the benefits of the two types of information; (2) Their combination methods significantly increase the model budgets (for storing samples or histograms) and negatively affect the training and estimating efficiencies of the model; (3) They cannot directly ingest incremental data because they have to be trained with new queries whose cardinalities are obtained on the updated data.
Autoregressive Models Autoregressive models capture the joint data distribution
by decomposing it into a product of conditional distributions. Recent deep autoregressive models include Masked Autoencoder
(Germain et al., 2015b), Masked Autoregressive Flow (Papamakarios et al., 2017) and Autoregressive Energy Machines (Nash and Durkan, 2019).Remark Our belongs to the hybrid family and is based on deep autoregressive models. To our knowledge, no existing work on deep autoregressive models in the machine learning literature is able to support using query workload as supervised information to train the model, much less supporting both data as unsupervised information and query workload as supervised information in one model. In , we propose a novel solution to enable using query workload as supervised information, as well as a unified model to utilize both data as unsupervised information and query workload as supervised information w.r.t. deep autoregressive models.
3. PROBLEM Statement
Consider a relation that consists of columns (or attributes) . A tuple (or data point) is an
dimensional vector. The row count of
is defined as . The domain region of attribute is given by , which represents the set of distinct values in .Predicates A query is a conjunction of predicates, each of which contains an attribute, an operator and a value. A predicate indicates a constraint on an attribute (e.g., equality constraint , or range constraint ).
Cardinality The cardinality of a query , , is defined as the number of tuples of that satisfy the query. Another related term, selectivity, is defined as the fraction of the rows of that satisfy the query, i.e., .
Supported Queries Our proposed estimator supports cardinality estimation for queries with conjunctions of predicates. Each predicate contains a range constraint (), equality constraint (=) or clause on a numeric or categorical column. For a numerical column, we make the assumption that the domain region is finite and use the values present in that column as the attribute domain. Moreover, the estimator can also support disjunctions via the inclusionexclusion principle. Note that our formulation follows a large amount of previous work on cardinality estimation (Yang et al., 2020; Hasan et al., 2020; Getoor et al., 2001; Park et al., 2020). For joins, UAE supports multiway and multikey equijoins, as it is in (Yang et al., 2021). Moreover, groupby queries could be supported by learning query containment rates (Hayek and Shmueli, 2020).
Problem Consider (1) the underlying tabular data and (2) a set of queries with their cardinalities . Then, this work aims to build a model that leverages the set of queries and their cardinalities and the underlying data to predict the cardinalities for incoming queries. Furthermore, after training the model, it is also desirable that the model can ingest new data and query workload in an incremental fashion, rather than retraining. Note that such labeled queries can be collected as feedback from prior query executions (query log).
Formulation as Distribution Estimation. Consider a set of attributes of a relation and an indicator function for a query , which produces if the tuple satisfies the predicate of , and 0 otherwise. The joint data distribution of is given by , which is a valid distribution. Next, we can form the selectivity as: . Thus, the key problem of selectivity estimation is obtaining the joint data distribution under the formulation.
4. Proposed Model
We present the proposed unified deep autoregressive estimator, called UAE, that is capable of learning from both data and query workload to support cardinality estimation. We first present an overview of (Section 4.1). Then we introduce how to use a trained autoregressive model for cardinality estimation (Section 4.2). We then present our idea on differentiating progressive sampling to enable the deep autoregressive models to be trained with query workload (Section 4.3). Next, a hybrid training procedure (Section 4.4) is proposed to use data as unsupervised information and queries as supervised information to jointly train UAE. We present the approaches to incorporating incremental data and query workload in Section 4.5. Finally, we discuss several miscellaneous issues (Section 4.6), and make several remarks (Section 4.7).
4.1. Overview
Motivations On the one hand, datadriven methods have been claimed to be more general and robust to workload shifts than query driven methods (Hilprecht et al., 2020; Yang et al., 2020). On the other hand, query workload with true selectivities provides additional information of the workload (Bruno et al., 2001). Therefore, it would be a natural idea of combining datadriven and querydriven models. As discussed before, the existing proposals leveraging both data and query workload (Kipf et al., 2019; Dutt et al., 2019; Heimel et al., 2015; Kiefer et al., 2017) are insufficient towards this direction.
An idea to overcome the problem of datadriven methods suffering the tail of the distribution due to their averaging optimization target would be using ensemble methods with each component targeting a different part of the distribution. However, 1) it is not easy to define a good partition. 2) It is nontrivial to integrate the results of different ensembles since queries may span multiple ensembles. For example, (Hilprecht et al., 2020) uses an SPN to combine different ensembles and consequently independence assumptions are made. 3) Using ensembles is orthogonal to UAE. We can integrate UAE with ensemble methods if good ensemble methods could be designed.
Challenges In this work, to achieve the four goals in Introduction, we resort to deep autoregressive models since they (Yang et al., 2020; Hasan et al., 2020) have demonstrated superior performance for their expressiveness in capturing attribute correlations and efficiency. This is however challenging in two aspects: (1) Offtheshelf deep autoregressive models in the machine learning literature are not able to incorporate the query workload information as supervised information for training the model. (2) As the naive combination of datadriven and querydriven models is not desirable, we aim to develop a unified autoregressive model with a single set of model parameters to use both data as unsupervised information and query workload as supervised information to learn the joint data distribution.
Overview of Highlevel Idea Both challenges call for designing new deep autoregressive models. First, deep autoregressive models rely on sampling techniques to answer range queries (Yang et al., 2020)
. However, it cannot be trained with queries because the sampled categorical variables are not differentiable (to be explained in detail in Section
4.3). Therefore, to enable the deep autoregressive model to incorporate query workload as supervised information to learn the joint data distribution, we propose a novel idea that we utilize the GumbelSoftmax trick to differentiate the sampled variables so that the deep autoregressive model can learn the joint data distribution directly from queries. In this way, our proposed model can also incorporate incremental query workload as discussed later.Second, to fully leverage data as unsupervised information and queries as supervised information in the hybrid training setting, we combine the unsupervised and supervised losses produced from data and queries, respectively, with a tradeoff hyperparameter. This enables to jointly train the deep autoregressive model to learn the joint data distribution by minimizing the combined loss. Therefore, the deep autoregressive model can learn from both data and queries simultaneously using the same set of model parameters.
Figure 1 shows the workflow of our proposed estimator . We can train with data only, and batches of random tuples are fetched from the table for learning the joint data distribution. We can also train with query as supervised information only and batches of random (query, cardinality) pairs are read from the query workload log to learn the joint data distribution. is able to lean the joint data distribution with a single autoregressive model from both data and queries.
4.2. Preliminary: Deep Autoregressive Models for Cardinality Estimation
Autoregressive Decomposition Naively, one could store the point distribution of all tuples in a table for exact selectivity estimation. However, the number of entries in the table will grow exponentially in the number of attributes and is not feasible. Many previous methods have attempted to use Bayesian Networks (BN) for approximating the joint distribution via factorization (Getoor et al., 2001; Tzoumas et al., 2011). However, (1) they still make some conditional independence assumptions and (2) the expense of learning and inferring from BN is often prohibitive (we empirically found that the estimation time of a BN could be 110120s on the DMV dataset).
To achieve a better tradeoff between the ability to capture attribute correlations and space budgets while keeping the tractability and efficiency in model training and inference, we utilize the autoregressive decomposition mechanism which factorizes the joint distribution in an autoregressive manner without any independence assumption:
(1) 
After training on the underlying tuples using neural network architectures, only model weights of the deep autoregressive model need to be stored to compute the conditional distributions etc.. Note that we use lefttoright order in this work, which was demonstrated to be effective in previous work (Yang et al., 2020). More strategies for choosing a good ordering can be found in (Yang et al., 2020; Hasan et al., 2020).
Encoding Tuples Deep autogressive models treat a table as a multidimensional discrete distribution. We first scan all attributes to obtain their attribute domain. Next, for each attribute , its values are encoded into integers in range in a natural order. For instance, consider a string attribute , the encoded dictionary would be: . It is a bijection transformation without any information loss.
After the integer transformation, for each attribute, a specific encoder further encodes these integers into vectors for training the neural networks. The simplest method would be onehot encoding. Specifically, consider the encoded integers of an attribute
with three distinct values: , onehot encoding represents them as . However, this naive method is not efficient in storage because the encoded vector is dimensional. We thus use binary encoding, which encodes the same attribute into , a dimensional vector.Model Architectures We use ResMADE (Nash and Durkan, 2019), a multilayer perceptron with information masking technique which masks out the influence of on . Exploring advanced architectures of deep autoregressive models (Papamakarios et al., 2017; Germain et al., 2015b) is orthogonal to our work.
Model Training In a nutshell, the input to deep autoregressive estimators is each data tuple and its output is the corresponding predicted density estimation. In the training phase, the weights (or parameters) of deep autoregressive estimators are learned from data tuples by minimizing the crossentropy between real and estimated data distributions (Germain et al., 2015a).
(2) 
where
are the model weights. Normally, gradient updates for neural networks are performed by stochastic gradient descent (SGD)
(Bottou, 2010) using backpropagation (Rumelhart and McClelland, 1987; Goodfellow et al., 2016)as a gradient computing technique. Backpropagation is an efficient method for computing gradients in directed graphs of computations, such as neural networks, using
chain rule (Rudin and others, 1964). It is a simple implementation of chain rule of derivatives, which computes all required partial derivatives in linear time in terms of the graph size, as shown in (1) of Figure
2. An important characteristic of backpropagation is that it requires each node of the computation graph involved in the flow to be deterministic and differentiable, which has a well defined gradient.Answering Range Queries with Sampling After being trained, deep autoregressive models can be directly used to answer point queries (e.g., AND for a relation with two attributes), because a deep autoregressive model is essentially a point distribution estimator. However, it is not easy to use the point estimator to answer range queries. Estimating the selectivity of a range query is equivalent to estimating the sum of selectivities for the set of data points the query contains. Suppose the region of a query is: , where denotes the number of attributes. A naive approach for estimating the range query is exhaustive enumeration:
(3) 
where represents the list of distinct tuples contained in and is the estimated selectivity of query . However, this method is computationally prohibitive because in the worst case, the number of estimated tuples would grow exponentially in the number of attributes. We thus resort to sampling techniques to efficiently compute the approximate selectivity result as follows.

[leftmargin=*]

Uniform Sampling method samples tuples at random and then computes the estimated selectivity as
(4) 
Progressive Sampling is a Monte Carlo integration approach (Yang et al., 2020), which sequentially samples each tuple in order of its attributes by concentrating on the regions of high probability. Specifically, to sample a tuple , we sequentially sample its attributes {} from distributions {}, respectively, where the categorical distribution is the distribution of given attributes
predicted by the deep autoregressive model. Therefore, the tuple having higher probabilities in
could be more likely sampled. The selectivity estimate made by a sampled tuple is given by . The estimation result from multiple sampled tuples can be easily obtained by averaging the estimate of each single tuple. It is easy to verify that progressive sampling estimates are unbiased. This method is more robust to skewed data distribution than uniform sampling. We thus adopt progressive sampling in our work.
4.3. Training Deep Autoregressive Models with Queries
We proceed to present our idea of empowering the autoregressive model with the ability of learning from queries. Nevertheless, the existing autoregressive models cannot learn from queries via backpropagation in an endtoend manner, because in principle gradients cannot flow through the sampled discrete random variables, and hence the process of progressive sampling is not differentiable, which is a prerequisite of backpropagation as explained earlier. Specifically, consider a set of queries
, we define the query loss for autoregressive models as:(5) 
where is the predicted selectivity of . There are many choices to define the function , e.g., root mean square error (RMSE) and Qerror (Moerkotte et al., 2009):
(6) 
Next, let us focus on (2) in Figure 2, which illustrates the gradient flow of a deep autoregressive model with progressive sampling using sample, trained with queries. In each forward pass, the autoregressive model utilizes progressive sampling to successively sample onehot vectors for each attribute (In practice, the result of sampling from in deep autoregressive models is an onehot vector that represents , we thus denote as thereinafter for clarity) and use them to compute . can be obtained after this. However, we observe that during backpropagation, gradients cannot completely flow from to . This is because that gradients cannot flow from to , respectively, since the stochastic variables do not have a welldefined gradient w.r.t. . One can easily generalize the case for as only an averaging operation is needed to combine the estimate of each sample and it does not change the nondifferentiable property of progressive sampling. Consequently, the model weights cannot be trained using query workload with the current techniques.
Our key insight is that if we can find a method making the process of progressive sampling differentiable, the deep autoregressive models can be trained directly from queries by minimizing the discrepancy between the actual selectivities and the estimated selectivities through progressive sampling via backpropagation.
The key challenge of differentiating progressive sampling is differentiating the nondifferentiable sample from the categorical distribution . To this end, we consider two ideas: score function estimators and the GumbelSoftmax trick, and analyze which is more suitable for our work.
Score Function Estimators. The score function estimator (SF), also known as REINFORCE (Williams, 1992), derives the gradient of query loss for autoregressive models w.r.t. the model weights by:
(7) 
where is sampled from , and is a function of and . With SF, we only need and being continuous in (which is valid), without requiring backpropagating through the sampled tuple .
However, SF often suffers from high variance, even if it is improved with variance reducing techniques (Gu et al., 2016). Also, SF is not scalable if used for categorical distribution because the variance will grow linearly in the number of dimensions in categorical distribution (Rezende et al., 2014). Consequently, a better method is needed.
The GumbelSoftmax Trick.
The GumbelSoftmax trick (Jang et al., 2017; Maddison et al., 2017), which was originally used to differentiate discrete latent variables in variational autoencoders (Kingma and Welling, 2014) and is summarized in Algorithm 1. We leverage this technique to provide a simple and effective way to draw a sample for attribute .
Consider a categorical distribution with dimensional class probability , to sample a onehot vector from , the key idea of the GumbelSoftmax trick is,
(8) 
where are independent and identically distributed samples drawn from a Gumble(0,1) distribution, which can be sampled by:
(9) 
Since is nondifferentiable, we can use differentiable as a continuous and approximate distribution to sample :
(10) 
where is an adjustable hyperparameter, referred as temperature. When the temperature approaches 0, sampling from the GumbelSoftmax distribution becomes onehot. Hence, the temperature is a tradeoff between the gradient variance and the degree of approximation to a onehot vector. Essentially, sampled by the GumbelSoftmax trick is differentiable and has been proven to have lower variance and be more scalable than SF. Therefore, in this work we use the GumbelSoftmax trick as the core technique for differentiating progressive sampling. Note that can be any categorical distribution, including .
Based on the procedure of the GumbelSoftmax trick, we introduce in detail how to use the GumbelSoftmax trick to differentiate progressive sampling. As shown in (3) of Figure 2, the key idea of differentiable progressive sampling is using deterministic, continuous variables to approximate stochastic discrete variables so that gradient can flow from to completely. Specifically, for each attribute , in a forward pass, a stochastic vector is first generated from E.q. 9. Next, we define and sample from softmax, according to E.q. 10. Note here the categorical distribution in E.q. 10 is set to . Then we can use the sampled to continue the forward pass. In doing so, we surprisingly find that gradients from to can be computed completely, because the stochastic nodes are disencountered from the entire gradient flows.
We present the flow of differentiable progressive sampling (DPS) in Algorithm 2. Note that in practice we can perform DPS with samples in batch. Note that in line 7 of Algorithm 2, we can simply musk out the probabilities of by setting the corresponding values in to negative infinity (inf). This does not change the categorical property of and does not affect GSSampling.
4.4. Hybrid Training
Now, UAE is able to learn from either data or queries. To achieve our ultimate goal, which is to take both data as unsupervised information and queries as supervised information into the training of UAE, we propose a hybrid training method, which trains the model of UAE by minimizing an overall loss function combining (Eq. 2) and (Eq. 5) by a hyperparameter :
(11) 
The adjustable hyperparameter rescales the relative values of and . In doing so, UAE learns to capture both the data and query workload information simultaneously to learn the joint data distribution. We summarize the workflow of the hybrid training of UAE in Algorithm 3.
4.5. Incorporating Incremental Data and Query Workload
We introduce the superiorities of UAE in efficiently and effectively ingesting incremental data or query workload.
Incremental Data denotes the tuples newly added to the database after the model is trained. UAE can perform incremental training on the incremental data by minimizing the unsupervised loss produced from the new data.
Incremental Query Workload is a set of queries drawn from a different distribution compared to the training workload (i.e.,
they focus on different data regions). For example, on IMDB dataset, a workload might focus on the data area where
title.production_year¿1975 but another workload might focus on title.production_year¡1954. To adapt to the new query workload after being trained, UAE only need to minimize the supervised loss offered fromfor incrementally ingesting these new queries. In our experiments, we find a smaller value of training epochs (10~20) is enough to prevent
UAE from catastrophic forgetting.4.6. Miscellaneous Issues
Supporting Join Queries. A natural idea of supporting multitable joins for UAE is to train UAE on join results offered by join samplers (Leis et al., 2017; Huang et al., 2019). We follow the idea (Yang et al., 2021; Hilprecht et al., 2020) to handle join queries, which adds virtual indicator and fanout columns into the model architecture of UAE. Then we train UAE on tuples sampled by the Exact Weight algorithm (Zhao et al., 2018) and queries with fanout scaling. Interested readers may refer to (Yang et al., 2021; Hilprecht et al., 2020) for details.
Handling Columns with Large NDVs. A problem of the autoregressive architecture UAE is when the number of distinct values (NDVs) in a column is very large, storing the model parameters would consume large space. Hence, for columns with large NDVs, we leverage 1) embedding method (which embeds onehot column vectors by learnable embedding matrices) for tuple encoding and decoding; 2) column factorization which slice a value into groups of bits and then transforms them into base10 integers (Yang et al., 2021).
Handling Unqueried Columns (Wildcards). We use wildcard skipping (Liang et al., 2020; Yang et al., 2020) which randomly masks tuples and replaces them with special tokens as the inputs to UAE’s data part during training. This could improve the efficiency of training UAE and query inference, because for omitted columns we can skip DPS during training and skip progressive sampling during query inference.
4.7. Remarks
We call the UAE trained with data and queries as UAED and UAEQ, respectively. We make several remarks as follows:

[leftmargin=*]

UAE can accurately capture complex attribute correlations without independence or uniformity assumptions because of its deep autoregressive model architecture;

By learning from both data and query workload, UAE is further forced to produce more accurate estimates in the data regions accessed by the workload. Meanwhile, UAE can maintain the knowledge of overall data distribution;

In fact, UAED is equivalent to Naru (Yang et al., 2020). We thus claim that UAE generalizes Naru;

We opt for Qerror as the for UAEQ
because it is consistent with our evaluation metric.

A distinct feature of UAEQ is that, different from other supervised methods (Wu et al., 2018; Ortiz et al., 2019; Hasan et al., 2020; Sun and Li, 2019) or sampling enhanced ML model (Kipf et al., 2019) for cardinality estimation which are all discriminative, deep autoregressive modelbased UAEQ is a generative model. To the best of our knowledge, UAEQ is the first supervised deep generative model for cardinality estimation.

When being used to estimate the cadinality of a query, UAE only uses its model weights, without scanning the data. Thus the estimation process is convenient and efficient, especially if accelerated by advanced GPU architectures.

By switching between UAED and UAEQ, UAE can learn from new data or query workload in an incremental manner, without being retrained. To the best of our knowledge, there is no single deep learningbased model for cardinality estimation can achieve the two goals of incremental learning, although it is a consequent advantage of UAE’s construction strategy.
5. Experimental Results
We conduct comprehensive experiments to answer the following research questions.

[leftmargin=*]

RQ1: Compared to stateoftheart cardinality estimation models, how does UAE perform in accuracy (Section 5.2)?

RQ2: How different hyperparameters (e.g., temperature , tradeoff parameter ) affect the results of UAE (Section 4)?

RQ3: How well can UAE incrementally incorporate new data and query workload (Section 5.4)?

RQ4: How long does it take to train UAE and how efficient does it produce a cardinality estimate (Section 5.5)?

RQ5: How does UAE impact on a query optimizer (Section 5.6)?
5.1. Experimental Settings
5.1.1. Datasets
We use three realworld datasets with different characteristics for singletable experiments as follows:

[leftmargin=*]

DMV (Zanettin, 2019). This dataset consists of vehicle registration information in New York. We follow the preprocessing strategy in previous work (Yang et al., 2020), and get 11.6M tuples and 11 columns after preprocessing. The 11 columns has widely different data types and domain sizes ranging from 2 to 2101. We also use DMVlarge which includes colums with very large NDVs (e.g., 100% unique VIN column and 31K unique CITY column) and has 16 columns. This dataset is used to evaluate the sensitivities of compared methods to very large NDVs. We find that the results provide similar clues as those on DMV, and thus we do not report them here due to the space limit.

Census (72). This dataset was extracted from the 1994 Census database, consisting of person income information. It contains 48K tuples and 14 columns. The 14 columns contain a mix of categorical columns and numerical columns with domain sizes ranging from 2 to 123.

Kddcup98 (72). This dataset was used in the KDD Cup 98 Challenge. We use 100 columns for experiments and use this dataset to evaluate the sensitivities of various methods to very highdimensional data . It contains 95K tuples with domain sizes ranging from 2 to 43.
We use two statistics in probability theory to measure the skewness and correlation of datasets: Fisher–Pearson standardized moment coefficient
(Doane and Seward, 2011) for skewness and Nonlinear Correlation Information Entropy (NCIE) (Wang et al., 2005) for correlation. Smaller values of the two measures indicate weaker skewness or correlation. The skewness measures are 4.9, 2.1, 4.7 and the correlation measures are 0.23, 0.15, 0.32 for DMV, Census and Kddcup98, respectively. Therefore, DMV and Kddcup98 have relatively stronger skewness and attribute correlation while Census has weaker skewness and attribute correlation. In addition, we use the realworld IMDB dataset for experiments on join queries. IMDB was reported to have strong attribute correlation (Leis et al., 2015).5.1.2. Query Workload
Training Queries. We follow the previous work (Bruno et al., 2001; Kipf et al., 2019) to generate query workload as there is no real query log available for the datasets we use. Specifically, we first choose an attribute with a relatively large domain size as the bounded attribute. The bounded attribute is specified by a distribution for the centers and a target measure (Bruno et al., 2001), which is based on real usage scenarios. The distribution center is chosen uniformly within a specific range and the target measure is a target volume of 1% of the distinct values. We have also varied the selection method for the centers (e.g., following data distribution) and the target measure (e.g., target selectivity) and the experimental results turned out to be qualitatively similar. We thus do not report these results due to the page limit. Next, for other attributes (i.e., random attributes), we follow the method (Kipf et al., 2019; Yang et al., 2020) to generate queries. We draw the number of filters () at random. Then we uniformly sample columns and the corresponding filter operators. Finally, the filter literals are set from the values of a randomly sampled tuple. We generate 20K training queries for each dataset. For join experiments, we use one template (a join table subset) out of 18 templates in JOBlight, a 70query benchmark used by a number of previous work, to generate 10K training queries. This template includes 3 tables, title, movie_companies, movie_info. We set title.production_year as the bounded attribute and then choose 25 filters on other content columns as discussed above. This generation procedure follows (Yang et al., 2021) which produces more diversified queries than joblight, using the same join template. We term this benchmark as JOBlightrangesfocused,
Test Queries. Apart from the performance on the training workload (i.e., inworkload queries), we also evaluate whether the estimators are robust to outofworkload queries. Therefore, we generate two kinds of test queries to thoroughly evaluate the performance of estimators: (1) Inworkload Queries: 2K test queries are generated in the same procedure of training query workload. For joins experiments we generate 1K test queries from JOBlightrangesfocused; (2) Random Queries: We also generate 2K test queries without bounded attributes, i.e., all attribute filters are generated randomly, to evaluate the robustness of different models to workload shifts. For join experiments we use JOBlight as it contains no focused information.
Workload Characteristics. Figure LABEL:fig.cards plots the selectivity distributions of 2K inworkload and random queries on all datasets. We observe: 1) the selectivities of generated workloads are widely spaced. 2) Random queries have much wider selectivity spectrums than inworkload queries because inworkload queries have an additional bounded column. Note that though training and test inworkload queries share the same generation procedure, we manually ensure that each training query is different from each test query.
5.1.3. Performance Metric
5.1.4. Baseline Methods
We compare ^{1}^{1}1The source code is available at https://github.com/pagegitss/UAE. to 9 cardinality estimation methods, including stateoftheart and the newest methods.
Querydriven Models:

[leftmargin=*]

MSCNbase (Kipf et al., 2019)
. This querydriven deep learning (DL)based method uses a multiset convolutional neural network for answering correlated joins. For each predicate, it featurizes the attribute and operator using onehot vectors and normalizes the value. It then concatenates the average results over the predicate set as the query encoding. We use two layers (256 hidden units) of multilayer perceptrons, the default setting of
(Kipf et al., 2019), on the query encoding. We apply the code from (48). Note that the original MSCN was proposed to handle join queries. We adapt MSCN to singletable queries by dropping the join module. We also evaluate another querydriven DL method Sup (Hasan et al., 2020) and find that it shares the similar performance with MSCN. 
LR (Kutner et al., 2005). This method first represents a query as the concatenation of the domain range of each predicate (following (Dutt et al., 2019)
), and trains a linear regression model on the query representation. We use this method as a nonDL querydriven counterpart to demonstrate the effectiveness of DLbased querydriven methods (
MSCNbase).
Datadriven Models:

[leftmargin=*]

Sampling. This method keeps a portion () of tuples uniformly from the dataset and scans them to estimate query cardinalities.

DeepDB (Hilprecht et al., 2020). This method models joint data distribution by learning relational sumproduct networks, which is based on the structure of Sum Product Networks (SPN) (Poon and Domingos, 2011). The number of samples per SPN
for learning the ensemble is set to 1M. We use its opensourced code
(10). DeepDB is a deep model but nonneural, which is a proxy to compare neural deep models (Naru, UAE) against the effectiveness of nonneural deep models. 
Naru (Yang et al., 2020). Naru is equivalent to UAED. We extend the opensourced code from (50) because the original code does not support twosided queries. We also compare with MADE (Hasan et al., 2020), which also uses deep autoregressive models and its performance is close to Naru (Yang et al., 2020). For join queries, we compare UAE with NeuroCard (Yang et al., 2021), a concurrent work that extends Naru for join.
Hybrid Models:

[leftmargin=*]

MSCN+sampling (Kipf et al., 2019). This method uses estimates on materialized sampled tuples as additional inputs to MSCNbase. We use this method to demonstrate the advantages of leveraging both data and workload information.

FeedbackKDE (Heimel et al., 2015). This method further utilizes query feedback to adjust the bandwidth in KDE (Gunopulos et al., 2005). We apply the code from the authors (14) and modify it to run it with more than 10 columns. SquaredQ loss function and Batch variant are adopted for bandwidth optimization.
We also compared with STHoles (Bruno et al., 2001), Postgres (57) and MHIST (Poosala et al., 1996). The performances are worse than the 9 methods, and thus we do not report them here. The numbers of sample tuples in two KDEbased methods (KDE and FeedbackKDE) and two samplingbased methods (Sampling and MSCN+sampling) are set to match the memory budget of our model for a fair comparison. For Sampling method, the sample ratios are 0.2%, 9%, 4.6% for DMV, Census and Kddcup98, respectively. We also test two sampling budgets, one smaller () and one lager () than . The results do not change the conclusions in this paper, and thus we do not report them here. Additionally, for a fair comparison, two autoregressive based models ( and Naru) share the same model architecture, which is 2 hidden layers ( units). We also turn on column factorization on IMDB due to its highcardinality columns. Afterward, the space consumption of the autoregressive model is 2.0MB, 0.5MB, 3.45MB and 4.1MB on DMV, Census, Kddcup98 and IMDB, respectively. In addition, the two deep autoregressive based methods, Naru and UAE, rely on progressive sampling for answering range queries. For fair comparison, the number of estimate samples is set to 200 on DMV inworkload queries, Census and 1K on DMV random queries, Kddcup98, IMDB for both of them, because we empirically found by this setting, they can strike a better balance between the estimation accuracy and overheads, e.g., further increasing the estimate number will not result in a significantly improvement in accuracy but will increase the estimation overhead. In , the temperature and the number of samples in DPS are set to 1 and 200, respectively, on all the datasets. The tradeoff parameter is set to on three singletable datasets and 10 on IMDB dataset.
All the experiments were run on a machine with a Tesla V100 GPU and a 20 cores E52698 v4 @ 2.20GHz CPU.
5.2. Performance Comparison
Tables 3 ~ 5 show the experimental results of all models on both inworkload queries and random (outofworkload) queries. The results show that matches or significantly exceeds the best estimator across the board, not only in terms of mean and median, but also in terms of max, which demonstrates the robustness of in handling tail of difficult queries. From these experimental results, we conclude several major findings as follows.
(1) UAEQ outperforms other supervised methods in most cases and they are vulnerable to workload shifts. We observe that the proposed UAEQ outperforms LR and MSCNbase in most of the cases. We also observe that for all the supervised methods, the accuracy on random queries is much worse than that on inworkload queries. This indicates that supervised models may learn the data distribution on the data regions the training workload focuses on, but are vulnerable to workload shifts. For example, on DMV, when moving from inworkload queries to random queries, MSCNbase degrades by in mean error and in max error.
(2) Unsupervised methods are more robust to workload shifts but still produce large error at tail. The performance gap of unsupervised methods between inworkload and random queries is much smaller than those of supervised methods. Nevertheless, these unsupervised models still have large worstcase errors likely due to their optimization target of minimizing the average error. For instance, on DMV dataset, Naru produces 108 for max error.
(3) DLbased methods outperform nonDL methods, especially at tail. For supervised methods, deep learning (DL) models MSCNbase preforms significantly better than traditional machine learning method LR. Also, for unsupervised methods, deep learning method Naru and DeepDB usually perform better than nonDL methods (e.g., BayesNet ), especially in mean and max errors. These results demonstrate that DL can better capture complex data correlation than nonDL models.
(4) KDEbased methods suffer from large domain sizes. KDEbased methods (KDE and FeedbackKDE) perform poorly on two datasets with large domain sizes (DMV). Moreover, We find that FeedbackKDE can not enhance the performance of KDE significantly. It is likely because KDEbased methods suffer on these datasets inherently or the bandwidths computed by FeedbackKDE are not optimal on these datasets.
(5) DeepDB preforms relatively well on datasets with weak attribute correlation but degrades largely on datasets with strong attribute correlation. On the dataset with weaker attribute correlation (i.e., Census), DeepDB
offers accurate estimates across all error quantiles. Nevertheless, on DMV that has strong attribute correlations,
DeepDB’s performance drops quickly especially at tail, since the independence assumption in sumproduct networks of DeepDB does not hold for this dataset.(6) Deep autoregressive models suffer from high dimensional data and SPNs might suffer from high NDVs. We draw interesting conclusions in the comparison between deep autoregressive modelbased methods (Naru, UAE) and SPNsbased method (DeepDB). On two datasets with relatively high domain sizes (2K for DMV, 100% unique for DMVlarge), DeepDB may suffer at tail (e.g., max error on DMV random queries) while deep autoregressive modelbased methods achieve much more stable accuracy. This is likely because that the histograms used in the leaf nodes of DeepDB cannot accurately model the distributions of attributes with high NDVs while deep autoregressive models can well capture them because they consider the probability of each distinct value at the output layer. On the contrary, on Kddcup98 with 100 attributes, deep autoregressive models may degrade at tail (e.g., Naru makes max error 690 on random queries. Although UAE improves Naru by , it still makes max error 345), but DeepDB can achieve very low max error. This is likely because a higher dimensional dataset would contain more independent attributes. In this case, the autoregressive decomposition of autoregressive models might introduce noises to the model learning since it ”forces” the model to learn the correlation between independent attributes. However, DeepDB would not have this problem as SPNs successfully separate those independent attributes into different groups for this dataset. The result indicates that a promising future work would be to combine the best of deep autoregressive models and SPNs for highdimensional datasets with high NDVs.
(7) Additional data information boosts supervised methods by a large margin. Compared to MSCNbase, MSCN+sampling achieves much better performance on all datasets. The improvements become more obvious on random queries. The results demonstrate that including the estimates from sampling as extra features together with query features improve the accuracy of neural networks in MSCN. We also note that integrating query as supervised information in KDE does not help on the three datasets.
(8) outperforms both of its two modules. For example, on DMV both UAED and UAEQ have a max error 108. However, is able to achieve max error 5.0, which greatly improves the tail behaviour. This demonstrates the effectiveness of the unified modeling and training in for cardinality estimation.
(9) achieves the best performance on inworkload queries while maintaining the robustness on random queries. achieves the best overall results (in mean and max errors) on inworkload queries on all singletable datasets. For instance, on DMV, outperforms the second best method Naru by at tail. Also, produces the lowest median errors on inworkload queries on most datasets. Additionally, also achieves the best or comparable overall performance for join queries. As shown in Table 5, on JOBlightrangesfocused, produces the lowest median error and beats two newest datadriven models (DeepDB, NeuroCard) by a large margin across all error quantiles. Although MSCN+sampling outperforms at tail on this workload, its performance drops on random queries (JOBlight) while does not. Surprisingly, can even achieve the best overall performance on random queries on DMV and Census, e.g., 7.0 vs. 21.0 in max error on Census, compared to the second best method DeepDB. It is likely because the supervised component using query workload enforces to offer accurate estimates on some tricky data regions (e.g., long tail data) that all other models cannot successfully handle. In addition, ’s performance on random queries on other datasets matches or outperforms that of Naru (or NeuroCard). These results demonstrate that can effectively leverage query workload as supervised information for enhancing the unsupervised autoregressive model, while keeping the knowledge of overall data distribution.
5.3. Hyperparameter Studies
We report indepth experimental studies to explore the best hyperparameter settings of . The results on all the datasets are qualitatively similar and we report the results on DMV only.
Impact of Temperature The temperature parameter is used in the GumbelSoftmax algorithm which works for the supervised part of (i.e., UAEQ). We have to isolate the influence from ’s unsupervised part (i.e., UAED) when training with UAEQ. To this end, we first train only with UAED to obtain the overall data knowledge. Next is refined by UAEQ with various settings of . Specifically, 10K training queries are generated following the procedure described in Section 5.1.2 and the evaluation is conducted on 2K inworkload queries since we are interested in the effect of on the performance of ’s supervised part. As discussed in (Jang et al., 2017), fixed between 0.5 and 1.0 yields good results empirically. We thus evaluate candidate values {0.5, 0.75, 1.0, 1.25} for . In addition, we use 2K samples for estimation because we are interested in the limits of influenced by and lower numbers of estimation samples also share the same trend. We empirically find that at = 1.0, achieves the lowest estimation errors.
Impact of the number of training samples in DPS. Like , also belongs to UAEQ. Therefore, we use the same experimental setting as in . We evaluate the values {50, 100, 200, 400} for , as larger numbers beyond 400 will significantly increase training overhead (i.e., consume more GPU memory). From Figure 4 (a), we observe that if the number of estimation samples is set to 200, the best setting for would be 200 as well; If the number of estimation samples is increased to 2K, 400 becomes the best setting for . In most cases can offer very accurate estimates on inworkload queries with 200 samples. Furthermore, considering the training and estimation overheads, we suggest 200 for on these datasets.
Impact of Tradeoff Parameter . Tradeoff parameter rescales the losses produced by two parts of , UAED and UAEQ. We thus use the same query workload in Section 5.1.3. The candidate values of are {,,, , }. Figure 4 (b) shows the performance of on both inworkload and random queries as is varied, from which we conclude ’s best setting is . Moreover, when is larger than , the performance drops quickly on both kinds of queries, indicating that putting too much emphasis on UAEQ will negatively affect model training and is not encouraged.
5.4. Incremental Data and Query Workload
This experiment is to study the incremental learning ability of . Since the ability of autoregressive models to incorporate incremental data has been demonstrated in previous work (Yang et al., 2020; Hasan et al., 2020), , based on autoregressive model, can inherently handle incremental data. Consequently, we will not repeat the experiment in this paper. Beyond the previous work, we aim to show that is also able to incorporate incremental query workload while previous work (Yang et al., 2020; Hasan et al., 2020) cannot. To this end, using the same procedure in Section 5.1.3, we generate 5 parts of query workload with different query center for the bounded column, i.e., different query workload focuses on different data region. Each part consists of 4K training queries and 200 inworkload test queries since we our goal is to demonstrate ’s effectiveness of incorporating new query workload. After training with the underlying data, we ingest each partition of query workload in order, following the experimental setting of incremental data in (Yang et al., 2020). Evaluations are conducted after each ingest using the test queries in that partition.
Ingested Partitions  1  2  3  4  5 
Naru: mean  1.035  1.047  1.152  1.197  2.903 
UAE: mean  1.031  1.039  1.095  1.132  1.073 
We compare refined to the model only trained with the underlying data (i.e., Naru), which cannot further ingest incremental query workload, on DMV. Table 6 shows the mean errors of both methods, which are estimated by 200 samples. From the table, we observe: (1) due to the incapability of leveraging query workload, the performance of Naru is not stable on queries of various workloads. (2) can offer consistently accurate estimates after being refined by each query workload, which demonstrates the ability of in effectively ingesting incremental query workload.
5.5. Training Time Estimation Efficiency
An epoch of takes about 363 seconds, 62 seconds, and 657 seconds on DMV, Census, and Kddcup98, respectively. We report the changing process of max error estimated by 200 samples as training progresses on Census inworkload queries in Figure LABEL:fig.timings (1). We observe that about 13 epochs for yields the performance of singledigit max error, which is 9.0.
On all datasets, can produce estimates in around 10ms on a V100 GPU. Figure LABEL:fig.timings (2) shows the estimation latencies of different estimators on DMV. As shown in the figure, can produce estimates in reasonable efficiency, much faster than samplingbased methods (MSCN+sampling, Sampling).
5.6. Impact on Query Optimization
We proceed to evaluate the impact of UAE on query optimization, compared to PostgreSQL and NeuroCard. We follow the procedure (Cai et al., 2019) and we modify the source code of PostgreSQL to allow it accept external cardinality estimates. Then, for each query we collect the cardinality estimates of its subqueries returned by different estimators and inject them into the modified PostgreSQL. We use the JOBM (Leis et al., 2015) benchmark as the testbed for this case because it has a more complex join schema, which is more challenging for query optimization. We generate 50 test queries using a template of JOBM (including 6 tables and multiway joins), following the generation procedure of JOBlightrangesfocused. For training , we use the same template to randomly generate 10K subqueries (including 25 tables). Figure LABEL:fig.qo shows the impact of cardinality estimates from NeuroCard and on query performance compared to PostgreSQL.
We have two major findings. First, more accurate cardinality estimates from deep autoregressive modelbased estimators could translate into better query plans in query optimization of PostgreSQL. Second, for inworkload queries, could result in equivalent or better query plans to improve the quality of query optimization without any significant slowdown compared with PostgreSQL and NeuroCard .
6. CONCLUSIONS and future work
We propose a novel unified deep autoregressive model that is able to utilize both data as unsupervised information and query workload as supervised information for cardinality estimation. Experiments demonstrate that achieves the four goals in Section 1.
We see this work as the first step toward a unified deep learning model that is able to train a single model exploiting both data information and workload information for cardinality estimation. We expect that our unified model for cardinality estimation would be inspirational to future developments of cardinality estimation models that fuse data information and query workload. We believe that our model will open interesting and promising future research directions. For example, exploring the power of UAEQ on database generation is a very promising direction. The generative characteristic of UAEQ allows us to efficiently sample tuples from the model. This is not the case in other supervised models because it is hard to obtain the normalizing constant (Chow and Teicher, 2003) of the data probability for these models. This characteristic makes UAEQ suitable for database generation for DBMS testing and benchmarking (Arasu et al., 2011; Lo et al., 2010; Li et al., 2018), another important task for big data management.
Acknowledgements.
This research was conducted at Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU), which is a collaboration between Singapore Telecommunications Limited (Singtel) and Nanyang Technological University (NTU) that is funded by the Singapore Government through the Industry Alignment Fund ‐ Industry Collaboration Projects Grant. This work was also supported in part by a MOE Tier2 grant MOE2019T22181, a MOE Tier1 grant RG114/19, and an NTU ACE grant. We would like to thank Zizhong Meng (NTU) for helping with some of the experiments, and the anonymous reviewers for providing constructive feedback and valuable suggestions.References
 Selftuning histograms: building histograms without looking at data. ACM SIGMOD Record 28 (2), pp. 181–192. Cited by: §1, Table 1, §2.
 Learning to accurately count with querydriven predictive analytics. In 2015 IEEE international conference on big data (big data), pp. 14–23. Cited by: §1, Table 1, §2.
 Data generation using declarative constraints. In SIGMOD, pp. 685–696. Cited by: §6.
 Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §4.2.
 STHoles: a multidimensional workloadaware histogram. In SIGMOD, pp. 211–222. Cited by: §1, Table 1, §2, §4.1, §5.1.2, §5.1.4.
 Pessimistic cardinality estimation: tighter upper bounds for intermediate join cardinalities. In SIGMOD, pp. 18–35. Cited by: §5.6.

Approximating discrete probability distributions with dependence trees
. IEEE transactions on Information Theory 14 (3), pp. 462–467. Cited by: §1, Table 1, §2, item 4..  Probability theory: independence, interchangeability, martingales. Springer Science & Business Media. Cited by: §6.
 Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4 (1–3), pp. 1–294. External Links: ISSN 19317883, Link, Document Cited by: §2.
 [10] () DeepDB. Note: https://github.com/DataManagementLab/deepdbpublic/ Cited by: item 6..
 Independence is good: dependencybased histogram synopses for highdimensional data. ACM SIGMOD Record 30 (2), pp. 199–210. Cited by: Table 1, §2.
 Measuring skewness: a forgotten statistic?. Journal of statistics education 19 (2). Cited by: §5.1.1.
 Selectivity estimation for range predicates using lightweight models. VLDB 12 (9), pp. 1044–1057. Cited by: §1, Table 1, §2, §4.1, item 2..
 [14] () Feedbackkde. Note: https://bitbucket.org/mheimel/feedbackkde/ Cited by: item 9..
 Made: masked autoencoder for distribution estimation. In ICML, pp. 881–889. Cited by: §1, §4.2.
 MADE: masked autoencoder for distribution estimation. Cited by: §2, §4.2.
 Selectivity estimation using probabilistic models. In SIGMOD, pp. 461–472. Cited by: §1, Table 1, §2, §3, §4.2.
 Deep learning. MIT press. Cited by: §4.2.
 Muprop: unbiased backpropagation for stochastic neural networks. In ICLR, Cited by: §4.3.
 Approximating multidimensional aggregate range queries over real attributes. Acm Sigmod Record 29 (2), pp. 463–474. Cited by: Table 1, §2.
 Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14 (2), pp. 137–154. Cited by: Table 1, §2, item 5., item 9..
 On the relative cost of sampling for join selectivity estimation. In PODS, pp. 14–24. Cited by: Table 1, §2.
 Deep learning models for selectivity estimation of multiattribute queries. In SIGMOD, pp. 1035–1050. Cited by: §1, §1, Table 1, §2, §2, §3, 5th item, §4.1, §4.2, item 1., item 7., §5.1.3, §5.4.
 Improved cardinality estimation by learning queries containment rates. In EDBT, Cited by: §3.
 Selftuning, gpuaccelerated kernel density models for multidimensional selectivity estimation. In SIGMOD, pp. 1477–1492. Cited by: §1, §1, Table 1, §2, §4.1, item 9..
 DeepDB: learn from data, not from queries!. Vol. 13, pp. 992–1005. Cited by: §1, §1, Table 1, §4.1, §4.1, §4.6, item 6..
 Cost based optimizer in apache spark 2.2. Note: https://databricks.com/blog/2017/08/31/costbasedoptimizerinapachespark22.html Cited by: §1.
 Joins on samples: a theoretical guide for practitioners. VLDB. Cited by: §4.6.
 CORDS: automatic discovery of correlations and soft functional dependencies. In SIGMOD, pp. 647–658. Cited by: Table 1, §2.
 Global optimization of histograms. ACM SIGMOD Record 30 (2), pp. 223–234. Cited by: Table 1, §2.
 Categorical reparameterization with gumbelsoftmax. In ICLR, Cited by: §1, §4.3, §5.3.
 Estimating join selectivities using bandwidthoptimized kernel density models. VLDB 10 (13), pp. 2085–2096. Cited by: §1, §1, Table 1, §2, §4.1.
 Autoencoding variational bayes. In ICLR, Cited by: §4.3.
 Learned cardinalities: estimating correlated joins with deep learning. In CIDR, Cited by: §1, Table 1, §2, 5th item, §4.1, item 1., item 8., §5.1.2, §5.1.3.
 Applied linear statistical models. Vol. 5, McGrawHill Irwin New York. Cited by: item 2..
 How good are query optimizers, really?. Proceedings of the VLDB Endowment 9 (3), pp. 204–215. Cited by: §5.1.1, §5.6.
 Cardinality estimation done right: indexbased join sampling.. In CIDR, Cited by: §4.6.
 Query optimization through the looking glass, and what we found running the join order benchmark. The VLDB Journal 27 (5), pp. 643–668. Cited by: §1.
 Touchstone: generating enormous queryaware test databases. In 2018 USENIX Annual Technical Conference (USENIXATC 18), pp. 575–586. Cited by: §6.
 Variable skipping for autoregressive range density estimation. Cited by: §4.6.
 SASH: a selfadaptive histogram set for dynamically changing workloads. In Proceedings 2003 VLDB Conference, pp. 369–380. Cited by: §1, Table 1, §2.
 Practical selectivity estimation through adaptive sampling. In SIGMOD, pp. 1–11. Cited by: Table 1, §2.
 Generating databases for query workloads. VLDB 3 (12), pp. 848–859. Cited by: §6.
 Selectivity estimation and query optimization in large databases with highly skewed distribution of column values.. In VLDB, pp. 240–251. Cited by: Table 1, §2.

The concrete distribution: a continuous relaxation of discrete random variables
. In ICLR, Cited by: §1, §4.3.  [46] (2017) Microsoft database. Note: https://docs.microsoft.com/enus/sql/relationaldatabases/statistics/statistics?view=sqlserver2017 Cited by: §1.
 Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB 2 (1), pp. 982–993. Cited by: §4.3.
 [48] () MSCN. Note: https://github.com/andreaskipf/learnedcardinalities Cited by: item 1..
 Equidepth multidimensional histograms. In SIGMOD, pp. 28–36. Cited by: Table 1, §2.
 [50] () Naruproject. Note: https://github.com/naruproject/naru/ Cited by: item 7..
 Autoregressive energy machines. Cited by: §1, §2, §4.2.
 An empirical analysis of deep learning for cardinality estimation. arXiv preprint arXiv:1905.06425. Cited by: Table 1, §2, 5th item.
 Masked autoregressive flow for density estimation. In NIPS, pp. 2338–2347. Cited by: §2, §4.2.
 Quicksel: quick selectivity learning with mixture models. In SIGMOD, pp. 1017–1033. Cited by: Table 1, §2, §3.
 Sumproduct networks: a new deep architecture. In ICCV Workshops, pp. 689–690. Cited by: item 6..
 Improved histograms for selectivity estimation of range predicates. ACM Sigmod Record 25 (2), pp. 294–305. Cited by: Table 1, §2, §5.1.4.
 [57] () PostgreSQL. Note: https://www.postgresql.org/ Cited by: §1, §5.1.4.
 Stochastic backpropagation and approximate inference in deep generative models. In ICML, Cited by: §4.3.
 The vcdimension of sql queries and selectivity estimation through sampling. In ECML PKDD, pp. 661–676. Cited by: Table 1, §2.
 Principles of mathematical analysis. Vol. 3, McGrawhill New York. Cited by: §4.2.
 Learning internal representations by error propagation. Cited by: §4.2.
 Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. Cited by: §1.
 Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
 Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. Cited by: item 5..
 Graphbased synopses for relational selectivity estimation. In SIGMOD, pp. 205–216. Cited by: §1, Table 1, §2.
 LEOdb2’s learning optimizer. In VLDB, Vol. 1, pp. 19–28. Cited by: §1, Table 1, §2.
 An endtoend learningbased cost estimator. VLDB 13 (3), pp. 307–319. Cited by: Table 1, §2, 5th item.
 Dynamic multidimensional histograms. In SIGMOD, pp. 428–439. Cited by: Table 1, §2.
 Entropybased histograms for selectivity estimation. In CIKM, pp. 1939–1948. Cited by: Table 1, §2.
 Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4 (11), pp. 852–863. Cited by: §4.2.
 Efficiently adapting graphical models for selectivity estimation. The VLDB Journal 22 (1), pp. 3–27. Cited by: §1, Table 1, §2.
 [72] () UCI machine learning repository. Note: https://archive.ics.uci.edu/ml/index.php Cited by: item 2., item 3..
 Multiple join size estimation by virtual domains. In Proceedings of the twelfth ACM SIGACTSIGMODSIGART symposium on Principles of database systems, pp. 180–189. Cited by: Table 1, §2.
 A nonlinear correlation measure for multivariable data set. Physica D: Nonlinear Phenomena 200 (34), pp. 287–295. Cited by: §5.1.1.

Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
. Machine learning 8 (34), pp. 229–256. Cited by: §4.3.  Towards a learning optimizer for shared clouds. VLDB 12 (3), pp. 210–222. Cited by: Table 1, §2, 5th item.
 NeuroCard: one cardinality estimator for all tables. PVLDB. Cited by: §3, §4.6, §4.6, item 7., §5.1.2.
 Deep unsupervised cardinality estimation. Vol. 13, pp. 279–292. Cited by: §1, §1, Table 1, §2, §3, 2nd item, 3rd item, §4.1, §4.1, §4.1, §4.2, §4.6, item 1., item 7., item 4., §5.1.2, §5.1.3, §5.4.
 State of new york. vehicle, snowmobile, and boat registrations. Note: catalog.data.gov/dataset/vehiclesnowmobileandboatregistrations Cited by: item 1..
 Random sampling over joins revisited. In SIGMOD, pp. 1525–1539. Cited by: §4.6.
Comments
There are no comments yet.