Selectivity Estimation of Inequality Joins In Databases

06/15/2022
by   Diogo Repas, et al.
0

Selectivity estimation refers to the ability of the SQL query optimizer to estimate the size of the results of a predicate in the query. It is the main calculation, based on which the optimizer can select the cheapest plan to execute. While the problem is known since the mid 70s, we were surprised that there are no solutions in the literature for the selectivity estimation of inequality joins. By testing four common database systems: Oracle, SQL-Server, PostgreSQL, and MySQL, we found that the open-source systems PostgreSQL and MySQL lack this estimation. Oracle and SQL-Server make fairly accurate estimations, yet their algorithms are secret. This paper thus proposes an algorithm for inequality join selectivity estimation. The proposed algorithm has been implemented in PostgreSQL and sent as a patch to be included in the next releases.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/05/2021

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Query optimizer is at the heart of the database systems. Cost-based opti...
09/07/2019

Compiling PL/SQL Away

"PL/SQL functions are slow," is common developer wisdom that derives fro...
01/15/2019

Integrazione di Apache Hive con Spark

English. This document describes the solutions adopted, which arose from...
01/31/2019

Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases

A critical challenge in constructing a natural language interface to dat...
08/31/2018

Implementing WHERE and ORDER BY as spreadsheet formulas

The WHERE and ORDER BY clauses of the SQL SELECT statement select a subs...
02/26/2018

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing

Modern data processing applications execute increasingly sophisticated a...
01/07/2019

Popular SQL Server Database Encryption Choices

This article gives an overview of different database encryption choices ...

1. Introduction

Query optimization is the overall process of generating the most efficient query plan given an SQL statement. The query optimizer, responsible for this process, applies equivalence rules to reorganize and merge the operations in the query to find the fastest execution plan and feeding it to the executor. It examines multiple access methods, such as sequential table scans or index scans, different join methods such as nested loops and hash joins, different join orders, sub-query normalization, materialized views, and other possible transformations. Starting with a naively generated query plan, the optimizer generates a set of equivalent plans. To choose the most efficient plan, almost all systems adopt a cost-based approach, which roots back in the architecture of System R (Chamberlin et al., 1981) and Volcano/Cascades (Graefe, 2014; Graefe and McKenna, 1991).

In cost-based query optimization, the optimizer estimates the cost of the alternative query plans and chooses the plan with minimum cost. The cost is estimated in terms of the CPU and I/O resources that the query plan will use. A central component in cost estimation is the Selectivity Estimation (SE). SE collects statistics for all attributes in a relation, such as data distribution histograms, most common values, null percentage, etc. These statistics are then used during planning time to estimate the number of tuples generated by a predicate in the query. A smaller selectivity value means a smaller size for intermediate results, which is favorable for a more efficient execution. The cost-based optimizer thus reorders the selection and join predicates to quickly reduce the sizes of intermediate results.

Since, in general, the cost of each operator depends on the size of its input relations, it is important to provide good estimations of their selectivity, that is, of their result size, to the query optimizer (Pitoura, 2009). Inaccurate selectivity estimations can lead to inefficient query plans being chosen, sometimes leading to orders of magnitude longer execution times(Lan et al., 2021).

There is a trade-off between the size of the stored statistics and the complexity of the estimation algorithm on the one hand, and the estimation accuracy on the other. Recent research thus focuses on using machine learning methods to capture the data distribution into compact models. While there are good results in this research direction

(Wang et al., 2021)

, common relational database systems continue to use traditional statistics structures, mostly based on histograms. A histogram can be used as a discrete approximation of the probability density function of an attribute.

Despite the popularity of histograms, there is a lack of theory on how to use them in estimating inequality join selectivity. This paper aims at filling this gap, and presents the following main contributions:

  • A novel algorithm for join selectivity estimation of inequality operators using histogram statistics

  • The implementation of this algorithm in PostgreSQL both for scalar inequality joins, as well as for multiple operators of range types

  • An extension of the algorithm that also takes advantage of additional statistics, when available.

  • The proposed algorithm has been implemented in PostgreSQL and submitted as a patch for inclusion in a future release111https://github.com/DRepas/postgres/tree/rangejoinsel

Section 2 starts by reviewing existing work in selectivity estimation. A running example to be used throughout the paper is given in Section 3.1. Then, some definitions, notation and terminology are introduced in Section 3.2

. The algorithm presented in this paper consists of mapping the problem of selectivity estimation to a probability theory problem, as described in Section

4. The algorithm of join selectivity estimation is developed in Section 5. The section also develops a way to incorporate null values and Most Common Values (MCV) statistics in the estimation model. A mapping of this model for several range operators is also presented. Finally, an implementation in PostgreSQL and the experimental evaluation are provided in Section 6.

2. Review of Related Work

A survey on DBMS query optimizer has been proposed in 2021 (Lan et al., 2021), which categorizes cardinality estimation methods into synopsis-based methods, sampling-based methods, and learning-based methods. Many learning-based methods (Hasan et al., 2020; Kipf et al., 2018; Yang et al., 2019) have been proposed in recent years and show better accuracy than traditional methods. But there are still many missing parts to be solved to put them into real systems, such as the cost of model training and updating, and the black-box property of learning algorithms (Wang et al., 2021). Sampling-based methods estimate selectivity by executing a (sub)query on samples collected from tables, whose accuracy depends on the degree to which the samples fit the original data distribution (Lan et al., 2021). These methods, however, suffer from a high cost of storage and retrieval time, especially when the tables are very large. Another limitation of sampling-based methods is that they currently only support equality join selectivity estimation (Lan et al., 2021). Histograms, as a form of synopsis-based methods, have been extensively studied (Lan et al., 2021; Hasan et al., 2020) and are widely adopted in common database systems (Cormode et al., 2012) for the purpose of selectivity estimation, including MySQL, PostgreSQL, Oracle, and SQL Server (team at Oracle, ; Group, 1996d; Oracle, 2021; Server, 2022).

MySQL uses two histogram types for selectivity estimation (team at Oracle, ). One is the singleton histogram which stores the distinct values and their cumulative frequency. Another is the equi-depth histogram, called equi-height in MySQL documentation. This histogram stores the lower and upper bounds, cumulative frequency, and the number of distinct values for each bucket. However, the usage of histograms is limited to restriction selectivity estimation (team at Oracle, ), i.e., the selection operator. For join selectivity estimation, MySQL naively returns a constant: 0.1 for equality joins and 0.3333 for inequality joins (team at Oracle, 2000). The following is the excerpt of the MySQL source code in which these constants are defined (team at Oracle, ):

/// Filtering effect for equalities: col1 = col2
#define COND_FILTER_EQUALITY 0.1f
/// Filtering effect for inequalities: col1 > col2
#define COND_FILTER_INEQUALITY 0.3333f

PostgreSQL also uses histogram as optimizer statistics (Group, 1996d). By analyzing its manual (Group, 1996a), as well as its source code, it uses equi-depth histograms. In contrast to MySQL, the number of distinct values in each bucket is not stored. For this reason, PostgreSQL does not use these histogram statistics in estimating equi-join selectivity. It rather uses a singleton histogram of Most Common Values (MCV) (Group, 1996a). As for inequality join selectivity estimation (, , , ), a default constant value of is returned (Group, 1996b).

The following is an excerpt of the PostgreSQL source code in which these constants are defined (Group, 1996c):

/* default selectivity estimate for equalities such as  ”A = b”*/
#define DEFAULT_EQ_SEL  0.005
/* default selectivity estimate for inequalities such as ”A < b”*/
#define DEFAULT_INEQ_SEL  0.3333333333333333

Oracle Database uses three types of histograms to capture the data distribution of a relation’s attribute (Oracle, 2021): singleton histograms (referred to as frequency histogram and top frequency histogram in the official documentation), equi-depth histogram (referred to as height-balanced in the official documentation), and hybrid histogram (a combination of equi-depth and frequency histograms). The type of histogram is determined based on specific criteria to fit different situations. The official documentation (Oracle, 2021) also states some factors behind their selectivity estimation algorithms, such as endpoint numbers (the unique identifier of a bucket, e.g., the cumulative frequency of buckets in frequency and hybrid histograms) and values (the highest value in a bucket), and whether column values are popular (an endpoint value that appears multiple times in a histogram) or non-popular (every column value that is not popular). However, the details of these estimation algorithms are not published. Few online articles, in the form of hacker blogs, did experimental analyses to guess how selectivity estimation works in Oracle Database but didn’t yield a clear algorithm, (Lewis, 2006; Dell’Era, 2008).

SQL-Server is another popular closed-source DBMS. Due to its proprietary nature, implementation details are scarce. According to the official documentation (Server, 2022)

, a proprietary kind of histogram with a density vector associated is built in three steps for each attribute. The official documentation

(Server, 2021; Sack, 2014) describes four core assumptions for the selectivity estimation: independence when no correlation information is available, uniformity in histogram bins, inclusion when filtering a column with a constant, and containment when joining distinct values from two histograms (Bruno and Chaudhuri, 2002). Although the white paper (Sack, 2014) is a publication from SQL Server that deals with the problem of selectivity estimation, it does not explain the algorithm used for join selectivity. Similar to Oracle, the implemented algorithm is a secret.

To identify if any informed selectivity estimation is taking place when performing inequality joins in SQL-Server and Oracle, we have performed the following experiment. Two different attributes, T1 and T2, with 1000 and 200 rows respectively, were randomly generated by sampling the range [0, 100] uniformly. They were then joined using the (less than) operator. Both databases made a quite accurate selectivity estimation of this inequality join. Oracle Database had an estimation error of 3%, and SQL Server had a smaller error of just 0.29%. As such, we know that both systems implement good estimation algorithms.

In conclusion, although learning-based methods have become a popular research direction for selectivity estimation in recent years, histograms are still the most commonly used statistics in existing DBMS for this purpose. The recurring types of used histogram statistics are equi-depth histograms approximating the distribution of values, and singleton histograms of Most Common Values. As our investigation indicates, MySQL and PostgreSQL don’t have algorithms implemented for join selectivity estimation, and they use predefined constants. On the other hand, popular commercial DBMS (SQL-Server and Oracle) have implemented some algorithms based on the histograms, but we couldn’t find any source describing them. This paper addresses this gap by proposing such an algorithm.

3. Preliminaries

This paper presents a formal model to reason about two different selectivity estimation types:

  • Restriction selectivity estimation: when one of the sides of the operator is an attribute of a relation and the other is a constant value.

    • Example: SELECT * FROM R1 WHERE R1.X ¡ 100

  • Join selectivity estimation: when both sides of the operator are attributes of different relations.

    • Example: SELECT * FROM R1, R2 WHERE R1.X ¡ R2.Y

Selectivity estimation of operations where both sides of the operator are attributes of the same relation, with no join or Cartesian product involved (Example: SELECT * FROM R1 WHERE R1.x ¡ R1.y), is not addressed by this paper.

The selectivity of an operator is the fraction of values kept in the result after selection. In the case of restriction selectivity, the denominator is the input relation’s size. In the case of join selectivity, the denominator is the input relations’ Cartesian product size (their sizes multiplied). This fraction can be interpreted as the probability that a given randomly selected tuple from the input relation, or from the Cartesian product of input relations in the case of joins, is selected by the operator being considered.

The focus of the next sections will be on the restriction selectivity estimation of the less than (¡) operator. The restriction selectivity estimation of all scalar inequality operators will be derived from this initial estimation. We will also build/generalize on it to develop the join selectivity estimation. This restriction selectivity estimation in the next section is already implemented by all common database systems, thus not a novel contribution of this work. We however formulate it as a probability problem, and develop the join selectivity estimation on top of it, to maximize the code reuse in these systems.

The attributes being restricted or joined will be treated as random variables that follow a distribution modeled by a Probability Density Function (PDF) and/or a Cumulative Distribution Function (CDF).

3.1. Running Example

For demonstration purposes, relations R1 and R2 will be used throughout this paper. For each relation, 12 integers were manually selected to cover as many corner cases as possible when using equi-depth histograms (introduced in section 3.2

), such as skew and common bin boundaries.

3.2. Histogram Statistics

Histograms are commonly used to approximate the PDF of an attribute by grouping the values in uniform bins. Each bin is an interval of the form , where are values from the domain of the attribute. It is important to note that the side on which the interval is open or closed is not relevant for the purposes of this paper, as all estimations correspond to integration over a continuous domain, where singular points do not affect the final result. By defining a bin this way (using intervals), this paper is restricting itself to domains where total order exists. This excludes categorical data, for which the idea of an equi-depth histogram does not apply.

Let be the fraction of values of the attribute that is represented by the histogram bin , i.e., the height/depth of the histogram bin:

  • Singleton histograms are such that each bin refers to the frequency of a single element. Typically used to collect Most Common Values statistics (introduced in section 5.3).

  • Equi-width histograms are such that each bin has the same width. That is, is constant.

  • Equi-depth histograms are such that each bin has the same depth (height) but varying width. That is , where is the total number of bins.

For selectivity estimation, equi-depth histograms are favoured because the resolution of the histogram adapts to skewed value distributions. Typically, the histogram is smaller than the original data. Thus, it cannot represent the true distribution entirely and some assumptions are induced, e.g., uniformity on a single attribute, and independence assumption among different attributes. The use of equi-depth histograms is also motivated by their prevalence in current RDBMS. These are usually constructed through the use of random sampling of the original relations.

For demonstration purposes, Figure 1 shows a graphical representation of equi-depth histograms for the attributes and of the running example. For both histograms, there are three bins, meaning that the fraction accounted for by each bin is . For attribute , , which means that , , and . For attribute y, , which means that , , and .

Figure 1. Equi-depth histograms of R1.X and R2.Y with 3 bins each.

Using histograms as a statistical representation of attributes involves the following implicit assumptions:

  • The data is distributed uniformly inside each bin

  • The histograms are complete (they account for all the data points), that is:

In practice, these two assumptions do not strictly hold. The data is usually not uniformly distributed inside each bin. The more bins used in the histograms, the smaller the error introduced by this assumption. Database systems, e.g., PostgreSQL, typically create the histogram using a random sample of the attribute values, especially when the number of tuples is too large. The assumption of completeness of the histogram might be broken in the presence of sampling. When the sample is representative of the underlying data, the estimation is still fairly accurate.

Given the equi-depth histogram of an attribute , with bins, one can derive its approximate PDF and CDF as shown next. Let and denote the PDF and the CDF of X, respectively, at a given value then:

(1)
(2)

When , formula 1 is derived from the definition of equi-depth histogram, where each bin represents of the data, spread over a width of .

Formula 2 is derived from the following:

where,

This last formula performs linear interpolation within the bin where

is contained, thus assuming a uniform distribution of values within the bin. The assumption that the histogram is complete is reflected in substituting the infinite bounds by . In the running example, the PDF and CDF of can be derived from the formulas presented in this section as follows:

4. A Formal Model for Selectivity Estimation

We first start by formalizing the problem of restriction selectivity estimation for the Less Than (¡) operator. Suppose the goal of estimating the selectivity of the following operation (expressed in SQL):

SELECT *
FROM R1
WHERE R1.X < c

where c is a constant. Treating the attribute as a random variable , estimating the selectivity of the above operation is equivalent to finding . Given the PDF or the CDF of , or , respectively, the selectivity of the operation above can be formalized as:

(3)

Suppose now that the goal is estimation the join selectivity for the Less Than (¡) operator. That is, we want to estimate the selectivity of the following operation (expressed in SQL):

SELECT *
FROM R1, R2
WHERE R1.X < R2.Y

Treating the attributes and as random variables and , respectively, estimating the selectivity of the above operation can be formulated as finding .

Consider the joint distribution of

and , . The probability that a sample (a, b) taken at random from the Cartesian product of the values in and can be defined as follows:

or equivalently:

which is the definition of independent random variables. Note that when a Cartesian product is involved, either explicitly as in the SQL statement above, or implicitly through a join clause, the two variables are independent.

Given a joint PDF of and , . With and being independent, it is known that , with and the PDFs of and , respectively. Considering to be the CDF of , the selectivity of the less than (¡) operator can be formalized as follows:

(4)

This formula thus presents a solution for estimating the join selectivity estimation. Next, we discuss how to translate it into an algorithm.

5. Implementation in a Database System

In RDBMS implementations, histograms are used as a discrete approximation of the PDF and CDF of attributes. This section maps the theory above into an implementable solution in databases using equi-depth histograms.

5.1. Selectivity Estimation

Restriction selectivity estimation

Recall that restriction selectivity estimation is about estimating the selectivity of a predicate in the following form:

SELECT *
FROM R1
WHERE R1.X < c

As described in section 4, restriction selectivity estimation can be calculated using the CDF of (equation 3). Deriving the CDF from an equi-depth histograms is presented in equation 2. To find the bin, , where is contained, one can perform a binary search over the histogram boundaries.

Algorithm 1 illustrates the estimation of restriction selectivity. The array represents the equi-depth histogram, stored as an ordered array of bin boundaries. The function binary_search returns the greatest index in this array that is less than or equal to a given constant, effectively finding the bin where such constant falls into. The rest of the algorithm computes the CDF at c using equation 2.

Input: an array of length representing the equi-depth histogram of R1.X, the scalar literal in the query
Output: the estimated selectivity
/* Identify preceding whole bins */
j binary_search(hist, c) ;
/* Corner cases */
if j ¡ 0 then
      return 0
if j ¿= n - 1 then
      return 1
/* Estimate using preceding bins */
selectivity j / (n - 1) ;
/* Adjust using linear interpolation */
selectivity += (c - hist[j]) / (hist[j+1] - hist[j]) / (n - 1) ;
return selectivity
Algorithm 1 Restriction selectivity estimation for the expression R1.X ¡ c

Using the running example, and the following query:

SELECT *
FROM R1
WHERE R1.X < 3

this query yields 8 rows, which corresponds to a selectivity of . Using Algorithm 1 with the histograms presented in the running example, the number 30 will be found in , meaning that the estimated selectivity will be . After multiplying by the attribute’s cardinality (12), we get the estimated row count of 9, which is a close estimate to the actual result size.

Figure 2 shows a graphical depiction of the PDF of the , which is directly obtainable from its equi-depth histogram. The integral in equation 2, as well as the estimation calculated above using Algorithm 1, correspond to the highlighted area in the figure.

Figure 2. Restriction Selectivity Estimation of

Join selectivity estimation

Join selectivity estimation is mapped into a double integral involving the two PDFs. Equation 4 illustrates that join selectivity can be estimated by using the CDF of and the PDF of . These can be calculated using equations 1 and 2.

The CDF of is linear piece-wise, each piece is defined in a bin of ’s histogram. The PDF of is a step function, i.e., constant piece-wise, where each piece is defined in a bin of ’s histogram. This leads to the conclusion that their product, which is needed in equation 4, is a linear piece-wise function, with every piece being defined in an intersection of and ’s bins (see Figure 3 for a graphical depiction of this using the running example introduced in section 3.1). By merging the bounds of the two histograms of and in a single sorted array, , the intersections of and ’s bins will be of the form .

Figure 3. Piece of product used for Join Selectivity Estimation

From equation 1, we know that is 0 for all values until first value of , and it is 0 after the last value of . From equation 2, we know that is 0 for all values until first value of , and it is 1 after the last value of .

Analyzing the product piece-wise:

(5)

Given that the product, , in equation 4, is linear in each interval of the form , equation 4 can be discretized as follows:

(6)

To maximize code re-use, it is possible to reorganize this equation into a sum of the CDFs of both and , i.e., so that we can reuse Algorithm 1. The following is derived directly from equation 3:

(7)

the first two steps rely on the fact that is constant in the interval .

Equation 6 can now be re-written using only CDFs of and as follows:

(8)

Algorithm 2 illustrates the estimation of join selectivity of the less than (¡) operator using this equation. It thus has the advantage of re-using algorithm 1 to compute and . The creation of the array in Algorithm 2 comes at the expense of time and space of O(), for duplicating and merging the two sorted histograms. To optimize this, the two histograms can be scanned in parallel, without the need to materialize the sync array. This also allows for further optimization. The algorithm only needs to iterate over the overlapping region of both histograms. All the partial sums before that will be zero, as can be verified in Figure 4. After the overlapping region the remaining partial sums are equal to what is left of the histogram of , , because all remaining values of Y will be greater than the maximum value in . We adopt these optimizations in our implementation, yet we omit them here for the clarity of presentation.

Input:
histogram of X, , is an array of length
histogram of Y, , is an array of length
Output: Join selectivity estimation of X ¡ Y
selectivity 0 ;
sync merge_sorted(hist, hist) ;
cur_F F(sync[0]) ;
  // always zero
cur_F F(sync[0]) ;
  // always zero
for  to  do
       next_F F(sync[k]);
        // using Algorithm 1
       next_F F(sync[k]);
        // using Algorithm 1
       selectivity += (cur_F + next_F) * (next_F - cur_F) ;
       cur_F next_F ;
       cur_F next_F ;
      
return selectivity / 2
Algorithm 2 Join selectivity estimation algorithm for the less than (¡) operator re-using Algorithm 1 as and

The goal of Algorithm 2 is to calculate the area under the curve of the product , represented in Figure 4. Taking the 3-bin histograms calculated in section 3.2 from the running example attributes and , a materialized array would have the values [10, 15, 20, 25, 39, 45, 50] after merging, sorting and removing duplicates. These correspond to the boundaries of the pieces in which the product is linear. Stepping through Algorithm 2 with k from 1 to 5, corresponding to the 6 pieces represented by the sync array, we arrive at the following sum:

The correct result will have 95 rows, which corresponds to a selectivity of . After multiplying by the cardinality of the Cartesian product of both attributes (144) and rounding the result to the nearest integer, we get the estimated row count of , which is close to the correct result size.

Figure 4 shows a graphical depiction of the product of the CDF of and the PDF of , which is directly obtainable from their equi-depth histograms. The integral in equation 6, as well as the estimation calculated above using Algorithm 2, correspond to the area under the curve in the figure.

Figure 4 illustrates the multiplication of the CDF(R1.X) and the PDF(R2.Y). The integral in equation 4, as well as its estimation in Algorithm 4, correspond to calculating the area under the curve in the figure.

Figure 4. Product used for Join Selectivity Estimation

Note that the code re-use in algorithm 2 has a small performance impact. This algorithm has a time complexity of O() since it performs a binary search (twice) for each element of each histogram. This binary search is not necessary since the two histograms are being scanned sequentially and the current indices are known at each iteration. One way to avoid this overhead would be to optionally specify as an input parameter of algorithm 1, thus reducing the time complexity to O()

5.2. Extending to all scalar inequality operators

Given the restriction and join selectivity estimators for the less than inequality, all scalar inequality operators can be implemented by noting the following equivalences:

Restriction selectivity:

Join selectivity:

Estimators for equality selections and joins are already implemented by almost all common systems. In case they are missing, one could assume P(X = c) and P(X = Y) to be zero, thus leading to under-/over-estimate the selectivity.

5.3. Making Use of Other Statistics

Typically, RDBMS will collect statistics about nulls, in the form of a fraction of null values, and Most Common Values (MCV), in the form of a singleton histogram. Histograms will thus be constructed for the remaining part of the data. When the histogram statistics only refer to a fraction of the data, the methods described up to this point only provide an estimation for this fraction. The final estimation must thus take nulls and MCV into account.

As a general way to integrate such other statistics in the estimation besides the histograms, we note the following: given a non-overlapping partitioning of the data, if each partition corresponds to a fraction of the original data, and selectivity within that partition is , the final selectivity can be calculated by the inner product .

Since a value is either null, a most common value, or accounted for by the histogram, the overall restriction selectivity can be calculated by the following formula:

(9)
Null Values

All inequality operators are strict, this means that the selectivity of null values is 0. For this reason, the first term in equation 9 is also 0.

Most Common Values

MCV statistics maintain pairs of values and their frequencies in the table. They are maintained for the top k frequent values, where k is a statistics collection parameter. Since MCV represent the data in its original form, it is possible to accurately compute the selectivity for these values.

To estimate the restriction selectivity of an operator using the most common values, algorithm 3 can be used. This algorithm computes the selectivity of a Boolean operator on a list of most common values by adding the frequencies of the most common values satisfying this Boolean condition.

Input:
MCV statistics of X (array of and corresponding array of )
Length of MCV arrays,
constant,
operator,
Output: Restriction selectivity estimation of X ¡op¿ c
selectivity 0 ;
for  to  do
       if op(values[i], c) then
             selectivity += fractions[i] ;
            
return selectivity
Algorithm 3 mcv_selectivity(values, fractions, n, op), estimates the restriction selectivity, using only the MCV statistics, for a given Boolean operator op

For join selectivity estimation, since there is a need to combine statistics from null values, MCV, and equi-depth histograms for both X and Y, there are 9 cases that need consideration, depending on the combination of values of X and Y, as shown in Figure 5.

Y is null Y in MCV Y in histogram
X is null case1 case2 case3
X in MCV case4 case5 case6
X in histogram case7 case8 case9
Figure 5. Nine cases for join selectivity estimation

For strict operators, such as inequalities, only cases 5, 6, 8, and 9 need to be calculated. This is because nulls result in empty joins. Case 9 has already been handled in Algorithm 2. For case 5, we iterate over the values in the MCV of Y. For each value, we multiply the fraction represented by that value in Y times the MCV restriction selectivity of the operator in question with the current value of the MCV of Y as the constant. This process is described in algorithm 4.

Input:
MCV statistics of X (array of and corresponding array of )
Length of MCV arrays of X,
MCV statistics of Y (array of and corresponding array of )
Length of MCV arrays of Y,
operator,
Output: Join selectivity estimation of X ¡op¿ Y
selectivity 0;
for  to  do
       selectivity += fractions_Y[i] * mcv_selectivity(values_X, fractions_X, values_Y[i], op) ;
      
return selectivity
Algorithm 4 Join selectivity estimation algorithm for any binary Boolean operator re-using algorithm 3 as mcv_selectivity(values, fractions, n, op)

For cases 6 and 8, Algorithm 5 is used, by swapping the arguments. For each common value in the statistics of X, multiply its fraction by the histogram restriction selectivity of Y using the current value of X as the constant.

Input:
MCV statistics of X (array of and corresponding array of )
Length of MCV arrays,
histogram of Y,
operator,
Output: Join selectivity estimation of X ¡ Y
selectivity 0 ;
for  to  do
       selectivity += fractions[i] * F_Y(values[i]) ;
      
return selectivity
Algorithm 5 Join selectivity estimation algorithm for the less than (¡) operator re-using algorithm 1 as

Given algorithms 4 and 5, the selectivity of the less than (¡) operator considering histograms and most common values can be estimated by the following formula:

The final selectivity taking null values into account can be estimated as follows:

(10)

5.4. Implementation for Ranges and Multi-Ranges

The algorithms described above are for scalar types. An advanced type, which is implemented by many database systems is the range type. A range type is a tuple (left, right, lc, rc), where left ¡= right are two values of a domain with a total order. lc and rc specify whether respectively the left and right bounds are included in the range. The range type can be parameterized by the type of its bounds, e.g., range(float), range(timestamp), etc. In this section, we describe how the selectivity estimation in previous sections can be applied to the range type and the respective operators.

PostgreSQL, as an example of DBMS that has range types, collects the statistics for range attributes in the form of two equi-depth histograms: one for the lower bounds of the ranges, and one for the upper bounds. In the following, let be attributes of the same range type. Also, let be the variable that represents all the lower bounds of , be the variable that represents all the upper bounds of , and similarly for . Then it is possible to estimate the selectivity of the different range operators as follows:

  • , where , reads strictly left of, yields true when X ends before Y starts

  • , where , reads strictly right of, yields true when X starts after Y ends

  • &, where &, reads does not extend to the right of , yields true when ends before the end of

  • &, where &, reads does not extend to the left of , yields true when starts before starts

  • , where indicates the overlapping between and

  • and so on

It is however not possible to accurately estimate the join selectivity of the operators that express total or partial containment. This is mainly because the lower and upper histograms assume Independence between the range bounds. For containment operators, we need to relate the two bounds, which explicitly breaks this assumption.

Another consideration in estimating the selectivity of range operators is the fraction of empty ranges since these are not accounted for by the histograms. Depending on the operator, empty ranges are either always included or always excluded when compared to non-empty ranges and similarly when compared to other empty ranges.

TODO-Diogo: Add rules for empty ranges handling for each operator (maybe Boolean table with columns: non-empty x empty, empty x non-empty, empty x empty, and rows are operators.

6. Experiments

This section evaluates the selectivity estimation accuracy of the proposed algorithm, and its relation to the size of the histogram statistics, i.e., the number of bins. Firstly, the proposed algorithm has been implemented in PostgreSQL 14, including support for range operators as described in section 5.4. We prepared a patch, and it is currently under review for inclusion in the next release of PostgreSQL. The batch is also included as an artifact with this paper.

The experiment described in this section is thus run using our implementation in PostgreSQL 15-develop on a Debian virtual Debian machine with 32GB of Disk and 8GB of RAM. We created two relations, R1 and R2, with range attributes R1.X and R2.Y, and cardinalities of 20390 and 20060 rows respectively. Note that for range types, we use the very same algorithm as for scalar types, so the results of this experiment hold for both.

The range values in the two relations were generated to cover a mixture of short, medium, and long ranges. We also included corner cases such as ranges with infinite bounds, empty ranges, and null values in the two relations. The following query was executed varying the number of histogram bins:

SELECT *
FROM R1, R2
WHERE x << y

Note that << denotes the ”strictly left of” operator, which returns true if and only if the upper bound of x is less than the lower bound of y. PostgreSQL collects bounds histogram statistics for range attributes. To estimate the selectivity of the query above, histograms of the upper bound of X and lower bound of Y are used as input for Algorithm 2.

The optimizer statistics collector of PostgreSQL has a parameter called statistics target, that controls the number of bins in the equi-depth histograms. The default value is 100, and it can be increased up to 10000. The experiments described in this section were run by incrementing the statistics target by steps of 100 starting with the default value till the maximum value.

In the experiment, we observe two quantities: (1) the planning time, which is the time taken by the query optimizer to enumerate the alternative query plans, and estimate their costs, and (2) the cost estimation error defined as the absolute difference between the estimated and the actual number of rows returned by the query, divided by the cardinality of the Cartesian product of the two relations, which is 409013259.

Figure 6

shows the change in planning time (in milliseconds) as the statistics target increases. Apart from two outliers, the planning time shows approximately linear behavior, indicating that the binary search does not have a significant impact on it in the allowed range of statistics targets. Recall that the analytical complexity is O(

).

Figure 6. Query planning time (ms) V.S. the number of histogram bins in algorithm 2

Figure 7 shows the estimation error (in a logarithmic scale) against the number of histogram bins, i.e., by varying the statistics target. As expected, using more bins leads to a lower estimation error. The largest error observed, when using only 100 histogram bins, was 1,112%. The error then drops rapidly to less than 0,002% at 900 bins, which corresponds to less than 5% of each relation size. The error stays consistently around this value for the histograms bigger than 900 bins. Diogo: Need help finding better wording for the fast decrease in error

Figure 7. Log selectivity estimation error V.S. the number of histogram bins

Figure 8 plots the selectivity estimation error against the planning time in milliseconds. The significance of this figure is to illustrate the relation of the expenditure in terms of planning time versus the gain in terms of reduced error. This figure shows that, for the relations used, the planning time does not need to exceed 2 milliseconds to obtain extremely accurate estimations.

Figure 8. Log selectivity estimation error V.S. planning time (ms)

7. Conclusions

This paper proposed an algorithm for estimating the selectivity of inequality join predicates. It is a fundamental problem in databases that has not been solved before up to our knowledge. Common open-source databases, PostgreSQL and MySQL, lack implementations for this functionality. We have implemented and pushed the proposed algorithm as a patch to be included in PostgreSQL. Propitiatory databases, Oracle and SQL-Server, return fairly accurate estimations, but their algorithms are not known. Our experiments show that the proposed algorithm provides comparable estimation accuracy, slightly more accurate. To produce these estimations, the algorithm uses equi-depth histogram statistics, which are adopted in all these systems. In a practical setting, the planning time remains within 2 milliseconds, which is the accepted norm in common databases.

Diogo: Should I write some conclusion here besides the explanation of the results above?

References

  • N. Bruno and S. Chaudhuri (2002) Exploiting statistics on query expressions for optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 263–274. Cited by: §2.
  • D. Chamberlin, M. Astrahan, M. Blasgen, J. Gray, W. III, B. Lindsay, R. Lorie, J. Mehl, T. Price, G. Putzolu, P. Selinger, M. Schkolnick, D. Slutz, I. Traiger, B. Wade, and R. Yost (1981) A history and evaluation of system r.. Commun. ACM 24, pp. 632–646. Cited by: §1.
  • G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, pp. 1–294. Cited by: §2.
  • A. Dell’Era (2008) External Links: Link Cited by: §2.
  • G. Graefe and W. McKenna (1991) The volcano optimizer generator. pp. 21. Cited by: §1.
  • G. Graefe (2014) The cascades framework for query optimization.. IEEE Data Eng. Bull. 18, pp. 19–29. Cited by: §1.
  • P. G. D. Group (1996a) External Links: Link Cited by: §2.
  • P. G. D. Group (1996b) External Links: Link Cited by: §2.
  • P. G. D. Group (1996c) External Links: Link Cited by: §2.
  • P. G. D. Group (1996d) External Links: Link Cited by: §2, §2.
  • S. Hasan, S. Thirumuruganathan, J. Augustine, N. Koudas, and G. Das (2020) Deep learning models for selectivity estimation of multi-attribute queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, New York, NY, USA, pp. 1035–1050. External Links: ISBN 9781450367356, Link, Document Cited by: §2.
  • A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper (2018) Learned cardinalities: estimating correlated joins with deep learning. CoRR abs/1809.00677. External Links: Link, 1809.00677 Cited by: §2.
  • H. Lan, Z. Bao, and Y. Peng (2021) A survey on advancing the dbms query optimizer: cardinality estimation, cost model, and plan enumeration. Data Science and Engineering 6, pp. 86–101. Cited by: §1, §2.
  • J. Lewis (2006) Join cardinality. In Cost-Based Oracle Fundamentals, pp. 265–305. External Links: ISBN 978-1-4302-0087-1, Document, Link Cited by: §2.
  • Oracle (2021) External Links: Link Cited by: §2, §2.
  • E. Pitoura (2009) Selectivity estimation. In Encyclopedia of Database Systems, pp. 2548–2548. Cited by: §1.
  • J. Sack (2014) Optimizing your query plans with the sql server 2014 cardinality estimator. Cited by: §2.
  • S. Server (2021) External Links: Link Cited by: §2.
  • S. Server (2022) External Links: Link Cited by: §2, §2.
  • [20] M. team at Oracle(Website) External Links: Link Cited by: §2, §2.
  • [21] M. team at Oracle(Website) External Links: Link Cited by: §2.
  • M. team at Oracle (2000) External Links: Link Cited by: §2.
  • X. Wang, C. Qu, W. Wu, J. Wang, and Q. Zhou (2021) Are we ready for learned cardinality estimation?. Proc. VLDB Endow. 14 (9), pp. 1640–1654. External Links: ISSN 2150-8097, Link, Document Cited by: §1, §2.
  • Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica (2019) Selectivity estimation with deep likelihood models. CoRR abs/1905.04278. External Links: Link, 1905.04278 Cited by: §2.