The Case for Deep Query Optimisation

08/22/2019 ∙ by Jens Dittrich, et al. ∙ 0

Query Optimisation (QO) is the most important optimisation problem in databases. The goal of QO is to compute the best physical plan under a given cost model. In that process, physical operators are used as building blocks for the planning and optimisation process. In this paper, we propose to deepen that process. We present Deep Query Optimisation (DQO). In DQO, we break up the abstraction of a 'physical' operator to consider more fine-granular subcomponents. These subcomponents are then used to enumerate (sub-)plans both offline and at query time. This idea triggers several exciting research directions: (1) How exactly can DQO help to compute better plans than (shallow) QO and at which costs? (2) DQO can be used to precompute and synthesise database operators and any other database component as Algorithmic Views (AVs). (3) We identify the Algorithmic View Selection Problem (AVSP), i.e. which AVs should be materialized when? This paper presents the high-level idea of DQO using an analogy inspired from biology. Then we proceed to question the terms 'physical' and 'physical operator'. We present experiments with a 'physical operator' formerly known as 'hash-based grouping'. We benchmark that operator both independently as well as in the context of DQO-enabled dynamic programming. We conclude by sketching a DQO research agenda.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Query Optimisation (QO) is at the heart of any query engine. The core task of QO is to find an efficient plan under a given cost model. A major difficulty of this task is to compute not only a logical plan (an extended relational algebra expression, a DAG of logical operators) specifying which relations to join in which order but additionally a physical plan (a DAG of physical operators) that additionally specifies which access methods (e.g. unclustered B-tree vs scan) and algorithms to use (e.g. sort-merge vs hash join). The physical plan can then either be interpreted or compiled.

A major problem of this approach is the hidden legacy of relational algebra. For example, at some point in QO, a logical join, i.e. , is translated to a physical join, i.e.  or . In other words, the abstraction used during query optimisation is the following: a physical operator in relational algebra receives one or two inputs from outside, does some well-defined processing inside, and produces one output dataset to the outside. In summary, a physical plan is DAG-structured ‘algorithmic recipe’ where the nodes are physical operators and the vertices symbolise producer-consumer relationships.

There has been considerable work on examining the pipelining aspects of this legacy, i.e. how to effectively implement the producer-consumer relationships from good old volcano-style ONC-iterators [5], via vectorisation [3] to breaking the boundaries of physical operators to run plans until the next pipeline breaker [15], to include parallelism [14], and combinations thereof [10].

However, these works neglect that the algorithms used to implement an operator can be considered a query plan in itself. For instance, consider the physical grouping operator222We focus on grouping here as joins have very similar algorithmic issues and solutions. This becomes clear if you consider that a join is merely a co-group-operation with exactly two inputs followed by an aggregation on each co-group.Vice versa a grouping operation is merely a co-group operation with a single input.. From a 10,000 feet perspective, it can be implemented either using a sort-based- or a hash-based algorithm. In hash-based grouping (as shown in Figure 1) we initialise an empty hash table (Line 1). Then we insert each tuple from the input into that hash table using the grouping key as the key to the hash table and the set of tuples having that key as the value to the hash table (Lines 2–6). Afterwards, for each existing key in the hash table, we compute the aggregate on the set of tuples pointed to (Lines 7–8).

HashBasedGouping (Relation R, groupingKey): [itemsep=0pt] HashMap hm; Relation result = {}; //Insert all tuple from input R into HashMap hm: For each r in R: .7cm If r.groupingKey in hm: 1.5cm hm.probe(r.groupingKey) {r}; .7cm Else: 1.5cm hm.insert(r.groupingKey, {r}); //Build aggregates for each existing key in hm: For each key in hm.keySet(): .7cm result = aggregate(hm.probe(key)); Return result;

Figure 1: Textbook-style pseudo-code for hash-based grouping

This algorithm can be found in almost all database textbooks, lectures, and even conference talks. The problem with this algorithm is is that it implies a couple of algorithmic and physical design decisions: (1) As an internal index structure a hash table is used, but which one exactly? As already observed in [17] a hash table has many different dimensions which influence performance dramatically. (2) The insert operations to the hash table are implicitly assumed to occur in serial. (3) The aggregation operations in the hash table are implicitly assumed to occur in serial, i.e. group-wise. (4) The ‘Relation R’ parameter in the function signature implies that the entire result set is passed to the algorithm fully materialized. Likewise, the final aggregates are collected in a result set before passing them outside which again implies that the result set is materialised. In summary, this induces two unnecessary pipeline breakers. (5) The two phases in the algorithm (first load the hash table, then compute the aggregates) forbids any kind of non-blocking behaviour, e.g. like in any kind of online aggregation algorithm [6, 4].

A much better description of this algorithm is shown in Figure 2. The two LOCs in Figure 2 basically say: (1) we will partition the data produced by R into a bundle of independent producers. If the input produces 42 different groups, partitionBy creates 42 different producers. Semantically each producer will deliver the tuples belonging to its group. Notice that there is no need here to shoehorn the result into one relation as in relational algebra or SQL. In (2) we specify that a bundle of independent producers is aggregated with the same aggregation function, but possibly independently.

What if we depict this ‘code’, i.e. the insides of the logical operator, as a query plan itself? This is shown in Figure 3. Figure 3(a) shows logical grouping as found in (extended) relational algebra. If we open up that box, i.e. if we unnest the operator into a more fine-granular query plan, we obtain Figure 3(b). The latter corresponds to the pseudo-code given in Figure 2. The more we unnest, we increase the physicality of the query plan as for each unnest we have to make a decision on how exactly to implement a certain ‘bubble’. Only after some recursive unnesting, we eventually obtain at Figure 3(d) which corresponds to the pseudocode of hash-based grouping shown in Figure 1. In other words, hash-based grouping is just one of many special cases in a partition-based grouping algorithm. Figure 3(e) shows another unnest using static perfect hashing (SPH, in the experiments we will even use minimal SPH) as well as a parallel load. And from that we could continue further…

In Figure 3, the arrows denote a specific path we followed at each unboxing step. In turn, at each unnest we discard several options. In this example we unnest four times. However, this figure just visualises the principle. We do not imply that it is exactly four unnest operations to get from a logical operator to a ‘physical’ plan. However, what we do imply is that the current state-of-the-art to translate from one extreme (a logical operator) in a single step to another extreme (a blackbox ‘physical operator’) misses several interesting optimisation opportunities (cf. SQO in Figure 3).

PartitionBasedGrouping(Producer R, Consumer R’, groupingKey): [itemsep=0pt] //partition the input: R partitionBy(groupingKey) R_partitions //aggregate each partition: R_partitions aggregate(…) R’

Figure 2: Partition-based grouping. Here each ‘line of code’ is written as a producer-consumer pattern, i.e. a line of code consumes some input and creates one or multiple producers. This does not make any algorithmic decision whatsoever on how this producer-consumer pattern will be implemented physically. denotes that an operation provides a bundle of independent producers.

This paper is structured as follows: in the next section we introduce Deep Query Optimisation (DQO). In Section 3 we introduce Algorithmic Views (AVs) and the Algorithmic View Selection Problem (AVSP). Section 4 presents early experimental results of our idea. Section 5 contrasts DQO to related work. Finally, Section 6 presents a research agenda.

Figure 3: Standard (‘shallow’) query optimisation (SQO) vs deep query optimisation (DQO)

2 Deep Query Optimisation

The core idea of Deep Query Optimisation (DQO) is the following: rather than computing a plan using only coarse-granular logical and/or ‘physical’ operators (as done in Shallow Query Optimisation, SQO), in DQO we consider more fine-granular components in the optimisation process.

As already shown in the previous sections, when inspecting a single group-by operator, there are many different hidden levels of nested query plans. Table 1 presents another view on these granularity levels using an example from biology, a living cell, as an analogy. A living cell is composed of organelles which consist of (macro-)molecules which consist of atoms. Using the living cell analogy, we can phrase the key idea of DQO as follows: extend SQO to also assemble organelles and macro-molecules from molecules rather than only living cells from organelles.

Table 1: Granularity concepts in biology vs their counterparts in query optimisation

2.1 Local vs Global Effects of Deep Query Optimisation

DQO can be considered local if the types of subcomponents used to assemble a specific granule do not have any effect on the context of that granule. In contrast, DQO must be considered global (or at least non-local) if the types of subcomponents used to assemble a specific granule do (or even may) have an effect on the context of that granule.

For instance, assume we want to find the optimal index to be used in hash-based grouping. Let’s take a look at two major options for the indexes to use here:

(1) We use an out-of-the-box hash-table. Then, the output of the operator will not be sorted (technically it is often sorted in the order as the groups appear in the hash table. That order depends heavily on the hash function used. If we do not know exactly which order is produced by a blackbox hash table, we have to assume that the data is unordered to be on the safe side.).

(2) We use a static perfect hash-function (SPH). SPH can simply be an array of groups of tuples (or running aggregates in the case of distributive and/or decomposable aggregation functions). The grouping key then serves as the index into that array. Here, the linear array slot computation works like a perfect hash function. If all array slots are used, the SPH is even minimal. This is only applicable if the key domain of the grouping key is (relatively) dense. This situation is not as rare as one might think. For instance, the keys of a dictionary-compressed column are a natural candidate for this and can directly be used for SPH.

In summary: any optimisation step in DQO can be considered local, if the subplan produced at that step has the same properties as any other plan at that granule. In contrast if, for a given granule different subplans have different properties, its optimisation effects may be considered global.

2.2 DQO Plan Properties

DQO plan properties have similarities to interesting orders in sort-based operators. However, in DQO, an “interesting order” is just one tiny special case. Other cases include, but are not limited to, any statistical or physical property of the input data like: sparse vs dense, clustered, partitioned, correlated, compressed (and how exactly?), layout (row, col, PAXish [1], in-between), etc. Any of these DQO plan properties may have a non-local effect.

In DQO, in a first shot, these properties can be considered and handled very similarly to how interesting properties are handled in dynamic programming. If any subcomponent in DQO produces an output with such a property, we must not discard that information.

In the experiments in Section 4, we show how we modified dynamic programming to factor in one of these properties: dense vs sparse and how that makes a difference in optimisation and runtime performance. Examining the other properties in more detail is an avenue for future work.

3 Algorithmic View Selection

There is another interesting similarity when thinking about materialized views (precomputed query results) and also prepared statements (pre-optimised queries).

In DQO, in particular for local effects but also for the non-local ones, it makes sense to precompute certain granules offline (before a query comes in). We coin these precomputed components Algorithmic Views (AV). AVs can be precomputed for any level not only “physical” operators. Like that AVs can be used as building blocks for DQO at query time to speed-up plan enumeration. There is a natural trade-off here: how much time do I want to spend on DQO offline vs at query time? Again, inspired my the materialized view selection problem [2], we coin this the Algorithmic View Selection Problem (AVSP). And like with MVs there is no need in AVSP to make any manual decision about which granules to precompute and which not. This is simply adding a new AVSP-dimension to the physical design problem. AVs trigger a couple of interesting research challenges and directions, in particular when applied to indexing. We will discuss them in Section 6.

4 Experiments

DQO can be applied in all parts of query optimisation that include algorithmic design decisions. In this section, we demonstrate in a small domain that DQO can have a significant impact on query execution.

4.1 Setup and Methodology

All experiments were conducted on a Linux machine with an AMD Ryzen Threadripper 1900X 8-Core processor with 32 GiB memory. All algorithms are implemented single-threaded in C++ and compiled with Clang 8.0.1, -O3.

We consider five different implementation variants of grouping333The implementation of our algorithms will be made available on GitHub . Each implementation computes the aggregates COUNT and SUM on the fly and stores a mapping from grouping key to aggregate data inside an array. We always assume the number of distinct values to be known.

Hash-based Grouping (HG). We use std::unordered_map as the underlying hash table and the Murmur3 finaliser as hash function. Every input element is inserted individually into the hash table.

Static Perfect Hash-based Grouping (SPHG). We use the grouping key as offset into the array storing the groups, acting as a static and perfect hash function.

Order-based Grouping (OG). This implementation requires the input data to be partitioned by the grouping key. We iterate sequentially over the input data, create a group for the very first occurrence of a grouping key, and insert this group at the first empty slot in the array. As long as the grouping key remains the same, the corresponding aggregates are updated.

Sort & Order-based Grouping (SOG). We do not require that the input data is partitioned by the grouping key. Therefore, we first sort the data then we apply OG.

Binary Search-based Grouping (BSG). We store a mapping from grouping key to aggregate data inside a sorted array. This allows us to perform binary search to lookup a group by its key.

The datasets consist of 100 million 4 byte unsigned integer values representing the grouping key. Each dataset is uniformly distributed and has two properties, sortedness and density. Taking all combination of those properties, we end up with four different datasets.

4.2 Performance of Physical Grouping Implementations

This section compares the performance of the aforementioned algorithms by their execution time on the four different datasets. Figure 4 shows one plot for each of the the four different datasets. Each plot depicts the execution time in milliseconds for an increasing number of groups.

Hash-based Grouping

Order-based Grouping

Sort & Order-based Grouping

Static Perfect Hash-based Grouping

Binary Search-based Grouping
sparse dense



Figure 4: Grouping performance of four different grouping algorithms on four different input datasets.

Sorted & Dense. For this combination OG and SPHG exhibit the best performance at roughly 250 ms being more than four times as fast as HG. SOG performs even worse because the already sorted input data is again (unnecessarily) sorted incurring a considerable overhead. Furthermore, the execution time of each algorithm is mostly independent of the number of groups.

Sorted & Sparse. This experiment shows that the execution time of HG, OG, and SOG is essentially independent of the data density. OG again performs best with around 250 ms. Since we have a sparse data domain, we cannot use SPHG. Instead, we use BSG, which incurs logarithmic costs relative to the number of groups. However, no algorithm comes close to the performance of OG.

Unsorted & Dense. In this case, SPHG is the best performing algorithm at a constant execution time of roughly 250 ms, being unaffected by the sortedness of the data. The execution time of HG grows with an increasing number of groups because of caching effects. For less than about 500 groups, SOG displays a steep rise in execution time and afterwards merely a modest increase.

Unsorted & Sparse. For this setting, we can neither exploit the sortedness nor the density of the data. Without these properties, HG is superior in a wide range of number of groups. However, for up to 14 groups (see zoom-in in the respective plot), BSG outperforms HG. This opens up another optimisation dimension in which the number of distinct values should be considered.

In summary, this experiment shows that not only sortedness determines the fastest algorithm but also other properties like in this case density.

4.3 DQO-enabled Dynamic Programming

This section shows how classical Dynamic Programming can lead to better query execution plans when extending it with DQO. Consider the following example query:


For the physical implementations of the joins, we assume the algorithmic counterparts of our grouping implementations. The corresponding cost models are shown in Table 2. We assume the output-size of the join to be because of the foreign-key constraint and the output-size to be .

Grouping Join
sort & order-based
static perfect hash-based
binary search-based
Table 2: Cost models for grouping and join-algorithms used.
Figure 5:

Improvement factors for the estimated plan costs of DQO over SQO.

While SQO only considers data sortedness as in traditional dynamic programming, DQO also considers other DQP plan properties (cf. Section 2.2), here: the density of the grouping keys. This allows us to use o static perfect hash-based algorithmic designs (SPH) for our join and grouping implementations. Figure 5 shows the improvement factors for the estimated plan costs of DQO over SQO. Since SPH can only be used in a dense domain, for sparse data DQO generates the same plans as SQO, resulting in no improvement. In case both inputs are sorted, the order-based implementations achieve the cheapest plans regardless of the data density. However, if at least one input is unsorted, DQO generates plans with an improvement factor of up to 4x. In this case, DQO chooses plans that use the SPHJ and SPHG algorithms.

In summary, the experiments show that depending on the underlying data properties, different physical implementations lead to the cheapest plans and achieve the best execution time. This supports our claim that more fine-granular optimisation can lead to better query plans.

5 Related Work

Notice, that the bulk of related work has already been discussed inline above or will be discussed in the Research Agenda in Section 6.

Stratos’ Idreos work [8, 7] is the closest work to ours when it comes to physical ‘indexing operators’, see Section 6/Algorithmic Index Views for details.

Christoph Koch’s work [11] discusses the problems with missing abstractions in the database community in particular when communicating our findings to the systems community. The Introduction of that paper is a must-read for every researcher: ‘… What is frequently lacking is a conclusive deconstruction of the research contribution [of a paper] into first principles and fundamental patterns from which the contribution is composed. …’ In his work, Christoph Koch develops methods for synthesising and composing algorithms [12]. That work is however much more PL- and compiler-oriented than DQO. In particular, we argue that database domain-specific decisions should not be delegated to the PL-compiler.

6 Research Agenda

Revisit SQO Algorithms. We envision that many existing SQO algorithms can easily be extended to support DQO. Just like extending SQO to large queries [16], the challenge will be to extend them to deep queries and find the right sweet-spots. Recall that, in the history of SQO, initially only relatively small queries could be optimised, but over time the queries became bigger and bigger. We foresee the same to happen with DQO: over time deeper and deeper queries will become optimisable. As long as optimisation time in DQO is an issue, we need AVs to the rescue.

Physiological Algebra. An interesting research subspace will be to identify the right components to use in DQO, i.e. what are suitable organelles and macro-molecules to consider? We envision that this will lead to a physiological component set akin to relational algebra yet including both logical and physical aspects.

Algorithmic Views Selection. A promising direction is to systematically research and evaluate AVs and the Algorithmic Views Selection Problem (AVSP). When to materialise which algorithm into an AV? Beforehand or at query time? And what do we possibly win or lose at query time? These trade-offs have to be explored carefully. And, yes, for sure: these trade-offs are absolutely workload-dependent. For which parts of a query plan should we consider DQO?

Partial Algorithmic Views. Rather than fully materialising parts of a deep query plan into an AV, or, if we pick the other extreme, not materialising it at all, there is an interesting middle-ground: It makes sense to partially optimise an AV offline and leave some flexibility for DQO at query time. Which portions should be left up for DQO at query time? Again, these trade-offs have to be explored carefully. Actually, there are many interesting lessons here that can be adapted from compiler construction in this space.

Algorithmic Index Views. An entire interesting research subspace is to apply DQO to indexing. It would be exciting to explore DQO in that context. In database literature, we witness the birth of about a dozen index structures every year. Most indexes are basically composed of substructures (atoms), i.e. different nodes and leaf-types. Class-book index structures like a B-trees and BSTs use a tiny set of allowed node-and leaf-types, other indexes extend that set slightly allowing for more heterogeneous trees, e.g. [13].

An extreme version of this is [8, 7]. However, a synthesised data structure (SDS) is simply one special case of what we propose. Basically, in DQO, a synthesised data structure is one particular type of an AV. In addition, for an SDS all of the optimisation happens offline. That is an unnecessary restriction as already outlined in Section 3. In DQO we do not need to synthesise the entire index or any other AV beforehand. Which implies the following:

Runtime-Adaptivity and Reoptimisation of AVs. So far we suggested to optimise deep query plans and then execute these plans. As with shallow query plans, the literature on reoptimisation (during query time) as well as adaptivity should be revisited in the light of DQO. For instance, in traditional indexing, for each column, the decision whether to create an index is binary. What if we make that decision continuous? Like that different parts of a column are not, slightly, or fully indexed. That is the core idea of adaptive indexing [9, 18]

. An adaptive index has built-in heuristics to make these decisions at runtime based on the incoming queries. And even those heuristics may be meta-adapted 

[19]. In the DQO universe a (meta-)adaptive index is simply a partial AV where some optimisation decisions have been delegated to query time and baked into that AV. This idea should be revisited for all physical components currently used in SQO; not only indexes.

Longterm Vision. Build a system that integrates all of the above. That system should be able to make a smooth transition from SQO to DQO and find the sweet-spots for any given workload.

7 Conclusions

This paper made several contributions: we opened the book for Deep Query Optimisation (DQO). We presented the general idea, contrasted it to SQO, and showed the high potential of DQO. In addition, we introduced the concept of Algorithmic Views and the Algorithmic View Selection Problem. We presented early experimental results with DQO. In addition, we compiled a research agenda.


  • [1] A. Ailamaki et al. Weaving relations for cache performance. VLDB, 2001.
  • [2] E. Baralis et al. Materialized views selection in a multidimensional database. VLDB, 1997.
  • [3] P. A. Boncz et al. Monetdb/x100: Hyper-pipelining query execution. CIDR, 2005.
  • [4] J. Dittrich et al. Progressive merge join: A generic and non-blocking sort-based join algorithm. VLDB, 2002.
  • [5] G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2), 1993.
  • [6] J. M. Hellerstein et al. Online aggregation. SIGMOD, 1997.
  • [7] S. Idreos et al. The data calculator: Data structure design and cost synthesis from first principles and learned cost models. SIGMOD, 2018.
  • [8] S. Idreos et al. The periodic table of data structures. IEEE Data Eng. Bull., 41(3), 2018.
  • [9] M. L. Kersten et al. Cracking the database store. CIDR, 2005.
  • [10] T. Kersten et al.

    Everything you always wanted to know about compiled and vectorized queries but were afraid to ask.

    PVLDB, 11(13), 2018.
  • [11] C. Koch. Abstraction without regret in data management systems. CIDR, 2013.
  • [12] C. Koch. Abstraction without regret in database systems building: a manifesto. IEEE Data Eng. Bull., 37(1), 2014.
  • [13] V. Leis et al. The adaptive radix tree: Artful indexing for main-memory databases. ICDE, 2013.
  • [14] V. Leis et al. Morsel-driven parallelism: a numa-aware query evaluation framework for the many-core age. SIGMOD, 2014.
  • [15] T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9), 2011.
  • [16] T. Neumann et al. Adaptive optimization of very large join queries. SIGMOD, 2018.
  • [17] S. Richter et al. A seven-dimensional analysis of hashing methods and its implications on query processing. PVLDB, 9(3), 2015.
  • [18] F. M. Schuhknecht et al. The uncracked pieces in database cracking. PVLDB, 7(2), 2013.
  • [19] F. M. Schuhknecht et al. Adaptive adaptive indexing. ICDE, 2018.