Towards a Hands-Free Query Optimizer through Deep Learning

by   Ryan Marcus, et al.
Brandeis University

Query optimization remains one of the most important and well-studied problems in database systems. However, traditional query optimizers are complex heuristically-driven systems, requiring large amounts of time to tune for a particular database and requiring even more time to develop and maintain in the first place. In this vision paper, we argue that a new type of query optimizer, based on deep reinforcement learning, can drastically improve on the state-of-the-art. We identify potential complications for future research that integrates deep learning with query optimization and describe three novel deep learning based approaches that can lead the way to end-to-end learning-based query optimizers.



There are no comments yet.


page 1

page 2

page 3

page 4


Neo: A Learned Query Optimizer

Query optimization is one of the most challenging problems in database s...

Join Query Optimization with Deep Reinforcement Learning Algorithms

Join query optimization is a complex task and is central to the performa...

Balsa: Learning a Query Optimizer Without Expert Demonstrations

Query optimizers are a performance-critical component in every database ...

Learning State Representations for Query Optimization with Deep Reinforcement Learning

Deep reinforcement learning is quickly changing the field of artificial ...

Bao: Learning to Steer Query Optimizers

Query optimization remains one of the most challenging problems in data ...

DeepLens: Towards a Visual Data Management System

Advances in deep learning have greatly widened the scope of automatic co...

Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach

Due to the outstanding capability of capturing underlying data distribut...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Query optimization, e.g. transforming SQL queries into physical execution plans with good performance, is a critical and well-studied problem in database systems (e.g. (volcano, ; systemr, )). Despite their long research history, the majority of existing query optimization systems share two problematic properties:

  1. They are, or are composed of, carefully tuned and complex heuristics designed using many years of developer-based experience. Furthermore, these heuristics often require even more tuning by expert DBAs to improve query performance on each individual database (e.g. tweaking optimization time cutoffs, adding query hints, updating statistics).

  2. They take a “fire and forget” approach in which the observed performance of a execution plan is never leveraged by the optimization process in the future, hence preventing query optimizers from systematically “learning from their mistakes.”

Of course, there are several notable exceptions. Many optimizers use feedback from query execution to update cardinality estimates 

(AboulnagaSelftuningHistogramsBuilding1999, ; ChenAdaptiveSelectivityEstimation1994, ; leo, ), and many adaptive query processing systems incorporate feedback (adaptive_qp_rl, ; cuttlefish, ). However, in this vision paper, we argue that recent advances in deep reinforcement learning (DRL) (deep_rl, ) can be applied to query optimization, resulting in a “hands-free” optimizer that (1) can tune itself for a particular database automatically without requiring intervention from expert DBAs, and (2) tightly incorporates feedback from past query optimizations and executions in order to improve the performance of future queries.

Deep reinforcement learning (DRL) is a process in which a machine learns a task through continuous feedback with the help of a neural network. It is a iterative learning process in which the machine (an

agent) repeatedly selects actions and receives feedback about the quality of the actions selected. DRL algorithms train a neural network model over multiple rounds (episodes), aiming to maximize the performance of their selected actions (policies). This performance feedback, the indicator of whether or not an agent is performing well or poorly, is referred to as the reward signal.

While deep learning has been applied to database systems (e.g. indexes (ml_index, ), physical design (selfdrivingcidr, ), and entity matching (deep_entity, )), deep reinforcement learning has not received much attention. Despite applications in multiple domains (deep_rl, ), applying DRL algorithms to query optimization generates a number of research challenges. First, DRL algorithms require extensive training data before achieving good performance, and initially perform very poorly. Second, a general assumption is that the reward signal is cheap to calculate. In query optimization, the most natural performance indicator to use is the query latency. However, training on (and hence executing) large numbers of query plans (especially poorly optimized query plans) and collecting their latency for feedback as a reward signal to a DRL agent can be extremely expensive. Using the optimizer’s cost model as a performance indicator is also problematic, as cost models are themselves complex, brittle, and often rely on inaccurate statistics and oversimplified assumptions. Hence, optimizers output a cost metric that is not strongly correlated with actual query execution times.

Second, the enormous size of the query plan search space for any given query causes naive applications of DRL to fail. For instance, while DRL can be used to learn policies that tackle join order enumeration (rejoin, ), training these models to additionally capture physical operator and access path selection dramatically lengthens the training process and hinders convergence to an effective policy.

In this vision paper, we describe and analyze potential solutions to the above challenges, each representing directions for further research that tightly integrates deep learning-based theory with query optimization. We propose two novel DRL approaches: learning from demonstration and cost model bootstrapping. The first approach involves initially training a model to imitate the behavior of a state-of-the-art query optimizer, and then fine-tuning that model for increased performance. The second approach involves using existing cost models as guides to help DRL models learn more quickly. Finally, we propose and analyze the design space of an incremental training approach that involves learning the complexities of query optimization in a step-by-step fashion, mirroring the way humans learn algebra before calculus.

We start in Section 2 with an brief introduction to DRL and an overview of a case study DRL-based join enumerator. In Section 3, we detail the three main challenges that DRL-based query optimizers need to overcome. In Section 4, we analyze our proposed future research directions, and we conclude in Section 5.

2. Background: DRL and REJOIN

Here, we give a brief overview of DRL and discuss the lessons we learned from applying DRL to join ordering enumeration (rejoin, ).

Deep Reinforcement Learning

Reinforcement learning (q, ) assumes that an agent interacts with an environment. The environment tells the agent its current state, , and a set of potential actions the agent may perform. The agent selects an action , and the environment gives the agent a reward based on that action. The environment additionally gives the agent a new state and a new action set . This process repeats until the agent reaches a terminal state, where no more actions are available. This marks the end of an episode, after which a new episode begins. The agent’s goal is to maximize its reward over episodes by learning from its experience (previous actions, states, and rewards). This is achieved by balancing the exploration of new never-before-tried actions with the exploitation of knowledge collected from past actions.

One of the key challenges in applying reinforcement learning to a particular domain is “massaging” the problem into these terms (i.e., designing its actions, states, and rewards). Once formalized in this way, reinforcement learning techniques aim to improve upon an initially random policy, a function that selects an action in a given state. In DRL (ppo, ; dqn, ), this policy is represented as a neural network (dnn, ), and it is improved by adjusting the weights of the network based on the reward signal from the environment.

Figure 1. ReJOIN example
ReJOIN: DRL-based Join Enumeration

Recently, we introduced ReJOIN (rejoin, ), a DRL-based join enumerator. While ReJOIN focused exclusively on join order enumeration (it did not perform operator or index selection), it represents an example of how query optimization may be framed in the terms of reinforment learning.

ReJOIN performs join ordering in a bottom-up fashion, modeling the problem in the terms of reinforcement learning. Figure 1 shows an example of this process. Each of the relations in the SQL query are initially treated as subtrees. At each step, the set of possible actions contains every possible pair of subtrees. For example, in Figure 1, ReJOIN selects the action [1,3], so relations and are joined. The reward for this action is determined by a DBMS’ optimizer cost model. At the next step, ReJOIN selects the action [2, 3], so relations and are joined. Finally, the action [1, 2] is selected, and the and subtrees are joined. The resulting state of the system is a terminal state, as no more actions can be selected. The resulting join ordering is sent to a traditional query optimizer, and the optimizer’s cost model is used to determine the quality of the join ordering (the reward).

Our analysis of ReJOIN (rejoin, ) yielded interesting conclusions:

  1. While ReJOIN is eventually able to learn a join ordering policy that outperforms PostgreSQL (both in terms of optimizer cost and query latency), doing so requires a substantial, but not prohibitive, training overhead.

  2. ReJOIN’s use of a traditional query optimizer’s cost model as a reward signal allowed for join orderings to be evaluated quickly. ReJOIN depended on the existance of a well-tuned cost model.

  3. Counter-intuitively, ReJOIN’s DRL algorithm is faster than PostgreSQL’s built-in join order enumerator in many cases. Notably, the bottom-up nature of ReJOIN’s algorithm is , where PostgreSQL’s greedy bottom-up algorithm is .

3. Research Challenges

Inspired by our experience with ReJOIN (rejoin, ) as well as other existing work on the area (qo_state_rep, ), we argue that applications of DRL theory to query optimization is both promising and possible. However, we next identify three key research challenges that must be overcome in order to achieve an end-to-end DRL-powered query optimizer. Afterwards, we propose a number of new research directions to potentially address these challenges.

Search Space Size

While previous work (rejoin, ) has demonstrated that reinforcement learning techniques can find good policies in limited search spaces (e.g., join order enumeration in isolation), the entire search space for execution plans is significantly larger. The ReJOIN prototype required 8000 iterations to become competitive with the PostgreSQL optimizer, and in that case only join ordering was considered (no index or operator selection, etc.). Accounting for operator selection, access path selection, etc. creates such a large search space that approaches from earlier work cannot be easily scaled up. In fact, a naive extension of ReJOIN to cover the entire execution plan search space yielded a model that did not out-perform random choice even with 72 hours of training time. Theoretical results (near_opt_rl, ) support this observation, suggesting that adding additional non-trivial dimensions to the problem increases convergence time drastically.

Performance Indicator

Deep reinforcment learning algorithms generally make several assumptions about the metric to optimize, i.e., the reward signal, that are difficult to guarantee in the context of query optimization. Abstractly, the metric to optimize in query optimization is the latency of the resulting execution plan. However, we next discuss why using latency as a reward signal leads to two unfortunate complications, namely that the query latency offers neither a dense nor a linear reward signal.

Many deep reinforcement learning algorithms (dqn, ; ppo, ) assume that, or perform substantially better when, the reward signal is dense: provided progressively as the environment is navigated, e.g. each action taken by a reinforcement learning agent achieves some reward. Furthermore, they often assume that rewards are linear, i.e. the algorithms attempt to maximize the sum of many small rewards within an episode. Neither of these assumptions hold in the context of query optimization: query latency is not dense (it can only be measured after a plan has been executed), and it is not linear (e.g., subtrees may be executed in parallel). Such a reward function can be difficult to engineer in the context of query optimization, as evaluating the quality of a partial execution plan is a difficult task.

One may reasonably consider using a traditional query optimizer’s cost model as a reward signal instead of query latency directly, as the optimizer’s cost model may appear to provide a dense linear reward. This approach has two major drawbacks. First, these cost models tend to be complex, hand-tuned (both by database engineers and DBAs) heuristics. Using a cost model as the reward signal for a DRL query optimizer simply “kicks the can down the road,” moving complexity and human effort from designing optimization heuristics to tweaking optimizer cost models. Second, the cost model’s estimation of the quality of an execution plan may not always accurately represent the latency of the execution plan (e.g., a query with a high optimizer cost might outperform a query with lower optimizer cost). Therefore, using DRL to find execution plans with a low cost as determined by a cost model might not always achieve the best possible results.

Performance Evaluation Overhead

An often-unstated assumption made by many DRL algorithms is that the reward of an action can be determined in constant time – e.g., that determining the performance of an agent for a particular episode in which the agent performs poorly is no more time-consuming than calculating the reward for an episode in which the agent performs well. For example, the time to determine the current score of a player in a video game does not change based on whether or not the score is high or low. If the latency of an execution plan is used a reward signal, this assumption does not hold: poor execution plans can take significantly longer to evaluate than good execution plans (hours vs. seconds). Since traditional DRL algorithms start with no information, their initial policies cannot be better than random choice, which will often result in very poor plans (howgood, ). Hence, a naive DRL approach that simply uses query latency as the reward signal would take a prohibitive amount of time to converge to good results.111We have confirmed this experimentally by swapping the reward signal in ReJOIN (rejoin, ) with query latency. The initial query plans produced were so sub-optimal that they could not be executed in any reasonable amount of time.

4. Research Directions

Here, we outline potential approaches to handle the challenges we highlighted. First, we discuss two drastically different approaches, demonstration learning and cost-model bootstrapping, which both avoid the pitfalls identified in Section 3 in interesting ways. We then touch upon incremental learning, and propose three techniques that decompose the problem of query optimization in a principled way across various axes, and analyze its design space.

The common intuition behind all of these approaches is to split the learning process into phases, where early phases involve less complexity than later phases. By carefully incrementally increasing the complexity of the task in specific ways, the search space can be narrowed and the difficulties of the optimization metric/reward signal can be more easily managed.

4.1. Learning From Demonstration

Figure 2. Learning from demonstration

One way to avoid the pitfalls of using query latency directly as the performance indicator (reward) for DRL algorithms is learning from demonstration (LfD) (demonstration, ; dqfd, ). Intuitively, this approach works by first training a model to imitate the behavior of an expert. Once this mimicry reaches acceptable levels, the model is fine-tuned by applying it to the actual environment. This learn-by-imitation technique intuitively mirrors how humans learn basic behaviors like language and walking by watching adults, and then fine-tune those behaviors by practicing themselves.

Here, we propose using a traditional DBMS’ query optimizer – such as the PostgreSQL query optimizer – as an expert. In this approach, illustrated in Figure 2, a model is initially allowed to observe how the traditional query optimizer (the expert) optimizes a query. During this phase, the model is trained to mimic the optimizer’s selected actions (e.g., indexes, join orderings, pruning of bad plans, etc). Assuming that a traditional optimizer will be able to prune-out unfeasible plans, this process allows a DRL model to learn by observing the execution time of only feasible plans.

Once the model achieves good mimicry, it is then used to optimize queries directly, bypassing the optimizer. In this second phase, the model initially closely matches the actions of the traditional query optimizer, but now begins to slowly fine-tune itself based on the observed query latency. Here, the agent updates its neural network based on the latency of the execution plans it constructs. If the performance of the model begins to slip, it is re-trained to match the traditional query optimizer until performance improves. In practice, choosing the point at which the model is again trained to mimic the traditional query optimizer is critical to improve the performance of the algorithm (dqfd, ). By leveraging learning from demonstration, one can train a query optimization model that learns with small overhead, without having to execute a large number of bad plans, therefore massively accelerating learning.

While specific techniques and formalizations vary (dqfd, ; pretrain_demonstration, ; q, ; demonstration, ), we outline the general process here.

  1. A large query workload, , is executed one query at a time. Each is transformed by the traditional query optimizer into a physical plan through a number of actions at various intermediary states , which are recorded as an episode history:

    For example, at the initial state , a query optimizer performing a greedy bottom-up join order selection process may choose an action signifying that two particular relations should be joined, or a query optimizer that first performs storage selection may choose an action signifying that data for a certain relation should come from a particular index. All episode histories are saved.

  2. The resulting physical plans are executed, and the latency of each query , , is measured and saved.

  3. Next, the agent is trained, for each , on the and data (Phase 1 in Figure 2). Specifically, for each action/state pair , the agent is taught to predict that taking action in state eventually results in a query latency of . Similar to the off-policy learning approach of (qo_state_rep, ), the agent thus learns a reward prediction function: a function that guesses the quality of a given action at a given state.

  4. Once the agent has proficiency guessing the outcome of the traditional optimizer’s actions, the agent can fine-tune itself. Now, the agent will be creating a query plan for an incoming query . For a given state , an action is selected by running every possible action though the reward prediction function and selecting the action which is predicated to result in the lowest latency.222

    In many implementations, an action besides the one predicted to result in the lowest latency may be selected with small probability 

    (dqn, ) to enable additional exploration. This process repeats until a physical execution plan is created and executed. The model is then trained (fine-tuned) on the resulting history and observed latency .

  5. Hopefully, the performance of the model will eventually exceed the performance of the traditional query optimizer. However, if the model’s performance slips, it is partially re-trained with samples from the traditional query optimizer’s choices when processing the queries in the initial workload .

Since the behavior of the model in the second phase should not initially stray too far from the behavior of the expert system (dqfd, ), we do not have to worry about executing any exceptionally poor query plans. Additionally, since the second training phase only needs to fine-tune an already-performant model, the delayed reward signal is of far less consequence. In fact, the initial behavior of the model may outpeform the traditional query optimizer in certain circumstances, for example if the trained model were to observe a systemic error in the performance of traditional optimizer, such as the traditional optimizer handling two similar situations in two significantly different ways, one of which causes substantially increased query latency. In this case, the trained model may automatically avoid the errors of the traditional optimizer (which has no capability to learn from its mistakes) through observation alone.

An important issue here is that, since the experience collected based on the traditional optimizer is necessarily covering a narrow part of the action space (it excludes “bad” plans, and thus also excludes the corresponding sequence of actions that would produce them), many state-actions have never been taken and have no training data to ground them to realistic cost. For instance, a nested-loop-join or a table scan may never/rarely be picked by the traditional optimizer for a particular workload/database and hence the model does not learn how to evaluate these actions correctly. However, since the model is trained on experiences containing significantly faster execution plans, there is no reason for the model to attempt to explore these extremely poor plans.

Experimental results from other problem domains (e.g. arcade games and a few systems applications) (dqfd, ; demonstration, ; lift, ), show that deep reinforcement learning agents which initially learn from demonstration can master tasks with significantly less training time than their tabula rasa counterparts. This result holds even when the expert is flawed (e.g. when the expert is a human player who does not know a particular shortcut or strategy), implying that learning-from-demonstration techniques can improve upon, and not just imitate, existing expert systems.

4.2. Cost Model Bootstrapping

Figure 3. Cost Model Bootstrapping

A traditional, but still widely used and researched, approach to improving the performance of reinforcement learning algorithms on problems when the performance indicator (reward) is only available at the end of an episode (i.e., sparse) is to craft a heuristic reward function. This heuristic reward function estimates the utility of a given state using a heuristic constructed by a human being: for example, when a robot is learning to navigate a maze, it may use an “as-the-crow-flies” heuristic to estimate its proximity to the maze’s exit. In the game of chess, a popular heuristic to evaluate the value of a particular board position is to count the number of pieces captured by both sides. Sometimes, this heuristic may be incorrect (e.g., it may rate a dead-end very near the exit as a desirable position, or it may highly-rate a board position in which many pieces have been captured but the opponent has an obvious winning move), but in general there is a strong relationship between the value of the heuristic function and the actual reward.

Luckily, the database community has invested significant effort into designing optimizer cost functions, which can be used for exactly this purpose. While not perfect, modern cost models, like “as-the-crow-flies” distance, can normally differentiate between good and catastrophic actions. We thus propose using these cost models as heuristic reward functions. This approach, depicted in Figure 3, involves initially using the optimizer’s cost model as a reward signal (Phase 1) and then, once training has converged, switching the reward signal to the observed query latency (Phase 2). In this way, the optimizer’s cost model acts as “training wheels,” allowing the DRL model to explore strategies that produce catastrophic query plans without having to execute them. Once the DRL model has stabilized and starts to pick predominately good plans, the “training wheels” can be removed and the DRL model can fine-tune itself using the “true” reward signal, query latency.

Cost model bootstrapping brings about a number of complications which require further exploration by the database community. Generally, an optimizer’s cost model output is a unitless value, meant to compare alternative query plans but not meant to directly correlate with execution latency. For example, an optimizer’s cost estimate for a set of query plans may range from 10 to 50, but the latency of these query plans may range from 100s to 200s. Switching the range of the reward signal from 10-50 to 100-200 will cause the DRL model to assume that its performance has suddenly decreased333

The change in variance will also have a significant detrimental effect 

(batchnorm, ). (the DRL model was getting query plans with costs in the range 10-50 in Phase 1, and at the start of Phase 2 the costs suddenly jump to be in range 100-200). This sudden change could cause the DRL model to begin exploring previously-discarded strategies, requiring the execution of poor execution plans.

One way to potentially fix this issue would be to tune the units of the cost model to more precisely match execution latency, but the presence of cardinality estimation errors makes this difficult (howgood, ). Instead of adjusting the optimizer’s estimates to match the query latency, another approach could be to adjust the query latency to match the optimizer cost. This could be implemented by simply scaling the query latency observed in Phase 2 to fall within the range of cost model estimates observed in Phase 1.

One could implement this scaling by noting the optimizer cost estimates and query execution latencies during the end of Phase 1 (when the DRL model has converged). Let and be the maximum and minimum observed optimizer cost, and let and be the maximum and minimum observed query execution times. Then, in Phase 2, when the DRL model proposes an execution plan with an observed latency of , the reward could be

This scaling could be done linearly, as above, or using a more complex (but probably monotonic) function. This simple solution would likely need to be adjusted to handle workload shifts, changes in hardware, changes in physical design, etc.

Another potential approach, partially suggested in (KrishnanLearningOptimizeJoin2018, )

, is to first train a neural network model to optimize for the operator cost, and then transfer the weights of the middle layers of that network into a new network that trains directly on query latency. Hopefully, the transferred weights help the new network to more quickly converge to a good policy. However, there are no guarantees that selectively transferring the weights of some layers from one network to another will lead to this behavior – it is possible that this new network will initially produce extremely poor plans that cannot be realistically executed. In the broader machine learning literature, this technique is known as “transfer learning” 

(transfer, ; transfer2, ).

4.3. Incremental Learning

In this section, we discuss potential techniques to incrementally learn query optimization by first training a model to handle simple cases and slowly introducing more complexity. This approach makes the extremely large search space more manageable by dividing it into smaller pieces. Similar incremental approaches has shown success in other applications of reinforcement learning (dex, ; incremental_rl, ; transfer_rl, ).

Figure 4. Complexity diagram

We begin by first explaining how the task of query optimization can be decomposed into simpler pieces in a number of ways. We note that the difficulty of a query optimization task is primarily controlled by two dimensions: the number of relations in the query, and the number of optimization tasks that need to be performed. This is illustrated in Figure 4. The first axis is the number of relations in the query. If a DRL model must optimize queries containing only a single relation, then the search space of query plans is very small (there are no join orderings or join operators to consider, etc.). However, if a DRL model must optimize queries containing many relations, then the search space is much larger.

The second axis is the number of optimization tasks to perform. Consider a simplified query optimization pipeline (illustrated in Figure 6) containing four phases: join ordering, index selection, join operator selection, and aggregate operator selection. Performing any prefix of the pipeline is a simpler task than performing the entire pipeline: e.g., determining a join ordering and selecting indexes is a simpler task than determining a join ordering, selecting indexes, and determining join operators.

Figure 5. Potential decompositions

Thus, the lower-left hand side of Figure 4 corresponds to “easy” cases, e.g. few stages of the pipeline and few relations. The upper-right hand side of Figure 4 corresponds to “hard” cases, e.g. most stages of the pipeline and many relations. This insight illuminates a large design space for incremental learning approaches. In general, an incremental learning approach will be divided into phases. The first phase will use “easier” cases (the bottom left-hand part of the chart), training until relatively good performance is achieved. Then, subsequent phases will introduce more complex examples to the model, allowing the model to slowly and smoothly learn more complex cases (the top right-hand part of the chart).

Figure 5 illustrates three simple incremental learning approaches, with light colors representating the initial training phases and dark colors representing the subsequent training phases. We next discuss each of these approaches in detail.

4.3.1. Increasing optimization actions (pipeline)

Figure 6. Learning incrementally

Our first proposed approach is pipeline-based incremental learning, illustrated in Figure 6. A model is first trained on a small piece of the query optimization pipeline, e.g. join order selection. During this first phase, traditional query optimization techniques are used to take the output of the model and construct a complete execution plan (ReJOIN (rejoin, ) is essentially this first phase). Once the model achieves good performance in this first phase, the model is then slightly modified and trained on the first two phases of the query optimization pipeline, e.g. join order selection and index selection. This process is repeated until the model has learned the entire pipeline.

Extending ReJOIN to support this approach would be relatively straightforward. As shown in (rejoin, ), the first phase of query optimization (join order enumeration) can be effectively learned. Once this initial training is complete, the action space can be extended to support index selection: instead of having one action per relation, the extended action space would have one action per relational data structure, e.g. one action for a relation’s B-tree index, one action for a relation’s row-order storage, one action for a relation’s hash index, etc. The knowledge gained from the previous training phase should help the model train significantly faster in subsequent phases.

The pipeline approach has the advantages of incremental learning (e.g., a managable growth of the state space), but comes with several drawbacks that need to be further investigated. First, the early training phases requires access to a traditional implementation of the later stages of the query optimization pipeline. While such implementations are available in a range of DBMSes today, the dependency on a traditional query optimizer is not ideal. Second, each phase of the training process will not bring about a uniform increase in complexity. It is conceivable that some stages of the pipeline are fundamentally more complex than others (for example, join order selection might be more difficult than aggregate operator selection). The non-linearity of complexity going through the query optimization pipeline means that some training phases will require overcoming much larger jumps in complexity than others. This could result in unpredictable training times, or, in the worst case, a jump in complexity to large to learn all at once.

4.3.2. Increasing relations

Figure 7. Learning from small examples

While the previous approach reduces the size of the search space by focusing on larger and larger parts of the query optimization pipeline, this section proposes limiting the search space by focusing on larger and larger queries. The proposed approach is depicted in Figure 7. In the first training phase, the model learns to master queries over a single relation. In subsequent training phases, the model is trained on queries over two relations, then three relations, etc. In each phase, the entire query optimization pipeline is performed.

This approach dodges some pitfalls of the pipeline stage approach. Generally, the increase in complexity between optimizing a query with relations and optimizing a query with relations is small. Even though there is an exponential increase in the number of potential join orderings, this is a “quantitative” change as opposed to a “qualitative” change – intuitively, it is easier to learn how to create a join plan with a single additional relation than it is to learn how to perform a new pipeline step.

A major challenge of this approach is finding candidate queries. Generally, real-world workloads will contain very few queries over a single relation. Even synthetic workloads have very few low-relation-count queries (TPC-H (tpch, ) has only two such templates, JOB (howgood, ) has none). Queries with low relation counts could be synthetically generated, but doing so while matching the characteristics of real-world workloads is a complex task.

4.3.3. Hybrid

The last approach we explicitly discuss is the hybrid approach, depicted on the right-hand side of Figure 5. In this hybrid approach, the initial training phase learns only the first step of the query optimization pipeline (e.g. join order selection) using only queries over two or fewer relations. The next training phase introduces both another step of the pipeline (e.g. index selection) and queries over three or fewer relations. After all stages of the query optimization pipeline have been incorporated, subsequent training phases increase the number of relations considered. The advantage of this approach is that it provides the smallest increase in complexity from training phase to subsequent training phase. However, the hybrid approach suffers from some of the disadvantages of both the relations and pipeline based approach.

5. Conclusion

We have argued that recent advances in deep reinforcement learning open up new research avenues towards a “hands-free” query optimizer, potentially improving the speed of relational queries and significantly reducing time spent tuning heuristics by both DBMS designers and DBAs. We have identified how the large search space, delayed reward signal, and costly performance indicators provide substantial hurdles to naive applications of DRL to query optimization. Finally, we have analyzed how recent advances in reinforcement learning, from learning from demonstration to bootstrapping to incremental learning, open up new research directions for directly addressing these challenges. While query optimization is a good candidate for applying DRL to database internals, a wide variety of other core DBMS concepts (e.g.cache management, concurrency control) could benefit from applications of machine learning as well. Careful applications of machine learning across the entire DBMS, not just the query optimizer, could bring about a massive increase in performance and capability.


  • [1] A. Aboulnaga and S. Chaudhuri. Self-tuning Histograms: Building Histograms Without Looking at Data. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, pages 181–192, New York, NY, USA, 1999. ACM.
  • [2] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A Brief Survey of Deep Reinforcement Learning. IEEE Signal Processing Magazine, 34(6):26–38, Nov. 2017.
  • [3] Y. Bengio. Deep Learning of Representations for Unsupervised and Transfer Learning. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, ICML WUTL ’12, pages 17–36, June 2012.
  • [4] O. Buffet, A. Dutech, and F. Charpillet. Incremental Reinforcement Learning for Designing Multi-agent Systems. In Proceedings of the Fifth International Conference on Autonomous Agents, AGENTS ’01, pages 31–32, New York, NY, USA, 2001. ACM.
  • [5] C. M. Chen and N. Roussopoulos. Adaptive Selectivity Estimation Using Query Feedback. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, SIGMOD ’94, pages 161–172, New York, NY, USA, 1994. ACM.
  • [6] G. V. de la Cruz Jr, Y. Du, and M. E. Taylor. Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning. arXiv:1709.04083 [cs], Sept. 2017.
  • [7] N. Erickson and Q. Zhao. Dex: Incremental Learning for Complex Environments in Deep Reinforcement Learning. arXiv:1706.05749 [cs, stat], June 2017.
  • [8] G. Graefe and W. J. McKenna. The Volcano Optimizer Generator: Extensibility and Efficient Search. In Proceedings of the Ninth International Conference on Data Engineering, ICDE ’93, pages 209–218, Washington, DC, USA, 1993. IEEE Computer Society.
  • [9] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys. Deep Q-learning from Demonstrations. In Thirty-Second AAAI Conference on Artifical Intelligence, AAAI ’18, New Orleans, Apr. 2017. IEEE.
  • [10] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 448–456, Lille, France, 2015.
  • [11] T. Kaftan, M. Balazinska, A. Cheung, and J. Gehrke. Cuttlefish: A Lightweight Primitive for Adaptive Query Processing. arXiv preprint, Feb. 2018.
  • [12] M. Kearns and S. Singh. Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning, 49(2-3):209–232, Nov. 2002.
  • [13] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pages 489–504, New York, NY, USA, 2018. ACM.
  • [14] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning to Optimize Join Queries With Deep Reinforcement Learning. arXiv:1808.03196 [cs], Aug. 2018.
  • [15] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How Good Are Query Optimizers, Really? Proc. VLDB Endow., 9(3):204–215, Nov. 2015.
  • [16] R. Marcus and O. Papaemmanouil. Deep Reinforcement Learning for Join Order Enumeration. In

    First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management

    , aiDM ’18, Houston, TX, June 2018.
  • [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [18] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pages 19–34, New York, NY, USA, 2018. ACM.
  • [19] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. Learning State Representations for Query Optimization with Deep Reinforcement Learning. In 2nd Workshop on Data Managmeent for End-to-End Machine Learning, DEEM ’18, 2018.
  • [20] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. C. Mowry, M. Perron, I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-Driving Database Management Systems. In 8th Biennial Conference on Innovative Data Systems Research, CIDR ’17, 2017.
  • [21] M. Poess and C. Floyd. New TPC Benchmarks for Decision Support and Web Commerce. SIGMOD Records, 29(4):64–71, Dec. 2000.
  • [22] S. Schaal. Learning from Demonstration. In Proceedings of the 9th International Conference on Neural Information Processing Systems, NIPS’96, pages 1040–1046, Cambridge, MA, USA, 1996. MIT Press.
  • [23] M. Schaarschmidt, A. Kuhnle, B. Ellis, K. Fricke, F. Gessert, and E. Yoneki. LIFT: Reinforcement Learning in Computer Systems by Learning From Demonstrations. arXiv:1808.07903 [cs, stat], Aug. 2018.
  • [24] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, Jan. 2015.
  • [25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs], July 2017.
  • [26] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In J. Mylopolous and M. Brodie, editors, SIGMOD ’89, SIGMOD ’89, pages 511–522, San Francisco (CA), 1989. Morgan Kaufmann.
  • [27] M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2’s LEarning Optimizer. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 19–28, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
  • [28] M. E. Taylor and P. Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
  • [29] K. Tzoumas, T. Sellis, and C. Jensen. A Reinforcement Learning Approach for Adaptive Query Processing. Technical Report, 08, June 2008.
  • [30] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • [31] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How Transferable Are Features in Deep Neural Networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS ’14, pages 3320–3328, Cambridge, MA, USA, 2014. MIT Press.