The rise of cloud-computing platforms has offered new capabilities and benefits to users, including the capability to provision the appropriate amount of resources necessary for a given task. In order to effectively utilize this capability, Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) users need tools and techniques to quantify the tradeoffs between resource costs and performance benefits. Such techniques are also invaluable to PaaS and Software-as-a-Service (SaaS) cloud providers, since they can reduce operational costs while meeting Service-Level Agreements (SLAs).
Identifying optimal or near-optimal resource configurations is a difficult task. The challenges are three-fold: unknown workload characteristics, a large state space, and costly evaluation of what-if scenarios. Although resource utilization may be observed for repeating jobs, one may have to resort to intrusive profiling (not feasible in production environments) or A/B testing (which requires additional resources) to estimate tradeoffs with different configurations. Resource provisioning for new jobs is even more challenging. For these reasons, configuration tuning is traditionally performed by database administrators. However, this approach is tedious, requires a high level of domain expertise, and does not scale to cloud platforms that need to handle millions of databases and different service tiers.
In this paper, we tackle this problem by studying how Machine Learning (ML) techniques can be applied to estimate cost-performance tradeoffs for query processing in an RDBMS. To focus our study, we use Microsoft SQL Server as the RDBMS, which is available as cloud IaaS and PaaS offerings [azuresql] and as an on-premises solution [sqlserver-2019]. We also choose to study the configuration of a particular resource, the query Degree Of Parallelism (DOP): this is the maximum number of hardware threads (logical cores) that can be used at any time for executing the query. Estimates of cost-benefit tradeoffs for different DOPs allow the provisioning for on-premises server configurations, IaaS compute sizes, and PaaS service tier selections—Azure SQL Database [azuresqldb-paas] offers different service tiers that allow users to choose a specific number of (logical) cores [azuresql-vcore-service-tiers, azuresql-vcore-resource-limits], or a range of cores [azuresql-serverless-tier], or a pre-configured size with cores bundled with other resources [azuresql-dtu-service-tiers, azuresql-dtu-resource-limits].
Figure 1 shows the speedup and costup of 22 TPC-H1000 queries (on non-partitioned data) on SQL Server, running on our system (details in Section 4.1), for different values of the total number of logical cores allowed for the workload. Here the speedup and cost are normalized with respect to the performance and cost for a single core (DOP1). For this example, we define cost as the number of provisioned cores multiplied by the total time. The figure assumes optimal per-query DOP selections, with DOP #provisioned cores. Doubling the number of cores from 20 to 40 improves performance by 76% with a 14% increase in cost, while a further doubling to 80 cores additionally increases performance by 8% with an 85% increase in cost. These estimates can be used to select service tiers depending on whether the business value of the additional performance outweighs the costs.
Choosing a DOP Query DOP is a critical knob that affects query performance and resource utilization of multicore servers. However, choosing the optimal or near-optimal DOP is not easy. For well-parallelizable queries, performance improves as DOP increases, but diminishing returns are observed beyond some point depending on the database instance and query characteristics. For other queries, increasing DOP could be detrimental to performance due to synchronization overheads and additional resource requirements (e.g., memory), resulting in potential loss of business value, in addition to increasing cost of operations. In general, there is no single DOP choice that is optimal across all queries: as we will see in Section 2, one can obtain significant improvements with a per-query DOP selection. As a result, developing automated techniques that can make fine-grained resource allocation decisions can unlock further opportunities for performance benefits.
Recent work [akdere2012learning, li2012robust, marcus2019plan] has explored the use of various ML techniques on the problem of query performance prediction (QPP). On the surface, the problem of DOP tuning at query-level can be reduced to QPP with DOP being considered as an additional feature of the model: given a query and a set of DOP values of interests, we can then select the DOP value such that the estimated query execution time of the ML model satisfies the specified resource and performance requirements. However, directly applying such techniques, that consider only a single DOP value (usually, 1), is not feasible in our setting. In particular, both [li2012robust] and [marcus2019plan] estimate the overall query resource/latency by aggregating/combining predicted per-operator latencies. But such information may not be readily available or easy to use with complex systems that have multithreaded query execution (due to the potential for inter-operator overlap), runtime optimizations, multiple processing modes (e.g., row/batch mode) per operator, etc.
Based on the above observation, we posit that it is necessary to consider the query plan as a whole without using operator-level execution time information for ML-based DOP tuning. An interesting outcome of our work is that approaches that use a simple plan representation are very effective in many cases when they are applied to DOP tuning while having been dismissed in previous relevant work such as QPP.
Our Contribution In this work, we study the problem of automating fine-grained DOP tuning using ML-based techniques. To the best of our knowledge, this is the first work to study this problem in the context of an RDBMS. Our contributions can be summarized as follows:
We emphasize the importance of DOP tuning and show the potential benefits of fine-grained per-query DOP tuning (Section 2).
We formulate the DOP tuning problem as a regression task and study how to best represent each query as a feature vector (Section3). We perform a detailed experimental evaluation of different ML models using well-known decision support benchmarks (TPC-H [poess2000new] and TPC-DS [nambiar2006making]) consisting of queries generated from different query templates, database instances and scale factors.
In order to evaluate the performance of the ML models, we use four generalization levels based on different application scenarios: to queries from the same template, to queries from a new template, to instances of different scale, and to different instances and queries. We also use both task-agnostic and task-specific performance metrics to evaluate the utility of each model. Our experiments show that a simple while proper featurization along with tree-ensemble models is effective for DOP Tuning in most cases. We also experimentally analyze the causes of systematic failures, and discuss about possible improvement as our future work (Section 4).
2 Degree of Parallelism
In SQL Server, the query Degree Of Parallelism (DOP) is the maximum number of hardware threads or logical processors that can be used at any time to execute the query. Intra-query parallel execution is inspired by the Volcano [graefe1990encapsulation] operator model. The physical operators in a query plan can be both serial and parallel, but all parallel operators have the same DOP, which is fixed at the start of the query execution. Being able to determine the optimal DOP in one shot at (or before) query startup is important since a suboptimal DOP cannot be fixed without terminating query execution and restarting it with a different DOP.
SQL Server provides mechanisms to configure the DOP both at query optimization time and query startup time. In this work, we focus only on the latter option. The default DOP in the version of SQL Server we use, and for the machines used for our experiments, is 64. Thus, we choose to use the query performance at DOP 64 as the baseline in our experiments. More details regarding our experimental setup are in Section 4.1.
As we discuss in the introduction, each query performs differently as the DOP increases. Figure 2 shows the performance profiles for 4 different queries on TPC-DS with scale factor 1000 (see Section 4 for details about our experimental setup and workloads). Queries Qb and Qc are well-parallelizable and their performance improves as DOP increases, but with diminishing returns beyond some point. Query Qa sees no benefit from parallelism. Query Qd loses performance at high DOPs due to disk spills caused by increased memory requirement that exceeds the default limit for a single query. Query Qd(m) shows how Qd perform differently as the DOP increases by intentionally giving enough memory to avoid disk spilling, indicating that memory is an important factor which can influence the query behavior at different DOP. One can also observe that there is no single DOP choice that is optimal across all four queries.
Figure 3 shows workload performance (details in Section 4.1) as a function of the DOP value, with the same DOP for all parallel queries, relative to the performance with the default DOP of 64 for our setup. For most workloads, performance improves with DOP, starting from DOP=1, and attains the peak at DOP values that are less than the default (64) and the maximum value (80). The optimal value is workload-specific and can result in substantial performance gains over the default value for some workloads.
Further gains are possible by selecting optimal DOPs on a per-query basis. Figure 4 compares workload-level speedups possible with per-query DOP selections, per-workload DOP selections, and a static selection of DOP=40 (equal to the number of physical cores in our server). All speedups values are with respect to the static (default) value of DOP=64. In line with observations from Figure 3, DOP=40 speeds up some workloads (e.g., H 1000), but slows down some others (e.g., H 300), and a per-workload optimal DOP can significantly improve performance (e.g., DS 300). All workloads also show additional speedups, ranging from 4% (H 300) to 15% (DS 1000) beyond what is possible with a workload-optimal selection.
Just as the per-workload optimal DOP differs across workloads, the per-query optimal DOP values within each workload also show substantial variation. Figure 5 shows more variation for the DS workloads than the H workloads, which is due to a larger variety of query templates in DS. No workload has a single per-query optimal DOP value, indicating the potential for speedup with per-query DOP selections as we observed in Figure 4. The average and median shift towards larger DOP values as scale factors, and consequently, size of datasets increase.
Based on the above, we can observe the following:
3 The Main Components
In this section, we formally describe our approach to solving the DOP tuning problem. We first formally state the ML task we are solving. Then, we describe how we featurize the input data to the ML models and present the ML models we consider in this work. Finally, we discuss different ways to evaluate how well our ML models generalize under different application scenarios.
Regression over Classification At first glance, it is natural to think of tuning query DOP as a classification
problem: given a query and a set of DOP values, classify the query into the DOP class that achieves optimal performance.
We choose instead to cast the problem as a regression problem for two reasons.
First, the performance difference of a query at neighboring DOP values can be very small (see for example Figure
At first glance, it is natural to think of tuning query DOP as a classification problem: given a query and a set of DOP values, classify the query into the DOP class that achieves optimal performance. We choose instead to cast the problem as a regression problem for two reasons. First, the performance difference of a query at neighboring DOP values can be very small (see for example Figure2), which means that accurate classification may not be possible. Second, formulating DOP tuning as a regression problem allows us to use the learned model in applications beyond DOP selection, such as resource provisioning. We also note that the capability to predict the estimated time of a query at different DOPs might be helpful in order to perform DOP selection for concurrent query execution, which we consider as future work.
3.1 Problem Formulation
We are given a database instance over a schema , and a workload over the instance , where each is a compiled query plan generated by the RDBMS. For our purposes, a query plan is a tree where each node is a physical operator (e.g., Index Scan, Hash Join, Sort); additionally, for each node we are given certain information about the operator execution (e.g., estimated rows read, whether it runs in parallel mode, etc). Figure 6 shows an example of such a query plan. In the next section, we will show how to map the query plan to a feature representation .
For each plan , we also have a measurement of the execution time of the plan over instance with DOP , where comes from a set of DOP values of interest. We denote this time with . Every such measurement corresponds to a training point , where , and .
We can now cast our problem as a regression task. In particular, the goal is to use the training points in order to learn a function (model instance) of a model that minimizes the quantity
is a loss function. In this paper, we will use as a loss function themean squared error (MSE), . We will use to denote the estimated execution time using the learned model instance.
Using the learned model instance , we can now solve the DOP tuning problem. We distinguish two subproblems.
DOP Selection at Workload Level In practice, users in many cases configure the DOP at workload-level, in which the same DOP is selected for all the queries in the workload. In this case, the task is to choose a single DOP,
DOP Selection at Per-Query Level Further performance gains are possible if we perform DOP selection on a per-query basis. In this case, the task is to select for each query plan a DOP
In this section, we describe how we map a query plan to a feature vector of fixed dimension in a similar manner to what is presented in [ding2019ai]. Compared to Ding et al. [ding2019ai], our featurization contains richer information by which the characteristics of a given query plan could be more accurately captured. We propose different featurization alternatives for a query plan – we study the effects of different featurization choices in Section 4.2.
We next detail our featurization process. A simplified example of the process is depicted in Figure 6. Recall that is a tree, where each node in the tree consists of a physical operator along with its processing mode and information about its estimated runtime performance. Each physical operator can execute in different modes: parallel or serial, and batch or row. Since different modes handle parallelism differently, it is critical that we encode them in the feature vector. For example, it is important to distinguish between operators that use a parallel implementation versus the ones with a non-parallel implementation. To achieve this, we construct a composite key for each operator instance in the form of
The vector adds additional attributes that are only applicable for certain operators. For example, IsAdaptive and EstimatedJoinType apply only for the operator AdaptiveJoin. In addition, in the case where the same physical operator can be used to implement different logical operators (e.g., HashMatch can be used for join, aggregation or union), encodes the logical type as well.
For each composite key, the feature vector allocates space to encode different types of execution measures – each measure is summed across all nodes in the plan with the same composite key.
Cardinality (card): cardinality estimates computed by the optimizer expressed in bytes (e.g., estimated output size in bytes).
Cost (cost): cost-based estimates calculated by the optimizer’s cost model (e.g., estimated CPU cost, estimated I/O cost).
Count (count): takes value 1 if the operator has the composite key, otherwise 0.
Weight (weight): each node is assigned a weight which is computed recursively from the leaf nodes. The weight of a leaf node is the estimated output size in bytes, while for a non-leaf node it is the sum of product of weights and height of all its children. The weight feature encodes structural information (see [ding2019ai]).
3.3 ML Models
We briefly describe the ML models we use in our approach.
Linear Regression (LR)
We consider Linear Regression as one of the baseline ML models due to its simplicity and efficiency. Specifically, we use its regularized version,elastic net, which linearly combines L1 and L2 regularization.
Random Forest (RF) Random Forest is an ensemble learning method exploiting bagging
as its ensemble algorithm. Since multiple decision trees are built/traversed independently during the model construction/inference phase, this allows for efficient training/testing in a multicore setting. The fact that multiple trees built on different randomly sampled data also makes it more robust compared to a single decision tree, enabling it to consider complex interactions among different features while mitigating overfitting.
XGBoost XGBoost [chen2016xgboost]
is an efficient implementation of gradient boosting with a set of optimizations that exploit any available parallelism, achieve faster model convergence by using the second-order gradients approach Newton Boosting and mitigate the overfitting issue of gradient boosting by imposing better regularization techniques.
Multi-Layer Perceptron (MLP)hidden layers in which each layer has neurons. RELU (
) is used as the activation function. We use the Adam optimizer for training[kingma2014adam].
Exploration of Other Models Motivated by the promising results in [marcus2019plan]
with explicit exploitation of the query plan structure and operator-level query latency, we investigated a rich set of DNN-based models that are designed to capture the spatial (e.g., nearby pixels in images) or structural (e.g., syntactic tree structure in natural languages) information of the data. Specifically, we explored models that have been applied successfully to a set of natural language processing (NLP) related tasks including convolutional neural network (CNN)[kim2014convolutional]sundermeyer2012lstm] and Tree-LSTM [tai2015improved], regarding each operator instance as a special word. We observed that none of these models outperform MLP, the simplest DNN-based model, considering both runtime efficiency and prediction performance. We also implemented and explored QPPNet [marcus2019plan], including the operator-level elapsed time. QPPNet failed to learn the relation between query latency and DOP (see example given in Figure 7). The failure could be attributed to a series of possible reasons, such as the difference of the studied system environment (e.g., PostgreSQL vs SQL Server ), the assumptions QPPNet relying on as discussed in the introduction, the not well-defined per-operator latency in our study environment, and the different sizes of query plans (more details in Figure 18). Our models exploit the schema and database instance agnostic featurization, while QPPNet is tested on schema-dependent features. Deeper investigation of the failure of applying QPPNet in DOP tuning is future work. Due to the above observations, we choose to use MLP as the DNN-based model representative and only present the comparison results of its with that of other models.
shows QPPNet does converge after training for less than 25 epochs on 80% of the TPC-DS1000 plan-dop pairs (Generalization Lvl1). But(b)b suggests that QPPNet is ignorant of different DOP values - it always gives the same latency prediction for the same query regardless of DOP changes, while RF is able to capture the actual speedup curve accurately.
Different application scenarios exhibit different degree of commonality, i.e., how much similarity is observed across a query workload. The commonality here refers to both the queries themselves (e.g., SQL statements and query plans), as well as the input data. Given a specific application, having a good understanding of the degree of commonality is crucial when attempting to apply ML-based techniques for DOP tuning, since it helps practitioners to understand the capabilities and limitations of the models that are learned from the underlying data and queries. More specifically, given training data (pairs of query plans plus runtime), we should first try to answer the following questions:
How similar are future queries to queries in the training data?
How much can the input relations change (in terms of both schema and scale) compared to the input relations in the training data?
The answers to the above questions are different when considering different applications. For example, a supermarket chain might run the same set of queries daily to analyze its sales, and the size and the distribution of the input data to the queries might be relatively stable across most of the days, with the exception of a few promotion days such as Black Friday and Christmas. On the other hand, analysts in different teams might run different queries on the same internal data for different data analysis tasks. In addition, the scale of the input data may increase after a period of time as new data arrives daily.
To capture this differentiation, we categorize the training data and test data into four generalization levels, considering the query templates, the schema of input tables and the scale of data. Suppose that we train our model on instance with schema , where the workload is drawn from a distribution of templates (see Figure 8).
same templates & same schema & same scale
The test workload comes from the same distribution of templates and runs on the same database instance . As an example, the training and test data are gathered from the executions of two sets of TPC-DS queries generated from the same set of templates but with different random seeds running against the same instance.
different templates & same schema & same scale
The test workload comes from a different distribution of templates , but runs on the same instance . As an example, the training and test data are gathered from running two sets of TPC-DS queries generated from a different subset of templates running against the same instance.
same template & same schema & different scale
The test workload comes from the same distribution of templates , but runs on an instance of different scale using the same schema . As an example, the training and test data are gathered from running two sets of TPC-DS queries generated from the same set of query templates running against the and scale instances respectively.
different template & different schema
The test workload comes from a different distribution of templates and runs on a different instance defined on a different schema . As an example, the training and test data are gathered from running two sets of queries generated from TPC-DS templates and TPC-H templates running against the corresponding database instances.
We assume a fixed hardware configuration and do not study generalization to different hardware configurations in this work. Cloud platforms generally have a restricted set of known hardware configurations on which services are deployed, and separate models could be trained and deployed for each configuration.
4 Experimental Evaluation
In this section, we present and discuss the results of our experimental evaluation. We design our experiments with the goal of answering the following questions:
What are the effects of different featurization alternatives in model performance?
How do different ML models perform for different generalization levels?
What are the root causes behind wrong predictions in the ML models in our problem formulation?
What trade-offs should be considered when choosing between different ML models?
At a high level, our experiments show that the featurization using the count, card and weight features (excluding cost) leads to the overall best model prediction performance. RF shows the best overall performance considering both task-agnostic and task-specific metrics. However, when there is significant distribution mismatch between the test queries and training queries, no model is able to perform DOP selection that gives performance close to optimal. One concrete explanation for such distribution mismatch is the difference of memory requirements.
Comparatively speaking, hyper-parameter tuning of XGBoost and MLP is time-consuming while additional regularization is hardly to be imposed for better generalization (e.g., to achieve better performance on test data) when the training data is limited; RF is easier to be configured and it is more robust while it has relatively higher model inference overhead.
4.1 Experimental Setup
System Configuration Our queries were executed on a dual-socket Intel Xeon Broadwell server with a total of physical cores and GB main memory. We have hyper-threading enabled, resulting in a total of logical cores which, consequently, is the maximum DOP possible on this server. We run SQL Server 2019 CTP 2.0 Developer edition. By default, SQL Server chooses a DOP value of 64 for queries running on this server. SQL Server sets aside part of the available memory for the buffer pool and shared objects; the remainder can be granted as working memory to queries up to a certain limit (by default, 25% of the working memory, corresponding to 90GB on our server). Using query hints, we explicitly request this maximum memory grant for queries that spill. The data for the databases we study are striped across several NVMe SSDs and their logs files are on another NVMe SSD.
Workload and Datasets We use both the TPC-H and TPC-DS workloads for our experiments, with scale factors 100, 300, and 1000 for each. The following table gives the detailed information for the dataset construction.
For each TPC-H query template, we generated queries using different random seeds with duplicate queries being removed. TPC-DS queries are generated in a similar manner, but with different random seeds used for each query template. We execute each query in the workloads for different DOP values: . We focus on warm-cache query execution, since caching is almost always used in practice whenever it is feasible. Thus, to obtain each data point, we perform one cold-cache query execution followed by ten warm-cache runs. The average time of the warm-cache runs is used as the ground truth of the latency for the executed query. For the purpose of our study, we execute one query at a time. We also only consider queries executed using query plans with at least one physical operator that uses intra-parallelism. We use the clustered columnstore index [cci] organization for each database.
Training and Testing Splits We perform 5-fold cross-validation to evaluate the model performance for each generalization level. For Generalization Lvl1, each fold uses of TPC-DS 1000 queries for training and the remaining for testing. For Generalization Lvl2, the data is split based on the query templates: each fold uses of the query templates from TPC-DS1000 for training, and the remaining for testing. The folds for Generalization Lvl3 and Generalization Lvl4 are as follows:
|Generalization Lvl3||Generalization Lvl4|
Hyper-Parameter Tuning We tune the selected models in a standard manner, similar to [ding2019ai], based on the mean absolute error in cross-validation on the training data. For tree-ensemble models (XGBoost and RF), we limit the maximal number of trees to be
. We note that spending additional effort tuning the parameters of different models might lead to a narrower gap between the evaluation metrics of these models. However, the main goal of this study is to to explore the usability and trade-offs present in the compared techniques in a standard training configuration. Hence, for each model, we stop tuning the hyper-parameters (e.g., searching larger hyper-parameter space) when reasonably satisfying results are observed.
Comparison Metrics To compare the performance of different ML models, we use both task-agnostic metrics (e.g., mean absolute error), as well as task-specific metrics that are tied to a specific application (e.g., workload throughput at optimal DOP selection). We provide the detailed definitions of these metrics in Table 3.
|Task-Agnostic||Mean Absolute Error|
|Relative Prediction Error|
|Speedup Prediction Error|
|Task-Specific||Throughput with per-Query DOP|
|Throughput with per-Workload DOP|
4.2 Featurization Analysis
To compare the featurization alternatives, we run RF with all four features , and also excluding one feature at a time (, , , ). Tables 8-14 show the distributions of relative prediction error (RPE) and speedup prediction error (SPE). We observe that for Generalization Lvl1 there is little difference across different featurization alternatives (Tables 8,8). For the other generalization levels, leads to the best performance: for the low relative error range (e.g., RPE 0.80), the percentage given by using is larger than other featurization alternatives. We also observe that including all features is rarely optimal, while having count and weight is beneficial most of the time. As a result of our analysis, we present our experimental results using the features for the remaining experiments.
|Model||Generalization Lvl1||Generalization Lvl2||Generalization Lvl3||Generalization Lvl4|
4.3 Training and Inference Time
In this section, we evaluate the training and inference overheads for the ML models we tested. It is critical that both overheads are small for the model to be usable in practice. We report below (Table 5) the training and inference overheads for Generalization Lvl1.
|Model||Training (sec)||Inference (sec)|
The training overhead includes the time spent on preprocessing (e.g., featurization, training/test data splitting, etc). MLP is trained on NVIDIA GeForce GTX 1080Ti, while other models are trained on the same machine in which the queries are executed with all physical cores given. The inference time of all models is measured by running each model on a single data point using a single CPU core. We observe that the inference time of all models except for RF is very small ( 10ms ), while the training time varies, with MLP being the most expensive model. RF spent more time ( ms) on inference due to the traversal of large number of deep trees. However, our work focus on relatively long-running queries (e.g., OLAP), for which DOP tuning is mostly beneficial to, and thus we see the inference overhead of RF as insignificant.
4.4 Model Comparison
4.4.1 Task-Agnostic Metrics
Mean Absolute Error MAE represents the average absolute difference between the actual query latency and predicted query latency of all plan-dop pairs. A nice property of MAE is that it has the same unit as the regression target time, which is also noted in [marcus2019plan].
In Table 4, we observe that except for LR, all other models exhibit relatively small MAEs on the training data after learning across all generalization levels, suggesting the necessity of nonlinearity and large model capacity. Among these models, XGBoost and RF show the best performance considering only MAEs on the training data, which are close to zero, suggesting the number/depth of the trees in tree-ensemble models are large/deep enough.
Switching to test data, the comparison between results of LR and other models in Generalization Lvl1 are relatively consistent with what is being observed in training data. However, we see much larger gaps across different generalization levels in all models, and this suggests the different degree of difficulty of applying ML models for DOP tuning at different generalization levels. MAEs of different testing folds in each generalization level suggest that RF generalizes better than XGBoost and MLP. This result can be explained by the fact that RF is less sensitive to overfitting compared to XGBoost and MLP, as well as easier to tune. While it is possible to apply better regularization techniques on XGBoost and MLP to further reduce the generalization error, we note that the current architecture/hyperparameters of XGBoost, MLP, and RF (which are tuned via cross-validation on the training data) show comparable performance on different validation splits. The generalization error gap between different models might be narrowed down when more training data from queries of larger variety becomes available.
Relative Prediction Error We show the distribution of the per-query RPE of each model at different generalization levels in Figure 10. Not surprisingly, LR consistently exhibits high RPE () in most of the plan-dop pairs across all generalization levels due to its inability of capturing the complex relationship between the plan-dop pair features and the query latency. Looking at the relative error distribution of other models, we observe that in most cases, RF exhibits lower RPE for a larger portion of test plan-dop pairs compared to other models.
Speedup Prediction Error While relative error might be a good metric for evaluating the accuracy of the query latency prediction, it does not directly infer the speedup of the query execution at different DOP values. Intuitively, when selecting the DOP for a single query execution, users should know the pattern or trend of the performance curve of the query at DOP values of interests. We look at a simple metric called speedup prediction error (Table 3) that captures this property. Figure 10 presents the per-query based SPE distribution of each model considering different generalization levels. We observe that RF consistently shows the best performance for SPE.
The relationships between relative error/speedup error and the query latency (at DOP 40) are shown in Figure 14 and 14. We observe that large errors are less likely to occur in long-running queries () for both metrics. The observation is positive, since intra-parallelism is more beneficial to long running queries.
|Model||Generalization Lvl1||Generalization Lvl2||Generalization Lvl3||Generalization Lvl4|
4.4.2 Task-Specific Metrics
DOP Selection at Individual Query Level We now consider the task defined in Section 3.1 of selecting the DOP per query Level. We use the metric as defined in Table 3. We compare given by different models and the query throughputs given by executing queries at DOP 40, 80, the actual optimal DOP (OPT) for each , the workload optimal DOP (WORKLOAD) for all , with all results being normalized over the throughput at DOP 64 (default value). The comparison results at different generalization levels are shown in Figure 11. We first observe that in Generalization Lvl1, all models except LR lead to query throughput close to OPT.
At Generalization Lvl2, we observe that no model is able to predict DOPs that lead to throughput comparable to WORKLOAD in most of the cases. Though XGBoost seems to give performance much closer to WORKLOAD in some cases (Fold 2-5), the results are more likely to be accidental. After investigation, we have found that the root cause behind this performance gap is the mismatch between the data distribution in training and test data: non-trivial disk spilling is observed in queries from the same query template in which (as shown in Figure 2) comes from (generated from different random seeds) due to the increase of memory requirements at high DOP values (DOP 20), which in turn causes a long running time that dominates the whole test workload latency. The observation suggests that a good understanding of the structural differences between the training and test workloads is important before applying ML-based approach. Though many recent works [ding2019ai, marcus2018deep] exploring ML-based techniques in DBMS problems mention the distribution difference between the training and test data (queries), to the best of our knowledge, we are the first one to explicitly point out such a discrepancy. Meanwhile, the result also reveals that being able predict the memory requirement for a given query with a known query plan, is helpful for DOP tuning. We have verified our analysis by intentionally removing -type queries and the results are shown in Figure 12. Though the performance gain opportunity itself becomes significantly smaller (the performance gap between actual optimal DOP values and the default DOP), in most cases, most models do not select DOPs that result in performance regression and the best-performing model RF often select DOPs giving performance better than WORKLOAD.
At Generalization Lvl3, performance given by models except for LR often matches up OPT or WORKLOAD (Fold 1-4). For Fold 5, since models are trained on plan-dop pairs generated by TPC-DS 100 queries & database instance and tested on TPC-DS 1000 queries & database instance (in which disk-spilling is more severe than that being observed in TPC-DS 100), the distribution mismatch (similar to previous reasoning for Generalization Lvl2) results in performance regression.
At Generalization Lvl4, the opportunity for throughput improvement over DOP 64 is small (the gap between OPT and throughput at DOP 64 is only ), no model selects DOP values giving performance comparable to OPT, and sometimes models show performance regression compared to the default DOP configuration (64). Interestingly, LR seems to be the best-performing model if solely looking at this task-specific performance metric at Generalization Lvl4. The cause of this phenomenon is that TPC-H queries, from which the test data is generated, scale very well with increasing DOP on SQL Server. And LR always chooses DOP 80 due to its linearity. At first it may seem like the poor performance shown in other models is due to the fact that the degree of commonality between training and testing queries in Generalization Lvl4 is low, however, it is noticeable that there is no distinguishable difference between the SPE distributions of Generalization Lvl3 and Generalization Lvl4 (Figure (c)c (d)d), while RF does a fairly good job in DOP selection for Generalization Lvl3. After careful investigation, we find out that for most of the queries, RF and XGBoost are able to capture the speedup trend accurately, and the relatively poor performance seems more likely to be caused by the specific characteristics of TPC-H queries (testing queries in Generalization Lvl4) - they scale really well with increasing DOP (see Figure (e)e), suggesting large performance gap between query executions at different DOP values and small latency at high DOP (e.g., nearly 30X speedup at DOP 80 compared to performance at DOP 1). Keeping this fact in mind and considering the DS1000/H1000 training/testing split (Fold 2), RF and XGBoost failed to select the near-optimal DOP for two relatively long-running queries in the testing split (RF selects DOP 20 and XGBoost selects DOP 16 while the actual optimal DOPs are 80), resulting in the performance gaps observed in Figure 11.
4.4.3 Cost-Performance Trade-off DOP Selection
While making the optimal choice of DOP configuration at individual query is useful, selecting the optimal DOP at workload-level, although is sub-optimal, could ease DOP configuration. Figure 17 visualizes the actual speedup curve and the predicted speedup curves given by different models with increasing degree of parallelism. The speedup curve can be used for selecting DOP for a given set of queries based on a resource budget, or by simply choosing the optimal point without any constraints. For example, looking at Figure (a)a, the speedup by increasing DOP from to is insignificant but the hardware provision cost could increase by (assuming the hardware provision cost is proportion to the number of logical cores).
We observe that the best-performing model RF is able to approximately capture the trend of the actual speedup curves in Generalization Lvl1, 2M and 3 considering both per-query optimal-DOP (Figure 17) and workload-level DOP (Figure 17). Besides, RF is also able to select per-query level DOPs for a given workload, leading to performance that is close to the performance given by running each query at the actual optimal-DOP given the specified maximum DOP possible (Figure 17). However, there is a large gap between the actual speedup curve and the curve predicted by RF in Generalization Lvl2 and Generalization Lvl4, where XGBoost is performing better in predicting the speedup curve in Generalization Lvl4 (Figure (e)e,(e)e )– in contrast to our observations for task-agnostic metrics. This gap is caused by the heterogeneous predictions on different plan-DOP pairs in the test data: the prediction output of a few plan-DOP pairs dominates the predicted performance, and hence the actual performance of other plan-DOP pairs is not accurately reflected. On the other hand, XGBoost fails to capture the performance difference (e.g., query performance improves) for a few plan-DOP pairs that are important for DOP selection (Figure (e)e), which though are not critical in predicting the overall speedup trend (Figure (e)e,(e)e). Based on this observation, we argue that merely looking at the task-agnostic metrics might not be always the best way to evaluate the performance of the model, but rather the task-specific ones should be considered together. For example, RF is doing well in terms of per-query DOP selection in Generalization Lvl4 while XGBoost is preferred when predicting the view of cost-performance trade-offs is important. Meanwhile, a better learning objective, that forces the model to more accurately learn the absolute and relative difference between the query latency at different DOPs is critical to make ML-based techniques for query DOP tuning more feasible.
5 Related Work
Query Performance Prediction (QPP) is a well-studied problem [akdere2012learning, ganapathi2009predicting, li2012robust, wu:qpp-concurrent-queries:vldb:2013, marcus2019plan], but not in a setting with intra-query parallelism. Early work in DOP management [mehta1995managing] studied the exploitation of intra-operator parallelism in a multi-query environment for shared-nothing parallel database systems using a simplified simulation model. More recent works have developed analytical models for query parallelism in BigData execution frameworks [Rajan:perforator:socc:2016]. To the best of our knowledge, this is the first paper to do a comparative study of ML techniques for QPP for multithreaded query execution and evaluate query DOP tuning for a commercial-grade RDBMS on a modern multicore server.
ML for Query Performance Prediction Prior work has explored the use of per-operator models [akdere2012learning, li2012robust, marcus2019plan] for QPP. The per-operator latency estimates can be added together [akdere2012learning, li2012robust] or combined usings plan/subplan-level estimates [akdere2012learning] or DNNs [marcus2019plan] to get query-level estimates. While operator-level modeling may have the potential for better generalization to unknown plans, there are several challenges in using this approach for highly-optimized RDBMSs.
First, query estimates based on per-operator modeling can incur significant complexity and cost for plans with a large number of operators. Figure 18 shows the distributions of the number of operators (plan size) and height of the relational operator tree (plan depth) in the plans for TPC-DS 1000 queries that we study. About 6.2% of plans have size of more than 100, with a maximum of 197, and about 10% have a depth of 20 or more. Large plans have high training costs while deep plans increase difficulty of training QPPNet-like DNN-based models, critical path lengths and inference overhead at query execution time.
Second, building a query-level combined model from per-operator models is not always feasible when there is pipelined parallel execution, because of the temporal overlap of and contention for shared resources by inter- and intra-operator threads. The semantics of isolated per-operator latencies are not well-defined in such a setting, and it may prevents from extracting and composing operator-level estimates to derive query-level estimates as Figure 7 suggests.
Automated Database Tuning Prior work has also explored techniques for automatically tuning database systems to improve query performance [Chaudhuri:self-tuning-db:vldb:2007, Duan:ituned:vldb:2009, pavlo2017self, VanAken:ottertune:sgmod:2017]. Our work on DOP tuning can benefit such approaches by providing what-if analyses for different DOP settings and thereby potentially eliminating the need for runtime exploration of this parameter space.
Dynamic Parallelism Recent work on new database systems use dynamic/elastic parallelism where the the parallelism can be increased/decreased at runtime [Leis:morsel:sigmod:2014, Patel:quickstep:vldb:2018]. Although our ML-based DOP selection approach for performance improvement focuses on a setting with static parallelism, we believe that our approach on providing cost-benefit tradeoff estimates would be useful for resource provisioning in elastic systems as well.
ML for Query Optimization
Conventional query optimization in RDBMSs is based on manually constructed cost models and rule-based heuristics. Such strategies usually make certain assumptions about the plan search space (e.g., left-deep tree) and rely heavily on the estimation of statistics required by the cost models such as cardinality estimation– which can be quite inaccurate, leading to query plans that are far away from optimal. In addition, query optimizers built this way never learn from past experience, and the same query plan with poor performance can be repeatedly generated[marcus2018towards]. Recent research efforts attempt to enhance and even replace the core optimizer components seen in most of RDBMSs today using deep learning (DL) techniques. For example, [marcus2018deep, krishnan2018learning] exploit (DRL) to optimize join queries, while [kipf2018learned, ortiz2018learning] propose to use DRL to improve the cardinality estimation. Neo, a query optimizer built based on a set of DL-based models [marcus2019neo], has been shown to even outperform state-of-the-art commercial optimizers in some cases. Compared to this line of research, our work focuses on optimizing the query performance outside of the RDBMS (an example of resource tuning) rather than touching the optimizer internals, similar in scope to [pavlo2017self].
Our comparative exploration of ML techniques for tuning DOP in SQL Server indicates that performance gains are possible by using simple tree-ensemble models along with query plan featurization. The performance profiling for queries showing different behaviors at different DOP values suggests that DOP tuning is an important while challenging problem. We have also identified a set of important issues that raise concerns when applying the ML-based techniques for DOP tuning in practice, along with the possible improvements to this work:
Memory consideration: in our present study, we run queries in an environment in which the available memory is fixed (i.e., static). The utility of ML models learned without considering memory information is limited, preventing their use in new environments with different memory requirements (e.g., machines with different memory sizes, or concurrent query execution in which each stream is under strict memory constraints). We plan to address this deficiency as part of improvements to this work.
Concurrent query execution: we plan to utilize the single-query based ML models for DOP tuning in a concurrent environment by an analytical approach, expanding the applicability of our study.
Model interpretability: the results given by the black-box ML algorithm should be interpretable in a way similar to that of the rule-based algorithm. The DBAs should be able to see the decision path suggested by the algorithm in order to decide whether to accept/reject the model output. Meanwhile, interpretability could also help with possible implementation issues inside the targeted RDBMS (e.g., the poor parallelism could be caused by the behavior of operators of certain types).
Studying the effects of hyper-threading: since enabling or disabling hyper-threading will affect query performance characteristics, it is important to study such differences and maybe even encode this piece of information into ML models.
We studied the problem of tuning the degree of parallelism via statistical machine learning. We focus our evaluation on Microsoft SQL Server, a popular commercial RDBMS that allows an individual query to execute on multiple cores. In our study, we cast the problem of DOP tuning as a regression task, and examine how several popular ML models can help with query performance prediction in a multi-core setting. We performed an extensive experimental study comparing these models against a list of performance metrics, and tested how well they generalize in different settings: to queries from the same template, to queries from a new template, to instances of different scale, and to different instances and queries. Our experimental results show that a simple featurization of the input query plan that ignores cost model estimations can accurately predict query performance, capture the speedup trend with respect to the available parallelism, as well as help with automatically choosing an optimal per-query DOP.
We thank Alan Halverson for insightful discussions about the DOP problem and about baseline experimental setup during the initial phase of this work. Zhiwei was supported by Microsoft’s Gray Systems Lab (GSL) through a summer internship and Microsoft RA-ships for this research work. We thank Carlo Curino and other members of GSL, and members of the SQL Server team for discussions and feedback on this work. The manuscript was improved by detailed and thoughtful comments from Remzi Arpaci-Dusseau and other, anonymous, reviewers.