Log In Sign Up

Flexible Operator Embeddings via Deep Learning

by   Ryan Marcus, et al.
Brandeis University

Integrating machine learning into the internals of database management systems requires significant feature engineering, a human effort-intensive process to determine the best way to represent the pieces of information that are relevant to a task. In addition to being labor intensive, the process of hand-engineering features must generally be repeated for each data management task, and may make assumptions about the underlying database that are not universally true. We introduce flexible operator embeddings, a deep learning technique for automatically transforming query operators into feature vectors that are useful for a multiple data management tasks and is custom-tailored to the underlying database. Our approach works by taking advantage of an operator's context, resulting in a neural network that quickly transforms sparse representations of query operators into dense, information-rich feature vectors. Experimentally, we show that our flexible operator embeddings perform well across a number of data management tasks, using both synthetic and real-world datasets.


page 9

page 11


Plan-Structured Deep Neural Network Models for Query Performance Prediction

Query performance prediction, the task of predicting the latency of a qu...

Database-Agnostic Workload Management

We present a system to support generalized SQL workload analysis and man...

HPTMT Parallel Operators for High Performance Data Science Data Engineering

Data-intensive applications are becoming commonplace in all science disc...

The Case for Deep Query Optimisation

Query Optimisation (QO) is the most important optimisation problem in da...

Transformer for Partial Differential Equations' Operator Learning

Data-driven learning of partial differential equations' solution operato...

MonetDBLite: An Embedded Analytical Database

While traditional RDBMSes offer a lot of advantages, they require signif...

Toward Evaluating the Complexity to Operate a Network

The task of determining which network architectures provide the best rat...

1 Introduction

As database management systems and their applications grow increasingly complex, researchers have turned to machine learning to solve numerous data management problems, such as query admission control [60], query scheduling [41], cluster sizing [46, 34, 39], index selection [49]

, cardinality estimation 

[33, 26], and query optimization [37, 29, 45]. One of the most critical steps of integrating machine learning techniques into data management systems is the process of feature engineering. Generally, feature engineering involves human experts deciding (1) if a particular piece of information is relevant to a task (too many features may increase inference time or decrease the efficacy of some machine learning models), (2) how to combine available domain information into useful

features (many models desirable for their fast inference time are unable to learn arbitrary combinations of features, e.g. linear regression), and (3) how to

encode these features into usable input (most machine learning algorithms require vector inputs).

In addition to being a difficult task to undertake, feature engineering has many disadvantages specific to data management tasks:

  • Time/cost intensive: feature engineering requires experts in machine learning and database systems to spend a long time testing and experimenting with different combinations of features. For example, there were 18 engineered features in [2] for latency prediction, and 41 engineered features in [25] used for database intrusion detection.

  • Non-task transferable: generally, features engineered by hand for a specific task will not provide good performance on a different task – the process must be repeated for each task (e.g. the features developed for query admission control in [60] cannot be used for latency prediction [2], nor for database intrusion detection [25], and vice versa).

  • Non-data transferable: features engineered for a particular task may work well for one dataset (e.g. TPC-H), but those same features may fail for another dataset (e.g. a real-world dashboard system).

In order to ease these burdens, this work introduces flexible operator embeddings: a general technique to perform multi-purpose, database-tailored feature engineering with minimal human effort. As query operators lie at the heart of numerous complex tasks, including but not limited to resource consumption prediction [32], query optimization [55], query performance prediction [2], and query scheduling [60], our embeddings offer feature engineering at the operator level. Here, an operator embedding is a mapping from a query operator to a low-dimensional feature (vector) space that captures rich, useful information about the operator. By leveraging deep learning techniques, these embeddings act as vectorized representations of an operator that (a) can be generated and tailored to a specific database, (b) can provide useful features for a variety of data management tasks, and (c) provide usable input to numerous off-the-shelf machine learning algorithms.

Our technique for learning representations of query operators is based on the intuition that the behavior of an operator is context sensitive: a join of two large relations will behave differently than a join of two small relations (e.g. greater memory usage, possible buffer spilling, etc.). We capture this intuition by training a deep neural network with a specialized, hour-glass architecture to predict the context of a given query operator: in our case, the operator’s children. The training set for this neural network is a set of query plans (i.e., trees of query operators). Once trained, we use the internal representation of a particular operator learned by the neural network as that operator’s embedding. Intuitively, training the neural network to predict an operator’s children ensures that the internal representation learned by the neural network carries a high amount of semantic information about the operator. For example, if the network can predict that a join operator with a particular predicate normally has two very large inputs, then the internal representation learned by the network is likely to carry semantic information relevant to whether or not that join operator will use a significant amount of memory.

One unique characteristic of our operator embeddings is that they produce features that can be fed to traditional, off-the-shelf machine learning models with fast inference time. Furthermore, they can be useful for integrating machine learning techniques into a variety of data management tasks such as query admission, query outlier detection, and detecting cardinality estimation errors. Finally, once learned by the neural network, our flexible operator embeddings are significantly information rich. This implies that, when learned embeddings are used as input to a traditional machine learning model, only a relatively small amount of training data is required to learn a particular task. Overall, we argue that learned embeddings represent a valuable addition to the DBMS designer’s toolbox, enabling accurate, light-weight, custom-tailored models for a wide variety of tasks that can be generated with minimal human effort.

The contributions of this paper are:

  • We present flexible operator embeddings, a novel technique to automatically map query operators to useful features tailored to a particular database, alleviating the need for costly human feature engineering.

  • We demonstrate that our embeddings can be effectively combined with simple, low-latency, and off-the-shelf machine learning models.

  • We demonstrate the efficacy and flexibility of our custom embeddings across a variety of datasets and tasks.

This remainder of this paper is organized as follows. In Section 2, we give an intuitive overview of why effectively representing query operators as vectors is difficult, and how deep learning can be used to learn effective embeddings. Section 3 describes our framework for learning and applying database-tailored operator embeddings. Section 4 describes how to train a custom operator embedding and apply it to a particular task. We present experimental results in Section 5, demonstrating the efficacy of our approach across a variety of workloads and tasks. Related work is presented in Section 6, and concluding remarks are given in Section 7.

2 Context-aware Embeddings

Machine learning is not magic, and like all algorithms, obeys the maxim of “garbage in, garbage out.” If one provides a meaningless or nearly-meaningless representation of the domain, a machine learning algorithm will show poor performance [12]. If one wants a machine learning model to generalize well to unseen data, one needs to provide a good representation of the data. Here, “good” inputs to a machine learning algorithm means inputs that are generally dense (i.e., every dimension is more-or-less continuous and meaningful) and information-rich (i.e., the distance between points is meaningful). Unfortunately, coming up with a dense, information-rich representation of query operators is often not easy. Next, we motivate our context-aware operator embeddings by discussing alternative approaches and their drawbacks when used for data management tasks.

(a) One-hot encodings of words (left) and query operators (right)
(b) Example embedded space word vectors (left) and operator embeddings (right)
Figure 1: Encodings and embeddings for words and query operators

One-hot encodings

A commonly used information encoding approach (particularly in the domain of natural language processing, NLP) is using an one-hot vector, i.e., a vector where there is a single high (generally 1) value and all the other values are low values (generally 0). For example, in NLP, where the main units of analysis are words, a given word can be represented with a high value in a specific position in a vector with the size of the vocabulary. Figure 

0(a) shows an example of how the words king, queen, apple, and banana might be represented as part of an eight-word vocabulary. Here, the word “king” is represented by a high value in the fourth position of the vector, and the word “queen” is represented by a high value in the first position.

The right side of Figure 0(a) shows how a similar one-hot encoding strategy can be used to encode query operators. Here, four operators – an index-nested loop (natural) join on an attribute , a merge (natural) join on attribute , and two selection operators, with predicates and – are represented with a straightforward “combined” one-hot encoding which captures information about the operator type (in the first three dimensions), join attribute (in the next two dimensions), and predicate (in the last two dimensions). For example, the appearance of nested loop join is captured by the value of 1 in the first position, a merge-join is represented by the value of 1 in the second position, and the selection operator is captured in the third position. Similarly, the join attributes “g” and“h” are captured by a value of 1 in the fourth and fifth positions, respectively.

For both English words and query operators, it is immediately clear that this representation is (1) not dense, as each dimension is used minimally, and (2) not information-rich, as all three operator types (and all four English words) are equidistant from each other in the embedded space. For example, the distance between the first three operators (the nested loop join, the merge join, and the first selection operator) are all equidistant. As a result, this representation encodes very little semantic information. While not particularly useful (indeed, both database-related and NLP models trained directly on one-hot vectors perform poorly), this sort of encoding is very easy to come up with, and requires almost no human effort.

Information-rich embeddings Ideally, our vectorized representations would contain significant semantic information. In the NLP case, what is desired is the representation depicted in the left side of Figure 0(b): “king” and “queen” are neighbors, as are “apple” and “banana.” For query operators, we want the vectors representing the merge join and index nested loop join operators to be closer to each other than to the selections, as depicted in the right side of Figure 0(b), since both merge join and index nested loop join implement the same logical operator (a join). However, unlike one-hot encodings, it is not trivial to construct such an embedding. Thus, we ask a natural question: can we automatically transform the one-hot, sparse, easily-engineered, information-anemic representation into a dense, information-rich representation useful for machine learning algorithms?

Context-aware representations In order to facilitate learning a dense, information-rich representation from an easy-to-construct one-hot encoding, the notion of context is often leveraged [42, 43, 36, 51]. For instance, word vectors, invented to transform words into vectors, is a way to take advantage of the context that a word appears in to represent that word as a vector. For example, in the sentence “Long live the         !”, we except to see a word like “king” or “queen”, as opposed to a word like “apples” or “bananas.” However, in the sentence “         are a tasty fruit.”, the converse is true: one would expect “apples” or “bananas” instead of “king” or “queen.”

In a query processing context, if we know that the child of an unknown unary operator is a scan operator reading phone number data, we know it is unlikely that the unknown operator is an average aggregate (as the numerical average of a set of phone numbers is nonsensical). If we know that both the children of an unknown binary operator are sort operators, we know the operator is more likely to be a merge join than a hash join (a query optimizer would have little reason to sort the input to a hash join operator). Hence, in the next paragraphs, we demonstrate how we can build these context-aware embeddings via deep neural network (DNNs).

2.1 Context-aware Learning via DNNs

Context and deep neural neworks (DNNs) can be leveraged to learn a semantically-rich, low-dimensional vector space from sparse, one-hot vectors [42]. At a high level, our approach works by training a neural network to predict the context (children) of an operator. In practice, this neural network maps one-hot representations of query operators to one-hot representations of their children. Once trained, the internal representation learned by the neural network is used as an operator embedding.

To facilitate the discussion in the next paragraphs we first provide necessary background on neural networks.

Neural networks Deep neural networks are composed of multiple layers of nodes, in which each node takes input from the previous layer, transforms it via a differentiable function and a set of weights, and passes the transformed data to the next layer. Thus, the output of each node can be thought of as a feature: a piece of information derived from the input data. As you advance into the subsequent layers of the neural net, nodes begin to represent more complex features, since each node aggregates and recombines features from the previous layer. Because each node applies a differentiable transformation, and thus the entire network can be viewed as applying a single complex differentiable transformation, neural networks can be trained using gradient descent [52]. This training is done by defining a loss function

, a function that measures the prediction accuracy of the neural network on the desired task, and then using gradient descent to modify the parameters (i.e., weights) of the neural network to minimize this loss function. Unlike most traditional machine-learning algorithms, deep neural networks learn complex features

automatically, without human intervention. Readers unfamiliar with deep neural networks may wish to see [54] for an overview.

Hour-glass DNN architecture Next, we discuss how deep neural networks can be used to learn context-aware operator embeddings. Our framework takes advantage of contextual relationships between query operators. The critical component is a deep neural network architecture that is trained to predict the context of a particular entity. For example, in the domain of query processing, the neural network, given a merge join operator, would be trained to predict the two sorted inputs (details in Section 4).

Intuitively, this process works by first training a neural network to map a simple, one-hot encoding of a query operator to some contextual information about that operator: for example, the network would be trained to map a one-hot representation of merge join operator to the one-hot representation of the merge join’s two (sorted) inputs. Here, there is no separate process needed apart from training the neural network for this auxiliary task. Once the network is trained, each hidden layer outputs combinations of its input features that are useful for this predictive task. Hence, these outputs can be seen an embedding of the (sparse) input vectors of the neural network.

Figure 2: Hourglass embedding neural network

To ensure that the final hidden layer of the network offers an output (an embedding) that is dense and information-rich, the architecture of this neural network is designed with an information bottleneck. We use an “hour-glass” shaped network, depicted in Figure 2. Here, one-hot vectors of operator information are fed into an input layer and then projected down into lower and lower dimensional spaces through a series of hidden layers. Then, a very low dimensional layer is used to represent the learned embedding. At the end of the network, a final prediction layer is responsible for mapping the low-dimensional learned embedding into a prediction of the context of the input operator’s feature vector.

During training, the neural network strives to identify complex transformations of its input that improve the accuracy of its output prediction. Implicitly, this forces the neural network to learn a representation (an embedding) of the input one-hot vectors that can be used to predict the correct operator context at the final layer. After the network is trained, the prediction layer is “cut off”, resulting in a network which takes in an one-hot encoded, sparse vector and outputs a lower-dimensional embedding. Because this low-dimension learned embedding was trained to be useful for predicting the context of the operator, we know it is information-rich. The low-dimension layer representing the learned embedding serves as an information bottleneck, forcing the learned representation to be dense.

Of course, precisely predicting the context of a query operator is an impossible task. After all, the children of a merge join operator will not always be sort operators. Since ultimately we will cut off the final prediction layer, we do not care much about its accuracy, as long as its output somewhat matches the distribution of potential contexts. For example, when fed a merge join operator, the network predicts a higher likelihood that the children of the merge join operator were sort operators than hash operators. Once trained to have this property, the resulting network (with the prediction layer cut off), serves as a mapping from a sparse, information-anemic, easy-to-engineer representation to a dense, information-rich representation.

The rest of this paper explains how we employ this learned embedding framework to automate and custom-tailor the feature engineering process to a particular database, and how the output of our feature engineering process can be useful for a number of data management tasks.

3 Learning Framework Overview

Figure 3: Three phase learning framework

Learning database-specific operator embeddings is the first step towards automating the feature engineering process. Once operator embeddings are obtained, an off-the-shelf machine learning algorithm can utilize them as an input for a specific task. Next, we provide a brief overview of our framework before describing its technical details in Section 4.

Our proposed framework operates in three phases, depicted in Figure 3. In the first embedding learning step, an operator embedding is learned by training the “hour-glass” neural network. This training uses previously-observed query plans (i.e., operator trees) from a particular database, allowing the network to learn to predict an operator’s children (i.e., context). In the task-based model training phase, the learned embeddings are used to train an off-the-shelf machine learning model for a specific task (i.e., cardinality error estimation). Specifically, the trained cut-off neural network is used to map the training set for this task into the dense embedded vector space and these dense vectors are fed into a traditional machine learning model to train the model for the particular task. Finally, at runtime, observed query operators are embedded through the cut-off neural network and the embedded output is sent through the traditional machine learning model, resulting in a prediction for the particular task and for the observed operator.

Embedding learning This first phase focuses on learning the operator embedding for a particular database workload. Here, we assume the availability of a long history of executed queries from the database, which we call a sample workload. This sample workload can be extracted from database logs or through other means. Many DBMSs automatically store such logs for auditing and debugging purposes. The only information needed from these queries are their query plans, which allows one to extract information about each plan’s query operators, e.g., the type of operator, the expected cost of the operator according to the optimizer’s cost model, etc.111This information is made available in many DBMSes via EXPLAIN queries. However, these queries need no additional annotation or tagging, i.e. they do not have to be hand-labeled with any particular piece of information.

In this phase, we train a deep neural network with an hour-glass architecture (orange in Figure 3). The first layer takes as input any available piece of information about a query operator in the training set. The subsequent, hidden layers of the network slowly decrease in size until the penultimate layer of the network, called the embedding layer. After the embedding layer, an additional prediction layer is added, responsible for taking the information from the embedding layer and predicting the input operator’s children. The embedding layer, the smallest layer of the network, serves as an information bottleneck, forcing the neural network to learn a compact representation of the input operators.

Given the trained neural network, the last step of the embedding learning stage is to “cut off” the final prediction layer of the network. This results in a network that maps an input operator to a low-dimensional vector, i.e. the embedding layer becomes the output layer of the network.

Task-based model training During the model training phase, depicted in the second row of Figure 3, the learned embeddings are used to train an off-the-shelf machine learning model from a small training set. For example, suppose one is interested in training a machine learning model to predict cardinality estimation errors, as are common in queries with many joins [31]. One would then collect a small training set consisting of query operators and labels specifying whether or not the estimated cardinality of the operator was too high or too low (these labels are represented with green and purple in Figure 3

). These collected operators are then encoded into one-hot sparse vectors. These sparse vectors are ran through the hour-glass neural network, generating one dense vector for each operator. These generated vectors, which can be thought of as embedded versions of the labeled operators, are dense and low-dimensional. These vectors, and their corresponding labels, are then used to train an off-the-shelf machine learning model (e.g., logistic regression 


, random forest 

[10]) to predict whether the candinality estimation of a particular operator will be too high or too low.

Runtime Predictions

At runtime, the specialized model trained in the previous step is used on new, unlabeled operators. First, these previously-unseen operators are sent through the cut-off neural network. This produces an embedded dense vector which can be classified by the traditional machine learning model trained in the previous stage. For example, the specialized model may classify the new operator as having an under or over estimated cardinality. The “Runtime Prediction” stage in Figure 

3 shows an example of this process, where an initially unlabeled (uncolored) operator is fed through the embedding network to produce an embedded vector, which is subsequently classified by the traditional ML model. The result of this classification can then be utilized by the DBMS, e.g. if the specialized model predicts that a particular operator’s cardinality has been underestimated, the DBMS could perform sampling, or replace the particular operator with one less sensitive to cardinality overestimations (e.g. replace a loop join with a hash join).

4 Operator Embeddings Training

In this section, we give a technical description of the embedding learning, task-based model training, and runtime predictions phases of the framework we discussed above.

4.1 Embedding Learning

The first phase of our framework involves the training of the embedding network. During this step, we train a neural network to emit operator embeddings using a large history of previously-executed queries. Let us assume a sample query workload from the target database. We treat each query operator as a large, sparse vector containing information about

such as the operator type, join algorithm, index used, table scanned, column predicates, etc. Categorical variables are one-hot encoded, and vector entries corresponding to properties that a certain operator does not have are set to zero, e.g. for table scan operators, the vector entry for “join type” is set to zero.

We train our embedding network to predict the children, and , of a given query operator . If a query operator has no children, we define both and as the zero vector, denoted , and if a query operator has only one child, we define to be and to be that child. Query operators with more than two children are uncommon, and can either be ignored or truncated (e.g., ignore the 3rd child onward).

Figure 4: Training the embedding

Given a sample workload , we next construct an hour-glass shaped neural network, as shown in Figure 4, which we train to learn a mapping from each parent to its children. The initial input layer for the embedding network is as large as the sparse encoding of each query operator. The subsequent layers, known as hidden layers, map the output of the previous layer into smaller and smaller layers, corresponding with lower and lower dimensional vectors. The penultimate layer, the embedding layer, is the smallest layer in the network. Collectively, these layers are referred to as the encoder. Immediately following the embedding layer are the output layers, which attempt to map the dense, compact representation outputted by the embedding layer to the prediction targets, i.e. the children of the input operator.

We refer to the encoder as a function parameterized by the network weights , and we refer to and as the functions representing the output layers for predicting the first and second child of the input operator from the output of the encoder, respectively. Thus, the embedding network is trained to by adjusting the weights to minimize the network’s prediction error:

where represents an error criteria. For vector elements representing scalar variables (such as cardinality and resource usage estimates), we use mean square error, e.g. . For vector elements representing categorical variables (such as join algorithm type, or hash function choice), we use cross entropy loss [30].

By minimzing this loss function via stochastic gradient descent 

[52], we train the neural network to accurately predict the contextual information (children) of each input operator. It is important to note that the network will never achieve very high accuracy – in fact, it is quite likely that the same operator has different children in different queries within the sample workload, thus making perfect accuracy impossible. When this is the case, the best the network can do is match the distribution of the operator’s children, e.g. to predict the average of the correct outputs (as this will minimize the loss function), which still requires that the narrow learned embedding contain information about the input operator. The point is not for the embedding network to achieve a high accuracy, but for the network to learn a representation of each query operator that carries a semantic information.

Intuitively, the learned representation is information-rich and dense: the narrow embedding layer forces the neural network to make its prediction of the operator’s children based on a compact vector, so the network must densely encode information in the embedding layer in order to make accurate predictions, and thus minimize the prediction error. Since the network can make accurate predictions about the context of a query operator given the dense encoding, the dense encoding must be information-rich.

Example Figure 4 shows an example of how the embedding network is trained for a single operator: a merge join (MJ) with two children, a table scan (Tscn) and an index scan (Iscn). The network is trained so that when the sparse, vectorized representation of the merge join operator is fed into the input layer, the encoder (which contains a sequence of layers with decreasing sizes) maps the sparse vector into a dense, semantically rich output, which is finally fed into the output layers to predict the merge join’s children, the table scan and the index scan. Because the final output layers must make their prediction of the input operator’s children based only on the small vector produced by the encoder, the output of the encoder must contain semantic contextual information about the input operator.

4.2 Task-based Model Training

After a good encoder has been trained, a user can identify a task (e.g. cardinality estimation error prediction) and build a small training set of labeled operators. For the cardinality estimation error prediction task, these labels would indicate whether or not an operator’s cardinality was under or over estimated. We denote each operator in the training set at , with the label of being . Using the embedding network, the user can transform each labeled query operator into a labeled vector, e.g. for any , we compute . These input vectors and their labels can be used as a training set for a traditional, off-the-shelf ML model such as a logistic regression or a random forest.

We denote this off-the-shelf model with parameters as , and note that, in general, will be trained to minimize the model’s classification error:

Figure 5: Task-based model

Virtually any machine learning model can be used in place of , and the corresponding learning algorithm can be used to find a good value for (e.g. random forest tree induction for building a random forest model [10], or gradient descent for finding coefficients for a logistic regression [21]). Because the vectors produced by a well-trained encoder are information-rich and dense, operators with interesting properties (such as those with cardinality estimation errors) may be somewhat linearly separable, allowing for fast, simple models like logistic regression to be used with reasonable accuracy. We demonstrate this experimentally in Section 5.

Training set size One extra advantage of using the embedded operator vectors as input to traditional machine learning models is that it reduces the amount of data required to train a effective model compared to training a model directly on labeled sparse operator vectors. While many deployed database systems have a large log of executed queries available, acquiring labeled query operators, i.e. query operators that have been tagged (possibly by hand) with the additional piece of information one wishes to predict, is generally more difficult than analyzing logs. For example, for the cardinality estimation error prediction task, acquiring a large number of query plans from logs is straight-forward, but determining whether the optimizer under or over-estimated the cardinality for each operator requires re-executing the query and recording the estimated and actual cardinalities. Clearly, acquiring this information by re-executing the entire query log is untenable.

By training the embedding network to predict the context of a query operator using the large supply of easily-acquired unlabeled data, we ensure that the embedded operator vectors contain information about patterns in the query workload. Using these embedded vectors to train a traditional machine learning model removes the need for the traditional model to re-learn workload-level information, and can thereby achieve strong performance without a large supply of labeled training data.

Example An example of task-specific model training for the cardinality error estimation task is depicted in Figure 5. Here, a small set of training data is represented by squares in the left side of the figure. The color corresponds to the prediction target: whether or not the cardinality estimate for the operator was too high or too low. The training set is then fed through the learned embedding. The scatterplot on the right side of Figure 5

depicts the resulting dense, information-rich vectors (in this case, of dimension 2). A resulting classifier can be trained to find a simple linear boundary (e.g., logistic regression, perceptron, SVM) which nearly-perfectly separates the two classes.

4.3 Runtime Prediction

At runtime, an unlabeled operator can be ran through the encoder , then through the traditional ML model, , in order to produce a predicted label for . The resulting classification decision can be used by the DBMS. For example, for the cardinality error estimation task, if the model predicts that an operator

’s cardinality has been underestimated, the DBMS could perform additional sampling, refuse to run the query, or ask the user for confirmation. In a more advanced setting, the model’s prediction could be combined with a rule-based system to avoid catastrophic query plans, e.g. if the model predicts that a loop join operator has an underestimated cardinality, replace that loop join with a hash join.

We note that because this inference is happening during query processing, inference time matters. As a result, users may wish to select a model with sufficently fast inference time for their application: if model inference time is too high, the net effect on the system may be negative, even if the model provided accurate predictions. We experimentally analyze the inference time required by the encoder, and the encoder combined with various off-the-shelf machine learning models, in Section 5.3.

5 Experimental Analysis

Here, we present an experimental analysis of the operator embedding framework. First, we analyze the embeddings themselves, investigating the properties of the learned vector space that operators are embedded into. This analysis allow us to build an intuition for why the learned embedding approach offers effective feature vectors for use with traditional machine learning algorithms. Then, we measure the effectiveness and efficiency of flexible operator embeddings for several data management tasks.

Neural network setup Unless otherwise stated, our hourglass embedding network generates embeddings of size 32, meaning that query operators are mapped into 32 floating-point numbers. This neural network has six hidden layers (of size 256, 256, 128, 128, 64, and 64 nodes). Layer normalization [7] (a technique to stabilize the training of neural networks) and ReLUs [18]

(an activation function) are used after all layers except for the output layer. The network was trained using stochastic gradient descent 


over 100 epochs (passes over the training data). The embeddings were trained using a GeForce GTX TITAN GPU and the PyTorch 

[48] deep learning toolkit.

Database setup Unless otherwise stated, all queries are executed using PostgreSQL 10.5 [1], running on Linux kernel 4.18. PostgreSQL was ran from inside a virtual machine with 8 GB of RAM, a configured buffer size of 6GB, and two virtualized CPU cores. The underlying machine has 64GB of RAM and a Intel Xeon E5-2640 v4.

Input vectors setup Queries are initially encoded into a sparse representation based on features we extracted from the output of PostgreSQL’s EXPLAIN functionality. The number of features (i.e., the size of the input sparse vectors) depends on the dataset, as different datasets have different numbers of relations, attributes, etc. and thus have different sized sparse input vectors. These features vary between 280 and 478, and contain information such as the optimizer’s estimated cost (for all operators) and the expected number of hash buckets required (for hashing operators). Details about the initial sparse encoding can be found in Appendix A.

Dataset We conducted our experiments over one synthetic and two real-world datasets, which were provided by a large corporation on the condition of anonymity:

  • TPC-DS Workload: this workload includes 21,688 TPC-DS query instances generated from the 70 TPC-DS templates that are compatible with PostgreSQL without modification. These queries were executed on a TPC-DS dataset with scale factor of 10, resulting in a total database size of 25GB. We have made execution traces of these queries publicly available222 for replication and analysis.

  • Online Workload: we also used a real-world, online workload extracted from execution logs of large corporation. The dataset contains 8,000 analytic (OLAP) queries sent by 4 different users in a 48-hour period. The total database size is 5TB, and the average query reads approximately 300GB of data.

  • Batch Workload: finally we used a real-world, batch workload executed weekly at a large corporation for report generation. The dataset contains 1,500 analytic (OLAP) queries sent by 98 different users. The total database size is 2.5TB, and the average query reads approximately 350GB of data.

5.1 Analysis of Operator Embeddings

(a) Embedding under t-SNE
(b) Cardinality errors
(c) Query latency
Figure 6: t-SNE plots of a learned embedding for TPC-DS (best viewed in color)

First, we explore and analyze the embeddings learned by the hourglass neural network. This is the output of the first phase of our framework. Here, we trained the neural network on the query plans we collected from the TPC-DS queries. We then ran all the query operators present in those query plans through the trained embedding network, resulting in one 32-dimensional vector for each operator. We refer to these as the embedded operators.

Visualizing embedded operators To visualize the embedded operators, we used the t-Distributed Stochastic Neighbor Embedding (t-SNE) [63] technique. The t-SNE technique projects high dimensional vector spaces (in our case, the 32-dimensional embedding space), into lower dimensional spaces (2D, for visualization), while striving to keep data that is clustered together in the high dimensional space clustered together in the low dimensional space as well.

Figure 5(a) shows the t-SNE of the 32-dimensional TPC-DS operator vectors. Note that the X and Y values of a particular point have no meaning on their own – the t-SNE only attempts to preserve clusters in the higher dimensional space. On its own, Figure 5(a) does not give much insight into the shape of the learned embedding. However, several clusters and shapes appear, which we analyze next.

Groupings in the t-SNE To confirm that the learned embedding carries information about query operator semantics, we next color each embedded operator based on the accuracy of PostgreSQL’s cardinality estimate. In Figure 5(b), operators for which the optimizer correctly estimated the cardinality within a factor of two are colored gray (Correct). When the cardinality is overestimated by at least a factor of two, the operator is colored blue (Over). When the cardinality is underestimated by at least a factor of two, the operator is colored red (Under).

We make two observations about Figure 5(b).

  1. Even in the 2D space used for visualization, there are many apparent clusters of cardinality under and over estimations. This demonstrates that the operator embedding – which was trained with no knowledge of cardinality estimation errors – still learned an embedding that preserves semantic information about the query operators. In other words, by learning an embedding useful for predicting the context (children) of an operator, the neural network learned to embed query operators into a vector space with semantic meaning.

  2. The fact that operators with cardinality estimation errors are clustered together indicates that a machine learning model should be able to learn underlying patterns in the embedded data (e.g., the clusters) and make useful predictions. We investigate this directly in Section 5.2.2.

We also colored the t-SNE plot by operator latency, as shown in Figure 5(c). Here, we color each query operator based on the percentile of its latency, so that the fastest query operator, the 0th percentile, is colored with dark blue and the slowest query operator, the 100th percentile, is colored with yellow. When comparing Figure 5(b) and Figure 5(c) we can observe, unsurprisingly, that many of the slowest query operators correspond with cardinality underestimations, possibly resulting in spills to disk or suboptimal join orderings. Figure 5(c) demonstrates similar behavior to the previous plot. Long-running query operators tend to be neighbors with other long-running query operators, forming clusters in the 2D visualization. These clusters demonstrate that the learned embeddings carry semantic information (e.g., information related to operator latency), which can be taken advantage of by a machine learning model.

5.2 Operator Embeddings Effectiveness

Next, we evaluated the task-based model training component of our framework. Here, we trained a number of machine learning models using our learned operator embeddings for a number of different data management tasks, and we evaluate the effectiveness of this models. These models are trained using a labeled set of a query operators for a particular task. To show the flexibility of our approach, we measure the performance of the learned embeddings across three different tasks:

  1. Query admission: the model is trained to predict whether a particular query contains an operator that will take an extraordinary amount of time to execute. The decision is used to admit or reject the query.

  2. Cardinality boosting: the model is trained to predict whether the cardinality estimate of a particular operator’s output is too high, too low, or correct.

  3. User identification: the model is trained to identify the user that submitted a particular query. One application of such a model is to test for outlier queries.

For each task, we used a number of off-the-shelf machine learning models: (1) logistic regression, (2) random forest [10] (RF) with 100 trees, (3) k-nearest neighbors [11] (kNN

) configured to account for 6 neighbors using weighted distance (the best value found after an extensive hyperparamter search), and (4) support vector machines 

[20] (SVM). In order to demonstrate that task-based models can be trained with relatively little training data, each model is evaluated using 5-fold cross validation, in which one-fifth of the data is used for training and four-fifths are used for testing at a time (the final number reported is thus the median of 5 runs). For the TPC-DS dataset, cross validation folds are chosen based on query templates, so that the query templates in each training set are distinct from queries in each testing set. For the online workload dataset, training and testing sets are chosen so that all queries in the training set precede all queries in the testing set (i.e., the training set represents the “past”, and the test set represents the “future”). For the batch workload dataset, folds are chosen using uniform random sampling without replacement.

We compare the performance of each trained machine learning model when trained using a number of different input feature vectors. The feature engineering techniques we used for extracting these vectors are the following:

  1. Sparse: this is the raw, unreduced sparse encoding taken directly from PostgreSQL’s EXPLAIN functionality (see Appendix A for a listing).

  2. Neural: this is the automatically-learned operator embeddings generated using the “hourglass” neural network approach presented here.

  3. PCA

    : here we use feature vectors from the 32 leading principal components (vectors that explain a high percentage of the variance in the data) of the original sparse vectors. These components are found by performing (automated) principal components analysis 

    [23] on the sparse input vectors.

  4. FA: this is an automatic feature engineering process that uses feature agglomeration [50]

    , a technique similar to hierarchical clustering. This technique builds up features by combining redundant (measured by correlation) features in the sparse input vectors together until the desired number of features (32) is reached.

5.2.1 Query Admission Task

(a) TPC-DS
(b) Online Workload
(c) Batch Workload
Figure 7: Query admission prediction accuracy for different models and feature vectors. The models predict if a query does or does not contain an operator with latency above the 95th percentile.

For the query admission task, we trained various machine learning models to predict whether or not any operator in a query plan would fall above or below the 95th-percentile for latency. In other words, the model tries to predict if any operator in a given query plan will take longer than 95th% of previously seen operators or not. To do this, we applied the trained model on each operator in an incoming query, and if the model predicts that any operator in the query would exceed the 95th percentile threshold, the query is flagged. DBMSes may wish to reject such flagged queries, or prompt the user for confirmation before utilizing a large amount of (potentially shared) resources to execute them [60, 68, 66].

Figure 7 shows the average 5-fold cross validation accuracy for each model on all three datasets. The red line at represents the prior (i.e., since 75% of queries do not have an operator exceeding the 95th latency percentile, a model that always guessed this most common class would achieve an accuracy of 75%).

For all three datasets, the specialized models trained using the Neural input (the learned operator embeddings) outperformed models trained with the other input vectors. While the combination of the operator embeddings (Neural) and logistic regression saw the highest performance and outperformed other combinations across all three datasets by 10-15%, the operator embeddings (Neural) also produced the most effective models compared with any other feature engineering approach. The only exception is the SVM model on the online workload, where the model trained on the Neural inputs was within 5% of the most effective model.

Surprising performance of logistic regression

For the query admission task (and the other tasks analyzed next), the combination of the learned embeddings with a logistic regression classifier is surprisingly effective (97% accuracy for TPC-DS). This is due to the fact that the logistic regression model has a very similar mathematical form to the cross-entropy and mean-squared error loss functions used to train the neural network. Training a logistic regression model on the embedded data is equivalent to re-training the last layer of the embedding network for a different prediction target, a technique called knowledge distillation or transfer learning, which has been shown to be extremely effective 

[8, 69]. This also explains the large gap (15% - 18%) between the performance of logistic regression when using the Neural featurization and with the other featurizations.

5.2.2 Cardinality Boosting Task

(a) TPC-DS
(b) Online Workload
(c) Batch Workload
Figure 8: Cardinality over/under estimation prediction accuracy for different models and feature vectors. The models predict if the optimizer’s cardinality estimation is within a factor of two, over by a factor of two, or under by a factor of two.

Next, we trained various machine learning models to predict whether or not the PostgreSQL query optimizer’s cardinality estimate would be correct, too high (by at least a factor of two), or too low (by at least a factor of two). We call this task cardinality boosting because of the similarities to “boosting” techniques in machine learning [53], i.e. using one model to predict the errors of another model. Providing good cardinality estimates is extremely important for query optimization [31], and many recent works have applied machine learning to this problem [33, 26, 45].

Figure 8 shows the performance of the trained machine learning models for the cardinality boosting task across each dataset with each feature engineering technique. Again, the combination of the Neural featurization and logistic regression produced the most effective model for all three datasets, achieving over 93% accuracy on the TPC-DS dataset. While the gap between the Neural featurization and the other featurizations appears significantly higher for the cardinality boosting task than the query admission task, this is attributable to the change in the prior (i.e., 75% for the query admission task and strictly less than 50% for the cardinality boasting task).

The strong performance of the Neural featurization is well-explained by the t-SNE plot in Figure 5(b). Since we know the learned embedding is clustering operators with similar cardinality estimation errors together, it is not surprising that machine learning models can find separation boundaries/cluster centers within the data.

5.2.3 User Identification Task

(a) TPC-DS
(b) Online Workload
(c) Batch Workload (note y-axis scale)
Figure 9: Query user identification prediction accuracy for different models and feature vectors. The specialized model tries to predict the user that sent a particular query (number of classes varies by dataset).

The last task we evaluate is user identification. In this task, the model’s goal is to predict the user who submitted the query containing a particular operator. While the DBMS generally knows the user submitting a query, such a model is useful for determining when a user-submitted query does not match the queries usually submitted by that user, a common learning task in database intrusion detection [6, 19, 35, 25] or query outlier detection [56].

For the TPC-DS data, we use the query template as the “user” for a each query (and therefore we use random cross-validation folds). For the online and batch datasets, user information was provided by their respective corporations. The online dataset contains queries submitted from 4 users, whereas the batch dataset contains queries submitted by 98 users. In all cases, the number of queries submitted by each user is approximately equal. Figure 9 shows the results.

The TPC-DS and batch workloads show an extremely large gap between the logistic regression performance using the Neural featurization and the other featurizations, again because of the significantly lower prior. Overall, the Neural featurization outperforms all other baselines, demonstrating that the learned embedding contains rich, semantic information about each query operator.

Two exception, however, are notable: first, for the TPC-DS dataset, the kNN model with the Sparse featurization outperforms the Neural featurization by approximately 1% (Figure 8(a)). This is due to certain operators – such as max aggregates over specific columns – appearing in only one query template. Such an operator will have a very close neighbor in the sparse vector space, allowing the kNN model to easily classify it. This advantage, however, does not extend to real-world data (e.g. Figures 8(b) and 8(c)), where such uniquely identifying operators do not exist.

Second, the random forest algorithm exhibits surprisingly good performance using sparse inputs for the batch workload (Figure 8(c)) – significantly better than the Neural featurization and random forest, although not as good as the Neural featurization and the logistic regression. The reason the Sparse encoding works so well with the random forest model is due to the specifics of the batch dataset: while most users access every table, almost all users can be uniquely identified by the set of tables they accessed. The random forest algorithm, which builds a tree of rules based on discrete splitting points, is especially well-suited to identifying the table usage patterns of each user in the one-hot encoding. We note, however, that the logistic regression combined with the Neural feature vectors still outperformed all random forest models on this task.

5.2.4 Impact of Embedding Size

(a) Query Admission
(b) Cardinality Boosting
(c) User Identification
Figure 10: Accuracy of various models on different tasks for the TPC-DS dataset, varying the size of the embedding layer.

Up until this point, we have only used learned embeddings with embedding layers of size 32, i.e. query operators are mapped into a vector space with 32-dimensions. Here, we evaluate changing this hyperparameter. Figure 

10 shows how the performance of various models change for the TPC-DS dataset when the size of the embedding is set to 64, 32, 16, and 8 dimensions. We observed similar results for both the online and batch workloads, but these plots are omitted due to space constraints.

Generally, the performance of the size-64 embeddings and the size-32 embeddings are nearly identical, indicating that adding additional dimensions beyond 32 to the embedding space does not cause the embedding network to learn a better compressed representation. On the other hand, the performance of all models drops significantly between the size-32 and size-16 embeddings, and again between the size-16 embeddings and size-8 embeddings (best depicted in Figure 9(c)). This suggests that when the embedding size is smaller than 32, the “information bottleneck” (see Section 4) becomes too narrow, and the neural network is unable to learn a sufficiently rich description of each query operator in such a small vector.

The ideal embedding size is hard to predict ahead of time, and although our experiments show that an embedding size of 32 or 64 provides good results on a number of datasets, we suggest that users test several configurations. Doing so can be done in an automatic manner by training multiple embedding sizes, and then selecting the one that results in the best cross-validated model performance.

5.3 Runtime Efficiency

(a) Inference time
(b) Cardinality boosting, TPC-DS
(c) Training time
Figure 11: Analysis of inference time, training time, and accuracy

After an embedding has been trained and a subsequent machine learning model has been trained for a particular task, the DBMS applies that model at runtime. Because the model is being applied at runtime, inference time matters, as we do not want to unnecessarily slow down query processing. Next, we discuss the inference time when these models are trained using our learned operator embeddings. We also performed a Pareto analysis of the efficiency (inference time) vs. the effectiveness (accuracy) of these models.

Inference time Figure 10(a) shows the amount of time it takes to perform model inference on a single operator for each machine learning model we used and feature engineering technique. Note the log scale on the y-axis. The Sparse featurization has exceptionally high inference time, especially for kNN and SVM models, because these models require measuring the distance between the input vector and a large number of other points (for kNN, this could potentially be every point in the training set; for SVM, this could be a large number of support vectors).

The PCA and FA featurizations have slightly lower inference time than the Neural featurization. This is because running the sparse feature vector through the multi-layer deep neural network takes slightly longer than the simple dot product computations required for PCA and FA. However, the difference is minimal (ms). It is important to note that operator embeddings (and other dimensionality reduction techniques) lead to faster inference time than the original sparse vector encoding, because the embedding process is often asymptotically cheaper than the inference process. This is especially true for models such as SVM and kNN, which scale poorly with dimensionality – for example, inference time using Sparse featurization with an SVM model (130ms) is almost an order of magnitude more than using the Neural featurization (18ms).

Pareto analysis Of course, inference time is not the only important factor when considering a runtime model: accuracy is obviously important as well. Here, operator embeddings’ natural synergy with logistic regression (see Section 5.2.1) give the Neural featurization a massive advantage, as illustrated in the Pareto plot in Figure 10(b).

Figure 10(b) shows all the models trained for the cardinality boosting task for the TPC-DS dataset, plotted based on their inference time (x-axis, log scale) and their testing accuracy (y-axis). The Pareto front, the models for which no other model is both faster and more accurate, contains only the logistic regression classifiers using the Neural, FA, and PCA featurizations. While the logistic regression classifiers using the FA and PCA featurizations have slightly faster inference time (the Neural featurization still produces an inference time under 0.1ms), the accuracy of the model produced by the Neural classifier is substantially higher: approximately 94% versus 73%. Therefore, with the exception of the two logistic regression models with significantly lower accuracy, we conclude that the models produced by the learned operator embeddings are Pareto dominate in terms of inference time and accuracy.

5.4 Training Time

Finally, we analyze the time required to train the hourglass-shaped neural network used to represent an embedding. We compare the training time required for both a CPU (Intel Xeon E5-2640 v4) and a GPU (GeForce GTX TITAN). The results are shown in Figure 11. The time to train the network is a function of several parameters, the most notable being the size of the training set. Since our TPC-DS workload has the most queries (21,688), it has a longer training time than the online (8,000 queries) or batch (1,500 queries) workloads. For TPC-DS, the largest workload, training time on a CPU required 10 minutes, whereas on a GPU the training time dropped to 2.8 minutes.

Since the learned embedding does not need to be frequently retrained (as demonstrated by the consistent model performance on the online workload), we conclude that the training overhead of the proposed learned embeddings is manageable even for systems without a GPU, although we note that a GPU greatly accelerates training. Of course, training time can be reduced by sampling from the available training set, and we leave studying the effects of such a strategy on model performance to future work.

6 Related work

Machine learning in DBMSes A recent groundswell of work has focused on applying machine learning to data management problems. Recent works in intrusion detection [6, 19], index structures [28], SLA management [46, 41, 39, 58, 47, 34, 15], entity matching [44], physical design [38, 49], and latency prediction [66, 32, 65, 17, 14, 2, 64, 13] have all employed machine learning techniques. SageDB [59] proposes building an entire database system around machine learning concepts, including integrating machine learning techniques into join processing, sorting, and indexing. With little exception, each of these works have included hand-engineered features derived for each particular task, a arduous process.

Notably, recent work has applied reinforcement learning to various problems in data management systems, including join order enumeration 

[37, 40, 29], cardinality estimation [45], and adaptive query processing [62, 24, 61]. Recent work [33, 26]

has also used more traditional supervised learning approaches, using a specialized neural network architecture, to perform join cardinality estimation. Predominately, these techniques are custom-tailored to a problem at hand, and while several of the systems mentioned utilize deep learning (and thus learn useful features automatically for their specific task), they do not generally decrease the burden of applying machine learning to

new problems in data management.

Feature engineering Recent works related to feature engineering, such as ZOMBIE [4], Brainwash [3], and Ringtail [5] take a “human-in-the-loop” approach to feature engineering, assisting data scientists in selecting good features. These techniques all seek to automate or shorten the process of evaluating the utility of a particular feature (e.g., optimizing model testing), a time-consuming task in feature engineering. In contrast, our technique takes a completely automatic approach, custom-tailoring a featurization to a specific database without any user interaction.

Automatic machine learning As machine learning grows more popular and complex, recent research has also focused on automating the entire machine learning pipeline [27, 67, 9]

. These systems are generally designed for use by data science practitioners applying machine learning to external problems, and assist data scientists with model selection, hyperparameter tuning, etc. In contrast, our approach focuses on automatically generating features for machine learning applications

within the DBMS and with generating features in an entirely automated fashion.

Learned embeddings This work is not the first to apply learned embeddings to database systems. In [44], the authors show how embeddings learned on the rows of a table (as opposed to this work, which learns an embedding of query operators) can be used for entity matching. In preliminary work [16], the authors present Termite, a system that helps users navigate and explore heterogeneous databases by building a multi-faceted embedding on structured and unstructured data. In [56]

, the authors present a method for embedding the text of SQL queries (the actual query as written by the user) using recurrent neural networks, and show that the features learned are useful for various textual tasks (such as detecting syntax errors). Most researching into embeddings have focused on natural language processing 

[42, 57]

or computer vision 

[22]. In contrast, our work demonstrates the power of deep learning as a feature engineering tool for operator-level data management tasks.

7 Conclusions

We have presented flexible operator embeddings, an automatic technique to custom-tailor a multi-purpose featurization to a particular database. Utilizing deep learning, our technique learns semantically rich, information dense embeddings of query operators automatically, and with minimal human interaction. We have shown that our technique produces features that can be utilized by simple, well-studied, off-the-shelf machine learning models such as logistic regression, and that the resulting trained models are both accurate and fast.

Moving forward, we plan to investigate additional applications of operator embeddings, especially in parallel databases. We are also considering new ways to integrate the embedding training process into modern query optimizers, and if there is any way the training process could exploit recorded partial execution information about past queries.

Feature PostgreSQL Ops Encoding Description
Plan Width All Numeric Optimizer’s estimate of output row width
Plan Rows All Numeric Optimizer’s cardinality estimate
Plan Buffers All Numeric Optimizer’s estimate of operator memory requirements
Estimated I/Os All Numeric Optimizer’s estimate of the number of I/Os performed
Total Cost All Numeric Optimizer’s cost estimate for this operator, plus the subtree
Join Type Joins One-hot One of: semi, inner, anti, full
Parent Relationship Joins One-hot When the child of a join. One of: inner, outer, subquery
Hash Buckets Hash Numeric # hash buckets for hashing
Hash Algorithm Hash One-hot Hashing algorithm used
Sort Key Sort One-hot Key for sort operator
Sort Method Sort One-hat Sorting algorithm, e.g. “quicksort”, “top-N”, “external sort”
Relation Name All Scans One-hot Base relation of the leaf
Attribute Mins All Scans Numeric Vector of minimum values for relevant attributes
Attribute Medians All Scans Numeric Vector of median values for relevant attributes
Attribute Maxs All Scans Numeric Vector of maximum values for relevant attributes
Index Name Index Scans One-hot Name of index
Scan Direction Index Scans Boolean Direction to read the index (forward or backwards)
Strategy Aggregate One-hot One of: plain, sorted, hashed
Partial Mode Aggregate Boolean Eligible to participate in parallel aggregation
Operator Aggregate One-hot The aggregation to perform, e.g. max, min, avg
Table 1: Features used for naive encoding

Appendix A Encoding from PostgreSQL

Here, we describe the naive, sparse encoding we derive from the PostgreSQL EXPLAIN output.

Table 1 describes the values used for our naive encoding. The first column lists the name of the quantity. The second column describes which PostgreSQL operators use a particular type of input. The third column describes how the particular value is encoded into an input suitable for a neural network. The encoding strategies are:

  • Numeric: the value is encoded as a numeric value, occupying a single vector entry.

  • Boolean: the value is encoded as either a zero or a one, occupying a single vector entry.

  • One-hot: the value is categorical, and is encoded as a one-hot vector, e.g. a vector with a single “1” element where the rest of the elements are “0”, occupying a number of vector entries.

A particular operator is encoded using all of its applicable values, and using zeros for all inapplicable values. For example, a join operator will be encoded as a vector that has zeros for the “Strategy,” “Partial Mode,” and “Operator” entries, as these apply only to aggregate operators. An aggregate operator, on the other hand, would have zeros for the “Join Type” (for example) entry. See Section 2 for additional details on the sparse encoding.


  • [1] PostgreSQL database,
  • [2] M. Akdere et al. Learning-based query performance modeling and prediction. In ICDE ’12.
  • [3] M. Anderson et al. Brainwash: A Data System for Feature Engineering. In CIDR ’13.
  • [4] M. R. Anderson et al. Input selection for fast feature engineering.
  • [5] D. Antenucci et al. Ringtail: A Generalized Nowcasting System. VLDB ’13.
  • [6] Asmaa Sallam et al.

    DBSAFE—An Anomaly Detection System to Protect Databases From Exfiltration Attempts.

    Systems ’17.
  • [7] J. L. Ba et al. Layer Normalization. arXiv ’16.
  • [8] Y. Bengio. Deep Learning of Representations for Unsupervised and Transfer Learning. In ICML WUTL ’12.
  • [9] C. Binnig et al. Towards Interactive Curation & Automatic Tuning of ML Pipelines. In DEEM ’18.
  • [10] L. Breiman. Random Forests. Machine Learning ’01.
  • [11] T. Cover et al. Nearest neighbor pattern classification. Information Theory ’67.
  • [12] P. Domingos. A Few Useful Things to Know About Machine Learning. Comm. ACM ’12.
  • [13] J. Duggan et al. Contender: A Resource Modeling Approach for Concurrent Query Performance Prediction. In EDBT ’14.
  • [14] J. Duggan et al. Performance Prediction for Concurrent Database Workloads. In SIGMOD ’11.
  • [15] A. J. Elmore et al. Characterizing Tenant Behavior for Placement and Crisis Mitigation in Multitenant DBMSs. In SIGMOD ’13.
  • [16] R. C. Fernandez et al. Termite: A System for Tunneling Through Heterogeneous Data. In Preprint, 2019.
  • [17] A. Ganapathi et al. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In ICDE ’09.
  • [18] X. Glorot et al. Deep Sparse Rectifier Neural Networks. In PMLR ’11.
  • [19] H. Grushka - Cohen et al. CyberRank: Knowledge Elicitation for Risk Assessment of Database Security. In CIKM ’16.
  • [20] M. A. Hearst et al. Support vector machines. ISA ’98.
  • [21] D. W. Hosmer Jr et al. Applied Logistic Regression. 2013.
  • [22] Y. Jia et al. Caffe: Convolutional Architecture for Fast Feature Embedding. In MM ’14.
  • [23] I. Jolliffe. Principal Component Analysis. In IESS ’11.
  • [24] T. Kaftan et al. Cuttlefish: A Lightweight Primitive for Adaptive Query Processing. arXiv ’18.
  • [25] L. Khan et al. A new intrusion detection system using support vector machines and hierarchical clustering. VLDB ’07.
  • [26] A. Kipf et al. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR ’19.
  • [27] T. Kraska. Northstar: An Interactive Data Science System. VLDB ’18.
  • [28] T. Kraska et al. The Case for Learned Index Structures. In SIGMOD ’18.
  • [29] S. Krishnan et al. Learning to Optimize Join Queries With Deep Reinforcement Learning. arXiv ’18.
  • [30] Y. LeCun et al. Deep learning. Nature ’15.
  • [31] V. Leis et al. How Good Are Query Optimizers, Really? VLDB ’15.
  • [32] J. Li et al. Robust estimation of resource consumption for SQL queries using statistical techniques. VLDB ’12.
  • [33] H. Liu et al. Cardinality Estimation Using Neural Networks. In CASCON ’15.
  • [34] K. Lolos et al. Elastic management of cloud applications using adaptive reinforcement learning. In Big Data ’17.
  • [35] Lorenzo Bossi et al. A System for Profiling and Monitoring Database Access Patterns by Application Programs for Anomaly Detection. ToSE ’14.
  • [36] A. L. Maas et al.

    Learning word vectors for sentiment analysis.

    In H:T ’11.
  • [37] R. Marcus et al. Deep Reinforcement Learning for Join Order Enumeration. In aiDM ’18.
  • [38] R. Marcus et al. NashDB: An Economic Approach to Fragmentation, Replication and Provisioning for Elastic Databases. In SIGMOD ’18.
  • [39] R. Marcus et al. Releasing Cloud Databases from the Chains of Performance Prediction Models. In CIDR ’17.
  • [40] R. Marcus et al. Towards a Hands-Free Query Optimizer through Deep Learning. In CIDR ’19.
  • [41] R. Marcus et al. WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases. VLDB ’16.
  • [42] T. Mikolov et al. Efficient Estimation of Word Representations in Vector Space. arXiv ’13.
  • [43] T. Mikolov et al. Linguistic Regularities in Continuous Space Word Representations. In HLT ’13.
  • [44] S. Mudgal et al. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD ’18.
  • [45] J. Ortiz et al. Learning State Representations for Query Optimization with Deep Reinforcement Learning. In DEEM ’18.
  • [46] J. Ortiz et al. PerfEnforce Demonstration: Data Analytics with Performance Guarantees. In SIGMOD ’16.
  • [47] J. Ortiz et al. SLAOrchestrator: Reducing the Cost of Performance SLAs for Cloud Data Analytics. In USENIX ATX’18.
  • [48] A. Paszke et al. Automatic differentiation in PyTorch. In NIPS-W ’17.
  • [49] A. Pavlo et al. Self-Driving Database Management Systems. In CIDR ’17.
  • [50] F. Pedregosa et al. Scikit-learn: Machine Learning in Python. JMLR ’11.
  • [51] M. Peters et al. Deep Contextualized Word Representations. In NAACL ’18.
  • [52] S. Ruder. An overview of gradient descent optimization algorithms. arXiv ’16.
  • [53] R. E. Schapire. The Strength of Weak Learnability. Machine Learning ’90.
  • [54] J. Schmidhuber. Deep learning in neural networks: An overview. NN ’15.
  • [55] P. G. Selinger et al. Access Path Selection in a Relational Database Management System. In SIGMOD ’89.
  • [56] Shrainik Jain et al. Database-Agnostic Workload Management. In CIDR ’19.
  • [57] D. Snyder et al. Deep neural network-based speaker embeddings for end-to-end speaker verification. In SLT ’16.
  • [58] R. Taft et al. STeP: Scalable Tenant Placement for Managing Database-as-a-Service Deployments. In SoCC ’16.
  • [59] Tim Kraska et al. SageDB: A Learned Database System. In CIDR ’19.
  • [60] S. Tozer et al. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. In ICDE ’10.
  • [61] I. Trummer et al. SkinnerDB: Regret-bounded Query Evaluation via Reinforcement Learning. VLDB ’18.
  • [62] K. Tzoumas et al. A Reinforcement Learning Approach for Adaptive Query Processing. In Technical Report, 08.
  • [63] L. van der Maaten et al. Visualizing Data using t-SNE. JMLR ’08.
  • [64] S. Venkataraman et al. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI ’16.
  • [65] W. Wu et al. Predicting Query Execution Time: Are Optimizer Cost Models Really Unusable? In ICDE ’13.
  • [66] W. Wu et al. Towards Predicting Query Execution Time for Concurrent and Dynamic Database Workloads. VLDB ’13.
  • [67] D. Xin et al. Helix: Accelerating Human-in-the-loop Machine Learning. VLDB ’18.
  • [68] P. Xiong et al. ActiveSLA: A Profit-oriented Admission Control Framework for Database-as-a-Service Providers. In SoCC ’11.
  • [69] J. Yosinski et al. How Transferable Are Features in Deep Neural Networks? In NIPS ’14.