Extracting patterns from a query workload has been an important technique in database systems research for decades, used for a variety of tasks including workload compression , index recommendation , modeling user and application behavior [49, 23, 51], query recommendation , predicting cache performance [44, 14], and designing benchmarks .
We see a need for generalized, automated techniques that can support all of these applications with a common framework, due to three trends: First, workload heterogeneity is increasing. In loosely structured analytics environments (e.g., “data lakes”), ad hoc queries over ad hoc datasets tend to dominate routine queries over engineered schemas , increasing heterogeneity and making heuristic-based pattern analysis more difficult. Second, workload scale is increasing. With the advent of cloud-hosted, multi-tenant databases like Snowflake which receive tens of millions of queries every day, database administrators can no longer rely on manual inspection and intuition to identify query patterns queries. Third, new use cases for workload analysis are emerging. User productivity enhancements, for example SQL debugging  and database forensics , are emerging, motivating a more automated analysis of user behavior patterns.
To mine for patterns in large, unstructured data sources, data items must be represented in a standard form. Representation learning 
aims to find semantically meaningful embeddings of semi-structured and unstructured data in a high-dimensional vector space to support further analysis and prediction tasks. The area has seen explosive growth in recent years, especially in text analytics and natural language processing (NLP). These learned embeddings aredistributional
in that they produce dense vectors capable of capturing nuanced relationships — the meaning of a document is distributed over the elements of a high-dimensional vector, as opposed to encoded in a sparse, discrete form such as bag-of-tokens. In NLP applications, distributed representations have been shown to capture latent semantics of words, such that relationships between words correspond (remarkably) to arithmetic relationships between vectors. For example, ifrepresents the vector embedding of the word , then one can show that , demonstrating that the vector embedding has captured the relationships between gendered nouns. Other examples of semantic relationships that the representation can capture include relating capital cities to countries, and relating the superlative and comparative forms of words [36, 35, 29]. Outside of NLP, representation learning has been shown to be useful for understanding nuanced features of code samples, including finding bugs or identifying the programmer’s intended task .
In this paper, we apply representation learning approaches to SQL workloads, with an aim of automating and generalizing database administration tasks. Figure 1 illustrates the general workflow for our approach. For all applications, we consume a corpus of SQL queries as input, from which we learn a vector representation for each query in the corpus using one of the methods described in Section 2
. We then use these representations as input to a machine learning algorithm to perform each specific task.
We consider two primary applications of this workflow: workload summarization for index recommendation , and identifying patterns of queries that produce runtime errors. Workload summarization involves representing each query with a set of specific features based on syntactic patterns. These patterns are typically identified with heuristics and extracted with application-specific parsers: for example, for workload compression for index selection, Surajit et al.  identify patterns like query type (SELECT, UPDATE, INSERT or DELETE), columns referenced, selectivity of predicates and so on. Rather than applying domain knowledge to extract specific features, we instead learn a generic vector representation, then cluster and sample these vectors to compress the workload.
For query debugging, consider a DBA trying to understand the source of out-of-memory errors in a large cluster. Hypothesizing that group by queries on high-cardinality, low-entropy attributes may trigger the memory bug, they tailor a regex to search for candidate queries among those that generated errors. But for large, heterogeneous workloads, there may be thousands of other hypotheses that they need to check in a similar way. Using learned vector approaches, any syntactic patterns that correlate with errors can be found automatically. As far as we know, this task has not been considered in the literature, though we encounter this problem daily in production environments. We use real customer workloads from Snowflake Elastic Data Warehouse  for training and evaluating our model. We describe these applications in more detail in Section 3.
A key benefit of our approach in both applications is that a single pre-trained model can be used for a variety of applications. We show this effect empirically in Section 4.5; we use a model trained on one workload to support an application on a completely unrelated workload. This transfer learning capability opens up new use cases, where a model can be trained on a large private workload but shared publicly, in the same way that Google shares pre-trained models of text embbeddings.
We make the following contributions:
We describe a general model for workload analysis tasks based on representation learning.
We adapt several NLP vector learning approaches to SQL workloads, considering the effect of pre-processing strategies.
We propose new algorithms based on this model for workload summarization and query error prediction.
We demonstrate that it is possible to pre-train models that generate query embeddings and use them for workload analytics on unseen query workloads.
This paper is structured as follows: We begin by discussing three methods for learning vector representations of queries, along with various pre-processing strategies (Section 2). Next, we present new algorithms for query recommendation, workload summarization, and two classification tasks that make use of our proposed representations (Section 3). We then evaluate our proposed algorithms against prior work based on specialized methods and heuristics (Section 4). We position our results against related work in Section 5. Finally, we present some ideas for future work in this space and some concluding remarks in Sections 6 and 7 respectively.
2 Learning Query Vectors
Representation learning  aims to map some semi-structured input (e.g., text at various resolutions [36, 35, 29], an image, a video , a tree , a graph, or an arbitrary byte sequence [3, 41]) to a dense, real-valued, high-dimensional vector such that relationships between the input items correspond to relationships between the vectors (e.g., similarity).
For example, NLP applications seek vector representations of textual elements (words, paragraphs, documents) such that semantic relationships between the text elements correspond to arithmetic relationships between vectors.
As a strawman solution to the representation problem for text, consider a one-hot encoding of the wordcar: A vector the size of the dictionary, with all positions zero except for the single position corresponding to the word car. In this representation, the words car and truck are as dissimilar (i.e., orthogonal) as are the words car and dog.
A better representation is suggested by the distributional hypothesis 
: that the meaning of a word is carried by the contexts in which it appears. To take advantage of this hypothesis, one can set up a prediction task: given a context, predict the word (or alternatively, given a word, predict its context.) The weights learned by training a neural network to make accurate predictions become an alternative representation of each word: a vector that captures all of the contexts in which the word appears. This approach has led to striking results in NLP, where semantic relationships between words can be shown to approximately correspond to arithmetic relationships between vectors.
In this section, we discuss alternatives for adapting representation learning for SQL queries. We first discuss pre-processing strategies, including removing literals and the tradeoffs between using raw SQL or optimized plans as inputs. Next we describe a direct application of Doc2Vec by predicting each token in the query from its context, i.e. “continuous bag-of-words” [35, 36]
). Finally we use Long Short Term Memory networks (LSTMs) to implement an autoencoder; this approach allows us to avoid explicitly defining the size of the context to be used and thereby better adapt to heterogeneous queries.
Preprocessing: Query Plan Templates
The optimized plan rather than the SQL text can be used as input to the network, but this strategy has strengths and weaknesses. There are variations in how a query may be expressed in SQL (e.g., expressing a join using JOIN ... ON ... vs. the WHERE clause). For some workload analysis tasks, these variations may become confounding factors, since two equivalent queries can produce very distinct representations. For example, deriving a set of indexes to improve performance for a given workload has little to do with how the SQL was authored, and everything to do with the plan selected by the optimizer. However, for other tasks, the ability to capture patterns in how users express queries as opposed to how the database evaluates the queries can be important. For example, query recommendation tasks that aim to improve user productivity  and should therefore be responsive to stylistic differences in the way various users express queries.
We remove literal values in the plan before training the network, primarily because the network will not be able to learn numeric relationships. We do not remove attribute and table identifiers: the usage patterns of specific tables and attributes in a workload may be informative. In some multi-tenant situations, table names and attribute names can also help identify patterns even across different customer schemas. For example, in the SQLShare workload  we found that the table name phel_clc_blastx_uniprot _sprot_sqlready_1.tab was considered most similar to the table name phel_deseq2_sig_results _c.sig in one data sharing workload. These tables correspond to different users and are not otherwise obviously related, but they were routinely queried the same way in the workload. Investigating, we found that these tables were uploaded by different users in the same community who were performing similar experiments.
After removing literals and constants, we obtain an optimized query plan from the database (or logs), which also contains estimated result sizes and estimated runtimes for each operator. The plan is represented as an XML document. We remove angle brackets and other syntactic features from the XML before learning a representation, but we found that this transformation did not significantly affect the results.
We now explore different strategies for learning representations for queries: A learned representation approach using context windows of a fixed size (i.e., Doc2Vec), and an approach based on LSTMs which automatically learns an appropriate context window (more precisely: the relative influence of tokens in the surrounding context is itself learned.) All of these approaches can work on either the raw query string or the processed query plan template.
2.1 Method 1: Doc2Vec
To improve on previous token-frequency based representations, Mikolov et al. [36, 35, 29] learn a vector representation for words in a document corpus by predicting the next word in context, then derive a vector representation for larger semantic units (sentences, paragraphs, documents) by adding a vector representing the paragraph to each context as an additional “word” to provide memory across context windows (see Figure 3). The learned vector for this virtual word is used as a representation for the entire paragraph. Vectors learned in this manner have been shown to work remarkably well for sentiment classification and clustering tasks [26, 30].
This approach can be applied directly for learning representations of queries: We can use fixed-size context windows to learn a representation for each token in the query, and include a query identifier to learn a representation of entire query. This approach implies a hypothesis that the semantics of a query is an aggregation of the semantics of a series of local contexts; evaluating this hypothesis for workload analytics tasks is one of the goals of this paper.
Although the local contexts are defined assuming a simple sequence of tokens (the previous tokens and the subsequent tokens), this approach is not necessarily inappropriate even for tree-structured inputs like queries: local sequential contexts are well-defined in any reasonable linearization of the tree. In this paper, we linearize the query plan and query plan templates using an in-order traversal. We describe this process in Algorithm 1.
To qualitatively evaluate whether these learned vector representation for queries can recover semantics, we can generate embeddings for queries with known patterns and see if we can recover those patterns in the derived vector space. In Figure 2, we show clusters derived from embedding (a) the raw SQL, (b) the query plan, and (c) the templatized query plan. Using the 21 TPC-H queries as patterns, we generated 200 random instances of each TPC-H query pattern. Next we learned vector embeddings model on raw query strings (1(a)) and generated 300 dimensional vectors for each query. We then used t-SNE  to reduce these embeddings to two dimensions in order to visualize them. We repeated the same experiments on linearized query plans (1(b)) and linearized query plan templates (1(c)). The color of each point represents the ground truth of the original TPCH query type, unknown to the model. All algorithms are able to recover the original structure in this contrived workload, but the clusters resulting from the raw SQL are less cohesive, suggesting that the use of optimized plans may be more effective.
2.2 Method 2: LSTM Autoencoder
The paragraph vector approach in the previous section is viable, but it requires a hyper-parameter for the context size. There is no obvious way to determine a fixed context size for queries, for two reasons: First, there may be semantic relationships between distant tokens in the query. As an illustrative example, attribute references in the SELECT clause correspond to attribute references in the GROUP BY clause, but there may be arbitrary levels of nesting in between. Second, the length of queries vary widely in ad hoc workloads [25, 24]. Figure 4 illustrates the query length distribution for two real SQL workloads .
To avoid setting a context size, we can use Long Short-Term Memory (LSTM) networks, which are modified Recurrent Neural Networks (RNN) that can automatically learn how much context to remember and how much of it to forget, thereby removing the dependence on a fixed context size (and avoid theexploding or vanishing gradient problem 
). LSTMs have successfully been used in sentence classification, semantic similarity between sentences and sentiment analysis.
Background: Long Short Term Memory (LSTM) networks
Zaremba and Sutskever define an LSTM unit at time (time is equivalent to position of tokens in the processed query) to contain a memory cell , a hidden state , an input gate , a forget gate and an output gate . The forget gate scales the output from the previous cell output which is passed to the memory cell at time t. The transition equations governing these values are as follows:
where x is the input at current time step,
denotes logistic sigmoid function anddenotes the Hadamard product. The matrices and represent weights and vector represents the bias. The superscript , , and in the weight and bias matrices represent the corresponding weight matrices and biases for input gate, forget gate, output gate and cell state respectively. These are the parameters that are learned by the network using backpropagation . The state passed from one cell to the next are the cell state and the hidden state . The hidden state represents the an encoding of the input seen until the cell.
LSTM based autoencoder
We use a standard LSTM encoder decoder network [54, 32] with architecture as illustrated in Figure 5. A typical autoencoder trains the network to reproduce the input. The encoder LSTM part of the network encodes the query (or query plan), and this encoding is subsequently used by the later half of the network (decoder) which tries to reproduce the input query (or query plan). More concretely, we train the network by using the hidden state of the final LSTM cell (denoted by in the figure) to reproduce the input query (or query plan). The decoder network outputs one token at a time, and also uses that as an input to the subsequent LSTM cell to maintain context.
Our LSTM-based approach is given in Algorithm 2. An LSTM autoencoder is trained by sequentially feeding words from the pre-processed queries to the network one word at a time, and then reproduce the input. The LSTM network not only learns the encoding for the samples, but also the relevant context window associated with the samples as the forget gate scales the output through each step. The final hidden state () on the encoder network gives us an encoding for the samples.
Once this network has been trained, an embedded representation for a query can be computed by passing the query to the encoder network, completing a forward pass, and using the hidden state of the final encoder LSTM cell as the learned vector representation. Lines 9 through 13 in Algorithm 2 describe this process procedurally.
In this section, we use the algorithms from Section 2 as the basis for new algorithms for workload analysis tasks. We will evaluate these applications in Section 4. We consider two applications in detail: workload summarization for index selection  and resource error prediction. Then, in Section 6, we consider some novel applications enabled by these Query2Vec representations.
In each case, a key motivation is to provide services for a multi-tenant data sharing system, where many users are working with many schemas. In these situations, it is difficult to exploit heuristics to identify workload features since the schemas, query “style,” and application goals can vary widely. Learned vector representations can potentially offer a generic approach to workload analytics that adapts automatically to patterns in the workload, even across schemas, users, and applications. Our underlying hypothesis is that a common representation of queries provides a basis for algorithms for a variety of workload analysis tasks,and that these new algorithms can compete favorably with more specialized algorithms tailored for each specific task. We use the architecture proposed in Figure 1 for the applications that follow in the later parts of this section.
In this paper, we generally assume that we will train the model on a corpus that is specific to the application or system we are trying to support. However, we consider it likely that one can pre-train models on a large shared public query corpus (e.g., SDSS  or SQLShare ), and then re-use the pre-trained vectors for specific analysis tasks, in the same way that NLP applications and image analysis applications frequently re-use models and learned vectors pre-trained on large text and image corpora. We demonstrate some preliminary results of this approach in Section 4.5 and leave an analysis of the tradeoffs of this transfer learning approach to future work.
3.1 Workload Summarization
The goal of workload summarization [11, 28] is to find a representative sample of the whole workload as input to further database administration and tuning tasks, including index selection, approximate query processing, histogram tuning, and statistics selection . We focus here on index recommendation as a representative application of workload summarization. The goal is to find a set of indexes that will improve performance for a given workload . Most commercial database systems include index recommendation as a core feature; Microsoft SQL Server has included the feature as part of its Tuning Advisor since SQL Server 2000 . We will not further discuss the details of the index recommendation problem; following previous work on workload summarization, we make the assumption that it is a black box algorithm that accepts a workload as input and produces a set of indexes as an output.
The workload summarization task (with respect to index recommendation) is as follows: Given a query workload , find a representative subset of queries such that the set of indexes recommended based on approximate the set of indexes of the overall workload . Rather than attempt to measure the similarity of two sets of indexes (which can vary in syntactic details that may or may not significantly affect performance), we will follow the literature and measure the performance of the overall workload as the indicator of the quality of the indexes. Workload summarization is desirable not only to improve recommendation runtime (a roughly task), but also potentially to optimize query runtime performance during ordinary database operations .
Previous techniques for workload summarization are primarily variants of workload summarization algorithms described by Chaudhury et al. . They describe three main approaches to the problem: 1) a stratified random sampling approach, 2) a variant of K-Mediod algorithm which produces clusters and automatically selects an appropriate and 3) a greedy approach which prunes the workload of queries which do not satisfy the required constraints. Case 2) is highly dependent on the distance function between queries; the authors emphasize that a custom distance function should be developed for specific workloads. As we have argued, this approach is increasingly untenable for multi-tenant situations with highly dynamic schemas and workloads. Of these, the K-Mediod approach, equipped with a custom distance function that looks for join and group by patterns, performs the best in terms of quality of compression and time taken for compression.
Our observation is that the custom distance function is unnecessary: we can embed each query into a high-dimensional vector space in a manner that preserves its semantics, such that simple cosine distance between the vectors performs as well as application-specific distance functions over applications-specific feature vectors.
Since we no longer need the custom distance function, we can also use the much faster k-means algorithm rather than the more flexible (but much slower) K-Mediods algorithm originally proposed: K-Mediods selects an element in the dataset as the centroid, meaning that distances are always computed between two elements as opposed to between an element and the mean of many elements. The upshot is that K-Mediods supports any arbitrary distance function, a feature on which the original method depends crucially since these functions are black boxes. We do still need to select a member of the cluster to return at the last step; we cannot just return the centroid vector, since an arbitrary vector cannot be inverted back into an actual SQL query. Therefore we perform an extra linear scan to find the nearest neighbor to the computed centroid.
We illustrate our algorithm in figure LABEL:ws and provide a procedural description in Algorithm 3. We use K-means to find query clusters and then pick the closest query to the centroid in each cluster as a representative query for that cluster in the summary. To determine the optimal we use the Elbow method  which runs the K-means algorithm in a loop with increasing till the rate of change of the sum of squared distances from centroids ‘elbows’ or plateaus out. We use this method due to its ease of implementation, but our algorithm is agnostic to the choice of the method to determine .
3.2 Classification by Error
An individual query is unlikely to cause a mature relational database system to generate an error. However, modern MPP databases and big data systems are significantly more complex than their predecessors and operate at significantly larger scale. In the real workload we study (from Snowflake Elastic Data Warehouse ), from a 10-day window from one deployment consisted of about million select queries, about 100,000 queries resulted in some kind of error (). Moreover, systems have become increasingly centralized (e.g., as data lakes) and multi-tenant (e.g., cloud-hosted services.) In this heterogeneous regime, errors are more common, and identifying the syntactic patterns that tend to produce errors can not be performed manually.
In figure 6, we show a clustering of error-generating SQL queries from a large-scale cloud-hosted multi-tenant database system. Color represents the type of error; there are over 20 different types of errors ranging from out-of-memory errors to hardware failures to query execution bugs. The syntax patterns in the workload are complex (as one would expect), but there are obvious clusters, some of which are strongly associated with specific error types. For example, the cluster at the upper right corresponds to errors raised when a timeout was forced on dictionary queries.
Using an interactive visualization based on these clusterings, the analyst can inspect syntactic clusters of queries to investigate problems rather than inspecting individual queries, for two benefits: First, the analyst can prioritize large clusters that indicate a common problem. Second, the analyst can quickly identify a number of related examples in order to confirm a diagnosis. For example, when we first showed this visualization to our colleagues, they were able to diagnose the problem associated with one of the clusters immediately.
4.1 Experimental Datasets
Datasets for training query2vec
We evaluate our summarization approach against TPC-H in order to facilitate comparisons with experiments in the literature, and to facilitate reproducibility. We use scale factor 1 (dataset size 1GB) in order to compare with prior work. We generated queries with varying literals for 21 different TPC-H query templates. We ignored query 15 since it includes a CREATE VIEW statement.
We evaluate error forensics using a real workload from Snowflake , a commercial large-scale, cloud-hosted, multi-tenant database service. This service attracts millions of queries per day across many customers, motivating a comprehensive approach to workload analytics. For training the query2vec algorithms, we used a random sample with select queries over a 10-day window of all customer queries. The following query over the jobs table in the commercial database service was used:
select * from ( select * from ( select query, id from jobs -- logs table where created_on > ’1/09/2018’ and statement_properties = ’select’ ) tablesample (16) -- random sample ) limit 500000;
The statistics for these workloads appear in Table 1.
Datasets for experimental evaluation
For workload summarization experiment, we evaluate the query2vec model (trained on the TPC-H dataset in Table 1) by generating summary for another subset of TPC-H queries. For error forensics, we use 2 datasets. First dataset contains an even mix of queries that failed and queries with no errors (Snowflake-MultiError). The Second dataset contains queries that failed due to out-of-memory errors and queries with no errors. Details for these evaluation workloads appear in Table 2.
4.2 Workload Summarization and Indexing
In this section we measure the performance of our workload summarization algorithm. Following the evaluation strategy of Chaudhuri et al., we first run the index selection tool on the entire workload , create the recommended indexes, and measure the runtime for the original workload. Then, we run our workload summarization algorithm to produce a reduced set of queries , re-run the index selection tool, create the recommended indexes, and again measure the runtime of the entire original workload.
We have performed experiments to directly compare with prior work and found significant benefits, but since the original paper reports on results using a much earlier version of SQL Server, the results were difficult to interpret. Since the earlier compression techniques have since been incorporated directly into SQL Server , we instead use our method as a pre-processing step to see if it adds any additional benefit beyond the state-of-the-art built-in compression approach. The documentation for SQL Server states that workload compression is applied; we did not find a way to turn this feature off.
We further apply a time budget to show how giving the index recommendation tool more time produces higher quality recommendations and improved performance on the workload.
For this experiment, we use SQL Server 2016 which comes with the built-in Database Engine Tuning Advisor. We set up the SQL Server on a AWS EC2 instance. We set up TPC-H with scale factor 1 on this server. The experimental workflow involved submitting the workload (or summary) to Database Engine Tuning Advisor (with varying time budget), getting recommendations for the indexes, creating these indexes, running the complete TPC-H workload (Table 2) and measuring the total time taken, clearing the cache and finally dropping the indexes (for the next run of the experiment).
Figure 7 shows the results of the experiment. The x-axis is the time budget provided to the index recommender (a parameter supported by SQL Server). The y-axis is the runtime for the entire workload, after building the recommended indexes. The times for the learned vector embedding methods are bi-modal: For time budgets less than 3 minutes, our methods do not finish (and SQL server produces no recommendations as well), and we default to recommending zero indexes. These numbers do not include query2vec training time: in practice, the model would only be trained once and used for a variety of applications. This time only includes the time to process the queries to produce a learned vector representation. This step is still expensive due to the multi-layer neural network used (e.g., a chain of vector-matrix multiply operations must be performed.) We have not attempted to optimize this step, though it is trivial to do so: the workload can be processed in parallel, and a number of high-performance engines for neural networks are available.
Surprisingly, under tight time budgets, the index recommendations made by the native system can actually hurt performance relative to having no indexes at all! The optimizer chooses a bad plan based on the suboptimal indexes. In Figure 7(a) and Figure 7(b), we show the sequence of queries in the workload on the x-axis, and the runtime for each query on the y-axis. The indexes suggested under a 3 minute time budget result in all instance of TPC-H query 18 (queries 640-680 in Figure 7(a)) taking much longer than they would take when run without these indexes.
The key conclusion is that pre-compression can help achieve the best possible recommendations in significantly less time, even though compression is already being applied by the engine itself.
Figure 9 shows the compression (percentage reduction in workload size) achieved by our algorithms on the TPC-H workload. We see that as the number of queries in the workload increases, the overall compression improves. This increase in compression this because our workloads at different were generated using 21 TPC-H query types. Since our algorithms looks for syntactic equivalences, they result in a summary which is always around 20-30 queries in size.
Time taken for summarization
All of our methods, take about 2-3 minutes to summarize the workload. This excludes the offline one-time training step for learning the query2vec model (we discuss query2vec training time in Section 4.4). The index suggestion task is roughly quadratic and takes more than an hour on the complete workload, so the time taken for summarization is amply justified.
4.3 Classifying Queries by Error
Classifying multiple errors
In this experiment, we use the workload Snowflake-MultiError as described in Table 2 to train a classifier which can predict an error type (or no_error) given an input query. We use the pre-trained LSTM autoencoder based Query2Vec model on the Snowflake dataset111The performance on Doc2Vec based model is similar.. We evaluate only the models trained on query strings, because plans for some queries were not available since the corresponding database objects were deleted or the schema was altered.
We use the datasets in (Table 1) to learn the query representations for all queries in the Snowflake-MultiError workload. Next, we randomly split the learned query vectors (and corresponding error codes) into training () and test () sets. We use the training set to train a classifier using the Scikit-learn implementation of Extremely Randomized Trees [39, 9]. We present the performance of this classifier on the test set.
|Error Code||precision||recall||f1-score||# queries|
|avg / total||0.979||0.979||0.978||15000|
We summarize the performance of classifier on error classes with more than 10 queries each in Table 3. The classifier performs well for the errors that occur sufficiently frequently, suggesting that the syntax alone can indicate queries that will generate errors. This mechanism can be used in an online fashion to route queries to specific resources with monitoring and debugging enabled to diagnose the problem. Offline, query error classification can be used for forensics; it is this use case that was our original motivation. Although individual bugs are not difficult to diagnose, there is a long tail of relatively rare errors; manual inspection and diagnosis of these cases is prohibitively expensive. With automated classification, the patterns can be presented in bulk.
Classifying out-of-memory errors
In this experiment, we compare the classification performance of our method for one type of error considered a high priority for our colleagues — queries running out of memory (OOM). We compare to a baseline heuristic method developed in collaboration with Snowflake based on their knowledge of problematic queries. We use the workload Snowflake-OOM as described in Table 2 to train a classifier to predict out-of-memory errors. Following the methodology in the previous classification task, we use the pre-trained Query2Vec model to generate query representations for the workload, randomly split the learned query vectors into training () and test () sets, train a classifier using the Scikit-learn implementation of Extremely Randomized Trees, and present the performance on the test set.
Heuristic Baselines: We interviewed our collaborators at Snowflake and learned that the presence of window functions or joins between large tables in the queries tend to be associated with OOM errors. We implement four naïve baselines that looks for the presence of window functions or a join between at least 3 of the top largest tables in Snowflake. The first baseline looks for the presence of heavy joins, the second baseline looks for window functions, and the third baseline looks for the presence of either one of the indicators: heavy joins or window functions, and the fourth baseline looks for the presence of both heavy joins and window functions. The baselines predicts that the query will run out of memory if the corresponding indicator is present in the query text.
Table 4 shows the results. We find that our method significantly outperforms the baseline heuristics, without requiring any domain knowledge or custom feature extractors. We do find that the presence of heavy joins and window functions in the queries are good indicators of OOM errors (specially if they occur together) given the precision of these baselines, however, the low recall suggests that such hard-coded heuristics would miss a other causes of OOM errors. Query2Vec obviates the need for such hard-coded heuristics. As with any errors, this mechanism can be used to route potentially problematic queries to clusters instrumented with debugging or monitoring harnesses, or potentially clusters with larger available main memories. We see Query2Vec as a component of a comprehensive scheduling and workload management solution; these experiments show the potential of the approach.
|Contains heavy joins||0.729||0.115||0.198|
|Contains window funcs||0.762||0.377||0.504|
|Contains heavy joins OR window funcs||0.724||0.403||0.518|
|Contains heavy joins AND window funcs||0.931||0.162||0.162|
4.4 Training Time
In this experiment, we evaluated the training time of the embedding model against datasets of varying sizes. The hete Figure 10 shows the trend for training times for learning the Query2Vec models on the two dataset we use (Table 1) for varying workload size. All of the models were trained on a personal computer (MacBook Pro with 2.8 GHz Intel Core i7 processor, 16 GB 1600 MHz DDR3 and NVIDIA GeForce GT 750M 2048 MB graphics card). We use the highly parallelized publicly available implementation of doc2vec . For small workloads, training time is a few seconds. The entire Snowflake training set of queries takes less than 10 minutes.
We implemented the LSTM autoencoder in PyTorch, without any attempt to optimize the runtime. This method takes longer to train, going up to 20 minutes for workload sizes of 4200 queries. The total training time using the entire Snowflake workload was roughly 14 hours. We find that this time is justified, given the increase in performance and that this is onetime offline step. Furthermore, faster neural architectures can be deployed to greatly improve this time. We leave this optimization for future work.
We find that the training time is a function of average query length in the workload. More concretely, training time increases with the increase in average query length.
4.5 Transfer Learning and Model Reuse
In this section we evaluate if the Query2Vec models trained on a large number of queries from one workload (or a combination of workloads) can be used to learn query representations for separate previously unseen workload. To this end, we measure the performance of workload summarization algorithm on TPC-H dataset, while using the Query2Vec model trained on Snowflake dataset (Table 1. The experimental workflow is as follows: 1) Train Query2Vec models on Snowflake workload, 2) Infer query representations for TPC-H workload, 3) Generate a workload summary, 4) Measure the performance of this summary against the results in Section 4.2.
We summarize the results for this experiment in Figure 11. We find that even with no prior knowledge of the TPC-H workload, query2vec model trained on Snowflake workload aids workload summarization and even outperforms the index recommendation by SQL Server till the time budget minutes. This validates our hypothesis that we can train these generic querv2vec models on a large number of queries offline, and use these pre-trained models for workload analytics on unseen query datasets.
5 Related Work
Word Embeddings and representation learning
Word embeddings were introduced by Bengio et al.  in 2003, however the idea of distributed representations for symbols is much older and was proposed earlier by Hinton in 1986 . Mikolov et al. demonstrated word2vec [35, 36], an efficient algorithm which uses negative sampling to generate distributed representations for words in a corpus. word2vec was later extended to finding representations for complete documents and sentences in a follow-up work, doc2vec . Levy et al. provided a theoretical overview of why these techniques like word2vec [31, 17, 40] work so well. Generic methods for learning representations for complex structures were introduced in [41, 7]. In contrast, our work provided specialized algorithms for learning representations for SQL queries. Representation learning for queries has been implicitly used by Iyer et el. 
in some recent automated code summarization tasks, using a neural attention model, whereas we evaluate general query embedding strategies explicitly and explore a variety of tasks these embeddings enable. Zamani et al. and Grbovic et al.  proposed a method to learn embeddings for natural language queries or to aid information retrieval tasks, however we consider learning embeddings for SQL queries and their applications in workload analytics. LSTMs have been used various text encoding tasks like sentiment classification by Wang et el. , machine translation by Luong et el.  and as text autoencoders by Li et el. . Our work is inspired by the success of these approaches and demonstrates their utility for SQL workload analytics.
Compressing SQL Workloads
Workload summarization has been done in the past by Surajit’s et al. , however their method does not use query embeddings. Our evaluation strategy (index selection as a method to evaluate the compression) for workload summarization remains the same as the one proposed in . Piotr Kolaczkowski et al. provided a similar solution to workload compression problem , however their work doesn’t provide a detailed evaluation or the dataset they perform the evaluation on, therefore we do not provide a comparison with their approach.
Workload analytics and related tasks
Decades ago, Yu et al. characterized relational workloads as a step toward designing benchmarks to evaluate alternative design tradeoffs for database systems . Jain et al. report on the SQL workload generated from a multi-year deployment of a database-as-a-service system , finding that the heterogeneity of queries was higher than for conventional workloads and identifying common data cleaning idioms . We use this dataset in our study. Grust et al. use a query workload to support SQL debugging strategies  ; we envision that our embedding approaches could be used to identify patterns of mistakes and recommend fixes as a variant of our query recommendation task. We envision workload analytics as a member of a broader class of services for weakly structured data sharing environments (i.e., “data lakes”). Farid et al. propose mining for integrity constraints in weakly structured “load first” environments (i.e., “data lakes”) . Although the authors do not assume access to a query workload, we consider the workload, if available, a source of candidate integrity constraints.
6 Future Work
We are interested in adapting this approach for other perennial database challenges, including query optimization and data integration. Query embeddings can simplify the process of incorporating past experience into optimization decisions by avoiding an ever-expanding set of heuristics. We also envision mapping SQL vectors to plan vectors automatically, short circuiting the optimization process when a workload is available. Pre-trained Query2Vec models could assist in optimizing complex queries when the optimizer has incomplete information: in heterogeneous “polystore” environments  where an unproven system’s performance may be unknown, or in real time situations where data statistics are not available. In these low-information situations, Query2Vec models can be used as another source of information to offset the lack of a robust cost model and complete statistics. Looking further ahead, given enough (query, optimized plans) tuples, it could be possible to train a model to directly generate an optimized plan using an RNN in the same way a caption for an image can be synthesized!
Related to query optimization, we also intend to study query runtime prediction. To the extent that runtime is a function of data statistics and the structure of the query, vector embeddings should be able to capture the patterns given a workload and enable predicting it. The size of the workload required for training and the sensitivity to changes in the data are potential challenges.
For data integration, we hypothesize that query embeddings can help uncover relationships between schema elements (attributes, tables) by considering how similar elements have been used in other workloads. For example, if we watch how users manipulate a newly uploaded table, we may learn that “la” corresponds to “latitude” and “lo” corresponds to “longitude.” In weakly structured data sharing environments, there are thousands of tables, but many of them are used in the same way. Conventional data integration techniques consider the schema and the data values; we consider the query workload to be a novel source of information to guide integration.
We will also explore transfer learning further, i.e. training a Query2Vec model on a large shared public query corpus then reusing it in more specialized workloads and situations. We will also publish a pretrained query2vec model which database developers and researchers can utilize to assist with a variety of workload analytics tasks.
In this paper we only considered embeddings for queries, but we intend to explore the potential for learning representations of the database elements as well: rows, columns, and complete tables. This approach could enable new applications, such as automatically triggering relevant queries for never-before-seen datasets. That is, we could compare the vector representation for the newly ingested dataset with vector representations of queries to find suitable candidates.
We are also exploring other neural architectures for query2vec, including TreeLSTMs  which can take query plan tree as an input in order to produce query embeddings, thus learning similarities not just in query text, but also in the plan trees.
Finally, wherever applicable, we will add these as features to existing database systems and services and conduct user studies to measure their effect on productivity.
We presented Query2Vec, a general family of techniques for using NLP methods for generalized workload analytics. This approach captures the structural patterns present in the query workload automatically, largely eliminating the need for the specialized syntactic feature engineering that has motivated a number of papers in the literature.
To evaluate this approach, we derived new algorithms for two classical database tasks: workload summarization, which aims to find a representative sample of a workload to inform index selection, and error forensics, which aims to assist DBAs and database developers in debugging errors in queries and to assist users when authoring queries by predicting that query might result in an error. On these tasks, the general framework outperformed or was competitive with previous approaches that required specialized feature engineering, and also admitted simpler classification algorithms because the inputs are numeric vectors with well-behaved algebraic properties rather than result of arbitrary user-defined functions for which few properties can be assumed. We find that even with out-of-the-box representation learning algorithms like Doc2Vec, we can generate query embeddings and enable various new machine learning based workload analytics tasks. To further improve on the accuracy of these analytics tasks, we can use more sophisticated training methods like LSTM based autoencoders.
Finally, we demonstrate that learnings from models on large SQL workloads can be transfered to other workloads and applications.
We believe that this approach provides a new foundation for a variety of database administration and user productivity tasks, and will provide a mechanism by which to automatically adapt database operations to a specific query workload.
We would like to thank our collaborators at Snowflake for valuable feedback and datasets which made this work possible. We would also like to thank Louis M Burger and Doug Brown from Teradata for their feedback and helpful discussion on the topics covered in this paper. This work is sponsored by the National Science Foundation award 1740996 and Microsoft.
-  Database engine tuning advisor features. https://technet.microsoft.com/en-us/library/ms174215(v=sql.105).aspx. [Online;].
-  Gensim doc2vec implementation. https://radimrehurek.com/gensim/models/doc2vec.html.
-  Towards Anything2Vec. http://allentran.github.io/graph2vec.
-  TPC Benchmark H (Decision Support) Standard Specification Revision 2.14.0. http://www.tpc.org.
-  S. Agrawal, S. Chaudhuri, L. Kollar, A. Marathe, V. Narasayya, and M. Syamala. Database tuning advisor for microsoft sql server 2005: Demo. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, pages 930–932, New York, NY, USA, 2005. ACM.
-  J. Akbarnejad, G. Chatzopoulou, M. Eirinaki, S. Koshy, S. Mittal, D. On, N. Polyzotis, and J. S. V. Varman. Sql querie recommendations. Proceedings of the VLDB Endowment, 3(1-2):1597–1600, 2010.
-  Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
-  Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
-  L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122, 2013.
-  S. Chaudhuri, P. Ganesan, and V. Narasayya. Primitives for workload summarization and implications for sql. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pages 730–741. VLDB Endowment, 2003.
-  S. Chaudhuri, A. K. Gupta, and V. Narasayya. Compressing sql workloads. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 488–499. ACM, 2002.
-  S. Chaudhuri and V. R. Narasayya. An efficient, cost-driven index selection tool for microsoft sql server. In VLDB, volume 97, pages 146–155. Citeseer, 1997.
-  B. Dageville, T. Cruanes, M. Zukowski, V. Antonov, A. Avanes, J. Bock, J. Claybaugh, D. Engovatov, M. Hentschel, J. Huang, A. W. Lee, A. Motivala, A. Q. Munir, S. Pelley, P. Povinec, G. Rahn, S. Triantafyllis, and P. Unterbrunner. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pages 215–226, New York, NY, USA, 2016. ACM.
A. Dan, P. S. Yu, and J. Y. Chung.
Characterization of database access pattern for analytic prediction of buffer hit probability.The VLDB Journal, 4(1):127–154, Jan. 1995.
-  J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, and S. Zdonik. The bigdawg polystore system. SIGMOD Rec., 44(2):11–16, Aug. 2015.
-  M. Farid, A. Roatis, I. F. Ilyas, H.-F. Hoffmann, and X. Chu. Clams: Bringing quality to data lakes. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pages 2089–2092, New York, NY, USA, 2016. ACM.
-  Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.
-  M. Grbovic, N. Djuric, V. Radosavljevic, F. Silvestri, and N. Bhamidipati. Context-and content-aware embeddings for query rewriting in sponsored search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 383–392. ACM, 2015.
-  T. Grust and J. Rittinger. Observing sql queries in their natural habitat. ACM Trans. Database Syst., 38(1):3:1–3:33, Apr. 2013.
-  G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, volume 1, page 12. Amherst, MA, 1986.
-  S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
-  S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Summarizing source code using a neural attention model.
-  S. Jain and B. Howe. Data cleaning in the wild: Reusable curation idioms from a multi-year sql workload. In Proceedings of the 11th International Workshop on Quality in Databases, QDB 2016, at the VLDB 2016 conference, New Delhi, India, September 5, 2016, 2016.
-  S. Jain and B. Howe. SQLShare Data Release. https://uwescience.github.io/sqlshare//data_release.html, 2016. [Online;].
-  S. Jain, D. Moritz, D. Halperin, B. Howe, and E. Lazowska. Sqlshare: Results from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pages 281–293. ACM, 2016.
-  Y. Kim. Convolutional neural networks for sentence classification. CoRR, abs/1408.5882, 2014.
-  T. M. Kodinariya and P. R. Makwana. Review on determining number of cluster in k-means clustering. International Journal, 1(6):90–95, 2013.
-  P. Kołaczkowski. Compressing very large database workloads for continuous online index selection. In Database and Expert Systems Applications, pages 791–799. Springer, 2008.
-  Q. V. Le and T. Mikolov. Distributed representations of sentences and documents.
The mnist database of handwritten digits.
-  O. Levy, Y. Goldberg, and I. Ramat-Gan. Linguistic regularities in sparse and explicit word representations. In CoNLL, pages 171–180, 2014.
-  J. Li, M. Luong, and D. Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. CoRR, abs/1506.01057, 2015.
-  M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin.
Convolutional neural networks over tree structures for programming
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  K. E. Pavlou and R. T. Snodgrass. Generalizing database forensics. ACM Trans. Database Syst., 38(2):12:1–12:43, July 2013.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation.
-  M. Rudolph, F. Ruiz, S. Mandt, and D. Blei. Exponential family embeddings. In Advances in Neural Information Processing Systems, pages 478–486, 2016.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA, 1988.
-  M. Sahlgren. The distributional hypothesis. Italian Journal of Linguistics, 20(1):33–54, 2008.
-  C. Sapia. Promise: Predicting query behavior to enable predictive caching strategies for olap systems. In Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery, DaWaK 2000, pages 224–233, London, UK, UK, 2000. Springer-Verlag.
-  V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebedeva, and B. Yanny. Skyserver traffic report-the first five years. arXiv preprint cs/0701173, 2007.
-  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2015.
-  K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. CoRR, abs/1503.00075, 2015.
-  D. Tang, B. Qin, and T. Liu. Document modeling with gated recurrent neural network for sentiment classification.
-  Q. T. Tran, K. Morfonios, and N. Polyzotis. Oracle workload intelligence. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1669–1681, New York, NY, USA, 2015. ACM.
-  Y. Wang, M. Huang, L. Zhao, et al. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606–615, 2016.
-  P. S. Yu, M.-S. Chen, H.-U. Heiss, and S. Lee. On workload characterization of relational database environments. IEEE Trans. Softw. Eng., 18(4):347–355, Apr. 1992.
-  H. Zamani and W. B. Croft. Estimating embedding vectors for queries. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, ICTIR ’16, pages 123–132, New York, NY, USA, 2016. ACM.
-  W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
-  R. S. Zemel. Autoencoders, minimum description length and helmholtz free energy. NIPS, 1994.