Bidirectional Attention for SQL Generation

12/30/2017 ∙ by Tong Guo, et al. ∙ 0

Generating structural query language (SQL) queries from natural language is a long-standing open problem and has been attracting considerable interest recently, driven by the explosive development of deep learning techniques. Toward solving the problem, we leverage the structure of SQL queries and present a sketch-based approach or synthesizing way to solve this problem, which turns the solving method to a sequence-to-set problem and word order generation problem. We employ the bidirectional attention mechanisms and character level embedding to further improve the result. Experimental evaluations show that our model achieves the state-of-the-art results in WikiSQL dataset.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, with the explosive development of deep learning techniques [1]

, the problem of generating SQL from natural language has been attracting considerable interest recently. We refer to this problem as the natural-language-to-SQL problem (NL2SQL). Relational databases store the structured data, which is a large amount of entities or numbers. Due to the large number of entities or numbers, which enlarge the word vocabulary for deep learning for natural language processing. And the larger vocabulary will make it harder to find the exact data for a natural language query. For example, in question-answering (QA) problem, we use matching network 

[2] to get the best answer given a question. In the QA problem, the larger amount of entities will lead to the larger class number of answer and will decrease the accuracy of other deep learning models. But generation of SQL from natural language is a good way to handle this application, which could leverage the benefit of relational database itself.

The study of translating natural language into SQL queries has a long history. Recent works consider deep learning as the main technique.  [3]

employs an improved encoder-decoder framework based on neural machine translation

[4, 5] to solve this problem.  [6] uses augmented pointer network [7] to tackle this task, which is a attentional sequence to sequence model as sequence neural semantic parser and achieves state-of-the-art results on a variety of semantic parsing datasets. In [8]

, Seq2SQL breaks down this task to several sub-modules or sub-SQL to solve and incorporates execution rewards in reinforcement learning. But Seq2SQL has the ”order-matter” problem, which means the order of the two conditions in the WHERE clause does not affect the execution results of the query. It is well-known that the order affects the performance of a sequence-to-sequence style model 

[9]. SQLNet [10] introduces the attention mechanisms [11] for the model of [8] and solve the ”order-matter” problem by proposing the sequence-to-set technique for the WHERE clause of SQL.

The problem of NL2SQL can be considered as a special instance to the more generic semantic parsing problem. There are many works considering parsing natural language into a logical form [12, 13, 14, 15, 16, 17, 18, 19]. Parsing natural language into a logical form has a wide application in question answering and dialog system. And there are some datasets such as GeoQuery [20] and ATIS [21].

Our main contributions in this work are three-fold. First, we apply bidirectional attention to add the relevant information between two sequences for prediction. Second, we leverage character-level embedding and convolutional neural networks (CNNs) to maps each word to a vector space as the new embedding input. Last, the model we design achieves the state-of-the-art on the WikiSQL dataset. The code is available.


2 Task Description

In the NL2SQL task, given a question and a database table, the machine needs to generate a SQL to query the database table, and find the answer to the question. The question is described as a sequence of word tokens: where is the number of words in the question, and the table is described as a sequence of columns , where is the number of columns in the table. The table also has a number of data rows which contains the answer to the question.

We now explain the WikiSQL dataset [8], a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia. We present an example in Fig. 1. It has several features:

(1) Our model synthesizes a SQL query under the condi-tion that the table which the SQL will execute is determined. In other words, our model need not predict the exact table in the all tables.

(2) The SQL has a fixed structure: SELECT $COLUMN1 [$AGG] FROM TABLE WHERE $COLUMN2 EQUALS $VALUE1 [AND $COLUMN3 EQUALS $VALUE2], where there are one column or one aggregator in the SELECT clause and there are 0-4 conditions in the WHERE clause. Although there is no JOIN clause in the SQL, the task is still challenging as the baseline achieves.

Figure 1: An example of the WikiSQL task.

3 Model

We present the overall solution for this problem in Fig. 2. Before we describe the main SQL generation part of our model, we first describe the bi-directional attention mechanism [22] for two sequences.

Figure 2: The overall problem and solution description.

The reason and inspiration to use bi-directional attention is from the machine comprehension task. In the SQuAD [23] machine comprehension task, we input the paragraph and question to the model and find the answer string in the paragraph. And in the SQL generation task, we input the question and columns and find the answer column in the columns. The two tasks are very similar in this perspective.

3.1 Bi-attention

Bi-attention is an extended form of attention mechanism. The attention mechanism is the information of the most relevant words of second sequence to each word of first sequence. The bi-attention also computes the information signifies which words of first sequence have the closest similarity to one of the words of second sequence.

3.1.1 Forward Attention

Suppose we have two sequence of vector representation, and , which dimension is and , where is the features dimension size. Then we compute the co-attention matrix :


where is a trainable matrix. Eq. 1 contains the similarities information of each token between the two sequences. Then we apply softmax operation to the second dimension of the matrix :


Then we reshape to and reshape to and apply element-wise multiplication to get :


Note that the operation contains the broadcasting mechanism of NumPy. Then we reduce the sum of the second dimension of to get the representation of forward attention information :


3.1.2 Backward Attention

Suppose we already have the co-attention matrix in Eq. 1. Then we reduce the max value of the first dimension of :


where Then we apply softmax to :


Then we reshape to and apply element-wise multiplication with broadcasting mechanism:


where Then we reduce the sum of the first dimension of to get :


Then we compute the element-wise multiplication to get the representation of backward attention information . Note that the dimension of back-ward attention representation and forward attention representation are equal and are the same as the sequence dimension. In the next section we use the bi-attention mechanism for several components of our model.

3.2 Our Model

In this section, we present our model to tackle the WikiSQL task. As shown in Fig. 3, our model contains four modules:

Figure 3: Overall architecture of our model.

(1) The character-level and word-level embedding layer to map each character and word to vector space. The embedding layer is shared by the next three modules. (2) The COLUMN-SELECT module which predict the column of SELECT clause. (3) The AGGREGATOR-SELECT module which predict the aggregator of SELECT clause. (4) The WHERE module which predict the conditions of WHERE clause.

The detailed description of our model is provided as follows.

3.2.1 Character-level embedding and word-level embedding

We use the character-level GloVe [24] pre-trained 300 dimension to initialize the character-level embedding , where is the word number and is the character number of each word and is 300. We leverage convolutional neural networks to get the next representation of

. We use three convolution kernels, which sizes are height 1 * width 5, height 1 * width 4, height 1 * width 3. The convolution layer and max-pooling layer are 1 as 

[25] did. The input channel is 100 and output channel is 100 so the last dimension of 3 convolution results can concatenate to 300. After the max pooling and concatenation, the dimension of the final result is , which dimension is the same as the word embedding dimension.

We use the word-level GloVe pre-trained with 300 size to initialize the word-level embedding . As for the words which are not in the GloVe, we initialize them to 0. The experiment shows that if we initialize the words which are not in GloVe to a random value and make them trainable, the result decreases.

As one column contains several words, we encode the words of one column into one vector representation by running after a LSTM [26]. We take the last state of the LSTM for the representation of one column and consider it as one item of columns, which is the same as one word in the question.

3.2.2 COLUMN-SELECT module

In this sub-task, the inputs are questions and column names, the outputs are one column in the column names. So we need to capture the attention info of questions and column names and then leverage the attention info to output a prediction over the column names. We did it as follows.

Each token is represented as a one-hot vector and fed into a word embedding vector before feeding them into the bi-directional LSTM. We have the question embedding and the column embedding , where is the max word number of questions and is the columns number of a table. The embedding and can be computed as the hidden states of a bi-directional LSTM respectively and get the bi-LSTM encoded representation and , where /2 is the hidden size of LSTM.

Then we apply bi-directional attention to and according to Eq. 1 to Eq. 8, where is the first sequence and is the second sequence , to get the bi-attention info and . Then we concatenate the last dimension of forward attention info and the backward attention info to and apply the operations below to get the final prediction for column in the SELECT clause:


where , and are all trainable weights and

is the probability distribution over all columns of one table.


There are 5 types of aggregation keywords in SQL: ’MAX’, ’MIN’, ’COUNT’, ’SUM’, ’AVG’. The experiment of SQLNet shows that the column name input do not impact the prediction result. So we only need to input the question and predict the class of aggregation keywords. So we consider this sub-task as a text classification problem. Then we have the question embedding and the Bi-LSTM encoded representation . Then we compute the final prediction for aggregator :


where and are all trainable weights and sum apply to the first dimension and is the probability distribution over 6 choices of SQL aggregators.

3.2.4 WHERE module

The WHERE clause is the most challenging part. The order of conditions does not matter, so we predict the probability of column slots and choose the top columns as a set. We predict the number of conditions and the column slots first. Then we leverage the column predictions to choose from the columns candidates. Then we use the chosen columns as embedding input to predict the operator slots and value slots for each column. We describe them below.

(1) Condition number

Suppose the number of conditions of WikiSQL is ranging from 0 to and we consider it as a (+1)-class classification problem. We compute the bi-LSTM hidden states of the question embedding and compute the number of conditions as below:


where and are all trainable weights.

(2) Column slots

In this sub-task, taking questions and column names as input, we need to predict the top column names. So we leverage the bi-attention info of the questions and column names to output the probability over the column names. We compute bi-attention information of the question hidden states and column hidden states . Then we compute the bi-attention information according to Eq. 1 to Eq. 8 and compute final column prediction , which is the same computation as COLUMN-SELECT module. We choose the top scores of as the prediction of

columns and pad 0 to

columns. We leverage the chosen and padded LSTM representation for the predictions of operator slots and value slots.

(3) Operator slots

There are 3 type of operator keywords in SQL: ’GT’, ’LT’, ’EQL’, which indicates ’greater than’, ’less than’, ’equal’ separately. We start from the hidden states and . And we use the result of column name slots predictions to choose the top column from and get . Then we apply the bi-attention of Eq. 1 to Eq. 8 to get the final attention representation for and , which is the concatenation of the last dimension of the two sequence representation. Then the computation is to get predictions for each condition:


where , and are all trainable weights and is the probability distribution over 4 choices of condition operator for each column.

(4) Value slots

As one value slot must corresponding to one column slot and we need to leverage the predicted columns info, so we employ the sequence-to-sequence structure to generate the values by taking the predicted columns info as input. The structure is well-developed: suppose we have an input sequence, and we employ an encoder to encode the input sequence into a vector. Then we employ a decoder to decode the output sequence from the vector.

We employ bi-LSTM to be the encoder which take the question embedding and predicted column info as input and the encoder’s output is . At decoder phase we need to predict the value which is a sub-string of the question, so we use pointer network [7] to point to the sub-string of question. The LSTM decoder of pointer network is unidirec-tional and 2 layers. In training, the LSTM decoder takes the ground truth as input and outputs the , where

is the max word number and is one-hot encoding. Then

and participate the next computation:


Where the inputs , and are expanded to the same dimension and , , and are all separated trainable weights. The output of the pointer network is , where is the question length. In engineering we flat the specific dimension for the computation. For example, suppose we have batch size dimension and conditions as the second dimension, then we flat the dimension to * as the first dimension. Note that we generate the condition values for each of the conditions. The END token also appears in the question and the model stops generating for this slot when the END token is predicted. We prepare the exact ground truth for each sub-module of WHERE module and give each sub-module of WHERE module a separated loss.

3.2.5 Loss function

We use the cross-entropy loss for the prediction of COLUMN-SELECT module and AGGREGATOR-SELECT module. As for the WHERE module, we also use cross-entropy loss for the value slots and operator slots.

As the prediction for the columns in the WHERE module is a target set, we need to penalize the predicted columns that are not in the ground truth. So we design the loss function for the prediction of columns set:


where we choose =3 and =1 if the ground truth contains the -column, =0 if the ground truth does not contain the -column. The final objective function is:


where , , are the loss of AGGREGATOR-SELECT, COLUMN-SELECT and WHERE module separately.

4 Experiments

In this section, we present more details of the model and the evaluation on the dataset. We also analyze the evaluation result.

4.1 Experimental Setting

We tokenize the sentences using Stanford CoreNLP [27]. The LSTM contains 2 layers and the size of LSTM hidden states h is 50 and the output size of bi-LSTM is 100. The dropout [28] for LSTM cell is 0.3. We use different LSTM weights for predicting different slots. Each LSTM which encodes the embedding is an independent weight. Although we do not share the bi-LSTM weight, we find that sharing the same word em-bedding vector is better. Therefore, different components in our model only share the word embedding. We use the Adam optimizer [29]

with learning rate 0.001 and 0 weight decay to minimize the loss of Eq. 15. We train the model for 100 epochs with fixed word embedding and trainable character embedding. Then we use the pre-trained 100 epoch model to train the next 100 epoch with all traina-ble embeddings. The character-level embedding are all trainable in 0 to 200 epoch. The batch size is 64. We ran-domly re-shuffle the training data in each epoch. In addition, our final model is chosen as the models that perform the best on development set in each part in the process of training. We implement all models using PyTorch 


4.2 Evaluation

We evaluate our model on the WikiSQL dataset. The decomposition results are presented in Tab. 1 and the overall results are presented in Tab. 2. We display the separated results of each module and the query-match accuracy which compare whether two SQL queries match exactly. From the evaluation result we find that bi-attention mechanisms mainly improve the WHERE clause result and character-level embedding mainly improve the COLUMN-SELECT clause. The execution result is higher because different SQL may obtains the same result. For example, the two SQL queries SELECT COUNT (player) WHERE No. = 23 and SELECT COUNT (No.) WHERE No. = 23 produce the same result in the table of Fig. 1.

dev test
Seq2SQL 90.0 89.6 62.1 90.1 88.9 60.2
SQLNet 90.1 91.5 74.1 90.3 90.9 71.9
Our Model 90.1 91.1 74.6 90.3 90.8 72.8
+char-emb 90.1 92.5 74.7 90.3 91.9 72.8
Table 1: Our baselines are Seq2SQL [8] and SQLNet [10]. The third row is our model with bi-attention and +char-emb means our model with CNN-based character-level embedding. and indicate the accuracy on the aggregator and column prediction accuracy on the SELECT clause, and indicates the accuracy to generate the WHERE clause.
dev test
Result match Query string match Result match Query string match
Seq2SQL 62.1 53.5 60.4 51.6
SQLNet 69.8 63.2 68.0 61.3
Our Model 70.3 63.5 68.2 61.5
+char-emb 71.1 64.1 69.0 62.5
Table 2: Overall result on the WikiSQL task. The ”result match” indicates the execution of database accuracy and the ”query string match” is to compare whether predicted SQL and ground truth SQL match exactly.

4.3 Analysis

The improvement of COLUMN-SELECT clause which is attributed by CNN-based character-level embedding is around 2, as the baseline result is already 90. We think it is because with the help of the character-level embedding, the model can be more robust to the minor difference of a word between training data and test data. The improvement of attention is 2.5 and the improvement of the bi-attention mechanisms is 3 to 3.5. The improvement from attention to bi-attention is 0.5 to 1

. We also observe that if we initialize the words which are not in the GloVe the random initialization and train the embedding, the result does not improve. The reason is that we do not add the mask technique which set the rare words to a minimal value in the model in order that the rare words do not participate in the activation function such as sigmoid. We consider the mask technique as a future work.

5 Conclusion

In this paper, based on the structure of SQL and the observation that a sequence-to-sequence model suffer from the ”order-matters” problem, we design specific deep neural network for each sub-string of SQL. In the WHERE prediction module, we choose the top probabilities of the column candidates as the chosen set for the prediction of conditions. We apply the bi-directional attention mechanism and the CNN-based character-level embedding to improve the result. The experimental evaluations show that our model achieves the state-of-the-art results in the WikiSQL dataset.

We observe that the accuracy is around 90

on the COLUMN-SELECT clause prediction and AGGREGATOR-SELECT clause prediction because the number of candidate column in the SELECT clause is limited to one. The task will be more challenging if the SQL extends to more than one column candidates and more complex cases like ORDER-BY, GROUP-BY or even JOIN. And the technique of NL2SQL can be applied to Knowledge Graph query or other semantic parsing tasks. There will be a lot of work to research.