1 Introduction
In recent years, with the explosive development of deep learning techniques [1]
, the problem of generating SQL from natural language has been attracting considerable interest recently. We refer to this problem as the naturallanguagetoSQL problem (NL2SQL). Relational databases store the structured data, which is a large amount of entities or numbers. Due to the large number of entities or numbers, which enlarge the word vocabulary for deep learning for natural language processing. And the larger vocabulary will make it harder to find the exact data for a natural language query. For example, in questionanswering (QA) problem, we use matching network
[2] to get the best answer given a question. In the QA problem, the larger amount of entities will lead to the larger class number of answer and will decrease the accuracy of other deep learning models. But generation of SQL from natural language is a good way to handle this application, which could leverage the benefit of relational database itself.The study of translating natural language into SQL queries has a long history. Recent works consider deep learning as the main technique. [3]
employs an improved encoderdecoder framework based on neural machine translation
[4, 5] to solve this problem. [6] uses augmented pointer network [7] to tackle this task, which is a attentional sequence to sequence model as sequence neural semantic parser and achieves stateoftheart results on a variety of semantic parsing datasets. In [8], Seq2SQL breaks down this task to several submodules or subSQL to solve and incorporates execution rewards in reinforcement learning. But Seq2SQL has the ”ordermatter” problem, which means the order of the two conditions in the WHERE clause does not affect the execution results of the query. It is wellknown that the order affects the performance of a sequencetosequence style model
[9]. SQLNet [10] introduces the attention mechanisms [11] for the model of [8] and solve the ”ordermatter” problem by proposing the sequencetoset technique for the WHERE clause of SQL.The problem of NL2SQL can be considered as a special instance to the more generic semantic parsing problem. There are many works considering parsing natural language into a logical form [12, 13, 14, 15, 16, 17, 18, 19]. Parsing natural language into a logical form has a wide application in question answering and dialog system. And there are some datasets such as GeoQuery [20] and ATIS [21].
Our main contributions in this work are threefold. First, we apply bidirectional attention to add the relevant information between two sequences for prediction. Second, we leverage characterlevel embedding and convolutional neural networks (CNNs) to maps each word to a vector space as the new embedding input. Last, the model we design achieves the stateoftheart on the WikiSQL dataset. The code is available.
^{1}^{1}1https://github.com/guotong1988/NL2SQL2 Task Description
In the NL2SQL task, given a question and a database table, the machine needs to generate a SQL to query the database table, and find the answer to the question. The question is described as a sequence of word tokens: where is the number of words in the question, and the table is described as a sequence of columns , where is the number of columns in the table. The table also has a number of data rows which contains the answer to the question.
We now explain the WikiSQL dataset [8], a dataset of 80654 handannotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia. We present an example in Fig. 1. It has several features:
(1) Our model synthesizes a SQL query under the condition that the table which the SQL will execute is determined. In other words, our model need not predict the exact table in the all tables.
(2) The SQL has a fixed structure: SELECT $COLUMN1 [$AGG] FROM TABLE WHERE $COLUMN2 EQUALS $VALUE1 [AND $COLUMN3 EQUALS $VALUE2], where there are one column or one aggregator in the SELECT clause and there are 04 conditions in the WHERE clause. Although there is no JOIN clause in the SQL, the task is still challenging as the baseline achieves.
3 Model
We present the overall solution for this problem in Fig. 2. Before we describe the main SQL generation part of our model, we first describe the bidirectional attention mechanism [22] for two sequences.
The reason and inspiration to use bidirectional attention is from the machine comprehension task. In the SQuAD [23] machine comprehension task, we input the paragraph and question to the model and find the answer string in the paragraph. And in the SQL generation task, we input the question and columns and find the answer column in the columns. The two tasks are very similar in this perspective.
3.1 Biattention
Biattention is an extended form of attention mechanism. The attention mechanism is the information of the most relevant words of second sequence to each word of first sequence. The biattention also computes the information signifies which words of first sequence have the closest similarity to one of the words of second sequence.
3.1.1 Forward Attention
Suppose we have two sequence of vector representation, and , which dimension is and , where is the features dimension size. Then we compute the coattention matrix :
(1) 
where is a trainable matrix. Eq. 1 contains the similarities information of each token between the two sequences. Then we apply softmax operation to the second dimension of the matrix :
(2) 
Then we reshape to and reshape to and apply elementwise multiplication to get :
(3) 
Note that the operation contains the broadcasting mechanism of NumPy. Then we reduce the sum of the second dimension of to get the representation of forward attention information :
(4) 
3.1.2 Backward Attention
Suppose we already have the coattention matrix in Eq. 1. Then we reduce the max value of the first dimension of :
(5) 
where Then we apply softmax to :
(6) 
Then we reshape to and apply elementwise multiplication with broadcasting mechanism:
(7) 
where Then we reduce the sum of the first dimension of to get :
(8) 
Then we compute the elementwise multiplication to get the representation of backward attention information . Note that the dimension of backward attention representation and forward attention representation are equal and are the same as the sequence dimension. In the next section we use the biattention mechanism for several components of our model.
3.2 Our Model
In this section, we present our model to tackle the WikiSQL task. As shown in Fig. 3, our model contains four modules:
(1) The characterlevel and wordlevel embedding layer to map each character and word to vector space. The embedding layer is shared by the next three modules. (2) The COLUMNSELECT module which predict the column of SELECT clause. (3) The AGGREGATORSELECT module which predict the aggregator of SELECT clause. (4) The WHERE module which predict the conditions of WHERE clause.
The detailed description of our model is provided as follows.
3.2.1 Characterlevel embedding and wordlevel embedding
We use the characterlevel GloVe [24] pretrained 300 dimension to initialize the characterlevel embedding , where is the word number and is the character number of each word and is 300. We leverage convolutional neural networks to get the next representation of
. We use three convolution kernels, which sizes are height 1 * width 5, height 1 * width 4, height 1 * width 3. The convolution layer and maxpooling layer are 1 as
[25] did. The input channel is 100 and output channel is 100 so the last dimension of 3 convolution results can concatenate to 300. After the max pooling and concatenation, the dimension of the final result is , which dimension is the same as the word embedding dimension.We use the wordlevel GloVe pretrained with 300 size to initialize the wordlevel embedding . As for the words which are not in the GloVe, we initialize them to 0. The experiment shows that if we initialize the words which are not in GloVe to a random value and make them trainable, the result decreases.
As one column contains several words, we encode the words of one column into one vector representation by running after a LSTM [26]. We take the last state of the LSTM for the representation of one column and consider it as one item of columns, which is the same as one word in the question.
3.2.2 COLUMNSELECT module
In this subtask, the inputs are questions and column names, the outputs are one column in the column names. So we need to capture the attention info of questions and column names and then leverage the attention info to output a prediction over the column names. We did it as follows.
Each token is represented as a onehot vector and fed into a word embedding vector before feeding them into the bidirectional LSTM. We have the question embedding and the column embedding , where is the max word number of questions and is the columns number of a table. The embedding and can be computed as the hidden states of a bidirectional LSTM respectively and get the biLSTM encoded representation and , where /2 is the hidden size of LSTM.
Then we apply bidirectional attention to and according to Eq. 1 to Eq. 8, where is the first sequence and is the second sequence , to get the biattention info and . Then we concatenate the last dimension of forward attention info and the backward attention info to and apply the operations below to get the final prediction for column in the SELECT clause:
(9) 
where , and are all trainable weights and
is the probability distribution over all columns of one table.
3.2.3 AGGREGATORSELECT module
There are 5 types of aggregation keywords in SQL: ’MAX’, ’MIN’, ’COUNT’, ’SUM’, ’AVG’. The experiment of SQLNet shows that the column name input do not impact the prediction result. So we only need to input the question and predict the class of aggregation keywords. So we consider this subtask as a text classification problem. Then we have the question embedding and the BiLSTM encoded representation . Then we compute the final prediction for aggregator :
(10) 
where and are all trainable weights and sum apply to the first dimension and is the probability distribution over 6 choices of SQL aggregators.
3.2.4 WHERE module
The WHERE clause is the most challenging part. The order of conditions does not matter, so we predict the probability of column slots and choose the top columns as a set. We predict the number of conditions and the column slots first. Then we leverage the column predictions to choose from the columns candidates. Then we use the chosen columns as embedding input to predict the operator slots and value slots for each column. We describe them below.
(1) Condition number
Suppose the number of conditions of WikiSQL is ranging from 0 to and we consider it as a (+1)class classification problem. We compute the biLSTM hidden states of the question embedding and compute the number of conditions as below:
(11) 
where and are all trainable weights.
(2) Column slots
In this subtask, taking questions and column names as input, we need to predict the top column names. So we leverage the biattention info of the questions and column names to output the probability over the column names. We compute biattention information of the question hidden states and column hidden states . Then we compute the biattention information according to Eq. 1 to Eq. 8 and compute final column prediction , which is the same computation as COLUMNSELECT module. We choose the top scores of as the prediction of
columns and pad 0 to
columns. We leverage the chosen and padded LSTM representation for the predictions of operator slots and value slots.(3) Operator slots
There are 3 type of operator keywords in SQL: ’GT’, ’LT’, ’EQL’, which indicates ’greater than’, ’less than’, ’equal’ separately. We start from the hidden states and . And we use the result of column name slots predictions to choose the top column from and get . Then we apply the biattention of Eq. 1 to Eq. 8 to get the final attention representation for and , which is the concatenation of the last dimension of the two sequence representation. Then the computation is to get predictions for each condition:
(12) 
where , and are all trainable weights and is the probability distribution over 4 choices of condition operator for each column.
(4) Value slots
As one value slot must corresponding to one column slot and we need to leverage the predicted columns info, so we employ the sequencetosequence structure to generate the values by taking the predicted columns info as input. The structure is welldeveloped: suppose we have an input sequence, and we employ an encoder to encode the input sequence into a vector. Then we employ a decoder to decode the output sequence from the vector.
We employ biLSTM to be the encoder which take the question embedding and predicted column info as input and the encoder’s output is . At decoder phase we need to predict the value which is a substring of the question, so we use pointer network [7] to point to the substring of question. The LSTM decoder of pointer network is unidirectional and 2 layers. In training, the LSTM decoder takes the ground truth as input and outputs the , where
is the max word number and is onehot encoding. Then
and participate the next computation:(13) 
Where the inputs , and are expanded to the same dimension and , , and are all separated trainable weights. The output of the pointer network is , where is the question length. In engineering we flat the specific dimension for the computation. For example, suppose we have batch size dimension and conditions as the second dimension, then we flat the dimension to * as the first dimension. Note that we generate the condition values for each of the conditions. The END token also appears in the question and the model stops generating for this slot when the END token is predicted. We prepare the exact ground truth for each submodule of WHERE module and give each submodule of WHERE module a separated loss.
3.2.5 Loss function
We use the crossentropy loss for the prediction of COLUMNSELECT module and AGGREGATORSELECT module. As for the WHERE module, we also use crossentropy loss for the value slots and operator slots.
As the prediction for the columns in the WHERE module is a target set, we need to penalize the predicted columns that are not in the ground truth. So we design the loss function for the prediction of columns set:
(14) 
where we choose =3 and =1 if the ground truth contains the column, =0 if the ground truth does not contain the column. The final objective function is：
(15) 
where , , are the loss of AGGREGATORSELECT, COLUMNSELECT and WHERE module separately.
4 Experiments
In this section, we present more details of the model and the evaluation on the dataset. We also analyze the evaluation result.
4.1 Experimental Setting
We tokenize the sentences using Stanford CoreNLP [27]. The LSTM contains 2 layers and the size of LSTM hidden states h is 50 and the output size of biLSTM is 100. The dropout [28] for LSTM cell is 0.3. We use different LSTM weights for predicting different slots. Each LSTM which encodes the embedding is an independent weight. Although we do not share the biLSTM weight, we find that sharing the same word embedding vector is better. Therefore, different components in our model only share the word embedding. We use the Adam optimizer [29]
with learning rate 0.001 and 0 weight decay to minimize the loss of Eq. 15. We train the model for 100 epochs with fixed word embedding and trainable character embedding. Then we use the pretrained 100 epoch model to train the next 100 epoch with all trainable embeddings. The characterlevel embedding are all trainable in 0 to 200 epoch. The batch size is 64. We randomly reshuffle the training data in each epoch. In addition, our final model is chosen as the models that perform the best on development set in each part in the process of training. We implement all models using PyTorch
[30].4.2 Evaluation
We evaluate our model on the WikiSQL dataset. The decomposition results are presented in Tab. 1 and the overall results are presented in Tab. 2. We display the separated results of each module and the querymatch accuracy which compare whether two SQL queries match exactly. From the evaluation result we find that biattention mechanisms mainly improve the WHERE clause result and characterlevel embedding mainly improve the COLUMNSELECT clause. The execution result is higher because different SQL may obtains the same result. For example, the two SQL queries SELECT COUNT (player) WHERE No. = 23 and SELECT COUNT (No.) WHERE No. = 23 produce the same result in the table of Fig. 1.
dev  test  

Seq2SQL  90.0  89.6  62.1  90.1  88.9  60.2 
SQLNet  90.1  91.5  74.1  90.3  90.9  71.9 
Our Model  90.1  91.1  74.6  90.3  90.8  72.8 
+charemb  90.1  92.5  74.7  90.3  91.9  72.8 
dev  test  

Result match  Query string match  Result match  Query string match  
Seq2SQL  62.1  53.5  60.4  51.6 
SQLNet  69.8  63.2  68.0  61.3 
Our Model  70.3  63.5  68.2  61.5 
+charemb  71.1  64.1  69.0  62.5 
4.3 Analysis
The improvement of COLUMNSELECT clause which is attributed by CNNbased characterlevel embedding is around 2, as the baseline result is already 90. We think it is because with the help of the characterlevel embedding, the model can be more robust to the minor difference of a word between training data and test data. The improvement of attention is 2.5 and the improvement of the biattention mechanisms is 3 to 3.5. The improvement from attention to biattention is 0.5 to 1
. We also observe that if we initialize the words which are not in the GloVe the random initialization and train the embedding, the result does not improve. The reason is that we do not add the mask technique which set the rare words to a minimal value in the model in order that the rare words do not participate in the activation function such as sigmoid. We consider the mask technique as a future work.
5 Conclusion
In this paper, based on the structure of SQL and the observation that a sequencetosequence model suffer from the ”ordermatters” problem, we design specific deep neural network for each substring of SQL. In the WHERE prediction module, we choose the top probabilities of the column candidates as the chosen set for the prediction of conditions. We apply the bidirectional attention mechanism and the CNNbased characterlevel embedding to improve the result. The experimental evaluations show that our model achieves the stateoftheart results in the WikiSQL dataset.
We observe that the accuracy is around 90
on the COLUMNSELECT clause prediction and AGGREGATORSELECT clause prediction because the number of candidate column in the SELECT clause is limited to one. The task will be more challenging if the SQL extends to more than one column candidates and more complex cases like ORDERBY, GROUPBY or even JOIN. And the technique of NL2SQL can be applied to Knowledge Graph query or other semantic parsing tasks. There will be a lot of work to research.
References

[1]
A. Krizhevsky, I. Sutskever, and G. Hinton.: Imagenet classification with deep convolutional neural networks. In NIPS (2012)
 [2] Y.Wu, W. Wu, Z. Li and M. Zhou: Response Selection with Topic Clues for Retrievalbased Chatbots. arXiv preprint arxiv:1605.00090 (2016)
 [3] Ruichu Cai, Boyan Xu, Xiaoyan Yang, Zhenjie Zhang, Zijian Li: An EncoderDecoder Framework Translating Natural Language to Database Queries. arXiv preprint arxiv:1711.06061, Nov 2017
 [4] Casacuberta F, Vidal E.: Machine Translation using Neural Networks and FiniteState Models. In Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation (1997)
 [5] Cho, K.; Van Merrienboer, B.; Gulcehre, ¨ C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y.: Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
 [6] Li Dong and Mirella Lapata: Language to logical form with neural attention. Meeting of the Association for Computational Linguistics, January 2016
 [7] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly: Pointer networks. International Conference on Neural Information Processing Systems. MIT Press (2015)
 [8] Victor Zhong, C. Xiong, and R. Socher. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arXiv preprint arxiv:1709.00103, Nov 2017
 [9] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur: Order matters: Sequence to sequence for sets. In ICLR (2016)
 [10] Xiaojun Xu, Chang Liu and Dawn Song. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv preprint arXiv:1711.04436, Nov 2017
 [11] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio: Neural machine translation by jointly learning to align and translate. ICLR (2015)
 [12] Luke S Zettlemoyer and Michael Collins: Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. arXiv preprint arXiv:1207.1420 (2012)
 [13] Yoav Artzi and Luke Zettlemoyer: Bootstrapping semantic parsers from conversations. In Proceedings of the conference on empirical methods in natural language processing, pp. 421–432. Association for Computational Linguistics (2011)

[14]
Yoav Artzi and Luke Zettlemoyer: Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics, 1:49–62 (2013)

[15]
Qingqing Cai and Alexander Yates: Largescale semantic parsing via schema matching and lexicon extension. In ACL (2013)
 [16] Siva Reddy, Mirella Lapata, and Mark Steedman: Largescale semantic parsing without questionanswer pairs. Transactions of the Association for Computational Linguistics, 2:377–392 (2014)
 [17] Percy Liang, Michael I Jordan, and Dan Klein: Learning dependencybased compositional semantics. Computational Linguistics, 39(2):389446 (2011)
 [18] Chris Quirk, Raymond J Mooney, and Michel Galley: Language to code: Learning semantic parsers for ifthisthenthat recipes. In ACL (1), pp. 878–888 (2015)
 [19] Xinyun Chen, Chang Liu, Richard Shin, Dawn Song, and Mingcheng Chen: Latent attention for ifthen program synthesis. arXiv preprint arXiv:1611.01867 (2016)

[20]
Lappoon R. Tang and Raymond J. Mooney: Using multiple clause constructors in inductive logic programming for semantic parsing. In Proceedings of the 12th European Conference on Machine Learning, pp. 466–477, Freiburg, Germany (2001)
 [21] P. J. Price: Evaluation of spoken language systems: the atis domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 2427, pp. 91–95 (1990)
 [22] M. Seo, A. Kembhavi, A. Farhadi, et al: Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR (2017)
 [23] Rajpurkar P, Zhang J, Lopyrev K, et al: Squad: 100,000+ questions for machine comprehension of text[J]. arXiv preprint arXiv:1606.05250 (2016)
 [24] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543 (2014)
 [25] Yoon Kim: Convolutional neural networks for sentence classification. In EMNLP (2014)

[26]
Sepp Hochreiter and Jurgen Schmidhuber: Long shortterm memory. Neural Computation (1997）
 [27] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky: The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014)
 [28] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov: Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
 [29] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego (2015)
 [30] Pytorch. URL http://pytorch.org/ (2017)
Comments
There are no comments yet.