1 Introduction
The goal of the SQLtotext task is to automatically generate humanlike descriptions interpreting the meaning of a given structured query language (SQL) query (Figure 1 gives an example). This task is critical to the natural language interface to a database since it helps nonexpert users to understand the esoteric SQL queries that are used to retrieve the answers through the questionanswering process Simitsis and Ioannidis (2009) using varous text embeddings techniques Kim (2014); Arora et al. (2017); Wu et al. (2018a).
Earlier attempts for SQLtotext task are rulebased and templatebased Koutrika et al. (2010); Ngonga Ngomo et al. (2013)
. Despite requiring intensive human efforts to design temples or rules, these approaches still tend to generate rigid and stylized language that lacks the natural text of the human language. To address this, iyer2016summarizing proposes a sequencetosequence (Seq2Seq) network to model the SQL query and natural language jointly. However, since the SQL is designed to express graphstructured query intent, the sequence encoder may need an elaborate design to fully capture the global structure information. Intuitively, varous graph encoding techniques base on deep neural network
Kipf and Welling (2016); Hamilton et al. (2017); Song et al. (2018) or based on Graph Kernels Vishwanathan et al. (2010); Wu et al. (2018b), whose goal is to learn the nodelevel or graphlevel representations for a given graph, are more proper to tackle this problem.In this paper, we first introduce a strategy to represent the SQL query as a directed graph (see Section 2) and further make full use of a novel graphtosequence (Graph2Seq) model Xu et al. (2018) that encodes this graphstructured SQL query, and then decodes its interpretation (see Section 3). On the encoder side, we extend the graph encoding work of DBLP:conf/nips/HamiltonYL17 by encoding the edge direction information into the node embedding. Our encoder learns the representation of each node by aggregating information from its
hop neighbors. Different from DBLP:conf/nips/HamiltonYL17 which neglects the edge direction, we classify the neighbors of a node according to the edge direction, say
, into two classes, i.e., forward nodes ( directs to) and backward nodes (direct to ). We apply two distinct aggregators to aggregate the information of these two types of nodes, resulting two representations. The node embedding of is the concatenation of these two representations. Given the learned node embeddings, we further introduce a poolingbased and an aggregationbased method to generate the graph embedding.On the decoder side, we develop an RNNbased decoder which takes the graph vector representation as the initial hidden state to generate the sequences while employing an attention mechanism over all node embeddings. Experimental results show that our model achieves the stateoftheart performance on the WikiSQL dataset and Stackoverflow dataset. Our code and data is available at
https://github.com/IBM/SQLtoText.2 Graph Representation of SQL Query
Representing the SQL query as a graph instead of a sequence could better preserve the inherent structure information in the query. An example is illustrated in the blue dashed frame in Figure 2. One can see that representing them as a graph instead of a sequence could help the model to better learn the correlation between this graph pattern and the interpretation “…both X and Y higher than Z…”. This observation motivates us to represent the SQL query as a graph. In particular, we use the following method to transform the SQL query to a graph:^{1}^{1}1This method could be simply extended to cope with more general SQL queries that have complex syntaxes such as JOIN and ORDER BY.
SELECT Clause. For the SELECT clause such as “SELECT company”, we first create a node assigned with text attribute select. This SELECT node connects with column nodes whose text attributes are the selected column names such as company. For SQL queries that contain aggregation functions such as count or max, we add one aggregation node which is connected with column nodes. Similarly, their text attributes are the aggregation function names.
WHERE Clause. The WHERE clause usually contains more than one condition. For each condition, we use the same process as for the SELECT clause to create nodes. For example, in Figure 2, we create node assets and for the first condition, the node sales and for the second condition. We then integrate the constraint nodes that have the same text attribute (e.g., in Figure 2). For a logical operator such as AND, OR and NOT, we create a node that connects with all column nodes that the operator works on. Finally, these logical operator nodes connect with the SELECT node.
3 Graphtosequence Model
Based on the constructed graphs for the SQL queries, we make full use of a novel graphtosequence model Xu et al. (2018), which consists of a graph encoder to learn the embedding for the graphstructured SQL query, and a sequence decoder with attention mechanism to generate sentences. Conceptually, the graph encoder generates the node embedding for each node by accumulating information from its hop neighbors, and produces a graph embedding for the entire graph by abstracting all node embeddings. Our decoder takes the graph embedding as the initial hidden state and calculates the attention over all node embeddings on the encoder side to generate natural language interpretations.
Node Embedding.
Given the graph
, since the text attribute of a node may include a list of words, we first use a Long Short Term Memory (LSTM) to generate the feature vector
a for all nodes from ’s text attribute. We use these feature vectors as initial node embeddings. Then, our model incorporates information from a node’s neighbors within hop into its representation by repeating the following process times:(1)  
(2)  
(3)  
(4)  
(5) 
where is the iteration index, is the neighborhood function^{2}^{2}2 returns the nodes that directs to and returns the nodes that direct to ., h (h) is node ’s forward (backward) representation which aggregates the information of nodes in (), M and M are the forward and backward aggregator functions, W denotes weight matrices, is a nonlinearity function.
For example, for node , we first aggregate the forward representations of its immediate neighbors {h, } into a single vector h (equation 2). Note that this aggregation step only uses the representations generated at previous iteration and its initial representation is a. Then we concatenate ’s current forward representation h with the newly generated neighborhood vector h
. This concatenated vector is fed into a fully connected layer with nonlinear activation function
, which updates the forward representation of to be used at the next iteration (equation 3). Next, we update the backward representation of in the similar fashion (equation 45). Finally, the concatenation of the forward and backward representation at last iteration , is used as the resulting representation of. Since the neighbor information from different hops may have the different impact on the node embedding, we learn a distinct aggregator function at each step. This aggregator feeds each neighbor’s vector to a fullyconnected neural network and an elementwise maxpooling operation is applied to capture different aspects of the neighbor set.
Graph Embedding.
Most existing works of graph convolution neural networks focus more on node embeddings rather than graph embeddings (GE) since their focus is on the nodewise classification task. However, graph embeddings that convey the entire graph information are essential to the downstream decoder, which is crucial to our task. For this purpose, we propose two ways to generate graph embeddings, namely, the Poolingbased and Nodebased methods.
Poolingbased GE. This method feeds the obtained node embeddings into a fullyconnected neural network and applies the elementwise maxpooling operation on all node embeddings. In experiments, we did not observe significant performance improvement using minpooling and averagepooling.
Nodebased GE. Following Scarselli et al. (2009), this method adds a super node that is connected to all other nodes by a special type of edge. The embedding of , which is treated as graph embedding, is produced using node embedding generation algorithm mentioned above.
Sequence Decoding.
The decoder is an RNN which predicts the next token given all the previous words , the RNN hidden state for timestep and the context vector that captures the attention of the encoder side. In particular, the context vector depends on a set of node representations (h,…,h) to which the encoder maps the input graph. The context vector
is dynamically computed using attention mechanism over the node representations. Our model is jointly trained to maximize the conditional logprobability of the correct description given a source graph with respect to the parameters
of the model:where (, ) is the th SQLinterpretation pair in the training set, and is the length of the th target sentence . In the inference phase, we use the beam search algorithm with beam size = 5.
4 Experiments
We evaluate our model on two datasets, WikiSQL Zhong et al. (2017) and Stackoverflow Iyer et al. (2016). WikiSQL consists of a corpus of 87,726 handannotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). StackOverflow consists of 32,337 SQL query and natural language question pairs, and we use the same train/development/test split as Iyer et al. (2016). We use the BLEU4 score Papineni et al. (2002)
as our automatic evaluation metric and also perform a human study. For human evaluation, we randomly sampled 1,000 predicted results and asked three native English speakers to rate each interpretation against both the correctness conforming to the input SQL and grammaticality on a scale between 1 and 5. We compare some variants of our model against the template, Seq2Seq, and Tree2Seq baselines.
Graph2SeqPGE. This method uses the Pooling method for generating Graph Embedding.
Graph2SeqNGE. This method uses the Node based Graph Embedding.
Template. We implement a templatebased method which first maps each element of a SQL query to an utterance and then uses simple rules to assemble these utterances. For example, we map SELECT to which, WHERE to where, to more than. This method translates the SQL query of Figure 1 to which company where assets more than val and sales more than val and industry less than or equal to val and profits equals val.
Seq2Seq. We choose two Seq2Seq models as our baselines. The first one is the attentionbased Seq2Seq model proposed by DBLP:journals/corr/BahdanauCB14, and the second one additionally introduces the copy mechanism in the decoder side (Gu et al., 2016). To evaluate these models, we employ a template to convert the SQL query into a sequence: “SELECT + aggregation function + Split Symbol + selected column + WHERE + condition + Split Symbol + condition + … ”.
Tree2Seq. We also choose a treetosequence model proposed by Eriguchi et al. (2016) as our baseline. We use the SQL Parser tool^{3}^{3}3http://www.sqlparser.com to convert a SQL query into the tree structure^{4}^{4}4See Appendix for details. which is fed to the Tree2Seq model.
Our proposed models are trained using the Adam optimizer Kingma and Ba (2014), with minibatch size 30. Our hyperparameters are set based on performance on the validation set. The learning rate is set to 0.001. We apply the dropout strategy Srivastava et al. (2014) with the ratio of 0.5 at the decoder layer to avoid overfitting. Gradients are clipped when their norm is bigger than 20. We initialize word embeddings using GloVe word vectors from pennington2014glove, and the word embedding dimension is 300. For the graph encoder, the hop size is set to 6, the nonlinearity function
is implemented as ReLU
Glorot et al. (2011), the parameters of weight matrices W are randomly initialized. The decoder has one layer, and its hidden state size is 300.Results and Discussion
BLEU4  Grammar.  Correct.  
Template  15.71  1.50   
Seq2Seq  20.91  2.54  62.1% 
Seq2Seq + Copy  24.12  2.65  64.5% 
Tree2Seq  26.67  2.70  66.8% 
Graph2SeqPGE  38.97  3.81  79.2% 
Graph2SeqNGE  34.28  3.26  75.3% 
Iyer et al. (2016)  18.4  3.16  64.2% 
Graph2SeqPGE  23.3  3.23  70.2% 
Graph2SeqNGE  21.9  2.97  65.1% 
Table 1 summarizes the results of our models and baselines. Although the templatebased method achieves decent BLEU scores, its grammaticality score is substantially worse than other baselines. We can see that on both two datasets, our Graph2Seq models perform significantly better than the Seq2Seq and Tree2Seq baselines. One possible reason is that in our graph encoder, the node embedding retains the information of neighbor nodes within hops. However, in the tree encoder, the node embedding only aggregates the information of descendants while losing the knowledge of ancestors. The poolingbased graph embedding is found to be more useful than the nodebased graph embedding because Graph2SeqNGE adds a nonexistent node into the graph, which introduces the noisy information in calculating the embeddings of other nodes. We also conducted an experiment that treats the SQL query graph as an undirected graph and found the performance degrades.
By manually analyzing the cases in which the Graph2Seq model performs better than Seq2Seq, we find the Graph2Seq model is better at interpreting two classes of queries: (1) the complicated queries that have more than two conditions (Query 1); (2) the queries whose columns have implicit relationships (Query 2). Table 2 lists some such SQL queries and their interpretations. One possible reason is that the Graph2Seq model can better learn the correlation between the graph pattern and natural language by utilizing the global structure information.
SQL Query & Interpretations 
. COUNT Player WHERE starter = val AND touchdowns 
= val AND position = val 
S: How many players played in position val 
G: number of players with starter val and get touchdowns 
val for val 
. SELECT Tires WHERE engine = val AND chassis = 
val AND team val 
S: which tire has engine val and chassis val and val 
G: which tire does val run with val engine and val chassis 
We find the hop size has a significant impact on our model since it determines how many neighbor nodes to be considered during the node embedding generation. As the hop size increasing, the performance is found to be significantly improved. However, after the hop size reaches 6, increasing the hop size can not boost the performance on WikiSQL anymore. By analyzing the most complicated queries (around 6.2%) in WikiSQL, we find there are average six hops between a node and its most distant neighbor. This result indicates that the selected hop size should guarantee each node can receive the information from others nodes in the graph.
5 Conclusions
Previous work approaches the SQLtotext task using an Seq2Seq model which does not fully capture the global structure information of the SQL query. To address this, we proposed a Graph2Seq model which includes a graph encoder, an attention based sequence decoder. Experimental results show that our model significantly outperforms the Seq2Seq and Tree2Seq models on the WikiSQL and Stackoverflow datasets.
Appendix
We apply the SQL Parser tool^{5}^{5}5http://www.sqlparser.com to convert an SQL query to a tree whose structure is illustrated in Figure 3. More specifically, the root has two child nodes, namely Select List and Where Clause. The child nodes of Select List represent the selected columns in the SQL query. The Where Clause has the logical operators occurred in the SQL query as its children. The children of a logical operator node are the conditions on which this operator works.
References
 Arora et al. (2017) Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but toughtobeat baseline for sentence embeddings. In ICLR.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
 Eriguchi et al. (2016) Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Treetosequence attentional neural machine translation. arXiv preprint arXiv:1603.06075.

Glorot et al. (2011)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011.
Deep sparse rectifier neural networks.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 1113, 2011
, pages 315–323.  Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequencetosequence learning. arXiv preprint arXiv:1603.06393.
 Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1025–1035.

Iyer et al. (2016)
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.
Summarizing source code using a neural attention model.
In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2073–2083.  Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
 Koutrika et al. (2010) Georgia Koutrika, Alkis Simitsis, and Yannis E Ioannidis. 2010. Explaining structured queries in natural language. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 333–344. IEEE.
 Ngonga Ngomo et al. (2013) AxelCyrille Ngonga Ngomo, Lorenz Bühmann, Christina Unger, Jens Lehmann, and Daniel Gerber. 2013. Sorry, i don’t speak sparql: translating sparql queries into natural language. In Proceedings of the 22nd international conference on World Wide Web, pages 977–988. ACM.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 612, 2002, Philadelphia, PA, USA., pages 311–318.

Pennington et al. (2014)
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
Glove: Global
vectors for word representation.
In
Empirical Methods in Natural Language Processing (EMNLP)
, pages 1532–1543.  Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80.
 Simitsis and Ioannidis (2009) Alkis Simitsis and Yannis Ioannidis. 2009. Dbmss should talk back too. arXiv preprint arXiv:0909.1786.
 Song et al. (2018) Linfeng Song, Yue Zhang, Zhiguo Wang, and Daniel Gildea. 2018. A graphtosequence model for amrtotext generation. arXiv preprint arXiv:1805.02473.

Srivastava et al. (2014)
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research
, 15(1):1929–1958.  Vishwanathan et al. (2010) S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. 2010. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242.
 Wu et al. (2018a) Lingfei Wu, Ian E.H. Yen, Kun Xu, Fangli Xu, Avinash Balakrishnan, PinYu Chen, Pradeep Ravikumar, and Michael J. Witbrock. 2018a. Word mover’s embedding: From word2vec to document embedding. In EMNLP.
 Wu et al. (2018b) Lingfei Wu, Ian EnHsu Yen, Fangli Xu, Pradeep Ravikuma, and Michael Witbrock. 2018b. D2ke: From distance to kernel and embedding. arXiv preprint arXiv:1802.04956.
 Xu et al. (2018) Kun Xu, Lingfei Wu, Zhiguo Wang, and Vadim Sheinin. 2018. Graph2seq: Graph to sequence learning with attentionbased neural networks. arXiv preprint arXiv:1804.00823.
 Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.
Comments
There are no comments yet.