Semantic parsing is the task of translating natural language into machine-understandable formal logical forms. With the help of recent advance in deep learning technology, neural semantic parsers have achieved state-of-the-art results in many tasks Dong and Lapata (2016); Jia and Liang (2016); Iyer et al. (2017b). However, their training requires the preparation of a large amount of labeled data (questions and corresponding logical forms) which is often not scalable due to the requirement of expert knowledge necessary in writing logical forms.
Here, we develop a novel approach Syntactic Question Abstraction & Retrieval (SQAR) for semantic parsing task under data-hungry setting. The model constrains the logical form search space by retrieving logical patterns from the train set using natural language similarity with assistance of a pre-trained language model. The subsequent grounding module only needs to map the retrieved pattern to the final logical form.
We evaluate SQAR on various subsets of WikiSQL train data Zhong et al. (2017) consisting of 8502750 samples which occupies 1.5–4.9% of the full train data. SQAR
shows up to 4.9% higher logical form accuracy compared to the previous best open sourced modelSQLova Hwang et al. (2019). Also, we show that natural language sentence similarity dataset can be leveraged in SQAR by pre-training the backbone of SQAR using Quora pharaphrasing data which results in up to 5.9% higher logical form accuracy.
In general, the retrieval approach causes the limitation on dealing with unseen logical patterns. In contrast, we show that SQAR can generate unseen logical patterns by collecting new examples without re-training opening an interesting possibility of generalizable retrieval-based semantic parser.
Our contributions are summarized as follows:
Compared to the previous best open-sourced model Hwang et al. (2019), SQAR achieves the state-of-the-art performance on the WikiSQL test data under data-scarce environment.
We show that SQAR can leverage natural language query similarity datasets to improve logical form generation accuracy.
We show that retrieval-based parser can handle unseen new logical patterns on the fly without re-training.
For the maximum cost-effectiveness, we find that it is important to carefully design the train data distribution, not merely following the (approximated) data distribution.
2 Related work
WikiSQL Zhong et al. (2017) is a large semantic parsing dataset consisting of 80,654 natural language utterances and corresponding SQL annotations. Its massive size has invoked the development of many neural semantic parsing models Xu et al. (2017); Yu et al. (2018); Dong and Lapata (2018); Wang et al. (2017, 2018); McCann et al. (2018); Shi et al. (2018); Yin and Neubig (2018); Xiong and Sun (2018); Hwang et al. (2019); He et al. (2019). Berant and Liang Berant and Liang (2014) built the semantic parser that uses the query similarity between an input question and paraphrased canonical natural language representations generated from candidate logical forms. In our study, candidate logical forms and corresponding canonical forms do not need to be generated as input questions are directly compared to the questions in the training data, circumventing the burden of full logical form generation. Dong and Lapata Dong and Lapata (2018) developed the two step approach for logical form generation, similar to SQAR using sketch representation as intermediate logical forms. In SQAR, intermediate logical forms are retrieved from train set using question similarity being specialized for data-hungry setting. Finegan-Dollak et al. Finegan-Dollak et al. (2018) developed the model that first finds corresponding logical pattern and fills the slots in the template. While their work resembles SQAR, there is a fundamental difference between two approaches. The model from Finegan-Dollak et al. (2018) classifies input query into logical pattern whereas we use query-to-query similarity to retrieve logical pattern non-parametrically. By retrieving logical pattern using the similarity in natural language space, paraphrasing datasets can be employed during training which is relatively easy to label compared to semantic parsing datasets. Also, in contrast to classification methods, SQAR can handle unseen logical patterns by including new examples into the train set without re-training the model during inference stage (see section. 5.5). Also our focus is developing competent model with small amount of data which has not been studied in Finegan-Dollak et al. (2018). Hwang et al. Hwang et al. (2019) developed SQLova that achieves state-of-the-arts result in the WikiSQL task. SQLova consits of table-aware BERT encoder and NL2SQL module that generate SQL queries via slot-filling approach.
The model generates the logical form (SQL query) for a given NL query and its corresponding table headers (Fig. 1). First, the logical pattern is retrieved from the train set by finding the most similar NL query with . For example in Fig. 1, is “What is the points of South Korea player?”. To generate logical form , SQAR retrieves logical pattern = SELECT #1 WHERE #2 = #3 by finding the most similar NL query from the train set, for instance [“Which fruit has yellow color?”, SELECT Fruit WHERE Color = Yellow]. Then #1, #2, and #3 in are grounded to Point, Country, and South Korea respectively by the grounding module using information from and table headers. The process is depicted schematically in Fig. 2a. The detail of each step is explained below.
3.1 Syntactic Question Abstractor
The syntactic question abstractor generates two vector representationand of an input NL query (Fig. 2b). is trained to represent syntactic information of and used in the retriever module (Fig. 2c). is trained to represent lexical information of by being used in the grounder (Fig. 2d).
The logical patterns of the WikiSQL dataset consist of combination of six aggregation operators (none, max, min, count, sum, and avg), and three where operators (=, >, and <). The number of conditions in where clause is ranging from 0 to 4. Each condition is combined by and unit. In total, there are 210 possible SQL patterns (6 select clause patterns 35 where clause patterns, see Fig. A1).
To extract these syntactic information from NL query, both an input NL query and the queries in train set are mapped to a vector space (represented by and , respectively) via table-aware BERT encoder Devlin et al. (2018); Hwang et al. (2019) (Fig. 2b).
The input of the encoder consists of following tokens:
[CLS], , [SEP], , [SEP], , [SEP]
where stands for SQL language element tokens such as [SELECT], [MAX], [COL], ) separated by [SEP] (a special token in BERT), represents question tokens, and denotes the tokens of table headers in which each header is separated by [SEP]. is included to contextualize and use them during grounding process (section 3.3). Segment ids are used to distinguish (id = 0) from (id = 1) and (id = 1) as in BERT Devlin et al. (2018). Next, two vectors and are extracted from the (linearly projected) encoding vector of [CLS] token where notation indicates the elements of vector between th and th indices. In this study, and .
To retrieve logical pattern of , the questions from the train set () are also mapped to the vector space () using the syntactic question abstractor. Next, the logical pattern is found by measuring Euclidean distance between and .
Since has corresponding and logical form , the logical pattern can be obtained from after delexicalization. The process is depicted in Fig. 2c. In SQAR, maximum 10 closest are retrieved and the most frequently appearing logical pattern is selected for the subsequent grounding process. SQAR is trained using the negative sampling method. First, one positive sample (having the same logical pattern with input query ), and 5 negative samples (having different logical pattern) are randomly sampled from the train set. Then six
distances are calculated as above and interpreted as approximate probability by using softmax function after multiplied by -1. The cross entropy function is employed for the training.
To ground retrieved logical pattern , following LSTM-based pointer network is used Vinyals et al. (2015).
where stands for the one-hot vector (pointer to the input token) at time , and are hidden- and cell-vectors of the LSTM decoder, ’s denote (mutually different) affine transformations, and is the probability of observing th input token at time . Here (=100) is the hidden dimension of the LSTM. Compared to a conventional pointer network, our grounder has three custom properties: (1) as logical pattern is already found from the retriever, the grounder does not feed the output as the next input when the input token is already present in the logical pattern whereas lexical outputs like column and where values are fed into the next step as an input (Fig. 2d); (2) to generate conditional values for where clause, the grounder infers only the beginning and the end token positions from the given question to extract the condition values for where clause; (3) the multiple generation of same column on where clause is avoided by constraining the search space. The syntactic question abstractor, the retriever, and the grounder are together named as Syntactic Question Abstraction & Retrieval (SQAR).
To train SQAR and SQLova
, the pytorch version of pre-trained BERT model111https://github.com/huggingface/transformers (BERT-Base-Uncased222https://github.com/google-research/bert) is loaded and fine-tuned using ADAM optimizer. The NL query is first tokenized by using Standford CoreNLP Manning et al. (2014). Each token is further tokenized (into sub-word level) by WordPiece tokenizer Devlin et al. (2018); Wu et al. (2016). FAISS Johnson et al. (2017) is employed for the retrieval process. For the experiments with Train-Uniform-85P-850, Train-Rand-881, Train-Hybrid-85P-897, and Train-Rand-3523, only single logical pattern is retrieved from the retriever due to the scarcity of examples per pattern. Otherwise 10 logical patterns are retrieved. All experiments were performed with WikiSQL ver. 1.1 333https://github.com/salesforce/WikiSQL. The accuracy is measured by repeating three independent experiments in each condition with different random seeds unless particularly mentioned. To further pre-train BERT-backbone of SQAR, we use Quora paraphrase detection dataset Iyer et al. (2017a). The further details of experiments are summarized in Appendix.
5 Result and Analysis
5.1 Preparation of data scarce environment
The WikiSQL dataset consists of 80,654 examples (56,355 in train set, 8,421 in dev set, and 15,878 in test set). The examples are not uniformly distributed over 210 possible SQL logical patterns in train, dev, and test sets while they have similar logical pattern distributions (see Fig.A1, Table 6). To mimic original pattern distribution while preparing data scarce environemnts, we prepare Train-Rand-881 by randomly sampling 881 examples from the original WikiSQL train set (1.6%). The validation set Dev-Rand-132 is prepared by the same way from the WikiSQL dev set.
5.2 Accuracy Measurement
SQAR retrieves SQL logical pattern for a given question by finding most syntactically similar question from the train set and ground the retrieved logical pattern using LSTM-based grounder (Fig. 2a). The model performance is tested over the full WikiSQL test set by using two metrics: (1) logical pattern accuracy (P) and (2) logical form accuracy (LF). P is computed by ignoring difference in lexical information such as predicted columns and conditional values whereas LF is calculated by comparing full logical forms. The execution accuracy of SQL query is not compared as different logical forms can generate identical answer hindering fair comparison. Table 1 shows P and LF of several models over the WikiSQL original test set conveying following important messages: (1) SQAR outperforms SQLova by +4.0% in LF (3rd and 4th rows); (2) Quora pre-training improves the performance of SQAR further by 0.9% (4th and 5th rows); (3) Under data-scarce condition, the use of pre-trained language model (BERT) is critical (1st and 2nd rows vs 3–5th rows);
|Model||Train set||Dev set||P (%)||LF (%)|
|SQAR w/o Quora||Train-Rand-881||Dev-Rand-132|
The source code is downloaded from https://github.com/donglixp/coarse2fine
The source code is downloaded from https://github.com/naver/sqlova.
Comparison of models under data-hungry environment. Logical pattern accuracy (P) and full logical form accuracy (LF) on test set of WikiSQL are shown. The errors are estimated by three independent experiments with different random seeds exceptSQLova-GloVe where the error is estimated from two independent experiments.
It is of note that Coarse2Fine Dong and Lapata (2018) shows much lower accuracy compared to SQLova-GloVe although both models use GLoVe Pennington et al. (2014). One possible explanation will be that Coarse2Fine first classify SQL patterns of where clause (sketch generation) while SQLova generate SQL query via slot-filling approach. The classification involves abstraction of whole sentence and this process can be a data-hungry step.
5.3 Generalization test I: dependency on logical pattern distribution
When the size of train set is fixed, assigning more examples to frequently appearing logical patterns (in test environment) to the train set will increase the chance for correct SQL query generation as trained model would have a higher performance for frequent patterns (Train-Rand-881 is constructed in this regard). On the other hand, including diverse patterns in train set will help the model to distinguish similar patterns. Considering these two aspects, we prepare additional two subsets Train-Uniform-85P-850, and Train-Hybrid-85P-897. Train-Uniform-85P-850 consists of 850 uniformly distributed examples over 85 patterns whereas Dev-Uniform-80P-320 consists of 320 uniformly distributed examples over 80 patterns. Train-Hybrid-85P-897 is prepared by randomly sampling examples from top most frequent 85 logical patterns. Each pattern has approximately 128 times smaller number of examples compared to the full WikiSQL train set as in Train-Rand-881. In addition, all patterns are forced to have at least 7 examples for the diversity (Fig. A1, and Table 6) resulting in total 897 examples. Only 85 patterns out of 210 patterns are considered because (1) 85 patterns occupy 98.6% of full train set, and (2) only these patterns have at least 30 corresponding examples (Fig. A1, Table 6). A dev set Dev-Hybrid-223 is constructed similarly by extracting 223 examples from the WikiSQL dev set (Fig. A1, Table 6). The difference between three types of train sets are shown schematically in Fig. 3 (orange: Train-Uniform-85P-850, purple: Train-Rand-881, black: Train-Hybrid-85P-897).
Table 2 shows following important information: (1) SQAR outperforms SQLova again by +4.1% LF in Train-Uniform-85P-850 (3rd and 5th rows of upper panel) and +4.0% LF in Train-Hybrid-85P-897 (3rd and 5th rows of bottom panel); (2) the Quora pre-training improves model performance +5.9% LF in Train-Uniform-85P-850 and by +0.5% LF in Train-Hybrid-85P-897(4th and 5th rows of each panel).
|Model||Train set||Dev set||P (%)||LF (%)|
|SQAR w/o Quora||Train-Uniform-85P-850||Dev-Uniform-80P-320|
|SQAR w/o Quora||Train-Hybrid-85P-897||Dev-Hybrid-223|
Both SQAR and SQLova show good performance when they are trained using either Train-Rand-881 or Train-Hybrid-85P-897(3rd and 5th columns of Table 1, 2). In real service delivering scenario, the data distribution in test environment could vary with time. In regard of this, we prepare an additional test set Test-Uniform-81P-648 by extracting 8 examples from top most frequent 81 logical patterns from the WikiSQL test. The resulting test set has completely different logical pattern distribution with the WikiSQL test set. The table 3 shows that both models show best overall performance when they are trained with Train-Hybrid-85P-897 being remained robust to the change of test environment (4th columns). The result highlights the two important properties for train set to have: reflecting test environment (more examples for frequent logical patterns), and including diverse patterns.
|Model & Test set||Train-Rand-881||Train-Uniform-85P-850||Train-Hybrid-85P-897|
5.4 Generalization test II: dependency on dataset size
To further test generality of our findings under change of train set size, we prepare three additional train sets: Train-Uniform-85P-2550, Train-Rand-2677, and Train-Hybrid-96P-2750 (Table 6). Train-Uniform-85P-2550 consists of 2550 uniformly distributed examples over 85 patterns, Train-Rand-2677 consists of 2667 examples randomly sampled from the WikiSQL train data, and Train-Hybrid-96P-2750 is larger version of Train-Hybrid-85P-897 in which each logical pattern includes at least 15 examples for 96 logical patterns (Table. 6). Table 4 shows following information: (1) SQAR shows marginally better performance than SQLova showing +1.9%, +0.5%, and -0.7% in LF when Train-Rand-2677, Train-Uniform-85P-2550, and Train-Hybrid-96P-2750 are used as the train sets (1st and 3rd rows of each panel); (2) Again, the pre-training using Quora paraphrasing datset increases LF by +0.5%, +3.3%, and +2.7% in Train-Rand-2677, Train-Uniform-85P-2550, and Train-Hybrid-96P-2750 respectively (2nd and 3th rows of each panel); (3) Both SQAR and SQLova show best performance when they are trained over hybrid dataset. Observing that the performance gap between SQAR and SQLova becomes marginal as increasing the size of train set, we train both models using full WikiSQL train set. The result shows that again, there is only marginal difference between two models (SQLova LF: , SQAR LF = ). The overall results are summarized in Fig. 4.
|Model||Train set||Dev set||P (%)||LF (%)|
|SQAR w/o Quora||Train-Rand-2677||Dev-Rand-527|
|SQAR w/o Quora||Train-Uniform-85P-2550||Dev-Uniform-80P-320|
|SQAR w/o Quora||Train-Hybrid-96P-2750||Dev-Hybrid-446|
5.5 Generalization test III: parsing unseen logical forms
In general, retrieval-based approach cannot handle new type of questions when corresponding logical patterns are not presented in the train set. However, unlike simple classification approach Finegan-Dollak et al. (2018), SQAR has interesting generalization ability originated from the use of query-to-query similarity in natural language space. The train data in SQAR has two roles: (1) supervision examples at training stage, and (2) a database to retrieve the logical pattern (a retrieval set) from which the most similar natural language query will be found during inference stage. Once the model is trained, the second role can be improved by including more examples into the train set later. Particularly, by adding examples with new logical patterns, the model can handle questions with unseen logical patterns without re-training.
|Model||Train set||Set for retrieval||P (%)||LF (%)||R-capacity||RG-capacity|
|SQAR||R-881||R-881 + H-897|
|SQAR||R-881||R-881 + H-2750|
To experimentally show this, we measured P and LF of SQAR while changing the retrieval set during an inference stage (Table. 5). The train set is fixed to Train-Rand-881 consisting of 67 logical patterns. The result shows that upon addition of Train-Hybrid-85P-897 into the template set, which includes 18 more logical patterns compared to Train-Rand-881, P and LF increases by 1.1% and 0.6% respectively (2nd row of the table). Similar results are observed with Train-Hybrid-96P-2750 (+2.0% in P and +0.7% LF, 3rd row of the table) and with Test-Full-15878 (+4.1% in P and +1.7% in LF, 4th row of the table). To further show the power of using query-to-query similarity, we replaced the entire retrieval set from Train-Rand-881 to Train-Hybrid-96P-2750 where only 43 examples are overlapped between them. Again, P and LF increase by 1.7% and 0.5% respectively (5th row of the table). To further confirm the addition of examples enables parsing of unseen logical patterns, we introduce two additional metrics: R-capacity and RG-capacity. R-capacity is defined by the number of successfully retrieved logical pattern types by SQAR in the test set whereas RG-capacity indicates the number of successfully generated (retrieved and grounded) logical pattern types. The table shows both R- and RG-capacities increases upon addition of examples into the retrieval set (5th and 6th columns). It should be emphasized that, during the training stage, SQAR observed only 67 logical patterns. Collectively, these results show that, SQAR
can be easily generalized to handle new logical patterns by simply adding new examples without re-training. This also shows the possibility of transfer learning, even between semantic parsing tasks using different logical forms as intermediate logical patterns can be obtained from the natural language space.
We found that our retrieval-based model using query-to-query similarity can achieve high performance in WikiSQL semantic parsing task even when labeled data is scarce. We also found, pre-training using natural language paraphrasing data can help generation of logical forms in our query-similarity-based-retrieval approach. We also show that retrieval-based semantic parser can generate unseen logical forms during training stage. Finally, we found careful design of data distribution is necessary for optimal performance of the model under data-scarce environment.
- Berant and Liang (2014) Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1415–1425, Baltimore, Maryland, June 2014. Association for Computational Linguistics. doi: 10.3115/v1/P14-1133. URL https://www.aclweb.org/anthology/P14-1133.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. NAACL, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
- Dong and Lapata (2016) Li Dong and Mirella Lapata. Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33–43, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1004. URL https://www.aclweb.org/anthology/P16-1004.
- Dong and Lapata (2018) Li Dong and Mirella Lapata. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 731–742, Melbourne, Australia, July 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P18-1068.
- Finegan-Dollak et al. (2018) Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev. Improving text-to-sql evaluation methodology. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 351–360. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/P18-1033.
- He et al. (2019) Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. X-sql: Reinforce context into schema representation. Technical report, 2019. URL https://www.microsoft.com/en-us/research/uploads/prod/2019/03/X_SQL-5c7db555d760f.pdf.
- Hwang et al. (2019) Wonseok Hwang, Jinyeong Yim, Seunghyun Park, and Minjoon Seo. A comprehensive exploration on wikisql with table-aware word contextualization. CoRR, abs/1902.01069, 2019. URL http://arxiv.org/abs/1902.01069.
- Iyer et al. (2017a) Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. First quora dataset release: Question pairs. 2017a. URL https://data.quora.com.
- Iyer et al. (2017b) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 963–973, Vancouver, Canada, July 2017b. Association for Computational Linguistics. doi: 10.18653/v1/P17-1089. URL https://www.aclweb.org/anthology/P17-1089.
- Jia and Liang (2016) Robin Jia and Percy Liang. Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12–22, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1002. URL https://www.aclweb.org/anthology/P16-1002.
- Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
Manning et al. (2014)
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J.
Bethard, and David McClosky.
The Stanford CoreNLP natural language processing toolkit.In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014. URL http://www.aclweb.org/anthology/P/P14/P14-5010.
- McCann et al. (2018) Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
- Shi et al. (2018) Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti, Yi Mao, Oleksandr Polozov, and Weizhu Chen. Incsql: Training incremental text-to-sql parsers with non-deterministic oracles. CoRR, abs/1809.05054, 2018. URL http://arxiv.org/abs/1809.05054.
- Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2692–2700. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5866-pointer-networks.pdf.
- Wang et al. (2017) Chenglong Wang, Marc Brockschmidt, and Rishabh Singh. Pointing out SQL queries from text. Technical Report MSR-TR-2017-45, Microsoft, November 2017. URL https://www.microsoft.com/en-us/research/publication/pointing-sql-queries-text/.
- Wang et al. (2018) Wenlu Wang, Yingtao Tian, Hongyu Xiong, Haixun Wang, and Wei-Shinn Ku. A transfer-learnable natural language interface for databases. CoRR, abs/1809.02649, 2018.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.
- Xiong and Sun (2018) Hongyu Xiong and Ruixiao Sun. Transferable natural language interface to structured queries aided by adversarial generation. CoRR, abs/1812.01245, 2018. URL http://arxiv.org/abs/1812.01245.
- Xu et al. (2017) Xiaojun Xu, Chang Liu, and Dawn Song. Sqlnet: Generating structured queries from natural language without reinforcement learning. CoRR, abs/1711.04436, 2017. URL http://arxiv.org/abs/1711.04436.
- Yin and Neubig (2018) Pengcheng Yin and Graham Neubig. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 7–12, Brussels, Belgium, November 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D18-2002.
- Yu et al. (2018) Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. TypeSQL: Knowledge-based type-aware neural text-to-SQL generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 588–594, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2093. URL https://www.aclweb.org/anthology/N18-2093.
- Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017.
Appendix A Appendix
a.1.1 Model training
To train SQAR, pre-trained BERT model (BERT-Base-Uncased444https://github.com/google-research/bert) is loaded and fine-tuned using ADAM optimizer with learning rate of except the grounding module where the learning rate is set to . The decay rates of ADAM optimizer are set to . Batch size is set to 12 for all experiment. SQLova is trained similarly using pre-trained BERT model (BERT-Base-Uncased). The learning rate is set to except NL2SQL layer which is trained with the learning rate . Batch size is set to 32 for all experiment.
Natural language utterance is first tokenized by using Standford CoreNLP Manning et al. . Each token is further tokenized (into sub-word level) by WordPiece tokenizer Devlin et al. , Wu et al. . The headers of the tables and SQL vocabulary are tokenized by WordPiece tokenizer directly. FAISS Johnson et al.  is employed for the retrieval process. The PyTorch version of BERT code555https://github.com/huggingface/pytorch-pre-trained-BERT is used. The model performance of Coarse2Fine was calculated by using the code666https://github.com/donglixp/coarse2fine published by original authors Dong and Lapata . Our training of Coarse2Fine with the full WikiSQL train data results in logical form accuracy on WikiSQL test set.
All experiments were performed with WikiSQL ver. 1.1 777https://github.com/salesforce/WikiSQL. The model performance of SQAR SQLova and Coarse2Fine
was measured by repeating three independent experiments in each condition with different random seeds. The errors are estimated by calculating standard deviation. The performance ofSQLova-GloVe
was measured from two independent experiment with different random seeds. For the experiments with Train-Uniform-85P-850, Train-Rand-881, Train-Hybrid-85P-897, and Train-Rand-3523, only single logical pattern is retrieved from the retriever due to the scarcity of examples per pattern. Otherwise 10 logical patterns are retrieved. The models are trained until the logical form accuracy is saturated waiting up to maximum 1000 epochs.
a.1.2 Pre-training with Quora dataset
To further pre-trained BERT-backbone used in SQAR, we use Quora paraphrase detection dataset Iyer et al. [2017a]. The dataset contains more than 405,000 question pairs with a corresponding binary indicator that represents whether two questions are a pair of paraphrase or not. The task setting is analogous to the retriever of SQAR which detects the similarity of two given input NL queries and can be seen as fine-tuning in perspective of paraphrase detection task. During the training, two queries are given to the BERT model along with [CLS] and [SEP] tokens as in the original BERT training setting Devlin et al. . The output vector of [CLS] token was used for the binary classification to predict whether given two queries are a paraphrase pair or not. The model was trained until the classification accuracy converges using using ADAM optimizer.
Appendix B Supplementary tables