Effective natural language interfaces to databases (NLIDB) would give lay people access to vast amounts of data stored in relational databases. This paper identifies key oversights in current evaluation methodology for this task. In the process, we (1) introduce a new, challenging dataset, (2) standardize and fix many errors in existing datasets, and (3) propose a simple yet effective baseline system.111Code and data is available at https://github.com/jkkummerfeld/text2sql-data/
First, we consider query complexity, showing that human-written questions require more complex queries than automatically generated ones. To illustrate this challenge, we introduce Advising, a dataset of questions from university students about courses that lead to particularly complex queries.
Second, we identify an issue in the way examples are divided into training and test sets. The standard approach, shown at the top of Fig. 1, divides examples based on the text of each question. As a result, many of the queries in the test set are seen in training, albeit with different entity names and with the question phrased differently. This means metrics are mainly measuring robustness to the way a set of known SQL queries can be expressed in English—still a difficult problem, but not a complete test of ability to compose new queries in a familiar domain. We introduce a template-based slot-filling baseline that cannot generalize to new queries, and yet is competitive with prior work on multiple datasets. To measure robustness to new queries, we propose splitting based on the SQL query. We show that state-of-the-art systems with excellent performance on traditional question-based splits struggle on query-based splits. We also consider the common practice of variable anonymization, which removes a challenging form of ambiguity from the task. In the process, we apply extensive effort to standardize datasets and fix a range of errors.
Previous NLIDB work has led to impressive systems, but current evaluations provide an incomplete picture of their strengths and weaknesses. In this paper, we provide new and improved data, a new baseline, and guidelines that complement existing metrics, supporting future work.
2 Related Work
The task of generating SQL representations from English questions has been studied in the NLP and DB communities since the 1970s Androutsopoulos et al. (1995). Our observations about evaluation methodology apply broadly to the systems cited below.
Within the DB community, systems commonly use pattern matching, grammar-based techniques, or intermediate representations of the queryPazos Rangel et al. (2013). Recent work has explored incorporating user feedback to improve accuracy Li and Jagadish (2014). Unfortunately, none of these systems are publicly available, and many rely on domain-specific resources.
In the NLP community, there has been extensive work on semantic parsing to logical representations that query a knowledge base Zettlemoyer and Collins (2005); Liang et al. (2011); Beltagy et al. (2014); Berant and Liang (2014), while work on mapping to SQL has recently increased Yih et al. (2015); Iyer et al. (2017); Zhong et al. (2017). One of the earliest statistical models for mapping text to SQL was the PRECISE system Popescu et al. (2003, 2004), which achieved high precision on queries that met constraints linking tokens and database values, attributes, and relations, but did not attempt to generate SQL for questions outside this class. Later work considered generating queries based on relations extracted by a syntactic parser Giordani and Moschitti (2012) and applying techniques from logical parsing research Poon (2013)
. However, none of these earlier systems are publicly available, and some required extensive engineering effort for each domain, such as the lexicon used by PRECISE.
More recent work has produced general purpose systems that are competitive with previous results and are also available, such as Iyer2017. We also adapt a logical form parser with a sequence to tree approach that makes very few assumptions about the output structure Dong and Lapata (2016).
One challenge for applying neural models to this task is annotating large enough datasets of question-query pairs. Recent work Cai et al. (2017); Zhong et al. (2017) has automatically generated large datasets using templates to form random queries and corresponding natural-language-like questions, and then having humans rephrase the question into English. Another option is to use feedback-based learning, where the system alternates between training and making predictions, which a user rates as correct or not Iyer et al. (2017). Other work seeks to avoid the data bottleneck by using end-to-end approaches Yin et al. (2016); Neelakantan et al. (2017), which we do not consider here. One key contribution of this paper is standardization of a range of datasets, to help address the challenge of limited data resources.
For our analysis, we study a range of text-to-SQL datasets, standardizing them to have a consistent SQL style.
Iyer et al. (2017) User questions about academic publications, with automatically generated SQL that was checked by asking the user if the output was correct.
Li and Jagadish (2014) Questions about the Microsoft Academic Search (MAS) database, derived by enumerating every logical query that could be expressed using the search page of the MAS website and writing sentences to match them. The domain is similar to that of Scholar, but their schemas differ.
Yelp and IMDB
Yaghmazadeh et al. (2017) Questions about the Yelp website and the Internet Movie Database, collected from colleagues of the authors who knew the type of information in each database, but not their schemas.
Zhong et al. (2017) A large collection of automatically generated questions about individual tables from Wikipedia, paraphrased by crowd workers to be fluent English.
Advising (This Work)
Our dataset of questions over a database of course information at the University of Michigan, but with fictional student records. Some questions were collected from the EECS department Facebook page and others were written by CS students with knowledge of the database who were instructed to write questions they might ask in an academic advising appointment.
The authors manually labeled the initial set of questions with SQL. To ensure high quality, at least two annotators scored each question-query pair on a two-point scale for accuracy—did the query generate an accurate answer to the question?—and a three-point scale for helpfulness—did the answer provide the information the asker was probably seeking? Cases with low scores were fixed or removed from the dataset.
We collected paraphrases using Jiang2017’s method, with manual inspection to ensure accuracy. For a given sentence, this produced paraphrases with the same named entities (e.g. course number EECS 123). To add variation, we annotated entities in the questions and queries with their types—such as course name, department, or instructor—and substituted randomly-selected values of each type into each paraphrase and its corresponding query. This combination of paraphrasing and entity replacement means an original question of “For next semester, who is teaching EECS 123?” can give rise to “Who teaches MATH 456 next semester?” as well as “Who’s the professor for next semester’s CHEM 789?”
3.1 SQL Canonicalization
SQL writing style varies. To enable consistent training and evaluation across datasets, we canonicalized the queries: (1) we alphabetically ordered fields in SELECT, tables in FROM, and constraints in WHERE; (2) we standardized table aliases in the form <TABLE_NAME>alias<N> for the Nth use of the same table in one query; and (3) we standardized capitalization and spaces between symbols. We confirmed these changes do not alter the meaning of the queries via unit tests of the canonicalization code and manual inspection of the output. We also manually fixed some errors, such as ambiguous mixing of AND and OR (30 ATIS queries).
3.2 Variable Annotation
Existing SQL datasets do not explicitly identify which words in the question are used in the SQL query. Automatic methods to identify these variables, as used in prior work, do not account for ambiguities, such as words that could be either a city or an airport. To provide accurate anonymization, we annotated query variables using a combination of automatic and manual processing.
Our automatic process extracted terms from each side of comparison operations in SQL: one side contains quoted text or numbers, and the other provides a type for those literals. Often quoted text in the query is a direct copy from the question, while in some cases we constructed dictionaries to map common acronyms, like american airlines–AA, and times, like 2pm–1400. The process flagged cases with ambiguous mappings, which we then manually processed. Often these were mistakes, which we corrected, such as missing constraints (e.g., papers in 2015 with no date limit in the query), extra constraints (e.g., limiting to a single airline despite no mention in the question), inaccurate constraints (e.g., more than 5 as 4), and inconsistent use of this year to mean different years in different queries.
3.3 Query Deduplication
|Sets Identified||Affected Queries|
Three of the datasets had many duplicate queries (i.e., semantically equivalent questions with different SQL). To avoid this spurious ambiguity we manually grouped the data into sets of equivalent questions (Table 1). A second person manually inspected every set and ran the queries. Where multiple queries are valid, we kept them all, though only used the first for the rest of this work.
|Redundancy Measures||Complexity Measures|
|Question||query||/ pattern||Pattern||/ query||/ query||/ query||Depth|
4 Evaluating on Multiple Datasets Is Necessary
For evaluation to be informative it must use data that is representative of real-world queries. If datasets have biases, robust comparisons of models will require evaluation on multiple datasets. For example, some datasets, such as ATIS and Advising, were collected from users and are task-oriented, while others, such as WikiSQL, were produced by automatically generating queries and engaging people to express the query in language. If these two types of datasets differ systematically, evaluation on one may not reflect performance on the other. In this section, we provide descriptive statistics aimed at understanding how several datasets differ, especially with respect to query redundancy and complexity.
We consider a range of measures that capture different aspects of data complexity and diversity:
Question / Unique Query Counts
We measure dataset size and how many distinct queries there are when variables are anonymized. We also present the mean number of questions per unique query; a larger mean indicates greater redundancy.
Complexity can be described as the answer to the question, “How many query-form patterns would be required to generate this dataset?” Fig. 2 shows an example of a pattern, which essentially abstracts away from the specific table and field names. Some datasets were generated from patterns similar to these, including WikiSQL and Cai2017. This enables the generation of large numbers of queries, but limits the variation between them to only that encompassed by their patterns. We count the number of patterns needed to cover the full dataset, where larger numbers indicate greater diversity. We also report mean queries per pattern; here, larger numbers indicate greater redundancy, showing that many queries fit the same mold.
We consider the total number of tables and the number of unique tables mentioned in a query. These numbers differ in the event of self-joins. In both cases, higher values imply greater complexity.
A query with nested subqueries may be more complex than one without nesting. We count SELECT statements within each query to determine the number of sub-queries. We also report the depth of query nesting. In both cases, higher values imply greater complexity.
The statistics in Table 2 show several patterns.
First, dataset size is not the best indicator of dataset diversity. Although WikiSQL contains fifteen times as many question-query pairs as ATIS, ATIS contains significantly more patterns than WikiSQL; moreover, WikiSQL’s queries are dominated by one pattern that is more than half of the dataset (SELECT col AS result FROM table WHERE col = value). The small, hand-curated datasets developed by the database community—Academic, IMDB, and Yelp—have noticeably less redundancy as measured by questions per unique query and queries per pattern than the datasets the NLP community typically evaluates on.
Second, human-generated datasets exhibit greater complexity than automatically generated data. All of the human-generated datasets except Yelp demonstrate at least some nesting. The average query from any of the human-generated datasets joins more than one table.
In particular, task-oriented datasets require joins and nesting. ATIS and Advising, which were developed with air-travel and student-advising tasks in mind, respectively, both score in the top three for multiple complexity scores.
To accurately predict performance on human-generated or task-oriented questions, it is thus necessary to evaluate on datasets that test the ability to handle nesting and joins. Training and testing NLP systems, particularly deep learning-based methods, benefits from large datasets. However, at present, the largest dataset available does not provide the desired complexity.
Evaluate on multiple datasets, some with nesting and joins, to provide a thorough picture of a system’s strengths and weaknesses.
5 Current Data Splits Only Partially Probe Generalizability
It is standard best practice in machine learning to divide data into disjoint training, development, and test sets. Otherwise, evaluation on the test set will not accurately measure how well a model generalizes to new examples. The standard splits of GeoQuery, ATIS, and Scholar treat each pair of a natural language question and its SQL query as a single item. Thus, as long as each question-query pair appears in only one set, the test set is not tainted with training data. We call this a question-based data split.
However, many English questions may correspond to the same SQL query. If at least one copy of every SQL query appears in training, then the task evaluated is classification, not true semantic parsing, of the English questions. We can increase the number of distinct SQL queries by varying what entities our questions ask about; the queries for what states border Texas and what states border Massachusetts are not identical. Adding this variation changes the task from pure classification to classification plus slot-filling. Does this provide a true evaluation of the trained model’s performance on unseen inputs?
It depends on what we wish to evaluate. If we want a system that answers questions within a particular domain, and we have a dataset that we are confident covers everything a user might want to know about that domain, then evaluating on the traditional question-based split tells us whether the system is robust to variation in how a request is expressed. But compositionality is an essential part of language, and a system that has trained on What courses does Professor Smith teach? and What courses meet on Fridays? should be prepared for What courses that Professor Smith teaches meet on Fridays? Evaluation on the question split does not tell us about a model’s generalizable knowledge of SQL, or even its generalizable knowledge within the present domain.
To evaluate the latter, we propose a complementary new division, where no SQL query is allowed to appear in more than one set; we call this the query split. To generate a query split, we substitute variables for entities in each query in the dataset, as described in § 3.2. Queries that are identical when thus anonymized are treated as a single query and randomly assigned—with all their accompanying questions—to train, dev, or test. We include the original question split and the new query split labeling for the new Advising dataset, as well as ATIS, GeoQuery, and Scholar. For the much smaller Academic, IMDB, Restaurant, and Yelp datasets, we include question- and query- based buckets for cross validation.
Recently, a great deal of work has used variations on the seq2seq model. We compare performance of a basic seq2seq model Sutskever et al. (2014), and seq2seq with attention over the input Bahdanau et al. (2015)
, implemented with TensorFlow seq2seqBritz et al. (2017). We also extend that model to include an attention-based copying option, similar to Jia2016. Our output vocabulary for the decoder includes a special token, COPY. If COPY has the highest probability at step
, we replace it with the input token with the max of the normalized attention scores. Our loss function is the sum of two terms: first, the categorical cross entropy for the model’s probability distribution over the output vocabulary tokens; and second, the loss for word copying. When the correct output token isCOPY, the second loss term is the categorical cross entropy of the distribution of attention scores at time . Otherwise it is zero.
For comparison, we include systems from two recent papers. Dong2016 used an attention-based seq2tree model for semantic parsing of logical forms; we apply their code here to SQL datasets. Iyer2017 use a seq2seq model with automatic dataset expansion through paraphrasing and SQL templates.222 We enable Iyer2017’s paraphrasing data augmentation, but not their template-based augmentation because templates do not exist for most of the datasets (though they also found it did not significantly improve performance). Note, on ATIS and Geo their evaluation assumed no ambiguity in entity identification, which is equivalent to our Oracle Entities condition (§5.3).
We could not find publicly available code for the non-neural text-to-SQL systems discussed in Section 2. Also, most of those approaches require development of specialized grammars or templates for each new dataset they are applied to, so we do not compare such systems.
|No Variable Anonymization|
|Iyer et al.||41||1||45||17||66||40||100||8||44||3||76||70||10||4||6||6|
|With Oracle Entities|
|Iyer et al.||88||6||58||32||71||49||100||33||71||1||77||75||52||24||44||32|
5.2 New Template Baseline
In addition to the seq2seq models, we develop a new baseline system for text-to-SQL parsing which exploits repetitiveness in data. First, we automatically generate SQL templates from the training set. The system then makes two predictions: (1) which template to use, and (2) which words in the sentence should fill slots in the template. This system is not able to generalize beyond the queries in the training set, so it will fail completely on the new query-split data setting.
Fig. 3 presents the overall architecture, which we implemented in DyNet Neubig et al. (2017). A bidirectional LSTM provides a prediction for each word, either O if the word is not used in the final query, or a symbol such as city1 to indicate that it fills a slot. The hidden states of the LSTM at each end of the sentence are passed through a small feed-forward network to determine the SQL template to use. This architecture is simple and enables a joint choice of the tags and the template, though we do not explicitly enforce agreement.
To train the model, we automatically construct a set of templates and slots. Slots are determined based on the variables in the dataset, with each SQL variable that is explicitly given in the question becoming a slot. We can construct these templates because our new version of the data explicitly defines all variables, their values, and where they appear in both question and query.
For completeness, we also report on an oracle version of the template-based system (performance if it always chose the correct template from the train set and filled all slots correctly).
5.3 Oracle Entity Condition
Some systems, such as Dong and Lapata’s model, are explicitly designed to work on anonymized data (i.e., data where entity names are replaced with a variable indicating their type). Others, such as attention-based copying models, treat identification of entities as an inextricable component of the text-to-SQL task. We therefore describe results on both the actual datasets with entities in place and a version anonymized using the variables described in § 3.2. We refer to the latter as the oracle entity condition.
5.4 Results and Analysis
We hypothesized that even a system unable to generalize can achieve good performance on question-based splits of datasets, and the results in Table 3 substantiate that for the NLP community’s datasets. The template-based, slot-filling baseline was competitive with state-of-the-art systems for question split on the four datasets from the NLP community. The template-based oracle performance indicates that for these datasets anywhere from 70-100% accuracy on question-based split could be obtained by selecting a template from the training set and filling in the right slots.
For the three datasets developed by the databases community, the effect of question-query split is far less pronounced. The small sizes of these datasets cannot account for the difference, since even the oracle baseline did not have much success on these question splits, and since the baseline was able to handle the small Restaurants dataset. Looking back at Section 4, however, we see that these are the datasets with the least redundancy in Table 2. Because their question:unique-query ratios are nearly 1:1, the question splits and query splits of these datasets were quite similar.
Reducing redundancy does not improve performance on query split, though; at most, it reduces the difference between performance on the two splits. IMDB and Yelp both show weak results on query split despite their low redundancy. Experiments on a non-redundant version of query split for Advising, ATIS, GeoQuery, and Restaurant that contained only one question for each query confirmed this: in each case, accuracy remained the same or declined relative to regular query split.
Having ruled out redundancy as a cause for the exceptional performance on Academic’s query split, we suspect the simplicity of its questions and the compositionality of its queries may be responsible. Every question in the dataset begins return me followed by a phrase indicating the desired field, optionally followed by one or more constraints; for instance, return me the papers by ‘author_name0’ and return me the papers by ‘author_name0’ on journal_name0.
None of this, of course, is to suggest that question-based split is an easy problem, even on the NLP community’s datasets. Except for the Advising and Restaurants datasets, even the oracle version of the template-based system is far from perfect. Access to oracle entities helps performance of non-copying systems substantially, as we would expect. Entity matching is thus a non-trivial component of the task.
But the query-based split is certainly more difficult than the question-based split. Across datasets and systems, performance suffered on query split. Access to oracle entities did not remove this effect.
Many of the seq2seq models do show some ability to generalize, though. Unlike the template-based baseline, many were able to eek out some performance on query split.
On question split, ATIS is the most difficult of the NLP datasets, yet on query split, it is among the easiest. To understand this apparent contradiction, we must consider what kinds of mistakes systems make and the contexts in which they appear. We therefore analyze the output of the attention-based-copying model in greater detail.
We categorize each output as shown in column one of Table 4. The “Correct” category is self-explanatory. “Entity problem only” means that the query would have been correct but for a mistake in one or more entity names. “Different template” means that the system output was the same as another query from the dataset but for the entity names; however, it did not match the correct query for this question. “No template match” contains both the most mundane and the most interesting errors. Here, the system output a query that is not copied from training data. Sometimes, this is a simple error, such as inserting an extra comma in the WHERE clause. Other times, it is recombining segments of queries it has seen into new queries. This is necessary but not sufficient model behavior in order to do well on the query split. In at least one case, this category includes a semantically equivalent query marked as incorrect by the exact-match accuracy metric.333For the question which of the states bordering pennsylvania has the largest population, the gold query ranked the options by population and kept the top result, while the system output used a subquery to find the max population then selected states that had that population. Table 4 shows the number of examples from the test set that fell into each category, as well as the mean length of gold queries (“length”) for each category.
Short queries are easier than long ones in the question-based condition. In most cases, length in “correct” is shorter than length in either “different template” or “no template match” categories.
In addition, for short queries, the model seems to prefer to copy a query it has seen before; for longer ones, it generates a new query. In every case but one, mean length in “different template” is less than in “No template match.”
Interestingly, in ATIS and GeoQuery, where the model performs tolerably well on query split, the length for correct queries in query split is higher than the length for correct queries from the question split. Since, as noted above, recombination of template pieces (as we see in “no template match”) is a necessary step for success on query split, it may be that longer queries have a higher probability of recombination, and therefore a better chance of being correct in query split. The data from Scholar does not support this position; however, note that only 17 queries were correct in Scholar query split, suggesting caution in making generalizations from this set.
These results also seem to indicate that our copying mechanism effectively deals with entity identification. Across all datasets, we see only a small number of entity-problem-only examples. However, comparing the rows from Table 3 for seq2seq+Copy at the top and seq2seq+Attention in the oracle entities condition, it is clear that having oracle entities provides a useful signal, with consistent gains in performance.
Evaluate on both question-based and query-based dataset splits. Additionally, variable anonymization noticeably decreases the difficulty of the task; thus, thorough evaluations should include results on datasets without anonymization.
5.5 Logic Variants
To see if our observations on query and question split performance apply beyond SQL, we also considered the logical form annotations for ATIS and GeoQuery Zettlemoyer and Collins (2005, 2007). We retrained Jia2016’s baseline and full system. Interestingly, we founnd limited impact on performance, measured with either logical forms or denotations. To understand why, we inspected the logical form datasets. In both ATIS and GeoQuery, the logical form version has a larger set of queries after variable identification. This seems to be because the logic abstracts away from the surface form less than SQL does. For example, these questions have the same SQL in our data, but different logical forms:
what state has the largest capital
(A, (state(A), loc(B, A), largest(B, capital(B))))
which state ’s capital city is the largest
(A, largest(B, (state(A), capital(A, B), city(B))))
FROM CITY AS CITYalias0
WHERE CITYalias0.POPULATION = (
SELECT MAX( CITYalias1.POPULATION )
FROM CITY AS CITYalias1 ,
STATE AS STATEalias0
WHERE STATEalias0.CAPITAL =
CITYalias1.CITY_NAME ) ;
Other examples include variation in the logical form between sentences with largest and largest population even though the associated dataset only has population figures for cities (not area or any other measure of size). Similarly in ATIS, the logical form will add (flight $0) if the question mentions flights explicitly, making these two queries different, even though they convey the same user intent:
what flights do you have from bwi to sfo
i need a reservation from bwi to sfo
By being closer to a syntactic representation, the queries end up being more compositional, which encourages the model to learn more compositionality than the SQL models do.
In this work, we identify two issues in current datasets for mapping questions to SQL queries. First, by analyzing question and query complexity we find that human-written datasets require properties that have not yet been included in large-scale automatically generated query sets. Second, we show that the generalizability of systems is overstated by the traditional data splits. In the process we also identify and fix hundreds of mistakes across multiple datasets and homogenize the SQL query structures to enable effective multi-domain experiments.
Our analysis has clear implications for future work. Evaluating on multiple datasets is necessary to ensure coverage of the types of questions humans generate. Developers of future large-scale datasets should incorporate joins and nesting to create more human-like data. And new systems should be evaluated on both question- and query- based splits, guiding the development of truly general systems for mapping natural language to structured database queries.
We would like to thank Laura Wendlandt, Walter Lasecki, and Will Radford for comments on an earlier draft and the anonymous reviewers for their helpful suggestions. This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of IBM.
- Androutsopoulos et al. (1995) I. Androutsopoulos, G. D. Ritchie, and P. Thanisch. 1995. Natural Language Interfaces to Databases - An Introduction. Natural Language Engineering, 1(709):29–81.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR, pages 1–15, San Diego, California.
- Beltagy et al. (2014) Islam Beltagy, Katrin Erk, and Raymond Mooney. 2014. Semantic parsing using distributional semantics and probabilistic logic. Proceedings of the ACL 2014 Workshop on Semantic Parsing, pages 7–11.
- Berant and Liang (2014) Jonathan Berant and Percy Liang. 2014. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1415–1425.
Britz et al. (2017)
Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017.
exploration of neural machine translation architectures.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1442–1451, Copenhagen, Denmark. Association for Computational Linguistics.
- Cai et al. (2017) Ruichu Cai, Boyan Xu, Xiaoyan Yang, Zhenjie Zhang, and Zijian Li. 2017. An encoder-decoder framework translating natural language to database queries. ArXiv e-prints.
- Dahl et al. (1994) Deborah A. Dahl, Madeleine Bates, Michael Brown, William Fisher, Kate Hunicke-Smith, David Pallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriber. 1994. Expanding the scope of the ATIS task: The ATIS-3 corpus. Proceedings of the workshop on Human Language Technology, pages 43–48.
- Dong and Lapata (2016) Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1:33–43.
- Giordani and Moschitti (2012) Alessandra Giordani and Alessandro Moschitti. 2012. Translating questions to SQL queries with generative parsers discriminatively reranked. In COLING 2012, pages 401–410.
- Iyer et al. (2017) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 963—-973, Vancouver, Canada.
- Jia and Liang (2016) Robin Jia and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12–22.
- Jiang et al. (2017) Youxuan Jiang, Jonathan K. Kummerfeld, and Walter S. Lasecki. 2017. Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 103–109, Vancouver, Canada.
- Li and Jagadish (2014) Fei Li and H. V. Jagadish. 2014. Constructing an interactive natural language interface for relational databases. In Proceedings of the VLDB Endowment, pages 73–84.
- Liang et al. (2011) Percy Liang, Michael I Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, pages 590–599, Portland, Oregon.
- Neelakantan et al. (2017) Arvind Neelakantan, Quoc V Le, Martín Abadi, Andrew McCallum, and Dario Amodei. 2017. Learning a natural language interface with neural programmer. Proceedings of the ICLR, pages 1–10.
- Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980.
- Pazos Rangel et al. (2013) Rodolfo A. Pazos Rangel, Juan Javier González Barbosa, Marco Antonio Aguirre Lam, José Antonio Martínez Flores, and Héctor J. Fraire Huacuja. 2013. Natural language interfaces to databases: An analysis of the state of the art. Springer Berlin Heidelberg, Berlin, Heidelberg.
- Poon (2013) Hoifung Poon. 2013. Grounded unsupervised semantic parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 933–943.
- Popescu et al. (2004) Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern natural language interfaces to databases: composing statistical parsing with semantic tractability. In Proceedings of the 20th International Conference on Computational Linguistics, pages 141–147.
- Popescu et al. (2003) Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. 2003. Towards a theory of natural language interfaces to databases. Proceedings of the 8th International Conference on Intelligent User Interfaces IUI 03, pages 149–157.
- Price (1990) Patti J. Price. 1990. Evaluation of spoken language systems: The ATIS domain. Proc. of the Speech and Natural Language Workshop, pages 91–95.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (NIPS), pages 3104–3112.
- Tang and Mooney (2000) Lappoon R. Tang and Raymond J. Mooney. 2000. Automated Construction of Database Interfaces: Integrating Statistical and Relational Learning for Semantic Parsing. Proceedings of the Joint SIGDAT Conference on Emprical Methods in Natural Language Processing and Very Large Corpora, pages 133–141.
- Yaghmazadeh et al. (2017) Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. Sqlizer: Query synthesis from natural language. In International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM, pages 63:1–63:26.
- Yih et al. (2015) Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331, Beijing, China.
- Yin et al. (2016) Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao. 2016. Neural Enquirer: Learning to query tables in natural language. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), pages 2308–2314.
- Zelle and Mooney (1996) John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database queries using inductive logic proramming. Learning, pages 1050–1055.
- Zettlemoyer and Collins (2005) Luke Zettlemoyer and Michael Collins. 2005. Learning to Map Sentences to Logical Form : Structured Classification with Probabilistic Categorial Grammars. 21st Conference on Uncertainty in Artificial Intelligence, pages 658–666.
- Zettlemoyer and Collins (2007) Luke Zettlemoyer and Michael Collins. 2007. Online learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 678–687, Prague, Czech Republic.
- Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. ArXiv e-prints, pages 1–12.