Semantic parsing, which translates a natural language sentence into its corresponding executable logic form (e.g. Structured Query Language, SQL), relieves users from the burden of learning techniques behind the logic form. The majority of previous studies on semantic parsing assume that queries are context-independent and analyze them in isolation. However, in reality, users prefer to interact with systems in a dialogue, where users are allowed to ask context-dependent incomplete questions . That arises the task of Semantic Parsing in Context (SPC), which is quite challenging as there are complex contextual phenomena. In general, there are two sorts of contextual phenomena in dialogues: Coreference and Ellipsis . Figure 1 shows a dialogue from the dataset SParC . After the question “What is id of the car with the max horsepower?”, the user poses an elliptical question “How about with the max mpg?”, and a question containing pronouns “Show its Make!”. Only when completely understanding the context, could a parser successfully parse the incomplete questions into their corresponding SQL queries.
A number of context modeling methods have been suggested in the literature to address SPC [11, 17, 23, 24, 22]. These methods proposed to leverage two categories of context: recent questions and precedent logic form. It is natural to leverage recent questions as context. Taking the example from Figure 1, when parsing , we also need to take and as input. We can either simply concatenate the input questions, or use a model to encode them hierarchically . As for the second category, instead of taking a bag of recent questions as input, it only considers the precedent logic form. For instance, when parsing , we only need to take as context. With such a context, the decoder can attend over it, or reuse it via a copy mechanism [17, 24]. Intuitively, methods that fall into this category enjoy better generalizability, as they only rely on the last logic form as context, no matter at which turn. Notably, these two categories of context can be used simultaneously.
However, it remains unclear how far we are from effective context modeling. First, there is a lack of thorough comparisons of typical context modeling methods on complex SPC (e.g. cross-domain). Second, none of previous works verified their proposed context modeling methods with the grammar-based decoding technique, which has been developed for years and proven to be highly effective in semantic parsing [13, 20, 7]. To obtain better performance, it is worthwhile to study how context modeling methods collaborate with the grammar-based decoding. Last but not the least, there is limited understanding of how context modeling methods perform on various contextual phenomena. An in-depth analysis can shed light on potential research directions.
In this paper, we try to fulfill the above insufficiency via an exploratory study on real-world semantic parsing in context. Concretely, we present a grammar-based decoding semantic parser and adapt typical context modeling methods on top of it. Through experiments on two large complex cross-domain datasets, SParC  and CoSQL 
, we carefully compare and analyze the performance of different context modeling methods. Our best model achieves state-of-the-art (SOTA) performances on both datasets with significant improvements. Furthermore, we summarize and generalize the most frequent contextual phenomena, with a fine-grained analysis on representative models. Through the analysis, we obtain some interesting findings, which may benefit the community on the potential research directions. We will open-source our code and materials to facilitate future work upon acceptance.
In the task of semantic parsing in context, we are given a dataset composed of dialogues. Denoting a sequence of natural language questions in a dialogue, are their corresponding SQL queries. Each SQL query is conditioned on a multi-table database schema, and the databases used in test do not appear in training. In this section, we first present a base model without considering context. Then we introduce typical context modeling methods and describe how we equip the base model with these methods. Finally, we present how to augment the model with BERT .
2.1 Base Model
We employ the popularly used attention-based sequence-to-sequence architecture [18, 2] to build our base model. As shown in Figure 2, the base model consists of a question encoder and a grammar-based decoder. For each question, the encoder provides contextual representations, while the decoder generates its corresponding SQL query according to a predefined grammar.
2.1.1 Question Encoder
2.1.2 Grammar-based Decoder
The decoder is grammar-based with attention on the input question . Different from producing a SQL query word by word, our decoder outputs a sequence of grammar rule (i.e. action). Such a sequence has one-to-one correspondence with the abstract syntax tree of the SQL query. Taking the SQL query in Figure 2 as an example, it is transformed to the action sequence , , , , , , , , , by left-to-right depth-first traversing on the tree. At each decoding step, a nonterminal is expanded using one of its corresponding grammar rules. The rules are either schema-specific (e.g. ), or schema-agnostic (e.g. ). More specifically, as shown at the top of Figure 2, we make a little modification on -related rules upon the grammar proposed by guo-etal-2019-towards guo-etal-2019-towards, which has been proven to have better performance than vanilla SQL grammar. Denoting the unidirectional LSTM used in the decoder, at each decoding step of turn , it takes the embedding of the previous generated grammar rule (indicated as the dash lines in Figure 2), and updates its hidden state as:
is the context vector produced by attending on each encoder hidden statein the previous step:
where is a learned matrix. is initialized by the final encoder hidden state , while is a zero-vector. For each schema-agnostic grammar rule, returns a learned embedding. For schema-specific one, the embedding is obtained by passing its schema (i.e. table or column) through another unidirectional LSTM, namely schema encoder . For example, the embedding of is:
As for the output
, if the expanded nonterminal corresponds to schema-agnostic grammar rules, we can obtain the output probability of actionas:
where is a learned matrix. When it comes to schema-specific grammar rules, the main challenge is that the model may encounter schemas never appeared in training due to the cross-domain setting. To deal with it, we do not directly compute the similarity between the decoder hidden state and the schema-specific grammar rule embedding. Instead, we first obtain the unnormalized linking score between the -th token in and the schema in action . It is computed by both handcraft features (e.g. word exact match)  and learned similarity (i.e. dot product between word embedding and grammar rule embedding). With the input question as bridge, we reuse the attention score in Equation 3 to measure the probability of outputting a schema-specific action as:
2.2 Recent Questions as Context
To take advantage of the question context, we provide the base model with recent questions as additional input. As shown in Figure 3, we summarize and generalize three ways to incorporate recent questions as context.
The method concatenates recent questions with the current question in order, making the input of the question encoder be , while the architecture of the base model remains the same. We do not insert special delimiters between questions, as there are punctuation marks.
A dialogue can be seen as a sequence of questions which, in turn, are sequences of words. Considering such hierarchy, suhr-etal-2018-learning suhr-etal-2018-learning employed a turn-level encoder (i.e. an unidirectional LSTM) to encode recent questions hierarchically. At turn , the turn-level encoder takes the previous question vector as input, and updates its hidden state to . Then is fed into as an implicit context. Accordingly Equation 1 is rewritten as:
Similar to Concat, suhr-etal-2018-learning suhr-etal-2018-learning allowed the decoder to attend over all encoder hidden states. To make the decoder distinguish hidden states from different turns, they further proposed a relative distance embedding in attention computing. Taking the above into account, Equation 3 is as:
where represents the relative distance.
To jointly model the decoder attention in token-level and question-level, inspired by the advances of open-domain dialogue area , we propose a gate mechanism to automatically compute the importance of each question. The importance is computed by:
where are learned parameters and . As done in Equation 8 except for the relative distance embedding, the decoder of Gate also attends over all the encoder hidden states. And the question-level importance is employed as the coefficient of the attention scores at turn .
2.3 Precedent SQL as Context
Besides recent questions, as mentioned in Section 1, the precedent SQL can also be context. As shown in Figure 4, the usage of requires a SQL encoder, where we employ another BiLSTM to achieve it. The -th contextual action representation at turn , , can be obtained by passing the action sequence through the SQL encoder.
To reuse the precedent generated SQL, zhang-etal-2019-editing zhang-etal-2019-editing presented a token-level copy mechanism on their non-grammar based parser. Inspired by them, we propose an action-level copy mechanism suited for grammar-based decoding. It enables the decoder to copy actions appearing in , when the actions are compatible to the current expanded nonterminal. As the copied actions lie in the same semantic space with the generated ones, the output probability for action is a mix of generating () and copying (). The generating probability follows Equation 5 and 6, while the copying probability is:
where is a learned matrix. Denoting the probability of copying at decoding step of turn , it can be obtained by , where are learned parameters and
is the sigmoid function. The final probabilityis computed by:
Besides the action-level copy, we also introduce a tree-level copy mechanism. As illustrated in Figure 4, tree-level copy mechanism enables the decoder to copy action subtrees extracted from , which shrinks the number of decoding steps by a large margin. Similar idea has been proposed in a non-grammar based decoder . In fact, a subtree is an action sequence starting from specific nonterminals, such as . To give an example, , , , makes up a subtree for the tree in Figure 2. For a subtree , its representation is the final hidden state of SQL encoder, which encodes its corresponding action sequence. Then we can obtain the output probability of subtree as:
2.4 BERT Enhanced Embedding
We employ BERT  to augment our model via enhancing the embedding of questions and schemas. We first concatenate the input question and all the schemas in a deterministic order with [SEP] as delimiter . For instance, the input for in Figure 1 is “What is id … max horsepower? [SEP] CARS_NAMES [SEP] MakeId … [SEP] Horsepower”. Feeding it into BERT, we obtain the schema-aware question representations and question-aware schema representations. These contextual representations are used to substitute subsequently, while other parts of the model remain the same.
3 Experiment & Analysis
We conduct experiments to study whether the introduced methods are able to effectively model context in the task of SPC (Section 3.2), and further perform a fine-grained analysis on various contextual phenomena (Section 3.3).
3.1 Experimental Setup
We evaluate each predicted SQL query using exact set match accuracy . Based on it, we consider three metrics: Question Match (Ques.Match), the match accuracy over all questions, Interaction Match (Int.Match), the match accuracy over all dialogues111Int.Match is much more challenging as it requires each predicted SQL in a dialogue to be correct., and Turn Match, the match accuracy over questions at turn .
Our implementation is based on PyTorch, AllenNLP  and the library transformers . We adopt the Adam optimizer and set the learning rate as e- on all modules except for BERT, for which a learning rate of e- is used . The dimensions of word embedding, action embedding and distance embedding are , while the hidden state dimensions of question encoder, grammar-based decoder, turn-level encoder and SQL encoder are . We initialize word embedding using Glove  for non-BERT models. For methods which use recent questions, is set as on both datasets.
|EditSQL + BERT|
|Ours + BERT|
We consider three models as our baselines. SyntaxSQL-con and CD-Seq2Seq are two strong baselines introduced in the SParC dataset paper . SyntaxSQL-con employs a BiLSTM model to encode dialogue history upon the SyntaxSQLNet model (analogous to our Turn) , while CD-Seq2Seq is adapted from suhr-etal-2018-learning suhr-etal-2018-learning for cross-domain settings (analogous to our Turn+Tree Copy). EditSQL  is a STOA baseline which mainly makes use of SQL attention and token-level copy (analogous to our Turn+SQL Attn+Action Copy).
3.2 Model Comparison
Taking Concat as a representative, we compare the performance of our model with other models, as shown in Table 1. As illustrated, our model outperforms baselines by a large margin with or without BERT, achieving new SOTA performances on both datasets. Compared with the previous SOTA without BERT on SParC, our model improves Ques.Match and Int.Match by and points, respectively.
To conduct a thorough comparison, we evaluate different context modeling methods upon the same parser, including methods introduced in Section 2 and selective combinations of them (e.g., Concat+Action Copy). The experimental results are presented in Figure 5. Taken as a whole, it is very surprising to observe that none of these methods can be consistently superior to the others. The experimental results on BERT-based models show the same trend. Diving deep into the methods only using recent questions as context, we observe that Concat and Turn perform competitively, outperforming Gate by a large margin. With respect to the methods only using precedent SQL as context, Action Copy significantly surpasses Tree Copy and SQL Attn in all metrics. In addition, we observe that there is little difference in the performance of Action Copy and Concat, which implies that using precedent SQL as context gives almost the same effect with using recent questions. In terms of the combinations of different context modeling methods, they do not significantly improve the performance as we expected.
As mentioned in Section 1, intuitively, methods which only use the precedent SQL enjoys better generalizability. To validate it, we further conduct an out-of-distribution experiment to assess the generalizability of different context modeling methods. Concretely, we select three representative methods and train them on questions at turn and , whereas test them at turn , and beyond. As shown in Figure 6, Action Copy has a consistently comparable or better performance, validating the intuition. Meanwhile, Concat appears to be strikingly competitive, demonstrating it also has a good generalizability. Compared with them, Turn is more vulnerable to out-of-distribution questions.
In conclusion, existing context modeling methods in the task of SPC are not as effective as expected, since they do not show a significant advantage over the simple concatenation method.
|Contextual Phenomena||Fine-grained Types||Count||Example|
|Precedent Question||Current Question|
|Semantically Complete||Context Independent||Show the nationality of each person.||Group people by their nationality.|
|Coreference||Bridging Anaphora||Show the version number for all templates.||What is the smallest value?|
|Definite Noun Phrases||Which country has a head of state named Beatrix?||What languages are spoken in that country?|
|One Anaphora||Order the pets by age.||How much does each one weigh?|
|Demonstrative Pronoun||Which students have pets?||Of those, whose last name is smith?|
|Possessive Determiner||How many highschoolers are liked by someone else?||What are their names?|
|Ellipsis||Continuation||What are all the flight numbers?||Which land in Aberdeen?|
|Substitution||Explicit||What is id of the car with the max horsepower?||How about with the max MPG?|
|Implicit||Find the names of museums opened before 2010.||How about after?|
|Schema||How many losers participated in the Australian Open?||Winners?|
|Operator||Who was the last student to register?||Who was the first to register?|
3.3 Fine-grained Analysis
By a careful investigation on contextual phenomena, we summarize them in multiple hierarchies. Roughly, there are three kinds of contextual phenomena in questions: semantically complete, coreference and ellipsis. Semantically complete means a question can reflect all the meaning of its corresponding SQL. Coreference means a question contains pronouns, while ellipsis means the question cannot reflect all of its SQL, even if resolving its pronouns. In the fine-grained level, coreference can be divided into types according to its pronoun . Ellipsis can be characterized by its intention: continuation and substitution222The fine-grained types of ellipsis are proposed by us because there is no consensus yet.. Continuation is to augment extra semantics (e.g. ), and substitution refers to the situation where current question is intended to substitute particular semantics in the precedent question. Substitution can be further branched into types: explicit vs. implicit and schema vs. operator. Explicit means the current question provides contextual clues (i.e. partial context overlaps with the precedent question) to help locate the substitution target, while implicit does not. On most cases, the target is schema or operator. In order to study the effect of context modeling methods on various phenomena, as shown in Table 2, we take the development set of SParC as an example to perform our analysis. The analysis begins by presenting Ques.Match of three representative models on above fine-grained types in Figure 7. As shown, though different methods have different strengths, they all perform poorly on certain types, which will be elaborated below.
Diving deep into the coreference (left of Figure 7), we observe that all methods struggle with two fine-grained types: definite noun phrases and one anaphora. Through our study, we find the scope of antecedent is a key factor. An antecedent is one or more entities referred by a pronoun. Its scope is either whole, where the antecedent is the precedent answer, or partial, where the antecedent is part of the precedent question. The above-mentioned fine-grained types are more challenging as their partial proportion are nearly , while for demonstrative pronoun it is only . It is reasonable as partial requires complex inference on context. Considering the example in Table 2, “one” refers to “pets” instead of “age” because the accompanying verb is “weigh”. From this observation, we draw the conclusion that current context modeling methods do not succeed on pronouns which require complex inference on context.
As for ellipsis (right of Figure 7), we obtain three interesting findings by comparisons in three aspects. The first finding is that all models have a better performance on continuation than substitution. This is expected since there are redundant semantics in substitution, while not in continuation. Considering the example in Table 2, “horsepower” is a redundant semantic which may raise noise in SQL prediction. The second finding comes from the unexpected drop from implicit(substitution) to explicit(substitution). Intuitively, explicit should surpass implicit on substitution as it provides more contextual clues. The finding demonstrates that contextual clues are obviously not well utilized by the context modeling methods. Third, compared with schema(substitution), operator(substitution) achieves a comparable or better performance consistently. We believe it is caused by the cross-domain setting, which makes schema related substitution more difficult.
4 Related Work
The most related work is the line of semantic parsing in context. In the topic of SQL, zettlemoyer2009learning zettlemoyer2009learning proposed a context-independent CCG parser and then applied it to do context-dependent substitution, iyyer-etal-2017-search iyyer-etal-2017-search applied a search-based method for sequential questions, and suhr-etal-2018-learning suhr-etal-2018-learning provided the first sequence-to-sequence solution in the area. More recently, zhang-etal-2019-editing zhang-etal-2019-editing presented a edit-based method to reuse the precedent generated SQL. With respect to other logic forms, long-etal-2016-simpler long-etal-2016-simpler focuses on understanding execution commands in context, guo2018dialog guo2018dialog on question answering over knowledge base in a conversation, and  on code generation in environment context. Our work is different from theirs as we perform an exploratory study, not fulfilled by previous works.
There are also several related works that provided studies on context. hwang2019comprehensive hwang2019comprehensive explored the contextual representations in context-independent semantic parsing, and sankar-etal-2019-neural sankar-etal-2019-neural studied how conversational agents use conversation history to generate response. Different from them, our task focuses on context modeling for semantic parsing. Under the same task, androutsopoulos1995natural androutsopoulos1995natural summarized contextual phenomena in a coarse-grained level, while bertomeu-etal-2006-contextual bertomeu-etal-2006-contextual performed a wizard-of-oz experiment to study the most frequent phenomena. What makes our work different from them is that we not only summarize contextual phenomena by fine-grained types, but also perform an analysis on context modeling methods.
5 Conclusion & Future Work
This work conducts an exploratory study on semantic parsing in context, to realize how far we are from effective context modeling. Through a thorough comparison, we find that existing context modeling methods are not as effective as expected. A simple concatenation method can be much competitive. Furthermore, by performing a fine-grained analysis, we summarize two potential directions as our future work: incorporating common sense for better pronouns inference, and modeling contextual clues in a more explicit manner. By open-sourcing our code and materials, we believe our work can facilitate the community to debug models in a fine-grained level and make more progress.
-  (1995) Natural Language Interfaces to Databases–An Introduction. Natural language engineering. Cited by: §1, §3.3.
-  (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.1.
-  (2006) Contextual phenomena and thematic relations in database QA dialogues: results from a wizard-of-Oz experiment. In NAACL, Cited by: §1.
-  (2019) Representing schema structure with graph neural networks for text-to-SQL parsing. In ACL, Cited by: §2.1.2.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.4, §2.
AllenNLP: a deep semantic natural language processing platform. In ACL, Cited by: §3.1.
-  (2019) Towards complex text-to-SQL in cross-domain database with intermediate representation. In ACL, Cited by: §1.
-  (1997) Long short-term memory. Neural Computation, Volume 9. Cited by: §2.1.1.
-  (2019) A comprehensive exploration on wikisql with table-aware word contextualization. arXiv. Cited by: §2.4.
-  (2018) Mapping language to code in programmatic context. In EMNLP, Cited by: §4.
-  (2017) Search-based neural structured learning for sequential question answering. In ACL, Cited by: §1.
-  (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §3.1.
-  (2017) Neural semantic parsing with type constraints for semi-structured tables. In EMNLP, Cited by: §1, §2.1.2.
-  (2017) Automatic differentiation in PyTorch. In NIPS, Cited by: §3.1.
-  (2014) GloVe: global vectors for word representation. In EMNLP, Cited by: §3.1.
-  (1997) Bidirectional recurrent neural networks. IEEE Trans. Signal Processing, Volume 45. Cited by: §2.1.1.
-  (2018) Learning to map context-dependent sentences to executable formal queries. In NAACL, Cited by: §1, §2.3.
-  (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: §2.1.
-  (2019) HuggingFace’s transformers: state-of-the-art natural language processing. arXiv. Cited by: §3.1.
-  (2018) TRANX: a transition-based neural abstract syntax parser for semantic parsing and code generation. In EMNLP, Cited by: §1.
-  (2018) SyntaxSQLNet: syntax tree networks for complex and cross-domain text-to-SQL task. In EMNLP, Cited by: §3.1.
-  (2019) CoSQL: a conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In EMNLP-IJCNLP, Cited by: §1, §1, §3.1.
-  (2019) SParC: cross-domain semantic parsing in context. In ACL, Cited by: §1, §1, §1, §3.1, §3.1, §3.1.
-  (2019) Editing-based SQL query generation for cross-domain context-dependent questions. In EMNLP-IJCNLP, Cited by: §1, §3.1.
-  (2018) Context-sensitive generation of open-domain conversational responses. In COLING, Cited by: §2.2.