An Imitation Game for Learning Semantic Parsers from User Interaction

05/02/2020 ∙ by Ziyu Yao, et al. ∙ Facebook The Ohio State University 13

Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks. In this paper, we suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstration when uncertain. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions. To combat the sparsity of demonstration, we propose a novel annotation-efficient imitation learning algorithm, which iteratively collects new datasets by mixing demonstrated states and confident predictions and re-trains the semantic parser in a Dataset Aggregation fashion (Ross et al., 2011). We provide a theoretical analysis of its cost bound and also empirically demonstrate its promising performance on the text-to-SQL problem.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A semantic parser proactively interacts with the user in a friendly way to resolve its uncertainties. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions.

Semantic parsing has found tremendous applications in building natural language interfaces that allow users to query data and invoke services without programming Woods (1973); Zettlemoyer and Collins (2005); Berant et al. (2013); Su et al. (2017); Yu et al. (2018). The lifecycle of a semantic parser typically consists of two stages: (1) bootstraping, where we keep collecting labeled data via trained annotators and/or crowdsourcing for model training until it reaches commercial-grade performance (e.g., 95% accuracy on a surrogate test set), and (2) fine-tuning, where we deploy the system, analyze the usage, and collect and annotate new data to address the identified problems or emerging needs. However, it poses several challenges for scaling up or building semantic parsers for new domains: (1) high boostrapping cost because mainstream neural parsing models are data-hungry and annotation cost of semantic parsing data is relatively high, (2) high fine-tuning cost from continuously analyzing usage and annotating new data, and (3) privacy risks arising from exposing private user conversations to annotators and developers Lomas (2019).

In this paper, we suggest an alternative methodology for building semantic parsers that could potentially address all the aforementioned problems. The key is to involve human users in the learning loop. A semantic parser should be introspective of its uncertainties and proactively prompt for demonstration from the user, who knows the question best, to resolve them. In doing so, the semantic parser would be able to accumulate targeted training data and continue improving itself autonomously without involving any annotators or developers, hence also minimizing privacy risks. The bootstrapping cost could also be significantly reduced because an interactive system needs not to be almost perfectly accurate to be deployed. On the other hand, such interaction opens up the black box and allows users to know more about the reasoning underneath the system and better interpret the final results Su et al. (2018). A human-in-the-loop methodology like this also opens the door for domain adaptation and personalization.

This work builds on the recent line of research on interactive semantic parsing Li and Jagadish (2014); Chaurasia and Mooney (2017); Gur et al. (2018); Yao et al. (2019b). Specifically, Yao et al. (2019b)

provide a general framework, MISP (Model-based Interactive Semantic Parsing), which handles uncertainty modeling and natural language generation. We will leverage MISP for user interaction to prove the feasibility of the envisioned methodology. However, existing studies only focus on interacting with users to resolve uncertainties. None of them has answered the crucial question of

how to learn from user interaction, which is the technical focus of this study.

One form of user interaction explored for learning semantic parsers is asking users to validate the execution results Clarke et al. (2010); Iyer et al. (2017). While appealing, in practice it may be a difficult task for real users because they would not need to ask the question if they knew the answer in the first place. We instead aim to learn semantic parsers from fine-grained interaction where users only need to answer simple questions covered by their background knowledge (Figure 1). However, learning signals from such fine-grained interactions are bound to be sparse because the system needs to avoid asking too many questions and overwhelming the user, which poses a challenge for learning.

To this end, we propose a novel annotation-efficient imitation learning algorithm for learning semantic parsers from such sparse, fine-grained demonstration: The agent (semantic parser) only requests for demonstration when it is uncertain about a state (parsing step). For the certain/confident states, the actions chosen by the current policy are deemed as correct. The policy is updated iteratively in a Dataset Aggregation fashion Ross et al. (2011)

. At each iteration, all the state-action pairs, demonstrated or confident, are included to form a new training set and train a new policy in a supervised way. Intuitively, using confident predictions for training mitigates the sparsity issue, but it may also introduce noise. We provide a theoretical analysis of the proposed algorithm and show that, under mild assumptions, the quality of the final policy is mainly determined by the quality of the initial policy and confidence estimation accuracy.

Using simulated users, we also empirically compare our method with a number of baselines on the text-to-SQL parsing problem, including the powerful but costly baseline of full expert annotation. On the WikiSQL Zhong et al. (2017) dataset, compared with the full annotation baseline, we show that, when bootstrapped using only 10% of the training data, our method can achieve almost the same test accuracy (2% absolute loss) while using less than 10% of the annotations, without even taking into account the different unit cost of annotation from users vs. domain experts. We also show that the quality of the final policy is largely determined by the quality of the initial policy, further confirming the theoretical analysis. Finally, we demonstrate that the system can generalize to more complicated semantic parsing tasks such as Spider Yu et al. (2018).

2 Related Work

  • Our work extends interactive semantic parsing, a recent idea that leverages system-user interactions to improve semantic parsing on the fly Li and Jagadish (2014); He et al. (2016); Chaurasia and Mooney (2017); Su et al. (2018); Gur et al. (2018); Yao et al. (2019a, b). As an example, Gur et al. (2018) built a neural model to identify and correct error spans in a generated SQL query via dialogues. Yao et al. (2019b) further generalized the interaction framework by formalizing a model-based intelligent agent called MISP. Our system leverages MISP to support interactivity but focuses on developing an algorithm for continually improving the base parser from end user interactions, which has not been accomplished by previous work.

  • Learning interactively from user feedback has been studied for machine translation Nguyen et al. (2017); Petrushkov et al. (2018); Kreutzer and Riezler (2019) and other NLP tasks Sokolov et al. (2016); Gao et al. (2018); Hancock et al. (2019). Most relevant to this work, Hancock et al. (2019) constructed a chatbot that learns to request feedback when the user is unsatisfied with the system response, and then further improves itself periodically from the satisfied responses and feedback responses. The work reaffirms the necessity of human-in-the-loop autonomous learning systems like ours.

    In the field of semantic parsing, Clarke et al. (2010) and Iyer et al. (2017) learned semantic parsers from user validation on the query execution results. However, often times it may not very practical to expect end users able to validate answer correctness (e.g., consider validating an answer “103” for the question “how many students have a GPA higher than 3.5

    ” from a massive table). Active learning is also leveraged to selectively obtain gold labels for semantic parsing and save human annotations

    Duong et al. (2018); Ni et al. (2020). Our work is complementary to this line of research as we focus on learning interactively from end users (not “teachers”).

  • Traditional imitation learning algorithms Daumé et al. (2009); Ross and Bagnell (2010); Ross et al. (2011); Ross and Bagnell (2014) iteratively execute and train a policy by collecting expert demonstrations for every policy decision. Despite its efficacy, the learning demands costly annotations from experts. In contrast, we save expert effort by selectively requesting for demonstrations. This idea is related to active imitation learning Chernova and Veloso (2009); Kim and Pineau (2013); Judah et al. (2014); Zhang and Cho (2017). For example, Judah et al. (2014) used active learning to select informative trajectories from the unlabeled data pool for expert demonstrations. However, their setting assumes a “teacher” to intentionally provide labels and an unlabeled data pool, while our algorithm targets at end users who are using the system. Similar to our approach, Chernova and Veloso (2009) solicited expert demonstrations only for uncertain states. However, their algorithm simply abandons policy actions that are confident, leading to sparse training data. Instead, our algorithm utilizes confident policy actions to combat the sparsity issue and is additionally provided with a theoretical analysis.

3 Preliminaries

Formally, we assume the semantic parsing model generates a semantic parse by executing a sequence of actions (parsing decisions) at each time step . In practice, the definition of action depends on the specific semantic parsing model, as we will illustrate shortly. A state is then defined as a tuple of , where is the initial natural language question and is the current partial parse. In particular, the initial state contains only the question. Denote a semantic parser as , which is a policy function Sutton and Barto (2018) that takes a state

as input and outputs a probability distribution over the action space. The semantic parsing process can be formulated as sampling a

trajectory by alternately observing a state and sampling an action from the policy, i.e., , , …, , , assuming a trajectory length . The probability of the generated semantic parse becomes:

An interactive semantic parser typically follows the aforementioned definition and requests the user’s validation of a specific action . Based on the feedback, a correct action can be inferred to replace the original . The parsing process continues with afterwards. In this work, we adopt MISP Yao et al. (2019b) as the back-end interactive semantic parsing framework, which enables system-user interaction via a policy probability based uncertainty estimator, a grammar-based natural language generator, and a multi-choice question-answering interaction design, as shown in Figure 1.

  • Consider the SQLova parser Hwang et al. (2019), which generates a query by filling “slots” in a pre-defined SQL sketch “SELECT Agg SCol WHERE WCol OP VAL”. To complete the SQL query in Figure 1, it first takes three steps: SCol=“School/Club Team” (), Agg=“COUNT” () and WCol=“School/Club Team” (). MISP detects that is uncertain because its probability is lower than a pre-specified threshold. It validates with the user and corrects it with WCol=“Player” (). The parsing continues with OP=“=” () and VAL=“jalen rose” (). The trajectory length in this case.

4 Learning Semantic Parsers from User Interaction

In this section, we present an imitation learning algorithm for learning semantic parsers from user interactions. The algorithm is annotation-efficient

and can train a parser without requiring a large amount of user feedback (or “annotations”), an important property for practical use in an end-user-facing system. Note that while we apply this algorithm to semantic parsing in this work, in principle the algorithm can be applied to other structured prediction tasks (e.g., text summarization or machine translation) as well.

4.1 An Imitation Learning Formulation

Under the interactive semantic parsing framework, a learning algorithm intuitively can aggregate pairs collected from user interactions and trains the parser to enforce under the state

. However, this is not achievable by conventional supervised learning since the training needs to be conducted in an

interactive environment, where the partial parse is generated by the parser itself.

Instead, we formulate it as an imitation learning problem Daumé et al. (2009); Ross and Bagnell (2010); Ross et al. (2011); Ross and Bagnell (2014). Consider the user as a demonstrator, then the derived action can be viewed as an expert demonstration which is interactively sampled from the demonstrator’s policy (or expert policy) ,222We follow the imitation learning literature and use “expert” to refer to the imitation target, but the user in our setting by no means needs to be a “domain (SQL) expert”. i.e., . The goal of our algorithm is thus to train policy to imitate the expert policy . A general procedure is described in Algorithm 1, where is learned iteratively for every user questions (Line 1–9).

1:Initial training data , confidence threshold .
2:A trained policy .
3:Initialize .
4:Initialize by training it on .
5:for  to  do
6:     Observe user questions ;
7:     ;
8:     Aggregate dataset ;
9:     Train policy on using Eq. (1).
10:end for
11:return best on validation.
13:function Parse&Collect()
14:     Initialize , .
15:     for  to  do
16:         Preview action ;
17:         if  then
18:              ;
19:              Collect ;
20:              Execute ;
21:         else
22:              Trigger user interaction and derive expert demonstration ;
23:               if is valid; otherwise;
24:              Collect ;
25:              Execute .
26:         end if
27:     end for
28:     return .
29:end function
Algorithm 1 Learning from User Interaction

4.2 Annotation-efficient Imitation Learning

Consider parsing a user question and collecting training data using the parser in the -th iteration (Line 5). A standard imitation learning algorithm such as DAgger Ross et al. (2011) usually requests expert demonstration for every state in the sampled trajectory. However, it requires a considerable amount of user annotations, which may not be practical when interacting with end users.

We propose an annotation-efficient imitation learning algorithm, which saves user annotations by selectively requesting user intervention, as shown in function Parse&Collect. Specifically, in each parsing step, the system first previews whether it is confident about its own decision (Line 13–14), which is determined when its probability is no less than a threshold, i.e., . In this case, the algorithm executes and collects the policy action (Line 15–16); otherwise, a system-user interaction will be triggered and the derived demonstration will be collected and executed to continue parsing (Line 17–22).

Denote a collected state-action pair as , where could be or depending on whether an interaction is requested. To train (Line 7), our algorithm adopts a reduction-based approach similar to DAgger

and reduces imitation learning to iterative supervised learning. Formally, we define our training loss function as a

weighted negative log-likelihood:


where is the aggregated training data over iterations and denotes the weight of .

We consider assigning weight in three cases: (1) For confident actions , we set . This essentially treats confident actions as gold decisions, which resembles self-training Nigam and Ghani (2000). (2) For user-confirmed decisions (valid demonstrations ), such as enforcing a WHERE condition on “Player” in Figure 1, is also set to 1 to encourage the parser to imitate the correct decisions from users. (3) For uncertain actions that cannot be addressed via human interactions (invalid demonstrations ), we assign . This could happen when some of the incorrect precedent actions are not fixed. For example, in Figure 1, if the system missed correcting the WHERE condition on “School/Club Team”, then whatever value it generates after “WHERE School/Club Team=” is wrong, and thus any action derived from human feedback would be invalid. A possible training strategy in this case can set to be negative, similar to Welleck et al. (2020). However, empirically we find this strategy fails to train the parser to correct its mistake in generating School/Club Team but rather disturbs model training. To solve this problem, we directly set to remove the impact of unaddressed actions. A similar solution is also adopted in Petrushkov et al. (2018); Kreutzer and Riezler (2019). As shown in Section 6, this way of training weight assignment enables stable improvement in iterative model learning while requiring fewer user annotations.

5 Theoretical Analysis

While our system enjoys the benefit of learning from a small amount of user feedback, one crucial question is whether it can still achieve the same level of performance as a system trained on full expert annotations, if one could afford that and manage the privacy risk. In other words, what is the performance gap between our system and a fully supervised system? In this section, we answer this question by showing that the performance gap is mainly bounded by the learning policy’s probability of trusting a confident action that turns out to be wrong, which can be controlled in practice.

In the analysis, we follow prior work  Ross and Bagnell (2010); Ross et al. (2011) to assume a unified trajectory length and focus the proof on the “infinite sample” case, which assumes an infinite number of samples in each iteration (i.e., in Algorithm 1), such that the state space can be full explored by the current policy. An analysis under the “finite sample” case can be found in Appendix A.5.

5.1 Cost Function for Analysis

Different from typical imitation learning tasks (e.g., Super Tux Kart Ross et al. (2011)), in semantic parsing, there exists only one gold trajectory semantically identical to the question and can return correct execution results.333We assume a canonical order for swappable components in a parse. In practice, it may be possible, though rare, for one question to have multiple gold parses. Whenever a policy action is different from the gold one, the whole trajectory will not yield the correct semantic meaning. Therefore, we analyze a policy’s performance only whenauthor=ysu,color=lime!40,size=,fancyline,caption=,author=ysu,color=lime!40,size=,fancyline,caption=,todo: author=ysu,color=lime!40,size=,fancyline,caption=,justification is in Appendix A.1. Ziyu: We don’t really justify it in appendix it is conditioned on a gold partial parse, i.e., , where is the state distribution in step when executing the expert policy for first -1 steps. Let be the loss of making a mistake at state . By summing up a policy’s expected loss over steps, we define the cost of the policy as:


where denotes the average expert state distribution (assuming time step

is a random variable uniformly sampled from

). A detailed derivation is shown in Appendix A.1.

The better is, the smaller this cost becomes. Although it is not exactly the same as the objective evaluated in experiments, which measures the correctness of a complete trajectory (rather than a single policy action) sampled from , this simplified version makes theoretical analysis easier and reflects a consistent relative performance among algorithmsauthor=ysu,color=lime!40,size=,fancyline,caption=,author=ysu,color=lime!40,size=,fancyline,caption=,todo: author=ysu,color=lime!40,size=,fancyline,caption=,this sentence is a bit weak and confusing. what is ”consistent relative performance?”. Next, we will derive the bound of each policy’s cost in order to compare their performance.

5.2 Cost Bound of Supervised Approach

A fully supervised system trains a parser on expert-annotated pairs, where the gold semantic parse can be viewed as generated by executing the expert policy . This gives the policy :

where is the policy space induced by the model architecture. A detailed derivation in Appendix A.2 shows the cost bound of the supervised approach:

Theorem 5.1.

For supervised approach, let , then .

The theorem gives an exact bound (as shown by the equality) since the supervised approach, given the “infinite sample” assumption, trains a policy under the same state distribution as the one being evaluated in the cost function (Eq. (2)). As we will show next, when adopting an annotation-efficient learning strategy, our proposed algorithm breaks this consistency and thus induces a performance gap compared with the supervised approach.

5.3 Cost Bound of Our Proposed Algorithm

During its iterative learning, Algorithm 1 produces a sequence of policies , where is the number of training iterations, and returns the one with the best test-time performance on validation as (Line 9). Recall that our algorithm samples a trajectory by executing actions from both the previously learned policy and the expert policy (when an interaction is requested). Let denote such a “mixture” policy. The cost of the learned policy can be bounded as:

The above derivation shows that the bound comprises of two parts. The first term calculates the expected training loss of . Notice that, in training, each trajectory is sampled from the mixture policy (), while in evaluation, we measure a policy’s performance conditioned on a gold partial parse ( in Eq. (2)). This discrepancy, which does not exist in the supervised approach, explains the performance loss of our algorithm, which is bounded by the second term , the distance between and weighted by the maximum loss value that encounters over the training. Bounding the two terms gives the following theorem:

Theorem 5.2.

For the proposed annotation-efficient imitation learning algorithm, if is , there exists a policy s.t. .

Here, denotes the best expected policy loss in hindsight, and denotes the probability that does not query the expert policy (i.e., being confident) but its own action is wrong under . A detailed derivation can be found in Appendix A.3A.4.

  • A comparison of Theorem 5.1 and Theorem 5.2 shows that the performance gap led by our algorithm is mainly bounded by . Intuitively this is because whenever a learning policy in our algorithm collects its own, but wrong, action as the gold one for training, it introduces noise that does not exist in the supervised approach’s training set. This finding inspires us to restrict the gap by lowering down the learning policy’s error rate when it does not query the expert. Empirically this can be achieved by setting:

    • [noitemsep,topsep=1pt]

    • Accurate policy confidence estimation, such that actions regarded confident are generally correct.

    • Moderate model initialization, such that generally the policy is less likely to make wrong actions throughout the iterative training.

    For the first point, we set a high confidence threshold , which has been demonstrated to be reliable for MISP Yao et al. (2019b)

    . In the future, it can even be replaced by a machine learning module (see a discussion in Section 

    7). We empirically validate the second point in our experiments.

6 Experiments

In this section, we conduct experiments to demonstrate the annotation efficiency of our algorithm (Section 4) and that it can train semantic parsers to reach high performance when the system is reasonably instantiated, consistent with our theoretical analysis in Section 5author=ysu,color=lime!40,size=,fancyline,caption=,author=ysu,color=lime!40,size=,fancyline,caption=,todo: author=ysu,color=lime!40,size=,fancyline,caption=,evaluation may be a bit thin. Add hyper-parameter sensitivity experiment would be good, like different probability thresholds. Maybe after the arxiv submission. Case studies would be good add, too, as we discussed..

6.1 Experimental Setup

We test our system on the WikiSQL dataset Zhong et al. (2017). The dataset contains a large scale of annotated question-SQL pairs (56,355 pairs for training) and thus serves as a good resource for experimenting iterative learning. For the base semantic parser, we choose SQLova Hwang et al. (2019), one of the top-performing models on WikiSQL, to ensure a reasonable model capacity in terms of data utility along iterative training.

To instantiate the proposed algorithm, we set a high confidence threshold following Yao et al. (2019b) and experiment with different initialization settings as suggested by our analysis in Section 5, using 10%, 5% and 1% of the total training data. During iterative learning, questions from the remaining training data arrive in a random order to simulate user questions. The parser is trained with simulated user feedback (which is obtained by directly comparing the synthesized query with the gold one) iteratively for every questions. We test systems under different training iterations and report results averaged over three random runs. More implementation details are included in Appendix B.1.

Figure 2: System parsing accuracy on WikiSQL test set when they are trained with various numbers of user/expert annotations (top) and for different iterations (bottom). We experiment systems with three initialization settings, using 10%, 5% and 1% of the training data respectively.

6.2 System Comparison

We compare our system (denoted as MISP-L since it builds a Learning algorithm upon MISP) with the traditional supervised approach (denoted as Full Expert). To investigate the skyline capability of our system, we also present a variant called MISP-L*, which is assumed with perfect confidence measurement and interaction design, so that it can precisely identify and correct its mistakes during parsing. This is implemented by allowing the system to compare its synthesized query with the gold one. Note that this is not a realized automatic system; we show its performance as an upper bound of MISP-L.

On the other hand, while the learning systems by Clarke et al. (2010) and Iyer et al. (2017), which request user validation on query execution results, may not very practical to interact with end users, we include them nonetheless in the interest of comprehensive comparison. This leads to two baseline systems. The Binary User system requests binary user feedback on whether executing the generated SQL query returns correct database results and collects only queries with correct execution results to further improve the parser, similar to Clarke et al. (2010). The Binary User+Expert system additionally collects full expert SQL annotations when the execution results of the generated SQL queries are wrong, similar to Iyer et al. (2017).

6.3 Experimental Results

We evaluate each system by answering the two research questions (RQs):

  • [noitemsep,topsep=1pt]

  • RQ1: Can the system learn a semantic parser without requiring a large amount of annotations?

  • RQ2: For interactive systems, while requiring weaker supervision, can they train the parser to reach a performance comparable to the traditional supervised system?

For RQ1, we measure the number of user/expert annotations a system requires to train a parser. For Full Expert, this number is equal to the trajectory length of the gold query (e.g., 5 for the query in Figure 1); for MISP-L and MISP-L*, it is the number of user interactions during training. For Binary User(+Expert), it is hard to quantify “one annotation”, which varies according to the actual database size and the query difficulty. In experiments, we approximate this number by calculating it in the same way as Full Expert, with the assumption that in general validating an answer is as hard as validating the SQL query itself. More accurate metrics can be explored by conducting a user study, as we discussed in Section 7. Note that while we do not differentiate the actual cost (e.g., time and financial cost) of users and experts in this aspect, we emphasize that our system enjoys an additional benefit of collecting training examples from a much cheaper and more abundant source while serving end users’ needs at the same time.

Figure 2 (top) shows each system’s parsing accuracy on WikiSQL test set after they have been trained on certain amounts of annotations. Consistently under all initialization conditions, MISP-L consumes a comparable or smaller amount of annotations to train the parser to reach the same parsing accuracy. As shown in Figure 5 in Appendix, MISP-L requires an average of no more than one interaction for most questions along the iterative training. Given the limited size of WikiSQL training set, the simulation experiments currently can only show the system’s performance under a small number of annotations. However, we expect this gain to continue as it receives more user questions in the long-term deployment.

To answer RQ2, Figure 2 (bottom) compares each system’s parsing accuracy after they have been trained for the same number of iterations. The results demonstrate that when a semantic parser is moderately initialized (10%/5% initialization setting), MISP-L can further improve it to reach a comparable accuracy as Full Expert (0.776/0.761 vs. 0.794 in the last iteration). In the extremely weak 1% initialization setting (using only around 500 initial training examples), all interactive learning systems suffer from a huge performance loss. This is consistent with our finding in theoretical analysis (Section 5). In Appendix C, we plot the value of , the probability that makes a confident but wrong decision given a gold partial parse, showing that a better initialized policy generally obtains a smaller throughout the training and thus a tighter cost bound.

For both RQ1 and RQ2, our system surpasses Binary User, the execution feedback-based system. In experiments, we find out that the inferior performance of Binary User is mainly due to the “spurious program” issue Guu et al. (2017), i.e., a SQL query having correct execution results can still be incorrect in terms of semantics.444For example, contrast “WHERE C1=A” with “WHERE C1=A and C2=B”. They can give the same execution results when all records satisfying “C1=A” also meet “C2=B” by accident. However, semantically the latter includes an extra condition which may not be specified by the question. MISP-L circumvents this issue by directly validating the semantic meaning of intermediate parsing decisions.

Finally, when it is assumed with perfect interaction design and confidence estimator, MISP-L* shows striking superiority in both aspects. Since it always corrects wrong decisions immediately, MISP-L* can collect and derive the same training examples as Full Expert, and thus trains the parser to Full Expert’s performance level. Meanwhile, it requires only 10% of the annotations that Full Expert consumes. These observations implies large room for MISP-L to be improved in the future.

6.4 Generalize to Complex SQL Queries

Since queries in WikiSQL are generally simple and follow a pre-specified “SELECTWHERE…” sketch, the last part of our experiments investigates whether our system can generalize to complex SQL queries. To this end, we test our system on the Spider dataset Yu et al. (2018), where SQL queries can contain complicated keywords like GROUP BY. For the base semantic parser, we choose EditSQL Zhang et al. (2019)

, one of the open-sourced top models on Spider. Given the small size of Spider (7,377 question-SQL query pairs for training after data cleaning), we only experiment with one initialization setting, using 10% of the training set. Since all Spider models do not predict the specific values in a SQL query (e.g., “

jalen rose” in Figure 1),555Yu et al. (2018) promoted that to encourage a focus on more fundamental semantic parsing issues. The evaluation does not count the specific values either. we cannot execute the generated query to simulate the binary execution feedback. Therefore, we only compare our system with the Full Expert baseline. Parsers are evaluated on Spider Dev set since the test set is not publicly available. We include all implementation details in Appendix B.2.

Figure 3: System parsing accuracy on Spider Dev set when they are trained with various numbers of user/expert annotations and for different iterations.

Figure 3 (top) shows that our system and its variant consistently achieve comparable or better annotation efficiency. We expect this advantage to continue as the system receives more questions and interactions from users beyond the Spider dataset. However, we also notice that the gain is smaller and MISP-L suffers from a larger performance loss compared with Full Expert (Figure 3, bottom), due to the poor parser initialization and the SQL query complexity. This can be addressed via adopting better interaction designs and a more accurate confidence estimation, as shown by MISP-L*.

7 Conclusion and Future Work

In this paper, we explore building an interactive semantic parser that continually improves itself from end user interaction, without involving annotators or developers. To this end, we propose an annotation-efficient imitation learning algorithm to learn from the sparse, fine-grained demonstrations. We prove the quality of the algorithm theoretically and show its advantage over the traditional full expert annotation approach via experiments.

As a pilot study on this research topic, we train systems with simulated user feedback. One important future work is to conduct a large-scale user study and collect interactions from real users. This is not trivial and has to account for uncertainties such as noisy user feedback. By analyzing real users’ statistics (e.g., average time spent on each question), we believe a more accurate and realistic formulation of user/expert annotation cost can be derived to guide future research.

Besides, we would like to explore more accurate confidence measurement to improve our system, as suggested by our theoretical analysis. In experiments, we observe that the two neural semantic parsers (especially the more complicated EditSQL) tend to be overconfident, and training them with more data does not mitigate this issue. To address that, future directions include neural network calibration

Guo et al. (2017)

and using machine learning components (e.g., a reinforcement learning-based active selector

Fang et al. (2017)) to replace the confidence threshold.

Finally, the proposed annotation-efficient imitation learning algorithm can be generalized to other NLP tasks Sokolov et al. (2016) and classical imitation learning problems Ross et al. (2011). We expect this algorithm to save human annotation effort particularly for low-resource tasks Mayhew et al. (2019).


This research was sponsored in part by the Army Research Office under cooperative agreements W911NF-17-1-0412, NSF Grant IIS1815674, Fujitsu gift grant, and Ohio Supercomputer Center Center (1987). The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notice herein.


  • K. Azuma (1967) Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series 19 (3), pp. 357–367. Cited by: §A.5.
  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on freebase from question-answer pairs. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    pp. 1533–1544. Cited by: §1.
  • O. S. Center (1987) Ohio supercomputer center. Note: Cited by: Acknowledgments.
  • S. Chaurasia and R. J. Mooney (2017) Dialog for language to code. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 175–180. Cited by: §1, §2.
  • S. Chernova and M. Veloso (2009) Interactive policy learning through confidence-based autonomy.

    Journal of Artificial Intelligence Research

    34, pp. 1–25.
    Cited by: §2.
  • J. Clarke, D. Goldwasser, M. Chang, and D. Roth (2010) Driving semantic parsing from the world’s response. In Proceedings of the fourteenth conference on computational natural language learning, pp. 18–27. Cited by: §1, §2, §6.2.
  • H. Daumé, J. Langford, and D. Marcu (2009) Search-based structured prediction. Machine learning 75 (3), pp. 297–325. Cited by: §2, §4.1.
  • L. Duong, H. Afshar, D. Estival, G. Pink, P. R. Cohen, and M. Johnson (2018) Active learning for deep semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 43–48. Cited by: §2.
  • M. Fang, Y. Li, and T. Cohn (2017) Learning how to active learn: a deep reinforcement learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 595–605. Cited by: §7.
  • Y. Gao, C. M. Meyer, and I. Gurevych (2018) APRIL: interactively learning to summarise by combining active preference learning and reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4120–4130. Cited by: §2.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §7.
  • I. Gur, S. Yavuz, Y. Su, and X. Yan (2018) DialSQL: dialogue based structured query generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1339–1349. Cited by: §1, §2.
  • K. Guu, P. Pasupat, E. Liu, and P. Liang (2017) From language to programs: bridging reinforcement learning and maximum marginal likelihood. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1051–1062. Cited by: §6.3.
  • B. Hancock, A. Bordes, P. Mazare, and J. Weston (2019) Learning from dialogue after deployment: feed yourself, chatbot!. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3667–3684. Cited by: §2.
  • E. Hazan, A. Agarwal, and S. Kale (2007) Logarithmic regret algorithms for online convex optimization. Machine Learning 69 (2-3), pp. 169–192. Cited by: §A.3.
  • L. He, J. Michael, M. Lewis, and L. Zettlemoyer (2016) Human-in-the-loop parsing. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2337–2342. Cited by: §2.
  • W. Hoeffding (1994) Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pp. 409–426. Cited by: §A.5.
  • W. Hwang, J. Yim, S. Park, and M. Seo (2019) A comprehensive exploration on WikiSQL with table-aware word contextualization. arXiv preprint arXiv:1902.01069. Cited by: §3, §6.1.
  • S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and L. Zettlemoyer (2017) Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 963–973. Cited by: §1, §2, §6.2.
  • K. Judah, A. P. Fern, T. G. Dietterich, and P. Tadepalli (2014) Active imitation learning: formal and practical reductions to iid learning. Journal of Machine Learning Research 15, pp. 4105–4143. Cited by: §2.
  • S. M. Kakade and A. Tewari (2009) On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems, pp. 801–808. Cited by: §A.3.
  • B. Kim and J. Pineau (2013) Maximum mean discrepancy imitation learning. Robotics: Science and systems. Cited by: §2.
  • J. Kreutzer and S. Riezler (2019) Self-regulated interactive sequence-to-sequence learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 303–315. Cited by: §2, §4.2.
  • F. Li and H. Jagadish (2014) Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment 8 (1), pp. 73–84. Cited by: §1, §2.
  • N. Lomas (2019) Google ordered to halt human review of voice AI recordings over privacy risks. Note: 2020-04-28 Cited by: §1.
  • S. Mayhew, S. Chaturvedi, C. Tsai, and D. Roth (2019) Named entity recognition with partially annotated training data. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 645–655. Cited by: §7.
  • K. Nguyen, H. Daumé III, and J. Boyd-Graber (2017)

    Reinforcement learning for bandit neural machine translation with simulated human feedback

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1464–1474. Cited by: §2.
  • A. Ni, P. Yin, and G. Neubig (2020) Merging weak and active supervision for semantic parsing. In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), New York, USA. Cited by: §2.
  • K. Nigam and R. Ghani (2000) Analyzing the effectiveness and applicability of co-training. In Proceedings of the ninth international conference on Information and knowledge management, pp. 86–93. Cited by: §4.2.
  • P. Petrushkov, S. Khadivi, and E. Matusov (2018) Learning from chunk-based feedback in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 326–331. Cited by: §2, §4.2.
  • S. Ross and D. Bagnell (2010) Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661–668. Cited by: §A.1, §2, §4.1, §5.
  • S. Ross and J. A. Bagnell (2014) Reinforcement and imitation learning via interactive no-regret learning. ArXiv abs/1406.5979. Cited by: §2, §4.1.
  • S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §A.1, Appendix A, An Imitation Game for Learning Semantic Parsers from User Interaction, §1, §2, §4.1, §4.2, §5.1, §5, §7.
  • A. Sokolov, J. Kreutzer, C. Lo, and S. Riezler (2016) Learning structured predictors from bandit feedback for interactive nlp. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1610–1620. Cited by: §2, §7.
  • Y. Su, A. H. Awadallah, M. Khabsa, P. Pantel, M. Gamon, and M. Encarnacion (2017) Building natural language interfaces to web apis. In Proceedings of the International Conference on Information and Knowledge Management, Cited by: §1.
  • Y. Su, A. H. Awadallah, M. Wang, and R. W. White (2018) Natural language interfaces with fine-grained user interaction: a case study on web APIs.. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Cited by: §1, §2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §3.
  • S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston (2020) Neural text generation with unlikelihood training. In International Conference on Learning Representations, Cited by: §4.2.
  • W. A. Woods (1973) Progress in natural language understanding: an application to lunar geology. In Proceedings of the American Federation of Information Processing Societies Conference, Cited by: §1.
  • Z. Yao, X. Li, J. Gao, B. Sadler, and H. Sun (2019a) Interactive semantic parsing for if-then recipes via hierarchical reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 2547–2554. Cited by: §2.
  • Z. Yao, Y. Su, H. Sun, and W. Yih (2019b) Model-based interactive semantic parsing: a unified framework and a text-to-SQL case study. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5450–5461. Cited by: §B.1, §B.1, §1, §2, §3, §5.3, §6.1.
  • T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, et al. (2018) Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3911–3921. Cited by: §1, §1, §6.4, footnote 5.
  • L. S. Zettlemoyer and M. Collins (2005) Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 658–666. Cited by: §1.
  • J. Zhang and K. Cho (2017) Query-efficient imitation learning for end-to-end simulated driving. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
  • R. Zhang, T. Yu, H. Er, S. Shim, E. Xue, X. V. Lin, T. Shi, C. Xiong, R. Socher, and D. Radev (2019) Editing-based SQL query generation for cross-domain context-dependent questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5341–5352. Cited by: §B.2, §6.4.
  • V. Zhong, C. Xiong, and R. Socher (2017) Seq2SQL: generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. Cited by: §1, §6.1.

Appendix A Theoretical Analysis in Infinite Sample Case

In this section, we give a detailed theoretical analysis to derive the cost bound of the supervised approach and our proposed annotation-efficient imitation learning algorithm in Section 4.

Following Ross et al. (2011), we first focus the proof on an infinite sample case, which assumes an infinite number of samples to train a policy in each iteration (i.e., in Algorithm 1). As an overview, we start the analysis by introducing the “cost function” we use to analyze each policy in Appendix A.1, which represents an inverse quality of a policy. In Appendix A.2, we derive the bound of the cost of the supervised approach. Appendix A.3 and Appendix A.4 then discuss the cost bound of our proposed algorithm. Finally, in Appendix A.5, we show the cost bound of our algorithm in finite sample case.

a.1 Cost Function for Analysis

In a semantic parsing task, whenever a policy action is different from the gold one, the whole trajectory cannot yield the correct semantic meaning. Therefore, we analyze a policy’s performance only when it is conditioned on a gold partial parse, i.e., , where is the state distribution in step when executing the expert policy for first -1 steps. Given a question and denoting as the gold partial trajectory sampled by the expert policy , we define the cost of sampling a partial trajectory as:

Based on this definition, we further define the expected cost of in a single time step , given the question and the gold partial parse , as:

where denotes the probability of sampling a gold action from the policy . By taking an expectation over all questions , we have the following derivations:

The second equality holds by the definition . In this analysis, we follow Ross and Bagnell (2010); Ross et al. (2011) to assume a unified decision length . By summing up the above expected cost over the steps, we define the total cost of executing policy for steps as:

Denote as the “loss function” in our analysis, which is bounded within , then the cost of policy can be simplified as:


where is the average expert state distribution, when we assume the time step

to be a random variable under the uniform distribution

(the second equality).

The better a policy is, the smaller this cost becomes. Our analysis thus compares each policy by deriving the “bound” of their costs.

a.2 Derivation of Cost Bound for Supervised Approach

In this section, we analyze the cost bound for the supervised approach. Recall that the supervised approach trains a policy using the standard supervised learning algorithm with supervision from at every decision step. Therefore, it finds the best policy on infinite samples as:


where denotes the policy space induced by the model architecture, and the expectation over is sampled from the whole state space because of the “infinite sample” assumption. The supervised approach thus obtains the following cost bound:

This gives the following theorem:

Theorem A.1.

For supervised approach, let , then .

The cost bound of the supervised approach represents its exact performance as implied by the equality. This is because the approach trains a policy (Eq. (4)) under the same state distribution (given the “infinite sample” assumption) as in evaluation (Eq. (3)). As we will show next, the proposed annotation-efficient imitation learning algorithm breaks this consistency while enjoying the benefit of high annotation efficiency, which explains the performance gap.

a.3 No-regret Assumption

The derivation of our proposed annotation-efficient imitation learning algorithm’s cost bound leverages a “no-regret” assumption:

Assumption A.1.

No-regret assumption. Define and , then

for (usually ).

Many no-regret algorithms Hazan et al. (2007); Kakade and Tewari (2009) that guarantee require convexity or strongly-convexity of the loss function. However, the loss function used in our application, which is built on the top of a deep neural network model, does not satisfy this requirement. In this analysis, we simplify the setting and directly make this assumption for convenience of the proof. A more accurate regret bound for non-convex neural networks can be researched in the future.

Another concern is that the collected online training labels come from not only the expert policy (when it is queried), but also the learning policy (when the agent has a high confidence on its policy action). Labels from the learning policy may bring noise amid the model fitting to the expert policy. However, in practice the impact from such noisy labels are limited when the confidence threshold is set at a high value (e.g., 0.95). In this case, labels from are generally clean and lead to increasing performance during iterative training. Therefore, it is still safe to make this no-regret assumption.

a.4 Derivation of Cost Bound for Our Proposed Algorithm

As shown in Algorithm 1, our algorithm produces a sequence of policies , where is the number of training iterations, and the algorithm returns the one with the best test-time performance on validation as . In training, our algorithm executes actions from both the learning policy (when the model is confident) and the expert policy . We denote this “mixture” policy as . Then for the first iterations, we have the cost bound of our algorithm as:


From the last inequality, we can see that the cost bound of our algorithm is restricted by two terms. The first term denotes the expected loss of under the state induced by during training (under the “infinite sample” assumption, as mentioned in the beginning of the analysis). By applying the no-regret assumption (Assumption A.1), this term can be bound by . Here, denotes the best expected training loss in hindsight.

The second term denotes the distance between state distributions induced by and , weighted by the maximum loss value that encounters over the training. As we notice, unlike the supervised approach, our algorithm trains a policy under , which is different from the state distribution used to evaluate the policy (Eq. (3)). This discrepancy explains the performance loss of our algorithm compared to the supervised approach and is bounded by the aforementioned distance. To further bound this term, we define as the probability that makes a confident (i.e., without querying the expert policy) but wrong action under , and introduce the following lemma:

Lemma A.1.



Let be the probability of querying the expert policy under , the error rate of under , and any state distribution besides . We can then express by:

The distance between and thus becomes

The second inequality uses , which holds when . ∎

By applying Assumption A.1 and Lemma A.1 to Eq. (3), we derive the following inequality:

Given a large enough (), by the no-regret assumption, we can further simplify the above as:

which leads to our theorem:

Theorem A.2.

For our proposed annotation-efficient imitation learning algorithm, if is , there exists a policy s.t. .

In experiments, we consider a skyline instantiation of the proposed algorithm, called MISP-L*. This instantiation is assumed with perfect confidence estimation and interaction design, such that it can precisely detects and corrects its intermediate mistakes during parsing. Therefore, MISP-L* presents an upper bound performance (i.e., the tightest cost bound) of our algorithm. This can be interpreted theoretically. In fact, for MISP-L*, is always zero since the system has ensured that its policy action is correct when it does not query the expert policy. In this case, , so . Therefore, according to Theorem A.2, MISP-L* has a cost bound of:

where .

By comparing this bound with the cost bound in Theorem A.1, it is observed that MISP-L* shares the same cost bound as the supervised approach (except for the inequality relation and the constant). This is explainable since MISP-L* indeed collects exactly the same training labels (from ) as the supervised approach.

a.5 Cost Bound of Our Proposed Algorithm in Finite Sample Case

The theorems in previous sections hold when the algorithm observes infinite trajectories. However, in practice, our algorithm will observe the training loss from only a finite set of trajectories at each iteration using . For this consideration, in the following discussion, we provide a proof of the cost bound of our proposed algorithm under the finite sample case.

In the finite sample setting, our algorithm observes the training loss from a finite number of trajectories. We define as the trajectories collected in the iteration. In every iteration, the algorithm observes loss . By the no-regret assumption (Assumption A.1), the average observed loss for each iterations can still be bounded by the following inequality: We use to denote the loss of the best policy on the finite samples.

Following Eq. (A.4), we need to switch the derivation from the expected loss of over (i.e., ) to that over (i.e., ), the actual state distribution that is trained on. To fill this gap, we introduce to denote the difference between the expected loss of under and the average loss of under the sample trajectory with at iteration . The random variables over all and are all zero mean, bounded in and form a martingale in the order of . By Azuma-Hoeffding’s inequality Azuma (1967); Hoeffding (1994), with probability . Following the derivations in Eq. (A.4) and by introducing , with probability of , we obtain the following inequalities by definition:

Notice that we need to be at least , so that and are negligible. This leads to the following theorem:

Theorem A.3.

For our proposed annotation-efficient imitation learning algorithm, with probability at least , when is , there exists a policy s.t. .

The above shows that the cost of our algorithm can still be bounded in the finite sample setting. Comparing this bound with the bound under the infinite sample setting, we can observe that the bound is still related to , the probability that takes a confident but incorrect action under .

Appendix B Implementation Details

b.1 Interactive Semantic Parsing Framework

Our system assumes an interactive semantic parsing framework to collect user feedback. In experiments, this is implemented by adapting MISP Yao et al. (2019b), an open-sourced framework that has demonstrated a strong ability to improve test-time parsing accuracy.666 In this framework, an agent is comprised of three components: a world model that wraps the base semantic parser and a feedback incorporation module to interpret user feeds and update the semantic parse, an error detector that decides whether to request for user intervention, and an actuator that delivers the agent’s request by asking a natural language question, such that users without domain expertise can understand.

We follow MISP’s instantiation for text-to-SQL tasks to adopt a probability-based uncertainty estimator as the error detector, which triggers user interactions when the probability of the current decision is lower than a threshold.777While a dropout-based error detector is also possible, empirically we found it much slower than the probability-based one and thus is not preferable. The actuator is instantiated by a grammar-based natural language generator. We use the latest version of MISP that allows multi-choice interactions to improve the system efficiency, i.e., when the parser’s current decision is validated as wrong, the system presents multiple alternative options for user selection. An additional “None of the above options” option is included in case all top options from the system are wrong. Figure 1 shows an example of the user interaction. From there, the system can derive a correct decision to address its uncertainty (e.g., taking “Player” as a WHERE column).

  • Our experiments train each system with simulated user feedback. To this end, we build a user simulator similar to the one used by Yao et al. (2019b), which can access the ground-truth SQL queries. It gives yes/no answer or selects a choice by directly comparing the sampled policy action with the true one in the gold query.

b.2 EditSQL Experiment Details

In the data preprocessing step, EditSQL Zhang et al. (2019) transforms each gold SQL query into a sequence of tokens, where the From clause is removed and each column Col is prepended by its paired table name, i.e., Tab.Col. However, we observe that sometimes this transformation is not convertible. For example, consider the question “what are the first name and last name of all candidates?” and its gold SQL query: “SELECT T2.first_name , T2.last_name FROM candidates AS T1 JOIN people AS T2 ON T1.candidate_id = T2.person_id”. EditSQL transforms this query into : “select people.first_name , people.last_name”. The transformed sequence accidentally removes the information about table candidates in the original SQL query, leading to semantic meaning inconsistent with the question. When using such erroneous sequences as the gold targets in model training, we cannot simulate consistent user feedback, e.g., when the user is asked whether her query is relevant to the table candidates, the simulated user cannot give an affirmative answer given the transformed sequence. To avoid inconsistent user feedback, we remove question-SQL pairs whose transformed sequence is inconsistent with the original gold SQL query, from the training data. This reduces the size of the training set from 8,421 to 7,377. The validation set is kept untouched for fair evaluation.

The implementation of interactive semantic parsing for EditSQL is the same as Section B.1, except that, in order to cope with the complicated structure of Spider SQL queries, for columns in WHERE, GROUP BY, ORDER BY and HAVING clauses, we additionally provide an option for the user to “remove” the clause, e.g., removing a WHERE clause by picking the “The system does not need to consider any conditions.” option. The confidence threshold is 0.995 as we observe that EditSQL tends to be overconfident.

Appendix C Additional Experimental Results

c.1 SQLova Results on Dev set

Figure 4 shows MISP-L’s performance on WikiSQL validation set. We also show in Figure 5 the average number of annotations (i.e., user interactions) per question during the iterative training. Overall, as the base parser is further trained, the system tends to request fewer user interactions. In most cases throughout the training, the system requests no more than one user interaction, demonstrating the annotation efficiency of our algorithm.

Figure 4: System parsing accuracy on WikiSQL validation set when they are trained with various numbers of user/expert annotations (top) and for different iterations (bottom). We experiment systems with three initialization settings, using 10%, 5% and 1% of the training data respectively.
Figure 5: Average number of user annotations per question along training iterations (on WikiSQL), when the parser is initialized using 10%, 5% and 1% of training data.

c.2 SQLova Results in Theoretical Analysis

As we proved in Section 5, the performance gap between our proposed algorithm and the supervised approach is mainly decided by , an average probability that makes a confident but wrong decision under (i.e., given a gold partial parse) over training iterations. More specifically, from our proof of Lemma A.1, can be expressed as:

where denotes policy ’s conditional error rate under when it does not query the expert (i.e., being confident about its own action) at step