1 Introduction
Task & Dataset  # Examples  
Train  Dev  Test  Avg  Median  
1. Multimention reading comprehension  
TriviaQA (Joshi et al., 2017)  61,888  7,993  7,701  2.7  2 
NarrativeQA (Kočiskỳ et al., 2018)  32,747  3,461  10,557  4.3  5 
TriviaQAopen (Joshi et al., 2017)  78,785  8,837  11,313  6.7  4 
NaturalQuestionsopen (Kwiatkowski et al., 2019)  79,168  8,757  3,610  1.8  1 
2. Reading comprehension with discrete reasoning  
DROP (Dua et al., 2019)  46,973  5,850    8.2  3 
3. Semantic Parsing  
WikiSQL (Zhong et al., 2017)  56,355  8,421  15,878  346.1  5 
A natural setting in many question answering (QA) tasks is to provide weak supervision to determine how the question should be answered given the evidence text. For example, as seen in Figure 1, TriviaQA answers are entities that can be mentioned multiple times in supporting documents, while DROP
answers can be computed by deriving many different equations from numbers in the reference text. Such weak supervision is attractive because it is relatively easy to gather, allowing for large datasets, but complicates learning because there are many different spurious ways to derive the correct answer. It is natural to model such ambiguities with a latent variable during learning, but most prior work on reading comprehension has rather focused on the model architecture and used heuristics to map the weak signal to full supervision (e.g. by selecting the first answer span in
TriviaQA (Joshi et al., 2017; Tay et al., 2018; Talmor and Berant, 2019)). Some models are trained with maximum marginal likelihood (MML) (Kadlec et al., 2016; Swayamdipta et al., 2018; Clark and Gardner, 2018; Lee et al., 2019), but it is unclear if it gives a meaningful improvement over the heuristics.In this paper, we show it is possible to formulate a wide range of weakly supervised QA tasks as discrete latentvariable learning problems. First, we define a solution to be a particular derivation of a model to predict the answer (e.g. a span in the document or an equation to compute the answer). We demonstrate that for many recently introduced tasks, which we group into three categories as given in Table 1
, it is relatively easy to precompute a discrete, taskspecific set of possible solutions that contains the correct solution along with a modest number of spurious options. The learning challenge is then to determine which solution in the set is the correct one, while estimating a complete QA model.
We model the set of possible solutions as a discrete latent variable, and develop a learning strategy that uses hardEMstyle parameter updates. This algorithm repeatedly (i) predicts the most likely solution according to the current model from the precomputed set, and (ii) updates the model parameters to further encourage its own prediction. Intuitively, these hard updates more strongly enforce our prior beliefs that there is a single correct solution. This method can be applied to any problem that fits our weak supervision assumptions and can be used with any model architecture.
We experiment on six different datasets (Table 1) using strong taskspecific model architectures (Devlin et al., 2019; Dua et al., 2019; Hwang et al., 2019). Our learning approach significantly outperforms previous methods which use heuristic supervision and MML updates, including absolute gains of 2–10%, and achives the stateoftheart on five datasets. It outperforms recent stateoftheart rewardbased semantic parsing algorithms (Liang et al., 2018; Agarwal et al., 2019) by 13% absolute percentage on WikiSQL
, strongly suggesting that having a small precomputed set of possible solutions is a key ingredient. Finally, we present a detailed analysis showing that, in practice, the introduction of hard updates encourages models to assign much higher probability to the correct solution.
2 Related Work
Reading Comprehension.
Largescale reading comprehension (RC) tasks that provide full supervision for answer spans (Rajpurkar et al., 2016) have seen significant progress recently (Seo et al., 2017; Xiong et al., 2018; Yu et al., 2018a; Devlin et al., 2019). More recently, the community has moved towards more challenging tasks such as distantly supervised RC (Joshi et al., 2017), RC with freeform human generated answers (Kočiskỳ et al., 2018) and RC requiring discrete or multihop reasoning (Dua et al., 2019; Yang et al., 2018). These tasks introduce new learning challenges since the gold solution that is required to answer the question (e.g. a span or an equation) is not given.
Nevertheless, not much work has been done for this particular learning challenge. Most work on RC focuses on the model architecture and simply chooses the first span or a random span from the document (Joshi et al., 2017; Tay et al., 2018; Talmor and Berant, 2019), rather than modeling this uncertainty as a latent choice. Others maximize the sum of the likelihood of multiple spans (Kadlec et al., 2016; Swayamdipta et al., 2018; Clark and Gardner, 2018; Lee et al., 2019), but it is unclear if it gives a meaningful improvement. In this paper, we highlight the learning challenge and show that our learning method, independent of the model architecture, can give a significant gain. Specifically, we assume that one of mentions are related to the question and others are false positives because (i) this happens for most cases, as the first example in Table 2, and (ii) even in the case where multiple mentions contribute to the answer, there is often a single span which fits the question the best.
Semantic Parsing.
Latentvariable learning has been extensively studied in the literature of semantic parsing (Zettlemoyer and Collins, 2005; Clarke et al., 2010; Liang et al., 2013; Berant et al., 2013; Artzi and Zettlemoyer, 2013). For example, a question and an answer pair (, ) is given but the logical form that is used to compute the answer is not. Two common learning paradigms are maximum marginal likelihood (MML) and rewardbased methods. In MML, the objective maximizes , where is an approximation of a set of logical forms executing (Liang et al., 2013; Berant et al., 2013; Krishnamurthy et al., 2017). In rewardbased methods, a reward function is defined as a prior, and the model parameters are updated with respect to it (Iyyer et al., 2017; Liang et al., 2017, 2018). Since it is computationally expensive to obtain a precomputed set in semantic parsing, these methods typically recompute the set of logical forms with respect to the beam at every parameter update. In contrast, our learning method targets tasks that a set of solutions can be precomputed, which include many recent QA tasks such as reading comprehension, opendomain QA and a recent SQLbased semantic parsing task Zhong et al. (2017).
3 Method
In this section, we first formally define our general setup, which we will instantiate for specific tasks in Section 4 and then we describe our learning approach.
3.1 Setup
Let be the input of a QA system (e.g. a question and a document) and be the answer text (e.g. ‘Robert Schumann’ or ‘4’). We define a solution as a particular derivation that a model is supposed to produce for the answer prediction (e.g. a span in the document or an equation to compute the answer, see Table 2). Let denote a taskspecific, deterministic function which maps a solution to the textual form of the answer (e.g. by simply returning the string associated with a particular selected mention or solving an equation to get the final number, see Table 2). Our goal is to learn a model (with parameters ) which takes an input and outputs a solution such that .
In a fully supervised scenario, a true solution is given, and is estimated based on a collection of () pairs. In this work, we focus on a weakly supervised setting in which is not given and we define as a finite set of all the possible solutions. In the case that the search space is very large or infinite, we usually can approximate with a high coverage in practice. Then, we obtain by enumerating all . This results a set of all the possible solutions that lead to the correct answer. We assume it contains one solution that we want to learn to produce, and potentially many other spurious ones. In practice, is defined in a taskspecific manner, as we will see in Section 4.
At inference time, the model produces a solution from an input with respect to and predicts the final answer as . Note that we cannot compute at inference time because the groundtruth is not given.^{2}^{2}2 This is a main difference from multiinstance learning (MIL) (Zhou et al., 2009), since a bag of input instances is given at inference time in MIL.
3.2 Learning Method
In a fullysupervised setting where is given, we can learn by optimizing the negative log likelihood of given the input with respect to .
In our weak supervision scenario, the model has access to and , and the selection of the best solution in can be modeled as a latent variable. We can compute the maximum marginal likelihood (MML) estimate, which marginalizes the likelihood of each given the input with respect to . Formally,
is used to compute the objective as follows:
However, there are two major problems in the MML objective in our settings. First, MML can be maximized by assigning high probabilities to any subset of ; whereas in our problems, instances in other than one correct are spurious solutions which the model should ideally assign very low probability. Second, in MML we optimize the sum over probabilities of during training but typically predict the maximum probability solution during inference, creating a discrepancy between training and testing.
1. MultiMention Reading Comprehension (TriviaQA, NarrativeQA, TriviaQAopen & NaturalQuestionsopen) 
Question: Which composer did pianist Clara Wieck marry in 1840? 
Document: Robert Schumann was a German composer and influential music critic. He is widely regarded as one of the greatest composers of the Romantic 
era. (…) Robert Schumann himself refers to it as “an affliction of the whole hand”. (…) Robert Schumann is mentioned in a 1991 episode of Seinfeld “The 
Jacket”. (…) Clara Schumann was a German musician and composer, considered one of the most distinguished pianists of the Romantic era. Her husband was 
the composer Robert Schumann . Childhood (…) At the age of eight, the young Clara Wieck performed at the Leipzig home of Dr. Ernst Carus. There 
she met another gifted young pianist who had been invited to the musical evening, named Robert Schumann , who was nine years older. Schumann admired 
Clara’ s playing so much that he asked permission from his mother to discontinue his law studies. (…) In the spring of 1853, the then unknown 20yearold 
Brahms met Joachim in Hanover, made a very favorable impression on him, and got from him a letter of introduction to Robert Schumann . 
Answer (): Robert Schumann 
: Text match 
: All spans in the document 
: Spans which match ‘Robert schumann’ (red text) 
2. Reading Comprehension with Discrete Reasoning (DROP) 
Question: How many yards longer was Rob Bironas’ longest field goal compared to John Carney’s only field goal? 
Document: (…) The Chiefs tied the game with QB Brodie Croyle completing a 10 yard td pass to WR Samie Parker. Afterwards the Titans responded with 
Kicker Rob Bironas managing to get a 37 yard field goal. Kansas city would take the lead prior to halftime with croyle completing a 9 yard td pass to FB 
Kris Wilson. In the third quarter Tennessee would draw close as Bironas kicked a 37 yard field goal. The Chiefs answered with kicker John Carney getting 
a 36 yard field goal. Afterwards the Titans would retake the lead with Young and Williams hooking up with each other again on a 41 yard td pass. 
(…) Tennessee clinched the victory with Bironas nailing a 40 yard and a 25 yard field goal. With the win the Titans kept their playoff hopes alive at 8 6 . 
Answer (): 4 
: Equation executor 
: Equations with two numeric values and one arithmetic operation 
: { 4137, 4036, 106, … } 
3. SQL Query Generation (WikiSQL) 
Question: What player played guard for Toronto in 19961997? 
Table Header: player, year, position, ... 
Answer (): John Long 
: SQL executor 
: Nonnested SQL queries with up to 3 conditions 
: Select player where position=guard and year in toronto=199697 
Select max(player) where position=guard and year in toronto=199697 
Select min(player) where position=guard 
Select min(player) where year in toronto=199697 
Select min(player) where position=guard and year in toronto=199697 
We introduce a learning strategy with a hardEM approach. First, the model computes the likelihood of each given the input with respect to , , and picks one of with the largest likelihood:
Then, the model optimizes on a standard negative log likelihood objective, assuming is a true solution. The objective can be rewritten as follows:
This objective can be seen as a variant of MML, where the is replaced with a .
4 Task Setup
We apply our approach to three different types of QA tasks: multimention reading comprehension, RC with discrete reasoning and a semantic parsing task. In this section, we describe each task in detail: how we define a solution and precompute a set based on input and answer . The statistics of and examples on each task are shown in Table 1 and Table 2 respectively.
4.1 MultiMention Reading Comprehension
Multimention reading comprehension naturally occurs in several QA tasks such as (i) distantlysupervised reading comprehension where a question and answer are collected first before the evidence document is gathered (e.g. TriviaQA), (ii) abstractive reading comprehension which requires a freeform text to answer the question (e.g. NarrativeQA), and (iii) opendomain QA where only questionanswer pairs are provided.
Given a question and a document , where and denote the tokens in the question and document, the output is an answer text, which is usually mentioned multiple times in the document.
Previous work has dealt with this setting by detecting spans in the document through text matching (Joshi et al., 2017; Clark and Gardner, 2018). Following previous approaches, we define a solution as a span in the document. We obtain a set of possible solutions by finding exact match or similar mentions of , where is a span of text with start and end token indices and . Specifically,
where is a string matching function. If the answer is guaranteed to be a span in the document , is a binary function which returns if two strings are the same, and otherwise. If the answer is freeform text, we choose as the ROUGEL metric (Lin, 2004).
This complicates the learning because the given document contains many spans matching to the text, while most of them are not related to answering the question. As an example shown in Table 2, only the fourth span out of six is relevant to the question.
4.2 Reading Comprehension with Discrete Reasoning
Some reading comprehension tasks require reasoning in several discrete steps by finding clues from the document and aggregating them. One such example is mathematical reasoning, where the model must pick numbers from a document and compute an answer through arithmetic operations Dua et al. (2019).
In this task, the input is also a question and a document , and the output is given as a numeric value. We define a solution to be an executable arithmetic equation. Since there is an infinite set of potential equations, we approximate as a finite set of arithmetic equations with two numeric values and one operation, following Dua et al. (2019).^{3}^{3}3This approximation covers 93% of the examples in the development set. Specifically,
where and are all numeric values appearing in and , respectively, and are a set of predefined special numbers. Then
where is an execution function of equations.
Figure 1 shows an example given a question and a document. We can see that one equation is correct, while the others are false positives which coincidentally lead to the correct answer.
4.3 SQL Query Generation
To evaluate if our training strategy generalizes to other weak supervision problems, we also study a semantic parsing task where a question and an answer are given but the logical form to execute the answer is not. In particular, we consider a task of answering questions about a given table by generating SQL queries.
The input is a question and a table header , where is a token, is a multitoken title of each column, and is the number of headers. The supervision is given as the SQL query result , which is always a text string.
We define a solution to be an SQL query. Since the set of potential queries is infinite, we approximate as a set of nonnested SQL queries with at most three conditions.^{4}^{4}4This approximation covers 99% of the examples in the development set. Specifically, given as a set of aggregating operators and as a set of possible conditions , we define :
then,
where is an SQL executor. The third example in Table 2 shows may contain many spurious SQL querie, e.g. the third query in coincidentally executes the answer because ‘John Long’ is ranked first among all the guards in alphabetical order.
5 Experiments
We experiment on a range of question answering tasks with varied model architectures to demonstrate the effectiveness of our approach. Built on top of strong base models, our learning method is able to achieve stateoftheart on NarrativeQA, TriviaQAopen, NaturalQuestionsopen, DROP and WikiSQL.
TriviaQA  NarrativeQA  TriviaQA  NaturalQ  DROP  DROP  
open  open  w/ BERT  w/ QANet  
(F1)  (ROUGEL)  (EM)  (EM)  (EM)  (EM)  
Dev  Test  Dev  Test  Dev  Test  Dev  Test  Dev  Dev  
First Only  64.4  64.9  55.3  57.4  48.6  48.1  23.6  23.6  42.9  36.1 
MML  64.8  65.5  55.8  56.1  47.0  47.4  26.6  25.8  39.7  43.8 
Ours  66.9  67.1  58.1  58.8  50.7  50.9  28.8  28.1  52.8  45.0 
SOTA    71.4    54.7  47.2  47.1  24.8  26.5  43.8 
Model  Accuracy  
Dev  Test  
Weaklysupervised setting  
REINFORCE (Williams, 1992)  
Iterative ML (Liang et al., 2017)  70.1  
Hard EM (Liang et al., 2018)  70.2  
Beambased MML (Liang et al., 2018)  70.7  
MAPO (Liang et al., 2018)  71.8  72.4 
MAPOX (Agarwal et al., 2019)  74.5  74.2 
MAPOX+MeRL (Agarwal et al., 2019)  74.9  74.8 
MML  70.6  70.5 
Ours  84.4  83.9 
Fullysupervised setting  
SQLNet (Xu et al., 2018)  69.8  68.0 
TypeSQL (Yu et al., 2018b)  74.5  73.5 
Coarse2Fine (Dong and Lapata, 2018)  79.0  78.5 
SQLova (Hwang et al., 2019)  87.2  86.2 
XSQL (He et al., 2019)  89.5  88.7 
5.1 Multimention Reading Comprehension
We experiment on two reading comprehension datasets and two opendomain QA datasets. For reading comprehension, we evaluate on TriviaQA (Wikipedia) (Joshi et al., 2017) and NarrativeQA (summary) (Kočiskỳ et al., 2018).
For opendomain QA, we follow the settings in Lee et al. (2019) and use the QA pairs from TriviaQAunfiltered (Joshi et al., 2017) and Natural Questions (Kwiatkowski et al., 2019) with short answers and discard the given evidence documents. We refer to these two datasets as TriviaQAopen and NaturalQuestionsopen.^{5}^{5}5 Following Lee et al. (2019), we treat the dev set as the test set and split the train set into 90/10 for training and development. Datasets and their split can be downloaded from https://bit.ly/2HK1Fqn.
We experiment with three learning methods as follows.

First Only: , where appears first in the given document among all .

MML: .

Ours: .
can be obtained by any model which outputs the start and end positions of the input document. In this work, we use a modified version of BERT (Devlin et al., 2019) for multiparagraph reading comprehension (Min et al., 2019).
Training details.
We use uncased version of BERT. For all datasets, we split documents into a set of segments up to 300 tokens because BERT limits the size of the input. We use batch size of for two reading comprehension tasks and for two opendomain QA tasks. Following Clark and Gardner (2018), we filter a subset of segments in TriviaQA through TFIDF similarity between a segment and a question to maintain a reasonable length. For opendomain QA tasks, we retrieve 50 Wikipedia articles through TFIDF (Chen et al., 2017) and further run BM25 (Robertson et al., 2009) to retrieve 20 (for train) or 80 (for development and test) paragraphs. We try 10, 20, 40 and 80 paragraphs on the development set to choose the number of paragraphs to use on the test set.
To avoid local optima, we perform annealing: at training step , the model optimizes on MML objective with a probability of and otherwise use our objective, where
is a hyperparameter. We observe that the performance is improved by annealing while not being overly sensitive to the hyperparameter
. We include full hyperparameters and detailed ablations in Appendix B.Results.
Table 3 compares the results of baselines, our method and the stateoftheart on four datasets.^{6}^{6}6For NarrativeQA, we compare with models trained on NarrativeQA only. For opendomain QA, we only compare with models using pipeline approach. First of all, we observe that FirstOnly is a strong baseline across all the datasets. We hypothesize that this is due to the bias in the dataset that answers are likely to appear earlier in the paragraph. Second, while MML achieves comparable result to the FirstOnly baseline, our learning method outperforms others by F1/ROUGEL/EM consistently on all datasets. Lastly, our method achieves the new stateoftheart on NarrativeQA, TriviaQAopen and NaturalQuestionsopen, and is comparable to the stateoftheart on TriviaQA, despite our aggressive truncation of documents.
5.2 Reading Comprehension with Discrete Reasoning
We experiment on a subset of DROP (Dua et al., 2019) with numeric answers (67% of the entire dataset) focusing on mathematical reasoning. We refer to this subset as DROP. The current stateoftheart model is an augmented version of QANet Yu et al. (2018a) which selects two numeric values from the document or the question and performs addition or subtraction to get the answer. The equation to derive the answer is not given, and Dua et al. (2019) adopted the MML objective.
can take as any model which generates equations based on the question and document. Inspired by Dua et al. (2019), we take a sequence tagging approach on top of two competitive models: (i) augmented QANet, the same model as Dua et al. (2019) but only supporting addition, subtraction and counting, and (ii) augmented BERT, which supports addition, subtraction and percentiles.^{7}^{7}7As we use a different set of operations for the two models, they are not directly comparable. Details of the model architecture are shown in Appendix A.
Training details.
We truncate the document to be up to 400 words. We use the batch size of and for QANet and BERT, respectively.
Results.
Table 3 shows the results on DROP. Our training strategy outperforms the FirstOnly baseline and MML by a large margin, consistently across two base models. In particular, with BERT, we achieve an absolute gain of 10%.
Group  Avg  Median  # train 
3  3.0  3  10k 
10  10.2  9  10k 
30  30.0  22  10k 
100  100.6  42  10k 
300  300.0  66  10k 
5.3 SQL Query Generation
Finally, we experiment on the weaklysupervised setting of WikiSQL (Zhong et al., 2017), in which only the question & answer pair is used and the SQL query is treated as a latent variable.
can be computed by any query generation or semantic parsing models. We choose SQLova (Hwang et al., 2019), a competitive model on WikiSQL (designed for fully supervised setting), as our base model. We modify the model to incorporate either the MML objective or our hardEM learning approach for the weaklysupervised setting.
We compare with both traditional and recentlydeveloped rewardbased algorithms for weak supervision, including beambased MML (MML which keeps a beam during training), conventional hard EM^{8}^{8}8This method differs from ours in that it does not have a precomputed set, and uses a beam of candidate predictions to execute for each update., REINFORCE (Williams, 1992), iterative ML (Liang et al., 2017; Abolafia et al., 2018) and a family of MAPO (Memoryaugmented policy optimization) (Liang et al., 2018; Agarwal et al., 2019). For a fair comparison, we only consider single models without executionguided decoding.
Training details.
We adopt the same set of hyperparameters as in Hwang et al. (2019), except that we change the batch size to 10 and truncate the input to be up to 180 words.
Results.
Table 4
shows that our training method significantly outperforms all the weaklysupervised learning algorithms, including 10% gain over the previous state of the art. These results indicate that precomputing a solution set and training a model through hard updates play a significant role to the performance. Given that our method does not require SQL executions at training time (unlike MAPO), it provides a simpler, more effective and timeefficient strategy. Comparing to previous models with full supervision, our results are still on par and outperform most of the published results.
6 Analysis
In this section, we will conduct thorough analyses and ablation studies, to better understand how our model learns to find a solution from a precomputed set of possible solutions. We also provide more examples and analyses in Appendix C.
Q: How many yards longer was Rob Bironas’ longest field goal compared 
to John Carney’s only field goal? (Answer: 4) 
P: … The Titans responded with Kicker Rob Bironas managing to get a 37 
yard field goal. …Tennessee would draw close as Bironas kicked a 37 yard 
field goal. The Chiefs answered with kicker John Carney getting a 36 yard 
field goal. The Titans would retake the lead with Young and Williams hook 
ing up with each other again on a 41 yard td pass. …Tennessee clinched the 
victory with Bironas nailing a 40 yard and a 25 yard field goal. 
Pred  (ordered by )  
k  109  106  4137  4036  4137 
k  3736  4036  4137  4137  106 
k  4036  4036  4137  4137  106 
k  4036  4036  4137  4137  106 
k  3736  4036  4137  4137  106 
k  4036  4036  4137  4137  106 
Varying the size of solution set at inference time.
Figure 2 shows a breakdown of the model accuracy with respect to the size of a solution set () at test time. We observe that the model with our training method outperforms the model with MML objective consistently across different values of . The gap between MML and our method is marginal when or , and gradually increases as grows.
Varying the size of solution set at training.
To see how our learning method works with respect to the size of a solution set () of the training data, particularly with large , we take 5 subsets of the training set on WikiSQL with . We train a model with those subsets and evaluate it on the original development set, both with our training method and MML objective. Figure 3 shows statistics of each subset and results. We observe that (i) our learning method outperforms MML consistently over different values of , and (ii) the gain is particularly large when .
Model predictions over training.
We analyze the top 1 prediction and the likelihood of assigned by the model on DROP with different number of training iterations (steps from 1k to 32k). Table 5 shows one example on DROP with the answer text ‘4’, along with the model’s top 1 prediction and a subset of
. We observe that the model first begins by assigning a small, uniform probability distribution to
, but gradually learns to favor the true solution. The model sometimes gives the wrong prediction—for example, at k, and changes its prediction from the true solution to the wrong solution, ‘3736’—but again changes its prediction to be a true solution afterward. In addition, its intermediate wrong solution, ‘3736’ indicates the model was confused with distinguishing the longest field goal of Rob Bironas (40 vs. 37), which is an understandable mistake.We also compare the predictions from the model with our method to those from the model with MML, which is shown in Appendix C.
Quality of the predicted solution.
We analyze if the model outputs the correct solution, since the solution executing the correct answer could be spurious. First, on NarrativeQA and DROP, we manually analyze 50 samples from the development set and find that 98% and 92% of correct cases produce the correct solution respectively. Next, on WikiSQL, we compare the predictions from the model to the annotated SQL queries on the development set. This is possible because gold SQL queries are available in the dataset for the full supervision. Out of 8,421 examples, 7,110 predictions execute the correct answers. Among those, 88.5% of the predictions are exactly same as the annotated queries. Others are the cases where (i) both queries are correct, (ii) the model prediction is correct but the annotated query is incorrect, and (iii) the annotated query is correct and the model prediction is spurious. We show a full analysis in Appendix C.
Robustness to the noise in .
Sometimes noise arises during the construction of , such as constructed based on ROUGEL for NarrativeQA. To explore the effect of noise in , we experiment with more noisy solution set by picking all the spans with scores that is equal to or larger than the 5th highest. The new construction method increases from 4.3 to 7.1 on NarrativeQA. The result by MML objective drops significantly (56.0751.14) while the result by ours drops marginally (58.7757.97), suggesting that MML suffers more with a noisier while ours is more robust.
7 Conclusion
In this paper, we demonstrated that, for many QA tasks which only provide the answer text as supervision, it is possible to precompute a discrete set of possible solutions that contains one correct option. Then, we introduced a discrete latent variable learning algorithm which iterates a procedure of predicting the most likely solution in the precomputed set and further increasing the likelihood of that solution. We showed that this approach significantly outperforms previous approaches on six QA tasks including reading comprehension, opendomain QA, discrete reasoning task and semantic parsing, achieving absolute gains of 2–10% and setting the new stateoftheart on five wellstudied datasets.
Acknowledgements
This research was supported by ONR (N000141812826, N0001417SB001), DARPA N66001192403, NSF (IIS1616112, IIS1252835, IIS1562364), ARO (W911NF1610121), an Allen Distinguished Investigator Award, Samsung GRO and gifts from Allen Institute for AI, Google and Amazon.
The authors would like to thank the anonymous reviewers, Eunsol Choi, Christopher Clark, Victor Zhong and UW NLP members for their valuable feedback.
References
 Abolafia et al. (2018) Daniel A Abolafia, Mohammad Norouzi, Jonathan Shen, Rui Zhao, and Quoc V Le. 2018. Neural program synthesis with priority queue training. arXiv preprint arXiv:1801.03526.
 Agarwal et al. (2019) Rishabh Agarwal, Chen Liang, Dale Schuurmans, and Mohammad Norouzi. 2019. Learning to generalize from sparse and underspecified rewards. In ICML.
 Alberti et al. (2019) Chris Alberti, Kenton Lee, and Michael Collins. 2019. A BERT baseline for the Natural Questions. arXiv preprint arXiv:1901.08634.
 Artzi and Zettlemoyer (2013) Yoav Artzi and Luke Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. In ACL.
 Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from questionanswer pairs. In EMNLP.
 Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer opendomain questions. In ACL.
 Clark and Gardner (2018) Christopher Clark and Matt Gardner. 2018. Simple and effective multiparagraph reading comprehension. In ACL.
 Clarke et al. (2010) James Clarke, Dan Goldwasser, MingWei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In CoNLL.
 Devlin et al. (2019) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In NAACL.
 Dong and Lapata (2018) Li Dong and Mirella Lapata. 2018. Coarsetofine decoding for neural semantic parsing. In ACL.
 Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL.
 He et al. (2019) Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. 2019. XSQL: reinforce schema representation with context. arXiv preprint arXiv:1908.08113.
 Hurley and Rickard (2009) Niall Hurley and Scott Rickard. 2009. Comparing measures of sparsity. IEEE Transactions on Information Theory.
 Hwang et al. (2019) Wonseok Hwang, Jinyeung Yim, Seunghyun Park, and Minjoon Seo. 2019. A comprehensive exploration on WikiSQL with tableaware word contextualization. arXiv preprint arXiv:1902.01069.
 Iyyer et al. (2017) Mohit Iyyer, Wentau Yih, and MingWei Chang. 2017. Searchbased neural structured learning for sequential question answering. In ACL.
 Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
 Kadlec et al. (2016) Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text understanding with the attention sum reader network. In ACL.
 Kočiskỳ et al. (2018) Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The NarrativeQA reading comprehension challenge. TACL.
 Krishnamurthy et al. (2017) Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gardner. 2017. Neural semantic parsing with type constraints for semistructured tables. In EMNLP.
 Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, MingWei Change, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. TACL.
 Lee et al. (2019) Kenton Lee, MingWei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In ACL.

Liang et al. (2017)
Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. 2017.
Neural symbolic machines: Learning semantic parsers on Freebase with weak supervision.
In ACL.  Liang et al. (2018) Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V Le, and Ni Lao. 2018. Memory augmented policy optimization for program synthesis and semantic parsing. In NIPS.
 Liang et al. (2013) Percy Liang, Michael I Jordan, and Dan Klein. 2013. Learning dependencybased compositional semantics. Computational Linguistics, 39(2):389–446.
 Lin (2004) ChinYew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out.
 Min et al. (2019) Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multihop reasoning. In ACL.
 Nishida et al. (2019) Kyosuke Nishida, Itsumi Saito, Kosuke Nishida, Kazutoshi Shinoda, Atsushi Otsuka, Hisako Asano, and Junji Tomita. 2019. Multistyle generative reading comprehension. In ACL.

Paszke et al. (2017)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017.
Automatic differentiation in PyTorch.
 Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP.
 Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval.
 Seo et al. (2017) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
 Swayamdipta et al. (2018) Swabha Swayamdipta, Ankur P Parikh, and Tom Kwiatkowski. 2018. Multimention learning for reading comprehension with neural cascades. In ICLR.
 Talmor and Berant (2019) Alon Talmor and Jonathan Berant. 2019. MultiQA: An empirical investigation of generalization and transfer in reading comprehension. In ACL.
 Tay et al. (2018) Yi Tay, Anh Tuan Luu, Siu Cheung Hui, and Jian Su. 2018. Densely connected attention propagation for reading comprehension. In NIPS.
 Wang et al. (2018) Wei Wang, Ming Yan, and Chen Wu. 2018. Multigranularity hierarchical attention fusion networks for reading comprehension and question answering. In ACL.

Williams (1992)
Ronald J Williams. 1992.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine learning, 8(34):229–256.  Xiong et al. (2018) Caiming Xiong, Victor Zhong, and Richard Socher. 2018. DCN+: Mixed objective and deep residual coattention for question answering. In ICLR.
 Xu et al. (2018) Xiaojun Xu, Chang Liu, and Dawn Song. 2018. SQLNet: Generating structured queries from natural language without reinforcement learning. In ICLR.
 Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multihop question answering. In EMNLP.
 Yu et al. (2018a) Adams Wei Yu, David Dohan, Quoc Le, Thang Luong, Rui Zhao, and Kai Chen. 2018a. Fast and accurate reading comprehension by combining selfattention and convolution. In ICLR.
 Yu et al. (2018b) Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. 2018b. TypeSQL: Knowledgebased typeaware neural texttosql generation. In NAACL.
 Zettlemoyer and Collins (2005) Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In UAI.
 Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
 Zhou et al. (2009) ZhiHua Zhou, YuYin Sun, and YuFeng Li. 2009. Multiinstance learning by treating instances as noni.i.d. samples. In ICML.
Appendix A Model details
We describe the detailed model architecture used as a base model. In other words, we describe how we obtain .
The following paragraphs describe (i) BERT extractive model used for multimention RC (Section 5.1) and (ii) BERT sequence tagging model for discrete reasoning task (Section 5.2), respectively. For QANet for discrete reasoning and the model for SQL generation (Section 5.3), we use the opensourced code of the original implementation^{9}^{9}9https://github.com/allenai/allennlp/blob/master/allennlp/models/reading_comprehension/naqanet.py and https://github.com/naver/sqlova of Dua et al. (2019) and Hwang et al. (2019) and do not make any modification except the objective function, so we refer to original papers.
All implementations are done in Pytorch (Paszke et al., 2017). For BERT, we modify the opensourced implementation in PyTorch^{10}^{10}10https://github.com/huggingface/pytorchpretrainedBERT and use the uncased version of BERT.
Extractive QA model for multimention RC
The model architecture is closed to that of Min et al. (2019) and Alberti et al. (2019), where the model operates independently on each paragraph, and selects the best matching paragraph and its associated answer span.
The input is a question and a set of paragraphs , and the desired output is a span from one of paragraphs. Since our goal is to compute a probability of a specific span, , let’s say is th through th word in th paragraph.
The model receives a question and a single paragraph in parallel. Then, , a list of words, where : indicates a concatenation, is a special token, is the length of , and is the length of . This is fed into BERT:
where is the hidden dimension of BERT. Then,
where
are learnable vectors.
Finally, the probability of , th through th word in th paragraph, is obtained by:
where denotes th element of the vector .
Separately, a paragraph selector is trained through where is learnable vector. At inference time, is computed and is only considered to output a span.
Sequence Tagging model for discrete reasoning
The basic idea of the model is closed to that of Dua et al. (2019). The input is a question and a paragraph . Our goal is to compute a probability of an equation, , where and , and are all numeric values appearing in and , and are a set of predefined special numbers.^{11}^{11}11.
First, BERT encodings of the question and the paragraph is obtained via
where : indicates a concatenation, is a special token, is the length of , is the length of , and is the hidden dimension of BERT. Then,
where and are learnable matrices. Then,
where
Here, subscript of the vector or the sequence indicate th dimension of the vector or th element of the sequence, respectively.
TriviaQA F1  
None  55.83 
20k  58.05 
30k  57.99 
40k  56.66 
DROP EM  
None  52.31 
5k  51.98 
10k  52.82 
20k  51.74 
Dataset  batch size  
TriviaQA  20  15K 
NarrativeQA  20  20K 
TriviaQAopen  192  4K 
NaturalQuestionsopen  192  8K 
DROP with BERT  14  10K 
DROP with QANet  14  None 
WikiSQL  10  None 
Appendix B Annealing
To prevent the model to be optimized on early decision of the model, we perform annealing: at training step , the model optimizes on MML objective with a probability of and otherwise use our objective, where is a hyperparameter. We observe that the performance is improved by annealing while not being sensitive to the hyperparameter . Ablations and chosen for each dataset are shown in Table 6. Note that for DROP with QANet and WikiSQL, we do not ablate with varying and just report the number without annealing.
Appendix C Examples
Question & Document 
Q: What is the state capital of Alabama? (Groundtruth: Montgomery) 
D: Alabama is nicknamed the Yellowhammer State, after the state bird. Alabama is also known as the “Heart of Dixie” and 
the “Cotton State”. The state tree is the longleaf pine, and the state flower is the camellia. Alabama’s capital is Montgomery. 
(…) From 1826 to 1846, Tuscaloosa served as Alabama’s capital. On January 30, 1846, the Alabama legislature announced 
it had voted to move the capital city from Tuscaloosa to Montgomery. The first legislative session in the new capital met in 
December 1847. A new capitol building was erected under the direction of Stephen Decatur Button of Philadelphia. The first 
structure burned down in 1849, but was rebuilt on the same site in 1851. This second capitol building in Montgomery remains 
to the present day. 
Question & Passage 
Q: How many sports are not olympic sports but are featured in the asian games ? (A: 10) 
P: The first 30 sports were announced by the singapore national olympic council on 10 december 2013 on the sidelines of 
the 27th sea games in myanmar. It announced then that there was room for as many as eight more sports. On 29 april 2014 
the final six sports namely boxing equestrian floorball petanque rowing and volleyball were added to the programme. 
Floorball will feature in the event for the first time after being a demonstration sport in the 2013 edition. In its selection of 
events the organising committee indicated their desire to set a model for subsequent games in trimming the number of ‘trad 
itional’ sports to refocus on the seag’ s initial intent to increase the level of sporting excellence in key sports. Hence despite 
room for up to eight traditional sports only two floorball and netball were included in the programme. Amongst the other 34 
sports 24 are olympic sports and all remaining sports are featured in the asian games. 
Ours  
Pred  (ordered by )  
k  10+two  eight+two  eight+two  3424  10 
k  3024  3424  eight+two  eight+two  10 
k  34+24  3424  eight+two  eight+two  10 
k  34+24  3424  eight+two  10  eight+two 
k  3424  3424  eight+two  eight+two  10 
k  3424  3424  eight+two  eight+two  10 
MML  
Pred  (ordered by )  
K  10+two  eight+two  eight+two  10  3424 
k  eight+two  eight+two  eight+two  3424  10 
k  24+5  eight+two  eight+two  3424  10 
k  24+5  eight+two  eight+two  3424  10 
k  3424  3424  eight+two  eight+two  10 
k  24+5  3424  eight+two  eight+two  10 
Q  How many times was the # of total votes 2582322? 
H  Election, # of candidates nominated, # of seats won, # of total votes, % of popular vote 
A  Select count(# of seats won) where # of total votes = 2582322 
P  Select count(Election) where # of total votes = 2582322 
Q  What official or native languages are spoken in the country whose capital city is Canberra? 
H  Country (exonym), Capital (exonym), Country (endonym) Capital (endonym), Official or native language 
A  Select Official or native languages where Capital (exonym) = Canberra 
P  Select Official or native languages where Capital (endonym) = Canberra 
Q  What is the episode number that has production code 8abx15? 
H  No. in set, No. in series, Title, Directed by, Written by, Original air date, Production code 
A  Select min(No.in series) where Production code = 8ABX15 
P  Select No.in series where Production code = 8abx15 
Q  what is the name of the battleship with the battle listed on May 13, 1915? 
H  Estimate, Name, Nat., Ship Type, Principal victims, Date 
A  Select Name where Ship Type = battleship and Date = may 13, 1915 
P  Select Name where Date = may 13, 1915 
To see if the prediction from the model is the correct solution to derive the answer, we analyze outputs from the model.
TriviaQA.
Table 7 shows one example from TriviaQA where the answer text (Montgomery) is mentioned in the paragraph multiple times. Predictions from the model with our training method and that with MML objective are shown in the red text and the blue text, respectively. The span predicted by the model with our method actually answers to the question, while other spans with the answer text is not related to the question.
Drop.
Table 8 shows predictions from the model with our method and that with MML objective over training procedure. We observe that the model with our method learns to assign a high probability to the best solution (‘3424’), while the model with MML objective fails to do so. Another notable observation is that the model with our method assign sparse distribution of likelihood over , compared to the model with MML objective. We quantitatively define sparsity as
(Hurley and Rickard, 2009) and show that the model with our method gives higher sparsity than the model with MML (59 vs. 36 with , 54 vs. 17 with on DROP).
WikiSQL.
WikiSQL provides the annotated SQL queries, makes it easy to compare the predictions from the model to the annotated queries. Out of 8421 examples from the development set, 7110 predictions execute the correct answers. Among those, 6296 predictions are exactly same as the annotated queries. For cases where the predictions execute the correct answers but are not exactly same as the groundtruth queries, we show four examples in Table 9. In the first example, both the annotated query and the prediction are correct, because the selected column does not matter for counting. Similarly in the second example, both queries are correct because Capital (exonym) and Capital (endonym) both indicate the capital city. In the third example, the prediction makes more sense than the annotated query because the question does not imply anything about min. In the last example, the annotated query makes more sense than the prediction because the prediction misses Ship Type = battleship. We conjecture that the model might learn to ignore some information in the question if the table header implies the table is specific about that information, hence does not need to condition on that information.
Comments
There are no comments yet.