AmbigQA: Answering Ambiguous Open-domain Questions

04/22/2020 ∙ by Sewon Min, et al. ∙ University of Washington 7

Ambiguity is inherent to open-domain question answering; especially when exploring new topics, it can be difficult to ask questions that have a single, unambiguous answer. In this paper, we introduce AmbigQA, a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question. To study this task, we construct AmbigNQ, a dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark. We find that over half of the questions in NQ-open are ambiguous, exhibiting diverse types of ambiguity. We also present strong baseline models for AmbigQA which we show benefit from weakly supervised learning that incorporates NQ-open, strongly suggesting our new task and data will support significant future research effort. Our data is available at



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


AmbigQA: Answering Ambiguous Open-domain Questions

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the open-domain setting, it can be difficult to formulate clear and unambiguous questions. For example, Figure 1 shows a Google search query Kwiatkowski et al. (2019) that, perhaps surprisingly, has two possible interpretations given the evidence in Wikipedia. Although open-domain question answering (QA) systems aim to answer any factoid question (Voorhees et al., 1999), existing methods assume all questions have a single well defined answer. In this paper, we define the task of ambiguous question answering, and present the first data and baseline methods for the study of how to disambiguate and answer such questions.

Figure 1: An AmbigNQ example where the prompt question (top) appears to have a single clear answer, but is actually ambiguous upon reading Wikipedia.  requires producing the full set of acceptable answers while differentiating them from each other using disambiguated rewrites of the question.
Type Example
Event references (39%) What season does meredith and derek get married in grey’s anatomy?
Q: In what season do Meredith and Derek get informally married in Grey’s Anatomy? / A: Season 5
Q: In what season do Meredith and Derek get legally married in Grey’s Anatomy? / A: Season 7
Properties (27%) How many episode in seven deadly sins season 2?
Q: How many episodes were there in seven deadly sins season 2, not including the OVA episode? / A: 25
Q: How many episodes were there in seven deadly sins season 2, including the OVA episode? / A: 26
Entity references (23%) How many sacks does clay matthews have in his career?
Q: How many sacks does Clay Matthews Jr. have in his career? / A: 69.5
Q: How many sacks does Clay Matthews III have in his career? / A: 91.5
Answer types (16%) Who sings the song what a beautiful name it is?
Q: Which group sings the song what a beautiful name it is? / A: Hillsong Live
Q: Who is the lead singer of the song what a beautiful name it is? / A: Brooke Ligertwood
Time- dependency (13%) When does the new family guy season come out?
Q: When does family guy season 16 come out? / A: October 1, 2017
Q: When does family guy season 15 come out? / A: September 25, 2016
Q: When does family guy season 14 come out? / A: September 27, 2015
Multiple sub-questions (3%) Who was british pm and viceroy during quit india movement?
Q: Who was british viceroy during quit India movement? / A: Victor Hope
Q: Who was british pm during quit India movement? / A: Winston Churchill
Table 1: Breakdown of the types of ambiguity based on 100 random samples on AmbigNQ development data.

Ambiguity arises frequently in open-domain QA, where questions are written during information gathering (e.g. search queries) without knowledge of the answer. As we will see in Section 4, over 50% of the questions we sampled from a set of Google search queries are ambiguous. Furthermore, identifying ambiguities is difficult both for humans and machines. As we saw in Figure 1, ambiguity is a function of both the question and the evidence provided by a very large text corpus.

To study this challenge, we introduce  (Answering Ambiguous Open-domain Questions), a new task which involves disambiguating and answering a potentially ambiguous question. Specifically, the model must (1) find a set of distinct, equally plausible answers to the question, and (2) provide minimal yet unambiguous rewrites of the question that clarify the interpretation which leads to each answer. Figure 1 provides an example of two such disambiguated questions and their associated answers.

To support the study of this task, we construct a dataset called AmbigNQ using 14,042 questions from an open-domain version of Natural Questions (Kwiatkowski et al., 2019), denoted NQ-open

. For each question, annotators search for, navigate, and read multiple Wikipedia pages to find as many answers as possible. The high prevalence of ambiguity makes the annotation task difficult even for human experts; it is inherently difficult to know if you have found every possible interpretation of a question. Nonetheless, we were able to collect high quality data that covers high levels of ambiguity (2.1 distinct answers per question on average) with high estimated agreement (89.0 F1) on valid answers. The types of ambiguity are diverse and sometimes subtle (Table 

1), including ambiguous entity or event references, or ambiguity over the answer type, many of which are only apparent after examining one or more Wikipedia pages.

To establish initial performance levels on this data, we present results for a set of strong baseline methods. We extend a state-of-the-art model for NQ-open (Karpukhin et al., 2020) with three new components: (1) set-based question answering with a BART-based sequence-to-sequence model (Lewis et al., 2020), (2) a question disambiguation model also based on BART, and (3) a modification to democratic co-training (Zhou and Goldman, 2004) which allows this model to leverage the partial supervision available in the full NQ-open dataset. We also do an ablation study and qualitative analysis, which suggest there is significant room for the future work on this task.

To summarize, our contributions are threefold.

  1. [topsep=.5em,itemsep=-.2em]

  2. We introduce , a new task which requires identifying all plausible answers to an open-domain question, along with disambiguated questions to differentiate them.

  3. We construct AmbigNQ, a dataset with 14,042 annotations on NQ-open questions containing diverse types of ambiguity.

  4. We introduce the first baseline models that produce multiple answers to open-domain questions, with experiments showing their effectiveness in learning from our data while highlighting avenues for future work.

2 Related Work

Open-domain Question Answering requires a system to answer any factoid question based on evidence provided by a large text collection such as Wikipedia (Voorhees et al., 1999; Chen et al., 2017). Existing benchmarks include various kinds of questions, from open-ended information-seeking (Berant et al., 2013; Kwiatkowski et al., 2019; Clark et al., 2019) to more specialized trivia/quiz (Joshi et al., 2017; Dunn et al., 2017). To the best of our knowledge, all existing formulations of open-domain QA assume each question has a single clear answer.

Our work is built upon an open-domain version of Natural Questions (Kwiatkowski et al., 2019), denoted NQ-open, composed of questions posed by real users of Google search, each with an answer drawn from Wikipedia. NQ-open has promoted several recent advances in open-domain question answering (Lee et al., 2019; Asai et al., 2020; Min et al., 2019a, b; Guu et al., 2020; Karpukhin et al., 2020). Nonetheless, Kwiatkowski et al. (2019) report that the answers to such questions are often debatable, and the average agreement rate on NQ-open test data is 49.2%,111 The NQ-open test data is derived from Natural Questions development data, which has 5-way annotations; we compute their pairwise agreement based on string match. in large part due to ambiguous questions. In this paper, we embrace this ambiguity as inherent to information seeking open QA, and present the first methods for returning sets of answers paired with different interpretations of the question.

Clarification Questions have been used to study and mitigate question ambiguity in other related settings. Research on community question answering (Braslavski et al., 2017; Rao and Daumé III, 2018, 2019) models ambiguity that arises from common information gaps in the question, which are often specific to the community being studied.

Recently, Xu et al. (2019) study clarification of questions deliberately constructed to have pre-specified entity reference ambiguities. Aliannejadi et al. (2019) investigate using clarification questions to iteratively refine intents of simple queries where the information need is not immediately apparent (e.g., single keywords like dinosaur).

In contrast, we study open-domain factoid questions asked by real users: these present clear information needs, but carry diverse naturally occurring ambiguities, as shown in Table 1. Furthermore, instead of prolonging the user’s information-seeking session with clarification questions, our task formulation provides a complete and immediate solution with a set of disambiguated answers.

3 Task:

3.1  Setup

Figure 1 depicts the  task. The input is a prompt question , and the output is a list of question-answer pairs , where each is an equally plausible answer to , and each is a minimally edited modification of whose answer is unambiguously . We consider two subtasks.

Multiple Answer Prediction.

Given a question , output a set of semantically distinct and equally plausible answers , where is unknown.

Question Disambiguation.

Given and a set of answers , generate disambiguated questions , where each is a minimal edit of which makes it more specific so that is a correct answer and all for all are incorrect. When , this task is trivial, as .

We choose to represent ambiguity with a set of disambiguated questions because it is well-defined, immediately human-interpretable, and allows for straightforward annotation of a wide range of ambiguities without complex guidelines.

3.2 Evaluation Metrics

To evaluate model performance, we present several ways to compare a model prediction with question-answer pairs with a gold reference set with pairs . Since there may be more than one way to refer to a single answer (e.g., Michael Jordan versus Michael Jeffrey Jordan), each gold answer is a set of acceptable answer strings, where all are disjoint.

We assign each predicted question-answer pair a correctness score , based on a string similarity function valued in . Intuitively, counts correct answers by the similarity between the predicted and reference question, to account for the fact that there can be many acceptable paraphrases of each reference question. We calculate F1 treating the as a measures of correctness:

We consider three choices of the question similarity function . is the F1 score on answers only, where always yields 1. This may be used without the question disambiguation step. accounts for string similarity between questions, calculating with BLEU (Papineni et al., 2002). uses Edit-F1 as . Edit-F1 is a new measure that represents each disambiguated question by its added and deleted unigrams compared to the prompt question, and computes the F1 score between them. For example, consider the prompt question “Who made the play the crucible?”, the reference “Who wrote the play the crucible?” and the prediction “Who made the play the crucible in 2012?”. The gold edits222Represented as multisets, written using notation. here are while the predicted edits are . So even though the questions are similar, their Edit-F1 is zero. Unlike BLEU which we use to directly measure similarity to the gold question, this metric only gives credit for getting the key semantic differences correct between the original question and the clarification.

4 Data: AmbigNQ

4.1 Data Collection

We construct AmbigNQ using prompt questions from NQ-open and English Wikipedia as the evidence corpus. We use Amazon Mechanical Turk for crowdsourcing.

The crucial challenge in our annotation task is maximizing recall: finding all possible distinct answers to a question. This is difficult even for humans, as ambiguities are often only apparent after carefully searching the evidence for multiple possible answers. However, we were able to collect high quality data with high levels of ambiguity using careful worker selection and an annotation pipeline with two stages: generation and validation.


Workers in the first stage are given a prompt question and provided access to a search box that uses the Google Search to return results restricted to English Wikipedia. They may write any number of search queries to find, read, and navigate Wikipedia pages in search of answers to the prompt question. By allowing annotators to find Wikipedia pages on their own, this closely approximates the real process people use to answer open-ended questions—an approach for which there is no existing large-scale dataset.444 For instance, answers in NQ-open are annotated over pre-specified Wikipedia pages from the Google search engine.

Workers annotate answers to the prompt question as free text, which we instruct them to copy and paste from Wikipedia. A single distinct answer may be annotated with multiple possible spans (e.g., Michael Jordan versus Michael Jeffrey Jordan). We ask workers to consider all possible user intents, or plausible answers to the question. For questions with multiple distinct answers, each answer must be annotated with a minimal edit of the prompt question which differentiates it from the other answers, in line with our task requirements.

As a special case, some questions contain temporal deixis which depends on the time of writing, e.g., “When does the new family guy season come out?” To avoid unmanageably large sets of answers, we instruct workers to remove the time-dependence by refining the prompt question to be based on up to three most recent events before Jan 1, 2018, e.g., “When does family guy season 16 come out?” (example in Table 1).

Generators may skip the prompt question if the answer is not found in Wikipedia, or the question is ill-formed, too subjective or too ambiguous, e.g., “When did the new tax cuts go into effect?”

Split # data # QAs %
1 2 3 4+
Train 10,036 53 24 14 10
Dev 2,002 49 23 14 13
Test 2,004 44 24 16 16
Table 2: Data statistics. For the number of QA pairs (# QAs), the minimum is taken when there are more than 1 accepted annotations.
(a) Number of unique Wikipedia pages visited by crowdworkers.
(b) Number of search queries written by crowdworkers.
(c) Word cloud of the edits made in questions; and indicate added and deleted unigrams, respectively.
Figure 2: Data Analysis on the development data.


Workers in the validation stage review the complete annotations provided by multiple generators. Validators mark each generator’s annotations as correct or incorrect; the correct ones are provided as gold references in the final dataset. If only some question-answer pairs from each generator were correct, the validator provides a new set of question-answer pairs by combining the valid ones from each generator and disambiguating the questions accordingly. In this case, only the new set of question-answer pairs is used as gold.

Just like in the generation stage, validators have access to a search engine over Wikipedia, and are additionally given the Wikipedia pages that the generators viewed in order to speed up the process. Validation is skipped when all annotated answers from all generators exactly match (37% of cases).

Quality control.

We recruit highly qualified workers through a qualification test (details in Appendix A). Although the task was difficult for most workers, we found that our highly qualified full-time workers, given quick and detailed feedback on their work, produced high accuracy and recall.

In the beginning of data collection, we used a more complex pipeline consisting of three generators and two validators. However, we eventually found that managing a small pool of highly qualified workers allowed us to collect high-quality data with a simpler pipeline and lower costs.

We pay 0.75 and 0.15 USD per prompt question for generation and validation, respectively. For the development and test data, we use two generators and one validator per prompt question. For the training data, we skip validation and only use one generator per prompt question.

Inter-annotator agreement.

Evaluating both generators against each other on the development set yields 60.8 . All annotations passed validation for 76% of questions, while validators made changes (edits or exclusions) in the remaining 24%. This indicates that the task is hard even for humans, but the validation stage enables aggregation of work from multiple generators.

The average  between the co-authors and workers on a sample of 50 validations was 89.0%. This indicates that, despite the intrinsic difficulty and the subjectivity of the task, humans agree on the boundary between valid and invalid answers in most cases.

4.2 Data Analysis

The final dataset contains 14,042 annotated examples. As shown in Table 2, over 50% of development and test examples contain multiple question-answer pairs. This indicates a surprisingly high rate of ambiguity in NQ-open, even though previous work has studied it with the assumption that each question has a single answer. We also find a discrepancy between development and test; this is likely due to the way in which NQ-open is constructed, which over-samples difficult questions in the test set (see Appendix B for details). In comparison to development and test, fewer training examples contain multiple question-answer pairs (47%), presumably because using only one worker per training example yielded slightly lower recall.

Types of ambiguity.

Table 1

shows a breakdown of the types of ambiguity, where a single example may fall into multiple categories. They are diverse, including ambiguity in entity references, event references, properties, and answer types, with a relatively uniform distribution between them. It is worth comparing with

Xu et al. (2019), who intentionally elicited questions with ambiguous entity references; our analysis shows that unintended ambiguity comes from more diverse sources. In many cases, the ambiguity is not apparent from the prompt question alone, but only after researching the question on Wikipedia, as evidenced by differences in model performance (Section 6.2).

Annotator behavior.

Figures 1(a) and 1(b) show the number of unique Wikipedia pages visited from the search interface555 This is actually an underestimate; we could not track when annotators viewed pages by following hyperlinks, because our interface opens Wikipedia in an iframe. and the number of unique search queries, collected by workers during annotation process. More often than not, workers used multiple search queries and navigated multiple Wikipedia pages. This shows how our setup captures ambiguity in the retrieval step of open-domain question answering, which is missed in annotation approaches that assume a pre-specified evidence document.

1:procedure Democratic co-training with weak supervision(, )
2:       // Each question in  has an answer list annotated
3:       // Each question in  has one answer annotated
5:      for  do
6:            for  do
7:                 ) // Train sequence-to-sequence QA models             
9:            for  do // Get predictions by using as prefix of the sequence-to-sequence
10:                  and
11:                 if  then // Add if there is an additional answer by the majority of the models
13:                 else if  then // Add if all models predict a single answer
Algorithm 1 Democratic co-training with weak supervision

Distribution of edits.

Figure 1(c) shows a word cloud of unigram edits made to questions in the development data, where we remove stopwords except wh-words and group numeric values by the number of digits. We find that adding a numeric value such as a year is common, as it is an easy way to disambiguate entity or event references and to remove time dependence. Edits to the wh-word are also common, especially for specifying the answer type (e.g., from “who” to “which group”; see Table 1). The distribution of edits is fairly long-tailed, with the top 100 most frequent edits covering 36% of the total, and the top 1,000 covering 69%.

Mismatches with NQ-open.

29.4% of our development examples do not include the NQ-open answer. We report a breakdown of a random sample of 50 such questions in Appendix C. In short, the mismatch cases do not indicate low precision or low recall; in many cases both AmbigNQ answers and NQ-open answers are correct (60%), or AmbigNQ answers include a better answer than NQ-open answers (32%).

5 Model

To set initial performance levels on AmbigNQ, we present a baseline  model combining ideas from recent advances in open-domain QA Karpukhin et al. (2020) and generative pretraining Lewis et al. (2020). Given a prompt question , our model predicts answers and generates corresponding questions conditioning on , the answers , and the passages containing them. A novel co-training step also allows the model to leverage the partial supervision available in NQ-open.

Multiple Answer Prediction.

Here we describe SpanSeqGen, our model for multiple answer prediction. Following Karpukhin et al. (2020), a state-of-the-art model on NQ-open, SpanSeqGen first retrieves 100 passages with a BERT-based (Devlin et al., 2019) dual encoder, and reranks them using a BERT-based cross encoder. Then, instead of predicting an answer span from the top 1 passage as Karpukhin et al. (2020) does, SpanSeqGen uses another sequence-to-sequence model based on BART (Lewis et al., 2020). Specifically, it conditions on the concatenation of and the top passages in order up to 1024 tokens, and sequentially generates distinct answers token-by-token, separated by [SEP], in the order in which they appear in the input passages. We pretrain SpanSeqGen on NQ-open and fine-tune it on AmbigNQ.

We develop SpanSeqGen primarily because Karpukhin et al. (2020) is designed for generating a single answer, but SpanSeqGen also boosts the performance on NQ-open (41.542.2 on the test data). We include ablations on different approaches and models in Section 6.2.

Question Disambiguation.

We design a question disambiguation (QD) model based on BART. The model generates each question () conditioning on the concatenation of , the target answer , other answers , and the top passages as used by SpanSeqGen. We pretrain on NQ-open to generate questions given an answer and passage, and then finetune it on the full task data in AmbigNQ. We include ablations on different variants of the model in Section 6.2.

Model  (all)  (multi)
dev test dev test dev test dev test
Disambig-first 28.1 24.8 21.9 18.8 4.2 4.0 2.7 2.2
Thresholding + QD 37.1 32.3 28.4 24.8 13.4 11.3 6.6 5.5
SpanSeqGen + QD 39.7 33.5 29.3 24.5 13.4 11.4 7.2 5.8
SpanSeqGen + QD 41.2 35.2 29.8 24.5 13.6 10.6 7.4 5.7
SpanSeqGen (Co-training) + QD 42.3 35.9 31.7 26.0 14.3 11.5 8.0 6.3
Table 3: Result on the development and test data. all and multi indicate all examples and examples with multiple question-answer pairs only, respectively. QD indicates a question disambiguation model. indicates ensemble.

Co-training with weak supervision.

Given the prevalence of unmarked ambiguity in NQ-open, we introduce a method for treating the original annotations as weak supervision, while learning to correct for some the false negatives in the data. We modify a democratic co-training algorithm (Zhou and Goldman, 2004) as described in Algorithm 1. We iteratively grow the training set  from AmbigNQ () with silver data from NQ-open () predicted by a majority of a set of SpanSeqGen models trained on . The key step is injecting the known answer from NQ-openas a prefix to SpanSeqGen’s output during prediction.666For generating silver data, we modify SpanSeqGen to decode answers in random order. In each step, if a majority of predict an additional answer, we assume we have found a false negative and add the result to the training set . On the other hand, if all models predict no additional answer, we add the example to with as a single answer.

6 Experiments

We describe the baseline models used in our experiments, followed by results and ablations. Implementation details and hyperparameters of all models are provided in Appendix 


6.1 Baselines


This baseline disambiguates the prompt question without any context from plausible answers or reference passages. Specifically, it implements the following pipeline: (1) Feed the prompt question  into a BERT

-based binary classifier to determine whether it is ambiguous. (2) If

is ambiguous, pass it into a BART-based model which generates a sequence of disambiguated questions (), separated by [SEP]; otherwise, consider only . (3) Feed each into a state-of-the-art model on NQ-open (Karpukhin et al., 2020) to produce its answer .

Thresholding + QD.

We also include a model based on Karpukhin et al. (2020), with thresholding for multiple answer prediction and our BART-based question disambiguation (QD) model. The Karpukhin et al. (2020)

model outputs a passage selection probability

for passage , and probabilities and for start and end tokens of the answer span, respectively. We treat the product of these three probabilities as the span’s likelihood of being the answer, and obtain by taking valid spans with likelihood larger than a hyperparameter . The model is trained to maximize the marginal likelihood of any span in the gold answer set . As with SpanSeqGen, we pretrain on NQ-open and finetune on AmbigNQ. Finally, we produce disambiguated questions using our BART-based QD model (Section 5).

Model Full task Gold answers given
QD model 14.3 8.0 40.1 19.2
- prompt question - 6.7 7.7 15.1 19.2
- untargeted answers - 14.2 7.3 41.2 17.2
Always prompt question - - 15.9 0.0 47.4 0.0
Table 4: Ablations on question disambiguation (development data, multiple answers only). QD model refers to the question disambiguation model described in Section 5. For multiple answer prediction, we use SpanSeqGen with co-training (Full task) or the gold answers (Gold answers given).
Model NQ-open EM  (all)  (multi)
Min et al. (2019b) 34.7 30.8 20.4
Asai et al. (2020) 31.7 29.7 19.7
Karpukhin et al. (2020) 39.8 35.2 26.5
SpanSeqGen 42.0 36.4 24.8
Min et al. (2019b) 34.5 27.5 17.0
Asai et al. (2020) 32.6 27.9 17.7
Karpukhin et al. (2020) 41.5 30.1 23.2
SpanSeqGen 42.2 30.8 20.7
Table 5: Zero-shot performance on multiple answer prediction of the models trained on NQ-open. We report Exact Match (EM) on NQ-open and  on AmbigNQ.
Prompt question #1: Where was snow white and the huntsman filmed?
Q: Where were beach scenes for snow white and huntsman predominantly filmed? / A: Marloes Sands Beach
Q: Where was principal photography for snow white and huntsman filmed? / A: United Kingdom
Q: Where was castle in snow white and huntsman filmed? / A: Gateholm island
Prediction of Disambig-first: (=0.40, =0.00)
Q: Where was snow white and the huntsman filmed in 2017? / A: Marloes Sands Beach
Q: Where was snow white and the huntsman filmed during the filming of Season 1 of the TV series? / A: Marloes Sands Beach
Prediction of SpanSeqGen: (=0.80, =0.69)
Q: Where was snow white and huntsman principal photography filmed / A: United Kingdom
Q: Where were beach scenes for snow white and huntsman mostly filmed / A: Marloes Sands Beach
Prompt question #2: When was the city of new york founded?
Q: When was city of new york founded by dutch and initially called new amsterdam? / A: 1624
Q: When was city of new york under english control and renamed to new york? / A: 1664
Prediction of SpanSeqGen: (=1.00, =0.67)
Q: When was city of new york city founded with dutch protection? / A: 1624
Q: When was city of new york city founded and renamed with english name? / A: 1664
Prompt question #3: When does the fifty shades of grey come out?
Q: When did book fifty shades of grey come out? / A: June 20, 2011
Q: When did movie fifty shades of grey come out in los angeles? / A: February 9, 2015
Q: When did movie fifty shades of grey come out all over us? / A: February 13, 2015
Prediction of SpanSeqGen: (=0.40, =0.21)
Q: When does movie fifty shades of grey (2015) come out in imax across us? / A: February 13, 2015
Q: When does film 50 shades of grey (2018) come out in united states? / A: February 9, 2018
Table 6: Model predictions on samples from the development data. (#1) Disambig-first generates questions that look reasonable on the surface but don’t match the facts. SpanSeqGen produces the reasonable answers and questions, although not perfect. (#2) SpanSeqGen produces correct answers and questions. (#3) the model produces the incorrect answer “February 9, 2018”, which is the release date of Fifty Shades Freed.

6.2 Results

Table 3 reports the performance of our baselines; example model outputs are provided in Table 6.

Main results.

We first find that Disambig-first is significantly worse than other models. In particular, classification accuracy on whether the prompt question is ambiguous is 67%, close to the majority baseline (60%).777 The discrepancy between 60% for the majority baseline and 51% listed as ambiguous in Table 2 is because 9% have multiple reference answer sets that include both single-answer annotations and multiple question-answer annotations. These were considered to have a single answer in Table 2. When the model does identify an ambiguous question, its rewrites often look reasonable on the surface, but do not match the facts. For instance, in example 1 of Table 6, it asks about filming in 2017 and during season 1 for Snow White and the Huntsman, which was actually a film released in 2012. This shows that reading evidence documents is crucial for identifying and characterizing ambiguities.

While SpanSeqGen outperforms Karpukhin et al. (2020) with thresholding, the difference is not as great as we expected. This suggests two things. First, thresholding may be a surprisingly effective baseline for outputting multiple answers, even though the answers must compete with each other for probability mass in order to surpass the threshold . Second, maximizing likelihood in a sequence-to-sequence model like SpanSeqGen may not produce well-calibrated results for question answering. For instance, the model seems to suffer due to variation in the length of the output sequence, outputting shorter sequences on average (3.0 tokens) than gold (6.7). This problem has also been reported in other conditional generation tasks (Sountsov and Sarawagi, 2016; Stahlberg and Byrne, 2019); we leave it for future work.

Overall, SpanSeqGen achieves reasonable  scores.  on examples with multiple question-answer pairs ( (multi)) are lower, indicating that predicting all plausible answers is more challenging than predicting a single answer, as expected. SpanSeqGen also obtains the best performance in  and , although their absolute values are low in general; we investigate this in our question disambiguation ablations below. See Table 6 for example predictions and error cases.

There is a substantial difference in performance between development and test overall, likely due to distributional differences in the original questions in NQ-open; detailed discussion is in Appendix B.

Effect of co-training.

To see the effect of our co-training method, we compare with a naive ensemble, as co-training also requires multiple trained models. While we see gains from ensembling alone, an ensemble trained with the co-training method achieves the best performance on all metrics. This result demonstrates the importance of jointly using AmbigNQ and partial supervision from NQ-open.

Ablations on question disambiguation.

Table 4 reports results of an ablation experiment on question disambiguation. Among our ablations, we include models without the prompt question or untargeted answers as input, and a naive baseline that always outputs the prompt question. We report the metrics both in the scenarios of the full task and the gold answers given, to see the performance dependent on and independent from multiple answer prediction, respectively.888 For the  and  scores, errors do not propagate through the pipeline in the normal way. For instance, if a model correctly predicts one out of two answers, it will not perform any edits to the question, likely resulting in an  of zero.

Simply copying the prompt question gives high , which is natural since the questions were disambiguated using minimal edits. This justifies using  to evaluate semantic differences from the prompt question. In addition, we find that our QD model conditioned on all available context is better than other variants in overall metrics.

Performance is low overall, even given the gold answers, highlighting the challenge of the task. We think there are two major reasons. First, maximizing the likelihood of the output sequence can miss the importance of edits to the prompt question, leading the QD model to miss the information that is most important to differentiate one answer from the others. Second, there is a lack of annotated data, especially for question disambiguation which does not benefit from weakly supervised learning with NQ-open; future work can explore how to maximize the use of supervision from other available data. It is also worth noting that the metric may miss edits that are semantically correct, but phrased differently (see Table 6, example 2).

6.3 Zero-shot multiple answer prediction

Since AmbigNQ provides an evaluation set with explicit sets of multiple answers, we can also test if models trained on partial supervision only (NQ-open) are capable of producing full answer sets. This may be important for modeling in domains where single-answer datasets are available but full annotations like in AmbigNQ are not. To this end, we present a zero-shot setting where a system predicts multiple distinct answers without using AmbigNQ training data. We include four NQ-open models including ours, consisting of diverse approaches and model architectures, as baselines. These models, when trained on NQ-open, may be made to predict multiple answers via thresholding as described in Section 6.1.999 We allow using development data to tune the threshold , although this arguably makes our setting not zero-shot in the strictest sense. Table 5 reports zero-shot performance. Although Min et al. (2019b) outperforms Asai et al. (2020) on NQ-open, it is worse on AmbigNQ. In addition, although SpanSeqGen outperforms Karpukhin et al. (2020) in the standard setting, it is worse in zero-shot  (multi), potentially because thresholding exacerbates the problems that SpanSeqGen has with long sequences (Section 6.2).

7 Conclusion & Future Work

We introduced , a new task that involves providing multiple possible answers to a potentially ambiguous open-domain question, and providing a disambiguated question corresponding to each answer. We constructed AmbigNQ, a dataset with 14,042 annotations on NQ-open questions. Our analysis shows the dataset contains diverse types of ambiguity, often not visible by the prompt question alone but only found upon reading evidence documents. We also introduced a first baseline model for producing multiple answers to open-domain questions, with experiments showing its effectiveness in learning from our data while highlighting avenues for future work. Future work may investigate (1) more effective ways of dealing with highly ambiguous questions (e.g., returning tables or other structures), (2) providing information related to the inferred information need when no answers are found, or (3) dealing with ill-formed questions.


Appendix A Data Collection Details

We use Amazon Mechanical and Spacro (Michael et al., 2018) for crowdsourcing. All data was collected in February and March of 2020. We use the Google Search restricted to English Wikipedia for the search tool.

Crowdsourcing interface.

Figure 3 shows the interface used for generation and validation. We use an iframe to render Wikipedia pages in a mobile view, in order to provide the document format that they are familiar with, rather than the plain text with no formatting. When workers write the questions and the answers in the generation stage, we show appropriate error messages (e.g. when the written question is the same as the prompt question) or warning messages (e.g., when the answer is composed of more than 20 words) in order to give tight feedback.

Quality control.

We only recruit full-time workers that are dedicated to our task. We were able to recruit full-time workers by requiring the minimum number of HITs that can be achieved by working 40 hours a week. We also host a public website for them to monitor the validated statuses, ask questions on examples that they do not understand the validated result, or claim on the validation which is incorrect in their opinion. We found it very useful to communicate with workers, give feedback, and fix the incorrect annotations.

Inter-annotator agreement.

When two independent generators are evaluated on the answer list from each other, they obtain 60.8 . Specifically, for 76% of questions, all annotations passed validation, either automatically because they exactly matched (37%) or because they were both accepted by validators (39%). In the remaining 24% of cases, one annotator missed a possible question-answer pair that the other one found, or included an invalid question-answer pair.

To assess validation quality, two co-authors annotated a random sample of 50 validations. The average  between the co-authors and workers was 89.0%.

Appendix B Discrepancy between development and test in NQ-open

In our experiments on AmbigNQ, we found a significant discrepancy between the development and test sets. Upon further investigation, we identified that this is at least in part due to a distributional difference between the development and test sets of NQ-open, upon which we built the data. As this may be important for other researchers working on NQ-open, we detail our findings here.

Following Lee et al. (2019), NQ-open is constructed by filtering Natural Questions to questions where at least one annotator provided a non-null short answer to the question.131313 Natural Questions annotators answered each question with a set of short answers, which could be empty if there was no reasonable short answer. We refer to the empty cases as null answers. See Kwiatkowski et al. (2019) for details. While the training and development sets of NQ-open were all drawn from the training set of Natural Questions, in which one annotator answered each question, the test set of NQ-open is taken from its development set, which had five annotators per question.

This difference in number of annotators introduces a sampling bias: questions for which an annotator is less likely to find an answer are overrepresented in the NQ-open test set, in comparison to training and development. Suppose, for example, that a randomly sampled annotator has a 50% chance of producing a short answer for some question . Then has a 50% chance of making it into NQ-open’s development set, but a () 97% chance of making it into test. Concretely, when each annotator is considered independently, 34.6% of the short answer annotations in the test set of NQ-open are null answers, and the majority of annotations are null for 33.9% of questions.

Model Any First
dev test dev test
Min et al. (2019b) 34.7 34.5 32.4 25.7
Asai et al. (2020) 31.7 32.6 28.9 23.8
Karpukhin et al. (2020) 39.8 41.5 37.0 29.8
SpanSeqGen 42.0 42.2 38.8 31.1
Table 7: Exact Match (EM) on NQ-open of different models, counting a prediction as correct if it matches Any gold reference, or only the First non-null one.

As a consequence, there is a significant gap in model performance between development and test when they are evaluated under the same conditions. The official evaluation protocol for NQ-open counts a prediction as correct if it matches any of the gold reference answers. Under these conditions, the gap between development and test appears marginal (Table 7, first two columns). However, as the NQ-open test set was more comprehensively annotated than development, it has a more generous evaluation; the number of unique reference answers is 1.2 and 1.8 on development and test, respectively. In order to make the evaluation more consistent, we try evaluating models against the first reference answer only, and find a significant gap between development and test (5–8%) across all models (Table 7, last two columns).141414 It is unlikely that this discrepancy is due to overfitting on development, because the effect is consistent across models and not present on the other datasets that they are evaluated on.

Despite this discrepancy, AmbigNQ follows the setup and data split from NQ-open  providing consistency with prior work. Since the AmbigNQ development and test sets were annotated under the same conditions, this discrepancy now shows up in the metrics. We leave the distribution shift of questions on the test data as one of challenges on AmbigNQ.

Appendix C Data Analysis Details

Mismatches with NQ-open.

29.4% of AmbigNQ development examples do not include the NQ-open answer. We analyze a random sample of 50 such questions, and present a breakdown in Table 8. We find that our answers are correct in 92% of cases, including 44% of disagreements due to mismatches spans, 22% due to the NQ-open answer being incorrect, and 14% due to time-dependence in the question. Of the 8% of cases where our answer is incorrect, the NQ-open answers are also incorrect over half the time, indicating that these may be difficult questions.

(a) Interface in the generation stage when the workers write a query and see the search results.
(b) Interface in the generation stage when the workers click and read one of Wikipedia pages from the search results.
(c) Interface in the validation stage when the workers are given annotations from two generation workers and click the Wikipedia page that the generation workers have read.
Figure 3: Interface for crowdsourcing.
Answer span mismatch (44%)
Q: Who did the artwork for pink floyd’s wall?
NQ-open answer: Gerald Anthony Scarfe
AmbigNQ answer:
Q: Who did the art work for the album cover of Pink Floyd’s The Wall? / A: Gerald Scarfe
Q: Who was the cinematographer for Pink Floyd - The Wall (1982 film)? / A: Peter Biziou
NQ-open answer incorporated as a question (2%)
Q: What award did leonardo dicaprio won for the revenant?
NQ-open answer: BAFTA Award; Academy Award for Best Actor; Golden Globe Award
AmbigNQ answer:
Q: What British Academy Film Awards award did leonardo dicaprio won for the revenant? / A: Best Actor in a Leading Role
Q: What Academy award did leonardo dicaprio won for the revenant? / A: Best Actor
Q: What Golden Globe award did leonardo dicaprio won for the revenant? / A: Best Actor in a Motion Picture – Drama
(Other question-answer pairs omitted)
NQ-open answer less specific (10%)
Q: When was the nba 3 point line introduced?
NQ-open answer: 1979
AmbigNQ answer: June 1979
NQ-open answer incorrect and our answers include all possible answers (22%)
Q: Who was inducted into the national inventors hall of fame first?
NQ-open answer: John Fitch
AmbigNQ answer: Thomas Edison
Comment: Thomas Edison inducted in 1973, John Fitch inducted in 2006. John Fitch is mentioned as the earliest born
Mismatch from time-dependence (14%)
Q: Who has the most home runs in the home run derby?
NQ-open answer: Todd Frazier
AmbigNQ answer:
Q: Who has the most home runs in the the TV show the home run derby? / A: Mickey Mantle; Mickey Charles Mantle
Q: Who has the most home runs in the annual competition the home run derby? / A: Joc Russell Pederson; Joc Pederson
NQ-open answer is reasonable and our answers miss it (4%)
Q: Who was the first person to settle dodge city?
NQ-open answer: civilians
AmbigNQ answer: Henry J. Sitler
NQ-open answer incorrect but our answers miss another possible answer (4%)
Q: In which year were chips used inside the computer for the first time?
NQ-open answer: 1975
AmbigNQ answer: 1962
Comment: The years that the chips were used for the first time in the prototype and the production are 1962 and 1974,
respectively, and can be both
Table 8: Breakdown of cases that NQ-open answer is not included in AmbigNQ answers.

Appendix D Baseline Implementation Details

Evidence corpus.

We use English Wikipedia dump from 2018-12-20 and 2020-01-20 for NQ-open and AmbigNQ, respectively. Following Karpukhin et al. (2020), we take the plain text and split passages to be up to 100 words each.

Model implementation.

All models are implemented in PyTorch (Paszke et al., 2017), PyTorch-Transformers (Wolf et al., 2019) (for BERT) and fairseq (Ott et al., 2019) (for BART). We use BERT and BART for all models. We use the exact same setup and hyperparameters for any process that we follow Karpukhin et al. (2020). For the passage retrieval through a dual encoder, we use the provided multi-setting trained model. For all BART-based models, we follow the default hyparameters from BART summarization code in fairseq, using one 32GB gpu. For finetuning, we change the learning rate to be on both tasks. We use beam search for decoding the sequence. We train the model for epochs (when trained on NQ-open or pseudo-labelled data) or epochs (when trained on AmbigNQ), and take the best checkpoint based on the development data. Note that the perplexity of the output sequence does not correlate with the metric of interest (Exact Match,  or ) as briefly discussed in Section 6.2, so using the metric of interest instead of perplexity is important for hyperparamter tuning or the choice of the best checkpoint.

Details in ensemble and co-training.

We use an ensemble based on voting; the answers that are predicted by the highest number of models are chosen as the final answers. The number of models used in ensemble () is before cotraining and after cotraining. For co-training, we use and , where is the number of iteration and is the number of models, in line with Algorithm 1. The choice of is determined by taking the best combination of the models as follows. We train sixteen different models, using different hyperparamers including checkpoints from NQ-open, learning rates, the order of the answers in the output sequence and the random seed. We then measure the development  on different combinations of the models with varying () and take the best one.