Tackling Multi-Answer Open-Domain Questions via a Recall-then-Verify Framework

10/16/2021
by   Zhihong Shao, et al.
Tsinghua University
0

Open domain questions are likely to be open-ended and ambiguous, leading to multiple valid answers. Existing approaches typically adopt the rerank-then-read framework, where a reader reads top-ranking evidence to predict answers. According to our empirical analyses, this framework is faced with three problems: to leverage the power of a large reader, the reranker is forced to select only a few relevant passages that cover diverse answers, which is non-trivial due to unknown effect on the reader's performance; the small reading budget also prevents the reader from making use of valuable retrieved evidence filtered out by the reranker; besides, as the reader generates predictions all at once based on all selected evidence, it may learn pathological dependencies among answers, i.e., whether to predict an answer may also depend on evidence of the other answers. To avoid these problems, we propose to tackle multi-answer open-domain questions with a recall-then-verify framework, which separates the reasoning process of each answer so that we can make better use of retrieved evidence while also leveraging the power of large models under the same memory constraint. Our framework achieves new state-of-the-art results on two multi-answer datasets, and predicts significantly more gold answers than a rerank-then-read system with an oracle reranker.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/14/2017

Evidence Aggregation for Answer Re-Ranking in Open-Domain Question Answering

A popular recent approach to answering open-domain questions is to first...
11/15/2017

Good and safe uses of AI Oracles

An Oracle is a design for potentially high power artificial intelligence...
08/17/2018

Read + Verify: Machine Reading Comprehension with Unanswerable Questions

Machine reading comprehension with unanswerable questions aims to abstai...
09/11/2021

What's in a Name? Answer Equivalence For Open-Domain Question Answering

A flaw in QA evaluation is that annotations often only provide one gold ...
11/26/2020

Answering Ambiguous Questions through Generative Evidence Fusion and Round-Trip Prediction

In open-domain question answering, questions are highly likely to be amb...
02/17/2021

Open-Retrieval Conversational Machine Reading

In conversational machine reading, systems need to interpret natural lan...
02/28/2018

Predicting Recall Probability to Adaptively Prioritize Study

Students have a limited time to study and are typically ineffective at a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Open-domain question answering Voorhees (1999); Chen et al. (2017) is a long-standing task where a question answering system goes through a large-scale corpus to answer information-seeking questions. Previous work typically assumes that there is only one well-defined answer for each question, or only requires systems to predict one correct answer, which largely simplifies the task. In practice, humans may lack sufficient knowledge or patience to frame very specific information-seeking questions, leading to open-ended and ambiguous questions with multiple valid answers. According to Min et al. (2020b), over 50% of a sampled set of Google search queries Kwiatkowski et al. (2019) are ambiguous. Figure 1 shows an example with at least three interpretations. As can be seen from this example, the number of valid answers depends on both questions and relevant evidence, which challenges the ability of comprehensive exploitation of evidence from a large-scale corpus.

Original Question: When did [You Don’t Know Jack] come out?
Interpretation #1: When did the first video game called [You Don’t Know Jack] come out?
Evidence #1: You Don’t Know Jack is a video game released in 1995, and the first release in …
Answer #1: 1995
Interpretation #2: When did the Facebook game [You Don’t Know Jack] come out on Facebook?
Evidence #2: In 2012, Jackbox Games developed and published a social version of the game on Facebook …
Answer #2: 2012
Interpretation #3: When did the film [You Don’t Know Jack] come out?
Evidence #3: “You Don’t Know Jack” premiered April 24, 2010 on HBO.
Answer #3: April 24, 2010
Table 1: An example of multi-answer open-domain questions. We display only a part of valid answers. In fact, [You Don’t Know Jack] can also be a song.

Existing approaches mostly adopt the rerank-then-read framework. A retriever retrieves hundreds of or thousands of relevant passages which are later reranked by a reranker; a reader then predicts all answers in sequence conditioned on top-ranking passages. With a fixed memory constraint111We follow Min et al. (2021) to constrain memory usage which is usually a bottleneck of performance on open-domain question answering ., there is a trade-off between the size of the reader and the number of passages the reader can process at a time. According to Min et al. (2021), provided that the reranker is capable of selecting a small set of highly-relevant passages with high coverage of diverse answers, adopting a larger reader can outperform a smaller reader using more passages. However, as shown by Section 4.4, this framework is faced with three problems: first, due to the small reading budget, the reranker has to balance relevance and diversity, which is non-trivial because it is often unknown beforehand that how many or which relevant passages are sufficient for the reader to predict a particular answer; second, the reader has no access to more retrieved evidence that may be valuable but is filtered out by the reranker, while combining information from more passages was found to be beneficial to open-domain QA Izacard and Grave (2021b); third

, as the reader predict answers in sequence all at once, the reader tends to under-generate answers partly due to high variance of the number of valid answers

Min et al. (2020b, 2021); the reader also seems to learn pathological dependencies among answers, i.e., whether to predict an answer may also depend on passages that cover some other answer(s), while ideally, prediction of a particular answer should depend on the soundness of associated evidence itself.

To avoid the problems above, we propose to tackle multi-answer open-domain questions with a recall-then-verify framework. Specifically, we first use an answer recaller to extract a possible answer from each retrieved passage. Not distracted by evidence of other answer(s), this can be done with high recall, even when using a weak answer predictor. However, due to insufficient evidence, these recalled answers are mostly invalid. We then aggregate retrieved evidence relevant to each answer candidate, and verify the answer with a large answer verifier. Through task reformulation, this framework can make better use of retrieved evidence under the same memory constraint.

Our contributions are summarized as follows:

  • We empirically analyze the problems faced by the rerank-then-read framework when being used for multi-answer open-domain QA.

  • To avoid the problems of the rerank-then-read framework, we propose to tackle multi-answer open-domain questions via a recall-then-verify framework, which makes better use of retrieved evidence while also leveraging the power of large models under the same memory constraint.

  • We conducted experiments on two multi-answer QA datasets and achieved state-of-the-art results.

2 Related Work

Open-domain question answering requires question answering systems to answer factoid questions by searching for evidence from a large-scale corpus such as Wikipedia Voorhees (1999); Chen et al. (2017). The presence of many benchmarks has greatly promoted the development of this community, such as questions from real users like NQ Kwiatkowski et al. (2019) and WebQuestions Berant et al. (2013), and trivia questions like Quasar-T Dhingra et al. (2017) and TriviaQA Joshi et al. (2017). All these benchmarks either assume that each question has only one answer with several alternative surface forms, or only require a system to predict one valid answer. A typical question answering system is a pipeline as follows: an efficient retriever retrieves relevant passages using sparse Mao et al. (2021); Zhao et al. (2021) or dense Karpukhin et al. (2020); Xiong et al. (2021); Izacard and Grave (2021a); Khattab et al. (2021) representations; an optional passage reranker Asadi and Lin (2013); Nogueira and Cho (2019); Nogueira et al. (2020) further narrows down the evidence; an extractive or generative reader Izacard and Grave (2021b); Cheng et al. (2021) predicts an answer conditioned on retrieved or top-ranking passages. Nearly all previous work focused on locating passages covering at least one answer, or tried to predict one answer precisely.

However, both Kwiatkowski et al. (2019) and Min et al. (2020b) reported that there is genuine ambiguity in open-domain questions, resulting in multiple valid answers. To study the challenge of finding all valid answers for open-domain questions, Min et al. (2020b) proposed a new benchmark called AmbigQA where each question is annotated with as many answers as possible. In this new task, the passage reranker becomes more vital in the rerank-then-read framework, particularly when only a few passages are allowed to feed a large reader due to memory constraints. This is because the reranker has to ensure that top-ranking passages are not only highly relevant, but also cover diverse answers. Despite state-of-the-art performance on AmbigQA Min et al. (2021), according to our empirical analyses, applying the rerank-then-read framework to multi-answer open-domain QA is faced with the following problems: balancing relevance and diversity is non-trivial for the reranker due to unknown effect on the performance of the subsequent reader; using a large reader may prevent it from making good use of all retrieved evidence under memory constraints; as the reader generates all answers in sequence, it tends to suffer from high variance of the number of valid answers and learns pathological dependencies among answers. Therefore, we propose to tackle this task with a recall-then-verify framework, which leverages the power of large models while also making better use of retrieved evidence under the same memory constraint.

Some previous work argued that readers can be confused by similar but spurious passages, resulting in wrong predictions. Therefore, they proposed answer rerankers Wang et al. (2018a, b); Hu et al. (2019); Iyer et al. (2021) to rerank top predictions from readers. Our framework is related to answer reranking but with two main differences. First, a reader typically aggregates available evidence and already does a decent job of answer prediction even without answer reranking; an answer reranker is introduced to filter out hard false positive predictions from the reader. By contrast, our answer recaller aims at finding all possible answers with high recall, most of which are invalid answers. Evidence focused on each answer is then aggregated and reasoned about by our answer verifier. It is also possible to introduce another model analogous to an answer reranker to filter out false positive predictions from our answer verifier. Second, answer reranking typically compares answer candidates to determine the most relevant one, while our answer verifier selects multiple valid answers mainly based on the soundness of their respective evidence but without comparisons among answer candidates.

3 Task Formulation

Multi-answer open-domain QA can be formally defined as follows: given an open-ended question , a question answering system is required to make use of evidence from a large-scale text corpus and predict a set of valid answers . Questions and their corresponding answer sets are provided for training.

Evaluation

To evaluate passage retrieval and reranking, we adopt the metric MRecall@ from Min et al. (2021), which measures whether the top- passages cover at least distinct answers (or answers if the total number of answers is less than ). To evaluate question answering performance, we follow Min et al. (2020b) to use F1 score between gold answers and predicted ones.

4 Rerank-then-Read Framework

(a)
(b)
(c)
Figure 1: Analyses of how well OPR (the reranker of a rerank-then-read pipeline) balances relevance and diversity on questions with multiple answers in the dev set of AmbigQA. The number of retrieved passages is =100, and the number of passages selected by the reranker is =10. Figure (a) and (b) show the ratio of answers with different numbers of supporting passages, which are for gold answers that are missed and predicted by the reader, respectively. Figure (c) shows the ratio of retrieved supporting passages that are eventually used by the reader.

In this section, we will briefly introduce the representative and state-of-the-art rerank-then-read pipeline from Min et al. (2021) for multi-answer open-domain questions, and provide empirical analyses of this framework.

4.1 Passage Retrieval

Dense retrieval is widely adopted by open-domain question answering systems Min et al. (2020a)

. A dense retriever measures relevance of a passage to a question by computing the dot product of their semantic vectors encoded by a passage encoder and a question encoder, respectively. Given a question, a set of most relevant passages, denoted as

(), is retrieved for subsequent processing.

4.2 Passage Reranker

To improve the quality of evidence, previous work Nogueira et al. (2020); Gao et al. (2021) find it effective to utilize a passage reranker, which is more expressive than a passage retriever, to rerank retrieved passages, and select the best ones to feed a reader for answer generation (). With a fixed memory constraint, there is a trade-off between the number of selected passages and the size of the reader. As shown by Min et al. (2021), with good reranking, using a larger reader is more beneficial. To balance relevance and diversity of evidence, Min et al. (2021) proposed a passage reranker called JPR for joint modeling of selected passages. Specifically, they utilized T5 Raffel et al. (2020) to encode retrieved passages following Izacard and Grave (2021b) and decode the indices of selected passages autoregressively. JPR is trained to seek for passages that cover new answers. To better balance relevance and diversity especially when there are less than answers for the question, Min et al. (2021) also proposed a tree-decoding algorithm, so that JPR has the flexibility to select more passages covering the same answer.

4.3 Reader

A reader takes as input the top-ranking passages, and predicts answers for the question. Min et al. (2021) adopted a generative encoder-decoder reader initialized with T5-3b Raffel et al. (2020), and used the fusion-in-decoder method from Izacard and Grave (2021b) which proves to efficiently aggregate and combine evidence from multiple passages. Specifically, each passage is concatenated with the question and is encoded independently by the encoder; the decoder then attends to the concatenation of the representations of all passages and generates all answers in sequence.

4.4 Empirical Analyses

To analyze performance of the rerank-then-read framework for multi-answer open domain questions, we built a system that resembles the state-of-the-art pipeline from Min et al. (2021) but with two differences222Code and models from Min et al. (2021) were not publicly available in the period of this work.. First, we used the retriever from Izacard and Grave (2021a). Second, instead of using JPR, we used an oracle passage reranker (OPR): a passage is ranked higher than another passage if and only if 1) covers some answer while covers none 2) or both and cover or fail to cover some answer but has a higher retrieval score. Following Min et al. (2021), we retrieved =100 Wikipedia passages, =10 of which were selected by the reranker. Table 2 shows model performance on the dev set of a representative multi-answer dataset called AmbigQA Min et al. (2020b). Compared with JPR, OPR is better in terms of reranking, with similar question answering results.

max width=.48 Model Reranking QA MRecall@ MRecall@ F1 JPR Min et al. (2021) 64.8/45.2 67.1/48.2 48.5/37.6 OPR 67.7/46.5 70.3/51.2 48.4/37.0

Table 2: Reranking results and Question Answering results on the dev set of AmbigQA using JPR and OPR. The two numbers in each cell are results on all questions and questions with multiple answers, respectively.

With the oracle knowledge of whether a passage contains an answer during reranking, OPR is probably still far from being a perfect reranker. Notably, we are not striving for a better rerank-then-read pipeline for multi-answer questions, but use OPR as a representative case to analyze the problems a rerank-then-read pipeline may face.

Trade-off between Relevance and Diversity

Though 3,670 diverse gold answers are covered by OPR, the reader predicts only 1,554 of them. We therefore investigate how well OPR balances relevance and diversity, using questions with multiple answers in the dev set of AmbigQA, by comparing the supporting passages333We abuse the use of supporting passages of an answer to refer to passages that cover the answer. of missing gold answers and predicted gold answers.

Figure 0(a) and 0(b) show the distribution of supporting passages of missing answers and predicted answers, respectively. Generally, a missing answer has significantly less supporting passages fed to the reader (3.13 on average) than a predicted answer (5.08 on average), but not because of lacking available evidence. There is more evidence in retrieved passages for missing answers but filtered out by the reranker. As shown by Figure 0(c), OPR typically has a much lower level of evidence usage for missing answers than for predicted answers. A larger number of supporting passages of an answer is more likely to cover true positive evidence, which benefits question answering. However, as multiple answers share a small reading budget, it is inevitable that some answers are distributed with less supporting passages, which prevents the reader from making a better use of all retrieved evidence.

Judging from the widespread distribution of the number of supporting passages of predicted answers in Figure 0(b), there may be cases where redundant false positive evidence is selected by the reranker and can be safely replaced with more evidence for missing answers to better balance relevance and diversity. However, it is non-trivial for the reranker to know beforehand whether a passage is false positive evidence, and how many or which supporting passages provide strong enough evidence for the reader.

Figure 2: Analysis of the pathological dependencies among answers learned by the reader (of a rerank-then-read pipeline, OPR being the reranker) on AmbigQA. The horizontal axis is the number of diverse answers covered by OPR. The left axis shows the ratio of questions for which the reader recovers some originally missed gold answer after removing the supporting passages of some originally predicted gold answer.
Figure 3: The recall-then-verify framework we propose to tackle multi-answer open-domain questions. We first use the answer recaller to guess possible answers with high recall, the evidence aggregator then aggregates retrieved evidence for each candidate, and finally, the answer verifier verifies each candidate based on its aggregated evidence. As the reasoning process of each answer is separated, and thanks to candidate-aware evidence aggregation, we can have a high level of evidence usage with a large reader under a limited memory constraint.
Dependencies among Answers

Ideally, whether to predict an answer should mainly depend on its associated evidence. However, under the rerank-then-read framework, a reader predicts all answers in sequence, conditioned on all selected passages, which makes it possible for the reader to learn pathological dependencies among answers, i.e., whether to predict an answer may also depend on passages that cover some other answer(s). We conjectured that these pathological dependencies also partly accounted for the large number of gold answers missed by the reader. For verification, we attacked OPR’s reader on the dev set of AmbigQA as follows: a question is a target if and only if (1) it has a gold answer covered by OPR but missed by the reader and (2) it has a predicted gold answer whose supporting passages cover no other gold answers; a successful attack on a targeted question means that a missing answer is recovered after removing all supporting passages444Removed passages were replaced with the same number of top-ranking passages that cover no gold answer, so that the number of passages fed to the reader remained unchanged. of some predicted answer without removing any supporting passage of other gold answers.

There are 179 targeted questions; for 33.5% of them, we successfully recovered at least one missing gold answer per question, by eliminating the influence of evidence of some predicted gold answer. As shown by Figure 2, the success rate increases significantly when there are more answers covered by the reranker, indicating that predictions tend to be brittle on questions with many diverse supporting passages.

One possible explanation of the pathological dependencies is that the reader compares the validity of answer candidates and predicts the most likely ones. However, for 45.0% of successfully attacked questions, according to OPR, supporting passages of recovered missing answers are more relevant to the questions than those removed supporting passages of predicted answers. Notably, Min et al. (2020b) also have a similar observation on another rerank-then-read pipeline, i.e., it is hard to argue that the predicted answers are more likely than the missing ones.

5 Recall-then-Verify Framework

5.1 Overview

As shown by section 4.4, the rerank-then-read framework is faced with three problems when used for tackling multi-answer open-domain questions. First, to leverage the power of a large reader under a fixed memory constraint, a reranker should select only a few passages that are highly-relevant and also diverse enough to cover multiple answers. However, it is non-trivial for a reranker to know beforehand which answers can be safely distributed with less supporting passages and which ones should be distributed with more. Second, it is more likely for a reranker to cover true positive evidence with more supporting passages of an answer. However, as multiple answers share the small reading budget, most retrieved evidence, which is possibly valuable, is filtered out by the reranker and can not be used by the reader. Third, as the reader predicts answers all at once based on all selected passages, whether to predict an answer or not may pathologically depend on evidence of other answers.

To address the above problems, we propose a recall-then-verify framework, which separates the reasoning process of each answer so that answers (1) can be individually distributed with maximum supporting passages allowed on the same hardware (2) and are predicted mainly based on their own evidence. Figure 3 shows our framework. Specifically, we first guess possible answers based on retrieved passages using an answer recaller, an evidence aggregator then aggregates evidence for each answer candidate, and finally, an answer verifier verifies each candidate and outputs valid ones.

5.2 Answer Recaller

Our answer recaller is based on the encoder-decoder architecture of T5 Raffel et al. (2020); it is trained to generate an answer candidate from each retrieved passage (concatenated with the question):

(1)

We denote the set of recalled answer candidates as . Though a single passage may not contain strong enough evidence to support an answer, with some semantic clues, such as answer types, it is sufficient for even a weak model to predict possible answers with high recall. However, this is at the cost of low precision, which necessitates answer verification based on more evidence.

5.3 Evidence Aggregator

Our evidence aggregator aggregates evidence for each answer candidate from retrieved passages, which can be formulated as a reranking task, i.e., to rerank retrieved passages according to their relevance to a question-candidate pair, and select top-ranking ones for answer verification. We reuse the answer recaller as the evidence aggregator:

(2)

where denotes the top- relevant passages of the answer candidate .

5.4 Answer Verifier

Given an answer candidate and its evidence , our answer verifier, based on T5-3b, predicts whether is valid, using the fusion-in-decoder method from Izacard and Grave (2021b). Each passage from is concatenated with the question and the answer candidate, and is encoded independently; the decoder then attends to the representations of all passages and is trained to produce the tokens “right” or “wrong” depending on whether the encoded answer candidate is valid or not. During inference, we compute the validity score of a candidate by taking the normalized probability assigned to the token “right”:

(3)

Candidates with their validity scores higher than a pre-defined threshold will be produced as final answers.

6 Experiments

6.1 Datasets

We conducted experiments on two multi-answer QA datasets, whose statistics are shown in Table 3.

WebQSP Yih et al. (2016) is a semantic parsing dataset for knowledge base question answering, where answers are a set of entities in Freebase. Following Min et al. (2021), we repurpose this dataset for textual QA based on Wikipedia555Our train/dev split on WebQSP is different from Min et al. (2021)’s, as their split was not publicly available in the period of this work..

AmbigQA Min et al. (2020b) originates from NQ Kwiatkowski et al. (2019), where questions are annotated with equally valid answers from Wikipedia.

max width=.4 Dataset # Question # Answer Train Dev Test Avg. Median WebQSP 2,752 245 1582 22.6 1.0 AmbigQA 10,036 2,002 2,004 2.2 2.0

Table 3: Statistics of multi-answer QA datasets. The average and median number of answers are computed on the dev sets.

6.2 Baselines

We compare our recall-then-verify system with two state-of-the-art rerank-then-read systems.

Refuel Gao et al. (2021) selects 100 top-ranking passages from 1,000 retrieved passages, and predicts answers with a reader based on BARTlarge Lewis et al. (2020). It also has a round-trip prediction mechanism, i.e., to generate disambiguated questions based on predicted answers, which are re-fed to the reader to recall more valid answers.

JPR Min et al. (2021) is a passage reranker which jointly models selected passages. With improved reranking performance, Min et al. (2021) selected only 10 passages from 100 retrieved passages, and used a reader based on T5-3b which is much larger and more powerful than Refuel’s reader, while requiring no more memory resources than Refuel.

6.3 Implementation Details

Our retrieval corpus is the English Wikipedia from 12/20/2018, where articles are split into 100-word passages. The dense retriever is from Izacard and Grave (2021a) and is finetuned on each multi-answer QA dataset. The answer recaller and the answer verifier are initialized with T5-3b; both are pre-trained on NQ and then finetuned on each multi-answer dataset. We retrieve passages for a question, and verify an answer candidate with retrieved passages. The threshold for verification is tuned on the dev set based on the sum of F1 scores on all questions and questions with multiple answers; the best on WebQSP and AmbigQA are 0.7 and 0.75, respectively.

Memory Constraint: Min et al. (2021) considered a fixed hardware and trained a reader with the maximum number of passages. We follow this memory constraint, under which a reader based on T5-3b can encode up to passages each of length no longer than 360 tokens at a time.

6.4 QA Results

max width=0.42 System WebQSP AmbigQA Dev* Test Dev Test Refuel - - 48.3/37.3 42.1/33.3 JPR 53.6/49.5 53.1/47.2 48.5/37.6 43.5/34.2 Ours 54.4/48.5 54.9/48.2 51.2/40.2 44.9/35.9

Table 4: Question answering results on multi-answer datasets. The two numbers in each cell are F1 scores on all questions and questions with multiple answers, respectively. Note that results on the dev set of WebQSP can not be directly compared, as we used a different train/dev split5.

max width=0.42 System T5 Dev Test Izacard and Grave (2021a) 100 large 51.9 53.7 JPR 10 3b 50.4 54.5 Ours 10 3b 51.9 54.8

Table 5: Exact match scores of different systems on NQ, which is a single-answer dataset. Izacard and Grave (2021a) used significantly more memory resources than JPR and our system for training.

Due to candidate-aware evidence aggregation and a fixed sufficient number of passages distributed to each candidate, our recall-then-verify framework can make use of most retrieved supporting passages (see our improvements over OPR in Figure 0(c)). With a higher level of evidence usage, our recall-then-verify system outperforms state-of-the-art rerank-then-read baselines on both multi-answer datasets, which is shown by Table 4.

Though our framework focuses on multi-answer questions, we also experimented on NQ to demonstrate that our framework is applicable to single-answer scenario without suffering from low precision. Specifically, for each question, we only output the answer candidate with the highest validity score. As shown by Table 5, our system slightly outperforms previous state-of-the-art rerank-then-read systems.

6.5 Ablation Study

6.5.1 Can the Answer Recaller Do More than Guessing?

max width=.49 Dataset T5 N / P # Hit Recall Precision F1 WebQSP 3b 10 3.7 403/296 61.4/48.9 48.1/52.9 47.4/43.8 3b 5 4.0 412/307 61.7/49.6 46.1/50.5 46.5/43.6 3b 1 5.6 449/336 63.6/50.4 37.2/42.3 40.5/39.5 3b 0.1 16.7 499/382 66.7/54.7 17.2/23.1 22.2/26.4 3b 0 45.1 593/469 71.3/59.5 7.5/11.1 11.2/15.4 base 0 56.9 574/453 69.8/58.0 5.0/8.5 7.7/12.1 AmbigQA 3b 10 2.8 1975/1160 53.6/37.2 39.3/38.8 40.5/33.9 3b 5 3.1 2053/1219 55.2/39.2 37.5/37.2 39.7/34.1 3b 1 6.4 2526/1555 63.2/48.1 22.7/23.5 29.9/28.6 3b 0.1 24.9 2980/1891 68.3/55.3 7.2/8.6 12.1/13.8 3b 0 53.1 3206/2073 71.5/59.1 3.4/4.2 6.1/7.5 base 0 61.7 3091/1981 69.3/56.7 2.5/3.3 4.7/6.0

Table 6: Performance of the answer recaller on the dev sets, using different numbers of negatives per positive passage (N / P) for training. # Hit denotes the number of distinct gold answers recalled.

Our answer recaller is trained on only positive passages which cover some gold answer, thus not granted the capability of filtering out negative passages or negative answer candidates. As shown by Table 6, the performance of a recaller based on T5-base is close to that of a recaller based on T5-3b, though T5-base is commonly recognized as a much weaker model.

In this section, we also aim to answer two questions: (1) If a large recaller is trained to recognize negative passages and predict answer candidates only on positive passages, is using more evidence for verification still necessary? (2) If only using this recaller for multi-answer QA falls short, can it reduce the burden on the answer verifier by conservatively filtering out negative answer candidates?

Necessity of Verification

To investigate whether the answer recaller has the potential to tackle multi-answer open-domain questions alone, we trained the answer recaller to predict the “irrelevant” token given a negative passage and predict an answer candidate given a positive passage. Results with varying number of negatives per positive passage are reported in Table 6. With increased negative training passages, the answer recaller learns to recall answers more precisely but still significantly underperforms the overall recall-then-verify system. With set to 0.6, our recall-then-verify system predicts 2015/1163 distinct gold answers and have F1 scores of 50.0/39.7, while an answer recaller trained with 10 negatives per positive passage predicts a similar number of gold answers (1975/1160) but with significantly lower F1 scores (40.5/33.9). The answer recaller is likely to be trained on false positive passages, which may be misleading and make it over-conservative in filtering out hard negative passages. By contrast, using more evidence for verification is less likely to miss true positive evidence if there is any for a candidate, thus not prone to mislead the verifier.

Reducing Answer Candidates

As shown by Table 6, training the answer recaller with a small ratio of negative passages helps reduce answer candidates without significantly lowering recall.

In summary, it is difficult for our recaller alone to tackle multi-answer open-domain questions, which necessitates answer verification based on more associated evidence. However, an answer recaller can be trained to shrink down the number of candidates, so that the burden on the verifier can be reduced.

6.5.2 Effect of the Size of Evidence

(a)
(b)
Figure 4: Performance on the dev set of AmbigQA, with varying and . In Figure (a), results with are associated with the top and right axes, while the others are with the bottom and left axes. As increases (), points of the same color move from bottom right to top left.

Figure 4 shows the benefit of using more evidence for verification. As the number of passages increases from 1 to 10, there is a significant boost in F1 score and the number of predicted gold answers.

6.5.3 Effect of the Validity Threshold

As shown by Figure 3(a), the balance between recall and precision can be controlled by varying : a lower leads to higher recall and may benefit performance on questions with multiple-answers. When is set to 10, our recall-then-verify system outperforms the state-of-the art rerank-then-read system JPR for a wide range of . Under the best setups (, ), our system predicts 25.0% and 25.7% more gold answers than OPR on all questions and questions with multiple answers, respectively.

6.6 Error Analyses

max width=.49 Missing Gold Answers Evidence is wrong 20% Evidence is right and straightforward 68% Evidence is right but needs reasoning 8% Evidence is right but implicit 4% Wrong Predictions Predictions are true negatives 16% Predictions are superficially-different false negatives 48% Predictions are unannotated false negatives 36%

Table 7: Analysis of our predictions on the dev set of AmbigQA. Examples are shown in Appendix.

Among 3,206 recalled gold answers, the answer verifier misses 1,263 of them and outputs 1,316 wrong predictions. We manually analyzed 50 random samples, 25 of which are missed gold answers and 25 are wrong predictions. Table 7 reports our analysis. Our evidence aggregator aggregates true positive evidence for 80% of missing gold answers: 68% of missing ones actually have straightforward evidence66653% of missing gold answers with straightforward evidence have validity scores higher than 0.5 but lower than the threshold., while verification for 12% of missing ones requires reasoning (e.g., multi-hop reasoning and numeric reasoning) on evidence or having necessary implicit knowledge. Notably, 84% of our “wrong” predictions turn out to be false negatives: 48% of “wrong” predictions are semantically equivalent to some annotated answer but are superficially different Si et al. (2021); 36% of “wrong” predictions are unannotated false negatives. This demonstrates that it is indeed difficult to find all valid answers to an open-domain question Min et al. (2020b) and that our system may be underrated.

7 Conclusion

In this paper, we empirically analyze the problems of the rerank-then-read framework for multi-answer open-domain questions. Using a large reader can benefit QA performance, but under a fixed memory constraint, it will force the reranker to select only a few passages that are relevant and also diverse enough to cover multiple answers. As a result, the reranker should deal with the non-trivial balance between relevance and diversity, and the reader fails to make use of retrieved evidence that is valuable but filtered out by the reranker. What’s more, as the reader predicts answers all at once based on all selected passages, whether to predict an answer may pathologically depend on evidence of other answers.

To avoid these problems, we propose to tackle multi-answer open-domain questions with the recall-then-verify framework, which separates the reasoning process of each answer so that we can make better use of retrieved evidence with large models under the same memory constraint. Extensive experiments demonstrate the effectiveness of our framework.

References

  • N. Asadi and J. Lin (2013) Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In The 36th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR ’13, Dublin, Ireland - July 28 - August 01, 2013, G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.), pp. 997–1000. External Links: Link, Document Cited by: §2.
  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on freebase from question-answer pairs. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL

    ,
    pp. 1533–1544. External Links: Link Cited by: §2.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, R. Barzilay and M. Kan (Eds.), pp. 1870–1879. External Links: Link, Document Cited by: §1, §2.
  • H. Cheng, Y. Shen, X. Liu, P. He, W. Chen, and J. Gao (2021) UnitedQA: A hybrid approach for open domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), pp. 3080–3090. External Links: Link, Document Cited by: §2.
  • B. Dhingra, K. Mazaitis, and W. W. Cohen (2017) Quasar: datasets for question answering by search and reading. CoRR abs/1707.03904. External Links: Link, 1707.03904 Cited by: §2.
  • Y. Gao, H. Zhu, P. Ng, C. N. dos Santos, Z. Wang, F. Nan, D. Zhang, R. Nallapati, A. O. Arnold, and B. Xiang (2021) Answering ambiguous questions through generative evidence fusion and round-trip prediction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), pp. 3263–3276. External Links: Link, Document Cited by: §4.2, §6.2.
  • M. Hu, F. Wei, Y. Peng, Z. Huang, N. Yang, and D. Li (2019) Read + verify: machine reading comprehension with unanswerable questions. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 6529–6537. External Links: Link, Document Cited by: §2.
  • S. Iyer, S. Min, Y. Mehdad, and W. Yih (2021) RECONSIDER: improved re-ranking using span-focused cross-attention for open domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou (Eds.), pp. 1280–1287. External Links: Link, Document Cited by: §2.
  • G. Izacard and E. Grave (2021a) Distilling knowledge from reader to retriever for question answering. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §2, §4.4, §6.3, Table 5.
  • G. Izacard and E. Grave (2021b) Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, P. Merlo, J. Tiedemann, and R. Tsarfaty (Eds.), pp. 874–880. External Links: Link Cited by: §1, §2, §4.2, §4.3, §5.4.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, R. Barzilay and M. Kan (Eds.), pp. 1601–1611. External Links: Link, Document Cited by: §2.
  • V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 6769–6781. External Links: Link, Document Cited by: §2.
  • O. Khattab, C. Potts, and M. Zaharia (2021) Relevance-guided supervision for openqa with colbert. Trans. Assoc. Comput. Linguistics. External Links: Link Cited by: §2.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics 7, pp. 452–466. External Links: Link Cited by: §1, §2, §2, §6.1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020)

    BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    .
    In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 7871–7880. External Links: Link, Document Cited by: §6.2.
  • Y. Mao, P. He, X. Liu, Y. Shen, J. Gao, J. Han, and W. Chen (2021) Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), pp. 4089–4100. External Links: Link, Document Cited by: §2.
  • S. Min, J. L. Boyd-Graber, C. Alberti, D. Chen, E. Choi, M. Collins, K. Guu, H. Hajishirzi, K. Lee, J. Palomaki, C. Raffel, A. Roberts, T. Kwiatkowski, P. S. H. Lewis, Y. Wu, H. Küttler, L. Liu, P. Minervini, P. Stenetorp, S. Riedel, S. Yang, M. Seo, G. Izacard, F. Petroni, L. Hosseini, N. D. Cao, E. Grave, I. Yamada, S. Shimaoka, M. Suzuki, S. Miyawaki, S. Sato, R. Takahashi, J. Suzuki, M. Fajcik, M. Docekal, K. Ondrej, P. Smrz, H. Cheng, Y. Shen, X. Liu, P. He, W. Chen, J. Gao, B. Oguz, X. Chen, V. Karpukhin, S. Peshterliev, D. Okhonko, M. S. Schlichtkrull, S. Gupta, Y. Mehdad, and W. Yih (2020a) NeurIPS 2020 efficientqa competition: systems, analyses and lessons learned. In NeurIPS 2020 Competition and Demonstration Track, 6-12 December 2020, Virtual Event / Vancouver, BC, Canada, H. J. Escalante and K. Hofmann (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 133, pp. 86–111.
    External Links: Link Cited by: §4.1.
  • S. Min, K. Lee, M. Chang, K. Toutanova, and H. Hajishirzi (2021) Joint passage ranking for diverse multi-answer retrieval. CoRR abs/2104.08445. External Links: Link, 2104.08445 Cited by: §1, §2, §3, §4.2, §4.3, §4.4, Table 2, §4, §6.1, §6.2, §6.3, footnote 1, footnote 2, footnote 5.
  • S. Min, J. Michael, H. Hajishirzi, and L. Zettlemoyer (2020b) AmbigQA: answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 5783–5797. External Links: Link, Document Cited by: §1, §1, §2, §3, §4.4, §4.4, §6.1, §6.6.
  • R. Nogueira and K. Cho (2019) Passage re-ranking with BERT. CoRR abs/1901.04085. External Links: Link, 1901.04085 Cited by: §2.
  • R. Nogueira, Z. Jiang, R. Pradeep, and J. Lin (2020) Document ranking with a pretrained sequence-to-sequence model. Findings of the Association for Computational Linguistics: EMNLP 2020. External Links: Link, Document Cited by: §2, §4.2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    .
    J. Mach. Learn. Res. 21, pp. 140:1–140:67. External Links: Link Cited by: §4.2, §4.3, §5.2.
  • C. Si, C. Zhao, and J. L. Boyd-Graber (2021) What’s in a name? answer equivalence for open-domain question answering. CoRR abs/2109.05289. External Links: Link, 2109.05289 Cited by: §6.6.
  • E. M. Voorhees (1999) The TREC-8 question answering track report. In Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17-19, 1999, E. M. Voorhees and D. K. Harman (Eds.), NIST Special Publication, Vol. 500-246. External Links: Link Cited by: §1, §2.
  • S. Wang, M. Yu, J. Jiang, W. Zhang, X. Guo, S. Chang, Z. Wang, T. Klinger, G. Tesauro, and M. Campbell (2018a) Evidence aggregation for answer re-ranking in open-domain question answering. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.
  • Y. Wang, K. Liu, J. Liu, W. He, Y. Lyu, H. Wu, S. Li, and H. Wang (2018b) Multi-passage machine reading comprehension with cross-passage answer verification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, I. Gurevych and Y. Miyao (Eds.), pp. 1918–1927. External Links: Link, Document Cited by: §2.
  • L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. N. Bennett, J. Ahmed, and A. Overwijk (2021) Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §2.
  • W. Yih, M. Richardson, C. Meek, M. Chang, and J. Suh (2016) The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers, External Links: Link, Document Cited by: §6.1.
  • T. Zhao, X. Lu, and K. Lee (2021) SPARTA: efficient open-domain question answering via sparse transformer matching retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou (Eds.), pp. 565–575. External Links: Link, Document Cited by: §2.

Appendix A Error Analysis

Table 8 reports error analysis of our answer verifier.

Missing Gold Answers > Evidence is wrong (20%)
Question: When was the last time adelaide was in a grand final?
Gold Answers: 2016; 2017; 1998; 30 September 2017
Prediction: 2017
Evidence: … in the home-and-away season in 2017, in Round 6, with Adelaide recording a 76-point … Bookmakers installed Adelaide as the favourites to win the grand final …
Explanation: Evidence is insufficient to infer whether the 2017 grand final was the last of Adelaide.
Missing Gold Answers > Evidence is right and straightforward (68%)
Question: Who was the first person who discovered electricity?
Gold Answers: Ancient Egytian; William Gilbert; Gilbert
Prediction: William Gilbert
Evidence: William Gilbert … is credited as one of the originators of the term “electricity” … the father of electrical engineering or electricity …
Missing Gold Answers > Evidence is right but needs reasoning (8%)
Question: Who plays james corden’s sister in gavin and stacey?
Gold Answers: Sheridan Smith, OBE; Sheridan Smith
Prediction: Sheridan Smith
Evidence: (1) … Gavin rushes to find Smithy with Smithy’s sister, Rudi (Sheridan Smith) … (2) … James Cordan & Sheridan Smith performed … as Smithy and Rudi …
Explanation: Multi-hop reasoning is needed.
Missing Gold Answers > Evidence is right but implicit (4%)
Question: Who did fsu beat for the 2013 championship?
Gold Answers: Duke; Duke Blue Devils; Auburn; Auburn Tigers
Prediction: Duke
Evidence: … in the 2013 ACC Championship Game … Duke lost to Florida State …
Explanation: “fsu” is short for Florida State University, which is not mentioned.
Wrong Predictions > Predictions are true negatives (16%)
Question: How many seasons of shameless usa is there?
Gold Answers: ten; 10
Prediction: 9
Evidence: (1) … Shameless (season 9) … (2) … The ninth series … was reduced to 11 episodes, with the remaining 11 being turned into the tenth series.
Wrong Predictions > Predictions are superficially-different false negatives (48%)
Question: Where did the brown v board of education take place?
Gold Answers: U.S. Supreme Court; Topeka, KS
Prediction: United States Supreme Court
Evidence: … “Brown v. Board of Education” … was taken to the United States Supreme Court …
Wrong Predictions > Predictions are unannotated false negatives (36%)
Question: Who founded jamestown in what is now virginia?
Gold Answers: London Company; The Virginia Company of London; John Smith; Captain John Smith; Edward Maria Wingfield; Virginia Company of London; English settlement; the Virginia Company of London; Virginia Company of London
Prediction: Christopher Newport
Evidence: Virginia … English settlement in the “New World”, Jamestown. Named for King James I, it was founded in May 1607 by Christopher Newport …
Table 8: Analysis of predictions from our answer verifier. We display all annotated forms of gold answers, which are separated with semicolons.