Multi-hop question answering tests the ability of a system to retrieve and combine multiple facts to answer a single question. HotpotQA Yang et al. (2018) introduces a task where questions are free-form text, supporting facts come from Wikipedia, and answer text and supporting facts are labeled. The questions in HotpotQA are further categorized as bridge-type questions or comparison-type questions. For comparison questions, often all necessary facts may be retrieved using terms in the question itself. For challenging bridge-type questions, it may not be possible to retrieve all the necessary facts based on the terms present in the original question alone. Rather, partial information must first be retrieved and used to formulate an additional query.
Although many systems have been submitted to the HotpotQA leaderboard, surprisingly, only a few have directly addressed the challenge of followups. Systems can either be evaluated in a distractor setting, where a set of ten paragraphs containing all supporting facts is provided, or in a full wiki setting, where supporting facts must be retrieved from all of Wikipedia. The systems that compete only in the distractor setting can achieve good performance by combining and ranking the information provided, without performing followup search queries. Furthermore, even in the distractor setting, Min et al. (2019a) found that only 27% of the questions required multi-hop reasoning, because additional evidence was redundant or unnecessary or the distractors were weak. They trained a single-hop model that considered each paragraph in isolation and ranked confidences of the answers extracted from each, to obtain competitive performance.
Of the nine systems with documentation submitted to the full wiki HotpotQA leaderboard as of 24 November 2019, four of them (Nie et al., 2019; Ye et al., 2019; Nishida et al., 2019; Yang et al., 2018) attempt to retrieve all relevant data with one search based on the original question, without any followups. Fang et al. (2019) retrieves second hop paragraphs simply by following hyperlinks from or to the first hop paragraphs.
Qi et al. (2019), Ding et al. (2019), and Feldman and El-Yaniv (2019) form various kinds of followup queries without writing a new question to be answered. Qi et al. (2019) trains a span extractor to predict the longest common subsequence between the question plus the first hop evidence and the (unseen) second hop evidence. At inference time, these predicted spans become followup search queries. In Ding et al. (2019), a span extractor is trained using the titles of the second hop evidence. Feldman and El-Yaniv (2019) trains a neural retrieval model that uses maximum inner product with an encoding of the question plus first hop evidence to retrieve second hop evidence.
Min et al. (2019b) forms not just followup queries but followup questions. They use additional specially labeled data to train a pointer network to divide the original question into substrings, and use handcrafted rules to convert these substrings into subquestions. The original question is answered by the second subquestion, which incorporates a substitution of the answer to the first subquestion.
While performing followup retrievals of some sort should be essential for correctly solving the most difficult multi-hop problems, formulating a followup question whose answer becomes the answer to the original question is motivated primarily by interpretability rather than accuracy. In this paper, we pursue a trained approach to generating followup questions that is not bound by handcrafted rules, posing a new and challenging application for abstractive summarization and neural question generation technologies. Our contributions are to define the task of a followup generator module (Section 2), to propose a fully trained solution to followup generation (Section 3), and to establish an objective evaluation of followup generators (Section 5).
2 Problem Setting
Our technique is specifically designed to address the challenge of discovering new information is needed that is not specified by the terms of the original question. At the highest level, comparison questions do not pose this challenge, because each quantity to be compared is specified by part of the original question. (They also pose different semantics than bridge questions because a comparison must be applied after retrieving answers to the subquestions.) Therefore we focus only on bridge questions in this paper.
shows our pipeline to answer a multi-hop bridge question. As partial information is obtained, an original question is iteratively reduced to simpler questions generated at each hop. Given an input question or subquestion, possible premises which may answer the subquestion are obtained from an information retrieval module. Each possible premise is classified against the question as irrelevant, containing a final answer, or containing intermediate information, by a three-way controller module. For premises that contain a final answer, the answer is extracted with a single-hop question answering module. For premises that contain intermediate information, a question generator produces a followup question, and the process may be repeated with respect to this new question. It is this question generator that is the focus of this paper. Various strategies may be used to manage the multiple reasoning paths that may be produced by the controller. Details are in section5.
Although our method applies to bridge questions with arbitrary number of hops, for simplicity we focus on two-hop problems and on training the followup question generator. Let denote the controller, denote the answer extractor, and denote the followup generator. Let be a question with answer and gold supporting premises and , and suppose that but not contains the answer. The task of the followup generator is to use and to generate a followup question such that
Failure of any of these desiderata could harm label accuracy in the HotpotQA full wiki or distractor evaluations.
Some questions labeled as bridge type in HotpotQA have a different logical structure, called “intersection” by Min et al. (2019b). Here the subquestions specify different properties that the answer entity is supposed to satisfy, and the intersection of possible answers to the subquestions is the answer to the original question. Our approach is not oriented towards this type of question, but there is no trivial way to exclude them from the dataset.
One non-interpretable implementation of our pipeline would be for to simply output concatenated with as the “followup question.” Then would operate on input that really does not take the form of a single question, along with , to determine the final answer. Effectively, would be doing multi-hop reasoning. To ensure that gets credit only for forming real followup questions, we insist that is first trained as a single-hop answer extractor, by training it on SQuAD 2.0 (Rajpurkar et al., 2018), then freeze it while and are trained.
Ideally, we might train using cross entropy losses inspired by equations 1, 2, and 3 with and fixed, but the decoded output is not differentiable with respect to parameters. Instead, we train with a token-based loss against a set of weakly labeled ground truth followup questions.
The weakly labeled ground truth followups are obtained using a neural question generation (QG) network. Given a context and an answer , QG is the task of finding a question
most likely to have produced it. We use reverse SQuAD to train the QG model of Zhao et al. (2018), which performs near the top of an extensive set of models tested by Tuan et al. (2019) and has an independent implementation available. Applied to our training set with and , it gives us a weak ground truth followup .
We instantiate the followup question generator, which uses and to predict , with a pointer-generator network (See et al., 2017). This is a sequence to sequence model whose decoder repeatedly chooses between generating a word from a fixed vocabulary and copying a word from the input. Typically, pointer-generator networks are used for abstractive summarization. Although the output serves a different role here, their copy mechanism is useful in constructing a followup that uses information from the original question and premise.
We train with cross-entropy loss for ternary classification on the ground truth triples , if , and for all other . In this way the controller learns to predict when a premise has sufficient or necessary information to answer a question. Both and are implemented by BERT following the code by Devlin et al. (2019).
4 Related Work
Evaluating a followup question generator by whether its questions are answered correctly is analogous to verifying the factual accuracy of abstractive summarizations, which has been studied by many, including Falke et al. (2019)
, who estimate factual correctness using a natural language inference model, and find that it does not correlate with ROUGE score. Contemporaneous work byZhang et al. (2019)
uses feedback from a fact extractor in reinforcement learning to optimize the correctness of a summary, suggesting an interesting future direction for our work.
A recent neural question generation model has incorporated feedback from an answer extractor into the training of a question generator, rewarding the generator for constructing questions the extractor can answer correctly (Klein and Nabi, 2019)
. Although the loss is not backpropagated through both the generator and extractor, the generator is penalized by token level loss against ground truth questions when the question is answered wrongly, but by zero loss when it constructs a variant that the extractor answers correctly.
To isolate the effect of our followup generator on the types of questions for which it was intended, our experiments cover the subset of questions in HotpotQA labeled with exactly two supporting facts, with the answer string occurring in exactly one of them. There are 38,806 such questions for training and 3,214 for development, which we use for testing because the structure of the official test set is not available. For a baseline we compare to a trivial followup generator that returns the original question without any rewriting.
|One hop ( only)||16.8||21.5|
|Two hops (trained )||19.8||25.4|
|Randall Cunningham II was a multi-sport athlete at the high school located in what Nevada city?||Summerlin||—||where is bishop gorman high school located?||Summerlin, Nevada|
|Alexander Kerensky was defeated and destroyed by the Bolsheviks in the course of a civil war that ended when?||October 1922||—||what was the name of the russian civil war?||The Russian Civil War|
|Peter Marc Jacobson is best known as the co-creator of the popular sitcom ”The Nanny”, which he created and wrote with his then wife an actress born in which year ?||1957||1993||what year was fran drescher born in?||1957|
|Who did the Star and Dagger bass player marry?||Sean Yseult.||Sean Yseult||what was the name of the american rock musician?||Chris Lee|
First, we evaluate performance using an oracle controller, which forwards only to the followup generator, and only to the answer extractor. Results are shown in Table 1. Best performance is achieved using the system “ else ,” which answers with or , whichever is non-null. Thus, although many questions are really single-hop and best answered using the original question, using the followup questions when a single-hop answer cannot be found helps the F1 score by 8.9%. Table 2 shows followup generations and extracted answers in two typical successful and two typical failed cases.
Next we consider the full system of Figure 1. We use the distractor paragraphs provided. We run the loop for up to two hops, collecting all answer extractions requested by the controller, stopping after the first hop where a non-null extracted answer was obtained. If multiple extractions were requested for the same problem, we take the answer in where had the highest confidence. The controller requested 2,989 followups, and sent 975 pairs for answer extraction in hop one, and 1,180 in hop two. The performance gain shows that the followup generator often can generate questions which are good enough for the frozen single hop model to understand and extract the answer with, even when the question must be specific enough to avoid distracting premises.
Followup queries are essential to solving the difficult cases of multi-hop QA, and real followup questions are an advance in making this process interpretable. We have shown that pointer generator networks can effectively learn to read partial information and produce a fluent, relevant question about what is not known, which is a complement to their typical role in summarizing what is known. Our task poses a novel challenge that tests semantic properties of the generated output.
By using a neural question generator to produce weak ground truth followups, we have made this task more tractable. Future work should examine using feedback from the answer extractor or controller to improve the sensitivity and specificity of the generated followups. Additionally, the approach should be developed on new datasets such as QASC (Khot et al., 2019), which are designed to make single-hop retrieval less effective.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §3.
- Cognitive graph for multi-hop reading comprehension at scale. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2694–2703. External Links: Cited by: §1.
- Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2214–2220. External Links: Cited by: §4.
- Hierarchical graph network for multi-hop question answering. CoRR 1911.03631. Cited by: §1.
- Multi-hop paragraph retrieval for open-domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2296–2309. External Links: Cited by: §1.
- QASC: a dataset for question answering via sentence composition. CoRR 1910.11473. Cited by: §6.
- Learning to answer by learning to ask: getting the best of gpt-2 and bert worlds. CoRR 1911.02365. Cited by: §4.
- Compositional questions do not necessitate multi-hop reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4249–4257. External Links: Cited by: §1.
- Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6097–6109. External Links: Cited by: §1, §2.
Revealing the importance of semantic retrieval for machine reading at scale.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2553–2566. External Links: Cited by: §1.
- Answering while summarizing: multi-task learning for multi-hop QA with evidence extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2335–2345. External Links: Cited by: §1.
- Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2590–2602. External Links: Cited by: §1.
- Know what you don’t know: unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 784–789. External Links: Cited by: §2.
- Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. External Links: Cited by: §3.
- Capturing greater context for question generation. CoRR 1910.10274. Cited by: §3.
- HotpotQA: a dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2369–2380. External Links: Cited by: §1, §1.
Multi-paragraph reasoning with knowledge-enhanced graph neural network. CoRR 1911.02170. Cited by: §1.
- Optimizing the factual correctness of a summary: a study of summarizing radiology reports. CoRR 1911.02541. Cited by: §4.
- Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3901–3910. External Links: Cited by: §3.