Multi-hop reading comprehension (RC) is challenging because it requires the aggregation of evidence across several paragraphs to answer a question. Table 1 shows an example of multi-hop RC, where the question “Which team does the player named 2015 Diamond Head Classics MVP play for?” requires first finding the player who won MVP from one paragraph, and then finding the team that player plays for from another paragraph.
|Q Which team does the player named 2015 Diamond Head|
|Classic’s MVP play for?|
|P1 The 2015 Diamond Head Classic was … Buddy Hield was|
|named the tournament’s MVP.|
|P2 Chavano Rainier Buddy Hield is a Bahamian professional|
|basketball player for the Sacramento Kings …|
|Q1 Which player named 2015 Diamond Head Classic’s MVP?|
|Q2 Which team does ANS play for?|
In this paper, we propose DecompRC, a system for multi-hop RC, that learns to break compositional multi-hop questions into simpler, single-hop sub-questions using spans from the original question. For example, for the question in Table 1, we can create the sub-questions “Which player named 2015 Diamond Head Classics MVP?” and “Which team does ANS play for?”, where the token ANS is replaced by the answer to the first sub-question. The final answer is then the answer to the second sub-question.
Recent work on question decomposition relies on distant supervision data created on top of underlying relational logical forms (Talmor and Berant, 2018), making it difficult to generalize to diverse natural language questions such as those on HotpotQA (Yang et al., 2018). In contrast, our method presents a new approach which simplifies the process as a span prediction, thus requiring only 400 decomposition examples to train a competitive decomposition neural model. Furthermore, we propose a rescoring approach which obtains answers from different possible decompositions and rescores each decomposition with the answer to decide on the final answer, rather than deciding on the decomposition in the beginning.
Our experiments show that DecompRC outperforms other published methods on HotpotQA (Yang et al., 2018), while providing explainable evidence in the form of sub-questions. In addition, we evaluate with alternative distrator paragraphs and questions and show that our decomposition-based approach is more robust than an end-to-end BERT baseline (Devlin et al., 2019). Finally, our ablation studies show that our sub-questions, with 400 supervised examples of decompositions, are as effective as human-written sub-questions, and that our answer-aware rescoring method significantly improves the performance.
Our code and interactive demo are publicly available at https://github.com/shmsw25/DecompRC.
2 Related Work
In reading comprehension, a system reads a document and answers questions regarding the content of the document (Richardson et al., 2013). Recently, the availability of large-scale reading comprehensiondatasets (Hermann et al., 2015; Rajpurkar et al., 2016; Joshi et al., 2017) has led to the development of advanced RC models (Seo et al., 2017; Xiong et al., 2018; Yu et al., 2018; Devlin et al., 2019). Most of the questions on these datasets can be answered in a single sentence (Min et al., 2018), which is a key difference from multi-hop reading comprehension.
Multi-hop Reading Comprehension.
In multi-hop reading comprehension, the evidence for answering the question is scattered across multiple paragraphs. Some multi-hop datasets contain questions that are, or are based on relational queries (Welbl et al., 2017; Talmor and Berant, 2018). In contrast, HotpotQA (Yang et al., 2018), on which we evaluate our method, contains more natural, hand-written questions that are not based on relational queries.
Prior methods on multi-hop reading comprehension focus on answering relational queries, and emphasize attention models that reason over coreference chains(Dhingra et al., 2018; Zhong et al., 2019; Cao et al., 2019). In contrast, our method focuses on answering natural language questions via question decomposition. By providing decomposed single-hop sub-questions, our method allows the model’s decisions to be explainable.
Our work is most related to Talmor and Berant (2018), which answers questions over web snippets via decomposition. There are three key differences between our method and theirs. First, they decompose questions that are correspond to relational queries, whereas we focus on natural language questions. Next, they rely on an underlying relational query (SPARQL) to build distant supervision data for training their model, while our method requires only 400 decomposition examples. Finally, they decide on a decomposition operation exclusively based on the question. In contrast, we decompose the question in multiple ways, obtain answers, and determine the best decomposition based on all given context, which we show is crucial to improving performance.
Semantic parsing is a larger area of work that involves producing logical forms from natural language utterances, which are then usually executed over structured knowledge graphs(Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Liang et al., 2011). Our work is inspired by the idea of compositionality from semantic parsing, however, we focus on answering natural language questions over unstructured text documents.
In multi-hop reading comprehension, a system answers a question over a collection of paragraphs by combining evidence from multiple paragraphs. In contrast to single-hop reading comprehension, in which a system can obtain good performance using a single sentence (Min et al., 2018), multi-hop reading comprehension typically requires more complex reasoning over how two pieces of evidence relate to each other.
We propose DecompRC for multi-hop reading comprehension via question decomposition. DecompRC answers questions through a three step process:
|Type||Bridging (47%) requires finding the first-hop evidence in order to find another, second-hop evidence.|
|Q||Which team does the player named 2015 Diamond Head Classic’s MVP play for?|
|Q1||Which player named 2015 Diamond Head Classic’s MVP?|
|Q2||Which team does ANS play for?|
|Type||Intersection (23%) requires finding an entity that satisfies two independent conditions.|
|Q||Stories USA starred ✓ which actor and comedian ✓ from ‘The Office’?|
|Q1||Stories USA starred which actor and comedian?|
|Q2||Which actor and comedian from ‘The Office’?|
|Type||Comparison (22%) requires comparing the property of two different entities.|
|Q||Who was born earlier, Emma Bull or Virginia Woolf?|
|Q1||Emma Bull was born when?|
|Q2||Virginia Woolf was born when?|
|Q3||Which_is_smaller (Emma Bull, ANS) (Virgina Woolf, ANS)|
First, DecompRC decomposes the original, multi-hop question into several single-hop sub-questions according to a few reasoning types in parallel, based on span predictions. Figure 1 illustrates an example in which a question is decomposed through four different reasoning types. Section 3.2 details our decomposition approach.
Then, for every reasoning types DecompRC leverages a single-hop reading comprehension model to answer each sub-question, and combines the answers according to the reasoning type. Figure 1 shows an example for which bridging produces ‘City of New York’ as an answer while intersection produces ‘Columbia University’ as an answer. Section 3.3 details the single-hop reading comprehension procedure.
We identify several reasoning types in multi-hop reading comprehension, which we use to decompose the original question and rescore the decompositions. These reasoning types are bridging, intersection and comparison. Table 2 shows examples of each reasoning type. On a sample of 200 questions from the dev set of HotpotQA, we find that 92% of multi-hop questions belong to one of these types. Specifically, among 184 samples out of 200 which require multi-hop reasoning, 47% are bridging questions, 23% are intersection questions, 22% are comparison questions, and 8% do not belong to one of three types. In addition, these multi-hop reasoning types correspond to the types of compositional questions identified by Berant et al. (2013) and Talmor and Berant (2018).
The goal of question decomposition is to convert a multi-hop question into simpler, single-hop sub-questions. A key challenge of decomposition is that it is difficult to obtain annotations for how to decompose questions. Moreover, generating the question word-by-word is known to be a difficult task that requires substantial training data and is not straight-forward to evaluate (Gatt and Krahmer, 2018; Novikova et al., 2017).
Instead, we propose a method to create sub-questions using span prediction over the question. The key idea is that, in practice, each sub-question can be formed by copying and lightly editing a key span from the original question, with different span extraction and editing required for each reasoning type. For instance, the bridging question in Table 2 requires finding “the player named 2015 Diamond Head Classic MVP” which is easily extracted as a span. Similarly, the intersection question in Table 2 specifies the type of entity to find (“which actor and comedian”), with two conditions (“Stories USA starred” and “from “The Office””), all of which can be extracted. Comparison questions compare two entities using a discrete operation over some properties of the entities, e.g., “which is smaller”. When two entities are extracted as spans, the question can be converted into two sub-questions and one discrete operation over the answers of the sub-questions.
Span Prediction for Sub-question Generation
Our approach simplifies the sub-question generation problem into a span prediction problem that requires little supervision (400 annotations). The annotations are collected by mapping the question into several points that segment the question into spans (details in Section 4.2). We train a model that learns to map a question into points, which are subsequently used to compose sub-questions for each reasoning type through Algorithm 3.
is a function that points to indices in an input sequence.111 is a hyperparameter which differs in different reasoning types.
is a hyperparameter which differs in different reasoning types.Let denote a sequence of words in the input sequence. The model encodes using BERT (Devlin et al., 2019):
where is the output dimension of the encoder.
Let denote a trainable parameter matrix. We compute a pointer score matrix
denotes the probability that theth word is the th index produced by the pointer. The model extracts indices that yield the highest joint probability at inference:
3.3 Single-hop Reading Comprehension
Given a decomposition, we use a single-hop RC model to answer each sub-question. Specifically, the goal is to obtain the answer and the evidence, given the sub-question and paragraphs. Here, the answer is a span from one of paragraphs, yes or no. The evidence is one of paragraphs on which the answer is based.
Any off-the-shelf RC model can be used. In this work, we use the BERT reading comprehension model (Devlin et al., 2019) combined with the paragraph selection approach from Clark and Gardner (2018) to handle multiple paragraphs. Given paragraphs , this approach independently computes and from each paragraph , where and denote the answer candidate from th paragraph and the score indicating th paragraph does not contain the answer. The final answer is selected from the paragraph with the lowest . Although this approach takes a set of multiple paragraphs as an input, it is not capable of jointly reasoning across different paragraphs.
For each paragraph , let be the BERT encoding of the sub-question concatenated with a paragraph , obtained by Equation 1. We compute four scores, , and , indicating if the answer is a phrase in the paragraph, yes, no, or does not exist.
denotes a max-pooling operation across the input sequence, anddenotes a parameter matrix. Additionally, the model computes , which is defined by its start and end points and .
where and indicate the probability that the th word is the start and the th word is the end of the answer span, respectively. and are obtained by the th element of and the th element of from
Here, are the parameter matrices. Finally, is determined as one of , yes or no based on which of , and is the highest.
The model is trained using questions that only require single-hop reasoning, obtained from SQuAD (Rajpurkar et al., 2016) and easy examples of HotpotQA (Yang et al., 2018) (details in Section 4.2). Once trained, it is used as an off-the-shelf RC model and is never directly trained on multi-hop questions.
3.4 Decomposition Scorer
Each decomposition consists of sub-questions, their answers, and evidence corresponding to a reasoning type. DecompRC scores decompositions and takes the answer of the top-scoring decomposition to be the final answer. The score indicates if a decomposition leads to a correct final answer to the multi-hop question.
Let be the reasoning type, and let and be the answer and the evidence from the reasoning type . Let denote a sequence of words formed by the concatenation of the question, the reasoning type , the answer , and the evidence . The decomposition scorer encodes this input using BERT to obtain similar to Equation (1). The score is computed as
where is a trainable matrix.
During inference, the reasoning type is decided as . The answer corresponding to this reasoning type is chosen as the final answer.
An alternative to the decomposition scorer is a pipeline approach, in which the reasoning type is determined in the beginning, before decomposing the question and obtaining the answers to sub-questions. Section 4.6 compares our scoring step with this approach to show the effectiveness of the decomposition scorer. Here, we briefly describe the model used for the pipeline approach.
We experiment on HotpotQA (Yang et al., 2018), a recently introduced multi-hop RC dataset over Wikipedia articles. There are two types of questions—bridge and comparison. Note that their categorization is based on the data collection and is different from our categorization (bridging, intersection and comparison) which is based on the required reasoning type. We evaluate our model on dev and test sets in two different settings, following prior work.
Distractor setting contains the question and a collection of 10 paragraphs: 2 paragraphs are provided to crowd workers to write a multi-hop question, and 8 distractor paragraphs are collected separately via TF-IDF between the question and the paragraph. The train set contains easy, medium and hard examples, where easy examples are single-hop, and medium and hard examples are multi-hop. The dev and test sets are made up of only hard examples.
Full wiki setting is an open-domain setting which contains the same questions as distractor setting but does not provide the collection of paragraphs. Following Chen et al. (2017), we retrieve 30 Wikipedia paragraphs based on TF-IDF similarity between the paragraph and the question (or sub-question).
|Distractor setting||Full wiki setting|
|Model||Dist F1||Open F1|
4.2 Implementations Details
Training Pointer for Decomposition.
We obtain a set of 200 annotations for bridging to train , and another set of 200 annotations for intersection to train , hence 400 in total. Each bridging question pairs with three points in the question, and each intersection question pairs with two points in the question. For comparison, we create training data in which each question pairs with four points (the start and end of the first entity and those of the second entity) to train , requiring no extra annotation.444Details in Appendix B.
Training Single-hop RC Model.
We create single-hop QA data by combining HotpotQA easy examples and SQuAD (Rajpurkar et al., 2016) examples to form the training data for our single-hop RC model described in Section 3.3. To convert SQuAD to a multi-paragraph setting, we retrieve other Wikipedia paragraphs based on TF-IDF similarity between the question and the paragraph, using Document Retriever from DrQA (Chen et al., 2017). We train 3 instances with for an ensemble, which we use as the single-hop model.
To deal with sections/ungrammatical questions generated through our decomposition procedure, we augment the training data with ungrammatical samples. Specifically, we add noise in the question by randomly dropping tokens with probability of , and replace wh-word into ‘the’ with probability of .
Training Decomposition Scorer
We create training data by making inferences for all reasoning types on HotpotQA medium and hard examples. We take the reasoning type that yields the correct answer as the gold reasoning type. Appendix C provides the full details.
4.3 Baseline Models
We compare our system DecompRC with the state-of-the-art on the HotpotQA dataset as well as strong baselines.
BERT is a large, language-model-pretrained model, achieving the state-of-the-art results across many different NLP tasks (Devlin et al., 2019). This model is the same as our single-hop model described in Section 3.3, but trained on the entirety of HotpotQA.
BERT–1hop train is the same model but trained on single-hop QA data without HotpotQA medium and hard examples.
DecompRC–1hop train is a variant of DecompRC that does not use multi-hop QA data except 400 decomposition annotations. Since there is no access to the groundtruth answers of multi-hop questions, a decomposition scorer cannot be trained. Therefore, a final answer is obtained based on the confidence score from the single-hop RC model, without a rescoring procedure.
Table 3 compares the results of DecompRC with other baselines on the HotpotQA development set. We observe that DecompRC outperforms all baselines in both distractor and full wiki settings, outperforming the previous published result by a large margin. An interesting observation is that DecompRC not trained on multi-hop QA pairs (DecompRC–1hop train) shows reasonable performance across all data splits.
We also observe that BERT trained on single-hop RC achieves a high F1 score, even though it does not draw inferences across different paragraphs. For further analysis, we split the HotpotQA development set into single-hop solvable (Single) and single-hop non-solvable (Multi).555We consider an example to be solvable if all of three models of the BERT–1hop train ensemble obtains non-negative F1. This leads to 3426 single-hop solvable and 3979 single-hop non-solvable examples out of 7405 development examples, respectively. We observe that DecompRC outperforms BERT by a large margin in single-hop non-solvable (Multi) examples. This supports our attempt toward more explainable methods for answering multi-hop questions.
Finally, Table 4 shows the F1 score on the test set for distractor setting and full wiki setting on the leaderboard.666Retrieved on March 4th 2019 from https://https://hotpotqa.github.io These include unpublished models that are concurrent to our work. DecompRC achieves the best result out of models that report both distractor and full wiki setting.
|Question||Robert Smith founded the multinational company headquartered in what city?|
|Span-based||Q1: Robert Smith founded which multinational company?|
|Q2: ANS headquartered in what city?|
|Free-form||Q1: Which multinational company was founded by Robert Smith?|
|Q2: Which city contains a headquarter of ANS?|
4.5 Evaluating Robustness
In order to evaluate the robustness of different methods to changes in the data distribution, we set up two adversarial settings in which the trained model remains the same but the evaluation dataset is different.
Modifying Distractor Paragraphs.
We collect a new set of distractor paragraphs to evaluate if the models are robust to the change in distractors.777We choose 8 distractor paragraphs that do not to change the groundtruth answer. In particular, we follow the same strategy as the original approach (Yang et al., 2018) using TF-IDF similarity between the question and the paragraph, but with no overlapping distractor paragraph with the original distractor paragraphs. Table 5 compares the F1 score of DecompRC and BERT in the original distractor setting and in the modified distractor setting. As expected, the performance of both methods degrade, but DecompRC is more robust to the change in distractors. Namely, DecompRC–1hop train degrades much less (only 3.41 F1) compared to other approaches because it is only trained on single-hop data and therefore does not exploit the data distribution. These results confirm our hypothesis that the end-to-end model is sensitive to the change of the data and our model is more robust.
Adversarial Comparison Questions.
We create an adversarial set of comparison questions by altering the original question so that the correct answer is inverted. For example, we change “Who was born earlier, Emma Bull or Virginia Woolf?” to “Who was born later, Emma Bull or Virginia Woolf?” We automatically invert 665 questions (details in Appendix D). We report the joint F1, taken as the minimum of the prediction F1 on the original and the inverted examples. Table 5 shows the joint F1 score of DecompRC and BERT. We find that DecompRC is robust to inverted questions, and outperforms BERT by 36.53 F1.
Span-based vs. Free-form sub-questions.
We evaluate the quality of generated sub-questions using span-based question decomposition. We replace the question decomposition component using with (i) sub-question decomposition through groundtruth spans, (ii) sub-question decomposition with free-form, hand-written sub-questions (examples shown in Table 6).
Table 7 (left) compares the question answering performance of DecompRC when replaced with alternative sub-questions on a sample of 50 bridging questions.888A full set of samples is shown in Appendix E. There is little difference in model performance between span-based and sub-questions written by human. This indicates that our span-based sub-questions are as effective as free-form sub-questions. In addition, trained on 200 or 400 examples obtains close to human performance. We think that identifying spans often rely on syntactic information of the question, which BERT has likely learned from language modeling. We use the model trained on 200 examples for DecompRC to demonstrate sample-efficiency, and expect performance improvement with more annotations.
Ablations in decomposition decision method.
Table 7 (right) compares different ablations to evaluate the effect of the decomposition scorer. For comparison, we report the F1 score of the confidence-based method which chooses the decomposition with the maximum confidence score from the single-hop RC model, and the pipeline approach which independently selects the reasoning type as described in Section 3.4. In addition, we report an oracle which takes the maximum F1 score across different reasoning types to provide an upperbound. A pipeline method gets lower F1 score than the decomposition scorer. This suggests that using more context from decomposition (e.g., the answer and the evidence) helps avoid cascading errors from the pipeline. Moreover, a gap between DecompRC and oracle (6.2 F1) indicates that there is still room to improve.
|Breakdown of 15 failure cases|
|Partial match with the groundtruth||3|
|Mistake from human||3|
|Sub-question requires cross-paragraph reasoning||2|
|Decomposed sub-questions miss some information||2|
|Answer to the first sub-question can be multiple||3|
|Q What country is the Selun located in?|
|P1 Selun lies between the valley of Toggenburg and Lake Walenstadt in the canton of St. Gallen.|
|P2 The canton of St. Gallen is a canton of Switzerland.|
|Q Which pizza chain has locations in more cities, Round Table Pizza or Marion’s Piazza?|
|P1 Round Table Pizza is a large chain of pizza parlors in the western United States.|
|P2 Marion’s Piazza … the company currently operates 9 restaurants throughout the greater Dayton area.|
|Q1 Round Table Pizza has locations in how many cities? Q2 Marion ’s Piazza has locations in how many cities?|
|Q Which magazine had more previous names, Watercolor Artist or The General?|
|P1 Watercolor Artist, formerly Watercolor Magic, is an American bi-monthly magazine that focuses on …|
|P2 The General (magazine): Over the years the magazine was variously called ‘The Avalon Hill General’, ‘Avalon Hill’s General’, ‘The General Magazine’, or simply ‘General’.|
|Q1 Watercolor Artist had how many previous names? Q2 The General had how many previous names?|
Upperbound of Span-based Sub-questions without a decomposition scorer.
To measure an upperbound of span-based sub-questions without a decomposition scorer, where a human-level RC model is assumed, we conduct a human experiment on a sample of 50 bridging questions.999A full set of samples is shown in Appendix E. In this experiment, humans are given each sub-question from decomposition annotations and are asked to answer it without an access to the original, multi-hop question. They are asked to answer each sub-question with no cross-paragraph reasoning, and mark it as a failure case if it is impossible. The resulting F1 score, calculated by replacing RC model to humans, is 72.67 F1.
Table 8 reports the breakdown of fifteen error cases. 53% of such cases are due to the incorrect groundtruth, partial match with the groundtruth or mistake from humans. 47% are genuine failures in the decomposition. For example, a multi-hop question “Which animal races annually for a national title as part of a post-season NCAA Division I Football Bowl Subdivision college football game?” corresponds to the last category in Table 8. The question can be decomposed into “Which post-season NCAA Division I Football Bowl Subdivision college football game?” and “Which animal races annually for a national title as part of ANS?”. However in the given set of paragraphs, there are multiple games that can be the answer to the first sub-question. Although only one of them is held with the animal racing, it is impossible to get the correct answer only given the first sub-question. We think that incorporating the original question along with the sub-questions can be one solution to address this problem, which is partially done by a decomposition scorer in DecompRC.
We show the overall limitations of DecompRC in Table 9. First, some questions are not compositional but require implicit multi-hop reasoning, hence cannot be decomposed. Second, there are questions that can be decomposed but the answer for each sub-question does not exist explicitly in the text, and must instead by inferred with commonsense reasoning. Lastly, the required reasoning is sometimes beyond our reasoning types (e.g. counting or calculation). Addressing these remaining problems is a promising area for future work.
We proposed DecompRC, a system for multi-hop RC that decomposes a multi-hop question into simpler, single-hop sub-questions. We recasted sub-question generation as a span prediction problem, allowing the model to be trained on 400 labeled examples to generate high quality sub-questions. Moreover, DecompRC achieved further gains from the decomposition scoring step. DecompRC achieved the state-of-the-art on HotpotQA distractor setting and full wiki setting, while providing explainable evidence for its decision making in the form of sub-questions and being more robust to adversarial settings than strong baselines.
This research was supported by ONR (N00014-18-1-2826, N00014-17-S-B001), NSF (IIS 1616112, IIS 1252835, IIS 1562364), ARO (W911NF-16-1-0121), an Allen Distinguished Investigator Award, Samsung GRO and gifts from Allen Institute for AI, Google, and Amazon.
We thank the anonymous reviewers and UW NLP members for their thoughtful comments and discussions.
- Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP.
- Cao et al. (2019) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. Question answering by reasoning across documents with graph convolutional networks. In NAACL.
- Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In ACL.
- Clark and Gardner (2018) Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In ACL.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Dhingra et al. (2018) Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2018. Neural models for reasoning over multiple mentions using coreference. In NAACL.
Gatt and Krahmer (2018)
Albert Gatt and Emiel Krahmer. 2018.
Survey of the state of the art in natural language generation: Core tasks, applications and evaluation.Artificial Intelligence Research.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In NIPS.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
- Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
- Liang et al. (2011) Percy Liang, Michael Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In ACL.
- Min et al. (2018) Sewon Min, Victor Zhong, Richard Socher, and Caiming Xiong. 2018. Efficient and robust question answering from minimal context over documents. In ACL.
Novikova et al. (2017)
Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser.
Why we need new evaluation metrics for NLG.In EMNLP.
Paszke et al. (2017)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017.
Automatic differentiation in PyTorch.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP.
- Richardson et al. (2013) Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP.
- Seo et al. (2017) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
- Talmor and Berant (2018) Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In NAACL.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
- Welbl et al. (2017) Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2017. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. In TACL.
- Xiong et al. (2018) Caiming Xiong, Victor Zhong, and Richard Socher. 2018. DCN+: Mixed objective and deep residual coattention for question answering. In ICLR.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
- Yu et al. (2018) Adams Wei Yu, David Dohan, Quoc Le, Thang Luong, Rui Zhao, and Kai Chen. 2018. Fast and accurate reading comprehension by combining self-attention and convolution. In ICLR.
Zelle and Mooney (1996)
John M. Zelle and Raymond J. Mooney. 1996.
Learning to parse database queries using inductive logic programming.In AAAI/IAAI.
- Zettlemoyer and Collins (2005) Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In UAI.
- Zhong et al. (2019) Victor Zhong, Caiming Xiong, Nitish Shirish Keskar, and Richard Socher. 2019. Coarse-grain fine-grain coattention network for multi-evidence question answering. In ICLR.
Appendix A Span Annotation
In this section, we describe span annotation collection procedure for bridging and intersection questions.
The goal is to collect three points (bridging) or two points (intersection) given a multi-hop question. We design an interface to annotate span over the question by clicking the word in the question. First, given a question, the annotator is asked to identify which reasoning type out of bridging, intersection, one-hop and neither is the most proper.101010Note that we exclude comparison questions for annotations, since comparison questions are already labeled on HotpotQA. Since bridging type is the most common, bridging is checked by default. If the question type is bridging, the annotator is asked to make three clicks for the start of the span, the end of the span, and the head-word (top four examples in Figure 2
). After three clicks are all made, the annotator can see the heuristically generated sub-questions. If the question type is intersection, the annotator is asked to make two clicks for the start and the end of the second segment out of three segments (bottom three examples in Figure2). Similarly, the annotator can see the heuristically generated sub-questions after two clicks. If the question type is one-hop or neither, the annotator does not have to make any click. If the question can be decomposed into more than one way, the annotator is asked to choose the more natural decomposition. If the question is ambiguous, the annotator is asked to pass the example, and only annotate for the clear cases. For the quality control, all annotators have enough in person, one-on-one tutorial sessions and are given 100 example annotations for the reference.
Appendix B Decompotision for Comparison
|Operation & Example|
|Is greater (ANS) (ANS) yes or no|
|Is smaller (ANS) (ANS) yes or no|
|Which is greater (ENT, ANS) (ENT, ANS) ENT|
|Which is smaller (ENT, ANS) (ENT, ANS) ENT|
|Did the Battle of Stones River occur before the Battle of Saipan?|
|Q1: The Battle of Stones River occur when? 1862|
|Q2: The Battle of Saipan River occur when? 1944|
|Q3: Is smaller (the Battle of Stones River, 1862) (the Battle of Saipan, 1944) yes|
|And (ANS) (ANS) yes or no|
|Or (ANS) (ANS) yes or no|
|Which is true (ENT, ANS) (ENT, ANS) ENT|
|In between Atsushi Ogata and Ralpha Smart who graduated from Harvard College?|
|Q1: Atsushi Ogata graduated from Harvard College? yes|
|Q2: Ralpha Smart graduated from Harvard College? no|
|Q3: Which is true (Atsushi Ogata, yes) (Ralpha Smart, no) Atsushi Ogata|
|Is equal (ANS) (ANS) yes or no|
|Not equal (ANS) (ANS) yes or no|
|Intersection (ANS) (ANS) string|
|Are Cardinal Health and Kansas City Southern located in the same state?|
|Q1: Cardinal Health located in which state? Ohio|
|Q2: Cardinal Health located in which state? Missouri|
|Q3: Is equal (Ohio) (Missouri) no|
In this section, we describe the decomposition procedure for comparison, which does not require any extra annotation.
Comparison requires to compare a property of two different entities, usually requiring discrete operations. We identify 10 discrete operations which sufficently cover comparison operations, shown in Table 10. Based on these pre-defined discrete operations, we decompose the question through the following three steps.
First, we extract two entities under comparison. We use to obtain , where and indicate the start and the end of the first entity, and and indicate those of the second entity. We create a training data which each example contains the question and four points as follows: we filter out bridge questions in HotpotQA to leave comparison questions, extract the entities using Spacy111111https://spacy.io/ NER tagger in the question and in two supporting facts (annotated sentences in the dataset which serve as evidence to answer the question), and match them to find two entities which appear in one supporting sentence but not in the other supporting sentence.
Then, we identity the suitable discrete operation, following Algorithm 2.
Finally, we generate sub-questions according to the discrete operation. Two sub-questions are obtained for each entity.
Appendix C Implementation Details
We use PyTorch (Paszke et al., 2017) on top of Hugging Face’s BERT implementation.121212https://github.com/huggingface/pytorch-pretrained-BERT We tune our model from Google’s pretrained BERT-BASE (lowercased)131313https://github.com/google-research/bert, containing 12 layers of Transformers (Vaswani et al., 2017) and a hidden dimension of 768. We optimize the objective function using Adam (Kingma and Ba, 2015) with learning rate . We lowercase the input and set the maximum sequence length to for models which input is both the question and the paragraph, and for the models which input is the question only.
Appendix D Creating Inverted Binary Comparison Questions
We identify the comparison question with 7 out of 10 discrete operations (Is greater, Is smaller, Which is greater, Which is smaller, Which is true, Is equal, Not equal) can automatically be inverted. It leads to 665 inverted questions.
Appendix E A Set of Samples used for Ablations