Log In Sign Up

MuSiQue: Multi-hop Questions via Single-hop Question Composition

by   Harsh Trivedi, et al.
Stony Brook University
Allen Institute for Artificial Intelligence

To build challenging multi-hop question answering datasets, we propose a bottom-up semi-automatic process of constructing multi-hop question via composition of single-hop questions. Constructing multi-hop questions as composition of single-hop questions allows us to exercise greater control over the quality of the resulting multi-hop questions. This process allows building a dataset with (i) connected reasoning where each step needs the answer from a previous step; (ii) minimal train-test leakage by eliminating even partial overlap of reasoning steps; (iii) variable number of hops and composition structures; and (iv) contrasting unanswerable questions by modifying the context. We use this process to construct a new multihop QA dataset: MuSiQue-Ans with  25K 2-4 hop questions using seed questions from 5 existing single-hop datasets. Our experiments demonstrate that MuSique is challenging for state-of-the-art QA models (e.g., human-machine gap of 30 F1 pts), significantly harder than existing datasets (2x human-machine gap), and substantially less cheatable (e.g., a single-hop model is worse by 30 F1 pts). We also build an even more challenging dataset, MuSiQue-Full, consisting of answerable and unanswerable contrast question pairs, where model performance drops further by 13+ F1 pts. For data and code, see <>.


page 19

page 20


Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

A multi-hop question answering (QA) dataset aims to test reasoning and i...

Do Multi-Hop Question Answering Systems Know How to Answer the Single-Hop Sub-Questions?

Multi-hop question answering (QA) requires a model to retrieve and integ...

How Well Do Multi-hop Reading Comprehension Models Understand Date Information?

Several multi-hop reading comprehension datasets have been proposed to r...

Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering

Generative question answering (QA) models generate answers to questions ...

Decomposing Complex Questions Makes Multi-Hop QA Easier and More Interpretable

Multi-hop QA requires the machine to answer complex questions through fi...

LEPUS: Prompt-based Unsupervised Multi-hop Reranking for Open-domain QA

We study unsupervised multi-hop reranking for multi-hop QA (MQA) with op...

Multi-hop Inference for Question-driven Summarization

Question-driven summarization has been recently studied as an effective ...

1 Introduction

Multi-hop question answering (QA) datasets are designed with the intent that models must connect information from multiple facts in order to answer each multi-hop question. However, when using the common method of crowdsourcing to create any multi-hop question given just a pair of documents as the starting point, the resulting datasets (e.g., HotpotQA Yang et al. (2018)) often end up with unintended artifacts that allow models to achieve high scores Min et al. (2019a); Chen and Durrett (2019) without performing any reasoning that at least connects the intended facts together Trivedi et al. (2020). This defeats the purpose of building multi-hop datasets.

To address this issue, we propose a new bottom-up approach (and a corresponding dataset called MuSiQue) for building multi-hop reading comprehension QA datasets via composition of single-hop questions from existing datasets. By carefully controlling the hops used in the questions, our approach ensures that each hop requires the connected reasoning desired in multi-hop QA datasets.

Figure 1: Our proposed approach to generate multi-hop questions by composing single-hop questions. Left: A question from MuSiQue that forces models to reason through all intended hops. Right: A potential question filtered out by our approach for not requiring connected reasoning. Notice that the question on the right can be answered using just Q3’ without knowing the answers to the previous questions (there is only one mentioned country that gained independence on 27 Dec 1949).

Figure 1 illustrates two potential multi-hop questions that can be constructed using such a bottom-up approach. We create compositions of single-hop questions where each question must use the answer (often referred to as a bridging entity) from a previous question in the reasoning chain. While intuitive, this, by itself, is however not sufficient to ensure multi-hop reasoning. For example, the question on the right can be answered without ever finding answers to Q1’ or Q2’. Even if a model does not know that A2’ refers to the country Indonesia, there is only one country that is mentioned in the context as gaining independence on 27 Dec, 1949. Models can, therefore, answer the supposedly multi-hop question on the right using just the last question (i.e., be successful with single-hop reasoning). This is not the case with the question on the left (an actual example from our dataset) where every single-hop question would be almost impossible to answer confidently without knowing the bridging entity from the previous step.

Our proposed approach to build multi-hop datasets identifies the presence of such artifacts in the composition chain and filters them out, thereby reducing the cheatability of our dataset.

In addition to reducing such artifacts, our proposed approach also minimizes the potential of memorization by reducing train-test leakage at the level of each single-hop question. The approach allows creating questions with varying number of hops by simply composing over additional single-hop questions.

We use this approach to build a new challenge multi-hop QA dataset, MuSiQue, consisting of 2-4 hop questions. We empirically demonstrate that our dataset has fewer artifacts and is more challenging than two commonly used prior multi-hop reasoning datasets, HotpotQA Yang et al. (2018) and 2WikiMultihopQA Ho et al. (2020). We also show the benefits of using the various features of our pipeline in terms of increasing the difficulty of the dataset. We find that MuSiQue is a promising testbed for approaches that rely on principled multi-hop reasoning, such as the Step Execution model we discuss. Lastly, by incorporating the notion of unanswerability or insufficient context Rajpurkar et al. (2018); Trivedi et al. (2020), we also release a variant of our dataset, MuSiQue-Full, that is even more challenging and harder to cheat on.

In summary, we make two main contributions: (1) A new dataset construction approach for building challenging multi-hop reasoning QA datasets that operates via composition of single-hop question, reduces artifacts, reduces train-test leakage, and is easier to annotate. (2) A new challenge dataset MuSiQue that is less cheatable, has a higher human-machine gap, and comes with reasoning graph decompositions that can be used to build stronger models (via data augmentation or as auxilliary supervision).

2 Multihop Reasoning Desiderata

Multihop question answering requires connecting and synthesizing information from multiple facts. Prior works, however, have shown that multihop reasoning datasets often have artifacts that allow models to achieve high scores bypassing the need for connecting the information from multiple facts Min et al. (2019a); Chen and Durrett (2019); Trivedi et al. (2020). This defeats the purpose of building multihop QA datasets, as we cannot reliably use them to measure progress in multihop reasoning.

What are then the desirable properties of questions that indeed test a model’s multi-hop capabilities, and how can we create such questions? Ideally, a multihop question should necessitate meaningful synthesis of information from all (multiple) supporting facts. However, what meaningful synthesis really means is subjective, and hard to objectively quantify. Trivedi et al. (2020) have argued that at a minimum, multihop reasoning should necessitate connecting information from multiple (all) supporting facts, and proposed a formal condition (DiRe condition) to check if disconnected reasoning is employed by a model on a dataset. While the DiRe condition allows one to probe existing models and datasets for undesirable reasoning, it does not provide an efficient way to construct new multihop QA datasets that necessitate connected reasoning.

In this work, we propose a method to create multi-hop reasoning datasets that require connected reasoning by first laying out desirable properties of a multi-hop question in terms of its decomposition, and then devising a dataset construction procedure to optimize for these properties.

Connected Question Hops.

Consider the question Which city across the Charles River in Massachusetts was Facebook launched in?. This can be answered by two supporting facts in the Knowledge Source (KS):222We view KS as the fixed context text in the reading comprehension setting and a large corpus in the open-QA setting. () Facebook is launched in Harvard University. () Harvard university is in Cambridge city, which is across the Charles River in Massachussets. That is, the question can be decomposed as (Q1) Which university was Facebook launched in? with answer A1, and (Q2) Which city across the Charles River in Massachussets is A1 in?.

A proper way to find answer to this question is to answer Q1, plug this answer in Q2 in place of A1, and answer Q2. However, in this example, Q2 is so specifically informative about the city of interest that it is possible to uniquely identify the answer to it even without considering A1 or using . In other words, the question is such that a model doesn’t need to connect information from the two supporting facts.

To prevent such shortcut based reasoning to be an effective strategy, we need questions and knowledge sources such that all hops of the multi-hop question are necessary to arrive at the answer. That is, the dataset shouldn’t contain information that allows a model to arrive at the correct answer confidently even when bypassing single-hop questions. This can be computationally checked by training a strong model on the dataset distribution. To check if the hops of above 2-hop question are connected or not, we need to check whether input question Q2 with entity A1 masked, along with the knowledge source KS as input, is answerable by . If it is, then it is possible for to answer the supposedly multi-hop question via disconnected reasoning. In general, for a strong model , if a multi-hop question has any constituent question in its decomposition such that can answer it correctly even after masking at least one of its bridge entities, we can conclude that can cheat on the question via disconnected reasoning.

Formally, we associate each question with , a directed acyclic graph representing ’s decomposition into simpler subquestions , which form the nodes of . A directed edge for indicates that refers to the answer to some previous subquestion in the above subquestion sequence. is the (often singleton) set of valid answers to . is the set of valid answers to . For edge , we use to denote the subquestion formed from by masking out the mention of the answer from . Similarly, denotes the subquestion with answers from all incoming edges masked out.

We say that all hops are necessary for a model to answer if:

Connected Question and Context.

Although above requirement ensures each multi-hop question is connected via constituent single-hop questions, it doesn’t ensure whether single-hop questions are in fact meaningfully answered. This poses additional challenges especially in context of reading comprehension setting where the input text is limited, and often has limited or no distractors.

We want models to identify and meaningfully sythesize information from constituent question and context to arrive at its answer. However, here again, it’s difficult to quantify what meaningful synthesis really means. At the very least though, we know some connection between constituent question and context should be required to arrive at the answer for the skill of reading comprehension. This means that, at the very least, we want multi-hop questions with constituent single-hop questions such that dropping question or context in the dataset shouldn’t have enough information to identify the answer confidently. This can be tested in with a strong model that is trained on this dataset distribution.

We say that some question to context connection is necessary for a model to answer if:


Although this requirement might seem rather naïve, previous works have shown that RC datasets often have artifacts that allow models to predict the answer without question or without context Kaushik and Lipton (2018). Moreover, recent work has also shown that question and answer memorization coupled train-test leakage can lead to high answer scores without any context Lewis et al. (2021). As we will show later, previous multi-hop RC datasets can be cheated via such context-only and question-only shortcuts as well. Therefore, it is important for future multi-hop reading comprehension QA datasets to ensure that question and context are connected.

In summary, we want multi-hop reading comprehension questions that have desirable properties (1) and (2). If these two conditions hold for some strong model , we say that the question satisfies the MuSiQue condition. Our dataset construction pipeline (Section 4) optimizes for these properties over a large-space of multi-hop questions that can be composed by single-hop questions (Section 3).

3 Multihop via Singlehop Composition

The information need of a multihop question can be decomposed into a directed acyclic graph (DAG) of constituent questions or operations  Wolfson et al. (2020). For example, the question "Which city was Facebook launched in?" can be decomposed into a 2-node DAG with nodes corresponding to "Which university was Facebook launched" Harvard, and "Which city was #1 launched in?". The same processes can also be reversed to compose a candidate multihop question from constituent single-hop questions. In the above example, the answer of 1st question is an entity, Harvard which also occurs as part of the second question, which allows two questions to be composed together. More concretely, we have the following criterion:

Composability Criterion: Two single-hop question answer tuples and are composable into a multi-hop question with as a valid answer if is an answer entity and it is mentioned in .

The process of composing candidate multi-hop questions can be chained together to form candidate reasoning graphs of various shapes and sizes. Conveniently, since NLP community has constructed abundant human-annotated single hop questions, we can leverage them directly to create multi-hop questions.

Furthermore, since single-hop reading comprehension questions come with associated supporting paragraph or context, we can prepare supporting context for the composed multihop question as a set of supporting paragraphs from constituent questions. Additional distracting paragraphs can be retrieved from a large corpus of paragraphs.

Such ground-up and automatic construction of candidate multihop questions from existing single-hop questions gives us a programmatic approach to explore a large-space of candidate multi-hop questions, which provides a unique advantage towards the goal of preventing shortcut-based reasoning.

Previous works in making multihp QA datasets less cheatable have explored finding or creating better distractors to include in the context, while treating the questions as static Min et al. (2019a); Jia and Liang (2017). However, this may not be a good enough strategy, because if the subquestions are specific enough, even in an open domain there may not be any good distractor Groeneveld et al. (2020a). Further, adding distractors found by specialized interventions may introduce new artifacts, allowing models to learn shortcuts again. Instead, creating multi-hop questions by composing single-hop questions provides us with greater control. Specifically, exploring a very large space of potential single-hop questions allows us to filter out those for which we can’t find strong distractors.

Figure 2: MuSiQue construction pipeline. MuSiQue pipline takes single-hop questions from existing datasets, explores the space of multi-hop questions that can be composed from them, and generates dataset of challenging multi-hop questions that are difficult to cheat on. MuSiQue pipeline also makes unanswerable multi-hop questions that makes the final dataset significantly more challenging.

4 Data Construction Pipeline

We design a dataset construction pipeline with the goal of creating multi-hop questions that satisfy the MuSiQue condition (i.e., equations (1) and (2). The high-level schematic of the pipeline is shown in Figure 2.

We begin with a large set of reading comprehension single-hop questions , with individual instance denoted with referring to the question, associated paragraph, and a valid answer respectively. These single-hop questions are run through the following steps:

S1. Find Good Single-Hop Questions

First, we filter single-hop questions that:

Are close paraphrases:

If two questions have the same normalized333remove special characters, articles, and lowercase answer and their question words have an overlap of more than 70%, we assume them to be paraphrases and filter out one of them.

Likely have annotation errors:

Annotation errors are often unavoidable in the single-hop RC datasets. While a small percentage of such errors is not a huge problem, these errors can be amplified when multi-hop questions are created by composing single-hop questions – a multi-hop question would have an error if any constituent single-hop question has an error. For example, a dataset of 3-hop questions, created by composing single-hop questions with errors in 20% of the questions, would have errors in 50% of the multi-hop questions. To do this without human intervention, we use a model-based approach. We generate 5-fold train, validation and test splits of the set. For each split, we train 5 strong models (2 random-seeds of RoBERTa-large Liu et al. (2019), 2 random-seeds of Longformer-Large  Beltagy et al. (2020) and 1 UnifiedQA  Khashabi et al. (2020)) for the answer prediction task in reading comprehension setting. We remove instances from the test folds where none of the models’ predicted answer had any overlap with the labeled answer.

We also remove single-hop questions where the dataset comes with multiple ground-truth answers, or where ground-truth answer isn’t a substring of the associated context.

Are not amenable to creating multi-hop questions:

Since composing two questions (described next) needs the answer to the first question be an entity, we remove questions for which the answer doesn’t have a single entity444extracted by

. We also remove outlier questions for which the context is too short (

20 words) or long ( 300 words).

We start with 2017K single-hop questions from 5 datasets (SQuAD Rajpurkar et al. (2016), Natural Questions Kwiatkowski et al. (2019), MLQA555en-en subset of it Lewis et al. (2019), T-REx ElSahar et al. (2018), Zero Shot RE Levy et al. (2017)) and filter it down to 760K good single-hop questions using this pipeline.

S2. Find Composable 2-Hop Questions

We next find composable pairs of single-hop questions within this set. A pair of different single-hop questions (, , ) and (, , ) can be composed to form a 2-hop question (, {, }, ) if is an entity and is mentioned once in ; here represents a composed question whose DAG has and as the nodes and as the only edge.

To ensure that the two entity mentions ( and its occurrence in denoted as ) refer to the same entity, we check for the following:

  1. both and are marked as entities of the same type by an entity-extractor666we used

  2. normalized and are identical.

  3. querying wikipedia search API with and return identical first result.

  4. A SOTA wikification model Wu et al. (2020) returns same result for and with the context of and respectively.

We found this process to be about 92% precise777based on crowdsourced human evaluation in identifying pairs of single-hop questions with common entity.

To consider a pair of single-hop questions as composable, we additionally also check: 1. is not part of 2. and are not same. Given our seed set of 760K questions, we are able to find 12M 2-hop composable pairs.

S3. Filter to Connected 2-Hop Questions

Next, we filter out the composable 2-hop questions to only those that are likely to be connected. That is, to answer the 2-hop question, it is necessary to use and answer all constituent single-hop questions. We call this process disconnection filtering.

Going back to MuSiQue condition (1, 2), for the 2-hop question to be connected, we need be not and be not , where is the context. This condition naturally gives us an opportunity to decompose the problem in 2 parts: (i) check if the first single-hop question (head node) is answerable without the context (ii) check if the second single-hop question (tail node) is answerable with context and question, but with mention of from the question masked.

Filtering Head Nodes: We take all the questions that appear at least once as the head of composable 2-hop questions () to create the set of head nodes. We then create 5-fold splits of the set, and train and generate predictions using multiple strong (different random seeds). This way we have 5 answer predictions for each unique head question. We consider the head node acceptable only if average AnsF1 is less than a threshold.

Filtering Tail Nodes: We create a unique set of masked single-hop questions that occur as a tail node () in any composable 2-hop questions. If the same single-hop question occurs in two 2-hop questions with different masked entity, they both would be added to the set. We then prepare context for each question by taking the associated gold-paragraph corresponding to that question and retrieving 9 distractor paragraphs using question with the masked entity as a query. We then create 5-fold splits of the set, and train and predict answer and support paragraph using multiple (different random seeds) strong models Longformer888Because it can fit long context of 10 paragraphs and has been shown to be competitive on HotpotQA in similar setup Beltagy et al. (2020). This way we have 5 answer predictions for each unique tail question. We consider the tail node acceptable only if both AnsF1 and SuppF1 are less than a fixed threshold.

Finally, only those composable 2-hop questions are kept for which both head and tail node are acceptable. Starting from 12M 2-hop questions, this results in a set of 3.2M connected 2-hop questions.

S4. Build Multihop Questions

The set of connected 2-hop questions form the directed edges of a graph. Any subset Directed Acyclic Graph (DAG) of this graph can be used to create a connected multi-hop question. We enumerate 6 types reasoning graphs 1 for 2-hop, 2 for 3-hops and 3 for 4-hops as shown in Table  9

, and employ following heuristics for curation.

To ensure diversity of the resulting questions and to also make the graph exploration computationally practical, we used two heuristics to control graph traversal: (i) The same bridging entity should not be used more than 100 times, (ii) same single-hop question should not appear in more than 25 multi-hop questions. Furthermore, since we eventually want to create a comprehensible single multi-hop question from the multiple single-hop questions, we try to limit the total length of these questions. Each single-hop question shouldn’t be more than 10 tokens long. The total length of questions should not be more than 15 tokens in 2-hops and 3-hops, and more than 20 tokens in 4-hops . Finally, we remove all 2-hop questions that occur as a subset of any of the 3-hop questions and remove all 3-hop questions that occur as a subset of any of the 4-hop questions.

S5. Split Questions to Sets

Given the recent findings of Lewis et al. (2021), we split the final set of questions into train, validation and test splits such that it is not possible to score high by memorization. We do this by ensuring there no overlap (described below) between train and validation set, and train and test set. Additionally, we also assure overlap between validation and test is minimal.

We consider two multi-hop questions and overlap if (i) any single-hop question is common between and (ii) answer to any single-hop question of is also an answer to some single-hop question in (iii) any associated paragraph of any of the single-hop questions is common between and . We start with 2 sets of multi-hop questions: initially, set-1 contains all questions set-2 is empty. Then we greedily take one question from set-1 that least overlaps with rest of the set-1 questions. We do this until fixed set of set-2 is achieved. Then, remove all remaining questions from set-1 which overlap with set-2. Finally, our set-1 becomes training set, and set-2 becomes validation+test set, which is further split into validation and test with similar procedure. We ensure the distribution of datasets of single-hop questions in train, validation and test set are similar, and also control the sizes of 2,3 and 4 hop questions.

S6. Build Contexts for Questions

For a n-hop question, the context is a set of paragraphs consisting of paragraphs associated with individual constituent subquestions (, ), and additional distractor paragraphs retrieved from a corpus of paragraphs. The query used for retrieval is plain concatenation of subquestions with answer mentions from all of its incoming edges masked, +.

To ensure that our distractors are not obvious, we retrieve them only from the set of gold context paragraphs associated with the initially filtered single-hop questions. As a result, it would be impossible to identify the relevant paragraphs without using the question. We will compare this strategy with standard strategy of using full wikipedia as a source of distractors in the experiments section.

Furthermore, we also ensure that memorizing the paragraphs from the training data can’t help the model to select or eliminate paragraphs in development or test set. For this, we first retrieve top 100 paragraphs for each multi-hop question (using concatenated query described above), then we enumerate over all non-supporting paragraphs of each question and randomly eliminate all occurrences of this paragraph either from (i) training or (ii) the development and test set. We combine the remaining retrieved paragraphs (in the order of their score) for each question to the set of supporting paragraphs to form a set of 20 paragraphs. These are then shuffled to form the context.

This strategy ensures that a given paragraph may occur as non-supporting in only one of the two: (i) train set or (ii) development and test set. This limits memorization potential (as we show in our experiments).

S7. Crowdsource Question Compositions

We ask crowdworkers to compose short and coherent questions from our final question compositions (represented as DAGs), so that information from all single-hop questions is used, and answer to the composed question is same as the last single-hop question. We also filter out incorrectly composed questions by asking the crowdworkers to verify if the bridged entities refer to the same underlying entity. The workers can see associated paragraphs corresponding to each single-hop question for this task. Our annotation interface can be viewed in figure  3. Workers are encouraged to write short questions, but if the question is too long, they are allowed to split it in 2 sentences (see Table  2 for some examples)

We ran initial qualification rounds on Amazon MTurk for the task, where 100 workers participated. Authors of the paper graded the coherency correctness and selected the top 17 workers to generate the final dataset. The task was split in 9 batches and we gave regular feedback to the workers by email. We paid 25, 40, 60 cents for 2, 3 and 4 hops questions respectively, which amounted to about 15 USD per hour. Total cost was question writing was about 11K USD.

At this stage we have 24,814 reading-comprehension instances (19,938 train, 2,417 validation, 2,459 test) which we call MuSiQue-Ans.

S8. Add Unanswerable Questions

For each answerable multi-hop RC instance we create a corresponding unanswerable multi-hop RC instance using the procedure closely similar to the one proposed in Trivedi et al. (2020). For a multi-hop question we randomly sample any of its single-hop question and make it unanswerable by ensuring the answer to that single-hop question doesn’t appear in any of the paragraphs in context. Since one of the single-hop question is unanswerable given the context, the whole multi-hop question becomes unanswerable.

The process to build context for unanswerable questions is identical to that of the answerable ones, except it’s adjusted to ensure the forbidden answer (from single-hop question that’s being made unanswerable) is never part of the context. First, we remove the supporting paragraphs of multi-hop question which contain the forbidden answer. Second, we retrieve top 100 paragraphs with the concatenated query, same as answer question, but additionally put a hard-constraint to disallow the forbidden answer. From what remains, we apply the same filtering of paragraphs as explained for answerable questions, to ensure non-supporting paragraphs don’t overlap. Finally, remaining supporting paragraphs are combined with top retrieved paragraphs to form context of 20 unique paragraphs.

For the new task, the model needs to predict whether the question is answerable or not, and predict answer and support if it’s answerable. Given the questions for answerable and unanswerable sets are identical and the context marginally changes, models that rely on shortcuts find this task extremely difficult.

Since we create one unanswerable question for each answerable question, we now have 49,628 reading-comprehension instances (39,876 train, 4,834 validation, 4,918 test) which we call MuSiQue-Full.

Final Dataset

The final dataset statistics for MuSiQue-Ans (MuSiQue-Full has twice the number of questions in each cell) are shown in Table 1. Multi-hop questions in MuSiQue constitute 21,020 unique single-hop questions, 4132 unique answers to multi-hop questions, 19841 unique answers to single-hop questions, and 7676 unique supporting paragraphs. MuSiQue has multi-hop questions of 6 types of reasoning graphs distributed across 2-4 hops. These types and examples are shown in Figure 2.

Train Dev Test
2-hop 14376 1252 1271
3-hop 4387 760 763
4-hop 1175 405 425
Total (24,814) 19938 2417 2459
Table 1: Dataset statistics of MuSiQue-Ans. MuSiQue-Full contains twice the number of questions in each category above – one answerable and one unanswerable.
Graph Question Decomposition Supporting Snippets
placeholder Who was the grandfather of David Goodhart? Arthur Lehman Goodhart 1. Who was the male parent of David Goodhart? Philip Goodhart
2. Who’s Philip Goodhart’s father? Arthur Lehman Goodhart
1. Philip Goodhart … one of seven children … to Philip Goodhart.
2. Philip Carter Goodhart … son of Arthur Lehman Goodhart.
placeholder What currency is used where Billy Giles died? pound sterling 1. At what location did Billy Giles die? Belfast
2. What part of the United Kingdom is Belfast located in? Northern Ireland
3. What is the unit of currency in Northern Ireland? pound sterling
1. Billy Giles (…, Belfast – 25 September 1998, Belfast)
2. … thirty-six public houses … Belfast, Northern Ireland.
3. bank for pound sterling, issuing … in Northern Ireland.
placeholder When was the first establishment that McDonaldization is named after, open in the country Horndean is located? 1974 1. What is McDonaldization named after? McDonald’s
2. Which state is Horndean located in? England
3. When did the first McDonald’s open in England? 1974
1. … spreading of McDonald’s restaurants … ’McDonaldization’
2. … Horndean is a village … in Hampshire, England.
3. 1974 … first McDonald’s in the United Kingdom .. in London.
placeholder When did Napoleon occupy the city where the mother of the woman who brought Louis XVI style to the court died? 1805 1. Who brought Louis XVI style to the court? Marie Antoinette
2. Who’s mother of Marie Antoinette? Maria Theresa
3. In what city did Maria Theresa die? Vienna
4. When did Napoleon occupy Vienna? 1805
1. Marie Antoinette, …brought the "Louis XVI" style to court
2. Maria Antonia of Austria, youngest daughter of .. Maria Theresa
3. Maria Theresa … in Vienna … after the death
4. occupation of Vienna by Napoleon’s troops in 1805
placeholder How many Germans live in the colonial holding in Aruba’s continent that was governed by Prazeres’s country? 5 million 1. What continent is Aruba in? South America
2. What country is Prazeres? Portugal
3. The colonial holding in South America governed by Portugal? Brazil
4. How many Germans live in Brazil? 5 million
1. … Aruba, lived including indigenous peoples of South America
2. Prazeres is .. in municipality of Lisbon, Portugal.
3. Portugal, … desire for independence amongst Brazilians.
4. Brazil .. 5 million people claiming German ancestry.
placeholder When did the people who captured Malakoff come to the region where Philipsburg is located? 1625 1. What is Philipsburg capital of? Saint Martin
2. Saint Martin (French part) is located on what terrain feature? Caribbean
3. Who captured Malakoff? French
4. When did the French come to the Caribbean? 1625
1. Philipsburg .. capital of .. Saint Martin
2. … airport on the Caribbean island of Saint Martin/Sint Maarten.
3. … the capture of the Malakoff by the French
4. French trader … sailed to … Caribbean in 1625 … French settlement on
Table 2: Examples of 6 different reasoning graph shapes in MuSiQue

5 Experimental Setup

This section describes the datasets, models, and human assessment used in our experiments, whose results are reported in Section 6.

5.1 Datasets

We create two versions of our dataset: MuSiQue-Ans is a set of 25K answerable questions, where the task is to predict the answer and supporting paragraphs. MuSiQue-Full is a set of 50K questions (25K answerable and 25K unanswerable), where the task is to predict whether the question is answerable or not, and if it’s answerable, then predict the answer and supporting paragraphs.

We compare our dataset with two similar multi-hop RC datasets: HotpotQA Yang et al. (2018) and 2WikiMultihopQA 999For brevity, we use 2Wiki to refer to 2WikiMultihopQA. Ho et al. (2020). We use distractor-setting of HotpotQA to compare with our reading-comprehension setting. Questions in HotpotQA are crowdsourced, and questions in 2Wiki are automatically generated based on rules and templates. Both datasets have 10 paragraphs as context. HotpotQA contains 2-hop questions with 2 supporting paragraphs each, while 2Wiki has 2-hop and 4-hop questions with 2 and 4 supporting paragraphs, respectively. Additionally, HotpotQA has sentence-level support information and 2Wiki has supporting chain information with entity-relation tuples, but we don’t use this additional annotation in our evaluation for a fair comparison.

HotpotQA, 2Wiki, and MuSiQue-Ans have 90K, 167K, and 20K training instances, respectively. For a fair comparison, we use equal sized training sets in all our experiments, obtained by randomly sampling 20K instances each from HotpotQA and 2Wiki, and referred to as HotpotQA-20k and 2Wiki-20k, respectively.

We will use the following notation through this section. Instances in MuSiQue-Ans, HotpotQA, and 2Wiki are of the form . Given a question along with a context C consisting of a set of paragraphs, the task is to predict the answer and identify supporting paragraphs . MuSiQue-Ans additionally has the DAG representation of ground-truth decomposition (cf. Section 2), which models may leverage during training. Instances in MuSiQue-Full are of form , where there’s an additional binary classification task to predict , the answerability of based on , also referred to as context sufficiency Trivedi et al. (2020).


For MuSiQue-Ans, HotpotQA, and 2Wiki, we report the standard F1 based metrics for answer (AnsF1) and support identification (SuppF1); see Yang et al. (2018) for details. All 3 datasets have paragraph-level support annotation, but not all have the same further fine-grained support annotation, such as the reasoning graph, supporting sentences, or evidence tuples for the three datasets, respectively. To make a fair comparison, we use only paragraph-level support F1 across all datasets.

For MuSiQue-Full, we follow Trivedi et al. (2020) to combine context sufficiency prediction with AnsF1 and SuppF1, which are denoted as AnsF1+Suff and SuppF1+Suff. Instances in MuSiQue-Full occur in pairs and are also evaluated in pairs. Specifically, for each with a sufficient context , there is a paired instance with and an insufficient context . For AnsF1+Suff, if a model incorrectly predicts context sufficiency (yes or no) for either of the instances in a pair, it gets 0 pts on that pair. Otherwise, it gets same AnsF1 score on the pair as it gets on the answerable instance in the pair. Scores are averaged across all pairs of instances in the dataset. Likewise for SuppF1+Suff.

5.2 Models

All our models are Transformer-based Vaswani et al. (2017)

pretrained language models 

Devlin et al. (2019)

, implemented using PyTorch 

Paszke et al. (2019), HuggingFace Transformers Wolf et al. (2019) and AllenNLP Gardner et al. (2017). We experiment with 2 kinds of models: (1) Standard Multi-hop Models, which receive both and as input, are in principle capable of employing desired or expected reasoning, and have demonstrated competitive performance on previous multi-hop QA datasets. These models help probe the extent to which a dataset can be solved by current models. (2) Artifact-based Models, which are restricted in some way that prohibits them from doing desired or expected reasoning, as we will discuss shortly. These models help probe the extent to which a dataset can be cheated.

5.2.1 Multi-hop Models

We describe how these models work for our datasets, MuSiQue-Ans and MuSiQue-Full. For HotpotQA and 2Wiki, they operate similarly to MuSiQue-Ans.

End2End Model.

This model takes as input and predicts as the output for MuSiQue-Ans and for MuSiQue-Full. Answer prediction is implemented as span prediction using a transformer similar to Devlin et al. (2019). Support prediction is implemented by adding special [PP] tokens at the beginning of each paragraph and supervising them with binary cross-entropy loss, similar to Beltagy et al. (2020). The binary classification for answerability or context sufficiency is done via the CLS token of the transformer architecture, trained with cross-entropy loss. We use Longformer Beltagy et al. (2020), which is one of the few transformer architectures that is able to fit the full context.

Select+Answer Model.

This model, inspired by Quark Groeneveld et al. (2020b) and SAE Tu et al. (2020), breaks the process into two parts. First, a context selector ranks and selects the most relevant paragraphs from .101010

is a hyperparameter, chosen from {3,5,7}.

Second, an answerer generates the answer and supporting paragraphs based only on . Both components are trained individually, as follows.

The selector is designed to rank the support paragraphs to be the highest based on and . Given as input, it scores every and is trained with the cross-entropy loss. We form using the paragraphs it scores the highest. The answerer is trained to take as input, and predict as the output for MuSiQue-Ans and for MuSiQue-Full. We implement a selector using RoBERTa-large Liu et al. (2019), and an answerer using Longformer-Large.

Step Execution Model.

Similar to prior decompositional approaches Talmor and Berant (2018); Min et al. (2019b); Qi et al. (2020); Khot et al. (2021), this model performs explicit, step-by-step multi-hop reasoning, by first predicting a decomposition of the input question into a DAG containing single-hop questions, and then using repeated calls to a single-hop model to execute this decomposition as discussed below.

The question decomposer is trained based on ground-truth decomposition annotations available in MuSiQue-Ans, and is implemented with BART-large.

The answerer takes and the predicted DAG as input, and outputs for MuSiQue-Ans and for MuSiQue-Full. It does this with repeated calls to a single-hop model while traversing in a topological sort order , as follows. By definition of topological sort, has no incoming edges and hence does not refer to the answer to any other single-hop question. The step execution model applies to in order to predict an answer . Then, for every edge , it substitutes the reference in to the answer to with the predicted answer , thereby removing this cross-question reference in . This process is repeated for in this order, and the predicted answer to is reported as the final answer.

The single-hop model (which we implement as an End2End model) is trained on only single-hop instances—taking as input, and producing or as the output. Here refers to the singleton supporting paragraph for and refers to whether is sufficient to answer . For MuSiQue-Full, the answerer predicts the multi-hop question as having sufficient context if predicts all subquestions in the above process to have sufficient context.

We experiment with this model only on MuSiQue, since HotpotQA and 2Wiki don’t have decompositons that are executable using .

5.2.2 Artifact-based Models

To probe weaknesses of the datasets, we consider three models whose input, by design, is insufficient to allow them to perform desired reasoning.

The Question-Only Model takes only as input (no ) and generates as the output. We implement this with BART-large Lewis et al. (2020).

The Context-Only Model takes only as input (no ) and predicts as the output. We implement this with an End2End Longformer-Large model where the empty string is used as .

Finally, our Single-Paragraph Model, similar to those proposed by Min et al. (2019a) and Chen and Durrett (2019), is almost the same as Select-Answer model with =1. Instead of training the selector to rank all the highest, we train it to rank any paragraph containing the answer string as the highest. The answerer then takes as input one selected paragraph and predicts an answer to based solely on . Note that this model doesn’t have access to full supporting information, as all considered datasets have at least two supporting paragraphs per question.

5.3 Human Performance

We perform a randomized human experiment to establish comparable and fair human performance on HotpotQA, 2Wiki, and MuSiQue-Ans. We sample 125 questions from each dataset, combine them into a single set, shuffle this set, and obtain 5 annotations of answer span and supporting paragraphs for each instance. Our interface lists all context paragraphs and makes them easily searchable and sortable with interactive text-overlap-based search queries 4. The interface includes a tutorial with 3 examples of question, supporting paragraphs, and answer from each dataset.

We crowdsourced this task on Amazon Mechanical Turk.111111 For the qualification round ( annotations), we allowed only workers with master-qualification and selected workers who had more than 75 AnsF1 and more than 75 SuppF1 on all datasets. Only these 5 workers were allowed in rest of the annotation.

We should note that 2Wiki did not report human scores and HotpotQA reported human scores which aren’t a fair comparison to models. This is because humans in HotpotQA were shown question and only the two correct supporting paragraphs to answer the question, whereas models are expected to reason over the full context (including 8 additional distracting paragraphs). In our setup, we put the same requirements on human as on models, making human vs. model comparison more accurate. Moreover, since we’ve shuffled our instances and used the same pool of workers, we can also compare human-scores across the 3 datasets.

We compute two human performance scores: Human-Majority – most common annotated answer and support (in case of a tie, one is chosen at random), and Human Upper Bound – the answer and support prediction that maximizes the score. The human-scores reported in HotpotQA are based on Human Upper Bound.

HotpotQA-20K 2Wiki-20K MuSiQue-Ans
Model 5AnsF1 SuppF1 AnsF1 SuppF1 AnsF1 SuppF1
Humans Majority 83.6 92.0 84.1 99.0 76.6 93.0
UpperBound 91.8 96.0 89.0 100 88.6 96.0
Multi-hop End2End 72.9 94.3 72.9 97.6 42.3 67.6
Models Select+Answer 74.9 94.6 79.5 99.0 47.3 72.3
Step Execution . . 47.5 76.8
Artifact- Single-Paragraph 64.8 . 60.1 . 32.0 .
Based Context-Only 18.4 67.6 50.1 92.0 53.4 50.0
Models Question-Only 19.6 . 27.0 . 54.6 .
Table 3: MuSiQue has a substantially higher human-model gap than the other datasets, as shown by the results in the top two sections of the table. Further, MuSiQue is less cheatable compared to the other datasets, as evidenced by lower performance of artifact-based models (bottom section of the table).
MuSiQue-Ans MuSiQue-Full
Model 5AnsF1 SuppF1 5AnsF1+Suff SuppF1+Suff
Multi-hop End2End 42.3 67.6 22.0 25.2
Models Select+Answer 47.3 72.3 34.4 42.0
Step Execution 47.5 76.8 27.8 28.3
Artifact- Single-Paragraph 32.0 . 52.4 .
based Context-Only 53.4 50.0 51.0 50.8
Models Question-Only 54.6 . 50.7 .
Table 4: MuSiQue-Full is harder and less cheatable than MuSiQue-Ans, as evidenced by the multi-hop models and artifact-based models sections of the table, respectively. Note that MuSiQue-Full uses a stricter metric that also checks for correct context sufficiency prediction (“Suff”) and operates over pairs of highly related instances.

6 Empirical Findings

We now discuss our empirical findings, demonstrating that MuSiQue is a challenging multi-hop dataset that is harder to cheat on than existing datasets (Section 6.1), that the steps in the construction pipeline of MuSiQue are individually valuable (Section 6.2), and that the dataset serves as a useful testbed for pursuing interesting multi-hop reasoning research (Section 6.3).

All our reported results are based on the development sets of the respective datasets.

6.1 MuSiQue is a Challenging Dataset

We first show that, compared to the other two datasets considered (HotpotQA and 2Wiki), both variants of MuSiQue are less cheatable via shortcuts and have a larger human-to-machine gap.

Higher Human-Machine Gap.

As shown in the top two sections of Table 3, MuSiQue-Ans has a significantly higher human-model gap than the other datasets, for both answer and supporting paragraph identification.121212Table 9 in the Appendix shows the performance of multi-hop models split by the number of hops required by the question in MuSiQue (2, 3, or 4 hops). In fact, for both the other datasets, supporting paragraph identification has even surpassed human majority score, whereas for MuSiQue-Ans, there is 17 point gap. Additionally, MuSiQue-Ans has a 29 pt gap in answer-f1, whereas HotpotQA and 2Wiki have a gap of only 10 and 5, resp.

Note that the human-model gap further decreases for HotpotQA and 2Wiki if we use their full training dataset. Likewise for MuSiQue-Ans, as we’ll show in Section 6.3, adding auto-generated augmentation data also reduces the human-model gap. However, since these alterations result in datasets of different sizes and quality, they do not provide a useful comparison point.

Lower Cheatability.

As seen in the bottom section of Table 3, the performance of artifact-based models (from Section 5.2.2) is significantly higher on HotpotQA and 2Wiki than on MuSiQue-Ans. This shows that MuSiQue is significantly less cheatable via shortcut-based reasoning.

Answer identification can be done very well on HotpotQA and 2Wiki with the Single-Paragraph model. In particular, support identification in both datasets can be done to a very high degree (67.6 and 92.0 F1) even without the question. One might argue that the comparison of paragraph identification is unfair since HotpotQA and 2Wiki have 10 paragraphs as context, while MuSiQue-Ans has 20. However, as we will discuss in ablations later in Table 6, even with 10 paragraphs, MuSiQue-Ans is significantly less cheatable. Overall, we find that both answer and supporting paragraph identification tasks in MuSiQue-Ans are less cheatable via disconnected reasoning.

Single-Paragraph Model Context-Only Model End2End
5AnsF1 SuppF1 AnsF1 SuppF1 AnsF1 SuppF1
Full Pipeline (F) 32.0 . 53.4 50.0 42.3 67.6
F unmemorizable splits 85.1 . 70.1 49.8 90.1 85.5
F disconnection filter 52.7 . 56.3 36.8 60.8 72.5
Table 5: Key components of the construction pipeline of MuSiQue are crucial for its difficulty and less cheatability.
Context Type Retrieval Corpus Single-Paragraph Model Context-Only Model End2End
5AnsF1 SuppF1 AnsF1 SuppF1 AnsF1 SuppF1
No Distractors None 49.7 . 17.0 100 70.1 100
10 Para Full Wikipedia 42.5 . 12.5 77.7 57.2 87.6
10 Para Positive Distractors 28.0 . 55.5 34.6 54.1 80.2
20 Para Full Wikipedia 41.7 . 12.4 66.4 50.3 80.8
20 Para Positive Distractors 32.0 . 53.4 50.0 42.3 67.6
Table 6: Positive distractors are more effective than using full Wikipedia for choosing distractors, as evidenced by lower scores of models. The effect of using positive distractors is more pronounced when combined with the use of 20 (rather than 10) distractor paragraphs.
MuSiQue-Full: Even More Challenging.

Table 4 compares the performance of models on MuSiQue-Ans vs. MuSiQue-Full, where the latter is obtained by adding unanswerable questions to the former.

The results demonstrate that MuSiQue-Full is significantly more difficult and less cheatable than MuSiQue-Ans. Intuitively, because the answerable and unanswerable instances are very similar but have different labels, it’s difficult for models to do well on both instances if they learn to rely on shortcuts Kaushik et al. (2019); Gardner et al. (2020). As we see, all artifact-based models barely get any AnsF1+Suff or SuppF1+Suff score. For all multi-hop models too, the AnsF1 drops by 13-20 pts and SuppF1 by 30-48 pts.

6.2 Dataset Construction Steps are Valuable

Next, we show that three key steps of our dataset construction pipeline (Section 4) are valuable.

Disconnection Filter (step 3).

To understand the effect of Disconnection Filter in our dataset construction pipeline, we do an ablation study by skipping the step of filtering composable 2-hop questions to connected 2-hop questions; see Figure 2. Since we don’t have human-generated composed questions for these additional questions, we resort to a seq2seq BART-large model that’s trained (using MuSiQue) to take as input two composable questions and generate as output a composed question.

As shown in Table 5, the Disconnection Filter is crucial for increasing the difficulty and decreasing the cheatability of the final dataset. Specifically, without this filter, we see that both multihop and artifact-based models perform significantly better on the resulting datasets.

Reduced Train-Test Leakage (step 5).

Similar to the above ablation, we assess the value of using our careful train-test splits based on a clear separation of constituent single-hop subquestions, their answers, and their supporting paragraphs across splits (Step 8 in Figure 2). Note that our bottom-up construction pipeline is what enables such a split. To perform this assessment, we create a dataset the traditional way, with a purely random partition into train, validation, and test splits. For uniformity, we ensure the distribution of 2-4 hop questions in development set of the resulting dataset from both ablated pipelines remains the same as in the original development set.

Table 5 shows that without a careful train/test split, the dataset is highly solvable by current models (AnsF1=90.1). Importantly, we see that most of this high score can also be achieved by single-paragraph (AnsF1=85.1) and context-only (AnsF1=75.1) models, revealing the high cheatability of such a split.

MuSiQue-Ans MuSiQue-Full
5AnsF1 SuppF1 5AnsF1+Suff SuppF1+Suff
End2End 42.3 67.6 22.0 25.2
End2End w/ Oracle Decomposition 37.3 64.3 19.5 23.0
Step Execution 47.5 76.8 27.8 28.3
Step Execution w/ Oracle Decomposition 53.9 82.7 32.9 32.8
Table 7: Right inductive bias helps MuSiQue
Training Data MuSiQue-Ans MuSiQue-Full
5AnsF1 SuppF1 5AnsF1+Suff SuppF1+Suff
Filtered MultiHop 42.3 67.6 22.0 25.2
Filtered MultiHop + Filtered SingleHop 45.0 70.0 23.1 23.6
Filtered MultiHop + Unfiltered MultiHop 52.1 67.7 33.9 38.2
Table 8: Effect of augmenting training data of MuSiQue. The three rows (top to bottom) have 20K, 34.5K, 70.5K and instance for answerable set (L) respectively. There are twice as many instances in unanswerable set (R).
Harder Distractors (step 7).

To understand the effect of distractors on the difficulty and cheatability of the dataset, we construct 5 variations. Three of them capture the effect of the number of distractors: (i) no distractors, (ii) 10 paragraphs, and (iii) 20 paragraphs; and two of them capture the effect of the source of distractors: (i) Full wikipedia,131313We used the Wikipedia corpus from Petroni et al. (2020). and (ii) gold context paragraphs from the good single-hop questions (Stage 1 of Figure 2). We refer to the last setting as positive distractors, as these paragraphs are likely to appear as supporting (positive) paragraphs in our final dataset.

The results are shown in Table 6. First, we observe that using positive distractors instead of full wikipedia significantly worsens the performance of all models. In particular, it makes identifying supporting paragraphs without the question extremely difficult. This difficulty also percolates to the Single-Paragraph and End2End models. We have shown in Table 3 that Context-only models are able to identify supporting paragraphs in HotpotQA and 2Wiki to a very high degree (67.2 and 92.0 SuppF1) even without the question. This would have also been true for MuSiQue-Ans (66.4 SuppF1) had we used wikipedia as the corpus for identifying distractors like HotpotQA and 2Wiki. This result suggest that it’s necessary to be careful about the corpus to search for when selecting distractors, and ensure there is no distributional shift that a powerful pretrained model can exploit to bypass reasoning.

Second, we observe that using 20 paragraphs instead of 10 makes the dataset more difficult and less cheatable. Interestingly, this effect is significantly more pronounced if we use positive distractors, indicating the synergy between these two approaches to create more challenging distractors.

6.3 A Testbed for Interesting Approaches

Right Inductive Bias Helps MuSiQue.

We’ve shown that artifact-based models do not perform well on MuSiQue. Next we ask, can a model that has an explicit inductive bias towards connected reasoning outperform black-box models on MuSiQue? For this, we compare the End2End model, which doesn’t have any explicit architectural bias towards doing multi-step reasoning, to the Step Execution model, which can only take explicit reasoning steps to arrive at answer and support.

As shown in Table 7, the Step Execution model outperforms the End2End model by 5 F1 points on both MuSiQue-Ans and MuSiQue-Full. Moreover, using the oracle decompositions further improves the score of the Step Execution model by 5-6 pts on both the datasets. The End2End model, however, actually performs a few points worse when its provided with the oracle decomposition instead of the composed question. This shows that models that can exploit the decomposition structure (and our associated annotations) can potentially outperform these naive black-box models on our dataset, leading to development of more interesting and possibly more interpretable models.

Additional Data Augmentation helps MuSiQue.

Our dataset construction pipeline allows us to generate additional training augmentation data for MuSiQue. We explore two strategies:

Adding Filtered Single-hop Questions. We make a set of unique constituent single-hop question of training MuSiQue-Ans and MuSiQue-Full, apply the same context building procedure (Step 7, but for single question), and add resulting RC instances to the training instances of MuSiQue-Ans and MuSiQue-Full respectively. The validation and test sets of MuSiQue-Ans and MuSiQue-Full still remain the same.

Adding Unfiltered Multi-hop Questions. We make a new set of multi-hop questions through our dataset construction pipeline for additional data augmentation. Since we’ve already exhausted questions with disconnection filter, we choose to skip disconnection filter in favor of large-training data. Additionally, since it’s a very large set, we use to generates question compositions for it instead of relying on humans. We use BART-large seq2seq model trained on MuSiQue for this purpose.

We find that both augmentation strategies help improve scores on MuSiQue by 10 AnsF1 points. However, it’s still far for human achievable scores ( 25 AnsF1 and SuppF1).

7 Related Work

Multihop QA.

MuSiQue-Ans is closest to HotpotQA Yang et al. (2018) and 2WikiMultihopQA Ho et al. (2020). HotpotQA was prepared by having crowdworkers write a question by showing two paragraphs and then adding distractors paragraphs post-hoc. We believe since the questions were written by crowdworkers without considering the difficulty of compositions and distractors, this dataset has been shown to be solvable to a large extent without multi-hop reasoning Min et al. (2019a); Trivedi et al. (2020). 2WikiMultihopQA Ho et al. (2020) was generated automatically using structured information from Wikidata and Wikipedia and a fixed set of human-authored rule templates. While the compositions in this dataset could be challenging, they were limited to a few rule-based templates and also selected in a model-independent fashion. As shown in our experiments, both these datasets (in comparable settings) are significantly more cheatable and have less human-model gap.

Qi et al. (2020) also build a multi-hop QA dataset with variable number of hops but only use it for evaluation. Other multi-hop RC datasets focus on other challenges such as narrative understanding Khashabi et al. (2018), discrete reasoning Dua et al. (2019), multiple modalities Chen et al. (2020); Talmor et al. (2021), open-domain QA Geva et al. (2021); Khot et al. (2020); Yang et al. (2018); Talmor and Berant (2018); Mihaylov et al. (2018) and relation extraction Welbl et al. (2018). While we do not focus on these challenges, we believe it should be possible to extend our idea to these settings.

Unanswerability in QA.

The idea of using unanswerable questions to ensure robust reasoning has been considered before in single-hop Rajpurkar et al. (2018) and multi-hop Ferguson et al. (2020); Trivedi et al. (2020) setting has been considered before. Within the multi-hop setting, the IIRC dataset Ferguson et al. (2020) focuses on open-domain QA where the unanswerable questions are identified based on human annotations of questions where relevant knowledge could not be retrieved from Wikipedia. Our work is more similar to Trivedi et al. (2020) where we modify the context of an answerable question to make it unanswerable. This allows us to create counterfactual pairs Kaushik et al. (2019); Gardner et al. (2020) of answerable and unanswerable questions that can be evaluated as a group Gardner et al. (2020); Trivedi et al. (2020) to measure the true reasoning capabilities of a model.

The transformation described in Trivedi et al. (2020) also removes supporting paragraphs to create unanswerable questions but relies on the dataset annotations to be complete. It is possible that the context contains other supporting paragraphs that were not annotated in the dataset. By creating questions in this bottom-up fashion where we even know the bridging entities, we can eliminate any potential supporting paragraph by removing all paragraphs containing the bridging entity.

Question Decomposition and Composition.

Existing multi-hop QA datasets have been decomposed into simpler questions Min et al. (2019b); Talmor and Berant (2018) and special meaning representations such as QDMR Wolfson et al. (2020). Due to nature of our construction, our dataset naturally provides the question decomposition for each question. This can enable development of more interpretable models with the right inductive biases such as DecompRC Min et al. (2019b), ModularQA Khot et al. (2021).

Similar to our approach, recent work Pan et al. (2021); Yoran et al. (2021) has also used bottom-up approaches to build multi-hop QA datasets. However these approaches used rule-based approaches to create the composed question with the primary goal of data augmentation. These generated datasets themselves are not challenging and have only been shown to improve the performance on downstream datasets targetted in their rule-based compositions. Evaluating the impact of MuSiQue on other multi-hop QA datasets is an interesting avenue for future work.

8 Conclusion

We present a new pipeline to construct challenging multi-hop QA datasets via composition of single-hop questions. Due to the bottom-up nature of our construction, we can identify and eliminate potential reasoning shortcuts such as disconnected reasoning and train-test leakages. Furthermore, the resulting dataset automatically comes with annotated decompositions for all the questions, supporting paragraphs and bridging entities. This allows for easy dataset augmentation and development of models with the right inductive biases.

We build a new challenge dataset for multi-hop reasoning: MuSiQue consisting of 2-4 hops questions with 6 reasoning graphs. We show our dataset is less cheatable and more challenging than prior multi-hop QA dataset. Due to the additional annotations, we are also able to create an even more challenging dataset: MuSiQue-Full consisting of contrasting pairs of answerable and unanswerable questions with minor perturbations of the context. We use this dataset to show that each feature of our pipeline increases the hardness of the resulting dataset.

Extending our approach to more compositional operations such as comparisons, discrete computations, etc. are interesting directions of future work. Developing stronger models in the future by improving the accuracy of the question decomposition and the accuracy of the single-hop question answering model can reduce the human-machine gap on MuSiQue.


  • I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv:2004.05150. Cited by: §4, §4, §5.2.1.
  • J. Chen and G. Durrett (2019) Understanding dataset design choices for multi-hop reasoning. In NAACL-HLT, Cited by: §1, §2, §5.2.2.
  • W. Chen, H. Zha, Z. Chen, W. Xiong, H. Wang, and W. Wang (2020) HybridQA: a dataset of multi-hop question answering over tabular and textual data. Findings of EMNLP 2020. Cited by: §7.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §5.2.1, §5.2.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019) DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL, Cited by: §7.
  • H. ElSahar, P. Vougiouklis, A. Remaci, C. Gravier, J. S. Hare, F. Laforest, and E. Simperl (2018) T-REx: A large scale alignment of natural language with knowledge base triples. In LREC, Cited by: §4.
  • J. Ferguson, M. Gardner, H. Hajishirzi, T. Khot, and P. Dasigi (2020) IIRC: a dataset of incomplete information reading comprehension questions. In EMNLP, Cited by: §7.
  • M. Gardner, Y. Artzi, V. Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y. Elazar, A. Gottumukkala, et al. (2020) Evaluating NLP models via contrast sets. arXiv preprint arXiv:2004.02709. Cited by: §6.1, §7.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer (2017)

    AllenNLP: a deep semantic natural language processing platform

    arXiv preprint arXiv:1803.07640. Cited by: §5.2.
  • M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant (2021) Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. TACL. Cited by: §7.
  • D. Groeneveld, T. Khot, Mausam, and A. Sabharwal (2020a) A simple yet strong pipeline for HotpotQA. In EMNLP, Cited by: §3.
  • D. Groeneveld, T. Khot, A. Sabharwal, et al. (2020b) A simple yet strong pipeline for hotpotqa. In EMNLP, Cited by: §5.2.1.
  • X. Ho, A. Nguyen, S. Sugawara, and A. Aizawa (2020) Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In COLING, Cited by: §1, §5.1, §7.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In EMNLP, Cited by: §3.
  • D. Kaushik, E. Hovy, and Z. Lipton (2019) Learning the difference that makes a difference with counterfactually-augmented data. In ICLR, Cited by: §6.1, §7.
  • D. Kaushik and Z. C. Lipton (2018) How much reading does reading comprehension require? a critical investigation of popular benchmarks. In EMNLP, Cited by: §2.
  • D. Khashabi, S. Min, T. Khot, A. Sabhwaral, O. Tafjord, P. Clark, and H. Hajishirzi (2020) UnifiedQA: crossing format boundaries with a single qa system. EMNLP - findings. Cited by: §4.
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth (2018) Looking beyond the surface:a challenge set for reading comprehension over multiple sentences. In NAACL, Cited by: §7.
  • T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal (2020) QASC: a dataset for question answering via sentence composition. In AAAI, Cited by: §7.
  • T. Khot, D. Khashabi, K. Richardson, P. Clark, and A. Sabharwal (2021) Text modular networks: learning to decompose tasks in the language of existing models. In NAACL, Cited by: §5.2.1, §7.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. V. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. TACL 7, pp. 453–466. Cited by: §4.
  • O. Levy, M. Seo, E. Choi, and L. Zettlemoyer (2017) Zero-shot relation extraction via reading comprehension. In CoNLL, Cited by: §4.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, Cited by: §5.2.2.
  • P. Lewis, B. Oğuz, R. Rinott, S. Riedel, and H. Schwenk (2019) MLQA: evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475. Cited by: §4.
  • P. Lewis, P. Stenetorp, and S. Riedel (2021) Question and answer test-train overlap in open-domain question answering datasets. In EACL, Cited by: §2, §4.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4, §5.2.1.
  • T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, Cited by: §7.
  • S. Min, E. Wallace, S. Singh, M. Gardner, H. Hajishirzi, and L. Zettlemoyer (2019a) Compositional questions do not necessitate multi-hop reasoning. In ACL, Cited by: §1, §2, §3, §5.2.2, §7.
  • S. Min, V. Zhong, L. S. Zettlemoyer, and H. Hajishirzi (2019b) Multi-hop reading comprehension through question decomposition and rescoring. In ACL, Cited by: §5.2.1, §7.
  • L. Pan, W. Chen, W. Xiong, M. Kan, and W. Y. Wang (2021) Unsupervised multi-hop question answering by question generation. In NAACL, Cited by: §7.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: An imperative style, high-performance deep learning library

    In NeurIPS, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §5.2.
  • F. Petroni, A. Piktus, A. Fan, P. Lewis, M. Yazdani, N. D. Cao, J. Thorne, Y. Jernite, V. Plachouras, T. Rocktäschel, and S. Riedel (2020) KILT: A benchmark for knowledge intensive language tasks. In arXiv:2009.02252, Cited by: footnote 13.
  • P. Qi, H. Lee, O. Sido, C. D. Manning, et al. (2020) Retrieve, read, rerank, then iterate: answering open-domain questions of varying reasoning steps from text. arXiv preprint arXiv:2010.12527. Cited by: §5.2.1, §7.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for SQuAD. In ACL, Cited by: §1, §7.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, Cited by: §4.
  • A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. In NAACL, Cited by: §5.2.1, §7, §7.
  • A. Talmor, O. Yoran, A. Catav, D. Lahav, Y. Wang, A. Asai, G. Ilharco, H. Hajishirzi, and J. Berant (2021) MultiModalQA: complex question answering over text, tables and images. arXiv preprint arXiv:2104.06039. Cited by: §7.
  • H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal (2020) Is multihop QA in DiRe condition? Measuring and reducing disconnected reasoning. In EMNLP, Cited by: §1, §1, §2, §2, §4, §5.1, §5.1, §7, §7, §7.
  • M. Tu, K. Huang, G. Wang, J. Huang, X. He, and B. Zhou (2020) Select, answer and explain: interpretable multi-hop reading comprehension over multiple documents. In AAAI, Cited by: §5.2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §5.2.
  • J. Welbl, P. Stenetorp, and S. Riedel (2018) Constructing datasets for multi-hop reading comprehension across documents. TACL 6, pp. 287–302. External Links: Link, Document Cited by: §7.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §5.2.
  • T. Wolfson, M. Geva, A. Gupta, M. Gardner, Y. Goldberg, D. Deutch, and J. Berant (2020) Break it down: A question understanding benchmark. TACL. Cited by: §3, §7.
  • L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer (2020) Zero-shot entity linking with dense entity retrieval. In EMNLP, Cited by: item 4.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In EMNLP, Cited by: §1, §1, §5.1, §5.1, §7, §7.
  • O. Yoran, A. Talmor, and J. Berant (2021) Turning tables: generating examples from semi-structured tables for endowing language models with reasoning skills. arXiv preprint arXiv:2107.07261. Cited by: §7.

Appendix A Appendix

Figure 3 shows our annotation interface for question composition. Figure 4 show our annotation interface for establishing human scores on MuSiQue-Ans, 2WikiMultihopQA and HotpotQA.

and  4 show our annotation interfaces for question composition and comparing hum.

Figure 3: This is the annotation interface we used for MuSiQue. Workers could see decomposition graph and passage associated with subquestions.
Figure 4: This is the annotation interface we used for comparing human-scores on MuSiQue, HotpotQA and 2WikiMultihopQA.

Table 9 shows the performance of various multi-hop models on MuSiQue, split by the number of hops required for the question.

2-hop 3-hop 4-hop
Dataset Model 5AnsF1 SuppF1 5AnsF1 SuppF1 5AnsF1 SuppF1
End2End 43.1 67.1 40.1 69.4 43.8 65.6
MuSiQue-Ans Select+Answer 52.0 72.2 42.7 75.2 41.2 67.2
(metric m) Step Execution 56.1 80.0 43.7 77.1 28.2 66.0
End2End 23.2 26.4 19.7 26.5 18.1 19.1
MuSiQue-Full Select+Answer 42.3 50.4 26.4 36.7 24.9 26.0
(metric m+Suff) Step Execution 38.5 38.9 18.3 20.2 58.2 10.5
Table 9: Performance of various multi-hop models on questions with different numbers of hops.