Textual Multi-hop Question Answering (QA) is the task of answering questions by combining information from multiple sentences or documents. This is a challenging reasoning task that requires QA systems to identify relevant pieces of information in the given text and learn to compose them to answer a question. To enable progress in this area, many datasets wikihop; Talmor2018TheWA; hotpotqa; qasc and models decomprc; dfgn; sae with varying complexities have been proposed over the past few years. Our work focuses on HotpotQA hotpotqa, which contains 105,257 multi-hop questions derived from two Wikipedia paragraphs, where the correct answer is a span in these paragraphs or yes/no.
Due to the multi-hop nature of this dataset, it is natural to assume that the relevance of a sentence for a question would depend on the other sentences considered to be relevant. E.g., the relevance of “Obama was born in Hawaii.” to the question “Where was the 44 President of USA born?” depends on the other relevant sentence: “Obama was the 44 President of US.” As a result, many approaches designed for this task focus on jointly identifying the relevant sentences (or paragraphs) via mechanisms such as cross-document attention, graph networks, and entity linking.
Our results question this basic assumption. We show that a simple model, Quark (see Fig. 1), that first identifies relevant sentences from each paragraph independent of other paragraphs, is surprisingly powerful on this task: in 90% of the questions, Quark’s relevance module recovers all gold supporting sentences within the top-5 sentences. For QA, it uses a standard BERT devlin2018bert span prediction model (similar to current published models) on the output of this module. Additionally, Quark exploits the inherent similarity between the relevant sentence identification task and the task of generating an explanation given an answer produced by the QA module: it uses the same architecture for both tasks.
We show that this independent sentence scoring model results in a simple QA pipeline that outperforms all other BERT models in both ‘distractor’ and ‘fullwiki’ settings of HotpotQA. In the distractor setting (10 paragraphs, including two gold, provided as context), Quark achieves joint scores (answer and support prediction) within 0.75% of the current state of the art. Even in the fullwiki setting (all 5M Wikipedia paragraphs as context), by combining our sentence selection approach with a commonly used paragraph selection approach semanticmrs, we outperform all previously published BERT models. In both settings, the only models scoring higher use RoBERTa roberta, a more robustly trained language model that is known to outperform BERT across various tasks.
While our design uses multiple transformer models (now considered a standard starting point in NLP), our contribution is a simple pipeline without any bells and whistles, such as NER, graph networks, entity linking, etc.
The closest effort to Quark is by Min2019CompositionalQD Min2019CompositionalQD, who also propose a simple QA model for HotpotQA. Their approach selects answers independently from each paragraph to achieve competitive performance on the question-answering subtask of HotpotQA (they do not address the support identification subtask). We show that while relevant sentences can be selected independently, operating jointly over these sentences chosen from multiple paragraphs can lead to state-of-the-art question-answering results, outperforming independent answer selection by several points.
Finally, our ablation study demonstrates that the sentence selection module benefits substantially from using context from the corresponding paragraph. It also shows that running this module a second time, with the chosen answer as input, results in more accurate support identification.
2 Related Work
Most approaches for HotpotQA attempt to capture the interactions between the paragraphs by either relying on cross-attention between documents or sequentially selecting paragraphs based on the previously selected paragraphs.
While qfe qfe also use a standard Reading Comprehension (RC) model, they combine it with a special Query Focused Extractor (QFE) module to select relevant sentences for QA and explanation. The QFE module sequentially identifies relevant sentences by updating a RNN state representation in each step, allowing the model to capture the dependency between sentences across time-steps. dfgn dfgn propose a Dynamically Fused Graph Networks (DFGN) model that first extracts entities from paragraphs to create an entity graph, dynamically extract sub-graphs and fuse them with the paragraph representation. The Select, Answer, Explain (SAE) model sae is similar to our approach in that it also first selects relevant documents and uses them to produce answers and explanations. However, it relies on a self-attention over all
document representations to capture potential interactions. Additionally, they rely on a Graph Neural Network (GNN) to answer the questions. Hierarchical Graph Network (HGN) model hgn builds a hierarchical graph with three levels: entities, sentences and paragraphs to allow for joint reasoning. DecompRC decomprc takes a completely different approach of learning to decompose the question (using additional annotations) and then answer the decomposed questions using a standard single-hop RC system.
Others such as Min2019CompositionalQD Min2019CompositionalQD have also noticed that many HotpotQA questions can be answered just based on a single paragraph. Our findings are both qualitatively and quantitatively different. They did not consider the support identification task, and showed strong (but not quite SoTA) QA performance by running a QA model independently on each paragraph. We, on the other hand, show that interaction is not essential for selecting relevant sentences but actually valuable for QA! Specifically, by using a context of relevant sentences spread across multiple paragraphs in steps 2 and 3, our simple BERT model outperforms previous models with complex entity- and graph-based interactions on top of BERT. We thus view Quark as a different, stronger baseline for multi-hop QA.
In the fullwiki setting, each question has no associated context and models are expected to select paragraphs from Wikipedia. To be able to scale to such a large corpus, the proposed systems often select the paragraphs independent of each other. A recent retrieval method in this setting is Semantic Retrieval semanticmrs where first the paragraphs are selected based on the question, followed by individual sentences from these paragraphs. However, unlike our approach, they do not use the paragraph context to select the sentences, missing key context needed to identify relevance.
3 Pipeline Model: Quark
Our model works in three steps. First, we score individual sentences from an input set of paragraphs based on their relevance to the question. Second, we feed the highest-scoring sentences to a span prediction model to produce an answer to the question. Third, we score sentences from a second time to identify the supporting sentences using the answer. These three steps are implemented using the two modules described next in Sections 3.1 and 3.2.
3.1 Sentence Scoring Module
In the distractor setting, HotpotQA provides 10 context paragraphs that have an average length of 41.4 sentences and 1106 tokens. This is too long for standard language-model based span-prediction—most models scale quadratically with the number of tokens, and some are limited to 512 tokens. This motivates selecting a few relevant sentences to reduce the size of the input to the span-prediction model without losing important context. In a similar vein, the support identification subtask of HotpotQA also involves selecting a few sentences that best explain the chosen answer. We solve both of these problems with the same transformer-based sentence scoring module, with slight variation in its input.
Our sentence scorer uses the BERT-Large-Cased model devlin2018bert trained with whole-word masking, with an additional linear layer over the [CLS] token. Here, whole word masking refers to a BERT variant that masks entire words instead of word pieces during pre-training.
We score every sentence from every paragraph independently by feeding the following sequence to the model: [CLS] question [SEP] p [SEP] answer [SEP]
. This sequence is the same for every sentence in the paragraph, but the sentence being classified is indicated using a segment IDs: It is set tofor tokens from the sentence and to for the rest. If a paragraph has more than 512 tokens, we restrict the input to the first 512. Each annotated support sentence forms a positive example and all other sentences from form the negative examples. Note that our classifier scores each sentence independently and never sees sentences from two paragraphs at the same time. (See Appendix A.1 for further detail.)
We train two variants of this model: (1) is trained to score sentences given a question but no answer (answer is replaced with a [MASK] token); and (2) is trained to score sentences given a question and its gold answer. We use for relevant sentence selection and for support identification (Sec. 3.3).
3.2 Question Answering Module
To find answers to questions, we use huggingface huggingface’s implementation of devlin2018bert devlin2018bert’s span prediction model. To achieve our best score, we use their BERT-Large-Cased model with whole-word masking and SQuAD squad fine-tuning.111While we use the model fine-tuned on SQuAD, ablations show that this only adds to the final score. We fine-tune this model on the HotpotQA dataset with input QA context from . Since BERT models have a hard limit of 512 word-pieces, we use to select the most relevant sentences that can fit within this limit, as described next. (See Appendix A.2 for training details.)
To accomplish this, we compute the score for each sentence in the input . Then we add sentences in decreasing order of their scores to the QA context , until we have filled no more than 508 word-pieces (incl. question word-pieces). For every new paragraph considered, we also add its first sentence, and the title of the article (enclosed in <t></t>). This ensures that our span-prediction model has the right co-referential information from each paragraph. We arrange these paragraphs in the order of their highest-scoring sentence, so the most relevant sentences come earlier – a signal that could be exploited by our model. The final four tokens are a separator, plus the words yes, no, and noans. This allows the model to answer yes/no comparison questions, or give no answer at all.
|SAE (RoBERTa) sae||67.70||80.75||63.30||87.38||46.81||72.75|
|HGN (RoBERTa) hgn||–||81.00||–||87.93||–||73.01|
|Quark + SR-MRS (Ours)||55.50||67.51||45.64||72.95||32.89||56.23|
|HGN (RoBERTa) + SR-MRS hgn||56.71||69.16||49.97||76.39||35.36||59.86|
3.3 Bringing it Together: Distractor Setting
Given a question along with 10 distractor paragraphs , we use the variant of our sentence scoring module to score each sentence in , again without looking at other paragraphs. In the second step, the selected sentences are fed as context into the QA module (as described in Section 3.2) to choose an answer. In the final step, to find sentences supporting the chosen answer, we use to score each sentence in , this time with the chosen answer as part of the input.222We simply append the answer string to the question even if it is “yes” or “no”.
We define the score of a set of sentences to be the sum of the individual sentence scores; that is, .333Note that is the logit score and can be negative, so adding a sentence may not always improve this score.
is the logit score and can be negative, so adding a sentence may not always improve this score.In HotpotQA, supporting sentences always come from exactly two paragraphs. We compute this score for all possible satisfying this constraint and take the highest scoring set of sentences as our support.
3.4 Bringing it Together: Fullwiki Setting
Since there are too many paragraphs in the fullwiki setting, we use paragraphs from the SR-MRS system semanticmrs as our context for each question. On the Dev set, we found Quark to perform best with a paragraph score threshold of in MRS. Neither the sentence scorers nor the QA module were retrained in this setting.
We evaluate on both the distractor and fullwiki settings of HotpotQA with the following goal: Can a simple pipeline model outperform previous, more complex, approaches?
We present the EM (Exact Match) and F1 scores on the evaluation metrics proposed for HotpotQA: (1) answer selection, (2) support selection, and (3) joint score.
Table 1 shows that on the distractor setting, Quark outperforms all previous models based on BERT, including HGN, which like us also uses whole word masking for contextual embeddings. Moreover, we are within 1 point of models that use RoBERTa embeddings—a much stronger language model that has shown improvements of 1.5 to 6 points in previous HotpotQA models.
Quark also performs better than the recent single-paragraph approach for the QA subtask Min2019CompositionalQD by 14 points F1. While most of this gain comes from using a larger language model, Quark scores 2 points higher even with a language model of the same size (BERT-Base).
We observe a similar trend in the fullwiki setting (Table 2) where Quark again outperforms previous approaches (except HGN with RoBERTa). While we rely on retrieval from SR-MRS semanticmrs for our initial paragraphs, we outperform the original work. We attribute this improvement to two factors: our sentence selection capitalizing on the sentence’s paragraph context leading to better support selection, and a better span selection model leading to improved QA.
|top-n||Sup F1||Ans F1|
|B-Base w/o context||10||74.45||78.59|
|B-Base w/ context||6||83.15||80.92|
|+ B-Large ()||5||85.35||81.21|
|w/ answers ()||5||86.97||–|
To evaluate the impact of context on our sentence selection model in isolation, we look at the number of sentences that score at least as high as the lowest-scoring annotated support sentence. In other words, this is the number of sentences we must send to the QA model to ensure all annotated support is included. Table 3 shows that providing the model with the context from the paragraph gives a substantial boost on this metric, bringing it down from 10 to only 6 when using BERT-Base (an oracle would need 3 sentences). It further shows that this boost carries over to the downstream tasks of span selection and choosing support sentences (improving it by 9 points to 83%). Finally, the table shows the value of running the sentence selection model a second time: with BERT-Large, outperforms by 1.62% on the Support F1 metric.
Looking deeper, we analyzed the accuracy of our third stage, , as a function of the correctness of the QA stage. When QA finds the correct gold answer, obtains the right support in 65.9% of the cases. If the answer from QA is incorrect, the success rate of is only 50.9%.
Our work shows that on the HotpotQA tasks, a simple pipeline model can do as well as or better than more complex solutions. Powerful pre-trained models allow us to score sentences one at a time, without looking at other paragraphs. By operating jointly over these sentences chosen from multiple paragraphs, we arrive at answers and supporting sentences on par with state-of-the-art approaches. This result shows that retrieval in HotpotQA is not itself a multi-hop problem, and suggests focusing on other multi-hop datasets to demonstrate the value of more complex techniques.
Appendix A Appendix
a.1 Training the sentence scoring model
are trained the same way. We use the 90447 questions from the HotpotQA training set, shuffle them, and train for 4 epochs. Both models are trained in the distractor setting only, but evaluated in both settings. We construct positive and negative examples by choosing the two paragraphs containing the annotated support sentences, plus two more randomly chosen paragraphs. All sentences from the chosen paragraphs become instances for the model.
During training, we follow the fine-tuning advice from devlin2018bert, with two exceptions. We ramp up the learning rate from to over the first 10% of the batches, and then linearly decrease it again to .
To avoid biasing the training towards questions with many context sentences, we create batches at the question level. Three questions make up one batch, regardless of how many sentences they contain. We cap the batch size at 5625 tokens for practical purposes. If a batch exceeds this size, we drop sentences at random until the batch is small enough. As is standard for BERT classifiers, we use a cross-entropy loss with two classes, one for positive examples, and one for negative examples.
a.2 Training the span prediction model
We train the BERT span prediction model on the output paragraphs from . We use a batch size of 16 questions and maximum sequence length of 512 word-pieces. We use the same optimizer settings as the sentence selection model with an additional weight decay of . The model is trained for a fixed number of epochs (set to 3) and the final model is used for evaluation. Under the hood, this model consists of two classifiers that run at the same time. One finds the first token of potential spans, and one finds the last token of potential spans. Each classifier uses a cross entropy loss. The final loss is the average loss of the two classifiers. We train one model on the output from our best selection model and use it in all our experiments (and ablations).