MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension

05/31/2019 ∙ by Alon Talmor, et al. ∙ 7

A large number of reading comprehension (RC) datasets has been created recently, but little analysis has been done on whether they generalize to one another, and the extent to which existing datasets can be leveraged for improving performance on new ones. In this paper, we conduct such an investigation over ten RC datasets, training on one or more source RC datasets, and evaluating generalization, as well as transfer to a target RC dataset. We analyze the factors that contribute to generalization, and show that training on a source RC dataset and transferring to a target dataset substantially improves performance, even in the presence of powerful contextual representations from BERT (Devlin et al., 2019). We also find that training on multiple source RC datasets leads to robust generalization and transfer, and can reduce the cost of example collection for a new RC dataset. Following our analysis, we propose MultiQA, a BERT-based model, trained on multiple RC datasets, which leads to state-of-the-art performance on five RC datasets. We share our infrastructure for the benefit of the research community.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reading comprehension (RC) is concerned with reading a piece of text and answering questions about it Richardson et al. (2013); Berant et al. (2014); Hermann et al. (2015); Rajpurkar et al. (2016). Its appeal stems both from the clear application it proposes, but also from the fact that it allows to probe many aspects of language understanding, simply by posing questions on a text document. Indeed, this has led to the creation of a large number of RC datasets in recent years.

While each RC dataset has a different focus, there is still substantial overlap in the abilities required to answer questions across these datasets. Nevertheless, there has been relatively little work Min et al. (2017); Chung et al. (2018); Sun et al. (2018) that explores the relations between the different datasets, including whether a model trained on one dataset generalizes to another. This research gap is highlighted by the increasing interest in developing and evaluating the generalization of language understanding models to new setups Yogatama et al. (2019); Liu et al. (2019).

In this work, we conduct a thorough empirical analysis of generalization and transfer across 10 RC benchmarks. We train models on one or more source RC datasets, and then evaluate their performance on a target test set, either without any additional target training examples (generalization) or with additional target examples (transfer). We experiment with DocQA Clark and Gardner (2018), a standard and popular RC model, as well as a model based on BERT Devlin et al. (2019), which provides powerful contextual representations.

Our generalization analysis confirms findings that current models over-fit to the particular training set and generalize poorly even to similar datasets. Moreover, BERT representations substantially improve generalization. However, we find that the contribution of BERT is much more pronounced on Wikipedia (which BERT was trained on) and Newswire, but quite moderate when documents are taken from web snippets.

We also analyze the main causes for poor generalization: (a) differences in the language of the text document, (b) differences in the language of the question, and (c) the type of language phenomenon that the dataset explores. We show how generalization is related to these factors (Figure 1) and that performance drops as more of these factors accumulate.

Our transfer experiments show that pre-training on one or more source RC datasets substantially improves performance when fine-tuning on a target dataset. An interesting question is whether such pre-training improves performance even in the presence of powerful language representations from BERT. We find the answer is a conclusive yes, as we obtain consistent improvements in our BERT-based RC model.

We find that training on multiple source RC datasets is effective for both generalization and transfer. In fact, training on multiple datasets leads to the same performance as training from the target dataset alone, but with roughly three times fewer examples. Moreover, we find that when using the high capacity BERT-large, one can train a single model on multiple RC datasets, and obtain close to or better than state-of-the-art performance on all of them, without fine-tuning to a particular dataset.

Armed with the above insights, we train a large RC model on multiple RC datasets, termed MultiQA. Our model leads to new state-of-the-art results on five datasets, suggesting that in many language understanding tasks the size of the dataset is the main bottleneck, rather than the model itself.

Last, we have developed infrastructure (on top of AllenNLP Gardner et al. (2018)), where experimenting with multiple models on multiple RC datasets, mixing datasets, and performing fine-tuning, are trivial. It is also simple to expand the infrastructure to new datasets and new setups (abstractive RC, multi-choice, etc.). We will open source our infrastructure, which will help researchers evaluate models on a large number of datasets, and gain insight on the strengths and shortcoming of their methods. We hope this will accelerate progress in language understanding.

To conclude, we perform a thorough investigation of generalization and transfer in reading comprehension over 10 RC datasets. Our findings are:

  • [topsep=0pt, itemsep=0pt, leftmargin=0in, parsep=0pt]

  • An analysis of generalization on two RC models, illustrating the factors that influence generalization between datasets.

  • Pre-training on a RC dataset and fine-tuning on a target dataset substantially improves performance even in the presence of contextualized word representations (BERT).

  • Pre-training on multiple RC datasets improves transfer and generalization and can reduce the cost of example annotation.

  • A new model, MultiQA, that improves state-of-the-art performance on five datasets.

  • Infrastructure for easily performing experiments on multiple RC datasets.

The uniform format datasets can be downloaded from The code for the AllenNLP models is available at

2 Datasets

Dataset Size Context Question Multi-hop
SQuAD 108K Wikipedia crowd No
NewsQA 120K Newswire crowd No
SearchQA 140K Snippets trivia No
TriviaQA 95K Snippets trivia No
HotpotQA 113K Wikipedia crowd Yes
CQ 2K Snippets Web queries/KB No
CWQ 35K Snippets crowd/KB Yes
ComQA 11K Snippets WikiAnswers No
WikiHop 51K Wikipedia KB Yes
DROP 96K Wikipedia crowd Yes
Table 1: Characterization of different RC datasets. The top part corresponds to large datasets, and the bottom to small datasets.

We describe the 10 datasets used for our investigation. Each dataset provides question-context-answer triples for training, and a model maps an unseen question-context pair to an answer . For simplicity, we focus on the single-turn extractive setting, where the answer is a span in the context . Thus, we do not evaluate abstractive Nguyen et al. (2016) or conversational datasets Choi et al. (2018); Reddy et al. (2018).

We broadly distinguish large datasets that include more than 75K examples, from small datasets that contain less than 75K examples. In §4, we will fix the size of the large datasets to control for size effects, and always train on exactly 75K examples per dataset.

We now shortly describe the datasets, and provide a summary of their characteristics in Table 1. The table shows the original size of each dataset, the source for the context, how questions were generated, and whether the dataset was specifically designed to probe multi-hop reasoning.

The large datasets used are:

  1. [topsep=0pt,itemsep=0ex,parsep=0ex,leftmargin=*]

  2. SQuAD Rajpurkar et al. (2016): Crowdsourcing workers were shown Wikipedia paragraphs and were asked to author questions about their content. Questions mostly require soft matching of the language in the question to a local context in the text.

  3. NewsQA Trischler et al. (2017): Crowdsourcing workers were shown a CNN article (longer than SQuAD) and were asked to author questions about its content.

  4. SearchQA Dunn et al. (2017): Trivia questions were taken from Jeopardy! TV show, and contexts are web snippets retrieved from Google search engine for those questions, with an average of 50 snippets per question.

  5. TriviaQA Joshi et al. (2017): Trivia questions were crawled from the web. In one variant of TriviaQA (termed TQA-W), Wikipedia pages related to the questions are provided for each question. In another, web snippets and documents from Bing search engine are given. For the latter variant, we use only the web snippets in this work (and term this TQA-U). In addition, we replace Bing web snippets with Google web snippets (and term this TQA-G).

  6. HotpotQA Yang et al. (2018): Crowdsourcing workers were shown pairs of related Wikipedia paragraphs and asked to author questions that require multi-hop reasoning over the paragraphs. There are two versions of HotpotQA: the first where the context includes the two gold paragraphs and eight “distractor” paragraphs, and a second, where 10 paragraphs retrieved by an information retrieval (IR) system are given. Here, we use the latter version.

The small datasets are:

  1. [topsep=0pt,itemsep=0ex,parsep=0ex,leftmargin=*]

  2. CQ Bao et al. (2016): Questions are real Google web queries crawled from Google Suggest, originally constructed for querying the KB Freebase Bollacker et al. (2008). However, the dataset was also used as a RC task with retrieved web snippets Talmor et al. (2017).

  3. CWQ Talmor and Berant (2018c): Crowdsourcing workers were shown compositional formal queries against Freebase and were asked to re-phrase them in natural language. Thus, questions require multi-hop reasoning. The original work assumed models contain an IR component, but the authors also provided default web snippets, which we use here. The re-partitioned version 1.1 was used. Talmor and Berant (2018a)

  4. WikiHop Welbl et al. (2017) Questions are entity-relation pairs from Freebase, and are not phrased in natural language. Multiple Wikipedia paragraphs are given as context, and the dataset was constructed such that multi-hop reasoning is needed for answering the question.

  5. ComQA Abujabal et al. (2018): Questions are real user questions from the WikiAnswers community QA platform. No contexts are provided, and thus we augment the questions with web snippets retrieved from Google search engine.

  6. DROP Dua et al. (2019): Contexts are Wikipedia paragraphs and questions are authored by crowdsourcing workers. This dataset focuses on quantitative reasoning. Because most questions are not extractive, we only use the 33,573 extractive examples in the dataset (but evaluate on the entire development set).

3 Models

We carry our empirical investigation using two models. The first is DocQA Clark and Gardner (2018), and the second is based on BERT Devlin et al. (2019), which we term BertQA. We now describe the pre-processing on the datasets, and provide a brief description of the models. We emphasize that in all our experiments we use exactly the same training procedure for all datasets, with minimal hyper-parameter tuning.


Examples in all datasets contain a question, text documents, and an answer. To generate an extractive example we (a) Split: We define a length and split every paragraph whose length is into chunks using a few manual rules. (b) Sort: We sort all chunks (paragraphs whose length is

or split paragraphs) by cosine similarity to the question in tf-idf space, as proposed by clark2018simple. (c)

Merge: We go over the sorted list of chunks and greedily merge them to the largest possible length that is at most , so that the RC model will be exposed to as much context as possible. The final context is the merged list of chunks (d) We take the gold answer and mark all spans that match the answer.


Clark and Gardner (2018): A widely-used RC model, based on BiDAF Seo et al. (2016), that encodes the question and document with bidirectional RNNs, performs attention between the question and document, and adds self-attention on the document side.

We run DocQA on each chunk , where the input is a sequence of up to () tokens represented as GloVE embeddings Pennington et al. (2014)

. The output is a distribution over the start and end positions of the predicted span, and we output the span with highest probability across all chunks. At training time,


uses a shared-norm objective that normalizes the probability distribution over spans from all chunks. We define the gold span to be the first occurrence of the gold answer in the context



Devlin et al. (2019): For each chunk, we apply the standard implementation, where the input is a sequence of wordpiece tokens composed of the question and chunk separated by special tokens [CLS] <question> [SEP] <chunk> [SEP]. A linear layer with softmax over the top-layer [CLS] outputs a distribution over start and end span positions.

We train over each chunk separately, back-propagating into BERT’s parameters. We maximize the log-likelihood of the first occurrence of the gold answer in each chunk that contains the gold answer. At test time, we output the span with the maximal logit across all chunks.

4 Controlled Experiments

We now present controlled experiments aiming to explore generalization and transfer of models trained on a set of RC datasets to a new target.

4.1 Do models generalize to unseen datasets?

We first examine generalization – whether models trained on one dataset generalize to examples from a new distribution. While different datasets differ substantially, there is overlap between them in terms of: (i) the language of the question, (ii) the language of the context, and (iii) the type of linguistic phenomena the dataset aims to probe. Our goal is to answer (a) do models over-fit to a particular dataset? How much does performance drop when generalizing to a new dataset? (b) Which datasets generalize better to which datasets? What properties determine generalization?

We train DocQA and BertQA (we use BERT-base) on six large datasets (for TriviaQA we use TQA-G and TQA-W), taking 75K examples from each dataset to control for size. We also create Multi-75K, which contains 15K examples from the five large dataset (Using TQA-G only for TriviaQA), resulting in another dataset of 75K examples. We evaluate performance on all datasets that the model was not trained on.

Table 2 shows exact match (EM) performance (does the predicted span exactly match the gold span) on the development set. The row Self corresponds to training and testing on the target itself, and is provided for reference (For DROP, we train on questions where the answer is a span in the context, but evaluate on the entire development set). The top part shows DocQA, while the bottom BertQA.

At a high-level we observe three trends. First, models generalize poorly in this zero-shot setup: comparing Self to the best zero-shot number shows a performance reduction of 31.5% on average. This confirms the finding that models over-fit to the particular dataset. Second, BertQA substantially improves generalization compared to DocQA

owing to the power of large-scale unsupervised learning – performance improves by 21.2% on average. Last,

Multi-75K performs almost as well as the best source dataset, reducing performance by only 3.7% on average. Hence, training on multiple datasets results in robust generalization. We further investigate training on multiple datasets in §4.2 and §5.

SQuAD 18.0 10.1 16.1 4.2 2.4 - 23.4 9.5 32.0 20.9 7.6
NewsQA 14.9 8.2 13.5 4.8 3.0 41.9 - 7.7 25.3 19.9 5.3
SearchQA 29.2 16.1 24.6 8.1 2.3 17.4 10.8 - 50.3 28.9 4.5
TQA-G 30.3 17.8 29.4 9.2 3.0 30.2 15.5 38.5 - - 7.2
TQA-W 24.6 14.5 17.9 8.4 2.9 24.8 15.0 20.5 - - 6.5
HotpotQA 24.6 14.9 21.2 8.5 7.7 38.3 16.9 13.5 36.8 26.0 -
Multi-75K 32.8 17.9 26.7 7.4 4.3 - - - - - -
Self 24.1 24.9 45.2 41.7 15.6 68.0 36.5 51.3 58.9 41.6 22.5
SQuAD 23.6 12.0 20.0 4.6 5.5 - 31.8 8.4 37.8 33.4 11.8
NewsQA 24.1 12.4 18.9 7.1 4.4 60.4 - 10.1 37.6 28.4 8.0
SearchQA 30.3 18.5 25.8 12.4 2.8 23.3 12.7 - 53.2 35.4 5.2
TQA-G 35.4 19.7 28.6 6.3 3.6 36.3 18.8 39.2 - - 8.8
TQA-W 30.3 16.5 23.6 12.6 5.1 35.5 19.4 27.8 - - 8.7
HotpotQA 27.7 15.5 22.1 10.2 9.1 54.5 25.6 19.6 37.3 34.9 -
Multi-75K 34.0 18.2 30.9 11.7 8.6 - - - - - -
Self 30.8 27.1 51.6 52.9 17.9 78.0 46.0 52.2 60.7 50.1 24.2
Table 2: Exact match on the development set for all datasets in a zero-shot training setup (no training on the target dataset). The top of the table shows results for DocQA, while the bottom for BertQA. Rows correspond to the training dataset and columns to the evaluated dataset. Large datasets are on the right side, and small datasets on the left side, see text for details of all rows. Datasets used for training were not evaluated. In Multi-75K these comprise all large datasets, and thus these cases are marked by “-”

Taking a closer look, the pair SearchQA and TQA-G exhibits the smallest performance drop, since both use trivia questions and web snippets. SQuAD and NewsQA also generalize well (especially with BertQA), probably because they contain questions on a single document, focusing on predicate-argument structure. While HotpotQA and WikiHop both examine multi-hop reasoning over Wikipedia, performance dramatically drops from HotpotQA to WikiHop. This is due to the difference in the language of the questions (WikiHop questions are synthetic). The best generalization to DROP is from HotpotQA, since both require multi-hop reasoning. Performance on DROP is overall low, showing that our models struggle with quantitative reasoning.

For the small datasets, ComQA, CQ, and CWQ, generalization is best with TQA-G, as the contexts in these datasets are web snippets. For CQ, whose training set has 1,300 examples, zero-shot performance is even higher than Self.

Interestingly, BertQA improves performance substantially compared to DocQA on NewsQA, SQuAD, TQA-W and WikiHop, but only moderately on HotpotQA, SearchQA, and TQA-G. This hints that BERT is efficient when the context is similar to (or even part of) its training corpus, but degrades over web snippets. This is most evident when comparing TQA-G to TQA-W, as the difference between them is the type of context.

Global structure

To view the global structure of the datasets, we visualize them with the force-directed placement algorithm Fruchterman and Reingold (1991). The input is a set of nodes (datasets), and a set of undirected edges representing springs in a mechanical system pulling nodes towards one another. Edges specify the pulling force, and a physical simulation places the nodes in a final minimal energy state in 2D-space.

Let be the performance when training BertQA on dataset and evaluating on . Let be the performance when training and evaluating on . The force between an unordered pair of datasets is when we train and evaluate in both directions, and , if we train on and evaluate on only.

Figure 1: A 2D-visualization of the similarity between different datasets using the force-directed placement algorithm. We mark datasets that use web snippets as context with triangles, Wikipedia with circles, and Newswire with squares. We color multi-hop reasoning datasets in red, trivia datasets in blue, and factoid RC datasets in green.

Figure 1 shows this visualization, where we observe that datasets cluster naturally according to shape and color. Focusing on the context, datasets with web snippets are clustered (triangles), while datasets that use Wikipedia are also near one another (circles). Considering the question language, TQA-G, SearchQA, and TQA-U are very close (blue triangles), as all contain trivia questions over web snippets. DROP, HotpotQA, NewsQA and SQuAD generate questions with crowd workers, and all are at the top of the figure. WikiHop

uses synthetic questions that prevent generalization, and is far from other datasets – however this gap will be closed during transfer learning

4.2). DROP is far from all datasets because it requires quantitative reasoning that is missing from other datasets. However, it is relatively close to HotpotQA and WikiHop, which target multi-hop reasoning. DROP is also close to SQuAD, as both have similar contexts and question language, but the linguistic phenomena they target differ.

Multi-37K 30.9 17.7 28.4 12.3 6.3
Multi-75K 34.0 18.2 30.9 11.7 8.6
Multi-150K 35.0 17.6 30.0 12.4 9.1
Multi-250K 35.6 20.2 31.1 11.9 11.0
Multi-300K 37.6 18.8 31.5 13.5 10.4
Multi-375K 36.1 20.7 31.3 13.3 11.3
Table 3: Exact match on the development set of all small datasets, as we increase the number of examples taken from the five large datasets (zero-shot setup).

Does generalization improve with more data?

So far we trained on datasets with 75K examples. To examine generalization as the training set size increases, we evaluate performance as the number of examples from the five large datasets grows. Table 3 shows that generalization improves by 26% on average when increasing the number of examples from 37K to 375K.

4.2 Does pre-training improve results on small datasets?

SQuAD 29.7 25.3 37.1 39.2 14.5 - 33.3 39.2 49.2 34.5 17.8
NewsQA 16.9 26.1 34.7 38.1 14.3 59.6 - 41.6 44.2 33.9 16.5
SearchQA 30.8 28.8 41.3 39.0 15.0 57.0 31.4 - 57.5 39.6 19.2
TQA-G 41.5 30.1 42.6 42.0 14.0 57.7 31.8 49.5 - 41.4 19.1
TQA-W 31.3 27.0 38.0 41.4 13.3 57.6 31.7 44.4 50.7 - 17.2
HotpotQA 40.0 27.7 39.5 40.4 14.6 59.8 32.4 46.3 54.6 37.4 -
Multi-75K 43.1 27.6 39.1 38.9 14.5 59.8 33.0 47.5 56.4 40.4 19.2
Self 24.1 24.9 45.2 41.7 15.6 56.5 30.0 35.9 41.2 27.7 13.8
SQuAD 36.9 29.0 52.2 48.2 18.6 - 41.2 47.8 55.2 45.4 20.8
NewsQA 36.9 29.4 52.2 48.4 17.8 72.1 - 47.4 55.9 45.2 20.6
SearchQA 40.5 30.0 53.4 50.6 17.6 70.2 40.2 - 57.3 45.5 20.4
TQA-G 40.0 30.6 53.4 49.5 17.6 69.9 41.2 50.0 - 46.2 20.8
TQA-W 39.0 30.3 54.0 50.0 17.3 71.0 39.2 48.4 55.7 - 20.9
HotpotQA 34.4 30.2 53.0 49.3 17.2 71.2 39.5 48.6 56.6 45.6 -
Multi-75K 42.6 30.6 53.3 50.5 17.9 71.5 42.1 48.5 56.6 46.5 20.4
Self 30.8 27.1 51.6 52.9 17.1 70.1 37.9 46.0 54.4 41.9 18.9
Table 4: Exact match on the development set for all datasets with transfer learning. Fine-tuning is done on examples. The top of the table shows results for DocQA, while the bottom for BertQA. Rows are the trained datasets and columns are the evaluated datasets for which fine-tuning was performed. Large datasets are on the right, and small datasets are on the left side

We now consider transfer learning, assuming access to a small number of examples (15K) from a target dataset. We pre-train a model on a source dataset, and then fine-tune on the target. In all models, pre-training and fine-tuning are identical and performed until no improvement is seen on the development set (early stopping). Our goal is to analyze whether pre-training improves performance compared to training on the target alone. This is particularly interesting with BertQA, as BERT already contains substantial knowledge that might deem pre-training unnecessary.

How to choose the dataset to pre-train on?

Table 4 shows exact match (EM) on the development set of all datasets (rows are the trained datasets and columns the evaluated datasets). Pre-training on a source RC dataset and transferring to the target improves performance by 21% on average for DocQA (improving on 8 out of 11 datasets), and by 7% on average for BertQA (improving on 10 out of 11 datasets). Thus, pre-training on a related RC dataset helps even given representations from a model like BertQA.

Second, Multi-75K obtains good performance in almost all setups. Performance of Multi-75K is 3% lower than the best source RC dataset on average for DocQA, and 0.3% lower for BertQA. Hence, one can pre-train a single model on a mixed dataset, rather than choose the best source dataset for every target.

Third, in 4 datasets (ComQA, DROP, HotpotQA, WikiHop) the best source dataset uses web snippets in DocQA, but Wikipedia in BertQA. This strengthens our finding that BertQA performs better given Wikipedia text.

Last, we see dramatic improvement in performance comparing to §4.1. This highlights that current models over-fit to the data they are trained on, and small amounts of data from the target distribution can overcome this generalization gap. This is clearest for WikiHop, where synthetic questions preclude generalization, but fine-tuning improves performance from 12.6 EM to 50.5 EM. Thus, low performance was not due to a modeling issue, but rather a mismatch in the question language.

An interesting question is whether performance in the generalization setup is predictive of performance in the transfer setup. Average performance across target datasets in Table 4, when choosing the best source dataset from Table 4, is 39.3 (DocQA) and 43.8 (BertQA). Average performance across datasets in Table 4, when choosing the best source dataset from Table 2, is 38.9 (DocQA) and 43.5 (BertQA). Thus, one can select a dataset to pre-train on based on generalization performance and suffer a minimal hit in accuracy, without fine-tuning on each dataset. However, training on Multi-75K also yields good results without selecting a source dataset at all.

Figure 2: Learning curves for the five large datasets (top is DocQA and bottom is BertQA). The x-axis corresponds to the number of examples from the target dataset, and the y-axis is EM. The orange curve refers to training on the target dataset only, and the blue curve refers to pre-training on 75K examples from the nearest source dataset and fine-tuning on the target dataset. The green curve is training on a fixed number of examples from all 5 large datasets without fine-tuning (MultiQA).

How much target data is needed?

We saw that with 15K training examples from the target dataset, pre-training improves performance. We now ask whether this effect maintains given a larger training set. To examine this, we measure (Figure 2) the performance on each of the large datasets when pre-training on its nearest dataset (according to ) for both DocQA (top) and BertQA (bottom row). The orange curve corresponds to training on the target dataset only, while the blue curve describes pre-training on 75K examples from a source dataset, and then fine-tuning on an increasing number of examples from the target dataset.

In 5 out of 10 curves, pre-training improves performance even given access to all 75K examples from the target dataset. In the other 5, using only the target dataset is better after 30-50K examples. To estimate the savings in annotation costs through pre-training, we measure how many examples are needed, when doing pre-training, to reach 95% of the performance obtained when training on all examples from the target dataset. We find that with pre-training we only need 49% of the examples to reach 95% performance, compared to 86% without pre-training.

To further explore pre-training on multiple datasets, we plot a curve (green) for BertQA, where at each point we train on a fixed number of examples from all five large datasets (no fine-tuning). We observe that more data from multiple datasets improves performance in almost all cases. In this case, we reach 95% of the final performance using 30% of the examples only. We will use this observation further in §5 to reach new state-of-the-art performance on several datasets.

4.3 Does context augmentation improve performance?

For TriviaQA we have for all questions, contexts from three different sources – Wikipedia (TQA-W), Bing web snippets (TQA-U), and Google web snippets (TQA-G). Thus, we can explore whether combining the three datasets improves performance. Moreover, because questions are identical across the datasets, we can see the effect on generalization due to the context language only.

Table 5 shows the results. In the first 3 rows we train on 75K examples from each dataset, and in the last we train on the combined 225K examples. First, we observe that context augmentation substantially improves performance (especially for TQA-G and TQA-W). Second, generalization is sensitive to the context type: performance substantially drops when training on one context type and evaluating on another ( 48.4 for TQA-G, for TQA-U, and for TQA-W).

TQA-G 60.7 53.6 43.3
TQA-U 57.2 53.1 39.9
TQA-W 48.4 44.6 50.1
AllContexts 67.7 54.4 54.7
Table 5: EM on the development set, where each row uses the same question with a different context, and AllContexts is a union of the other 3 datasets.

5 MultiQA

BERT-large Dev. MultiQA Dev. MultiQA Test SOTA11footnotemark: 1
Dataset EM tok. F1 EM tok. F1 EM tok. F1 EM tok. F1
NewsQA 51.5 66.2 53.9 68.2 52.3 67.4 53.1 66.3
SearchQA 59.2 66.4 60.7 67.1 59.0 65.1 58.8 64.5
TQA-U 56.8 62.6 58.4 64.3 - - 52.022footnotemark: 2 61.722footnotemark: 2
CWQ 30.8 - 35.4 - 34.9 - 34.2 -
HotpotQA 27.9 37.7 30.6 40.3 30.7 40.2 37.122footnotemark: 2 48.922footnotemark: 2
Table 6:

Results for datasets where the official evaluation metric is EM and token F

. The CWQ evaluation script provides only the EM mertic. We did not find a public evaluation script for the hidden test set of TQA-U.
BERT-large Dev. MultiQA Dev. MultiQA Test SOTA
Dataset Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
ComQA 45.8 42.0 42.9 51.9 47.2 48.2 44.4 40.0 40.8 21.2 38.4 22.4
CQ - - 32.8 - - 46.6 - - 42.4 - - 39.722footnotemark: 2
Table 7: Results for datasets where the evaluation metric is average recall/precision/F. CQ evaluates with F only.

We now present MultiQA, a BERT-based model, trained on multiple RC datasets, that obtains new state-of-the-art results on several datasets.

Does training on multiple datasets improve BertQA?

MultiQA trains BertQA on the Multi-375K dataset presented above, which contains 75K examples from 5 large datasets, but uses BERT-large rather than BERT-base. For small target datasets, we fine-tune the model on these datasets, since they were not observed when training on Multi-375K. For large datasets, we do not fine-tune. We found that fine-tuning on datasets that are already part of Multi-375K does not improve performance (we assume this is due to the high-capacity of BERT-large), and thus we use one model for all the large datasets. We train on Multi-375K, and thus our model does not use all examples in the original datasets, which contain more than 75K examples.

We use the official evaluation script for any dataset that provides one, and the SQuAD evaluation script for all other datasets. Table 6 shows results for datasets where the evaluation metric is EM or token F

(harmonic mean of the list of tokens in the predicted vs. gold span). Table 

7 shows results for datasets where the evaluation metric is average recall/precision/F between the list of predicted answers and the list of gold answers.

We compare MultiQA to BERT-large, a model that does not train on Multi-375K, but only fine-tunes BERT-large on the target dataset. We also show the state-of-the-art (SOTA) result for all datasets for reference.111State-of-the-are-results were found in Tay et al. (2018) for NewsQA, in lin2018denoising, for SearchQA, in das2019multi for TQA-U, in Talmor and Berant (2018b) for CWQ, in Ding2019Cognitive for HotpotQA, in Abujabal et al. (2018) for ComQA, and in bao2016constraint for CQ.

MultiQA improves state-of-the-art performance on fivedatasets, although it does not even train on all examples in the large datasets.222We compare only to models for which we found a publication. For TQA-U, Figure 4 in clark2018simple shows roughly 67 F on the development set, but no exact number. For CQ we compare against SOTA achieved on the web snippets context. On the Freebase context SOTA is 42.8 F. Luo1 et al. (2018) MultiQA improves performance compared to BERT-large in all cases. This improvement is especially noticeable in small datasets such as ComQA, CWQ, and CQ. Moreover, in NewsQA, MultiQA surpasses human performance as measured by the creators of those datasets. (46.5 EM, 69.4 F1) Trischler et al. (2017)), improving upon previous state-of-the-art by a large margin.

To conclude, MultiQA is able to improve state-of-the-art performance on multiple datasets. Our results suggest that in many NLU tasks the size of the dataset is the main bottleneck rather than the model itself.

Does training on multiple datasets improve resiliency against adversarial attacks?

Finally, we evaluated MultiQA on the adversarial SQuAD Jia and Liang (2017), where a misleading sentence is appended to each context (AddSent variant). MultiQA obtained 66.7 EM and 73.1 F, outperforming BERT-large (60.4EM, 66.3F1) by a significant margin, and also substantially improving state-of-the-art results (56.0 EM, 61.3 F, Hu et al. (2018) and 52.1 EM, 62.7 F, Wang et al. (2018)).

6 Related Work

Prior work has shown that RC performance can be improved by training on a large dataset and transferring to a smaller one, but at a small scale Min et al. (2017); Chung et al. (2018). sun2018improving has recently shown this in a larger experiment for multi-choice questions, where they first fine-tuned BERT on RACE Lai et al. (2017) and then fine-tuned on several smaller datasets.

Interest in learning general-purpose representations for natural language through unsupervised, multi-task and transfer learning has been sky-rocketing lately Peters et al. (2018); Radford et al. (2018); McCann et al. (2018); Chronopoulou et al. (2019); Phang et al. (2018); Wang et al. (2019). In parallel to our work, studies that focus on generalization have appeared on publication servers, empirically studying generalization to multiple tasks Yogatama et al. (2019); Liu et al. (2019). Our work is part of this research thread on generalization in natural langauge understanding, focusing on reading comprehension, which we view as an important and broad language understanding task.

7 Conclusions

In this work we performed a thorough empirical investigation of generalization and transfer over 10 RC datasets. We characterized the factors affecting generalization and obtained several state-of-the-art results by training on 375K examples from 5 RC datasets. We open source our infrastructure for easily performing experiments on multiple RC datasets, for the benefit of the community.

We highlight several practical take-aways:

  • [topsep=0pt, itemsep=0pt, leftmargin=0in, parsep=0pt]

  • Pre-training on multiple source RC datasets consistently improves performance on a target RC dataset , even in the presence of BERT representations. It also leads to substantial reduction in the number of necessary training examples for a fixed performance.

  • Training the high-capacity BERT-large representations over multiple RC datasets leads to good performance on all of the trained datasets without having to fine-tune on each dataset separately.

  • BERT representations improve generalization, but their effect is moderate when the source of the context is web snippets compared to Wikipedia and newswire.

  • Performance over an RC dataset can be improved by retrieving web snippets for all questions and adding them as examples (context augmentation).


We thank the anonymous reviewers for their constructive feedback. This work was completed in partial fulfillment for the PhD degree of Alon Talmor. This research was partially supported by The Israel Science Foundation grant 942/16, The Blavatnik Computer Science Research Fund and The Yandex Initiative for Machine Learning.