Log In Sign Up

UnifiedQA: Crossing Format Boundaries With a Single QA System

by   Daniel Khashabi, et al.

Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems.


page 1

page 4

page 8

page 12


Question Answering and Question Generation for Finnish

Recent advances in the field of language modeling have improved the stat...

Meta-tuning Language Models to Answer Prompts Better

Large pretrained language models like GPT-3 have acquired a surprising a...

What do we expect from Multiple-choice QA Systems?

The recent success of machine learning systems on various QA datasets co...

Repurposing Entailment for Multi-Hop Question Answering Tasks

Question Answering (QA) naturally reduces to an entailment problem, name...

Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

Manually annotating datasets requires domain experts to read through man...

Lifelong Learning for Question Answering with Hierarchical Prompts

QA models with lifelong learning (LL) abilities are important for practi...

Comparing Test Sets with Item Response Theory

Recent years have seen numerous NLP datasets introduced to evaluate the ...

1 Introduction

Question answering is a common tool for assessing how well can computers understand language and reason with it. To this end, the NLP community has introduced several distinct QA formats, with four popular formats illustrated in Figure 1. These formats differ not only in how the question is presented but also in some implicit assumptions. For instance, the assumption that the expected answer is either “yes” or “no”, or that there is always a unique answer span in the associated paragraph (as opposed to multiple or no spans), etc. These differences have motivated their study in silos, often encoding QA format and assumptions into the model architecture itself. Efforts to exploit multiple datasets remain largely restricted to a single format. For example, Clark et al. (2019c) limit consideration to multiple-choice datasets, while Talmor and Berant (2019) focus their generalization study on extractive span prediction models. To the best of our knowledge, no single QA system targets, not to mention excels at, all of these formats.

Figure 1: Four formats (color-coded throughout the paper) commonly used for posing questions and answering them: Extractive (EX), Abstractive (AB), Multiple-Choice (MC), and Yes/No (YN). Sample dataset names are shown in square brackets. We study generalization and transfer across these formats.
Figure 2: Properties of various QA datasets included in this study: 4 extractive (EX), 3 abstractive (AB), 7 multiple-choice (MC), and 3 yes/no (YN). ‘idk’ denotes ‘I don’t know’ or unanswerable questions. Regents denotes both 4th and 8th grade datasets. BoolQ represents both the original dataset and its contrast-sets extension BoolQ-CS; similarly for ROPES, Quoref, and DROP.

This raises the question: Can QA models learn linguistic reasoning abilities that generalize across formats? Our intuition is simple: while question format and relevant knowledge may vary across QA datasets, the underlying linguistic understanding and reasoning abilities are largely common. A multiple-choice model may, therefore, benefit from training on an extractive answers dataset. Building upon this intuition, we present a single pre-trained QA system, named UnifiedQA, that exploits information across 4 different QA formats to achieve surprisingly strong performance across 17 different datasets listed in Figure 2.

Our work is enabled by recent progress in text-to-text neural architectures Raffel et al. (2019); Lewis et al. (2019); Radford et al. (2019). This paradigm conceptually unifies many NLP models that formerly had task-specific designs. While there have been hopes of—and a few attempts at—using this paradigm to achieve strong generalization and transfer across tasks, success so far has been limited. Most approaches fine-tuned a different set of parameters for each end task Raffel et al. (2019); Radford et al. (2019), and when they have attempted to make a single model for different NLP tasks Raffel et al. (2019), they have underperformed compared to the standard pre-training plus fine-tuning setup, with a need for explicit task-specific prefixes.

In contrast, by narrowing the scope to tasks that stay within the boundaries of QA, we are able to demonstrate that the text-to-text paradigm can, in fact, be quite powerful for multi-task learning across QA formats. We find that out-of-format training can lead to a single pre-trained QA model that can be applied as-is to different QA tasks, takes in natural text inputs without explicitly specifying a task-specific prefix, generalizes better to other unseen datasets, and with further fine-tuning can achieve state-of-the-art results on many QA tasks.


This work advocates for a unified view of different QA formats, and for building format-agnostic QA systems. To support this view, we present UnifiedQA, a single pre-trained QA system that works well on and generalizes to datasets with different formats (§6.2), while performing on par with state-of-the-art dedicated systems tailored to each dataset (§6.1). Additionally, fine-tuning UnifiedQA into specialized systems sets a new state of the art for 6 datasets (§6.3), establishing it as a powerful starting point for QA research. Our findings demonstrate that crossing QA format boundaries is not only qualitatively desirable but also quantitatively beneficial.

2 Related Work

Several QA efforts have studied generalization across datasets of a single format. For instance, in MultiQA, Talmor and Berant (2019) study generalization and transfer, but only across extractive span selection datasets. Further, while they show strong leave-one-out style results, they find a single system performs substantially worse than one tuned to each dataset. In ORB, Dua et al. (2019a) propose a multi-dataset evaluation benchmark spanning extractive and abstractive formats. However, that study is limited to an evaluation of systems, falling short of addressing how to build such generalizable models. Similarly, the MRQA shared task Fisch et al. (2019) focuses on span-prediction datasets. Unlike all these efforts, our goal is to investigate transfer and generalization across different QA formats, as well as to build a single system that does this well.

Exploiting commonality across machine learning tasks has a rich history studied under transfer learning 

Caruana (1997); Clark et al. (2019b). McCann et al. (2018) and Keskar et al. (2019) study transfer among various NLP tasks by casting them into a single QA format—an elegant transfer learning approach but orthogonal to the goal of this work. As noted earlier, Raffel et al. (2019) investigate the transfer between several diverse NLP tasks (machine translation, summarization, etc). Their key contribution is a text-to-text framework, and a powerful model called T5, that makes it easier to mix multiple tasks by encoding both inputs and outputs as text. They rely on textual prefixes to explicitly define the task corresponding to each input instance. While we build upon their framework, we narrow our focus to variations of QA. This allows us to achieve strong results while avoiding reliance on any format-specific prefixes. Our models learn to infer the format of each input question based on its content (e.g., whether the phrasing of the question demands a yes/no answer). Moreover, we are able to demonstrate generalization across QA tasks, which prior work failed to achieve presumably due to its focus on too broad a set of NLP tasks.

3 UnifiedQA: Multi-format Training

Suppose we would like to train a unified QA model that can operate over question-answering formats . For each format , suppose we have datasets sets where includes both training and evaluation examples. In some cases, the training set may be empty or we may want to ignore it in order to treat as an ‘unseen’, evaluation-only dataset and assess a model’s generalization to it.

We use the text-to-text paradigm to convert each training question in format into a plain-text input representation . This conversion uses a natural encoding process that will be described shortly (§3.1) for four common QA formats, and is easily extensible to other formats as well. We follow a simple possible approach of creating a mixed training pool consisting of all available training instances:

Training batches are drawn from this pooled data, , by including each

with a probability proportional

. Each batch thus, on average, contains the same number of instances from each training set, regardless of its size. As we will see in the experiments section, our multi-format mixing approach works surprisingly well. It clearly highlights the value of training on out-of-format data and confirms our intuition that there are strong ties across QA formats in terms of the underlying reasoning abilities.222A more sophisticated teaching curriculum Sachan and Xing (2016) or approaches such as model distillation and teacher annealing Clark et al. (2019b) are likely to further improve the performance of the resulting unified model, bolstering the strength of our advocacy for a unified view of all QA formats. We leave their exploration to future work.

Our unified question-answering system is based on the T5 framework Raffel et al. (2019), a recent text-to-text transformer model. For all experiments, we use token-limits of size 512 and 100 for inputs and outputs sequences, respectively. All models are trained for steps, on top of the steps pretraining of the T5 model.

We first define a unifying encoding of the instances across various formats (in §3.1). We then introduce UnifiedQA (in §3.2) that is a QA system trained on datasets in multiple formats, indicating new state-of-the-art results on 7 datasets and generalization to unseen datasets.

3.1 Text-to-Text Encoding

We convert each of our target datasets into a text-in/text-out format Raffel et al. (2019); Lewis et al. (2019); Radford et al. (2019). The question always comes first, followed by some additional information (context paragraph or candidate answers, or both). We use “\n” separators between different parts of the input. This ensures having a human-like encoding while not making it overly-specific to a certain format.

The unified model explored in this work incorporates the following four common question-answering formats:

Extractive (EX)

questions include a context paragraph (typically a paragraph) and require models to extract the answer as a substring from the context. In some datasets, ‘unanswerable’ can sometimes be the correct response.

Abstractive (AB)

questions require models to produce answers that are often not mere substrings of the provided context paragraph .

Multiple-choice (MC)

questions come with a set of candidate answers , of which generally exactly one is correct. In some cases, they also include a context paragraph .

Yes/No (YN)

questions expect a ‘yes’ or ‘no’ answer as the response, and may include a context paragraph .

Further details of these formats and specific datasets within them are deferred to Section 4.1.

Table 1 provides examples of the natural input and output encoding for each of these formats. Importantly, both input and output representations are raw text. There is no explicit information regarding a question being an MC question or having exactly four candidate answers. The expectation is that a model will learn to infer such notions from the raw text training data, just like humans are able to do.

Specifically, MC questions without any context paragraph are encoded as question \n (A) c1 (B) c2 where c1, c1, are the set of candidate answers (see the example from ARC dataset in Table 1). If the question includes a context paragraph, it is appended after the candidate answers: question \n (A) c1 (B) c2 \n paragraph, as shown in the example from the MCTest dataset in Table 1. Questions in the other three formats (EX, AB, and YN) are encoded simply as question \n paragraph.

To re-emphasize, unlike prior work Raffel et al. (2019), we do not specify any task-, dataset-, or format-specific prefixes in the input representation. Whether the answer should be extracted or abstracted, and whether from the provided context paragraph or candidate answers (or the fact that these even are candidate answers) is expected to be inferred by the system.

Table 1: Example text-to-text encoding of instances.

3.2 UnifiedQA: The Pre-Trained Model

The specific pre-trained QA model we provide and use in all our experiments is trained on representative datasets for each of the 4 formats discussed earlier. We empirically chose the following 9 datasets for training UnifiedQA, based on their effectiveness in our pilot study (details deferred to Section 5) assessing which datasets are most valuable for out-of-format training:

  • [noitemsep]

  • EX: SQuAD 1.1, SQuAD 2.0

  • AB: NarrativeQA

  • MC: RACE, Regents, ARC, OBQA, MCTest

  • YN: BoolQ

One can obviously use other combinations of formats and datsets to create variants of our UnifiedQA model, or extend it as future datasets become available or new formats are introduced.

Unless otherwise noted, we use the largest available T5 model (11B parameters) as the starting point for training UnifiedQA. Similar to pre-trained language models, the resulting pre-trained QA model can be used as a starting point for fine-tuning on other QA datasets.

4 Formats and Datasets

4.1 Datasets

We selected 17 existing datasets that target various formats, as well as various complex linguistic phenomena. Table 2 shows different properties for our datasets (whether it comes with a paragraph, whether the paragraph explicitly contains the answer, whether there are candidate-answers as part of the input, etc.) Most importantly, they are grouped into several formats/categories described below. Table 2 gives summary statistics of these datasets.

Extractive QA (EX).

All the datasets in this format require models to extract the answer to a given question as a substring from a context paragraph. SQuAD 1.1 Rajpurkar et al. (2016) contains questions about Wikipedia paragraphs. A later version of this dataset, SQuAD 2 Rajpurkar et al. (2018), includes unanswerable questions which empirically makes the task much harder. For our evaluation, we use the development sets of SQuAD 1.1 and SQuAD 2. NewsQA Trischler et al. (2017) dataset focuses on paraphrased questions with predicate-argument structure understanding collected from news articles from CNN/DailyMail articles. Quoref Dasigi et al. (2019) contains questions that require coreference resolution in Wikipedia articles and can even have disjoint spans as answers. ROPES Lin et al. (2019) centers around situation understanding, where the model must under the causes and effects implicit in the given situation.

Abstractive QA (AB).

All the datasets in this format require models to produce answers that are often not mere substrings of the given context paragraph. NarrativeQA Kociský et al. (2018) focuses on understanding various events that happen in a given movie plot, based on summaries of their movie adaptations from various web resources. Many of the answers do not have high overlap with the context. DROP Dua et al. (2019b) contains questions that involve rudimentary mathematical skills (such as counting, addition, subtraction, maximum, minimum, etc.) and questions query multiple parts of the paragraph. The answer can be either a number or a date that can be inferred from the paragraph, or several spans from the context paragraph.

Dataset Train set size Eval. set size 95% CI (as %) Input length Output length
SQuAD 1.1 87 10 1.0 136.2 3.0
SQuAD 2.0 130 11 0.9 139.9 2.6
NewsQA 76 4 1.6 606.6 4.0
Quoref 22 2 2.2 352.7 1.7
Quoref-CS - 700 3.7 324.1 2.2
ROPES 10 1.4 2.6 169.1 1.4
ROPES-CS - 974 3.1 182.7 1.3
NQA 65 21 0.7 563.6 6.2
DROP 77 9 1.0 189.1 1.6
DROP-CS - 947 3.2 206.0 2.1
RACE 87 4 1.6 317.9 6.9
OBQA 4 4.4 28.7 3.6
MCTest 1.4 0.3 5.8 245.4 4.0
ARC (easy) 2 2 2.2 39.4 3.7
ARC (chal.) 1 1 3.1 47.4 5.0
Regents 1 1 3.1 51.0 4.9
CQA 9.7 1.2 2.8 26.8 1.5
BoolQ 9 3 1.8 105.1 1.0
BoolQ-CS - 461 4.6 108.9 1.0
NP-BoolQ 10 3 1.8 106.2 1.0
MultiRC - 312 5.7 293.3 1.0
Table 2:

Dataset Statistics. CQA, OBQA, and NQA refer to CommonsenseQA, OpenBookQA, and NarrativeQA, respectively. The CI column shows the 95% confidence interval for the evaluation set as a percentage, around a mean score of 50%. Input and output representation lengths are measured in the number of tokens and averaged across the dataset.

Multiple-choice QA (MC).

All the datasets in this format contain questions that come with candidate answers. MCTest Richardson et al. (2013) contains questions about simple, fictional stories. RACE Lai et al. (2017) is a challenging set of English comprehension multiple choice exams given in Chinese middle and high schools. OpenBookQA Mihaylov et al. (2018), ARC Clark et al. (2018), Regents Science Exams Clark et al. (2016), QASC Khot et al. (2019) are different MC tests focusing on elementary/high school-style science exams. A slightly different dataset within this format is CommonsenseQA Talmor et al. (2019) which is geared towards activity/concept commonsense questions. Other than MCTest and RACE, the rest of the datasets do not come with accompanying paragraphs. On such datasets, occasionally a retrieval system is used to supplement each question with a relevant retrieved context paragraph. For most of this the work, we keep the questions as is with no additional retrieval (unless otherwise mentioned), except in §6.3 where we use IR to get numbers comparable to earlier work. One other variability among these datasets is their number of candidate answers. While many datasets have four candidates (see Figure 2), others have more. Later, in §6.2 we will see that our approach generalizes to datasets with different number of candidates, even if it’s not seen during training.

Yes/No QA (YN).

All the datasets in this format contain questions that could be responded with yes/no answers. One can think of these as multiple-choice questions with 2 candidates; however, they’re usually treated differently. Several examples we use are BoolQ Clark et al. (2019a) and a version of this dataset with natural-perturbations, BoolQ-NP Khashabi et al. (2020), the subset of MultiRC Khashabi et al. (2018) that have binary(yes/no) answers.


Additionally, we use contrast-sets Gardner et al. (2020) corresponding to several of our datasets (indicated by “CS” tag): BoolQ-CS, ROPES-CS, Quoref-CS, DROP-CS. These evaluation sets are expert-generated perturbations that deviate from the patterns common in the original dataset.

Table 3: Pilot study showing that out-of-format training can help improve performance. Each table compares training on just the anchor dataset (e.g., BoolQ in the top-left table) with training also on an out-of-format dataset denoted ‘X’. Evaluation is on the anchor dataset as well as unseen datasets of that format. The last row identifies the out-of-format dataset that helped most on each evaluation dataset. All results are based on the “small” size T5 model. Color denotes QA format (see Table 2).

4.2 Evaluation Metrics for Textual Output

We evaluate each dataset using the metric most often used for it in prior work. Specifically, for the EX format, we use the F1 score for the extracted span relative to the gold label. For the AB format, we use ROUGE-L metric Lin et al. (2006); Min et al. (2019); Nishida et al. (2019). For the MC format, we match the generated the text with the closest answer candidate (by token overlap) and measure how often this is the correct answer. For the YN format, we follow Clark et al. (2019a) to measure if the generated output matches the correct ‘yes’ or ‘no’ label. In rare cases where the output is longer than one word (e.g., ‘yes it is’), we check if it contains the correct label but not the incorrect one.

5 Pilot Study: Can Out-of-Format Training Help?

We first answer the question: Is the broad idea of benefiting from out-of-format training even viable? For instance, is our intuition correct that an MC dataset can, in practice, benefit from training on an EX dataset? Before discussing our main experimental results, we briefly report on a pilot study that assesses the following basic question: Given a training set (the anchor dataset) of QA format , is there an out-of-format training set of format such that training jointly on improves performance relative to training only on ? To this end, we evaluate both on the matching evaluation set as well as on ‘unseen’ data of the same format.

The results are summarized in Table 3. The two rows in each individual table correspond to training on (the anchor dataset) and on , where is an out-of-format dataset corresponding to above. The columns represent various evaluation sets of format . For each column, ‘’ at the very bottom indicates the out-of-format dataset that was the most helpful in improving performance on the evaluation set in that column.333Appendix A.3 reports extended results, including the performance with various choices of .

Consider, for example, the case of the anchor set being BoolQ and the evaluation set being NP-BoolQ, both of format YN. Here, including out-of-format training data SQuAD2 boosts performance from 51% to as much as 59%. The gain in other cases is often not this extreme. Nevertheless, across all anchor and evaluation datasets, we generally observe that there is at least one out-of-format training set whose inclusion improves performance.

This pilot study thus provides a proof of concept that out-of-format training can indeed help a QA model in nearly every case. Of course, this study only shows the existence of such an out-of-format dataset, rather than provide a single unified model. Nevertheless, it helps identify representative training sets from each format that were most helpful. As hinted to earlier, we used this empirical data to guide which training sets to include when building UnifiedQA in Section 3.2.

6 Experimental Results

We now discuss our main experimental results, evaluating our proposed UnifiedQA system on seen (used for training the system) and unseen datasets.

Table 4: Generalization to unseen datasets: Multi-format training (UnifiedQA) often outperforms models trained the same way but solely on other in-format datasets (e.g., UnifiedQA[EX], which is trained on all extractive training sets of UnifiedQA. When averaged across all evaluation datasets (last column), UnifiedQA shows strong generalization performance across all formats. Notably, the “Previous best” models (last row) were trained on the target dataset’s training data, but are even then outperformed by UnifiedQA (which has never seen these datasets during training) on the YN tasks.

6.1 UnifiedQA vs. 9 Dedicated Models

Is UnifiedQA, a single pre-trained multi-format QA system, as good as dedicated systems trained for individual datasets? We emphasize that the answer to this question is not as simple as it may seem, since earlier works have observed that a system addressing multiple tasks often underperforms a focused system Raffel et al. (2019).

Figure 3: UnifiedQA is on-par with, and often outperforms, 11 different equally-sized T5-based systems tailored to individual datasets. The figure contains separate models for each of the two subsets of the ARC and Regents datasets.

Figure 3 summarizes the results of the relevant experiment. As it can be observed from the figure, UnifiedQA performs almost as good as the best single dataset experts. In some cases UnifiedQA performs even better than than the single-dataset experts (e.g., on OBQA or NQA.) On average (last column) UnifiedQA is doing much better dataset/format-specific systems. In conclusion, UnifiedQA offers flexibility across multiple QA formats while compromising almost nothing compared to dataset-specific experts.

6.2 Generalization to Unseen Datasets

The question we want to explore here is whether UnifiedQA generalizes well to other unseen datasets. Table 4 summarizes the results of experiments during which we evaluate various models on datasets that are not used to train them.

The first few rows of the table shows T5 models trained for individual datasets, followed by UnifiedQA. For completeness, we include the highest previous scores for each dataset; one must be careful when reading these numbers as the best previous numbers follow the fully supervised protocol (for NewsQA Zhang et al. (2020), Quoref Dasigi et al. (2019), DROP Dua et al. (2019b), ROPES Lin et al. (2019), QASC Khot et al. (2019), CommonsenseQA Zhu et al. (2020) and x-CS datasets Gardner et al. (2020).)

The key observations are: (1) On average (last column), UnifiedQA shows a much stronger generalization across a wide range of datasets. (2) on 5 (out of 12) datasets UnifiedQA shows a better generalization than any single-dataset experts. For example, while the system is trained on multiple-choice questions with 4 candidate answers, it does work pretty well on datasets with more than 4 candidate answers (QASC and CommonsenseQA have has 8 and 5 candidate ansers per question, respectively). (3) single-dataset experts are better at generalization only when the source and target datasets are pretty similar (for instance SQuAD and Quoref).

Table 5: Simply fine-tuning UnifiedQA (last row) results in new state-of-the-art performance on 6 datasets. Further, it consistently improves upon fine-tuned T5 (2nd last row) by a margin ranging from 1% for CommonsenseQA (CQA) to as much as 13% for ARC-challenge. ‘(w/ IR)’ denotes relevant information is retrieved and appended as context sentences in the input encoding. Datasets marked with * are used in UnifiedQA’s original training.

6.3 State-of-the-art via Simple Fine-tuning

Fine-tuning of pre-trained language models has become the standard paradigm for building dataset-specific stat-of-art systems Devlin et al. (2019); Liu et al. (2019). The question we address here is whether there a value in using UnifiedQA as a starting point for fine-tuning, as opposed to a vanilla language model that has not seen other QA datasets before?

To address question, we fine-tune both UnifiedQA and T5 on several datasets. Table 5 summarizes the results of the experiments. The two last rows of the table show the performance UnifiedQA and T5, both fine-tuned for the target task. The fine-tuning process involves selection of the best checkpoint on the dev set and evaluation on the test set.

Table 6: The results of a leave-one-out ablation. The first row indicates the performance of UnifiedQA on each dataset it was trained on. The rest of the rows exclude one dataset at a time. The rows are sorted based the last column: the dataset with biggest contribution appear first. The red highlights indicate the top 3 performance drops at each column.

The columns indicate the evaluation on the test set corresponding to the data that was used for training. For several multiple-choice datasets that do not come with evidence with paragraphs, we include two variants: use them as is and another variant which uses paragraphs fetched via an Information Retrieval (IR) system as additional evidence, indicated with “w/ IR” tags. We use the same IR sentences as used by the baselines on these datasets: Aristo corpus for ARC and OBQA datasets Clark et al. (2019c), and 2-step IR for QASC Khot et al. (2019).

Additionally, we show the best published scores on each dataset: ALBERT Lan et al. (2019) (on RACE), RoBERTa Clark et al. (2019c) (on OBQA and ARC), KF+SIR Banerjee and Baral (2020) (on OBQA and QASC), FreeLB+RoBERTa Zhu et al. (2020) (on ARC-easy and CommonsenseQA).

As it can be seen, fine-tuning on UnifiedQA consistently dominates fine-tuning on T5, as well as the best previous scores on each dataset. Intuitively, since UnifiedQA has seen different formats should be positioned to achieve higher scores after a bit of fine-tuning, comparing to fine-tuning on a vanilla T5. This could be especially effective when a user has limited training data for a target QA task.

6.4 Ablation: Training Set Contributions

In this experiment we would like to better understand the contribution of each dataset to UnifiedQA through a leave-one-out experiment.

We take the system from §3.2 and evaluate the model when individual models are dropped from the union. The result of this experiment is summarized in Table 6 compares the performance of UnifiedQA all the default datasets (the first row) followed with ablated versions which exclude one dataset at a time. The rows are sorted based the last column: the dataset with biggest contribution appear first.

The top-few datasets have the highest contributions: BoolQ, SQuAD 2.0, OBQA, NQA are the top four contributing datasets, each with different format. SQuAD 1.1 has the least importance in the union, since its mostly covered by SQuAD 2.0.

The conclusion here is that to build an effective variant of UnifiedQA, one can just use a relatively small number of datasets, so long as that it is trained on representative members of each format.

7 Conclusion

The question-answering community has fruitfully explored the design of strong models, but while staying within the boundaries of individual QA formats. We argued that such boundaries are artificial and can even limit the performance of systems, because the desired reasoning abilities being taught and probed are not tied to specific formats. Training data in one format should, in principle, help QA systems perform better even on questions in another format.

With this intuition in mind, we presented UnifiedQA, a single pre-trained QA system based on the T5 text-to-text language model, seeking to bring unification across four common QA formats. We showed that even with its simple multi-format training methodology, UnifiedQA achieves performance on par with nearly a dozen dataset-specific expert models (§6.1), while also generalizing well to many unseen datasets (of seen formats) (§6.2). At the same time, we demonstrated that UnifiedQA is a strong starting point for building QA systems: it can achieve state-of-the-art performance by simply fine-tuning on target datasets (6.3).

We hope this effort will inspire a future line of work in the QA and NLP communities, moving towards more general and broader system designs. We leave extensions of UnifiedQA to other formats such as to direct-answer questions Kwiatkowski et al. (2019) as a promising avenue for future work.


The authors would like to thank Collin Raffel, Adam Roberts and Nicholas Lourie for their help with the T5 framework and providing suggestions on an earlier version of this work. TPU machines for conducting experiments were provided by Google.


  • P. Banerjee and C. Baral (2020) Knowledge fusion and semantic knowledge ranking for open domain question answering. arXiv preprint arXiv:2004.03101. Cited by: §6.3.
  • R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §2.
  • C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova (2019a) BoolQ: exploring the surprising difficulty of natural yes/no questions. In NAACL-HLT, Cited by: §4.1, §4.2.
  • K. Clark, M. Luong, U. Khandelwal, C. D. Manning, and Q. Le (2019b) BAM! Born-again multi-task networks for natural language understanding. In ACL, pp. 5931–5937. Cited by: §2, footnote 2.
  • P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? Try ARC, the AI2 reasoning challenge. ArXiv abs/1803.05457. Cited by: §4.1.
  • P. Clark, O. Etzioni, T. Khot, B. D. Mishra, K. Richardson, A. Sabharwal, C. Schoenick, O. Tafjord, N. Tandon, S. Bhakthavatsalam, et al. (2019c) From ’F’ to ’A’ on the NY Regents science exams: an overview of the Aristo project. ArXiv abs/1909.01958. Cited by: §1, §6.3, §6.3.
  • P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. Turney, and D. Khashabi (2016) Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, Cited by: §4.1.
  • P. Dasigi, N. F. Liu, A. Marasović, N. A. Smith, and M. Gardner (2019) Quoref: a reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP/IJCNLP, Cited by: §4.1, §6.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §6.3.
  • D. Dua, A. Gottumukkala, A. Talmor, M. Gardner, and S. Singh (2019a) Comprehensive multi-dataset evaluation of reading comprehension. In 2nd Workshop on Machine Reading for Question Answering, Cited by: §2.
  • D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner (2019b) DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL, Cited by: §4.1, §6.2.
  • A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, and D. Chen (2019) MRQA 2019 shared task: evaluating generalization in reading comprehension. In 2nd Workshop on Machine Reading for Question Answering, at EMNLP, Cited by: §2.
  • M. Gardner, Y. Artzi, V. Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y. Elazar, A. Gottumukkala, et al. (2020) Evaluating NLP models via contrast sets. ArXiv abs/2004.02709. Cited by: §4.1, §6.2.
  • N. S. Keskar, B. McCann, C. Xiong, and R. Socher (2019) Unifying question answering, text classification, and regression via span extraction. arXiv preprint arXiv:1904.09286. Cited by: §2.
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth (2018) Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In NAACL-HLT, Cited by: §4.1.
  • D. Khashabi, T. Khot, and A. Sabharwal (2020) Natural perturbation for robust question answering. ArXiv abs/2004.04849. Cited by: §4.1.
  • T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal (2019) QASC: a dataset for question answering via sentence composition. In AAAI, Cited by: §4.1, §6.2, §6.3.
  • T. Kociský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette (2018) The narrativeqa reading comprehension challenge. TACL 6, pp. 317–328. Cited by: §4.1.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. TACL 7, pp. 453–466. Cited by: §7.
  • G. Lai, Q. Xie, H. Liu, Y. Yang, and E. H. Hovy (2017) RACE: Large-scale reading comprehension dataset from examinations. In EMNLP, Cited by: §4.1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    In ICLR, Cited by: §6.3.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019)

    BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    arXiv preprint arXiv:1910.13461. Cited by: §1, §3.1.
  • C. Lin, G. Cao, J. Gao, and J. Nie (2006) An information-theoretic approach to automatic evaluation of summaries. In NAACL, Cited by: §4.2.
  • K. Lin, O. Tafjord, P. Clark, and M. Gardner (2019) Reasoning over paragraph effects in situations. In 2nd Workshop on Machine Reading for Question Answering, at EMNLP, Cited by: §4.1, §6.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §6.3.
  • B. McCann, N. S. Keskar, C. Xiong, and R. Socher (2018) The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: §2.
  • T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, Cited by: §4.1.
  • S. Min, D. Chen, H. Hajishirzi, and L. Zettlemoyer (2019) A discrete hard EM approach for weakly supervised question answering. In EMNLP/IJCNLP, Cited by: §4.2.
  • K. Nishida, K. Nishida, M. Nagata, A. Otsuka, I. Saito, H. Asano, and J. Tomita (2019) Answering while summarizing: multi-task learning for multi-hop qa with evidence extraction. In ACL, Cited by: §4.2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1, §3.1.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv abs/1910.10683. Cited by: §1, §2, §3.1, §3.1, §3, §6.1.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In ACL, Cited by: §4.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, Cited by: §4.1.
  • M. Richardson, C. J. C. Burges, and E. Renshaw (2013) MCTest: a challenge dataset for the open-domain machine comprehension of text. In EMNLP, Cited by: §4.1.
  • M. Sachan and E. Xing (2016) Easy questions first? a case study on curriculum learning for question answering. In ACL, pp. 453–463. Cited by: footnote 2.
  • A. Talmor and J. Berant (2019) MultiQA: an empirical investigation of generalization and transfer in reading comprehension. In ACL, Cited by: §1, §2.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In NAACL-HLT, Cited by: §4.1.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2017) NewsQA: a machine comprehension dataset. In Rep4NLP@ACL, Cited by: §4.1.
  • Z. Zhang, J. Yang, and H. Zhao (2020) Retrospective reader for machine reading comprehension. ArXiv abs/2001.09694. Cited by: §6.2.
  • C. Zhu, Y. Cheng, Z. Gan, S. Sun, T. Goldstein, and J. Liu (2020) FreeLB: enhanced adversarial training for natural language understanding. In ICLR, Cited by: §6.2, §6.3.

Appendix A Appendices

a.1 UnifiedQA: Different Sizes

For completeness we’re also showing the scores of UnifiedQA of different sizes on each dataset. For these systems each row is a single system.

Table 7: UnifiedQA of different sizes on our datasets.

a.2 Comparison with the Dedicated Models: extended results

Here we summarize an extension of the results in §6.1. Table 8 summarizes the results of the relevant experiment. In the top portion of the table we have evaluations of T5 model fine-tuned for individual datasets, followed by UnifiedQA. As it can be observed from the table, UnifiedQA performs almost as good as the best single dataset experts. In some cases UnifiedQA performs even better than than the single-dataset experts (e.g., on OBQA or NQA.) On average (last column) UnifiedQA is doing much better dataset/format-specific systems. In conclusion, UnifiedQA offers flexibility across multiple QA formats while compromising almost nothing compared to dataset-specific experts.

Table 8: UnifiedQA is on-par with systems tailored to individual datasets (the diagonal cells vs the last row) while functioning across a wide range of datasets (the last column).

a.3 Pairwise Mixing: extended results

Here we summarize an extension of the results in §5. The question addressed here is whether there is value in mixing datasets with different formats. We evaluated this by adding one dataset of a different format to four different datasets (one for each format). The results are summarized in Table 9. The goal of each sub-table is to measure the within-format generalization one can gain via out-of-format training. Each sub-table has an anchor dataset, indicated in the first column. For example in the first table the anchor dataset is SQuAD. Rows of the table: Each table combines datasets of other formats with the anchor dataset (e.g., SQuAD + RACE, etc). The columns of the sub-tables contain evaluations on the dataset with the same format as the anchor dataset. For example, on the first table, the evaluation is done on SQuAD 1.1/2.0, NewsQA, Quoref which have the same format as SQuaD 1.1, the anchor dataset. The results show that one can achieve gains for question-answering in a certain format by incorporating resources in other formats. In the first two sub-tables, we see that NQA (AB) and OBQA (MC) help a SQuAD models generalize better to other EX datasets. In the third table where the anchor dataset is NQA (AB), EX datasets help a NQA model generalize better to other AB datasets. In the 4th/5th subtable, EX and AB datasets help a RACE/OBQA (MC) models generalize better to other MC datasets. Similarly, in the final sub-table, MC dataset helps improve the scores on a YN datasets.

Table 9: Pairwise mixing of formats: mixing with QA of datasets with different formats helps.