Log In Sign Up

Exploring The Landscape of Distributional Robustness for Question Answering Models

We conduct a large empirical evaluation to investigate the landscape of distributional robustness in question answering. Our investigation spans over 350 models and 16 question answering datasets, including a diverse set of architectures, model sizes, and adaptation methods (e.g., fine-tuning, adapter tuning, in-context learning, etc.). We find that, in many cases, model variations do not affect robustness and in-distribution performance alone determines out-of-distribution performance. Moreover, our findings indicate that i) zero-shot and in-context learning methods are more robust to distribution shifts than fully fine-tuned models; ii) few-shot prompt fine-tuned models exhibit better robustness than few-shot fine-tuned span prediction models; iii) parameter-efficient and robustness enhancing training methods provide no significant robustness improvements. In addition, we publicly release all evaluations to encourage researchers to further analyze robustness trends for question answering models.


page 1

page 2

page 3

page 4


Robust fine-tuning of zero-shot models

Large pre-trained models such as CLIP offer consistent accuracy across a...

Are Sample-Efficient NLP Models More Robust?

Recent work has observed that pre-trained models have higher out-of-dist...

Few-Shot Question Answering by Pretraining Span Selection

In a number of question answering (QA) benchmarks, pretrained models hav...

Few-Shot Upsampling for Protest Size Detection

We propose a new task and dataset for a common problem in social science...

How Optimal is Greedy Decoding for Extractive Question Answering?

Fine-tuned language models use greedy decoding to answer reading compreh...

Explaining Question Answering Models through Text Generation

Large pre-trained language models (LMs) have been shown to perform surpr...

Towards Semantic Search for Community Question Answering for Mortgage Officers

Community Question Answering (CQA) has gained increasing popularity in m...

1 Introduction


Figure 1: We evaluate over 350 models on 16 datasets to characterize the landscape of distributional robustness in question answering. Our results span a variety of architectures and adaptation strategies, including zero-shot inference, fine-tuning, and in-context learning (ICL). The -axis shows performance on SQuAD (in-distribution), while the -axis shows the average performance on the 15 other QA datasets (out-of-distribution). Almost all models lie under the diagonal, i.e., performance drops under distribution shift. Moreover, within certain groups of models—for instance, ICL models—in-distribution performance accurately predicts out-of-distribution performance. As in taori2020measuring

, we apply logit axis scaling to clarify that the relationship between in-distribution and out-of-distribution performance is approximately linear in the logit domain.

Over the past few years, natural language processing has seen substantial progress. In many benchmarks, large pre-trained models adapted to a target dataset reach or even surpass human performance

(devlin-etal-2019-bert; raffel2019exploring; Radford2019LanguageMA; brown2020language; hoffmann2022training; chowdhery2022palm, inter alia). At the same time, current methods still fail to generalize reliably in a variety of test conditions ribeiro-etal-2020-beyond; gardner-etal-2020-evaluating; koh2021wilds; luu2021time; ribeiro-lundberg-2022-adaptive, which limits their applicability and raises questions about what exactly the methods learn bender-koller-2020-climbing. One limitation of current benchmarks is that they often measure performance only on data that comes from the same distribution as the training set wang-etal-2018-glue; wang2019superglue. However, evaluating models on a single test set provides no information on whether a method also performs well under distribution shift. While there is an increasing amount of research on robustness in NLP (ribeiro-etal-2020-beyond; tu-etal-2020-empirical; hendrycks-etal-2020-pretrained; gardner-etal-2020-evaluating; arora-etal-2021-types; veitch2021counterfactual; goel2021robustness; Miller2020TheEO, inter alia), the community has not yet adopted a common set of best practices for evaluating robustness. As a result, new methods often do not evaluate on comparable or even any robustness test sets, which makes it challenging to understand which methods generalize more reliably and whether NLP is making progress on robustness to distribution shift.

To address this challenge and shed light on the robustness landscape in NLP, we conduct a large empirical evaluation of distributional robustness in question answering (QA). Building on recent research on robustness in computer vision

taori2020measuring; miller2021accuracy, we focus on distribution shifts that arise between two related but different test sets. These distribution shifts are sometimes called dataset shift to distinguish them from other kinds of distribution shift. An example of dataset shift is a pair of QA test sets where one test set is constructed from Wikipedia articles and the other from Amazon product reviews, possibly also with a different crowdsourcing process. In contrast to other notions of robustness such as adversarial robustness, dataset shifts involve no synthetic perturbations of existing test examples and are therefore more representative of generalization challenges arising “in the wild” taori2020measuring.

Within the scope of dataset shifts for QA, our robustness evaluation includes a wide range of models and distribution shifts. Specifically, we assembled a testbed of over 350 QA models and 16 QA datasets, including SQuAD v1.1 rajpurkar-etal-2016-squad, SquadShifts Miller2020TheEO, and MRQA test sets fisch2019mrqa. Our testbed spans different model architectures, model sizes, and pre-training setups. In addition, we evaluate a variety of approaches for applying pre-trained models to question answering including supervised fine-tuning, in-context learning, parameter-efficient fine-tuning, zero-shot inference, and more. Finally, we also include methods specifically designed to enhance robustness such as RXF aghajanyan2021better and FreeLB Zhu2020FreeLB.

Our testbed enables us to both identify overarching trends spanning many models, and to contextualize the robustness behavior of individual models. Among our findings are the following key results:

  • Dataset shift still is an unsolved problem in QA: most models suffer a large performance drop under this kind of distribution shift.

  • Despite different architectures and model sizes, many models follow a consistent trend relating in-distribution and out-of-distribution performance. Improving in-distribution performance usually also increases out-of-distribution performance in a predictable way.

  • Current robustness interventions follow the same trend as models without such interventions, i.e., the robustness interventions do not increase robustness to dataset shifts.

  • The only exception to the otherwise universal performance trend are zero-shot, in-context learning, and few-shot prompt fine-tuned models. These models are more robust than the baseline given by the other models in our testbed. However, the robustness of large decoder-only models decreases as the models are fine-tuned on more data from the target task.

Figure 1 summarizes our findings and shows the average F1 score on all distribution shifts as a function of the F1 score on SQuAD. Interestingly, our overall results are analogous to similar large-scale robustness evaluations in computer vision taori2020measuring; miller2021accuracy; Radford2021LearningTV, which suggests that there may be a shared underlying mechanism behind these distribution shifts that warrants further investigation.

We hope that our work helps clarify the state of robustness in NLP and provides a starting point for future work. To simplify measuring robustness to dataset shift and enable future robustness improvements, we will release our testbed including all 350+ models and evaluation results.

The remainder of the paper is organized as follows: first, we detail background and experimental setup (§2). Next, we introduce and answer our specific research questions (§3, 4). Finally, we discuss the limitations of our approach, overall conclusions, and directions for future investigation (§6, 7).

Figure 2: A schematic which illustrates the robustness measuring technique we use. Effective robustness scatter plots pmlr-v97-recht19a; taori2020measuring display performance on the distribution from which training data is from (in-distribution) on the -axis, and out-of-distribution performance on the -axis. Effective robustness is vertical movement towards the diagonal beyond the baseline trend fit to fully fine-tuned models—a model with higher effective robustness has more consistent performance in- and out-of-distribution.

2 Experimental Setup

Our testbed includes over 350 models, covering a broad range of model architectures, pre-training datasets, and adaptation strategies. We use SQuAD v1.1 rajpurkar-etal-2016-squad as our reference point for question answering performance because SQuAD is a popular dataset and the performance ceiling is comparatively well understood since humans can achieve an F1 score around 95 (Miller2020TheEO). For all models except those performing zero-shot inference, we adapt the models to question answering with the SQuAD training set.

We evaluate robustness to distribution shift on the remaining 15 question answering datasets (Table 1). We follow taori2020measuring in defining robustness, i.e., we say a model is robust if it has consistent performance under a distribution shift from a reference distribution to another distribution. We refer to SQuAD as in-distribution (ID) and the other 15 datasets as out-of-distribution (OOD). In the remainder of this section, we describe the different models, adaptation strategies, datasets, and evaluation details.

Dataset name Test set size Domains
SQuAD v1.1 dev. set rajpurkar-etal-2016-squad 10,570 Wikipedia
SquadShifts New-Wiki Miller2020TheEO 7,938 Wikipedia
SquadShifts Reddit Miller2020TheEO 9,803 Reddit
SquadShifts NYT Miller2020TheEO 10,065 New York Times
SquadShifts Amazon Miller2020TheEO 9,885 Amazon reviews
RACE lai-etal-2017-race 674 English exams from China
DROP dua-etal-2019-drop 1,503 Wikipedia
NewsQA trischler-etal-2017-newsqa 4,212 CNN articles
SearchQA Dunn2017SearchQAAN 16,980 Jeopardy! questions with contexts from Google search
NaturalQuestions kwiatkowski-etal-2019-natural 12,836 Google search questions with contexts from Wikipedia
DuoRC (ParaphraseRC) Saha2018DuoRCTC 1,501

Movie plots from IMDB and Wikipedia

HotpotQA yang-etal-2018-hotpotqa 5,904 Wikipedia
TextbookQA Kembhavi_2017_CVPR 1,503 Middle school science questions from textbooks
TriviaQA joshi-etal-2017-triviaqa 7,785 Trivia questions with contexts collected using a Bing search
RelationExtraction levy-etal-2017-zero 2,948 Generated samples using a knowledge base
BioASQ TsatsaronisBioasq 1,504 Medical articles
Table 1: Question answering datasets used to evaluate models in this work. SQuAD is used as the in-distribution reference dataset—we use training data from SQuAD to adapt models. The remaining datasets are used to answer the question of how SQuAD models perform under dataset shift—we use these other datasets for evaluation only.

2.1 Models

Our testbed focuses on transformer models ranging from 11 million to 175 billion parameters. We explore several encoder-only models—ALBERT Lan2020ALBERT, BERT devlin-etal-2019-bert, SpanBERT joshi-etal-2020-spanbert, RoBERTa liu2019roberta, and Splinter Ram2021FewShotQA—encoder-decoder models —T5 raffel2019exploring and BART lewis-etal-2020-bart

—and decoder-only models (GPT-2 

Radford2019LanguageMA, OPT zhang2022opt, GPT-Neo gpt-neo, and GPT-J gpt-j).

2.2 Adaptation strategies

We evaluate multiple adaptation strategies—methods that adapt the pre-trained language model to perform better on a downstream task using labeled, in-distribution training data, e.g., through gradient based learning and in-context learning. We also examine models evaluated in a zero-shot setting, which we also refer to as an adaption method for consistency, even though no data from the in-distribution dataset is observed. For a subset of these models we also explore few-shot instead of full-shot adaptation to assess the impact of the number of training examples on robustness.

2.2.1 Fine-tuning (baseline)

We include a common fine-tuning method: adding a span prediction head and updating all the parameters in a language model via additional training on a downstream dataset, as done in devlin-etal-2019-bert and subsequent work.

2.2.2 Prompt fine-tuning

Prompt fine-tuning adds no additional task specific layers and fine-tunes the existing weights to generate the answer. We use next token prediction when fine-tuning auto-regressive models like GPT. For T5 and BART models we use two fine-tuning tasks: 1) casting QA as an infilling task and generate the answer by predicting a masked span 2) conditioning the model on the context and question and fine-tune it to generate the answer.

2.2.3 Parameter-efficient fine-tuning

Parameter-efficient fine-tuning modifies only a small percentage of existing or auxiliary parameters, while freezing all other parameters. We evaluate Houlsby Houlsby2019ParameterEfficientTL and Pfeiffer pfeiffer-etal-2021-adapterfusion adapters, prefix tuning li-liang-2021-prefix, and LoRA hu2021lora. While these methods modify only a small number of parameters, they have been shown to be competitive with full fine-tuning when measuring in-distribution performance. Previous work suggests freezing a majority of model weights may make these methods more robust lester-etal-2021-power.

2.2.4 Robustness enhancing fine-tuning

We evaluate methods which have been designed to improve model robustness. In particular, we evaluate RXF aghajanyan2021better and FreeLB Zhu2020FreeLB, which apply adversarial training strategies to improve generalization. Previous work evaluated robustness by comparing only to a few models and do not run extensive evaluations in question answering. Our work conducts evaluations on a large number of distribution shifts.

2.2.5 In-context learning

In-context learning is an adaptation method proposed by NEURIPS2020_1457c0d6 that does not require any gradient updates. This is particularly useful for very large language models, where fine-tuning is expensive. In-context learning refers to the process of conditioning a language model on one or more samples from a training set at inference time, allowing the model to perform a task without updating any parameters. For our experiments, we condition the model on triplets of context, question, and answer, as in NEURIPS2020_1457c0d6.

2.2.6 Zero-shot inference

We evaluate models using prompting or zero-shot inference Radford2019LanguageMA, where a model is conditioned only on the context and question of each test example. In other words, the model generates an answer without conditioning on training examples. Zero-shot models do not observe data from the reference distribution and have been shown to exhibit consistent performance across many distributions in computer vision Radford2021LearningTV.

2.3 Distribution shifts

We consider models which are trained on a reference distribution, which we also refer to as the in-distribution, with the exception of zero-shot models. In addition to measuring model performance on this reference distribution, we also evaluate model performance on other datasets where data distribution changes from the reference distribution. We refer to these other datasets as out-of-distribution, and we are interested in model behavior under distribution shift. Concretely, we want to measure how model performance changes when evaluated in- and out-of-distribution.

While there is extensive literature studying adversarial distribution shifts wu-etal-2021-evaluating, our work focuses on natural distribution shifts taori2020measuring, where the out-of-distribution datasets are not generated via synthetic perturbations to existing datasets.

In this work, we use the popular SQuAD rajpurkar-etal-2016-squad dataset as the reference (in-distribution) dataset. In addition, we evaluate model performance on 15 out-of-distribution datasets. We choose SQuAD as the reference distribution as it is one of the largest and the most well-studied QA datasets.

For our out-of-distribution test sets, we use the four datasets presented in the SquadShifts Miller2020TheEO in addition to datasets from the MRQA fisch2019mrqa testbed. Details about each of these datasets can be found in Table 1.

2.4 Measuring robustness

We follow the technique for measuring model robustness that is outlined in taori2020measuring: a model is said to be robust if it exhibits consistent performance in- and out-of-distribution. This is advantageous compared to examining only out-of-distribution performance because it removes the confounder of in-distribution performance (as shown in taori2020measuring; pmlr-v139-miller21b, models which achieve better performance in-distribution will often also perform better out-of-distribution).

As in taori2020measuring, the robustness measure we consider can be illustrated by looking at a scatter plot. For an illustrated example of this we refer to Figure 2, which displays the F1 score on the SQuAD development set on the -axis and the F1 score averaged over the out-of-distribution datasets on the -axis. Each point on the scatter plot is a different model. Effective robustness then describes vertical movement in this scatter plot towards the line. In particular, effective robustness measures performance out-of-distribution beyond the trend fit to fully fine-tuned models. This vertical movement is movement towards a model that has consistent performance in- and out-of-distribution (i.e., on aggregate fully fine-tuned models have 0 effective robustness). In Figure 2, which schematizes results that we will later observe with real data, models that are more robust sit above the baseline trend and exhibit robustness—the models shown in orange are more robust than the other models as they have better out-of-distribution performance given the same in-distribution performance.

Figure 3: Encoder-only, encoder-decoder, and decoder-only models are equally as robust when fine-tuned by adding a span prediction head. We conclude that architecture does not determine distributional robustness.
Figure 4: Parameter-efficient fine-tuning methods (highlighted in red and green) do not exhibit noticeable robustness improvements compared to other fine-tuned models.

3 Results

This section aims to answer our main research questions:

  • [wide, labelwidth=0pt, labelindent=0pt]

  • How do models perform under distribution shift?

  • Are some models more robust than others?

  • Do adaptation methods impact robustness?

We answer these questions in Sections 3.1, 3.2 and 3.3, respectively.

3.1 Performance drops under distribution shift

As shown in Figure 1, we observe that model performance drops under distribution shift. This effect is more pronounced for the best models on SQuAD, which are fully fine-tuned. This indicates that, despite progress in question answering, there is still substantial room for progress in improving model robustness.

3.2 Role of model

Role of model architecture.

In Figure 3 we compare the robustness of fine-tuned encoder-only, decoder-only, and encoder-decoder architectures. Our experiments indicate that architecture does not impact robustness. We observe that when different model families are adapted using a span prediction head, all models are equally robust. One limitation in our comparison is that the architectures we compare do not share the same pre-training corpus. However, larger corpora have been shown to improve robustness in computer vision Radford2021LearningTV. This is an area that could be investigated further in future work.

Role of model size.

Previous work hendrycks-etal-2020-pretrained has claimed that model size does not affect the robustness of language models. In Figure 5 we plot the average effective robustness on all distribution shifts as a function of the number of model parameters for fine-tuned GPT-2 and BERT models to control for pre-training corpus and architecture. Overall, we observe that model size is not strongly correlated with robustness.

Figure 5: Average effective robustness of BERT and GPT-2 as a function of the number of parameters of models fine-tuned on SQuAD. Overall model size does not determine robustness.

3.3 Role of the adaptation method

Figure 6: Methods designed to improve robustness (highlighted in black) do not exhibit noticeable robustness improvements on our testbed. This discrepancy may arise because of our focus on question answering, which previous work does not evaluate on.
Zero-shot and in-context learning (ICL).

We find that both zero-shot and in-context learning methods exhibit more robustness than methods that use gradient-based learning. As illustrated by Figure 1, the trend for zero-shot and in-context learning models is well above the trend of all other models. This entails that for the same in-distribution performance, we expect better out-of-distribution performance for in-context learning and zero-shot inference

Figure 7: Few-shot prompt fine-tuned billion parameter GPT models (colored black) are more robust than smaller few-shot prompt fine-tuned models. Further investigation is required to determine if the increase in effective robustness is due to architecture or model size.
Few-shot fine-tuning.

In Figure 1, we observe that few-shot methods follow two separate robustness trends.

  1. Few-shot fine-tuned models are on a trend similar to fully fine-tuned models.

  2. Few-shot prompt fine-tuned models are more robust than all other models that use gradient based learning.

Notable outliers to the few-shot prompt fine-tuned model trend are the GPT-2 XL 

Radford2019LanguageMA and GPT-Neo 1.3B gpt-neo models. As shown in Figure 7, these models are more robust than other few-shot prompt fine-tuned models. This indicates that models with better zero-shot capabilities can generalize better when fine-tuned in the few-shot setting. For these few-shot fine-tuned GPT models we explore how the number of training shots impacts robustness. We find that as the number of training samples increases, the effective robustness of few-shot GPT models decreases as shown in Figure 8

. In particular, increasing the number of shots from 16 to 1024 decreases effective robustness. This observation interpolates our previous results: a GPT model used in the zero-shot setting is robust while prompt fine-tuned GPT models are less robust. As observed by previous work 

Radford2021LearningTV; andreassen2021evolution; wortsman2021robust, fine-tuning a model can reduce robustness and lead to a model which is overspecialized to the downstream task.

Figure 8: Average effective robustness for each GPT model as a function of the number of shots used for fine-tuning. As the number of shots increases the model becomes better in distribution but the average effective robustness decreases.
Full fine-tuning using span prediction.

The fully fine-tuned models exhibit noticeably less robustness than other adaptation methods, however they also have the best performance on SQuAD. The best performing model on SQuAD has similar performance out-of-distribution to the best ICL model, despite performing more than 10 percentage points better in-distribution.

Fine-tuning using a prompt.

We find that prompt fine-tuning methods are more robust in comparison to fine-tuned models. We observe that not using span prediction and instead fine-tuning existing model weights to generate the answer allows the model to maintain some of robustness from the zero-shot setting.

Parameter-efficient tuning.

We examine the performance of parameter-efficient fine-tuning methods for different architectures and model sizes. Our results indicate that these methods are neither noticeably more robust or less robust than fine-tuning all parameters when using prompt based methods or span prediction, as shown in Figure 4.

Methods designed to enhance robustness.

As illustrated by Figure 6 we find that RXF and FreeLB, which are designed to improve robustness, do not exhibit noticeable robustness improvements on the distribution shifts. We believe that one of the values of our large test bed is to comprehensively evaluate future robustness enhancing methods.

(a) SquadShifts Wikipedia
(b) SearchQA
(c) DROP
Figure 9: Instead of averaging over all 15 datasets, we now show logit-scaled plots examining the three distribution shifts individually. (left) The SquadShifts Wiki dataset is derived from the same data source (Wikipedia) as SQuAD. As a result, models lie closer to the diagonal than on other distribution shifts. (middle) Progress on SQuAD is a weaker indicator for progress on SearchQA for fully fine-tuned models and few-shot fine-tuned models. We find that zero-shot and ICL models are less robust than fine-tuned and few-shot models with the exception of larger language models. (right) On the SQuADDROP distribution shift, we observe that progress beyond F1 on SQuAD yields quick progress on DROP for fine-tuned models.

4 Discussion

This section discusses the aforementioned findings. In particular we discuss how the findings compare to analogous studies in vision, and how individual distribution shifts differ from aggregate trends.

4.1 How do the findings compare to robustness evaluations in vision?

We observe that the overall robustness trends of question answering models are qualitatively similar to trends identified in image classification taori2020measuring; Radford2021LearningTV; miller2021accuracy. In particular, previous work Radford2021LearningTV; wortsman2021robust; pham2021combined has shown that zero-shot models are more robust than fine-tuned models, which is similar to the trend we observe. Moreover, additional robustness evaluations taori2020measuring have concluded that fully trained models models with different architectures, pre-training datasets, and robustness enhanced methods do not provide any robustness improvement when evaluated on multiple natural distribution shifts, which is also what we observe.

4.2 How do individual distribution shifts differ from aggregate trends?

While we have previously analyzed robustness trends averaged over all distribution shifts, we now examine trends on individual distribution shifts. For most datasets, we observe qualitatively similar trends as when averaging over all distribution shifts.

One exception is on the SquadShifts New-Wiki dataset, where we find that all models sit very close to the line (Figure 8(a)). Since both SQuAD and SquadShifts New-Wiki are collected from Wikipedia, it is perhaps unsurprising that models adapted to SQuAD can generalize to other datasets from the same domain.

Moreover, we observe a piece-wise linear trend when comparing few-shot and fine-tuned models on DROP 8(c). By fine-tuning on the entire training set, we improve in-distribution performance, which causes larger gains in DROP performance. Similar patterns of discontinuous improvement have been previously observed by wei2022emergent.

Additionally we find that on the SearchQA 8(b) dataset the trendlines are flatter than other distribution shifts for fine-tuned, prompt fine-tuned and few-shot fine-tuned models (i.e., increasing ID performance has a smaller impact on OOD performance). In addition, zero-shot and ICL models do not have additional robustness properties. The exception to this is GPT-J and OPT 175B, which continue to outperform other models. Moreover, few-shot prompt fine-tuned models perform better than other few-shot and fine-tuned models.

5 Related work

Understanding how models behave under conditions that differ from training has been the subject of much attention both in natural language processing (ribeiro-etal-2020-beyond; tu-etal-2020-empirical; hendrycks-etal-2020-pretrained; gardner-etal-2020-evaluating; arora-etal-2021-types; veitch2021counterfactual; goel2021robustness; Miller2020TheEO, inter alia) and computer vision (pmlr-v97-recht19a; taori2020measuring; miller2021accuracy; koh2021wilds, inter alia). As in taori2020measuring, we distinguish between synthetic and natural distribution shifts. The former includes any artificial perturbations to inputs, including adversarial attacks (szegedy2013intriguing; carlini2017towards; jia2017adversarial; biggio2018wild; wang2019towards; wallace-etal-2019-trick; wallace-etal-2019-universal; tramer2020adaptive; liu2021can; wu-etal-2021-evaluating; chang2021robustness, inter alia). In contrast, the later relates to naturally occurring data, without synthetic or adversarial perturbations. Our work focuses on natural distribution shifts.

Most similar to our work is that of yogatama2019learning; talmor2019multiqa; sen2020models; fisch2019mrqa and Miller2020TheEO, who examine the performance of models on multiple question answering datasets. Our work provides a more comprehensive modeling survey, evaluating a broader set of models, adaptation strategies and datasets. In contrast to previous work, we evaluate zero-shot inference, in-context learning, few-shot fine-tuning, and parameter-efficient adaptation methods, which have only recently been popularized.

Finally, a variety of methods for improving robustness have been explored by previous work (jiang2019smart; Zhu2020FreeLB; aghajanyan2021better; veitch2021counterfactual; wortsman2021robust, inter alia). Instead of proposing methods to build more robust models, our goal is to empirically examine the landscape of robustness. As part of this goal, we evaluate robustness enhancing methods, in addition to other adaptation strategies.

Concurrent work by  Liu2022AreSN examines the robustness of few-shot fine-tuned models. They find that these models yield no additional robustness which matches the findings from our evaluation.

6 Conclusion

We conduct an extensive evaluation of the robustness of different model and adaptation methods on 15 distribution shifts in question answering. Our in-depth analysis suggests several concrete directions for future work: improving the in-distribution performance of ICL methods and understanding why different few-shot fine-tuning methods yield varied robustness.

7 Limitations

Experimenting with different in-distribution datasets.

We choose SQuAD as a representative in-distribution dataset since it is one of the largest and most popular QA datasets. One limitation of SQuAD is that the training set is mainly collected from Wikipedia articles which may not be optimal for building a QA model that generalizes to many domains. Future work could explore the robustness of models trained on datasets from other domains for increased coverage.

Specialized modeling methods.

Our work does not evaluate models with task or data specific components. As an example andor-etal-2019-giving improved performance on DROP dua-etal-2019-drop by using arithmetic programs to improve a model’s mathematical reasoning. Evaluating the robustness of methods like these are an exciting area for future investigations.

Few-shot GPT evaluations.

Our results indicate that large GPT models fine-tuned on a smaller number of samples are more robust to distribution shifts compared to other few-shot fine-tuned models that use a prompt or span prediction. However, GPT-2 XL and GPT-Neo, which both have more than one billion parameters, are larger than all few-shot models we evaluate. Future work could examine the impact of architecture on this trend by evaluating other models with more than a billion parameters like T5.

Multiple fine-tuning runs.

For fine-tuned models we include a single data-point for each model. However, previous work phang2018sentence; Dodge2020FineTuningPL

has shown that different data ordering and weight initialization can lead to large variance in model performance. In Figure 

11 we evaluate the robustness of RoBERTa Large models fine-tuned with different data ordering and initialization for the span prediction head devlin-etal-2019-bert. We find that on average the robustness of these models does not differ substantially. Further investigation into the effect of random seeds on robustness would improve our understanding of the robustness of individual data points.


This work is in part supported by the NSF AI Institute for Foundations of Machine Learning (IFML), Open Philanthropy, Google, and the Allen Institute for AI.


Appendix A Appendix

Figure 10: A sample from SQuAD with the input formatting used for fine-tuning decoder-only models, in-context learning, and zero-shot inference.

a.1 Training Details

In addition to sharing hyperparameters for each model, we plan to share all model weights on the HuggingFace Hub 

wolf-etal-2020-transformers such that the community can continue to evaluate the models in our testbed.

a.1.1 Span prediction fine-tuning

We fine-tune models by adding a span prediction head and fine-tuning for 2 epochs using a learning rate of 3e-5 and a linear learning rate decay.

a.1.2 Prompt fine-tuning

Encoder-Decoder Models
We fine-tune encoder-decoder models on both question->answer generation (mask filling) and answer generation tasks from  chada-natarajan-2021-fewshotqa. For the question->answer generation task we fine-tune the models for 2 epochs, use a linear learning rate decay, and search for the best learning rate from 1e-4, 5e-5, and 3e-5 based on performance on the validation set. For the answer generation task we fine-tune for 2 epochs with a learning rate of 3e-5 and linear learning rate decay.
Decoder-Only Models
We fine-tune decoder only models using a language modeling head. Specifically we format samples as shown in Figure 10 and only calculate loss on the answer tokens. We search for the best learning rate among 5e-5 and 5e-6 and use a linear learning rate decay. In addition we search for the best weight decay value between 0.01 and 0.1. We fine-tune these models for 5 epochs and pick the model with the best validation set F1 score.

a.1.3 Parameter efficient fine-tuning

Fine-tuned Models
As suggested in the Adapter-Transformers library pfeiffer-etal-2020-adapterhub we use a learning rate of 1e-4 with linear learning rate decay and fine-tune for 15 epochs for all parameter efficient fine-tuning methods picking the model with the best F1 score on the validation set.
Prompt Fine-tuned Models
We fine-tune using a learning rate of 1e-4 with linear learning rate decay and 10 epochs for all parameter efficient fine-tuning methods picking the model with the best F1 score on the validation set.

a.1.4 Few-shot Fine-tuning

We fine-tune models on to samples from SQuAD (doubling the size as we increase the number of shots). We repeat each experiment three times using randomly picked samples to remove outliers that result from fine-tuning on specific examples.
Fine-tuned Models
We use the same fine-tuning setup as ram-etal-2021-shot. Specifically we fine-tune for 10 epochs or 200 steps (picking which ever is largest). We use a learning rate of 3e-5 with a linear learning rate decay and 0.1 warm-up ratio.
Prompt Fine-tuned Models

For autoregressive models we fine-tune for 10 epochs with a learning rate of 1e-5 and linear learning rate decay. In addition we use a weight decay of 0.1 as we find in our experiments fine-tuning decoder-only models that this is an ideal value for the models we evaluate. For T5 and BART we use the same evaluation setup as

chada-natarajan-2021-fewshotqa for both masked span prediction and answer generation methods.

a.1.5 Robustness enhanced methods

We adapt the official implementation111 for RXF (using the R3F variant) to fine-tune encoder only question answering models. We use the same fine-tuning hyper-parameters as the fully fine-tuned encoder only models A.1.1 but use polynomial learning rate decay, weight decay value of 0.01, and warm-up ratio of 0.06. For R3F specific parameters we use =1.0, =1e-5, and Normal noise type.
We use the official implementation222 for FreeLB to fine-tune encoder only question answering models. We use a learning rate of 5e-6 and fine-tune for 2 epochs with linear learning rate decay. For FreeLB specific parameters we set , =1e-1, and =6e-1.

a.1.6 Zero-shot inference

For zero-shot evaluations we pre-process each sample into the format in Figure 10 omitting the answer from the prompt. We generate the answer using beam decoding with five beams for models smaller than 2 billion parameters and use greedy decoding for the rest of the models. We use a maximum generation length of 20 tokens and use the end of sequence token to terminate generation.

a.1.7 In-context learning

For in-context learning evaluations we condition a language model on one or four random samples examples from the SQuAD training set. Figure 10 illustrates the format of each sample. When we are conditioning on multiple samples we separate each formatted sample with a newline delimiter. For each training sample we condition on, we truncate the context of the sample to 100 tokens. Furthermore, we truncate the context of input sample (sample we are running inference on) to 200 tokens. For each model and number of shots we repeat each experiment three times using randomly samples training shots. The exception to this is the OPT 175 billion parameter model which we evaluate only once in the one and four shot settings. We use the same generation setup as zero-shot inference A.1.6.

a.2 Additional Plots

Figure 11: The average effective robustness for six fine-tuning runs of RoBERTa Large shows that the robustness differences between fine-tuning runs are negligible.

In this section we include Figure 11 which shows the effect of different fine-tuning runs on effective robustness. We find that even when we fine-tune six different RoBERTa Large models by varying data ordering and weight initialization for the span prediction head the average effective robustness on all distribution shifts is stable.

(a) SquadShifts NYT
(b) SquadShifts Reddit
(c) SquadShifts Wikipedia
(d) SquadShifts Amazon
(e) RACE
(f) DROP
(g) BioASQ
(h) DuoRC
(i) HotpotQA
(j) SearchQA
(k) Natural Questions
(l) NewsQA
(m) TriviaQA
(n) TextbookQA
(o) Relation Extraction
Figure 12: Instead of averaging over all 15 datasets, we show logit-scaled plots examining all 15 distribution shifts individually.