Selective Question Answering under Domain Shift

by   Amita Kamath, et al.
Stanford University

To avoid giving wrong answers, question answering (QA) models need to know when to abstain from answering. Moreover, users often ask questions that diverge from the model's training data, making errors more likely and thus abstention more critical. In this work, we propose the setting of selective question answering under domain shift, in which a QA model is tested on a mixture of in-domain and out-of-domain data, and must answer (i.e., not abstain on) as many questions as possible while maintaining high accuracy. Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs. Instead, we train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely. Crucially, the calibrator benefits from observing the model's behavior on out-of-domain data, even if from a different domain than the test data. We combine this method with a SQuAD-trained QA model and evaluate on mixtures of SQuAD and five other QA datasets. Our method answers 56 the model's probabilities only answers 48


Can NLI Models Verify QA Systems' Predictions?

To build robust question answering systems, we need the ability to verif...

Exploring Models and Data for Image Question Answering

This work aims to address the problem of image-based question-answering ...

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Open-domain Question Answering models which directly leverage question-a...

What's in a Name? Answer Equivalence For Open-Domain Question Answering

A flaw in QA evaluation is that annotations often only provide one gold ...

Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains

Past works that investigate out-of-domain performance of QA systems have...

Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics

We propose to use question answering (QA) data from Web forums to train ...

Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

Ideally Open-Domain Question Answering models should exhibit a number of...

1 Introduction

Question answering (QA) models have achieved impressive performance when trained and tested on examples from the same dataset, but tend to perform poorly on examples that are out-of-domain (OOD) (Jia and Liang, 2017; Chen et al., 2017; Yogatama et al., 2019; Talmor and Berant, 2019; Fisch et al., 2019). Deployed QA systems in search engines and personal assistants need to gracefully handle OOD inputs, as users often ask questions that fall outside of the system’s training distribution. While the ideal system would correctly answer all OOD questions, such perfection is not attainable given limited training data (Geiger et al., 2019). Instead, we aim for a more achievable yet still challenging goal: models should abstain when they are likely to err, thus avoiding showing wrong answers to users. This general goal motivates the setting of selective prediction, in which a model outputs both a prediction and a scalar confidence, and abstains on inputs where its confidence is low (El-Yaniv and Wiener, 2010; Geifman and El-Yaniv, 2017).

Figure 1: Selective question answering under domain shift with a trained calibrator. First, a QA model is trained only on source data. Then, a calibrator is trained to predict whether the QA model was correct on any given example. The calibrator’s training data consists of both previously held-out source data and known OOD data. Finally, the combined selective QA system is tested on a mixture of test data from the source distribution and an unknown OOD distribution.

In this paper, we propose the setting of selective question answering under domain shift, which captures two important aspects of real-world QA: (i) test data often diverges from the training distribution, and (ii) systems must know when to abstain. We train a QA model on data from a source distribution, then evaluate selective prediction performance on a dataset that includes samples from both the source distribution and an unknown OOD distribution. This mixture simulates the likely scenario in which users only sometimes ask questions that are covered by the training distribution. While the system developer knows nothing about the unknown OOD data, we allow access to a small amount of data from a third known OOD distribution (e.g., OOD examples that they can foresee).

We first show that our setting is challenging because model softmax probabilities are unreliable estimates of confidence on out-of-domain data. Prior work has shown that a strong baseline for in-domain selective prediction is MaxProb, a method that abstains based on the probability assigned by the model to its highest probability prediction

(Hendrycks and Gimpel, 2017; Lakshminarayanan et al., 2017). We find that MaxProb gives good confidence estimates on in-domain data, but is overconfident on OOD data. Therefore, MaxProb performs poorly in mixed settings: it does not abstain enough on OOD examples, relative to in-domain examples.

We correct for MaxProb’s overconfidence by using known OOD data to train a calibrator

—a classifier trained to predict whether the original QA model is correct or incorrect on a given example

(Platt, 1999; Zadrozny and Elkan, 2002). While prior work in NLP trains a calibrator on in-domain data (Dong et al., 2018), we show this does not generalize to unknown OOD data as well as training on a mixture of in-domain and known OOD data. Figure 1

illustrates the problem setup and how the calibrator uses known OOD data. We use a simple random forest calibrator over features derived from the input example and the model’s softmax outputs.

We conduct extensive experiments using SQuAD (Rajpurkar et al., 2016) as the source distribution and five other QA datasets as different OOD distributions. We average across all 20 choices of using one as the unknown OOD dataset and another as the known OOD dataset, and test on a uniform mixture of SQuAD and unknown OOD data. On average, the trained calibrator achieves coverage (i.e., the system answers of test questions) while maintaining accuracy on answered questions, outperforming MaxProb with the same QA model ( coverage at accuracy), using MaxProb and training the QA model on both SQuAD and the known OOD data ( coverage), and training the calibrator only on SQuAD data ( coverage).

In summary, our contributions are as follows:
(1) We propose a novel setting, selective question answering under domain shift, that captures the practical necessity of knowing when to abstain on test data that differs from the training data.
(2) We show that QA models are overconfident on out-of-domain examples relative to in-domain examples, which causes MaxProb to perform poorly in our setting.
(3) We show that out-of-domain data, even from a different distribution than the test data, can improve selective prediction under domain shift when used to train a calibrator.

2 Related Work

Our setting combines extrapolation to out-of-domain data with selective prediction. We also distinguish our setting from the tasks of identifying unanswerable questions and outlier detection.

2.1 Extrapolation to out-of-domain data

Extrapolating from training data to test data from a different distribution is an important challenge for current NLP models (Yogatama et al., 2019). Models trained on many domains may still struggle to generalize to new domains, as these may involve new types of questions or require different reasoning skills (Talmor and Berant, 2019; Fisch et al., 2019). Related work on domain adaptation also tries to generalize to new distributions, but assumes some knowledge about the test distribution, such as unlabeled examples or a few labeled examples (Blitzer et al., 2006; Daume III, 2007); we assume no such access to the test distribution, but instead make the weaker assumption of access to samples from a different OOD distribution.

2.2 Selective prediction

Selective prediction, in which a model can either predict or abstain on each test example, is a longstanding research area in machine learning

(Chow, 1957; El-Yaniv and Wiener, 2010; Geifman and El-Yaniv, 2017). In NLP, Dong et al. (2018) use a calibrator to obtain better confidence estimates for semantic parsing. Rodriguez et al. (2019) use a similar approach to decide when to answer QuizBowl questions. These works focus on training and testing models on the same distribution, whereas our training and test distributions differ.

Selective prediction under domain shift.

Other fields have recognized the importance of selective prediction under domain shift. In medical applications, models may be trained and tested on different groups of patients, so selective prediction is needed to avoid costly errors (Feng et al., 2019). In computational chemistry, Toplak et al. (2014) use selective prediction techniques to estimate the set of (possibly out-of-domain) molecules for which a reactivity classifier is reliable. To the best of our knowledge, our work is the first to study selective prediction under domain shift in NLP.

Answer validation.

Traditional pipelined systems for open-domain QA often have dedicated systems for answer validation—judging whether a proposed answer is correct. These systems often rely on external knowledge about entities (Magnini et al., 2002; Ko et al., 2007). Knowing when to abstain has been part of past QA shared tasks like RespubliQA (Peñas et al., 2009) and QA4MRE (Peñas et al., 2013). IBM’s Watson system for Jeopardy also uses a pipelined approach for answer validation (Gondek et al., 2012). Our work differs by focusing on modern neural QA systems trained end-to-end, rather than pipelined systems, and by viewing the problem of abstention in QA through the lens of selective prediction.

2.3 Related goals and tasks


Knowing when to abstain is closely related to calibration—having a model’s output probability align with the true probability of its prediction (Platt, 1999). A key distinction is that selective prediction metrics generally depend only on relative confidences—systems are judged on their ability to rank correct predictions higher than incorrect predictions (El-Yaniv and Wiener, 2010). In contrast, calibration error depends on the absolute confidence scores. Nonetheless, we will find it useful to analyze calibration in Section 5.3, as miscalibration on some examples but not others does imply poor relative ordering, and therefore poor selective prediction. Ovadia et al. (2019) observe increases in calibration error under domain shift.

Identifying unanswerable questions.

In SQuAD 2.0, models must recognize when a paragraph does not entail an answer to a question (Rajpurkar et al., 2018). Sentence selection systems must rank passages that answer a question higher than passages that do not (Wang et al., 2007; Yang et al., 2015). In these cases, the goal is to “abstain” when no system (or person) could infer an answer to the given question using the given passage. In contrast, in selective prediction, the model should abstain when it would give a wrong answer if forced to make a prediction.

Outlier detection.

We distinguish selective prediction under domain shift from outlier detection, the task of detecting out-of-domain examples (Schölkopf et al., 1999; Hendrycks and Gimpel, 2017; Liang et al., 2018). While one could use an outlier detector for selective classification (e.g., by abstaining on all examples flagged as outliers), this would be too conservative, as QA models can often get a non-trivial fraction of OOD examples correct (Talmor and Berant, 2019; Fisch et al., 2019). Hendrycks et al. (2019b) use known OOD data for outlier detection by training models to have high entropy on OOD examples; in contrast, our setting rewards models for predicting correctly on OOD examples, not merely having high entropy.

3 Problem Setup

We formally define the setting of selective prediction under domain shift, starting with some notation for selective prediction in general.

3.1 Selective Prediction

Given an input , the selective prediction task is to output where , the set of answer candidates, and denotes the model’s confidence. Given a threshold , the overall system predicts if and abstain otherwise.

The risk-coverage curve provides a standard way to evaluate selective prediction methods (El-Yaniv and Wiener, 2010). For a test dataset , any choice of has an associated coverage—the fraction of the model makes a prediction on—and risk—the error on that fraction of . As decreases, coverage increases, but risk will usually also increase. We plot risk versus coverage and evaluate on the area under this curve (AUC), as well as the maximum possible coverage for a desired risk level. The former metric averages over all , painting an overall picture of selective prediction performance, while the latter evaluates at a particular choice of corresponding to a specific level of risk tolerance.

3.2 Selective Prediction under Domain Shift

We deviate from prior work by considering the setting where the model’s training data and test data are drawn from different distributions. As our experiments demonstrate, this setting is challenging because standard QA models are overconfident on out-of-domain inputs.

To formally define our setting, we specify three data distributions. First, is the source distribution, from which a large training dataset is sampled. Second, is an unknown OOD distribution, representing out-of-domain data encountered at test time. The test dataset is sampled from , a mixture of and :


for . We choose , and examine the effect of changing this ratio in Section 5.8. Third, is a known OOD distribution, representing examples not in but from which the system developer has a small dataset .

3.3 Selective Question Answering

While our framework is general, we focus on extractive question answering, as exemplified by SQuAD (Rajpurkar et al., 2016), due to its practical importance and the diverse array of available QA datasets in the same format. The input is a passage-question pair , and the set of answer candidates is all spans of the passage . A base model

defines a probability distribution

over . All selective prediction methods we consider choose , but differ in their associated confidence .

4 Methods

Recall that our setting differs from the standard selective prediction setting in two ways: unknown OOD data drawn from appears at test time, and known OOD data drawn from is available to the system. Intuitively, we expect that systems must use the known OOD data to generalize to the unknown OOD data. In this section, we present three standard selective prediction methods for in-domain data, and show how they can be adapted to use data from .

4.1 MaxProb

The first method, MaxProb, directly uses the probability assigned by the base model to as an estimate of confidence. Formally, MaxProb with model estimates confidence on input as:


MaxProb is a strong baseline for our setting. Across many tasks, MaxProb has been shown to distinguish in-domain test examples that the model gets right from ones the model gets wrong (Hendrycks and Gimpel, 2017). MaxProb is also a strong baseline for outlier detection, as it is lower for out-of-domain examples than in-domain examples (Lakshminarayanan et al., 2017; Liang et al., 2018; Hendrycks et al., 2019b). This is desirable for our setting: models make more mistakes on OOD examples, so they should abstain more on OOD examples than in-domain examples.

MaxProb can be used with any base model . We consider two such choices: a model trained only on , or a model trained on the union of and .

4.2 Test-time Dropout

For neural networks, another standard approach to estimate confidence is to use dropout at test time.

Gal and Ghahramani (2016) showed that dropout gives good confidence estimates on OOD data.

Given an input and model , we compute on with different dropout masks, obtaining prediction distributions , where each is a probability distribution over . We consider two statistics of these ’s that are commonly used as confidence estimates. First, we take the mean of across all (Lakshminarayanan et al., 2017):


This can be viewed as ensembling the predictions across all dropout masks by averaging them.

Second, we take the negative variance of the

’s (Feinman et al., 2017; Smith and Gal, 2018):


Higher variance corresponds to greater uncertainty, and hence favors abstaining. Like MaxProb, dropout can be used either with trained only on , or on both and the known OOD data.

Test-time dropout has practical disadvantages compared to MaxProb. It requires access to internal model representations, whereas MaxProb only requires black box access to the base model (e.g., API calls to a trained model). Dropout also requires forward passes of the base model, leading to a -fold increase in runtime.

4.3 Training a calibrator

Our final method trains a calibrator to predict when a base model (trained only on data from ) is correct (Platt, 1999; Dong et al., 2018). We differ from prior work by training the calibrator on a mixture of data from and , anticipating the test-time mixture of and . More specifically, we hold out a small number of examples from base model training, and train the calibrator on the union of these examples and the examples. We define to be the prediction probability of the calibrator.

The calibrator itself could be any binary classification model. We use a random forest classifier with seven features: passage length, the length of the predicted answer , and the top five softmax probabilities output by the model. These features require only a minimal amount of domain knowledge to define. Rodriguez et al. (2019) similarly used multiple softmax probabilities to decide when to answer questions. The simplicity of this model makes the calibrator fast to train when given new data from , especially compared to re-training the QA model on that data.

We experiment with four variants of the calibrator. First, to measure the impact of using known OOD data, we change the calibrator’s training data: it can be trained either on data from only, or both and data as described. Second, we consider a modification where instead of the model’s probabilities, we use probabilities from the mean ensemble over dropout masks, as described in Section 4.2, and also add as a feature. As discussed above, dropout features are costly to compute and assume white-box access to the model, but may result in better confidence estimates. Both of these variables can be changed independently, leading to four configurations.

5 Experiments and Analysis

5.1 Experimental Details


We use SQuAD 1.1 (Rajpurkar et al., 2016) as the source dataset and five other datasets as OOD datasets: NewsQA (Trischler et al., 2017), TriviaQA (Joshi et al., 2017), SearchQA (Dunn et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019).111We consider these different datasets to represent different domains, hence our usage of the term “domain shift.” These are all extractive question answering datasets where all questions are answerable; however, they vary widely in the nature of passages (e.g., Wikipedia, news, web snippets), questions (e.g., Jeopardy and trivia questions), and relationship between passages and questions (e.g., whether questions are written based on passages, or passages retrieved based on questions). We used the preprocessed data from the MRQA 2019 shared task (Fisch et al., 2019). For HotpotQA, we focused on multi-hop questions by selecting only “hard” examples, as defined by Yang et al. (2018). In each experiment, two different OOD datasets are chosen as and . All results are averaged over all 20 such combinations, unless otherwise specified. We sample 2,000 examples from for , and 4,000 SQuAD and 4,000 examples for . We evaluate using exact match (EM) accuracy, as defined by SQuAD (Rajpurkar et al., 2016). Additional details can be found in Appendix A.1.

QA model.

For our QA model, we use the BERT-base SQuAD 1.1 model trained for 2 epochs

(Devlin et al., 2019). We train six models total: one and five ’s, one for each OOD dataset.

Selective prediction methods.

For test-time dropout, we use different dropout masks, as in Dong et al. (2018). For our calibrator, we use the random forest implementation from Scikit-learn (Pedregosa et al., 2011)

. We train on 1,600 SQuAD examples and 1,600 known OOD examples, and use the remaining 400 SQuAD and 400 known OOD examples as a validation set to tune calibrator hyperparameters via grid search. We average our results over 10 random splits of this data. When training the calibrator only on

, we use 3,200 SQuAD examples for training and 800 for validation, to ensure equal dataset sizes. Additional details can be found in Appendix A.2.

5.2 Main results

Training a calibrator with outperforms other methods.

Table 1 compares all methods that do not use test-time dropout. Compared to MaxProb with , the calibrator has points and points higher coverage at and accuracy respectively, and points lower AUC.222confidence interval is , using the paired bootstrap test with 1000 bootstrap samples. This demonstrates that training a calibrator is a better use of known OOD data than training a QA model. The calibrator trained on both and also outperforms the calibrator trained on alone by coverage at accuracy. All methods perform far worse than the optimal selective predictor with the given base model, though achieving this bound may not be realistic.333As the QA model has fixed accuracy on , it is impossible to achieve risk at coverage.

Test-time dropout improves results but is expensive.

Table 2 shows results for methods that use test-time dropout, as described in Section 4.2. The negative variance of ’s across dropout masks serves poorly as an estimate of confidence, but the mean performs well. The best performance is attained by the calibrator using dropout features, which has higher coverage at accuracy than the calibrator with non-dropout features. Since test-time dropout introduces substantial (i.e., -fold) runtime overhead, our remaining analyses focus on methods without test-time dropout.

Cov @
Cov @
Train QA model on SQuAD
Calibrator ( only)
Calibrator ( and )
Best possible
Train QA model on SQuAD +
known OOD
Best possible
Table 1: Results for methods without test-time dropout. The calibrator with access to outperforms all other methods. : lower is better. : higher is better.
Cov @
Cov @
Train QA model on SQuAD
Test-time dropout (–var)
Test-time dropout (mean)
Calibrator ( only)
Calibrator ( and )
Best possible
Train QA model on SQuAD +
known OOD
Test-time dropout (–var)
Test-time dropout (mean)
Best possible
Table 2: Results for methods that use test-time dropout. Here again, the calibrator with access to outperforms all other methods.

The QA model has lower non-trivial accuracy on OOD data.

Next, we motivate our focus on selective prediction, as opposed to outlier detection, by showing that the QA model still gets a non-trivial fraction of OOD examples correct. Table 3 shows the (non-selective) exact match scores for all six QA models used in our experiments on all datasets. All models get around accuracy on SQuAD, and around to accuracy on most OOD datasets. Since OOD accuracies are much higher than , abstaining on all OOD examples would be overly conservative.444In Section A.3, we confirm that an outlier detector does not achieve good selective prediction performance. At the same time, since OOD accuracy is worse than in-domain accuracy, a good selective predictor should answer more in-domain examples and fewer OOD examples. Training on 2,000 examples does not significantly help the base model extrapolate to other distributions.

Train Data / Test Data SQuAD TriviaQA HotpotQA NewsQA
SQuAD only 80.95 48.43 44.88 40.45 42.78 17.98
SQuAD + 2K TriviaQA 81.48 (50.50) 43.95 39.15 47.05 25.23
SQuAD + 2K HotpotQA 81.15 49.35 (53.60) 39.85 48.18 24.40
SQuAD + 2K NewsQA 81.50 50.18 42.88 (44.00) 47.08 20.40
SQuAD + 2K NaturalQuestions 81.48 51.43 44.38 40.90 (54.85) 25.95
SQuAD + 2K SearchQA 81.60 56.58 44.30 40.15 47.05 (59.80)
Table 3: Exact match accuracy for all six QA models on all six test QA datasets. Training on improves accuracy on data from the same dataset (diagonal), but generally does not improve accuracy on data from .

Results hold across different amounts of known OOD data.

As shown in Figure 2, across all amounts of known OOD data, using it to train and validate the calibrator (in an 80–20 split) performs better than adding all of it to the QA training data and using MaxProb.

Figure 2: Area under the risk-coverage curve as a function of how much data from is available. At all points, using data from to train the calibrator is more effective than using it for QA model training.

5.3 Overconfidence of MaxProb

Figure 3: MaxProb is lower on average for OOD data than in-domain data (fig:maxprob_density), but it is still overconfident on OOD data: when plotting the true probability of correctness vs. MaxProb (fig:maxprob_calibration), the OOD curve is below the line, indicating MaxProb overestimates the probability that the prediction is correct. The calibrator assigns lower confidence on OOD data (fig:calibrator_density) and has a smaller gap between in-domain and OOD curves (fig:calibrator_calibration), indicating improved calibration.

We now show why MaxProb performs worse in our setting compared to the in-domain setting: it is miscalibrated on out-of-domain examples. Figure 2(a) shows that MaxProb values are generally lower for OOD examples than in-domain examples, following previously reported trends (Hendrycks and Gimpel, 2017; Liang et al., 2018). However, the MaxProb values are still too high out-of-domain. Figure 2(b) shows that MaxProb is not well calibrated: it is underconfident in-domain, and overconfident out-of-domain.555The in-domain underconfidence is because SQuAD (and some other datasets) provides only one answer at training time, but multiple answers are considered correct at test time. In Appendix A.4, we show that removing multiple answers makes MaxProb well-calibrated in-domain; it stays overconfident out-of-domain. For example, for a MaxProb of , the model is about likely to get the question correct if it came from SQuAD (in-domain), and likely to get the question correct if it was OOD. When in-domain and OOD examples are mixed at test time, MaxProb therefore does not abstain enough on the OOD examples. Figure 2(d) shows that the calibrator is better calibrated, even though it is not trained on any unknown OOD data. In Appendix A.5, we show that the calibrator abstains on more OOD examples than MaxProb.

Our finding that the BERT QA model is not overconfident in-domain aligns with Hendrycks et al. (2019a)

, who found that pre-trained computer vision models are better calibrated than models trained from scratch, as pre-trained models can be trained for fewer epochs. Our QA model is only trained for two epochs, as is standard for BERT. Our findings also align with

Ovadia et al. (2019), who find that computer vision and text classification models are poorly calibrated out-of-domain even when well-calibrated in-domain. Note that miscalibration out-of-domain does not imply poor selective prediction on OOD data, but does imply poor selective prediction in our mixture setting.

5.4 Extrapolation between datasets

We next investigated how choice of affects generalization of the calibrator to . Figure 4 shows the percentage reduction between MaxProb and optimal AUC achieved by the trained calibrator. The calibrator outperforms MaxProb over all dataset combinations, with larger gains when and are similar. For example, samples from TriviaQA help generalization to SearchQA and vice versa; both use web snippets as passages. Samples from NewsQA, the only other non-Wikipedia dataset, are also helpful for both. On the other hand, no other dataset significantly helps generalization to HotpotQA, likely due to HotpotQA’s unique focus on multi-hop questions.

Figure 4: Results for different choices of (y-axis) and (x-axis). For each pair, we report the percent AUC improvement of the trained calibrator over MaxProb, relative to the total possible improvement. Datasets that use similar passages (e.g., SearchQA and TriviaQA) help each other the most. Main diagonal elements (shaded) assume access to (see Section 5.9).

5.5 Calibrator feature ablations

We determine the importance of each feature of the calibrator by removing each of its features individually, leaving the rest. From Table 4, we see that the most important features are the softmax probabilities and the passage length. Intuitively, passage length is meaningful both because longer passages have more answer candidates, and because passage length differs greatly between different domains.

Cov @
Cov @
All features
–Top softmax probability
–2nd:5th highest
softmax probabilities
–All softmax probabilities
–Context length
–Prediction length
Table 4: Performance of the calibrator as each of its features is removed individually, leaving the rest. The base model’s softmax probabilities are important features, as is passage length.

5.6 Error analysis

We examined calibrator errors on two pairs of and —one similar pair of datasets and one dissimilar. For each, we sampled 100 errors in which the system confidently gave a wrong answer (overconfident), and 100 errors in which the system abstained but would have gotten the question correct if it had answered (underconfident). These were sampled from the 1000 most overconfident or underconfident errors, respectively.

, .

These two datasets are from different non-Wikipedia sources.

of overconfidence errors are due to the model predicting valid alternate answers, or span mismatches—the model predicts a slightly different span than the gold span, and should be considered correct; thus the calibrator was not truly overconfident. This points to the need to improve QA evaluation metrics

(Chen et al., 2019). of underconfidence errors are due to the passage requiring coreference resolution over long distances, including with the article title. Neither SQuAD nor NewsQA passages have coreference chains as long or contain titles, so it is unsurprising that the calibrator struggles on these cases. Another of underconfidence errors were cases in which there was insufficient evidence in the paragraph to answer the question (as TriviaQA was constructed via distant supervision), so the calibrator was not incorrect to assign low confidence. of all underconfidence errors also included phrases that would not be common in SQuAD and NewsQA, such as using “said bye bye” for “banned.

, .

These two datasets are dissimilar from each other in multiple ways. HotpotQA uses short Wikipedia passages and focuses on multi-hop questions; NewsQA has much longer passages from news articles and does not focus on multi-hop questions. of the overconfidence errors are due to valid alternate answers or span mismatches. On of the underconfidence errors, the correct answer was the only span in the passage that could plausibly answer the question, suggesting that the model arrived at the answer due to artifacts in HotpotQA that facilitate guesswork (Chen and Durrett, 2019; Min et al., 2019). In these situations, the calibrator’s lack of confidence is therefore justifiable.

5.7 Relationship with Unanswerable Questions

We now study the relationship between selective prediction and identifying unanswerable questions.

Unanswerable questions do not aid selective prediction.

We trained a QA model on SQuAD 2.0 (Rajpurkar et al., 2018), which augments SQuAD 1.1 with unanswerable questions. Our trained calibrator with this model gets AUC, which is very close to the for the model trained on SQuAD 1.1 alone. MaxProb also performed similarly with the SQuAD 2.0 model ( AUC) and SQuAD 1.1 model ( AUC).

Selective prediction methods do not identify unanswerable questions.

For both MaxProb and our calibrator, we pick a threshold and predict that a question is unanswerable if the confidence . We choose to maximize SQuAD 2.0 EM score. Both methods perform poorly: the calibrator (averaged over five choices of ) achieves EM, while MaxProb achieves EM.666We evaluate on 4000 questions randomly sampled from the SQuAD 2.0 development set. These results only weakly outperform the majority baseline of EM.

Taken together, these results indicate that identifying unanswerable questions is a very different task from knowing when to abstain under distribution shift. Our setting focuses on test data that is dissimilar to the training data, but on which the original QA model can still correctly answer a non-trivial fraction of examples. In contrast, unanswerable questions in SQuAD 2.0 look very similar to answerable questions, but a model trained on SQuAD 1.1 gets all of them wrong.

5.8 Changing ratio of in-domain to OOD

Until now, we used both for and training the calibrator. Now we vary for both, ranging from using only SQuAD to only OOD data (sampled from for and from for ).

Figure 5: Difference in AUC between calibrator and MaxProb, as a function of how much of comes from (i.e., SQuAD) instead of , averaged over 5 OOD datasets. The calibrator outperforms MaxProb most when is a mixture of and .

Figure 5 shows the difference in AUC between the trained calibrator and MaxProb. At both ends of the graph, the difference is close to 0, showing that MaxProb performs well in homogeneous settings. However, when the two data sources are mixed, the calibrator outperforms MaxProb significantly. This further supports our claim that MaxProb performs poorly in mixed settings.

5.9 Allowing access to

We note that our findings do not hold in the alternate setting where we have access to samples from (instead of ). Training the QA model with this OOD data and using MaxProb achieves average AUC of , whereas training a calibrator achieves ; unsurprisingly, training on examples similar to the test data is helpful. We do not focus on this setting, as our goal is to build selective QA models for unknown distributions.

6 Discussion

In this paper, we propose the setting of selective question answering under domain shift, in which systems must know when to abstain on a mixture of in-domain and unknown OOD examples. Our setting combines two important goals for real-world systems: knowing when to abstain, and handling distribution shift at test time. We show that models are overconfident on OOD examples, leading to poor performance in the our setting, but training a calibrator using other OOD data can help correct for this problem. While we focus on question answering, our framework is general and extends to any prediction task for which graceful handling of out-of-domain inputs is necessary.

Across many tasks, NLP models struggle on out-of-domain inputs. Models trained on standard natural language inference datasets (Bowman et al., 2015) generalize poorly to other distributions (Thorne et al., 2018; Naik et al., 2018). Achieving high accuracy on out-of-domain data may not even be possible if the test data requires abilities that are not learnable from the training data (Geiger et al., 2019). Adversarially chosen ungrammatical text can also cause catastrophic errors (Wallace et al., 2019; Cheng et al., 2020). In all these cases, a more intelligent model would recognize that it should abstain on these inputs.

Traditional NLU systems typically have a natural ability to abstain. SHRDLU recognizes statements that it cannot parse, or that it finds ambiguous (Winograd, 1972). QUALM answers reading comprehension questions by constructing reasoning chains, and abstains if it cannot find one that supports an answer (Lehnert, 1977).

NLP systems deployed in real-world settings inevitably encounter a mixture of familiar and unfamiliar inputs. Our work provides a framework to study how models can more judiciously abstain in these challenging environments.


All code, data and experiments are available on the Codalab platform at


This work was supported by the DARPA ASED program under FA8650-18-2-7882. We thank Ananya Kumar, John Hewitt, Dan Iter, and the anonymous reviewers for their helpful comments and insights.


  • Blitzer et al. (2006) J. Blitzer, R. McDonald, and F. Pereira. 2006. Domain adaptation with structural correspondence learning. In

    Empirical Methods in Natural Language Processing (EMNLP)

  • Bowman et al. (2015) S. Bowman, G. Angeli, C. Potts, and C. D. Manning. 2015. A large annotated corpus for learning natural language inference. In Empirical Methods in Natural Language Processing (EMNLP).
  • Chen et al. (2019) A. Chen, G. Stanovsky, S. Singh, and M. Gardner. 2019. Evaluating question answering evaluation. In Workshop on Machine Reading for Question Answering (MRQA).
  • Chen et al. (2017) D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL).
  • Chen and Durrett (2019) J. Chen and G. Durrett. 2019. Understanding dataset design choices for multi-hop reasoning. In North American Association for Computational Linguistics (NAACL).
  • Cheng et al. (2020) M. Cheng, J. Yi, H. Zhang, P. Chen, and C. Hsieh. 2020. Seq2Sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. In

    Association for the Advancement of Artificial Intelligence (AAAI)

  • Chow (1957) C. K. Chow. 1957. An optimum character recognition system using decision functions. In IRE Transactions on Electronic Computers.
  • Daume III (2007) H. Daume III. 2007. Frustratingly easy domain adaptation. In Association for Computational Linguistics (ACL).
  • Devlin et al. (2019) J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pages 4171–4186.
  • Dong et al. (2018) L. Dong, C. Quirk, and M. Lapata. 2018. Confidence modeling for neural semantic parsing. In Association for Computational Linguistics (ACL).
  • Dunn et al. (2017) M. Dunn, , L. Sagun, M. Higgins, U. Guney, V. Cirik, and K. Cho. 2017. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv.
  • El-Yaniv and Wiener (2010) R. El-Yaniv and Y. Wiener. 2010. On the foundations of noise-free selective classification. Journal of Machine Learning Research (JMLR), 11.
  • Feinman et al. (2017) R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. 2017. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410.
  • Feng et al. (2019) J. Feng, A. Sondhi, J. Perry, and N. Simon. 2019. Selective prediction-set models with coverage guarantees. arXiv preprint arXiv:1906.05473.
  • Fisch et al. (2019) A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, and D. Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading comprehension. In Workshop on Machine Reading for Question Answering (MRQA).
  • Gal and Ghahramani (2016) Y. Gal and Z. Ghahramani. 2016.

    Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.

    In International Conference on Machine Learning (ICML).
  • Geifman and El-Yaniv (2017) Y. Geifman and R. El-Yaniv. 2017. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS).
  • Geiger et al. (2019) A. Geiger, I. Cases, L. Karttunen, and C. Potts. 2019. Posing fair generalization tasks for natural language inference. In Empirical Methods in Natural Language Processing (EMNLP).
  • Gondek et al. (2012) D. C. Gondek, A. Lally, A. Kalyanpur, J. W. Murdock, P. A. Duboue, L. Zhang, Y. Pan, Z. M. Qiu, and C. Welty. 2012. A framework for merging and ranking of answers in DeepQA. IBM Journal of Research and Development, 56.
  • Hendrycks and Gimpel (2017) D. Hendrycks and K. Gimpel. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations (ICLR).
  • Hendrycks et al. (2019a) D. Hendrycks, K. Lee, and M. Mazeika. 2019a. Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning (ICML).
  • Hendrycks et al. (2019b) D. Hendrycks, M. Mazeika, and T. Dietterich. 2019b.

    Deep anomaly detection with outlier exposure.

    In International Conference on Learning Representations (ICLR).
  • Jia and Liang (2017) R. Jia and P. Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Empirical Methods in Natural Language Processing (EMNLP).
  • Joshi et al. (2017) M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Association for Computational Linguistics (ACL).
  • Ko et al. (2007) J. Ko, L. Si, and E. Nyberg. 2007. A probabilistic framework for answer selection in question answering. In North American Association for Computational Linguistics (NAACL).
  • Kwiatkowski et al. (2019) T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. 2019. Natural questions: a benchmark for question answering research. In Association for Computational Linguistics (ACL).
  • Lakshminarayanan et al. (2017) B. Lakshminarayanan, A. Pritzel, and C. Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS).
  • Lehnert (1977) W. Lehnert. 1977. The Process of Question Answering. Ph.D. thesis, Yale University.
  • Liang et al. (2018) S. Liang, Y. Li, and R. Srikant. 2018. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations (ICLR).
  • Magnini et al. (2002) B. Magnini, M. Negri, R. Prevete, and H. Tanev. 2002. Is it the right answer? exploiting web redundancy for answer validation. In Association for Computational Linguistics (ACL).
  • Min et al. (2019) S. Min, E. Wallace, S. Singh, M. Gardner, H. Hajishirzi, and L. Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In Association for Computational Linguistics (ACL).
  • Naik et al. (2018) A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig. 2018. Stress test evaluation for natural language inference. In International Conference on Computational Linguistics (COLING), pages 2340–2353.
  • Ovadia et al. (2019) Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek. 2019. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS).
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (JMLR), 12.
  • Peñas et al. (2009) A. Peñas, P. Forner, R. Sutcliffe, Álvaro Rodrigo, C. Forăscu, I. Alegria, D. Giampiccolo, N. Moreau, and P. Osenova. 2009. Overview of ResPubliQA 2009: Question answering evaluation over european legislation. In Cross Language Evaluation Forum.
  • Peñas et al. (2013) A. Peñas, E. Hovy, P. Forner, Álvaro Rodrigo, R. Sutcliffe, and R. Morante. 2013. QA4MRE 2011-2013: Overview of question answering for machine reading evaluation. In Cross Language Evaluation Forum.
  • Platt (1999) J. Platt. 1999.

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.

    Advances in Large Margin Classifiers, 10(3):61–74.
  • Rajpurkar et al. (2018) P. Rajpurkar, R. Jia, and P. Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Association for Computational Linguistics (ACL).
  • Rajpurkar et al. (2016) P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
  • Rodriguez et al. (2019) P. Rodriguez, S. Feng, M. Iyyer, H. He, and J. Boyd-Graber. 2019. Quizbowl: The case for incremental question answering. arXiv preprint arXiv:1904.04792.
  • Schölkopf et al. (1999) B. Schölkopf, R. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt. 1999.

    Support vector method for novelty detection.

    In Advances in Neural Information Processing Systems (NeurIPS).
  • Smith and Gal (2018) L. Smith and Y. Gal. 2018. Understanding measures of uncertainty for adversarial example detection. In Uncertainty in Artificial Intelligence (UAI).
  • Talmor and Berant (2019) A. Talmor and J. Berant. 2019. MultiQA: An empirical investigation of generalization and transfer in reading comprehension. In Association for Computational Linguistics (ACL).
  • Thorne et al. (2018) J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. In North American Association for Computational Linguistics (NAACL).
  • Toplak et al. (2014) M. Toplak, R. Močnik, M. Polajnar, Z. Bosnić, L. Carlsson, C. Hasselgren, J. Demšar, S. Boyer, B. Zupan, and J. Stålring. 2014. Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models. Journal of Chemical Information and Modeling, 54.
  • Trischler et al. (2017) A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA: A machine comprehension dataset. In Workshop on Representation Learning for NLP.
  • Wallace et al. (2019) E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. 2019. Universal adversarial triggers for attacking and analyzing NLP. In Empirical Methods in Natural Language Processing (EMNLP).
  • Wang et al. (2007) M. Wang, N. A. Smith, and T. Mitamura. 2007. What is the jeopardy model? a quasi-synchronous grammar for QA. In Empirical Methods in Natural Language Processing (EMNLP).
  • Winograd (1972) T. Winograd. 1972. Understanding Natural Language. Academic Press.
  • Yang et al. (2015) Y. Yang, W. Yih, and C. Meek. 2015. WikiQA: A challenge dataset for open-domain question answering. In Empirical Methods in Natural Language Processing (EMNLP), pages 2013–2018.
  • Yang et al. (2018) Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Empirical Methods in Natural Language Processing (EMNLP).
  • Yogatama et al. (2019) D. Yogatama, C. de M. d’Autume, J. Connor, T. Kocisky, M. Chrzanowski, L. Kong, A. Lazaridou, W. Ling, L. Yu, C. Dyer, et al. 2019. Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373.
  • Zadrozny and Elkan (2002) B. Zadrozny and C. Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 694–699.

Appendix A Appendix

a.1 Dataset Sources

The OOD data used in calibrator training and validation was sampled from MRQA training data, and the SQuAD data for the same was sampled from MRQA validation data, to prevent train/test mismatch for the QA model (Fisch et al., 2019). The test data was sampled from a disjoint subset of the MRQA validation data.

a.2 Calibrator Features and Model

We ran experiments including question length and word overlap between the passage and question as calibrator features. However, these features did not improve the validation performance of the calibrator. We hypothesize that they may provide misleading information about a given example, e.g., a long question in SQuAD may provide more opportunities for alignment with the paragraph, making it more likely to be answered correctly, but a long question in HotpotQA may contain a conjunction, which is difficult for the SQuAD-trained model to extrapolate to.

For the calibrator model, we experimented using an MLP and logistic regression. Both were slightly worse than Random Forest.

a.3 Outlier Detection for Selective Prediction

In this section, we study whether outlier detection can be used to perform selective prediction. We train an outlier detector to detect whether or not a given input came from the in-domain dataset (i.e., SQuAD) or is out-of-domain, and use its probability of an example being in-domain for selective prediction. The outlier detection model, training data (a mixture of and ), and features are the same as those of the calibrator. We find that this method does poorly, achieving an AUC of , Coverage at Accuracy of , and Coverage at Accuracy of . This shows that, as discussed in Section 2.3 and Section 5.2, this approach is unable to correctly identify the OOD examples that the QA model would get correct.

a.4 Underconfidence of MaxProb on SQuAD

As noted in Section 5.3, MaxProb is underconfident on SQuAD examples due to the additional correct answer options given at test time but not at train time. When the test time evaluation is restricted to allow only one correct answer, we find that MaxProb is well-calibrated on SQuAD examples (Figure 6). The calibration of the calibrator improves as well (Figure 7). However, we do not retain this restriction for the experiments, as it diverges from standard practice on SQuAD, and EM over multiple spans is a better evaluation metric since there are often multiple answer spans that are equally correct.

Figure 6: When considering only one answer option as correct, MaxProb is well-calibrated in-domain, but is still overconfident out-of-domain.
Figure 7: When considering only one answer option as correct, the calibrator is almost perfectly calibrated on both in-domain and out-of-domain examples.

a.5 Accuracy and Coverage per Domain

Table 1 in Section 5.2 shows the coverage of MaxProb and the calibrator over the mixed dataset while maintaining 80% accuracy and 90% accuracy. In Table 5, we report the fraction of these answered questions that are in-domain or OOD. We also show the accuracy of the QA model on each portion.

Our analysis in Section 5.3 indicated that MaxProb was overconfident on OOD examples, which we expect would make it answer too many OOD questions and too few in-domain questions. Indeed, at accuracy, of the examples MaxProb answers are in-domain, compared to for the calibrator. This demonstrates that the calibrator improves over MaxProb by answering more in-domain questions, which it can do because it is less overconfident on the OOD questions.

At 80% Accuracy
in-domain 92.45 61.59 89.09 67.57
OOD 58.00 38.41 59.55 32.43
At 90% Accuracy
in-domain 97.42 67.85 94.35 78.72
OOD 71.20 32.15 72.30 21.28
Table 5: Per-domain accuracy and coverage values of MaxProb and the calibrator ( and ) at 80% and 90% Accuracy on .