Retrieval-guided Counterfactual Generation for QA

by   Bhargavi Paranjape, et al.

Deep NLP models have been shown to learn spurious correlations, leaving them brittle to input perturbations. Recent work has shown that counterfactual or contrastive data – i.e. minimally perturbed inputs – can reveal these weaknesses, and that data augmentation using counterfactuals can help ameliorate them. Proposed techniques for generating counterfactuals rely on human annotations, perturbations based on simple heuristics, and meaning representation frameworks. We focus on the task of creating counterfactuals for question answering, which presents unique challenges related to world knowledge, semantic diversity, and answerability. To address these challenges, we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual evaluation and training data with minimal human supervision. Using an open-domain QA framework and question generation model trained on original task data, we create counterfactuals that are fluent, semantically diverse, and automatically labeled. Data augmentation with RGF counterfactuals improves performance on out-of-domain and challenging evaluation sets over and above existing methods, in both the reading comprehension and open-domain QA settings. Moreover, we find that RGF data leads to significant improvements in a model's robustness to local perturbations.



There are no comments yet.


page 1

page 2

page 3

page 4


End-to-End QA on COVID-19: Domain Adaptation with Synthetic Training

End-to-end question answering (QA) requires both information retrieval (...

Learning to Rank Question Answer Pairs with Bilateral Contrastive Data Augmentation

In this work, we propose a novel and easy-to-apply data augmentation str...

Multimodal Dialogue State Tracking By QA Approach with Data Augmentation

Recently, a more challenging state tracking task, Audio-Video Scene-Awar...

Adversarial Augmentation Policy Search for Domain and Cross-Lingual Generalization in Reading Comprehension

Reading comprehension models often overfit to nuances of training datase...

Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

In this paper, we propose a novel data augmentation method, referred to ...

A Framework for Rationale Extraction for Deep QA models

As neural-network-based QA models become deeper and more complex, there ...

Automatically Learning Data Augmentation Policies for Dialogue Tasks

Automatic data augmentation (AutoAugment) (Cubuk et al., 2019) searches ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Models for natural language understanding (NLU) that outperform humans on standard benchmarks are known to perform poorly under a multitude of distributional shifts (Jia and Liang (2017); Naik et al. (2018); McCoy et al. (2019), inter alia) due to over-reliance on spurious correlations or dataset artefacts. Recent work Kaushik et al. (2020); Gardner et al. (2020) proposes the construction of contrast or counterfactual data — minimal yet meaningful perturbations to test examples that are created by humans to flip the task label — to expose gaps in a model’s local decision boundaries. For instance, perturbing the movie review “A real stinker, one out of ten!" to “A real classic, ten out of ten!" changes its sentiment label. Kaushik et al. (2020, 2021); Wu et al. (2021); Geva et al. (2021) show that augmenting with counterfactual data (CDA) improves out-of-domain generalization and robustness to small input perturbations. Consequently, several techniques have been proposed for the automatic generation of counterfactual data for several downstream tasks Wu et al. (2021); Ross et al. (2021b, a); Bitton et al. (2021); Geva et al. (2021); Asai and Hajishirzi (2020); Mille et al. (2021).

Figure 1: Retrieve-Generate-Filter approach to generate counterfactual queries for Natural Question Kwiatkowski et al. (2019) using an open-domain retrieval system, question generation and post-hoc filtering.

In this paper, we focus on counterfactuals for both reading comprehension and open-domain question answering (e.g. Rajpurkar et al., 2016; Kwiatkowski et al., 2019), where inputs consist of a question, and variably a context passage, and outputs are short answer spans. We define counterfactuals in this setting to be semantically proximal, but distinct questions, and distinct contexts where appropriate. Question answering is particularly challenging for current methods of counterfactual data generation. Consider the original query in Figure 1, “Who is the captain of the Richmond Football Club?" Creating a counterfactual query that changes the answer requires background knowledge. For example, the perturbation “Who captained Richmond’s women’s team?" requires knowledge about the club’s alternate teams, and “Who was the captain of RFC in 1998?" requires knowledge about the time-sensitive nature of the original question. More generally, given different questions, different semantic dimensions become available for perturbation. Some plausible edits — such as “Who captained the club in 2050?" — can also encode false premises or represent unanswerable questions.

We develop a simple, yet effective technique to address these challenges: Retrieve, Generate, and Filter (RGF). The core intuition of the approach is that state-of-the-art retrieve-and-read models that have been developed for such queries Karpukhin et al. (2020); Guu et al. (2020) provide a rich resource for the task’s desiderata: background knowledge about the query, diverse top-k retrieval results that can serve as counterfactual pivots on different latent semantic dimensions, and alternate answer candidates that represent interesting target labels. Using a retrieve-and-read QA system trained on a QA task such as Natural Questions (Kwiatkowski et al., 2019), we generate a set of candidate passages and answers which are closely related to an original question, but differ from it in interesting ways (Figure 1). We then use a sequence-to-sequence question generation model to generate corresponding questions to these passages and answers Alberti et al. (2019). The results are fully labeled, and can be used directly to augment training data or filtered post-hoc for analysis using simple heuristics or meaning representations such as QED (Lamm et al., 2021).

Our method generates highly diverse counterfactuals covering a range of semantic phenomena (§4), including many of those found in existing work that relies on meaning representation pivots (Ross et al., 2021b; Geva et al., 2021) or human generation (Bartolo et al., 2020; Gardner et al., 2020). Compared to alternative sources of synthetic data (§5.1), training augmented with RGF data leads to increased performance on a variety of settings (§5.2, §5.3), including out-of-domain Fisch et al. (2019) and contrast evaluation sets Bartolo et al. (2020); Gardner et al. (2020), while maintaining in-domain performance. Additionally, we introduce a measure of pairwise consistency, and show that RGF leads to significant improvements in model robustness to a range of local perturbations (§6.1).

While we focus especially on question answering in this paper, for which retrieval components are readily available, we note that the RGF paradigm is quite general, and could potentially be applicable to counterfactual generation for a wide range of other tasks.

2 Related Work

2.1 Counterfactual Generation

There has been considerable interest in developing challenging evaluation sets for NLU that evaluate models on a wide variety of counterfactual input perturbations. Gardner et al. (2020); Khashabi et al. (2020); Kaushik et al. (2020); Ribeiro et al. (2020) use humans to create these perturbations. Adversarial data collection Bartolo et al. (2020) pits humans against models-in-the-loop in order to create perturbations that fool existing models. We note that manually constructing counterfactuals to information-seeking queries like Natural Questions (NQ; Kwiatkowski et al., 2019) may require showing background knowledge to annotators to spur and incentivize their creativity. Human annotations are also harder to scale up for data augmentation.

This has led to an increased interest in creating automatic counterfactual data for evaluating out-of-distribution generalization Bowman and Dahl (2021) and for counterfactual data augmentation Geva et al. (2021); Longpre et al. (2021). Some work focuses on using heuristics like first-order-logic Asai and Hajishirzi (2020), swapping superlatives and nouns Dua et al. (2021), or targeting specific data splits Finegan-Dollak and Verma (2020). Webster et al. (2020) use templates to create large-scale counterfactual data for pre-training to reduce gender bias. More recent work has specifically focused on using meaning representation frameworks and structured inputs to automatically perturb inputs – grammar formalisms Li et al. (2020), semantic role labeling Ross et al. (2021b), structured image representations like scene graphs Bitton et al. (2021) and query decompositions in multi-hop reasoning datasets Geva et al. (2021). (Ye et al., 2021) and (Longpre et al., 2021) perturb contexts instead of questions by swapping out all mentions of a named entity. These techniques either create perturbations where the change in label can be derived easily (eg. negating a boolean question changes the yes/no answer) or require a round of human re-labeling of the data. These may also be difficult to apply to tasks like NQ, where pre-defined schemas can have difficulty covering the range of semantic perturbations that may be of interest.

2.2 Data Augmentation

More standard data augmentation techniques where the synthetic data bears no instance-level relation to the original data has shown only weak improvements to robustness and out-of-domain generalization Bartolo et al. (2021); Lewis et al. (2021). In this work, we analyze the effectiveness of CDA against such augmentation techniques.

Joshi and He (2021) systematically analyze the idea of bias in CDA. They find that such methods that limit the structural and semantic space of perturbations can potentially lead to biases towards a small set of perturbations, leading to models that cannot generalize to unseen perturbations types. This problem is exacerbated in the question answering scenario where there can be multiple semantic dimensions to edit. In our work, we mitigate this issue by generating semantically diverse counterfactuals that cover many different phenomena, only constraining to specific perturbation types for evaluation and analysis.

3 RGF: Counterfactuals for Information-seeking Queries

A counterfactual is usually taken to be a perturbation of an original input along a particular latent variable that leaves other aspects unchanged. In this work, our goal is answer-flipping counterfactuals for question answering (Kwiatkowski et al., 2019). We consider counterfactuals that that transform the triple consisting of the question, context passage and short answer in the passage, , to a counterfactual triple . Insofar as we allow the context to change as well, this notion of counterfactual is somewhat less strict about minimality than is typical. However, we find it allows for a significantly more diverse set of counterfactuals than could be achieved by leaving the context unchanged (§C.1). The alternate approach of making targeted edits to the context is challenging due to the need to model complex discourse and alter factual knowledge. Moreover, in the open-domain setting where the context is treated as a latent variable, our counterfactuals reduce to pairs.

The task poses some unique challenges, such as the need for background knowledge to identify semantic dimensions that can be perturbed, ensuring sufficient semantic diversity in edits to the question, and avoiding questions with false premises or no viable answers. Ensuring (or characterizing) minimality can also be a challenge, as minimal surface form changes can lead to significant semantic changes. We introduce a generalized paradigm — Retrieve, Generate and Filter — to tackle these challenges. In the rest of this section, we describe an overview of the RGF method and describe each component in detail.

3.1 Overview of RGF

An outline of the RGF method is given in Figure 1. Given an input example consisting of a question, a context paragraph, and the corresponding answer , RGF generates a set of new examples from the local neighborhood around x. We first use an open-domain retrieve-and-read model to retrieve counterfactual contexts and alternate answers such that . These alternate contexts and answers potentially encode background knowledge and latent semantic dimensions that can be used to construct the counterfactual question. We use a sequence-to-sequence question generation model , trained on the original task data (such as NQ), to generate new questions from the context and answer candidates . This yields fully-labeled triples (albeit with some noise, see §4), avoiding the problem of unanswerable or false-premise questions. Since are candidate answers to the original , these triples are closely related to the original input. We do not explicitly constrain these triples to be minimal counterfactuals during the generation step, but can use post-hoc filtering to reduce noise, select minimal candidates, or select for specific semantic phenomena based on the relation between and .

3.2 Retrieval

We use the the REALM retrieve-and-read question answering framework Guu et al. (2020) that consists of a BERT-based bi-encoder for dense retrieval, a dense index of Wikipedia passages, and a BERT-based answer-span extraction model for reading comprehension trained on Natural Questions. When provided an information-seeking query , REALM outputs a ranked list of contexts and answers within those contexts: . These alternate contexts and answers provide relevant yet diverse background information to construct counterfactual questions. For instance, in Figure 1, the question “Who is the captain of the Richmond Football Club" with answer “Trent Cotchin" also returns other contexts and alternate answers like “Jeff Hogg" (“Who captained the team in 1994"), “Jess Kennedy" (“Who captained the women’s team") and “Steve Morris" (“Who captained the reserve team in the VFL league"). Retrieved contexts can also capture information about closely related or ambiguous entities. For instance, the question “who wrote the treasure of the sierra madre" retrieves passages about the original book Sierra Madre, its movie adaptation, and a passage about the Battle of Monte de las Cruces fought in the Sierra de las Cruces mountains. This background knowledge helps in contextualized counterfactual editing without specifying the type of perturbation or semantic dimension of change in advance. To focus on label-transforming counterfactuals, we retain all where is not one of the gold answers from the original NQ example.

3.3 Question Generation

This component generates questions that correspond to the answer-context pairs . We use a T5 (Raffel et al., 2020) model fine-tuned on triples from Natural Questions, using context passages as input with the answer marked with special tokens. We use the trained model to generate questions for each of the the retrieved set of alternate contexts and answers, . For each , we use beam decoding to generate 15 different questions . We measure the fluency and correctness of generated questions in §4. Model implementation details are available in Appendix A.

3.4 Standard Filtering

Noise Filtering

The question generation model can be noisy, resulting in a question that cannot be answered given or for which is an incorrect answer. Round-trip consistency Alberti et al. (2019); Fang et al. (2020) uses an existing QA model to attempt to answer the generated questions, ensuring that the predicted answer is consistent with the target answer prompted to the question generator. We use an ensemble of six T5-based reading-comprehension models (), trained on Natural Questions using different random seeds, and keep any generated triples where at least 5 of the 6 models agree on the answer. This discards about 5% of the generated data, although some noise still remains; see §4 for further discussion. Details of ensemble filtering can be found in Appendix A.

Question from NQ Original: who is the captain of richmond football club? Predicate: who is the captain of X?
Reference Change CF1: who is the captain of richmond’s vfl reserve team? Predicate: who is the captain of X?
Predicate Change CF2: who wears number 9 for richmond football club? Predicate: who wears Y for X?
Predicate and Reference Change CF3: who did graham negate in the grand final last year? Predicate: who did X negate in Y last year?
Table 1: Categorization of counterfactual questions based on QED decomposition of questions into reference and predicate changes. The original reference “Richmond football Club" changes in CF1 and CF3. Predicate “Who is the captain" changes in CF2 and CF3.

Filtering for Minimality

Unlike prior work on generating counterfactual perturbations, we do not explicitly control for the type of semantic shift or perturbation in the generated questions. Using a finite set of heuristics to control generation can potentially introduce bias in augmented data Joshi and He (2021).

Instead, we use post-hoc filtering over generated counterfactual questions to control for minimality of perturbation. We define a filtering function that categorizes the semantic shift or perturbation in with respect to . The simplest version of is the word-level edit (Levenshtein) distance between and . For instance, the questions “when is marvel’s cloak and dagger coming out ?" and “when was marvel’s cloak and dagger announced ?" have a word-edit distance of 3. After applying ensemble-based noise filtering, for each original triple we select the generated with the smallest word-edit distance between and such that . We use this simple heuristic to create large-scale counterfactual training data for augmentation experiments (§5). Over-generating potential counterfactuals based on latent dimensions identified in retrieval and using a simple filtering heuristic avoids adding perturbation-type bias in training.

Semantic Change Example Original, Counterfactual RGF Gold*
Reference Change O: when did lebron_james join the Miami Heat ? 50 35
TAILOR Ross et al. (2021b) C: When did lebron_james come into the league ?
Predicate Change O: who won the war between india and pakistan in 1948 30 30
AmbigQA Min et al. (2020) C: who started the war between india and pakistan in 1948
Question Disambiguation O: when does the walking dead season 8 start ? 13 2
AmbigQA Min et al. (2020) C: when does walking dead season 8 second half start ?
Negation O: what religion observes the sabbath day 1 -
Contrast Sets Gardner et al. (2020) C: what religion does not keep the sabbath day
Table 2: Patterns of semantic change between original queries (O) and RGF counterfactuals (C), corresponding to patterns common in prior methods for counterfactual generation. We randomly selected 100 generated examples and manually categorized the relation of to the original . We measure proportion of each type of semantic change in RGF and the Gold Agen-Qgen baseline (§5.1) which does not use retrieval.

3.5 Semantic Filtering for Evaluation

To better understand the types of counterfactuals generated by RGF, we can apply additional filters based on query meaning representations to categorize counterfactual pairs. Meaning representations provide a way to decompose a query into semantic units and categorize based on which of these units are perturbed. In this work, we employ the QED formalism for explanations in question answering Lamm et al. (2021) for query meaning representation. QED explanations segment the question into a predicate template and a set of reference phrases. For example, the question “Who is captain of richmond football club" decomposes into one question reference “richmond football club" and the predicate “Who is captain of X". A few example questions and their corresponding QED decompositions are illustrated in Table 1.

We use these query decompositions to identify whether a counterfactual pair represents a change in question predicate, reference, or both. Concretely, using a T5-based model finetuned on the QED dataset to perform explanation generation (see (Lamm et al., 2021) for more details) for each in the data, we identify predicates and references. We use exact match between strings to identify reference changes. As predicates can often differ slightly in phrasing (who captained vs. who is captain), we take a predicate match to be a prefix matching with more than 10 characters. For instance, “Who is the captain of Richmond’s first ever women’s team?" has the same predicate as “Who is the captain of the Richmond Football Club", but the two questions have different references. With these definitions in hand, we filter generated questions into three perturbation categories — reference change, predicate change, and both.

4 Intrinsic Evaluation

Figure 2: Context-specific semantic diversity of perturbations achieved by RGF on an NQ Question. The multiple latent semantic dimensions identified (arrows in the diagram) fall out of our retrieval-guided approach.

Following desiderata from Wu et al. (2021) and Ross et al. (2021b), we evaluate our RGF data along four qualitative evaluation measures: fluency, correctness, minimality, and directionality.


Fluency measures whether the generated text is grammatically correct and semantically meaningful. Since the question generation model is based on a large pretrained language model, almost 96% of the generated questions are deemed fluent when authors annotated a subset of 100 generated examples.


Correctness measures if the generated question and context, alternate answer pairs are aligned i.e. the question is answerable given context and is that answer. We quantify correctness in the generated dataset by manually annotating a samples of 100 triples (see Appendix B). The proportion of noise varies from 30% before noise filtering and 25% after noise filtering using an ensemble of models (§3.4).


We do not explicitly constrain the generated question to be minimally close to the original question. Since candidate passages and answers are in the retrieval neighborhood of the original question, we expect the new question to be closely related to the original, but to differ along a few semantic dimensions that leads to change of answer. In Table 2, we annotate a sample of RGF counterfactual question pairs , and find that the majority can be viewed as minimal with respect to a semantic dimension such as a reference change, predicate change, or disambiguation.

Directionality/Semantic Diversity

In Table 2, we show examples of semantic changes that occur in our data, which also occur in prior work Gardner et al. (2020); Ross et al. (2021b); Min et al. (2020), including reference changes, predicate changes, negations, syntactic changes, question expansions, disambiguations, and question contractions. Note that we do not use any specialized semantic framework or structured representations to achieve the same type of transformation. In Figure 2 we show the different semantic dimensions that are changed for the question “Who won the women’s wimbeldon tournament in 2017". The ontology of semantic dimensions is very question-specific and would be difficult to specify a priori in a meaning representation framework. The semantic diversity afforded by retrieval is thus greater than could be achieved using a small set of perturbation heuristics alone.

5 Data Augmentation

Unlike many counterfactual generation methods, RGF natively creates fully-labeled examples which can be used directly in counterfactual data augmentation (CDA). We augment the original NQ training set with additional examples from RGF. We explore two experimental settings, reading comprehension (§5.2) and open-domain QA (§5.3), and compare models trained using RGF data to those trained only on NQ, as well as with two alternative methods of synthetic data generation (§5.1).

5.1 Baselines

In the abstract, our model for generating counterfactuals specifies a way of selecting contexts from original questions, and answers within those contexts, and a way of a generating questions from them . In RGF, we use learned retrieval to identify contexts encoding contextually relevant semantic dimensions to guide question edits. We experiment with two baselines that relax this assumption. In the first (Rand. Agen-Qgen) we use random passage selection and in the second (Gold Agen-Qgen) we use the gold NQ context associated with the original instance.

Random Passage (Rand. Agen-Qgen)

Here, is a randomly chosen paragraph from the Wikipedia index that has no explicit relation with the original question. This setting essentially simulates the generation of more data from the original data distribution of Natural Questions. We observe that questions and relevant contexts in NQ have considerable distributional bias over Wikipedia. For instance, there are a significant number of articles about sports teams, books, songs etc. To ensure that the random sampling of Wikipedia paragraphs has a similar distribution, we employ the learned passage selection model from Lewis et al. (2021)111, which is the basis of closely related work on data augmentation (non-counterfactual) for the SQuAD reading comprehension dataset Bartolo et al. (2021).

Gold Context (Gold Agen-Qgen)

Here, is the short answer-containing passage from the training set for Natural Questions. With this baseline, we are specifically measuring the effectiveness of our method in expanding local background knowledge around the original question through retrieval, which we hypothesize should lead to more diversity in the generated counterfactuals. For instance, the gold context for the question in Figure 1 is a Wikipedia table on past and present captains of the men’s club. While this context can yield a question like “Who captained the Richmond Football Club in 1994", it cannot generate a question about the women’s team or the reserve team.

Answer Generation for Baselines

For both the above baselines for context selection, we also need to select spans in the new passage that are likely to be answers for a potential counterfactual question. Bartolo et al. (2021) find that an answer detection model trained to generate answers given only the context outperforms baselines that use part-of-speech taggers and span extractor models. Beam-decoding using such a model produces a diverse set of answer spans with potential overlaps. We use a T5 (Raffel et al., 2020) model fine-tuned for question-independent answer selection on NQ, and select the top 15 candidates from beam search. To avoid simply repeating the original question, we only retain answer candidates which do not match the original NQ answers for that example. Details about the answer-generation model can be found in Appendix A. These alternate generated answer candidates and associated passages are then used for question generation as in RGF (§3.3).

5.2 Reading Comprehension

Dataset Size NQ TriviaQA HotpotQA BioASQ AQA AmbigQA
Original NQ 90K 70.40 14.69 51.03 37.30 26.30 46.55
Gold Agen-Qgen 90K + 90K 70.60 13.24 45.59 31.98 20.50 43.45
Rand. Agen-Qgen 90K + 90K 71.08 13.87 45.26 33.64 22.50 42.04
RGF (REALM-Qgen) 90K + 90K 70.68 16.99 53.41 44.88 28.20 47.61
Table 3: Exact Match results for the reading comprehension task for in-domain NQ development set, out-of-domain datasets from MRQA 2019 Challenge Fisch et al. (2019), Adversarial QA Bartolo et al. (2020) and AmbigQA Min et al. (2020). RGF improves out-of-domain and challenge-set performance compared to other data augmentation baselines.

In the reading comprehension (RC) setting, the input consists of the question and context and the task is to identify an answer span in the context. Thus, the full triple consisting of the retrieved passage , generated and filtered question , and alternate answer in is used for augmentation during training. For RGF and for baselines, we augment with 90K examples, thus training on 2x the data of NQ alone.

Experimental Setting

Here, we assume that context is changed to along with the minimal change in the question. While are still topically related, they may be two distinct paragraphs in Wikipedia. We finetune a T5 (Raffel et al., 2020) model for reading comprehension, using the question appended to the context as input; additional training details are provided in Appendix A, We evaluate domain generalisation of our RC models on three evaluation sets from the MRQA 2019 Challenge Fisch et al. (2019). We also measure performance on evaluation sets consisting of counterfactual or perturbed versions of RC datasets on Wikipedia, including AQA (adversarially-generated SQuAD questions), and human authored counterfactual examples (contrast sets; Gardner et al., 2020) from the QUOREF dataset Dasigi et al. (2019). We also evaluate on the set of disambiguated queries in AmbigQA Min et al. (2020), which by construction are minimal edits to queries from the original NQ.

Dataset Size NQ TriviaQA AmbigQA SQuAD v1.0 TREC
Original 90K 37.09 26.75 22.43 14.25 35.66
Gold Agen-Qgen 90K + 90K 37.86 27.02 23.65 15.01 36.04
Rand. Agen-Qgen 90K + 90K 38.40 29.87 24.13 14.55 37.11
RGF (REALM-Qgen) 90K + 90K 39.01 32.32 26.98 16.94 38.76
Table 4: Exact Match results on open-domain QA datasets (TriviaQA, AmbigQA, SQuAD and TREC) for data augmentation with RGF counterfactuals and baselines. Open-domain improvements are larger than in the RC setting, perhaps as the more difficult task benefits more from additional data.


We report exact-match scores in Table 3; F1 scores follow a similar trend. We observe only limited improvements on the in-domain NQ development set, but we see significant improvements from CDA with RGF data in out-of-domain and challenge-set evaluations compared both to the original NQ model and the Gold and Random baselines. RGF improves relatively by 3 (TriviaQA) to 11 (HotpotQA) EM points over the next best augmentation baseline – Random Agen-Qgen. Note that all three techniques have similar proportion of noise (Appendix B), so CDA’s benefits may be attributed to improving model’s ability to learn more robust features for the task of reading comprehension and reduce reliance on spurious correlations or dataset-specific artefacts. RGF’s superior performance compared to the Gold Agen-Qgen baseline is especially interesting, since the latter also generates topically related questions. We observe that filtered RGF counterfactuals are more closely related to the original question compared to this baseline (Figure 5 in Appendix), since is directly dependent on .

5.3 Open-domain Question Answering

In this setting, only the question is provided as input. The pair , consisting of generated and filtered question and alternate answer , is used for augmentation. This setting is closer to the more prevalent definition of a counterfactual – a minimal perturbation of the input (§4).

Experimental Setting

We use the method of Guu et al. (2020) to finetune REALM on pairs from NQ. REALM consists of a retriever model that selects top-K passages relevant to the query. The reader model then selects answer spans in these passages. End-to-end training updates the reader model and the query-document encoders of the retriever module. We use the implementation of (Guu et al., 2020); additional training details can be found in Appendix A. We evaluate domain generalization on popular open-domain benchmarks - TriviaQA Joshi et al. (2017), SQuAD Rajpurkar et al. (2016), Curated TREC dataset Baudiš and Šedivỳ (2015), and disambiguated queries from AmbigQA Min et al. (2020).


In the open-domain setting, we observe an improvement of 2 EM points over the original model even in the in-domain setting on Natural Questions (Table 4), while also improving significantly when compared to other data augmentation techniques. RGF improves over the next best baseline – Random Agen-Qgen – by up to 6 EM points for TriviaQA). We hypothesize that this setting is considerably harder then the RC Setting. Thus, any data augmentation is beneficial to the model. CDA is especially helpful since counterfactual data for very similar queries helps the model learn robust query and document representations that improve the accuracy of dense retrieval as they encode important distinguishing features.

6 Analysis

To better understand how CDA improves the model, we introduce a measure of local consistency (§6.1) to measure model robustness, and perform a stratified analysis (§6.2) to show the benefits of the semantic diversity available from RGF. In Appendix C.2, we also report results on a low-resource experimental setting.

6.1 Local Robustness

Consistency Size AQA AmbigQA QUOREF RGF RGF
Contrast Ref. Change Pred. Change
Original NQ 90K 58.47 46.67 39.66 60.23 52.80
Gold Agen-Qgen 90K + 90K 59.27 50.23 42.83 65.14 55.65
Rand. Agen-Qgen 90K + 90K 55.45 49.06 41.93 61.31 45.16
RGF (REALM-Qgen) 90K + 90K 63.29 51.61 46.42 79.43 65.13
Table 5: Results on Reading Comprehension Task for consistency measured for datasets containing pairs of counterfactual questions. Consistency measures the proportion of correct counterfactual examples when the original is also predicted correctly. RGF leads to better consistency to many different perturbation types.

Compared to synthetic data methods such as PAQ (Lewis et al., 2021), RGF generates counterfactual examples that are paired with the original inputs and concentrated in local neighborhoods around them (Figure 2). As such, we hypothesize that augmentation with this data should specifically improve local consistency, i.e. how the model behaves under minimal semantic perturbations of the input.

Experimental Setting

We explicitly measure how well a model’s local behavior respects perturbations to input. Specifically, if a model correctly answers , how often does it also correctly answer ? We define pairwise consistency as accuracy over the counterfactuals , conditioned on correct predictions for the original examples:

To measure consistency, we construct validation sets consisting of paired examples : one original, and one counterfactual. We use QED to categorize our data, as described in Section 3.5. Specifically, we create two types of pairs — (a) a change in reference where question predicate remains fixed, and (b) a change in predicate, where the original reference(s) are preserved.222A predicate change may introduce additional reference slots, as in example CF2 of Table 1, so we require that the new reference set is a superset of those in the original . We create a denoised evaluation set by first selecting RGF examples for predicate or reference change, then manually discarding incorrect triples (§4) until we have 500 examples of each type. We also construct pairwise versions of AQA, AmbigQA and the QUOREF contrast set for our analysis. For AmbigQA, we pair two disambiguated questions and for QUOREF contrast, we pair original and human-authored counterfactuals. AQA consists of human-authored adversarial questions which are not explicitly paired with original questions; we create pairs by randomly selecting an original question and a generated question from the same passage.


We improve by up to 20 points (nearly a 50% reduction in error compared to the original NQ model) on RGF evaluation data with QED-based filtering, and 5-7 points on existing counterfactual sets like AQA, AmbigQA and QUOREF-contrast (Table 5). The Gold Agen-Qgen baseline (which contains topically related queries about the same passage) also improves consistency over the original model compared to the Random Agen-Qgen baseline. Consistency improvements on AQA, AmbigQA and QUOREF are especially noteworthy, since they suggest an improvement in robustness to local perturbations that is independent of other confounding distributional similarities between training and evaluation data. In Appendix C, we find that consistency similarly improves in the open-domain setting.

Consistency Val Ref. Val Pred.
Original NQ 60.23 52.80
Train Ref. 75.41 57.59
Train Pred. 70.10 65.83
Train All 79.43 65.13
Table 6: Results on sharding training data based on predicate and reference pairing between . Training data size for each category is 90k NQ + 52k generated. Overall, training with all RGF data robustly improves consistency across both types of counterfactual perturbations.
Consistency Val 1-4 Val 5-10 Val > 10
Train 1-4 71.02 67.55 64.78
Train 5-10 68.89 68.98 63.92
Train >10 65.78 66.33 65.33
Train All 72.34 67.82 65.12
Table 7: Results on sharding training data based on edit distance between . Training dataset size for each bin is 90k NQ + 167k generated. Once again, training with all RGF data robustly improves consistency across different amounts of perturbations.

6.2 Effect of Perturbation Type

In recent work, Joshi and He (2021) show that CDA is most effective when the types of perturbations used in training align well with those in evaluation. In particular, they found that on tasks such as Natural Language Inference (NLI), CDA with narrowly-focused perturbation types can actually lead to worse performance in unaligned cases, as the distribution of the CDA data introduces new biases into the model.

We test whether this is still true for RGF data, which covers a diverse range of perturbation types. To do so, we shard counterfactual training data to perform CDA with a more narrow set of perturbation types.

Experimental Setting

We experiment with semantic (i.e. QED-based) and surface form (i.e. edit distance-based) sharding. In both cases, we over-generate by starting with 20 for each original example to ensure there are enough examples matching the relevant heuristic.

For QED-based experiments, we shard training examples into two categories based on whether and have the same reference (predicate change) or same predicate (reference change), as defined in §3.5. We similarly evaluate on the two evaluation sets of predicate and reference changes from §6.1.

For edit distance-based experiments, we shard training examples into three categories by binning word-level edit distance between and into three ranges: , , and . We similarly categorize RGF data generated for the NQ development set into the same categories. Evaluation sets for edit-distance experiments based were not manually noise filtered. We again report consistency on the reading comprehension model.


Results are shown for QED-filtered training in Table 6, and for edit distance in Table 7. Counterfactual perturbation of a specific kind (with a special distributional shift) during augmentation does not hurt performance compared to the baseline NQ model, which differs from the observations of Joshi and He (2021) on NLI. Furthermore, similar to the observations of Joshi and He (2021), combining different kinds of perturbations together has orthogonal benefits that improve model generalization on other perturbation types. Similarly, when data is sharded by edit distance, we observe that using the full RGF data nearly matches the best performance from training on that shard, suggesting that CDA with the highly diverse RGF data can lead to improved consistency on a broad range of perturbation types.

7 Conclusion

We present Retrieve-Generate-Filter (RGF), a method for generating counterfactual examples for information-seeking queries. RGF creates automatically-labeled examples that can be used directly for data augmentation, or filtered using heuristics or meaning representations for analysis. The generated examples are semantically diverse, using knowledge from the passage context to capture semantic changes that would be difficult to specify a priori with a global schema. We show that training with this data leads to improvements on open-domain QA, as well as on challenge sets, as well as lead to significant improvements in local robustness. Our method requires only the original task training set (e.g. Natural Questions) as supervised input and minimal human filtering, making it easily transferable to new domains and semantic phenomena without the need for explicit meaning representations.


Appendix A Model Training and Implementation Details

Below, we describe the details of different models trained in the RGF pipeline. For all T5 models, we use the pre-trained checkpoints from Raffel et al. (2020)333

Question Generation

We use a T5-3B model fine-tuned on Natural Questions (NQ) dataset. We only train on the portion of the dataset that consists of gold short answers and an accompanying long answer evidence paragraph from Wikipedia. The input consists of the title of the Wikipedia article the passage is taken from, a separator (‘>>’) and the passage. The output is the original NQ question. The short answer is enclosed in the passage using character sequences ‘« answer =’ and ‘»’ on left and right respectively. The input and output sequence lengths are restricted to be and respectively. We train the model for 20k steps with a learning rate of , dropout , and batch size of . We decode with a beam size of 15, and take the top candidate as our generated question .

Answer Generation

We use a T5-3B model trained on the same subset of Natural Questions (NQ) as question generation with same set of hyperparameters and model size described above. The input consists of the title of the Wikipedia article the passage is taken from, a separator (‘>>’) and the passage, while the output sequence is the short gold answer from NQ.

Reading Comprehension Model

We model the task of span selection-based reading comprehension, i.e. identifying an answer span given question and passage, as a sequence-to-sequence problem. Input consists of the question, separator (‘>>’), title of Wikipedia article, The reading comprehension model is a T5-large model trained with batch size of and learning rate for steps.

Open-domain Question Answering model

The open domain QA model is based on the implementation from Lee et al. (2019), and initialized with the REALM checkpoint from (Guu et al., 2020)444

. Both the retriever and reader are initialized from the BERT uncased base model. The query and document representations are 128 dimensional vectors. When finetuning, we use a learning rate of

and a batch size of 1 on a single Nvidia V100 GPU. We perform 2 epochs of fine-tuning for Natural Questions.

Noise Filtering

We train 6 reading comprehension models based on the configurations above with different seed values for randomizing training dataset shuffling and optimizer initialization. We retain examples where more than 5 out of 6 models have the same answer for a question.

QED Training

We use a T5-large model fine-tuned on the Natural Questions subset with QED annotations Lamm et al. (2021).555 We refer the reader to the QED paper for details on the linearization of explanations and inputs in the T5 model. Our model is fine-tuned with batch size of and learning rate for 20k steps.

Appendix B Evaluation of Fluency and Noise

The authors sampled 300 examples of generated questions. To annotate for fluency, authors use the following rubric: Is the generated question grammatically well-formed barring non-standard spelling and capitalization of named entities. This noise annotation was done for RGF, as well as Gold Agen-Qgen and Random Agen-Qgen.

Data Unfiltered Filtered
RGF 29.8% 25.3%
Gold Agen-Qgen 27.9% 20.7%
Random Agen-Qgen 30.7% 28.3%
Table 8: Fraction of noise (incorrect ) in generated data, from 300 examples manually annotated by the authors.

Creation of paired data for counterfactual evaluation

Once again, authors annotate for correctness of counterfactual RGF instances that are paired by reference or predicate, as described in §3.5. Filtering is done until 500 examples are available under each category.

Appendix C Additional Experiments

c.1 Intrinsic Evaluation

Figure 3: Distribution of edit distance between original and counterfactual for RGF and other baselines for context selection. Note: For Random Wiki Passsage, original and generated questions bear no relation to each other and are randomly paired.

In Figure 3, we compare distributions of the edit distance between the original and generated questions for questions generated by our approach, those generated with the gold evidence passage, and those generated from a random Wikipedia passage are used (§5). We expect and find that RGF counterfactuals undergo minimal perturbations from the original question compared to questions that are generated from random Wikipedia paragraph. Surprisingly, this pattern also holds when compared to questions generated from gold NQ passages. We hypothesize that the set of alternate answers retrieved in our pipeline approach are semantically similar to the gold answer — same entity type, for instance. Random answer spans chosen from the gold NQ passage can result in significant semantic shifts in generated questions.

Figure 4: Plot of average edit distance between vs. retrieval rank , where is generated from passage, showing that edit distance and retrieval rank are monotonically related.

In Figure 4, we measure the relation between retrieval rank and edit-distance for RGF. For retrieval rank , we plot average edit distance between the original question and counterfactual question that was generated using the th passage and answer. We observe a monotonic relation between retrieval rank and edit distance (which we use for filtering our training data). We measure changes in the distribution of question type and predicate type.

Figure 5: Distribution of top 20 question types for original NQ data, RGF counterfactuals and questions generated from random Wikipedia passage, indicating bias towards popular question types.

Figure 5 indicates that counterfactual data exacerbates question-type bias. However, this bias exists in RGF as well as baselines. In Table 9, we show results on evaluating consistency on paired datasets in the open-domain results, similar to the results shown in §6.1.

Training Data Size AQA AmbigQA RGF Ref Change RGF Predicate Change
Original NQ 90K 16.58 13.33 24.58 12.75
Random Agen-Qgen 90K + 90K 15.80 20.00 25.57 16.82
RGF (REALM-Qgen) 90K + 90K 17.66 28.57 30.95 17.73
Table 9: Consistency Results for Open-domain QA.

c.2 Low-resource Transfer

Training Data Size BioASQ (Dev)
Original 1000 42.93 23.67
Orig. + RGF 500 + 500 41.72 23.01
Original 2000 45.88 25.80
Orig. + RGF 1000 + 1000 44.64 26.80
Table 10: Results on the reading comprehension task for Low Resource Transfer setting on BioASQ 2019 dataset. A model trained on 1000 gold BioASQ plus 1000 RGF examples performs nearly as well as a model trained on 2000 gold examples.

Joshi and He (2021) show CDA to be most effective in the low-resource regime. To better understand the role that dataset size plays in CDA in the reading comprehension setting, we evaluate RGF in a cross-domain setting where only a small amount of training data is available.

Experimental Setting

Since our approach depends on using an open-domain QA model and a question generation model trained on all Natural Questions data, we instead experiment with a low-resource transfer setting on the BioASQ domain, which consists of questions on the biomedical domain. We use the domain-targeted retrieval model from (Ma et al., 2021), where synthetic question-passage relevance pairs generated over the PubMed corpus are used to train domain-specific retrieval without any in-domain supervision. We further fine-tune the question generation model trained on NQ on the limited amount of in-domain data, and use a checkpoint trained on NQ as an initialization to fine-tune the RC model for in-domain data. Details of our training approach for low-resource transfer can be found in Appendix A.


We observe significant improvements over the baseline model in the low resource setting for in-domain data (< 2000 examples), as shown in Table 10. Compared with the limited gains we see on the relatively high-resource NQ reading comprehension task, we find that on BioASQ, CDA with 1000 examples improves performance by 2% F1 and 3% exact match, performing nearly as well as a model trained on 2000 gold examples.

Appendix D Semantic Diversity

Figure 6 includes more examples from Natural Questions, and shows context-specific semantic diversity of perturbations achieved by RGF. The multiple latent semantic dimensions identified (arrows in the diagram) fall out of our retrieval-guided approach.

Figure 6: Context-specific semantic diversity of perturbations achieved by RGF on an NQ Question. The multiple latent semantic dimensions identified (arrows in the diagram) fall out of our retrieval-guided approach.