Log In Sign Up

G-DAUG: Generative Data Augmentation for Commonsense Reasoning

Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.


page 1

page 2

page 3

page 4


SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness

Models that perform well on a training domain often fail to generalize t...

Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation

Determining the plausibility of causal relations between clauses is a co...

When does data augmentation help generalization in NLP?

Neural models often exploit superficial ("weak") features to achieve goo...

ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization

Neural models trained with large amount of parallel data have achieved i...

Learning from Data-Rich Problems: A Case Study on Genetic Variant Calling

Next Generation Sequencing can sample the whole genome (WGS) or the 1-2 ...

Semi-Supervised and Task-Driven Data Augmentation

Supervised deep learning methods for segmentation require large amounts ...

GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation

Practical dialogue systems require robust methods of detecting out-of-sc...

1 Introduction


Figure 1: Example of a selected high-quality generated example compared to a human-authored example from the WinoGrande dataset. Composing commonsense questions can require creativity.

While recent advances in large-scale neural language models Devlin et al. (2019); Liu et al. (2019); Radford et al. (2019); Raffel et al. (2019) have led to strong performance on several commonsense reasoning benchmarks Talmor et al. (2019); Lv et al. (2020); Sakaguchi et al. (2020), their accuracy by and large depends on the availability of large-scale human-authored training data. However, crowdsourcing training examples at scale for each new task and domain can be prohibitively expensive. Moreover, human-authored data has been shown to exhibit annotation artifacts Gururangan et al. (2018); Agrawal et al. (2018); Schwartz et al. (2017), leading to models with considerably weaker performance on out-of-distribution samples Jia and Liang (2017); Belinkov and Bisk (2017); Iyyer et al. (2018).

We present Generative Data Augmentation (G-DAug; §2): a framework for augmenting training data with diverse and informative synthetic training examples to improve both in-distribution performance and out-of-distribution generalization of commonsense reasoning models.

While data augmentation with synthetic examples has been used elsewhere in NLP, generating synthetic examples for commonsense reasoning poses a unique challenge. For instance, for data augmentation in reading comprehension Zhou et al. (2017); Du et al. (2017); Zhao et al. (2018a), the generator is given a reference passage and the task is to generate questions that are directly answered by the passage. In contrast, answering commonsense questions relies on commonsense notions that are seldom stated explicitly Gordon and Van Durme (2013); Forbes and Choi (2017), and authoring such questions can require creativity. We hypothesize that pretrained language models, such as GPT-2 Radford et al. (2019), capture some common sense expressed implicitly in their pretraining corpus, building upon promising evidence from previous work Yang et al. (2018); Trinh and Le (2018); Bosselut et al. (2019); Davison et al. (2019). Hence, questions generated by such models can form helpful training data.

A generative model allows us to produce large pools of synthetic training examples, alleviating the need for expensive crowdsourcing. Nonetheless, the automatically generated examples may be noisy or redundant. To ensure that we use the most informative examples for augmentation, we introduce data selection methods based on influence functions Koh and Liang (2017)

and a heuristic to maximize the diversity of the generated data pool. Finally, we propose an effective two-stage training scheme for augmentation with synthetic data. In experiments across multiple commonsense benchmarks, we show that

G-DAug can mitigate the expense and brittleness brought about by large training sets for commonsense reasoning tasks. In detail, our main contributions are the following:

  1. We introduce G-DAug, a generative data augmentation framework for commonsense reasoning tasks.

  2. We propose novel selection methods that identify informative and diverse synthetic training examples from the generated pool.

  3. In experiments, we show that G-DAug boosts in-domain performance (achieving a average absolute gain across five commonsense reasoning benchmarks) and improves model robustness in terms of resistance to adversarial attacks Jin et al. (2020) and accuracy on perturbed evaluation sets. We achieve new state-of-the-art results on the WinoGrande Sakaguchi et al. (2020), CommonsenseQA Talmor et al. (2019), and Codah Chen et al. (2019) benchmarks.

  4. We provide a comprehensive analysis of the factors that influence G-DAug’s performance.

We describe the G-DAug framework in §2 and §3. Empirical results are provided in §4 and our analysis is presented in §5. Finally, we review related work in §6 and conclude in §7.


Figure 2: Illustration of the G-DAug

process. We start by (1) generating synthetic data and training a task model, and then (2) relabel the generated data using the task model. Then, we (3) filter the generated data based on estimated influence scores, and (4) further select a subset based on a diversity-maximizing heuristic. Finally, we train a new task model in step (5) using the filtered generations (

synthetic training), and then train this model using the original training data in step (6) (organic training).

2 G-DAug

We now describe our framework for Generative Data Augmentation (G-DAug). The high-level overview of our approach is shown in Figure 2. In this section, we describe how G-DAug generates synthetic questions, answers, and distractors for a given dataset using a combination of a pretrained generative language model and a task-specific supervised model. Then, in §3 we describe how G-DAug filters the generated data and uses it for data augmentation.

2.1 Synthetic Training Data Generation

Our method involves finetuning a pretrained generative language model by maximizing the log-likelihood of a sequence of text, :

where denotes a subsequence of and denotes the model parameters.111 denotes an empty sequence Below, we consider a multiple choice question answering task as a running example to demonstrate how variations of this objective are used to finetune the LM for generating synthetic questions, answers and distractors.222 Specific modifications for other tasks, e.g. textual entailment, are discussed in Appendix A.

Given a dataset , where is a sequence of words denoting the question, is the corresponding choice set with choices that are word sequences as well, and denotes its ground truth label. We denote as the answer and the ’s as the distractors.

Generating Synthetic Questions

We obtain a question generator to synthesize questions by finetuning a pretrained generative language model on the training question set to optimize the language modeling objective: where denotes parameters of the question generator. After finetuning, we generate new questions with nucleus sampling Holtzman et al. (2020), which is suitable for generating long-form text.

Generating Synthetic Answers and Distractors

As answers and distractors have different purposes in each example question, we model them independently according to different distributions. Hence, we independently finetune two separate generative LMs to generate answers and distractors. More specifically, the answer generator is trained to maximize the conditional log-likelihood of the answer given the question, and the distractor generator is trained to maximize the conditional log-likelihood of the distractors. Mathematically, we optimize the following two objectives:

where and denotes the parameters of the answer and distractor generators respectively. In order to obtain high-quality answers, we use nucleus sampling with low temperature (for long answers) or greedy decoding (for short answers). For distractor generation, diversity is more important because training with repetitive distractors is not informative, so we use nucleus sampling without temperature for those.

Data Relabeling.

As shown above, our choice of generative LMs naturally defines labels for the synthetic questions. Alternatively, we can apply a supervised task model, which is trained on the original task dataset, to relabel the synthetic questions from a candidate pool of synthetic answers and distractors. This is similar to treating the synthetic questions as unlabeled data and applying self-training. Because the utility of this self-training can be variable, we recommend selecting whether to relabel or not for each task, based on validation data.

3 Synthetic Data Selection and Training

The generation techniques described in the previous section result in a large pool of generated synthetic samples. In this section, we describe how G-DAug uses this data for data augmentation. We first propose three data selection methods based on different criteria to further boost performance (§3.1) by selecting the most relevant synthetic examples. Then, we describe a simple staged training procedure (§3.2) to mitigate the negative impact from noise in the synthetic training data.

3.1 Selecting High-quality and Diverse Synthetic Examples

A randomly sampled synthetic dataset may contain many examples that are similar to one another, along with low-quality generations Holtzman et al. (2020). Intuitively, a more diverse and high-quality synthetic dataset would benefit the downstream task model more. In this section, we propose to improve our downstream task performance by selecting synthetic examples, using three algorithms that target quality , diversity and a combination of both. We refer to our random selection baseline as G-DAug-Rand.

Filtering with Influence Functions.

We consider an approach to improve the quality of a synthetic dataset by filtering out detrimental synthetic training examples. A given training example is considered detrimental if including in the training set results in a higher generalization error, which is often approximated by validation loss in practice. Finding whether increases validation loss would naively require retraining the model with the example included to evaluate how much the validation loss differs, which is computationally prohibitive to compute for a large pool of examples. Fortunately, the validation loss change can be efficiently approximated through the use of influence functions. Influence functions are a classic technique from robust statistics Atkinson et al. (1983)

that has recently been applied to deep neural networks

Koh and Liang (2017). While previous work focuses on removing or examining existing training examples Koh and Liang (2017); Wang et al. (2018b), we use influence functions to estimate the effect of including an unseen novel example into the training in order to filter out unhelpful synthetic data points to boost the performance of our generative data augmentation scheme. Note that Koh and Liang (2017) compute influence on loss value of a single test example. Instead of aggregating individual influences of including on loss of each validation example, we compute influence only once per synthetic example on the average validation loss to make computation feasible for large models with large pools of synthetic data. Please refer to Appendix F for a detailed derivation. After computing the influence scores, we filter out detrimental synthetic data (i.e., those that have a positive influence on the validation loss when upweighted). We refer to this approach as G-DAug-Influence.

Selecting Diverse Examples.

We hypothesize that a diverse set of examples will provide a more reliable training signal to the task model. Here, we measure diversity in terms of the number of unique n-grams (#

-grams) in the training set. We propose a simple greedy algorithm that iteratively selects a synthetic training example from the pool that maximizes the diversity measure. In our experiments, we set . We refer to this approach as G-DAug-Diversity. Exploring richer models of diversity, in terms of other metrics (such as embedding distance), is possible future work.

Combining Influence Filtering and Diversity Maximization

G-DAug-Influence and G-DAug-Diversity have complementary benefits—the former aims at improving individual quality by filtering out potentially detrimental training examples, and the latter is designed to select diverse examples but has no control over the quality of selected examples. Hence, we propose a selection algorithm called G-DAug-Combo that aims to improve both the diversity and quality of the selected synthetic examples. Under this method, we first apply G-DAug-Influence to construct a high quality synthetic data pool, and then apply G-DAug-Diversity to select the most diverse examples from the pool.

3.2 Training with Synthetic Data

In traditional data augmentation settings, augmented data are usually mixed with the original training set and are trained together with the original data Wei and Zou (2019); Kafle et al. (2017). However, when dealing with generated data, the intrinsic label noise may be detrimental to the learning process as noted by Kafle et al. (2017). In addition, noise could also come from questions, for example, the model can generate either nonsensical or ambiguous questions (see Table 9 under §4.2). We propose a simple training method to treat synthetic data and original data differently during learning to address this issue.

Two Stage Separate Training.

Our approach is simple: split synthetic and original data training into two separate stages. We first train a task model on the synthetic data (Synthetic Training) , then train it on the original, human-authored training set (Organic Training

). If a task model learns some unfavorable noise from the synthetic data in the first stage, it has a chance to “correct” this in the second stage with the original data. We also experimented with mimimizing a joint objective interpolating the loss on the synthetic data and the loss on the original data with an importance weight

to mitigate noise by downweighting the synthetic examples. However, in our preliminary experiments (see Section 5), we found that two-stage training performs better than the importance-weighted loss method, while also being easier to implement.

4 Experiments

We study G-DAug’s effectiveness on four commonsense QA benchmarks: CommonsenseQA Talmor et al. (2019), WinoGrande Sakaguchi et al. (2020), Codah Chen et al. (2019), HellaSwag Zellers et al. (2019), as well as one textual entailment task, SNLI Bowman et al. (2015) which was designed to require some degree of common sense. Details on each dataset are provided in Appendix A. In order to evaluate in the low-resource setting (where training examples are limited), we downsample the large HellaSwag and SNLI training sets to 2K and 3K examples respectively. The other datasets are either already low-resource or have a low-resource component. While our focus is on common sense, our techniques are also applicable to other closed-book QA tasks, such as science QA. To assess G-DAug outside of the commonsense domain, we also evaluate on a challenging closed-book version of the ARC-Challenge Scientific QA task Clark et al. (2018), in which access to the scientific corpus for the ARC dataset (or any other information sources) is disallowed during test.

Robustness Evaluation

Besides measuring in-distribution performance, we also analyze the augmented models’ robustness to perturbed or adversarial data, obtained using synonym replacement via WordNet Fellbaum (1998), and TextFooler Jin et al. (2020) adversarial attacks. Following Wei and Zou (2019), we perform WordNet-based synonym replacement on the validation or test set (when test labels are available) with a replacement rate 333 Our second evaluation with TextFooler utilizes several techniques to identify and prioritize the most important words to be replaced with the most semantically similar and grammatically correct substitutes, until the model prediction is altered. We adopt two metrics to measure the robustness under TextFooler’s attack: failure rate and average perturbation ratio. Failure rate is defined as the proportion of examples for which TextFooler fails to change the model prediction, and average perturbation ratio refers to the average fraction of words replaced when TextFooler succeeds in altering the model prediction. We re-implement TextFooler with two minor changes: we only swap words in questions, not answers; and we replace the Universal Sentence Encoder with SRoBERTa Reimers and Gurevych (2019), a state-of-the-art sentence embedding method.

4.1 Experimental Settings

We use RoBERTa Liu et al. (2019) as our pretrained task model, and GPT-2 Radford et al. (2019) as our pretrained generator.444We used the HuggingFace library Wolf et al. (2019) for both. We use validation performance to decide whether to do relabeling for CommonsenseQA and WinoGrande

, and simply apply relabeling by default on all other tasks (tuning this choice may further boost performance on the other tasks). For the sake of a controlled comparison, we restrict the synthetic dataset size to be equal across the all methods. We repeat all experiments with 10 random restarts and pick the best model based on validation performance. Additional experimental details, including hyperparameters for each experiment, are provided in Appendix

A, E and D.


Our first baseline is a finetuned RoBERTa model with no augmentation. To compare with existing work on data augmentation, we also report results using a backtranslation approach from Xie et al. (2019); in our experiments, we randomly mix the original and backtranslated data.555

4.2 In-Distribution Evaluation

RoBERTa (reported) 72.1 66.4 - - - -
RoBERTa (ours) 71.6 67.5 82.3 75.4 88.6 77.1
Backtranslation 70.2 67.2 81.8 73.0 88.1 76.1
G-DAug-Rand 71.8 70.9 83.6 75.9 89.0 78.2
G-DAug-Influence 72.1 70.9 84.3 75.8 88.7 78.4
G-DAug-Diversity 72.3 71.2 83.5 76.1 89.0 78.4
G-DAug-Combo 72.6 71.4 84.0 76.8 88.7 78.7
Table 1: Results on the test sets of five commonsense benchmarks. RoBERTa (reported) is the result for the RoBERTa-large baseline reported in previous work. RoBERTa (ours) is our evaluation of RoBERTa-large baselines following our experimental setup. All G-DAug methods outperform the baseline methods, and G-DAug-Combo performs the best on the majority of the tasks (3/5) and achieves the highest average score.

Our primary results are reported in Table 1. The G-DAug methods perform better than the baselines: on average, all of the proposed selection algorithms achieve higher test performance than G-DAug-Rand. G-DAug-Combo performs the best on 3/5 tasks and obtains the highest average score 78.7%, a 1.6% absolute gain over the non-augmented baseline. Moreover, with G-DAug-Combo, we obtain a 5.0% absolute gain over previously published state of the art results on WinoGrande.666These results are state-of-the-art for our model class; at the time of this writing, the WinoGrande leaderboard lists one superior result, but that is based on a T5 model with roughly an order of magnitude more parameters than ours. G-DAug is also applicable to larger models like T5, and experiments with these is an item of future work. For CommonsenseQA, G-DAug-Combo outperforms the prior non-ensemble state-of-the-art Zhu et al. (2020) by 0.4%. We also achieve a new state-of-the-art on Codah, where the previous best (BERT-based) score was 67.5% Chen et al. (2019). Backtranslation hurts performance: the non-augmented baseline uniformly outperforms Backtranslation, and the average score drops by 1% with Backtranslation. For validation set results, see Appendix B.

Method CSQA WinoGrande Codah Hellaswag-2K SNLI-3K Average
RoBERTa (ours) 69.9 63.8 74.7 63.2 77.5 69.8
Backtranslation 69.0 62.3 75.5 65.4 81.0 70.6
G-DAug-Rand 72.1 65.5 75.9 64.1 78.6 71.2
G-DAug-Influence 71.0 65.7 76.2 64.3 78.6 71.2
G-DAug-Diversity 71.6 66.0 76.0 64.8 79.4 71.6
G-DAug-Combo 72.0 66.0 76.0 65.2 78.7 71.6
Table 2: Results on WordNet-based synonym replacement sets. Metrics are the same as Table 2. G-DAug-Diversity and G-DAug-Combo achieve the highest average score.

4.3 Robustness Evaluation

Synonym Replacement

Table 2 presents our evaluation on synonym replacement sets. For Codah and HellaSwag-2K, we perturb test folds/sets, as the labels are available. Performing synonym replacement does not guarantee that the perturbed examples are still correct, due to e.g. polysemy. However, assuming the models perform similarly on invalid questions, higher accuracy on synonym-replaced questions may suggest better performance on the valid perturbations. G-DAug outperforms the baselines, with G-DAug-Combo and G-DAug-Diversity obtaining the best average performance. This suggests that augmenting with diverse questions is helpful for model robustness.

Method CSQA WinoGrande Codah Hellaswag-2K SNLI-3K Average
RoBERTa (ours) 14.8/12.6 4.5/7.8 30.9/15.8 17.4/9.8 17.0/20.2 16.9/13.2
Backtranslation 17.0/12.9 5.0/8.2 37.1/15.9 20.2/10.2 18.8/21.7 19.6/13.8
G-DAug-Rand 15.6/13.0 5.7/8.4 36.2/15.9 20.0/10.6 17.7/20.6 19.0/13.7
G-DAug-Influence 16.3/12.8 5.4/8.4 34.9/15.8 19.2/10.7 18.0/20.7 18.8/13.7
G-DAug-Diversity 16.0/12.9 5.9/8.4 36.1/16.2 21.4/10.4 19.0/20.5 19.7/13.7
G-DAug-Combo 16.5/12.6 5.9/8.5 35.2/15.7 21.3/10.5 16.7/20.5 19.1/13.6
Table 3: Failure rates/average perturbation ratio of TextFooler-based adversarial attacks (higher is better). “Failure rate” is the proportion of examples for which TextFooler fails to change the model prediction and “average perturbation ratio” is the percentage of words replaced on average to achieve a successful attack (failed examples are excluded in the calculation). A higher failure rate means that a model is more robust, while a higher average perturbation ratio suggests that the attacker needs more effort to achieve a successful attack when possible. Models trained with augmented data are more robust to TextFooler’s attacks compared to models without data augmentation. On average, Backtranslation and G-DAug-Diversity perform the best.


Our measurements of model performance under the TextFooler adversarial attack are shown in Table 3. The models trained with data augmentation are more robust to adversarial attacks, as all G-DAug variants and Backtranslation outperform the RoBERTa baseline on both metrics. G-DAug-Diversity obtains the best failure rate, while Backtranslation achieves the best average perturbation ratio (higher is better, in both metrics). Interestingly, G-DAug-Combo does not perform as well as G-DAug-Diversity, although it optimizes for diversity as well. Finally, Backtranslation performs well under this setting although it hurts in-domain accuracy in most cases. A possible explanation is that Backtranslation trains the model to maintain the same prediction under paraphrasing, making it harder for TextFooler to alter the model’s predictions by feeding it semantically similar examples.

Finally, we evaluate all SNLI models on the NLI Diagnostics dataset Wang et al. (2018a), which is hand-crafted for probing model performance on out-of-distribution examples. All G-DAug models outperform baseline models on this benchmark, suggesting improved robustness. These results are reported in Appendix C.

4.4 Results on ARC

Method Validation Test Synonym TF:failure TF:perturbation
RoBERTa (ours) 43.5 39.4 35.2 6.6 9.3
Backtranslation 43.1 43.1 42.4 9.3 10.3
G-DAug-Rand 50.8 48.1 43.4 12.9 10.8
G-DAug-Influence 51.5 48.5 45.2 12.4 11.0
G-DAug-Diversity 49.5 47.5 42.2 13.9 10.8
G-DAug-Combo 50.8 48.2 43.8 13.1 10.7
Table 4: Results on ARC-Challenge Scientific QA in the closed-book setting. “Synonym” refers to the synonym replacement set, “TF:failure” refers to the TextFooler failure rate, and “TF:perturbation” refers to TextFooler average perturbation ratio. G-DAug-Influence performs the best on the validation, test and synonym replacement sets, while G-DAug-Diversity is the most robust against TextFooler.

We explore G-DAug’s effectiveness outside of the commonsense domain in Table 4, where we evaluate on closed-book ARC-Challenge Scientific QA. Unlike the commonsense domain, valid science questions are harder to generate because their semantics is very precise. Despite that, G-DAug still improves on both validation and test sets by a large margin compared to the baselines. G-DAug-Influence achieves the best accuracy on the validation, test and synonym replacement sets, while G-DAug-Diversity is the most robust against the adversarial attacks but has worse accuracy than G-DAug-Rand. This suggests that optimizing for question quality is more important when the synthetic data is more noisy as it is for ARC, and that optimizing for diversity in this case can hurt performance.

5 Analysis and Discussion

In this section, we provide an analysis of the factors that influence G-DAug’s performance. Because the experiments are computationally expensive, we focus most of our analysis on just one data set where G-DAug tends to offer the most benefits, WinoGrande.

Training Size.


Figure 3: Validation results (in log scale) for different sizes of the WinoGrande dataset; G-DAug helps more for smaller training sizes.

We find that G-DAug remains effective as the amount of training data is varied, but provides a bigger boost over the baseline in the low-resource (small training size) regime. Figure 3 shows this trend on WinoGrande, considering our best strategy G-DAug-Combo. For the smallest sizes, XS and S, G-DAug increases the “effective training size” by a factor of 4 (that is, G-DAug-Combo trained on XS or S matches RoBERTa’s performance on S or M, respectively). In contrast, Backtranslation only helps when the training size is XS, but hurts performance on larger sizes.

Benefits of a Pretrained Generator.

We analyze the effect of using a pretrained generator by comparing our standard G-DAug-Rand setting with a setting where the generator is not pretrained, but instead trained from scratch. We find that using GPT-2 trained from scratch results in a score of on the WinoGrande-M validation set. This is a slight improvement (by ) over the non-augmented baseline, but is far inferior to the improvement obtained when using the pretrained GPT-2. This suggests that using a pretrained generator is critical for G-DAug.

Staged training.

Method Accuracy
No Augmentation 75.9
Mixing 75.9
Importance Weighted Loss 76.6
Two Stage Training 77.7
Table 5: Comparison of different training strategies on WinoGrande-L. G-DAug’s two-stage training achieves the highest accuracy.

G-DAug uses a two-staged training method (see Section 3.2) aimed at mitigating effects of noise in the generated data. We report an analysis of alternative training strategies for the WinoGrande-L dataset in Table 5. Mixing refers to simply training on the union of the generated and original data, and importance-weighted loss is similar but downweights the generated examples such that their aggregate weight in the loss computation equals that of the original data. G-DAug’s staged training approach is shown to outperform the other methods on this data set.


Method WinoGrande-L CSQA
Baseline 75.9 77.1
Generator label 76.2 78.1
Random relabeling 66.8 77.1
Model relabeling 77.7 77.7
Table 6: Validation accuracy of G-DAug with different labeling methods on WinoGrande-L and CommonsenseQA. Random relabeling assigns labels uniformly at random, and model relabeling assigns labels according to a task model. Random labels hurt accuracy, and model relabeling helps on WinoGrande but not on CommonsenseQA.

Even fully unsupervised language model pretraining can boost the performance of a task model, if performed on task-relevant data Gururangan et al. (2020). This raises the question of whether G-DAug boosts performance by simply exposing the model to more task-relevant text, or if the labels that G-DAug generates are in fact informative. A related question is whether G-DAug’s optional self-supervised relabeling improves performance. We analyze these questions in Table 6, evaluating G-DAug with three different labeling methods on WinoGrande-L and CommonsenseQA: 1) generator labels 2) random relabeling and 3) relabeling with a task model. When the generator labels are flipped randomly, G-DAug is unable to outperform the baselines for either dataset (and in fact dramatically underperforms the baseline on WinoGrande-L). This implies that the correctness of the labels plays an essential role in G-DAug. Relabeling with a task model provides a absolute gain in WinoGrande-L, but a 0.4 point drop in CommonsenseQA.

Data Selection.

Synthetic Data Accuracy
Random (127478) 71.7
Influence (127478) 74.4
Diversity (127478) 73.0
Whole Pool (380700) 73.1
Table 7: Results comparing G-DAug’s filtering methods against using entire synthetic data pool for augmentation, on WinoGrande-M. The synthetic data sizes are shown in parentheses. The selection algorithms achieve comparable or better performance despite training on three times fewer synthetic examples.

G-DAug’s filtering methods are designed to identify a high-quality and diverse subset of the generated data, in order to reduce the training cost compared to training on the entire generated pool—without harming accuracy. We evaluate whether G-DAug is successful at this aim in Table 7, by comparing G-DAug against using the entire synthetic data pool generated for G-DAug-Influence and G-DAug-Diversity.777G-DAug-Combo utilizes a larger pool, so it is not comparable. The selection approaches provide comparable or better accuracy compared to using the whole pool, despite using 3 times less synthetic training data.

Sharpness Analysis.

Previous work Hochreiter and Schmidhuber (1997); Keskar et al. (2016); Yao et al. (2019) has shown that models with flatter local minima tend to generalize better. Moreover, Hao et al. (2019) show that pretraining BERT helps achieve flat and wide optima in the finetuning stage, which partially explains its performance benefits. We investigate whether G-DAug’s data augmentation may also encourage flatter optima. Specifically, using the fact that a larger Hessian trace for a model implies a sharper local minimum Yao et al. (2019), we compute the Hessian trace of 10 baseline and 10 G-DAug-Combo methods using the Hutchinson Method Avron and Toledo (2011) and find a relative decrease of for G-DAug-Combo. This suggests that the minima for G-DAug-Combo are in fact flatter than the baseline, which may explain why they generalize better. A more thorough analysis of this hypothesis is an item of future work.


Figure 4: OpenIE analysis on original data and synthetic data used by G-DAug-Combo on WinoGrande-M. The synthetic dataset contains many more unique triplets, relations and entities compared to the original dataset, showing that the generated data adds substantial diversity to the training set.

Data Diversity.

G-DAug only very rarely generates questions that are exact duplicates in our observation, but an important question is how distinct the generated questions are from each other and the original training data. For example, does G-DAug introduce new entities and relations to the training data, or does it merely permute the same ones found in the original training set? We quantify the diversity of our synthetic dataset compared to the original data by counting the number of unique semantic units produced by performing Open Information Extraction Banko et al. (2007) on the data. Specifically, we run the Stanford Open IE package Angeli et al. (2015) and report the number of unique triplets, relations and entities extracted from our WinoGrande-M data sets in Figure 4. The synthetic data includes many more unique entities, relations, and triplets than the original training data, suggesting that G-DAug does introduce new semantic units into the training set.

Method Test AUC
Baseline 67.5
Baseline + Generator 67.5
G-DAug-Combo 71.4
Table 8: Test performance of an unaugmented baseline model and ensembled with finetuned GPT-2 generator on WinoGrande. We use weighted average ensemble with weights tuned on validation data.

Generator/Task Model Ensemble.

G-DAug harnesses pretrained knowledge from GPT-2 in order to improve a RoBERTa-based task model; a more standard approach to this end (albeit, with twice the computational cost at runtime) would be to ensemble the two models instead. We consider ensembling a baseline RoBERTa model with a finetuned GPT-2 generator for WinoGrande in Table 8. We adopt a weighted-average ensemble method, where the weights are tuned on validation data (the tuning is important to achieve peak performance). The ensemble model performs same as the baseline model, and G-DAug-Combo outperforms both of them by 3.9 point. This suggests that G-DAug is more effective than simply ensembling the finetuned generator.

Rating Description Examples Count Pct.
1 Nonsensical
What is a square leg made of made out of?
What country does a cow go to make a milk run?
54 3.89%
2 Ambiguous or unanswerable
A person is a human, but they are called what?
He hated flying, the controls were what?
306 22.06%
3 Minor errors (e.g., grammar)
What do you put on your head to do when you’re swimming?
Where does a bugle call be played?
138 9.95%
4 Coherent and Fluent
What is a person likely to feel when applying for jobs?
If you’re running late for work what would you be doing?
889 64.10%
Table 9: Examples and prevalence of generated commonsense questions with different manually-assigned fluency ratings, for the CommonsenseQA dataset. Ratings of 3 and higher correspond to questions that are answerable and address common sense, and most of G-DAug’s generated questions fall into this category.


Finally, in Table 9, we analyze the fluency of G-DAug’s output. On this analysis, we report on CommonsenseQA as it is the dataset for which we obtained hand-annotations of the output. We asked 3 human annotators to rate generated questions on their coherence and answerability on a scale from 1 to 4, where a rating of 3 denotes an acceptable question. We obtained 1387 labels in total. We measured annotator agreement on a separate set of 50 questions, obtaining a Fleiss’ Kappa of 0.41, which is at the low end of moderate annotator agreement, acceptable given the subjective nature of the task. We find that a large ( = 1027 out of 1387) majority of questions met this acceptability threshold, with an average rating of 3.34. Next, we ask annotators to select answers to the 1027 acceptable questions, where they can edit the choices but not the questions in case they are unable to pick a unique correct answer among the given choices. The editing rate is relatively high, at 55.3%. We mix these human-labeled generated examples into the original training set to train a RoBERTa model, and obtain validation accuracy ( over the baseline), which is comparable to G-DAug, despite using 48973 fewer questions. This indicates that human labels may provide higher leverage than the noisy labels obtained through G-DAug, although of course human labels are much more expensive to obtain.

6 Related Work

Data augmentation is a common practice in computer vision, where it takes the form of image transformations like translation and rotation

Perez and Wang (2017). For language tasks, data augmentation is less straightforward. Broadly, previous augmentation methods have used back-translation architectures Sennrich et al. (2016); Xie et al. (2019), heuristics based on syntactic and semantic properties of text including word replacements using a thesaurus Zhang et al. (2015); Wei and Zou (2019) and word embeddings Wang and Yang (2015); Fadaee et al. (2017); Kobayashi (2018); Wu et al. (2019), and very recently, generative models for synthesizing novel examples for text classification and reading comprehension Anaby-Tavor et al. (2020); Kumar et al. (2020); Puri et al. (2020). Our framework is similar to the last of these as we focus on generative models for data augmentation, but our work is the first to present a generative approach for the challenging commonsense QA setting, and we introduce new data selection approaches to improve the informativeness and diversity of the synthetic data.

Concurrently, there has been work on generating examples which are adversarial to a task model for the purpose of analyzing black-box classifier models. These approaches use generative adversarial networks

Zhao et al. (2018b) and population-based optimization algorithms Alzantot et al. (2018). Finally, previous work has also presented techniques to generate questions for reading comprehension Heilman and Smith (2010); Rus et al. (2011), online tutoring Lindberg et al. (2013), factual QA Serban et al. (2016) and visual question generation Mostafazadeh et al. (2016). A more comprehensive survey on neural question generation can be found in Pan et al. (2019). Our work 1) targets question generation in a closed-book setting, 2) investigates generation of questions as well as answers and distractors, and 3) discusses how to use them effectively for data augmentation with novel data selection strategies.

7 Conclusion

In this work, we propose G-DAug, a novel data augmentation framework that augments the training data with informative and diverse synthetic training examples generated by a model. We demonstrate the effectiveness of G-DAug on multiple commonsense reasoning and natural language inference benchmarks, showing that G-DAug improves in-distribution performance, and robustness on perturbed evaluation sets and challenge sets. Our analysis shows that G-DAug

tends to perform better in low-resource settings and our data selection strategies result in a more diverse and effective synthetic data pool. Further, we note that using a diverse synthetic data pool and pretraining the generator are both beneficial for augmentation. In future work, we hope to explore richer diversity heuristics, and active learning approaches that use

G-DAug with human annotators in the loop.


This work was supported in part by NSF Grant IIS-1351029 and the Allen Institute for Artificial Intelligence. We thank Iz Beltagy, Jonathan Bragg, Isabel Cachola, Arman Cohen, Mike D’Arcy, Daniel King, Kyle Lo, and Lucy Lu Wang for helpful comments.


Appendix A Datasets


Talmor et al. (2019): CommonsenseQA is a multiple choice QA dataset that consists of 12,247 examples, which aims to test commonsense reasoning capabilities. We use the official random split 1.11 which is an 80/10/10 split. We apply greedy decoding to generate answers, as answers are fairly short for this dataset.


Sakaguchi et al. (2020): WinoGrande is a benchmark for commonsense reasoning, inspired by the original Winograd Schema Challenge design Levesque et al. (2011)

, with a larger dataset size and higher difficulty level. It consists of 44K questions with five different training sizes: 160, 640, 2,558, 10,234 and 40,398 questions. The evaluation metric is Area Under the (learning) Curve. We observe that applying top-2 greedy decoding on the answer generator is able to yield a satisfactory set of choices, so the distractor generator is not used in this task. The Winograd schema requires that questions in twin pairs have opposite labels

Levesque et al. (2011). We use the following method to generate twin questions: 1. generate a sequence until a blank symbol ”_” is produced. 2. use two independent runs of sampling to complete the question in two different ways to form twins. The above process does not guarantee that the labels will differ for the two twins, so we further filter out generated pairs that do not have different labels.


Chen et al. (2019): Codah is an adversarially-constructed benchmark which tests commonsense reasoning using sentence-completion questions, inspired by the Swag dataset Zellers et al. (2018). It contains 2801 questions in total, and uses 5-fold cross validation for evaluation.888The original CODAH work does not specify a particular 5-fold split, so we choose these randomly. We will release our splits for replicability. We lower the temperature to 0.5 for the answer generation in order to increase the confidence of the generated answers.


Zellers et al. (2019): HellaSwag is a more challenging version of the Swag dataset Zellers et al. (2018), and the task is similar to Codah. The dataset consists of 70K questions where each question comes from one of two domains: ActivityNet or WikiHow. In order to test our methods under a low-resource setting, we downsample the training set to 2,000 examples. We take a random sample of 1000 questions from the original validation set to serve as our validation data, and another non-overlapping random sample of 5,000 questions from the same set as our test data. The generation settings are the same as Codah’s.


Bowman et al. (2015): SNLI is a natural language inference dataset with 570K pairs of labeled sentences. The label assigned to each sentence pair is one of entailment, contradiction or neutral. For low-resource experiments, we downsample the dataset to 3K training examples, which contains 1K unique premises and a hypothesis for all three labels. Similarly, we use a downsampled development set with 999 examples (333 premises and 3 hypotheses for each label). The generative model is fine-tuned by providing the premise, label and hypothesis, separated by special delimiters marking the beginning and end of each element.


Clark et al. (2018): The ARC Dataset consists of 7787 natural grade-school science questions that are used on standardized tests. The ARC-Challenge Set contains 2590 questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. We use the official split, which is 1119/299/1172 for train/validation/test. The generation settings are the same as CommonsenseQA’s.

Appendix B Validation Set Results

In Table 10, we summarize our main results on the validation sets, comparing proposed G-DAug methods against a non-augmented baseline and a backtranslation augmentation baseline. All G-DAug methods consistently outperform baseline methods in every benchmark. The proposed three selection methods provide an extra boost on average, compared to G-DAug-Rand. Among those, G-DAug-Influence achieves the best performance across all tasks, which is expected as G-DAug-Influence selects examples which are helpful in reducing validation loss. Interestingly, G-DAug-Combo has a lower score than G-DAug-Influence does although it outperforms G-DAug-Diversity. Finally, backtranslation does not demonstrate any benefit and obtains lower results compared to the non-augmented baseline in all benchmarks.

RoBERTa (reported) 78.4 66.6 - - - -
RoBERTa (ours) 77.1 68.4 84.2 75.2 91.8 79.3
Backtranslation 76.4 67.7 83.4 74.2 91.2 78.6
G-DAug-Rand 78.1 72.0 85.7 77.2 91.8 81.0
G-DAug-Influence 78.8 73.0 87.2 78.3 92.3 81.9
G-DAug-Diversity 78.1 72.8 86.0 76.6 92.0 81.1
G-DAug-Combo 78.2 72.7 86.7 77.5 91.9 81.4
Table 10: Results on the validation sets of five commonsense benchmarks. For WinoGrande, AUC 101010Area Under the (learning) Curveis reported to combine accuracy scores from different training sizes. For Codah, average accuracy over all validation folds is reported. RoBERTa (reported) is results of RoBERTa-large baseline scores reported in previous works. RoBERTa (ours) is our implementation of RoBERTa-large baseline. All G-DAug methods outperform the baseline methods, in particular, G-DAug-Influence performs the best on all tasks, which is expected as it selects examples which are helpful in reducing validation loss.

Appendix C Results on NLI Diagnostics

In Table 11, we report results for evaluating the NLI models on the NLI Diagnostics benchmark Wang et al. (2018a).

Method Accuracy
RoBERTa (ours) 56.70
Backtranslation 53.99
G-DAug-Rand 57.43
G-DAug-Influence 56.88
G-DAug-Diversity 57.70
G-DAug-Combo 57.61
Table 11: Results on the NLI Diagnostics challenge set Wang et al. (2018a).

Appendix D Input Formats

Here, we specify the input formats for finetuning GPT-2 and RoBERTa in Table 12 and 13.

Task Format
Table 12: Input formats for GPT-2. ”Q:” and ”A:” are the prefix for a question and a candidate answer (choice).
Task Format
Table 13: Input formats for RoBERTa. ”Q:” and ”A:” are the prefix for a question and a candidate answer (choice).

Appendix E Hyperparameter Settings

Hyperparameter settings for finetuning GPT-2, RoBERTa and G-DAug are shown in Table 14, 15, 16, 17 and 18

. We manually tune the learning rate and the number of epochs for GPT-2 finetuning based on validation perplexity. For finetuning

RoBERTa baseline models, we select the number of epochs from {1,3,5,8,10} based on validation accuracy for CSQA, WinoGrande and HellaSwag-2K. For Codah, SNLI-3K and ARC-Challenge, we simply use 5 epochs. For G-DAug synthetic training, we train all models using a learning rate of 5e-6 for one epoch. For G-DAug organic training, we use the same hyperparameter settings as RoBERTa baselines (except for CSQA and HellaSwag-2K, where we find reducing 2 epochs gives significant better results).

Hyperparam CSQA WinoGrande Codah HellaSwag-2K SNLI-3K ARC-Challenge
Version Large Medium Medium Medium Large Medium
Hardware I9-7900X RTX 2080Ti RTX 2080Ti RTX 2080Ti RTX 8000 RTX 2080Ti
Optimizer AdamW AdamW AdamW AdamW AdamW AdamW
 Adam 0.9 0.9 0.9 0.9 0.9 0.9
Adam 0.98 0.98 0.98 0.98 0.999 0.98
Adam 1e-6 1e-6 1e-6 1e-6 1e-8 1e-6
Mixed Precision No Yes Yes Yes Yes Yes
LR (q/a/d) 1e-5/5e-6/2e-5 * 4e-5/5e-5/5e-5 4e-5/5e-5/5e-5 5e-5 2e-5/1e-5/1e-5
Epochs (q/a/d) 3/5/3 * 3/3/3 3/3/3 3 3/5/5
Grad Clipping 1.0 1.0 1.0 1.0 1.0 1.0
Weight Decay 0.01 0.01 0.01 0.01 0.0 0.01
Batch Size 16 16 16 16 16 16
Max Length (q/a/d) 62/70/70 72/72/- 62/92/92 62/128/128 128 90/120
Warmup Ratio 0.06 0.06 0.06 0.06 0.06 0.06
LR Decay Linear Linear Linear Linear Linear Linear
Table 14: Hyperparameter settings for finetuning GPT-2. ”q/a/d” stands for ”question/answer/distractor”. Some hyperparameters for WinoGrande is shown in a separate table as they vary with the train size.
Hyperparam XS S M L XL

LR (q/a)
5e-5/5e-5 2e-5/5e-5 2e-5/5e-5 2e-5/5e-5 1e-5/5e-5
Epochs (q/a) 8/12 6/6 3/3 3/3 3/1

Table 15: Hyperparameter settings for finetuning GPT-2 on WinoGrande.
Hyperparam CSQA WinoGrande Codah HellaSwag-2K SNLI-3K ARC-Challenge
Version Large Large Large Large Large Large
Hardware RTX 2080Ti RTX 2080Ti RTX 2080Ti RTX 2080Ti RTX 8000 RTX 2080Ti
Optimizer AdamW AdamW AdamW AdamW AdamW AdamW
 Adam 0.9 0.9 0.9 0.9 0.9 0.9
Adam 0.98 0.98 0.98 0.98 0.98 0.98
Adam 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6
Mixed Precision Yes Yes Yes Yes Yes Yes

1e-5 * 1e-5 1e-5 1e-5 1e-5
Epochs 5 * 5 3 5 5
Grad Clipping 0.0 0.0 0.0 0.0 0.0 0.0
Weight Decay 0.01 0.01 0.01 0.01 0.01 0.01
Batch Size 16 16 16 16 16 16
Max Length 70 70 90 128 128 120
Warmup Ratio 0.06 0.06 0.06 0.06 0.06 0.06
LR Decay Linear Linear Linear Linear Linear Linear
Table 16: Hyperparameter settings for finetuning RoBERTa. Some hyperparameters for WinoGrande are shown in a separate table as they vary with the training set size.
Hyperparam XS S M L XL

1e-5 1e-5 1e-5 1e-5 1e-5
Epochs 10 8 5 5 5

Table 17: Hyperparameter settings for finetuning RoBERTa on WinoGrande.
Hyperparam CSQA WinoGrande Codah HellaSwag-2K SNLI-3K ARC-Challenge
Synthetic Data Size 50K 50K-130K111111We generate 400K before the rejection procedure (see Appendix A) . The number after the rejection procedure approximately ranges from 50K-130K for different training sizes. 100K 50K 100K 50K
LR (synthetic) 5e-6 5e-6 5e-6 5e-6 5e-6 5e-6
Epochs (synthetic) 1 1 1 1 1 1

Table 18: Additional hyperparameter settings for G-DAug Two Stage Separate Training. For finetuning on original data, we use the exactly same settings as RoBERTa (except for CSQA and HellaSwag-2K, where we find reducing 2 epochs gives significant better results).

Appendix F Influence Functions

In practice, since the generalization error is usually approximated by validation loss, a training example is considered detrimental if it increases validation loss, i.e.:


where is a training set, is a validation set,

is a loss function, and

is an empirical risk minimizer.

The main result from Atkinson et al. (1983); Koh and Liang (2017) tells us that the influence of upweighting a training example by some small on the model parameters with the corresponding parameter space is given by:


where is weight for the training example and is the Hessian evaluated at . The above result is a slight generalization of Koh and Liang (2017)

, since the simple average used in that work is a special case of our weighted average, but it is straightforward to generalize their proof to our weighted empirical risk case and we omit the details of the proof in this paper. Then, we apply the chain rule to get the influence of upweighting

on the validation loss:


Note that can be rewritten as the following weighted average form to incorporate a new training example term :

where , and . Adding the new training example is equivalent to upweighting by :

Applying the influence function , we obtain the following linear approximation of the validation loss change upon adding the training example :


We adopt the stochastic estimation method described in Koh and Liang (2017) to efficiently compute . Detrimental synthetic data will have .