Adversarial Augmentation Policy Search for Domain and Cross-Lingual Generalization in Reading Comprehension

Reading comprehension models often overfit to nuances of training datasets and fail at adversarial evaluation. Training with adversarially augmented dataset improves robustness against those adversarial attacks but hurts generalization of the models. In this work, we present several effective adversaries and automated data augmentation policy search methods with the goal of making reading comprehension models more robust to adversarial evaluation, but also improving generalization to the source domain as well as new domains and languages. We first propose three new methods for generating QA adversaries, that introduce multiple points of confusion within the context, show dependence on insertion location of the distractor, and reveal the compounding effect of mixing adversarial strategies with syntactic and semantic paraphrasing methods. Next, we find that augmenting the training datasets with uniformly sampled adversaries improves robustness to the adversarial attacks but leads to decline in performance on the original unaugmented dataset. We address this issue via RL and more efficient Bayesian policy search methods for automatically learning the best augmentation policy combinations of the transformation probability for each adversary in a large search space. Using these learned policies, we show that adversarial training can lead to significant improvements in in-domain, out-of-domain, and cross-lingual generalization without any use of training data from the target domain or language.



There are no comments yet.


page 1

page 2

page 3

page 4


Improving Cross-Lingual Reading Comprehension with Self-Training

Substantial improvements have been made in machine reading comprehension...

Cross-Lingual Machine Reading Comprehension

Though the community has made great progress on Machine Reading Comprehe...

Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer

Reading comprehension is a well studied task, with huge training dataset...

Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation

Cross-lingual Machine Reading Comprehension (CLMRC) remains a challengin...

Retrieval-guided Counterfactual Generation for QA

Deep NLP models have been shown to learn spurious correlations, leaving ...

Undersensitivity in Neural Reading Comprehension

Current reading comprehension models generalise well to in-distribution ...

MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension

A large number of reading comprehension (RC) datasets has been created r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been growing interest in understanding NLP systems and exposing their vulnerabilities through maliciously designed inputs Iyyer et al. (2018); Belinkov and Bisk (2018); Nie et al. (2019); Gurevych and Miyao (2018). Adversarial examples are generated using search Alzantot et al. (2018)

, heuristics

Jia and Liang (2017) or gradient Ebrahimi et al. (2018) based techniques to fool the model into giving the wrong outputs. Often, the model is further trained on those adversarial examples to make it robust to similar attacks. In the domain of reading comprehension (RC), adversaries are QA samples with distractor sentences that have significant overlap with the question and are randomly inserted into the context. By having a fixed template for creating the distractors and training on them, the model identifies learnable biases and overfits to the template instead of being robust to the attack itself Jia and Liang (2017). Hence, we first build on Wang and Bansal (2018)’s work of adding randomness to the template and significantly expand the pool of distractor candidates by introducing multiple points of confusion within the context, adding dependence on insertion location of the distractor, and further combining distractors with syntactic and semantic paraphrases to create combinatorially adversarial examples that stress-test the model’s language understanding capabilities. These adversaries inflict up to 45% drop in performance of reading comprehension models built on top of large pretrained models like RoBERTa Liu et al. (2019).

Next, to improve robustness to the aforementioned adversaries, we finetune the reading comprehension model with a combined augmented dataset containing an equal number of samples from all of the adversarial transformations. While it improves robustness by a significant margin, it leads to decline in performance on the original unaugmented dataset. Hence, instead of uniformly sampling from the various adversarial transformations, we propose to perform a search for the best adversarial policy combinations that improves robustness against the adversarial attacks and also preserves/improves accuracy on the original dataset via data augmentation. However, it is slow, expensive and inductive-biased to manually tune the transformation probability for each adversary and repeat the process for each target dataset, and so we present several RL and Bayesian search methods to learn this policy combination automatically.

For this, we create a large augmentation search space of up to 106, with four adversarial methods, two paraphrasing methods and a discrete binning of probability space for each method (see Figure 1). Cubuk et al. (2019)

showed via AutoAugment that a RNN controller can be trained using reinforcement learning to efficiently find the best policy in a large search space. However, AutoAugment is computationally expensive and relies on the assumption that the policy searched using rewards from a smaller model and reduced dataset will generalize to bigger models. Alternatively, the augmentation methods can be modelled with a surrogate function, such as Gaussian processes

Rasmussen (2003), and subjected to Bayesian optimization Snoek et al. (2012)

, drastically reducing the number of training samples required for achieving similar results (available as a software package for computer vision).

222 Hence, we extend these ideas to NLP and perform a systematic comparison between AutoAugment and our more efficient BayesAugment version.

Finally, there has been limited previous work exploring the role of adversarial data augmentation to improve generalization of reading comprehension models to out-of-domain and cross-lingual data. Hence, we also perform automated policy search of adversarial transformation combinations for enhancing generalization from English Wikipedia to datasets in other domains (news) and languages (Russian, German). We show that augmentation policies for the source domain learned using target domain performance as reward, improves the model’s generalization to the target domain without using any training data from that domain. Similarly, we use adversarial examples in a pivot language (in our case, English) to improve performance on other languages’ RC datasets without using any data from that language for training.

Our contributions can be summarized as follows:

  • [nosep, wide=0pt, leftmargin=*, after=]

  • We first propose novel adversaries for reading comprehension that cause up to 45% drop in large pretrained models’ performance. Augmenting the training datasets with uniformly sampled adversaries improves robustness to the adversarial attacks but leads to decline in performance on the original unaugmented dataset.

  • We next demonstrate that optimal adversarial policy combinations of transformation probabilities (for augmentation and generalization) can be automatically learned using different policy search methods. Our experiments also show that efficient Bayesian optimization achieves similar results as AutoAugment with a fraction of the resources.

  • By training on the augmented data generated via the learned adversarial policies, we not only improve adversarial robustness of the models but also show significant gains i.e., up to 2.07%, 4.0%, and 2.08% improvement for in-domain, out-of-domain, and cross-lingual evaluation respectively.

Adversary Method Description Original Question/Sentence and Corresponding Distractor
AddSentDiverse Jia and Liang (2017); Wang and Bansal (2018) Q: In what country is Normandy located?
D: D-Day is located in the country of Sri Lanka.
AddKSentDiverse Multiple AddSentDiverse distractors are inserted randomly in the context. Q: Which county is developing its business center?
D1: The county of Switzerland is developing its art periphery.
D2: The county of Switzerland is developing its home center.
AddAnswerPosition Answer span is preserved in this distractor. It is most misleading when inserted before the original answer. Q: What is the steam engine’s thermodynamic basis?
A: The Rankine cycle is the fundamental thermodynamic underpinning of the steam engine.
D: Rankine cycle is the air engine’s thermodynamic basis.
InvalidateAnswer AddSentDiverse and additional elimination of the original answer. Q: Where has the official home of the Scottish Parliament been since 2004?
D: Since October 2002, the unofficial abroad of the Welsh Assembly has been a old Welsh Assembly Houses, in the Golden Gate Bridge area of Glasgow.
PerturbAnswer Content words (except named entities) are algorithmically replaced with synonyms and evaluated for consistency using language model. A: The UK refused to sign the Social Charter and was exempt from the legislation covering Social Charter issues unless it agreed to be bound by the legislation.
P: The UK repudiated to signature the Social Charter and was exempt from the legislation encompassing Social Charter issues unless it consented to be related by the legislation.
PerturbQuestion Syntacting paraphrasing network is used to generate the source question with a different syntax. Q: In what country is Normandy located?
P: Where does Normany exist?
Table 1: Demonstration of the various adversary functions used in our experiments (Q=Question, D=Distractor, A=Answer, P=Paraphrase). Words that have been modified using adversarial methods are italicized in the distractor.

2 Related Work

Adversarial Methods in NLP: Following the introduction of adversarial evaluation for reading comprehension models by Jia and Liang (2017), there has been extensive work on developing methods for probing the sensitivity and stability of various NLP models Nie et al. (2019); Glockner et al. (2018). Wang and Bansal (2018) modified the AddSent algorithm Jia and Liang (2017)

to increase variance in adversarial data and improve robustness of BiDAF to adversarial attacks

Seo et al. (2017). Zhao et al. (2018) employ GANS to generate semantically meaningful adversaries for machine translation and textual entailment. With a similar goal, Ren et al. (2019) and Alzantot et al. (2018) use a synonym-substitution strategy to preserve the semantics and syntax of the original text in its adversarial counterpart; the latter also use population-based optimization to generate more effective adversaries. Miyato et al. (2017) create adversarial examples by adding noise to word embeddings. Using a white-box approach, Ebrahimi et al. (2018)

showed that gradient-based perturbations can be used to trick a character-level neural classifier. While most methods attack the semantics of a sentence,

Iyyer et al. (2018)

construct a syntactic paraphrasing network to introduce robustness to syntactic variance in sentiment analysis models.

Augmentation and Generalization: Some works in computer vision use adversarial training to demonstrate improvement for small datasets in the fully-supervised setting Goodfellow et al. (2015) or larger datasets in the semi-supervised setting Miyato et al. (2018). More recently, Xie et al. (2020)

use an enhanced adversarial training scheme with auxiliary batch normalization modules to improve image recognition using adversarial examples. The most effective augmentation techniques for RC have been backtranslation

Yu et al. (2018) and pre-training with other QA datasets Devlin et al. (2019). Virtual adversarial training Miyato et al. (2017) also shows improvements on some RC datasets Yang et al. (2019). Talmor and Berant (2019) show that finetuning a pre-trained RC model with the target domain improves generalization. Cao et al. (2020) propose a conditional adversarial self-training method to reduce domain distribution discrepancy. Lee et al. (2019); Wang et al. (2019) use a discriminator to enforce domain-invariant representation learning and improve generalization to out-of-domain datasets Fisch et al. (2019). Similar attempts were made to learn language-invariant representations for cross-lingual text classification Chen et al. (2018)

and lexicon induction

Zhang et al. (2017). We show that heuristics-based adversaries can be leveraged as augmentation as well as generalization data.

Policy Search: The AutoAugment algorithm Cubuk et al. (2019) uses reinforcement learning to effectively explore a large search space and find the best augmentation policies for a downstream task. Niu and Bansal (2019) use AutoAugment to discover perturbation policies for dialogue generation. Ho et al. (2019) use population-based augmentation (PBA) techniques Jaderberg et al. (2017) and significantly reduce the compute time required by AutoAugment algorithm. In a follow-up work, Cubuk et al. (2019) present RandAugment that reduces the task to simple grid-search by retaining only global parameters. We are the first to adapt RandAugment style techniques for NLP via our BayesAugment method. RandAugment enforces uniform transformation probability on all augmentation methods and collapses the augmentation policy search space to two global parameters while BayesAugment eliminates the need to choose between adversarial methods and optimizes only for their transformation probabilities (see Sec. 3.3 for more details).

3 Adversary Policy Design

As shown by Jia and Liang (2017), QA models are susceptible to random, semantically meaningless and even minor changes in the data distribution. We extend this work and propose adversaries that exploit the model’s sensitivity to insert location of distractor, number of distractors, combinatorial adversaries etc. After exposing the model’s weaknesses, we strengthen them by training on these adversaries and show that the model’s robustness to adversarial attacks significantly increases due to it. Finally, in Sec. 4, we automatically learn the right combination of transformation probability for each adversary in response to a target improvement using policy search methods.

3.1 Adversary Transformations

We present two types of adversaries, namely positive perturbations and negative perturbations (or attacks) (Figure 1). Positive perturbations are adversaries generated using methods that have been traditionally used for data augmentation in NLP i.e., semantic and syntactic transformations. Negative perturbations are adversaries based on the classic AddSent model Jia and Liang (2017) that exploit the RC model’s shallow language understanding to mislead it to incorrect answers.

AddSentDiverse: We use the method outlined by Wang and Bansal (2018) for AddSentDiverse to generate a distractor sentence (see Table 1) and insert it randomly within the context of a QA sample. In addition to WordNet, we use ConceptNet Speer et al. (2017) for a wider choice of antonyms during generation of adversary. QA pairs that do not have an answer within the given context are also augmented with AddSentDiverse adversaries.

AddKSentDiverse: The AddSentDiverse method is used to generate multiple distractor sentences for a given context. Each of the distractor sentences is then inserted at independently sampled random positions within the context. The distractors may or may not be similar to each other. Introducing multiple points of confusion is a more effective technique for misleading the model and reduces the scope of learnable biases during adversarial training by adding variance.

AddAnswerPosition: The original answer span is retained and placed within a distractor sentence generated using a combination of AddSentDiverse and random perturbations to maximize semantic mismatch. We modify the evaluation script to compare exact answer span locations in addition to the answer phrase and fully penalize incorrect locations. For practical purposes, if the model predicts the answer span within adversarial sentence as output, it does not make a difference. However, it brings into question the interpretability of such models. This distractor is most effective when placed right before the original answer sentence, showing dependence on insert location of distractor.

InvalidateAnswer: The sentence containing the original answer is removed from the context. Instead, a distractor sentence generated using AddSentDiverse is introduced to the context. This method is used to augment the adversarial NoAnswer-style samples in SQuAD v2.0.

PerturbAnswer (Semantic Paraphrasing): Following Alzantot et al. (2018)

, we perform semantic paraphrasing of the sentence which contains the answer span. Instead of using genetic algorithm, we adapt their

Perturb subroutine to generate paraphrases and use the OpenAI-GPT Radford et al. (2018) model to rank them. See appendix for details of this method.

PerturbQuestion (Syntactic Paraphrasing): We use the syntactic paraphrase network introduced by Iyyer et al. (2018)

to generate syntactic adversaries. Sentences from the context of QA samples tend to be long and have complicated syntax. The corresponding syntactic paraphrases generated by the paraphrasing network usually miss out on half of the source sentence. Therefore, we choose to perform paraphrasing on the questions. We generate 10 paraphrases for each question and rank them based on cosine similarity, computed between the mean of word embeddings

Pennington et al. (2014) of source sentence and generated paraphrases Niu and Bansal (2018); Liu et al. (2016).

Finally, we combine negative perturbations with positive perturbations to create adversaries which double-down on the model’s language understanding capabilities. It always leads to a larger drop in performance when tested on the reading comprehension models trained on original unaugmented datasets.

Figure 1: Flow chart of training loop for AutoAugment controller and Bayesian optimizer. See Sec. 4 for details.

3.2 Semantic Difference Check

To make sure that the distractor sentences are sufficiently different from the original sentence, we perform a semantic difference check in two steps:

  1. [nosep, wide=0pt, leftmargin=*, after=]

  2. Extract content phrases from original sentence. Content phrase is any common NER phrase or one of the four: noun, verb, adverb, adjective.

  3. There should be at least 2 content phrases in the original text that aren’t found in the distractor.

We examined 100 randomly sampled original-distractor sentence pairs and found that our semantic difference check works for 96% of the cases.

3.3 Adversarial Policy & Search Space

Reading comprehension models are often trained with adversarial samples in order to improve robustness to the corresponding adversarial attack. We seek to find the best combination of adversaries for data augmentation that also preserves/improves accuracy on source domain and improves generalization to a different domain or language. Following previous work in AutoAugment policy search Cubuk et al. (2019); Niu and Bansal (2019), we define a sub-policy to be a set of adversarial transformations which are applied to a QA sample to generate an adversarial sample. We show that adversaries are most effective when positive and negative perturbations are applied together (Table 2). Hence, to prepare one sub-policy, we select one of the four negative perturbations (or none), combine it with one of the two positive perturbations (or none) and assign the combination a transformation probability (see Figure 1). The probability space is discretized into 6 equally spaced bins. This leads to a search space of for a single sub-policy. Next, we define a complete adversarial policy as a set of sub-policies with a search space of . For each input QA sample, one of the sub-policies is randomly sampled and applied (with a probability equal to the transformation probability) to generate the adversarial sample.

We adopt a simplified formulation of the policy for our BayesAugment method, following Ho et al. (2019) and RandAugment Cubuk et al. (2019). Sampling of positive and negative adversaries is eliminated and transformation probabilities of all possible combinations of adversaries are optimized over a continuous range .333

RandAugment collapses a large parameter space by enforcing uniform probability on all transformations and optimizing for: (i) global distortion parameter, (ii) number of transformations applied to each image. It uses hyperparameter optimization and shows results with naive grid search due to small search space. RandAugment is not directly applicable to our setting because there is no notion of global distortion for text. Hence, we borrow the idea of treating augmentation policy parameters as hyperparameters but use Bayesian optimization for search.

Consequently, one of these combinations is randomly sampled for each input QA sample to generate adversaries.

4 Automatic Policy Search

Next, we need to perform search over the large space of augmentation policies in order to find the best policy for a desired outcome. Performing naive search (random or grid) or manually tuning the transformation probabilities is slow, expensive and largely impractical due to resource constraints. Hence, we compare two different approaches for learning the best augmentation policy in fewer searches: AutoAugment and BayesAugment. We follow the optimization procedure as demonstrated in Figure 1. For , do:

  • [nosep, wide=0pt, leftmargin=*, after=]

  • Sample the next policy (sample)

  • Transform training data with and generate augmented data (apply, transform)

  • Train the downstream task model with augmented data (train)

  • Obtain score on validation dataset as reward

  • Update Gaussian Process or RNN Controller with (update)

4.1 AutoAugment

Our AutoAugment model (see Figure 1

) consists of a recurrent neural network-based controller and a downstream task model. The controller has

output blocks for sub-policies; each output block generates distributions for the three components of sub-policies i.e., neg, pos and probability. The adversarial policy is generated by sampling from these distributions and applied on input dataset to create adversarial samples, which are added to the original dataset to create an augmented dataset. The downstream model is trained on the augmented dataset till convergence and evaluated on a given metric, which is then fed back to the controller as a reward (see the update flow in figure). We use REINFORCE Sutton et al. (1999); Williams (1992) to train the controller.

4.2 BayesAugment

Typically, it takes thousands of steps to train an AutoAugment controller using reinforcement learning which prohibits the use of large pretrained models as task model in the training loop. For example, the controllers in Cubuk et al. (2019) were trained for 15,000 samples or more. To circumvent this computational issue, we frame our adversarial policy search as a hyperparameter optimization problem and use Bayesian methods to perform the search. Bayesian optimization techniques use a surrogate model to approximate the objective function and an acquisition function to sample points from areas where improvement over current result is most likely. The prior belief about is updated with samples drawn from

in order to get a better estimate of the posterior that approximates

. Bayesian methods attempt to find global maximum in the minimum number of steps.

We use Gaussian Process (GP) Rasmussen (2003) as surrogate function and Upper Confidence Bound (UCB) Srinivas et al. (2010)

as the acquisition function. GP is a non-parametric model that is fully characterized by a mean function

and a positive-definite kernel or covariance function . Let denote any finite collections of points, where each represents a choice of sampling probabilities for each of the augmentation methods and is the (unknown) function value evaluated at . Let be the corresponding noisy observations (the validation performance at the end of training). In the context of GP Regression (GPR), are assumed to be jointly Gaussian. Then, the noisy observations

are normally distributed around

as . The Gaussian Process upper confidence bound (GP-UCB) algorithm measures the optimistic performance upper bound of the sampling probabilities.

4.3 Rewards

The F1 score of downstream task model on development set is used as reward during policy search. To discover augmentation policies which are geared towards improving generalization of RC model, we calculate the F1 score of task model (trained on source domain) on out-of-domain or cross-lingual development datasets, and feed it as the reward to the optimizer.

4.4 Datasets

We use SQuAD v2.0 Rajpurkar et al. (2018) and NewsQA Trischler et al. (2017) for adversarial evaluation and in-domain policy-search experiments. Futher, we measure generalization from SQuAD v2.0 to NewsQA, and from SQuAD v1.1 Rajpurkar et al. (2016) to German dataset from MLQA Lewis et al. (2019) and Russian dataset from XQuAD Artetxe et al. (2019).444The current choice of cross-lingual RC datasets in our experiments was based on the easy availability of x-en translation and span alignment models for the Translate-Test method Asai et al. (2018); we are currently actively adding RC datasets in other languages. See appendix for more details on datasets and training.

4.5 Reading Comprehension Models

We use RoBERTaBASE as the primary RC model for all our experiments. Search algorithms like AutoAugment require a downstream model that can be trained and evaluated fast, in order to reduce training time. So, we use distilRoBERTaBASE Sanh et al. (2019) for AutoAugment training loops, which has 40% lesser parameters than RoBERTaBASE. It should be noted that the distilRoBERTa model used in our experiments is trained on SQuAD without distillation. BayesAugment is trained for fewer iterations than AutoAugment and hence, allows us to use RoBERTaBASE model directly in the training loop. See appendix for baseline performances of these models on SQuAD and NewsQA.

4.6 Evaluation Metrics

We use the official SQuAD evaluation script for evaluation of robustness to adversarial attacks and performance on in-domain and out-of-domain datasets. For cross-lingual evaluation, we use the modified Translate-Test method as outlined in Lewis et al. (2019); Asai et al. (2018). QA samples in languages other than English are first translated to English and sent as input to RoBERTaBASE finetuned on SQuAD v1.1. The predicted answer spans within English context are then mapped back to the context in original language using alignment scores from the translation model. We use the top-ranked GermanEnglish and RussianEnglish models in WMT19 shared news translation task to generate translations and alignment scores Ng et al. (2019)555Available at

Adversary Method SQuAD NewsQA
Baseline (No Adversaries) 81.17 58.40
AddSentDiverse 65.50 51.47
AddKSentDiverse (K=2) 45.31 48.31
AddAnswerPosition 68.91 49.20
InvalidateAnswer 77.75 24.03
PerturbQuestion 43.67 36.76
PerturbAnswer 71.97 59.08
Effect of Multiple Distractors
AddSentDiverse 65.50 51.47
Add2SentDiverse 45.31 48.31
Add3SentDiverse 43.49 44.81
Combinatorial effect
AddSentDiverse 65.50 51.47
      + PerturbAnswer 50.71 51.43
AddKSentDiverse 45.31 48.31
      + PerturbQuestion 31.56 29.56
Effect of Insert Location of AddAnswerPosition
Random 68.91 49.20
Prepend 66.52 48.01
Append 67.84 48.76
Table 2: Adversarial evaluation of baseline RoBERTaBASE model trained on SQuAD v2.0 and NewsQA. Compare to corresponding rows in Table 3 to observe difference in performance after adversarial training. Results (F1 score) are shown on dev set.

5 Results

First, in Sec. 5.1, we perform adversarial evaluation of baseline RC models for various categories of adversaries. Next, in Sec. 5.2, we train the RC models with an augmented dataset that contains equal ratios of adversarial samples and show that it improves robustness to adversarial attacks but hurts performance of the model on original unaugmented dataset. Finally, in Sec. 5.3, we present results from AutoAugment and BayesAugment policy search and the in-domain, out-of-domain and cross-lingual performance of RC models trained using augmentation data generated from the learned policies with corresponding target rewards.

5.1 Adversarial Evaluation

Table 2 shows results from adversarial evaluation of RoBERTaBASE

finetuned with SQuAD v2.0 and NewsQA respectively. All adversarial methods lead to a significant drop in performance for the finetuned models i.e., between 4-45% for both datasets. The decrease in performance is maximum when there are multiple distractors in the context (Add3SentDiverse) or perturbations are combined with one another (AddSentDiverse + PerturbAnswer). These results show that, in spite of being equipped with a broader understanding of language from pretraining, the finetuned RC models are shallow and over-stabilized to textual patterns like n-gram overlap. Further, the models aren’t robust to semantic and syntactic variations in text.

Adversary Method SQuAD NewsQA
AddSentDiverse 68.00 61.13
AddKSentDiverse (K=2) 79.44 62.31
AddAnswerPosition 80.16 56.90
InvalidateAnswer 91.41 67.57
PerturbQuestion 60.91 44.99
PerturbAnswer 76.42 60.74
Original Dev (No Adversaries) 78.83 58.08
Table 3: Adversarial evaluation after training RoBERTaBASE with the original dataset augmented with equally sampled adversarial data. Compare to corresponding rows in Table 2 to observe difference in performance after adversarial training. Results (F1 score) are shown on dev set.

5.2 Manual Adversarial Training

Next, in order to remediate the drop in performance observed in Table 2

and improve robustness to adversaries, the RC models are further finetuned for 2 epochs with an adversarially augmented training set. The augmented training set contains each QA sample from the original training set and a corresponding adversarial QA sample by randomly sampling from one of the perturbation sub-policies. Table 

3 shows results from adversarial evaluation after adversarial training. Adding perturbed data during training considerably improves robustness of the models to adversarial attacks. For instance, RoBERTaBASE performs with 79.44 F1 score on SQuAD AddKSentDiverse samples (second row, Table 3), as compared to 45.31 F1 score without adversarial training (third row, Table 2). Similarly, RoBERTaBASE performs with 44.99 F1 score on NewsQA PerturbQuestion samples (fifth row, Table 3), as compared to 36.76 F1 score by the baseline model (sixth row, Table 2). However, this manner of adversarial training also leads to drop in performance on the original unaugmented development set, e.g., RoBERTaBASE performs with 78.83 and 58.08 F1 scores on the SQuAD and NewsQA development sets respectively, which is 2.34 and 0.32 points lesser than the baseline scores (first row, Table 2).666We also train RC models on a subset of adversaries and test on unseen adversaries. See results in Sec. 7.

5.3 Augmentation Policy Search for Domain and Language Generalization

Following the conclusion from Sec. 5.2 that uniform sampling of adversaries is not the optimal approach for model performance on original unaugmented dataset, we perform automated policy search over a large search space using BayesAugment and AutoAugment for in-domain as well as cross-domain/lingual improvements (as discussed in Sec. 4). For AutoAugment, we choose the number of sub-policies in a policy to be as a trade-off between search space dimension and optimum results. We search for the best transformation policies for the source domain that lead to improvement of the model in 3 areas: 1. in-domain performance 2. generalization to other domains and 3. generalization to other languages. Results from these experiments are presented in Tables 4 and 5 and the learned policies are shown in Table 6.

Search In-domain SQuAD Method SQuAD NewsQA NewsQA Validation Baseline 81.17 / 77.54 58.40 / 47.04 48.36 / 36.06 AutoAug 81.63 / 78.06 62.17 / 49.41 50.57 / 38.56 BayesAug 81.71 / 78.12 58.62 / 47.21 49.73 / 38.38 Test Baseline 80.64 / 77.19 57.02 / 45.29 44.95 / 34.68 AutoAug 81.06 / 77.79 59.09 / 45.49 46.82 / 35.75 BayesAug 80.88 / 77.57 57.63 / 45.32 48.95 / 37.44

Table 4: Baseline performance (first row) and evaluation after finetuning baseline models with the adversarial policies derived from AutoAugment and BayesAugment for in-domain improvements and out-of-domain generalization from Wikipedia (SQuAD) to news (NewsQA) domain. Results (F1 / Exact Match) are shown on validation and test sets.
Search Method Cross-lingual generalization
from English SQuAD
MLQA (de) XQuAD (ru)
Baseline 58.39 / 36.33 67.80 / 44.56
BayesAug 59.40 / 37.11 68.73 / 45.34
Baseline 57.20 / 35.86 60.95 / 33.52
BayesAug 59.02 / 38.01 63.03 / 34.85
Table 5: Cross-lingual QA: Translate-Test Lewis et al. (2019) evaluation after finetuning the baseline with adversarial policies derived from BayesAugment for generalization to German (de) and Russian (ru) RC datasets. Results (F1 / Exact Match) are shown on validation and test sets.
AutoAugment Policies
SQuAD SQuAD (AddS, None, 0.2) (IA, None, 0.4) (AddA, None, 0.2)
SQuAD NewsQA (None, PA, 0.4) (None, PA, 0.6) (AddS, PA, 0.4)
NewsQA NewsQA (AddA, PA, 0.2) (AddKS, None, 0.2) (AddA, PA, 0.4)
BayesAugment Policies
SQuAD SQuAD (AddS, 0.29), (AddA, 0.0), (AddA-PA, 0.0), (AddA-PQ, 0.0), (AddKS, 0.0), (AddKS-PA,0.0)
(AddKS-PQ, 0.0), (AddS-PA, 0.0), (AddS-PQ, 0.0), (PA, 0.61), (PQ, 0.0), (IA, 1.0)
SQuAD NewsQA (AddS, 1.0), (AddA, 0.0), (AddA-PA, 1.0), (AddA-PQ, 0.0), (AddKS, 0.0), (AddKS-PA, 0.0)
(AddKS-PQ, 0.0), (AddS-PA, 1.0), (AddS-PQ, 0.0), (PA, 0.48), (PQ, 0.0), (IA, 0.0)
SQuAD MLQA(de) (AddS, 0.042), (AddA-PA, 0.174), (AddA-PQ, 0.565), (AddKS, 0.173), (AddKS-PA, 0.567)
(AddA, 0.514), (AddS-PA, 0.869), (AddS-PQ, 0.720), (PA, 0.903), (PQ, 0.278), (AddKS-PQ, 0.219)
SQuAD XQuAD(ru) (AddS, 0.147), (AddA-PA, 0.174), (AddA-PQ, 0.79), (AddKS, 0.55), (AddKS-PA, 0.97)
(AddA, 0.77), (AddS-PA, 0.02), (AddS-PQ, 0.59), (PA, 0.11), (PQ, 0.95), (AddKS-PQ, 0.725)
NewsQA NewsQA (AddS, 1.0), (AddA, 1.0), (AddA-PA, 1.0), (AddA-PQ, 0.0), (AddKS, 0.0), (AddKS-PA, 1.0)
(AddKS-PQ, 0.156), (AddS-PA, 0.0), (AddS-PQ, 0.720), (PA, 0.0), (PQ, 0.0), (IA, 1.0)
Table 6: Best Policies suggested by BayesAugment and AutoAugment methods for different scenarios; AddS = AddSentDiverse, AddKS = AddKSentDiverse, AddA = AddAnswerPosition, IA = InvalidateAnswer, PA = PerturbAnswer, PQ = PerturbQuestion.

In-domain evaluation: The best AutoAugment augmentation policies for improving in-domain performance of RoBERTaBASE on the development sets result in 0.46% and 3.77% improvement in F1 score over baseline for SQuAD v2.0 and NewsQA respectively (see Table 4). Similarly, we observe 0.54% (p=0.021) and 0.22% (p=0.013) absolute improvement in F1 Score for SQuAD and NewsQA respectively by using BayesAugment policies. This trend is reflected in results on the test set as well. AutoAugment policies result in most improvement i.e., 0.42% (p=0.014) and 2.07% (p=0.007) for SQuAD and NewsQA respectively.

Out-of-domain evaluation: To evaluate generalization of the RC model from the domain of Wikipedia to news articles, we train RoBERTaBASE on SQuAD and evaluate on NewsQA. As seen in Table 4, out-of-domain performance for RoBERTaBASE trained on adversarially augmented SQuAD and evaluated on NewsQA (SQuADNewsQA) improves by 2.21% F1 score on the development set with the best augmentation policy from AutoAugment. The baseline row presents results of RoBERTaBASE trained on original unaugmented SQuAD and evaluated on NewsQA. BayesAugment provides a competitive and less computationally intensive substitute to AutoAugment in out-of-domain evaluation with 1.37% improvement over baseline. However, the trend varies for test set evaluation; BayesAugment policy for generalization from SQuAD NewsQA results in 4% (p0.001) improvement on the test set while AutoAugment improves F1 by 1.87% (p=0.004).

Our experiments suggest that AutoAugment finds better policies than BayesAugment for in-domain evaluation. We hypothesize that this might be attributed to a difference in search space between the two policy search methods. AutoAugment is restricted to sampling at most 3 sub-policies while BayesAugment has to simultaneously optimize the transformation probability for ten or more different augmentation methods. A diverse mix of adversaries from the latter is shown to be beneficial for out-of-domain generalization but results in minor improvements for in-domain performance. Moving ahead, due to better performance for out-of-domain evaluation and more efficient trade-off with computation, we only use BayesAugment for our cross-lingual experiments.

Cross-lingual evaluation: Table 5 shows results of RoBERTaBASE finetuned with adversarially augmented SQuAD v1.1 and evaluated on RC datasets in languages other than English. The baseline row presents results from RoBERTaBASE trained on original unaugmented SQuAD and evaluated on German MLQA(de) and Russian XQuAD(ru) datasets; F1 scores on the development sets are 58.39 and 67.80 respectively. It should be noted that these scores depend on quality of the translation model as well as the reading comprehension model. We observe significant improvements on the development as well as test sets without the use of additional training data in target language i.e only by finetuning baseline RC model with adversarial data from English SQuAD. BayesAugment policies result in 1.82% (p0.001), and 2.08% (p=0.009) improvement for test sets of MLQA(de) and XQuAD(ru) respectively.

6 Analysis and Discussion

Having established the efficacy of automated policy search for adversarial training, we further probe the robustness of models to adversarial attacks to analyze its dependence on source domain. Next, we train RoBERTaBASE on a subset of adversaries and evaluate its robustness to unseen adversaries. We also show the impact of adversarial augmentation ratio in training dataset and the size of training dataset on the generalization of RC model to out-of-domain data. Lastly, we analyze the convergence of BayesAugment for adversarial augmentation policy search and contrast its requirement of computational resources with that of AutoAugment.

Domain-Independence of Robustness to Adversarial Attacks: We have shown that a reading comprehension model trained on SQuAD can be generalized to NewsQA by finetuning the model with adversarially transformed samples from SQuAD dataset. It is expected that this model will be robust to similar attacks on SQuAD. To assess if this robustness generalizes to NewsQA as well, we evaluate our best SQuADNewsQA model on adversarially transformed NewsQA samples from the development set. The SQuAD column in Table 7 shows results from evaluation of RoBERTaBASE finetuned with original unaugmented SQuAD, on adversarially transformed NewsQA samples. The rightmost column shows results from similar evaluation of the SQuADNewsQA model that has been trained using the BayesAugment adversarial augmentation policy for generalization from SQuAD to NewsQA. Interestingly, the generalized model is 5-8% more robust to adversarial NewsQA without being trained on any NewsQA samples, showing that robustness to adversarial attacks in source domain easily generalizes to adversarial attacks in a different domain.

NewsQA Adversary SQuAD SQuAD
AddSentDiverse 42.39 / 32.79 49.54 / 38.02
PerturbAnswer 39.95 / 27.60 45.52 / 32.49
AddSentDiv-PertrbAns 35.08 / 26.33 43.63 / 32.76
Table 7: Comparison of robustness between RoBERTaBASE finetuned on original unaugmented SQuAD and our best SQuAD NewsQA generalized model (see Sec. 5.3). The latter is more robust to adversarial NewsQA without being trained on any NewsQA samples. Results (F1 score/Exact Match) are shown on dev set.

Robustness to Unseen Adversaries: We train RoBERTaBASE on a subset of adversarial attacks and evaluate it on the adversaries that were not in the training set, to analyze robustness of the model to unseen adversaries. In the first set of experiments, we train RoBERTaBASE on SQuAD v2.0 augmented with the AddSentDiverse counterpart of each original QA sample. In the second set of experiments, we train RoBERTaBASE on SQuAD which has been augmented with an adversarial dataset of the same size as SQuAD and contains equal number of samples from AddSentDiverse, PerturbQuestion and PerturbAnswer. As seen from the results in Table 8, training with AddSentDiverse leads to large improvements on AddKSentDiverse and small improvements on PerturbQuestion and PerturbAnswer i.e., 31.21% (45.31 vs. 76.52), 1.56% (43.67 vs. 45.23) and 5.31% (71.97 vs. 77.28) respectively, showing that the model is significantly robust to multiple distractors within the same context and it also gains some resilience to paraphrasing operations. Conversely, we see a drop in performance on InvalidateAnswer, showing that it is easier for the model to be distracted by adversaries when the original answer is removed from the context. In the second set of experiments, we see that the model is significantly more robust to combinatorial adversaries like AddSentDiverse+PerturbAnswer when trained on the adversaries AddSentDiverse and PerturbAnswer individually. We also see a decline in performance on InvalidateAnswer.

Trained on Trained on
Adversary Attack SQuAD SQ+ASD
AddKSentDiverse 45.31 76.52
InvalidateAnswer 77.75 70.91
PerturbQuestion 43.67 45.23
PerturbAnswer 71.97 77.28
Trained on Trained on
Adversary Attack SQuAD SQ+ASD/PQ/PA
AddSentDiverse+PerturbAnswer 50.71 84.37
AddKSentDiverse+PerturbQuestion 31.56 78.91
AddAnswerPosition 68.91 80.87
AddKSentDiverse 45.31 76.14
InvalidateAnswer 77.75 71.62
Table 8: Robustness of RoBERTaBASE trained on a subset of adversaries to unseen adversaries. Results (F1 score) are shown on SQuAD dev set (ASD=AddSentDiverse, PQ=PerturbQuestion, PA=PerturbAnswer, SQ=SQuAD).
Augmentation Ratio NewsQA
RoBERTa 48.36 / 36.06
           + 1x augmentation 49.73 / 38.38
           + 2x augmentation 49.84 / 37.97
           + 3x augmentation 49.62 / 38.01
Table 9: Effect of augmentation ratio for generalization from SQuADNewsQA. Results (F1 score/Exact Match) are shown on NewsQA dev set.
Figure 2: Performance of SQuAD NewsQA model on NewsQA dev set (F1 score) with increasing size of finetuning dataset.

Effect of Augmentation Ratio: To assess the importance of adversarial augmentation in the dataset, we experimented with different ratios i.e., 1x, 2x and 3x, of augmented samples to the original dataset, for generalization from SQuAD to NewsQA using the augmentation policy learnt by BayesAugment. The performance of SQuADNewsQA models on NewsQA validation set were 49.73, 49.84 and 49.62 for 1x, 2x and 3x augmentations respectively, showing slight improvement for twice the number of augmentations. However, the performance starts decreasing at 3x augmentations, showing that too many adversaries in the training data starts hurting generalization.

Effect of Augmented Dataset Size: We experimented with 20%, 40%, 60%, 80% and 100% of the original dataset to generate augmented dataset using the BayesAugment policy for generalization of RoBERTaBASE trained on SQuAD to NewsQA and observed little variance in performance with increasing data, as seen from Figure 2. The augmentation ratio in these datasets is 1:1. We hypothesize that the model is saturated early on during training, within the first tens of thousands of adversarially augmented samples. Exposing the model to more SQuAD samples gives little boost to performance on NewsQA thereafter.

Bayesian Convergence: In comparison to the thousands of training loops or more for AutoAugment, we run BayesAugment for only 100 training loops with 20 restarts. To show that BayesAugment converges within the given period, we plot the distance between transformation probabilities chosen by the Bayesian optimizer for the AddSentDiverse-PerturbQuestion augmentation method. As shown in Figure 3, the distance between the samples decreases as the training iterations increase, showing that the optimizer becomes more confident about the narrow range of probability which should be sampled for maximum performance on validation set.

Analysis of Resources for AutoAugment vs BayesAugment: With lesser number of training loops, BayesAugment uses only 10% of the GPU resources required for AutoAugment. Our AutoAugment experiments have taken more than 1000 iterations and upto 5-6 days for convergence, requiring many additional days for hyperparameter tuning. In contrast, our BayesAugment experiment ran for 36-48 hours on 2 1080Ti GPUs and achieved comparable performance with 100 iterations or less. If large pretrained models are replaced with smaller distilled models in future work, BayesAugment will provide even more gains in time/computation.

7 Conclusion

We show that adversarial training can be leveraged to improve robustness of reading comprehension models to adversarial attacks and also to improve performance on source domain and generalization to out-of-domain and cross-lingual data. We present BayesAugment for policy search, which achieves results similar to the computationally-intensive AutoAugment method but with a fraction of computational resources. By combining policy search with rewards from the corresponding target development sets’ performance, we show that models trained on SQuAD can be generalized to NewsQA and German, Russian cross-lingual datasets without any training data from the target domain or language.

Figure 3: Demonstration of variation in distance between neighboring samples picked by Bayesian optimizer with increasing training iterations. The red line represents moving average of distances.


This work was supported by DARPA MCS Grant #N66001-19-2-4031, DARPA KAIROS Grant #FA8750-19-2-1004, ONR Grant #N00014-18-1-2871, and awards from Google and Facebook (plus Amazon and Google GPU cloud credits). The views are those of the authors and not of the funding agency.


  • M. Alzantot, Y. Sharma, A. Elgohary, B. Ho, M. Srivastava, and K. Chang (2018) Generating natural language adversarial examples. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 2890–2896. External Links: Link, Document Cited by: §A.1, §1, §2, §3.1.
  • M. Artetxe, S. Ruder, and D. Yogatama (2019) On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856. Cited by: §A.2, §4.4.
  • A. Asai, A. Eriguchi, K. Hashimoto, and Y. Tsuruoka (2018) Multilingual extractive reading comprehension by runtime machine translation. arXiv preprint arXiv:1809.03275. Cited by: §4.6, footnote 4.
  • Y. Belinkov and Y. Bisk (2018)

    Synthetic and natural noise both break neural machine translation

    In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
  • Y. Cao, M. Fang, B. Yu, and J. T. Zhou (2020) Unsupervised domain adaptation on reading comprehension. In AAAI, Cited by: §2.
  • X. Chen, Y. Sun, B. Athiwaratkun, C. Cardie, and K. Weinberger (2018) Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6, pp. 557–570. External Links: Link, Document Cited by: §2.
  • E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §2, §3.3.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) AutoAugment: learning augmentation strategies from data. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §2, §3.3, §4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §A.2, §2.
  • J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2018) HotFlip: white-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 31–36. External Links: Link, Document Cited by: §1, §2.
  • A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, and D. Chen (2019) MRQA 2019 shared task: evaluating generalization in reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, Hong Kong, China, pp. 1–13. External Links: Link, Document Cited by: §2.
  • M. Glockner, V. Shwartz, and Y. Goldberg (2018) Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 650–655. External Links: Link, Document Cited by: §2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §2.
  • I. Gurevych and Y. Miyao (2018) Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers). In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1.
  • D. Ho, E. Liang, X. Chen, I. Stoica, and P. Abbeel (2019) Population based augmentation: efficient learning of augmentation policy schedules. In

    Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA

    , K. Chaudhuri and R. Salakhutdinov (Eds.),
    Proceedings of Machine Learning Research, Vol. 97, pp. 2731–2741. External Links: Link Cited by: §2, §3.3.
  • M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer (2018) Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1875–1885. External Links: Link, Document Cited by: §1, §2, §3.1.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. (2017) Population based training of neural networks. DeepMind tech report. arXiv preprint arXiv:1711.09846. Cited by: §2.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2021–2031. External Links: Link, Document Cited by: Table 1, §1, §2, §3.1, §3.
  • S. Lee, D. Kim, and J. Park (2019) Domain-agnostic question-answering with adversarial training. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, Hong Kong, China, pp. 196–202. External Links: Link, Document Cited by: §2.
  • P. Lewis, B. Oğuz, R. Rinott, S. Riedel, and H. Schwenk (2019) MLQA: evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475. Cited by: §A.2, §4.4, §4.6, Table 5.
  • C. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016)

    How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation

    In EMNLP, Cited by: §3.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §A.2, §1.
  • T. Miyato, A. M. Dai, and I. J. Goodfellow (2017) Adversarial training methods for semi-supervised text classification. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2, §2.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018)

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning

    IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.
  • N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov (2019) Facebook FAIR’s WMT19 News Translation Task Submission. In WMT, Cited by: §4.6.
  • Y. Nie, Y. Wang, and M. Bansal (2019) Analyzing compositionality-sensitivity of NLI models. In AAAI, pp. 6867–6874. Cited by: §1, §2.
  • T. Niu and M. Bansal (2018) Adversarial over-sensitivity and over-stability strategies for dialogue models. In CoNLL, Cited by: §3.1.
  • T. Niu and M. Bansal (2019) Automatically learning data augmentation policies for dialogue tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1317–1323. External Links: Link, Document Cited by: §2, §3.3.
  • J. Pennington, R. Socher, and C. Manning (2014)

    Glove: global vectors for word representation

    In EMNLP, pp. 1532–1543. Cited by: §3.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. OpenAI Technical Report. Cited by: §A.1, §3.1.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 784–789. External Links: Link, Document Cited by: §A.2, §4.4.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. In EMNLP, Cited by: §A.2, §4.4.
  • C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §1, §4.2.
  • S. Ren, Y. Deng, K. He, and W. Che (2019) Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1085–1097. External Links: Link, Document Cited by: §2.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019, Cited by: §4.5.
  • M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. In ICLR, Cited by: §2.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §A.3, §1.
  • R. Speer, J. Chin, and C. Havasi (2017) ConceptNet 5.5: an open multilingual graph of general knowledge. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA

    , S. P. Singh and S. Markovitch (Eds.),
    pp. 4444–4451. External Links: Link Cited by: §3.1.
  • N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML, Cited by: §4.2.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (1999) Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], S. A. Solla, T. K. Leen, and K. Müller (Eds.), pp. 1057–1063. External Links: Link Cited by: §4.1.
  • A. Talmor and J. Berant (2019) MultiQA: an empirical investigation of generalization and transfer in reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4911–4921. External Links: Link, Document Cited by: §2.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2017) NewsQA: a machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada, pp. 191–200. External Links: Link, Document Cited by: §A.2, §4.4.
  • H. Wang, Z. Gan, X. Liu, J. Liu, J. Gao, and H. Wang (2019) Adversarial domain adaptation for machine reading comprehension. In EMNLP, Cited by: §2.
  • Y. Wang and M. Bansal (2018) Robust machine comprehension models via adversarial training. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 575–581. External Links: Link, Document Cited by: Table 1, §1, §2, §3.1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, pp. 229–256. External Links: Link, Document Cited by: §4.1.
  • C. Xie, M. Tan, B. Gong, J. Wang, A. Yuille, and Q. V. Le (2020) Adversarial examples improve image recognition. In CVPR, Cited by: §2.
  • Z. Yang, Y. Cui, W. Che, T. Liu, S. Wang, and G. Hu (2019) Improving machine reading comprehension via adversarial training. arXiv preprint arXiv:1911.03614. Cited by: §2.
  • A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In ICLR, External Links: Link Cited by: §2.
  • M. Zhang, Y. Liu, H. Luan, and M. Sun (2017) Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1959–1970. Cited by: §2.
  • Z. Zhao, D. Dua, and S. Singh (2018) Generating natural adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.

Appendix A Appendices

a.1 Adversary Attacks

PerturbAnswer (Semantic Paraphrasing): Following Alzantot et al. (2018), we perform semantic paraphrasing of the sentence containing the answer span. Instead of using genetic algorithm, we adapt their Perturb subroutine to generate paraphrases in the following steps:

  1. [nosep, wide=0pt, leftmargin=*, after=]

  2. Select word locations for perturbations, which includes locations within any content phrase that does not appear within the answer span. Here, content phrases are verbs, adverbs and adjectives.

  3. For location in the set of word locations , compute 20 nearest neighbors of the word at given location using GloVe embeddings, create a candidate sentence by perturbing the word location with each of the substitute words and rank perturbed sentences using a language model.

  4. Select the perturbed sentence with highest rank and perform Step 2 for the next location using the perturbed sentence.

We use the OpenAI-GPT model Radford et al. (2018) to evaluate paraphrases.

a.2 Datasets

SQuAD v2.0 Rajpurkar et al. (2018) is a crowd-sourced dataset consisting of 100,000 questions from SQuAD v1.1 Rajpurkar et al. (2016) and an additional 50,000 questions that do not have answers within the given context. We split the official development set into 2 randomly sampled sets of validation and test for our experiments.

NewsQA is also a crowd-sourced extractive RC dataset based on 10,000 news articles from CNN, containing both answerable and unanswerable questions. Trischler et al. (2017) To accommodate very long contexts from NewsQA in models like Bert Devlin et al. (2019) and RoBERTa Liu et al. (2019), we sample two instances from the set of overlapping instances for the final training data.

MLQA Lewis et al. (2019) is the multilingual extension to SQuAD v1.1 consisting of evaluation (development and test) data only. We use German (de) MLQA in our experiments.

XQuAD is a multilingual version of SQuAD Artetxe et al. (2019) containing only test sets. We use Russian (ru) XQuAD which contains nearly 1100 QA samples that are further split equally and randomly into development and test sets.

Model SQuADv1.1 SQuADv2.0 NewsQA RoBERTa 89.73 / 82.38 81.17 / 77.54 58.40 / 47.04 DistilRoBERTa 84.57 / 75.81 73.29 / 69.47 54.21 / 42.76

Table 10: Comparison of performance (F1 Score / Exact Match) of different models on SQuAD v1.1, SQuaD v2.0 and NewsQA datasets. RoBERTaBASE is the baseline model; DistilRoBERTaBASE is the task model used during AutoAugment policy search.

a.3 Training Details

Model Hyperparameters: We trained RoBERTaBASE for 5 epochs on SQuAD and NewsQA respectively and selected the best-performing checkpoint as baseline. We perform a hyperparameter search for both datasets using Bayesian optimization search Snoek et al. (2012). The RNN controller in AutoAugment training loop consists of a single LSTM cell with a single hidden layer and hidden layer dimension of 100. The generated policy consists of 3 sub-policies; each sub-policy is structured as represented in Figure 1 in main text. BayesAugment is trained for 100 iterations with 20 restarts. During AutoAugment and BayesAugment training loops, RoBERTaBASE or distilRoBERTaBASE (which has already been trained on unaugmented SQuAD) is further finetuned on the adversarially augmented dataset for 2 epochs with a warmup ratio of 0.2 and learning rate decay (lr=1e-5) thereafter. After the policy search, further hyperparameter optimization is performed for best results from fine-tuning. We do not perform this last step of hyperparameter tuning on cross-lingual data to avoid the risk of overfitting the small datasets. For generalization from SQuAD v1.1 to cross-lingual datasets, we do not consider the adversary InvalidateAnswer because NoAnswer samples do not exist for these datasets.

Hyperparameter SQuAD v1.1 SQuAD v2.0 NewsQA
Learning Rate 3e-5 1.5e-5 1.6e-5
Batch Size 24 16 24
Warmup Ratio 0.06 0.06 0.08
No. of Epochs 2 5 5
Weight Decay 0.01 0.01 0.01
Table 11: Best hyperparameters for training RoBERTaBASE on SQuAD v2.0 and NewsQA.