Log In Sign Up

Towards Improving Adversarial Training of NLP Models

Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence encoders for constraining the generated instances. As a result, it remains challenging to use vanilla adversarial training to improve NLP models' performance, and the benefits are mainly uninvestigated. This paper proposes a simple and improved vanilla adversarial training process for NLP, which we name Attacking to Training (). The core part of is a new and cheaper word substitution attack optimized for vanilla adversarial training. We use to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, Yelp, and SNLI datasets. Our results show that it is possible to train empirically robust NLP models using a much cheaper adversary. We demonstrate that vanilla adversarial training with can improve an NLP model's robustness to the attack it was originally trained with and also defend the model against other types of attacks. Furthermore, we show that can improve NLP models' standard accuracy, cross-domain generalization, and interpretability. Code is available at .


page 1

page 2

page 3

page 4


Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models

Traditional (fickle) adversarial examples involve finding a small pertur...

Improving the Generalization of Adversarial Training with Domain Adaptation

By injecting adversarial examples into training data, the adversarial tr...

Fast AdvProp

Adversarial Propagation (AdvProp) is an effective way to improve recogni...

Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models

A sequence-to-sequence learning with neural networks has empirically pro...

Constant Random Perturbations Provide Adversarial Robustness with Minimal Effect on Accuracy

This paper proposes an attack-independent (non-adversarial training) tec...

Posterior Differential Regularization with f-divergence for Improving Model Robustness

We address the problem of enhancing model robustness through regularizat...

Lagrangian Objective Function Leads to Improved Unforeseen Attack Generalization in Adversarial Training

Recent improvements in deep learning models and their practical applicat...

1 Introduction

Recently, robustness of neural networks against adversarial examples has been an active area of research in natural language processing with a plethora of new adversarial attacks

222We use “methods for adversarial example generation” and “adversarial attacks” interchangeably. having been proposed to fool question answering (Jia and Liang, 2017), machine translation (Cheng et al., 2018), and text classification systems (Ebrahimi et al., 2017; Jia and Liang, 2017; Alzantot et al., 2018; Jin et al., 2019; Ren et al., 2019; Zang et al., 2020; Garg and Ramakrishnan, 2020). One method to make models more resistant to such adversarial attacks is adversarial training where the model is trained on both original examples and adversarial examples Goodfellow et al. (2014); Madry et al. (2018). Due to its simple workflow, it is a popular go-to method for improving adversarial robustness.

Typically, adversarial training involves generating adversarial example from each original example before training the model on both and

. In NLP, generating an adversarial example is typically framed as a combinatorial optimization problem solved using a heuristic search algorithm. Such an iterative search process is expensive. Depending on the choice of the search algorithm, it can take up to tens of thousands of forward passes of the underlying model to generate one example

(Yoo et al., 2020). This high computational cost hinders the use of vanilla adversarial training in NLP, and it is unclear how and as to what extent such training can improve an NLP model’s performance  Morris et al. (2020a).

In this paper, we propose to improve the vanilla adversarial training in NLP with a computationally cheaper adversary, referred to as A2T . The proposed A2T uses a cheaper gradient-based word importance ranking method to iteratively replace each word with synonyms generated from a counter-fitted word embedding Mrksic et al. (2016). We use A2T and its variation A2T-MLM (which uses masked language model-based word replacements instead) to train BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) models on text classification tasks such as IMDB (Maas et al., 2011), Rotten Tomatoes (Pang and Lee, 2005), Yelp (Zhang et al., 2015), and SNLI (Bowman et al., 2015) datasets. Our findings are as following:

  • [noitemsep,topsep=0pt]

  • Adversarial training with both A2T and A2T-MLM can help improve adversarial robustness, even against NLP attacks that were not used to train the model (see Table 5).

  • Adversarial training with A2T can provide a regularization effect and improve the model’s standard accuracy and/or cross-domain generalization, while A2T-MLM tends to hurt both standard accuracy and cross-domain generalization (see Table 6).

  • Using LIME Ribeiro et al. (2016) and AOPC metric, we demonstrate that adversarial training with A2T can improve NLP models’ interpretability (see  Table 7).

2 Background

Figure 1: Pipeline for vanilla adversarial training in NLP

2.1 Vanilla Adversarial Training and in NLP

Vanilla adversarial training has been a major defense strategy in most existing work on adversarial robustness (Goodfellow et al., 2014; Kurakin et al., 2016; Madry et al., 2018). This simple algorithm involves augmenting the training data by adding the adversarial examples generated from the training data. The augmentation happens online during training, and, therefore, the cost of generating adversarial examples is usually a concern for practical vanilla adversarial training Shafahi et al. (2019); Zhu et al. (2019). Such a concern is even more prominent in NLP since generating NLP adversarial examples involves combinatorial search and expensive language constraints.

In recent NLP literature, vanilla adversarial training has only been evaluated in a limited context. In most studies, adversarial training is only performed to show that such training can make models more resistant to the attack it was originally trained with (Jin et al., 2019; Ren et al., 2019; Li et al., 2020; Zang et al., 2020; Li et al., 2021). This observation is hardly surprising, and it is generally recommended to use different attacks to evaluate the effectiveness of defenses Carlini et al. (2019). Therefore, in this paper, we perform a more in-depth investigation into how a practical vanilla adversarial training algorithm we propose affects NLP models’ adversarial robustness against a set of different attacks that are not used for training.

In addition, we examine how adversarial training affects model’s performance in other aspects such as standard accuracy, cross-domain generalization, and interpretability.

2.2 Components of an NLP Attack

Figure 1 includes a schematic diagram on vanilla adversarial training where an adversarial attack is part of the training procedure. We borrow the framework introduced by Morris et al. (2020b) which breaks down the process of generating natural language adversarial examples into three parts (see Table 1): (1) A search algorithm to iteratively search for the best perturbations (2) A transformation module to perturb a text input from to (e.g. synonym substitutions) (3) Set of constraints that filters out undesirable to ensure that perturbed preserves the semantics and fluency of the original .

Adversarial attacks frame their approach as a combinatorial search because of the exponential nature of the search space. Consider the search space for an adversarial attack that replaces words with synonyms: If a given sequence of text consists of words, and each word has potential substitutions, the total number of perturbed inputs to consider is . Thus, the graph of all potential adversarial examples for a given input is far too large for an exhaustive search. Studies on NLP attacks have explored various heuristic search algorithms, including beam search Ebrahimi et al. (2017)

, genetic algorithm

Alzantot et al. (2018), and greedy method with word importance ranking Gao et al. (2018); Jin et al. (2019); Ren et al. (2019).

3 Method: A2t (Attacking to Training)

In this section, we present our algorithm A2T for an improved and practical vanilla adversarial training for NLP. We also present the cheaper adversarial attacks we propose to use in A2T .

3.1 Training Objective

Following the recommendations by Goodfellow et al. (2014); Kurakin et al. (2016), we use both clean333Clean examples refer to the original training examples. and adversarial examples to train our model. We aim to minimize both the loss on the original training dataset and the loss on the adversarial examples.


represent the loss function for input text

and label and let be the adversarial attack that produces adversarial example . Then, our training objective is as following:


is used to weigh the adversarial loss. In this work, we set , weighing the two loss equally. 444We leave tuning for the optimal for future work.

3.2 A Practical Training Workflow

Previous works in adversarial training, especially those from computer vision

(Goodfellow et al., 2014; Madry et al., 2018), generate adversarial examples between every mini-batch and use them to train the model. However, it is difficult in practice to generate adversarial examples between every mini-batch update when using NLP adversarial attacks.

This is because NLP adversarial attacks typically require other neural networks as their sub-components (e.g. sentence encoders, masked language models). For example,

Jin et al. (2019) uses Universal Sentence Encoder (Cer et al., 2018) while Garg and Ramakrishnan (2020) uses BERT masked language model (Devlin et al., 2018)

. Given that recent Transformers models such as BERT and RoBERTa models also require large amounts of GPU memory to store the computation graph during training, it is impossible to run adversarial attacks and train the model in the same GPU. We, therefore, propose to instead maximize GPU utilization by first generating adversarial examples before every epoch and then using the generated samples to train the model.

Also, instead of generating adversarial examples for every clean example in the training dataset, we leave the desired number of adversarial examples as a hyperparameter

where it is a percentage of the original training dataset. For cases where the adversarial attack fails to find an adversarial example, we skip them and instead sample more from the training dataset to compensate for the skipped samples. In our experiments, unless specified, we attack of the training dataset, which was based on an empirical study of that we present in Section 5.6.

Algorithm 1 shows the proposed A2T adversarial training algorithm in detail. We run clean training for number of epochs before performing epochs of adversarial training. Between line 6-13, we generate the adversarial examples until we obtain percentage of the training dataset. When multiple GPUs are available, we use data parallelism to speed up the generation process. We also shuffle the dataset before attacking to avoid attacking the same sample every epoch.

0:  Number of clean epochs , number of adversarial epochs , percentage of dataset to attack , attack , and training data , the smoothing proportion of adversarial training
1:  Initialize model
2:  for clean epoch do
3:     Train on
4:  end for
5:  for adversarial epoch do
6:     Randomly shuffle
9:     while  and  do
13:     end while
15:     Train on with used to weigh the loss
16:  end for
Algorithm 1 Adversarial Training with A2T

3.3 Cheaper Attack for Adversarial Training

The attack component in A2T is designed to be faster than previous attacks from literature. We achieve the speedup by making two key choices when constructing our attack: (1) Gradient-based word importance ordering, and (2) DistilBERT (Sanh et al., 2019) semantic textual similarity constraint. Table 1 summarizes the differences between A2T and two other attacks from literature: TextFooler Jin et al. (2019) and BAE Garg and Ramakrishnan (2020).

Components A2T A2T-MLM TextFooler BAE
Search Method
for Ranking Words
Word Importance
Word Importance
Word Importance
Word Importance
Word Substitution Word Embedding BERT MLM Word Embedding BERT MLM
Constraints POS Consistency POS Consistency POS Consistency POS Consistency
DistilBERT Similarity DistilBERT Similarity USE Similarity USE Similarity
Table 1: Comparing A2T and variation with TextFooler Jin et al. (2019) and BAE Garg and Ramakrishnan (2020)
Attack Runtime (sec)
A2T 5,988
TextFooler + Gradient Search 8,760
TextFooler 17,268
Table 2: The runtime (in seconds) of A2T , original TextFooler, and TextFooler where deletion-based word importance ranking is switched with gradient-based word importance ranking. We can see that replacing the search method gives us approximately speedup. The attack was carried out against a BERT model trained on IMDB dataset and 1000 samples were attacked.

Faster Search with Gradient-based Word Importance Ranking:

Previous attacks such as Jin et al. (2019); Garg and Ramakrishnan (2020) iteratively replace one word at a time to generate adversarial examples. To determine the order of words in which to replace, both Jin et al. (2019); Garg and Ramakrishnan (2020) rank the words by how much the target model’s confidence on the ground truth label changes when the word is deleted from the input. We will refer to this as deletion-based word importance ranking.

One issue with this method is that an additional forward pass of the model must be made for each word to calculate its importance. For longer text inputs, this can mean that we have to make up to hundreds of forward passes to generate one adversarial example.

A2T instead determines each word’s importance using the gradient of the loss. For an input text including words: where each is a word, the importance of is calculated as:


where is the word embedding that corresponds to word . For BERT and RoBERTa models where inputs are tokenized into sub-words, we calculate the importance of each word by taking the average of all sub-words constituting the word.

This requires only one forward and backward pass and saves us from having to make additional forward passes for each word. Yoo et al. (2020) showed that the gradient-ordering method is the fastest search method and provides competitive attack success rate when compared to the deletion-based method. Table 2 shows that when we switch from deletion-based ranking (“TextFooler”) to gradient-based ranking (“TextFooler+Gradient Search”), we can obtain approximately speedup.

Cheaper Constraint Enforcing with DistilBERT (Sanh et al., 2019) semantic textual similarity model:

Most recent attacks like Jin et al. (2019); Garg and Ramakrishnan (2020); Li et al. (2020) use Universal Sentence Encoders (USE) Cer et al. (2018) to compare the sentence encodings of original text and perturbed text

. If the cosine similarity between two encodings fall below a certain threshold,

is ignored. One of the challenges of using large encoders like USE is that it can take up significant amount of GPU memory – up to 9GB in case of USE.

Instead of using USE, A2T uses DistilBERT (Sanh et al., 2019) model trained on semantic textual similarity task as its constraint module 555We use code from Reimers and Gurevych (2019). This is because DistilBERT requires less GPU memory than USE and requires fewer operations.

3.4 A2t-Mlm : Variation with a Different Word Substitution Strategy

A2T generates replacements for each word by selecting top- nearest neighbors in a counter-fitted word embedding Mrksic et al. (2016). This word substitution strategy has been previously proposed by Alzantot et al. (2018); Jin et al. (2019). We also consider another variation we name as A2T-MLM in which BERT masked language model is used to generate replacements (proposed in Garg and Ramakrishnan (2020); Li et al. (2020, 2021)).

We consider this variation because two strategies prioritize different language qualities when proposing word replacements. Counter-fitted word embeddings are likely to propose synonyms as replacements, but could produce incoherent texts as it does not take the entire context into account. On the other hand, BERT masked language model is more likely to propose replacement words that preserve grammatical and contextual coherency but fail to preserve the semantics. Comparing A2T with A2T-MLM allows us to study the effect of word substitution strategy on adversarial training.

4 Related Work

Past works on adversarial training for NLP models come in diverse flavors that differ in how adversarial examples are generated. Miyato et al. (2017), which is one of the first works to introduce adversarial training to NLP tasks, perform perturbations in the word embedding level instead of the actual input space level. Likewise, Zhu et al. (2019); Jiang et al. (2020); Liu et al. (2020) all apply perturbations in the embedding level using gradient-based optimization methods from computer vision.

Another family of work on adversarial training involves computing the hyperspace of activations that contains all texts that can be generated using word substitutions and then training the model to make consistent prediction for inputs inside the hyperspace. Jia et al. (2019); Huang et al. (2019) compute axis-aligned hyper-rectangles and leverages Interval Bound Propagation Dvijotham et al. (2018) to defend the model against substitution attacks while Dong et al. (2021) computes the desired hyperspace as a convex hull in the embedding space and further trains the model to be robust against worst case embedding in the convex hull.

Yet, adversarial training that simply uses adversarial examples generated in the input space is still a relatively unexplored area of research despite its simple, extendable workflow. Most works that have discussed such form of adversarial training only train limited number of models and datasets to show that adversarial training can make models more resistant to the particular attack used to train the model (Jin et al., 2019; Ren et al., 2019; Li et al., 2020; Zang et al., 2020; Li et al., 2021). Our work demonstrates that simple vanilla adversarial training can actually provide improvements in adversarial robustness across many different word substitution attacks. Furthermore, we show that it can improve both generalization and interpretability of models, properties that have not been examined by previous works.

5 Experiment and Results

5.1 Datasets & Models

We chose IMDB Maas et al. (2011), Movie Reviews (MR) Pang and Lee (2005), Yelp Zhang et al. (2015), and SNLI Bowman et al. (2015) datasets for our experiment. For Yelp, instead of using the entire training set, we sampled 30k examples for training and 10k for validation.

Dataset Train Dev Test
IMDB 20k 5k 25k
MR 8.5k 1k 1k
Yelp 30k 10k 38k
SNLI 550k 10k 10k
Table 3: Overview of the datasets.

We trained BERT Devlin et al. (2018) and RoBERTa Liu et al. (2019) models using the implementation provided by Wolf et al. (2020). All texts were tokenized up to the first 512 tokens and we trained the model for one clean epoch and three adversarial epochs. Adam optimizer with weight decay of Loshchilov and Hutter (2017) and learning rate of were used for training. Also, we used a linear scheduler with warm-up steps for IMDB and Yelp, steps for MR, and steps for SNLI. We performed three runs with random seeds for each model.

5.2 Baselines

Adversarial training can be viewed as a data augmentation method where hard examples are added to the training set. Therefore, besides just having models that are trained on clean adversarial examples (i.e. “natural training”) as our baseline, we also compare our results to models trained using more conventional data augmentation methods. We use SSMBA Ng et al. (2020) and backtranslation666For backtranslation, we use English-to-German model and German-to-English model trained by Ng et al. (2019). Xie et al. (2019) methods as our baselines as both have reported strong performance on text classification tasks. We use these methods to generate approximately the same number of new training examples as adversarial training.

5.3 Results on Adversarial Robustness

To evaluate models’ robustness to adversarial attacks, we attempt to generate adversarial examples from 1000 randomly sampled clean examples from the test set and measure the attack success rate.

Table 4 shows the attack success rates of A2T attack and A2T-MLM attack against models that have been trained using A2T , A2T-MLM , and other baseline methods. Note that the overall attack success rates appear fairly low because we applied strict constraints to improve the quality of the adversarial examples (as recommend by Morris et al. (2020b)). Still, we can see that for both attacks, adversarial training using the same attack can decrease the attack success rate by up to 70%. What is surprising is that training the model using a different attack also led to a decrease in the attack success rate. From Table 4, we can see that adversarial training using the A2T-MLM attack lowers the attack success rate of A2T attack while training with A2T lowers the attack success rate of A2T-MLM attack.

Another surprising observation is that training with data augmentations methods like SSMBA and backtranslation can lead to improvements in robustness against both adversarial attacks. However, in case of smaller datasets such as MR, data augmentation can also hurt robustness.

When we compare the attack success rates between BERT and RoBERTa models, we also see an interesting pattern. BERT models tend to be more vulnerable to A2T attack while RoBERTa model tends to be more vulnerable to A2T-MLM attack.

Lastly, we use attacks proposed from literature to evaluate the models’ adversarial robustness. Table 5 shows the attack success rate of TextFooler Jin et al. (2019), BAE Garg and Ramakrishnan (2020), PWWS Ren et al. (2019), and PSO Zang et al. (2020).777These attacks were implemented using the TextAttack library Morris et al. (2020a). Across four datasets and two models, we can see that both A2T and A2T-MLM lower the attack success rate against all four attacks in all but five cases. The results for PWWS and PSO are especially surprising since both use different transformations - WordNet Miller (1995) and HowNet Dong et al. (2010) - when carrying out the attacks.

Attack Model IMDB MR Yelp SNLI
A.S.% A.S.% A.S.% A.S.%
A2T BERT-Natural 42.9 20.9 25.4 53.3
BERT-A2T 12.7 -70.4 13.2 -36.8 11.5 -54.7 15.6 -70.7
BERT-A2T-MLM 34.5 -19.6 18.9 -9.6 21.0 -17.3 47.2 -11.4
BERT-SSMBA 29.5 -31.2 21.1 1.0 23.3 -8.3 51.0 -4.3
BERT-BackTranslation 33.1 -22.8 19.2 -8.1 24.0 -5.5 48.3 -9.4
RoBERTa-Natural 34.3 18.6 19.9 48.4
RoBERTa-A2T 12.4 -63.8 12.1 -34.9 7.6 -61.8 8.3 -82.9
RoBERTa-A2T-MLM 19.5 -43.1 17.1 -8.1 13.0 -34.7 40.3 -16.7
RoBERTa-SSMBA 24.0 -30.0 21.8 17.2 19.3 -3.0 48.7 0.6
RoBERTa-BackTranslation 28.9 -15.7 18.3 -1.6 16.1 -19.1 48.6 0.4
A2T-MLM BERT-Natural 76.6 37.7 47.1 77.9
BERT-A2T 61.7 -19.5 33.2 -11.9 42.5 -9.8 76.7 -1.5
BERT-A2T-MLM 48.3 -36.9 24.7 -34.5 27.9 -40.8 37.1 -52.4
BERT-SSMBA 59.6 -22.2 36.2 -4.0 44.8 -4.9 76.9 -1.3
BERT-BackTranslation 68.8 -10.2 36.3 -3.7 46.8 -0.6 77.3 -0.8
RoBERTa-Natural 81.5 40.9 53.2 78.6
RoBERTa-A2T 69.8 -14.4 38.4 -6.1 45.2 -15.0 76.5 -2.7
RoBERTa-A2T-MLM 37.0 -54.6 28.5 -30.3 25.8 -51.5 35.2 -55.2
RoBERTa-SSMBA 57.0 -30.1 43.1 5.4 47.8 -10.2 78.3 -0.4
RoBERTa-BackTranslation 74.3 -8.8 41.1 0.5 43.8 -17.7 79.1 0.6
Table 4: Attack success rate of A2T and A2T-MLM attacks. A.S.% represents the attack success rates and column represents the percent change between the attack success rate of natural training and the different training methods.
Attack Model IMDB MR Yelp SNLI
A.S.% A.S.% A.S.% A.S.%
TextFooler BERT-Natural 85.0 91.6 55.9 97.5
BERT-A2T 66.0 -22.4 90.6 -1.1 57.9 3.6 92.2 -5.4
BERT-A2T-MLM 88.2 3.8 89.0 -2.8 67.7 21.1 94.1 -3.5
RoBERTa-Natural 95.2 94.4 74.5 96.7
RoBERTa-A2T 82.4 -13.4 91.0 -3.6 68.7 -7.8 91.4 -5.5
RoBERTa-A2T-MLM 72.9 -23.4 88.6 -6.1 71.7 -3.8 90.8 -6.1
BAE BERT-Natural 60.5 52.6 37.8 76.7
BERT-A2T 46.7 -22.8 51.5 -2.1 34.4 -9.0 75.9 -1.0
BERT-A2T-MLM 52.4 -13.4 43.8 -16.7 31.3 -17.2 60.9 -20.6
RoBERTa-Natural 65.5 56.4 44.4 75.6
RoBERTa-A2T 56.8 -13.3 54.7 -3.0 38.0 -14.4 76.0 0.5
RoBERTa-A2T-MLM 42.3 -35.4 48.3 -14.4 28.7 -35.4 61.2 -19.0
PWWS BERT-Natural 87.5 82.1 67.9 98.5
BERT-A2T 70.9 -19.0 80.4 -2.1 65.4 -3.7 97.5 -1.0
BERT-A2T-MLM 87.1 -0.5 81.3 -1.0 72.2 6.3 97.5 -1.0
RoBERTa-Natural 96.6 83.8 77.9 98.2
RoBERTa-A2T 84.4 -12.6 81.9 -2.3 73.1 -6.2 97.1 -1.1
RoBERTa-A2T-MLM 73.5 -23.9 79.8 -4.8 70.7 -9.2 96.5 -1.7
PSO BERT-Natural 43.8 81.6 - 40.3 92.1
BERT-A2T 16.5 -62.3 73.2 -10.3 26.4 -34.5 89.1 -3.3
BERT-A2T-MLM 29.9 -31.7 75.4 -7.6 34.4 -14.6 89.7 -2.6
RoBERTa-Natural 34.8 88.0 35.7 90.6
RoBERTa-A2T 12.9 -62.9 81.6 -7.3 21.6 -39.5 85.3 -5.8
RoBERTa-A2T-MLM 13.1 -62.4 77.5 -11.9 20.3 -43.1 84.8 -6.4
Table 5: Attack success rate of attacks from literature, including original TextFooler Jin et al. (2019), BAE Garg and Ramakrishnan (2020), PWWS Ren et al. (2019), and PSO Zang et al. (2020). A.S.% represents the attack success rates and column represents the percent change between natural training and the different training methods.
BERT-Natural 93.97 92.13 85.40 90.60 96.34 88.31 90.29 73.34
BERT-A2T 94.49 92.50 85.61 88.45 96.68 89.24 90.16 73.79
BERT-A2T-MLM 93.05 90.67 83.80 85.32 95.85 85.01 87.87 70.93
BERT-SSMBA 93.94 91.59 85.33 89.49 96.28 88.54 90.23 73.27
BERT-BackTranslation 93.97 91.73 85.65 89.46 96.46 88.77 90.57 72.82
RoBERTa-Natural 95.26 94.09 87.52 93.42 97.26 91.94 91.56 77.66
RoBERTa-A2T 95.57 94.41 88.03 93.45 97.45 91.86 91.16 77.88
RoBERTa-A2T-MLM 94.71 94.48 86.49 92.93 96.84 90.44 88.56 74.82
RoBERTa-SSMBA 95.25 94.11 86.46 93.03 97.16 91.90 91.38 77.02
RoBERTa-BackTranslation 95.31 93.84 87.78 93.77 97.25 91.76 90.79 76.70
Table 6: Accuracy on in-domain and out-of-domain datasets. We can see that adversarial training can helps model outperform both naturally trained models and models trained using data augmentation methods.
Model IMDB MR Yelp
BERT-Natural 7.78 33.43 12.78
BERT-A2T 10.74 34.25 13.18
BERT-A2T-MLM 9.12 32.17 11.14
BERT-SSMBA 7.21 32.21 10.94
BERT-BackTranslation 6.02 0.39 11.10
RoBERTa-Natural 0.35 0.01 -1.09
RoBERTa-A2T 0.43 0.45 -1.01
RoBERTa-A2T-MLM 0.09 -0.12 -1.13
RoBERTa-SSMBA 0.26 0.05 -0.43
RoBERTa-BackTranslation -0.04 0.05 -1.06
Table 7: AOPC scores of the LIME explanations for each model. Higher AOPC scores indicates that the model is more interpretable.

5.4 Results on Generalization

To evaluate how adversarial training affects the model’s generalization ability, we evaluate its accuracy on the original test set (i.e. standard accuracy) and on an out-of-domain dataset (e.g. Yelp dataset for model trained on IMDB dataset). In Table 6, we can see that in all but two cases, adversarial training using A2T attack beats natural training in terms of standard accuracy. In the two cases (SNLI) where natural training beats A2T , we can see that A2T still outperforms natural training in cross-domain accuracy. Overall, in six out of eight cases, A2T improves cross-domain accuracy. On the other hand, adversarial training with A2T-MLM attack tends to hurt both standard accuracy and cross-domain accuracy. This confirms the observations reported by Li et al. (2021) and suggests that using a masked language model to generate adversarial examples can lead to a trade-off between robustness and generalization. We do not see similar trade-off with A2T .

5.5 Results on Interpretability

We use LIME Ribeiro et al. (2016) to generate local explanations for our models. For each example, LIME approximates the local decision boundary by fitting a linear model over the samples obtained by perturbing the example. To measure the faithfulness of the local explanations obtained using LIME, we measure the area over perturbation curve (AOPC) Samek et al. (2017); Nguyen (2018); Chen and Ji (2020) which is defined as:


where represents example with none of the words removed and represents example with the top- most important words removed. here represents the model’s confidence on the target label . Intuitively, AOPC measures on average how the model’s confidence on the target label changes when we delete the top- most important words determined using LIME.

For each dataset, we randomly pick 1000 examples from the test set for evaluation. When running LIME to obtain explanations, we generate perturbed samples for each instance. We set for the AOPC metric. Table 7 shows that across three sentiment classification datasets, BERT model trained using A2T attack achieves higher AOPC than natural training. For RoBERTa models, the same observation holds (although by smaller margins). Overall, we see that the AOPC scores for RoBERTa models are far lower than those for BERT models, suggesting that RoBERTa might be less interpretable than BERT.

(a) (b)
Figure 2: Left: Attack success rates of A2T and A2T-MLM attacks on BERT-A2T model trained on IMDB dataset. Right: Standard accuracies and cross-domain accuracies (on Yelp) for the same model.

5.6 Analysis

A2t vs A2t-Mlm attack

We can see that model trained using A2T attack outperforms the model trained using A2T-MLM attack in standard accuracy and cross-domain accuracy in all but one case. This suggests that using counter-fitted embeddings can generate higher quality adversarial examples than masked language models. Since masked language models are only trained to predict words that are statistically most likely to appear, it is likely that it will propose words that do change the semantics of the text entirely; this can lead to false positive adversarial examples.

Effect of Gamma

Recall , which is the desired percentage of adversarial examples to generate in every epoch. To study how it affects the results of adversarial training, we train BERT model on IMDB dataset with A2T method and value ranging from (no adversarial training) to . Figure 2 shows the attack success rates, standard accuracies, and cross-domain accuracies.

We can see from Figure 2 (a) that higher does not necessarily mean that our trained model is more robust, with producing model that is more robust than others. Overall, we can see that adversarial training is insensitive to the specific choice of and a smaller can be used for faster training.

Effect of Adversarial Training on Sentence Embedding

Figure 3: BERT-A2T decreases the distance between [CLS] embeddings of original and adversarial .

In BERT Devlin et al. (2018), the output for [CLS] token represents the sentence-level embedding that is used by the final classification layer. For BERT-Natural and BERT-A2T , we measured the distance between the [CLS] embeddings of original text and its corresponding adversarial example using six different attacks. We noticed that across all cases, adversarial training decreases the average distance between and , as shown by Figure 3. This suggests that adversarial training improves the robustness of models by encouraging the model to learn a closer mapping of and .

6 Conclusion

In this paper, we have presented a practical vanilla adversarial training process called A2T that uses a new adversarial attack designed to generate adversarial examples quickly. We demonstrated that using A2T allows us to improve model’s robustness against several different types of adversarial attacks that have been proposed from literature. Also, we have shown that models trained using A2T can achieve better standard accuracy and/or cross-domain accuracy than baseline models.


  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pp. 1638–1649. Cited by: 1st item.
  • M. Alzantot, Y. Sharma, A. Elgohary, B. Ho, M. Srivastava, and K. Chang (2018) Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998. Cited by: §1, §2.2, §3.4.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §5.1.
  • N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. J. Goodfellow, A. Madry, and A. Kurakin (2019) On evaluating adversarial robustness. CoRR abs/1902.06705. External Links: Link, 1902.06705 Cited by: §2.1.
  • D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, and R. Kurzweil (2018) Universal sentence encoder. CoRR abs/1803.11175. External Links: Link, 1803.11175 Cited by: §3.2, §3.3.
  • H. Chen and Y. Ji (2020)

    Learning variational word masks to improve the interpretability of neural text classifiers

    In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 4236–4251. External Links: Link, Document Cited by: §5.5.
  • M. Cheng, J. Yi, H. Zhang, P. Chen, and C. Hsieh (2018) Seq2Sick: evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR abs/1803.01128. External Links: Link, 1803.01128 Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: item 2, §1, §3.2, §5.1, §5.6.
  • X. Dong, A. T. Luu, R. Ji, and H. Liu (2021) Towards robustness against natural language word substitutions. In International Conference on Learning Representations, External Links: Link Cited by: §4.
  • Z. Dong, Q. Dong, and C. Hao (2010) HowNet and its computation of meaning. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, COLING ’10, USA, pp. 53–56. Cited by: §5.3.
  • K. Dvijotham, S. Gowal, R. Stanforth, R. Arandjelovic, B. O’Donoghue, J. Uesato, and P. Kohli (2018) Training verified learners with learned verifiers. CoRR abs/1805.10265. External Links: Link, 1805.10265 Cited by: §4.
  • J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2017) HotFlip: white-box adversarial examples for text classification. In ACL, Cited by: §1, §2.2.
  • J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi (2018)

    Black-box generation of adversarial text sequences to evade deep learning classifiers

    2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. Cited by: §2.2.
  • S. Garg and G. Ramakrishnan (2020) BAE: bert-based adversarial examples for text classification. External Links: 2004.01970 Cited by: §A.1, §1, §3.2, §3.3, §3.3, §3.3, §3.4, Table 1, §5.3, Table 5.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §2.1, §3.1, §3.2.
  • P. Huang, R. Stanforth, J. Welbl, C. Dyer, D. Yogatama, S. Gowal, K. Dvijotham, and P. Kohli (2019) Achieving verified robustness to symbol substitutions via interval bound propagation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4083–4093. External Links: Link, Document Cited by: §4.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. External Links: 1707.07328 Cited by: §1.
  • R. Jia, A. Raghunathan, K. Göksel, and P. Liang (2019) Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4129–4142. External Links: Link, Document Cited by: §4.
  • H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and T. Zhao (2020) SMART: robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2177–2190. External Links: Link, Document Cited by: §4.
  • D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits (2019) Is bert really robust? natural language attack on text classification and entailment. ArXiv abs/1907.11932. Cited by: §A.1, §1, §2.1, §2.2, §3.2, §3.3, §3.3, §3.3, §3.4, Table 1, §4, §5.3, Table 5.
  • A. Kurakin, I. J. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. CoRR abs/1607.02533. External Links: Link, 1607.02533 Cited by: §2.1, §3.1.
  • D. Li, Y. Zhang, H. Peng, L. Chen, C. Brockett, M. Sun, and B. Dolan (2021) Contextualized perturbation for textual adversarial attack. External Links: 2009.07502 Cited by: §2.1, §3.4, §4, §5.4.
  • L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu (2020) BERT-attack: adversarial attack against bert using bert. External Links: 2004.09984 Cited by: §2.1, §3.3, §3.4, §4.
  • X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, and J. Gao (2020) Adversarial training for large neural language models. External Links: 2004.08994 Cited by: §4.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1, §5.1.
  • I. Loshchilov and F. Hutter (2017) Fixing weight decay regularization in adam. CoRR abs/1711.05101. External Links: Link, 1711.05101 Cited by: §5.1.
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011)

    Learning word vectors for sentiment analysis

    In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Link Cited by: §1, §5.1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.1, §3.2.
  • G. A. Miller (1995) WordNet: a lexical database for english. Commun. ACM 38 (11), pp. 39–41. External Links: ISSN 0001-0782, Link, Document Cited by: §5.3.
  • T. Miyato, A. M. Dai, and I. Goodfellow (2017) Adversarial training methods for semi-supervised text classification. External Links: 1605.07725 Cited by: §4.
  • J. Morris, E. Lifland, J. Y. Yoo, and Y. Qi (2020a) TextAttack: a framework for adversarial attacks in natural language processing. ArXiv abs/2005.05909. Cited by: §A.1, §1, footnote 7.
  • J. X. Morris, E. Lifland, J. Lanchantin, Y. Ji, and Y. Qi (2020b) Reevaluating adversarial examples in natural language. External Links: 2004.14174 Cited by: §A.1, §2.2, §5.3.
  • N. Mrksic, D. Ó. Séaghdha, B. Thomson, M. Gasic, L. M. Rojas-Barahona, P. Su, D. Vandyke, T. Wen, and S. J. Young (2016) Counter-fitting word vectors to linguistic constraints. In HLT-NAACL, Cited by: item 1, §1, §3.4.
  • N. Ng, K. Cho, and M. Ghassemi (2020) SSMBA: self-supervised manifold based data augmentation for improving out-of-domain robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 1268–1283. External Links: Link, Document Cited by: §5.2.
  • N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov (2019) Facebook fair’s wmt19 news translation task submission. In Proc. of WMT, Cited by: footnote 6.
  • D. Nguyen (2018) Comparing automatic and human evaluation of local explanations for text classification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1069–1078. External Links: Link, Document Cited by: §5.5.
  • B. Pang and L. Lee (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, pp. 115–124. External Links: Link, Document Cited by: §1, §5.1.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: 2nd item, footnote 5.
  • S. Ren, Y. Deng, K. He, and W. Che (2019)

    Generating natural language adversarial examples through probability weighted word saliency

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1085–1097. External Links: Link, Document Cited by: §A.1, §1, §2.1, §2.2, §4, §5.3, Table 5.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) "Why should I trust you?": explaining the predictions of any classifier. CoRR abs/1602.04938. External Links: Link, 1602.04938 Cited by: 3rd item, §5.5.
  • W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller (2017) Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems 28 (11), pp. 2660–2673. External Links: Document Cited by: §5.5.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. External Links: Link, 1910.01108 Cited by: 2nd item, §3.3, §3.3, §3.3.
  • A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. P. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein (2019) Adversarial training for free!. CoRR abs/1904.12843. External Links: Link, 1904.12843 Cited by: §2.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link, Document Cited by: §5.1.
  • Q. Xie, Z. Dai, E. H. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation. CoRR abs/1904.12848. External Links: Link, 1904.12848 Cited by: §5.2.
  • J. Y. Yoo, J. X. Morris, E. Lifland, and Y. Qi (2020) Searching for a search method: benchmarking search algorithms for generating nlp adversarial examples. External Links: 2009.06368 Cited by: §1, §3.3.
  • Y. Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, and M. Sun (2020) Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6066–6080. External Links: Link Cited by: §A.1, §1, §2.1, §4, §5.3, Table 5.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 649–657. Cited by: §1, §5.1.
  • C. Zhu, Y. Cheng, Z. Gan, S. Sun, T. Goldstein, and J. Liu (2019) FreeLB: enhanced adversarial training for language understanding. CoRR abs/1909.11764. External Links: Link, 1909.11764 Cited by: §2.1, §4.

Appendix A Appendix

a.1 A2t and A2t-Mlm Attacks

Here, we give more details about A2T and A2T-MLM . We will use the framework introduced by Morris et al. (2020a) to break down adversarial attacks into the following four components: (1) goal function, (2) transformation, (3) a set of constraints, (4) search method.

Goal Function

In this work, we perform untargeted attack since they are generally easier than targeted attack. We aim to maximize the following as our goal function:


where means the model’s confidence of label given input and parameters .


  1. A2T Counter-fitted word embedding Mrksic et al. (2016)

  2. A2T-MLM : BERT masked language model Devlin et al. (2018)

For both methods, we select the top 20 words proposed as replacements. This helps us narrow down our replacements to the best ones and save time from considering less desirable replacements.


We use the following constraints for both attacks:

  • Part-of-speech Consistency: To preserve fluency, we require that the two words being swapped have the same part-of-speech. This is determined by a part-of-speech tagger provided by Flair Akbik et al. (2018)

    , an open-source NLP library.

  • DistilBERT Semantic Textual Similarity (STS) Sanh et al. (2019): We require that cosine similarity between the sentence encodings of original text and perturbed text meet minimum threshold value of . We use fine-tuned DistilBERT model provided by Reimers and Gurevych (2019).

  • Max modification rate: We allow only 10% of the words to be replaced. This limits us from modifying the text too much and causing the semantics of the text to change.

Also, for A2T attack, we require that the word embeddings between original text and perturbed text have minimum cosine similarity of .

The threshold values for word embedding similarity and sentence encoding similarity were set based on the recommendations by Morris et al. (2020b), which noted that high threshold values encourages strong semantic similarity between the original text and the perturbed text.

Search Method

Search method is responsible for iteratively perturbing the original text until we discover an adversarial example that causes the model to mispredict. Algorithm 2 shows A2T ’s search algorithm. If the search method fails to find an adversarial example by the time its search is over, it has failed to generate one. It can also exit preemptively if it has reached maximum number of queries to the victim model. Such limit is called query budget.

0:  Original text . Transformation module that perturbs by replacing .
0:  Adversarial text if found
1:  Calculate for all words by making one forward and backward pass.
2:   ranking of words by descending importance
4:  for  in  do
6:     if  then
8:        if  fools the model then
9:           return   as
10:        end if
11:     end if
12:  end for
Algorithm 2 A2T ’s Search Method: Gradient-based Word Importance Ranking

During training, we limit the search method to making only 200 queries to the victim model for faster generation of adversarial examples. For evaluation using A2T and A2T-MLM , we increase the query budget to 2000 queries for a more extensive search. For other attacks such as TextFooler Jin et al. (2019), BAE Garg and Ramakrishnan (2020), PWWS Ren et al. (2019), and PSO Zang et al. (2020), the query budget is set to 5000.