Generating Natural Language Adversarial Examples on a Large Scale with Generative Models

03/10/2020 ∙ by Yankun Ren, et al. ∙ University of Southern California 0

Today text classification models have been widely used. However, these classifiers are found to be easily fooled by adversarial examples. Fortunately, standard attacking methods generate adversarial texts in a pair-wise way, that is, an adversarial text can only be created from a real-world text by replacing a few words. In many applications, these texts are limited in numbers, therefore their corresponding adversarial examples are often not diverse enough and sometimes hard to read, thus can be easily detected by humans and cannot create chaos at a large scale. In this paper, we propose an end to end solution to efficiently generate adversarial texts from scratch using generative models, which are not restricted to perturbing the given texts. We call it unrestricted adversarial text generation. Specifically, we train a conditional variational autoencoder (VAE) with an additional adversarial loss to guide the generation of adversarial examples. Moreover, to improve the validity of adversarial texts, we utilize discrimators and the training framework of generative adversarial networks (GANs) to make adversarial texts consistent with real data. Experimental results on sentiment analysis demonstrate the scalability and efficiency of our method. It can attack text classification models with a higher success rate than existing methods, and provide acceptable quality for humans in the meantime.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Today machine learning classifiers have been widely used to provide key services such as information filtering, sentiment analysis. However, recently researchers have found that these ML classifiers, even deep learning classifiers are vulnerable to adversarial attacks. They demonstrate that image classifier 

[10] and now even text classifier [26] can be fooled easily by adversarial examples that are deliberately crafted by attacking algorithms. Their algorithms generate adversarial examples in a pair-wise way. That is, given one input , they aim to generate one corresponding adversarial example by adding small imperceptible perturbations to . The adversarial examples must maintain the semantics of the original inputs, that is, must be still classified as the same class as by humans. On the other hand, adversarial training is shown to be a useful defense method to resist adversarial examples [31, 10]. Trained on a mixture of adversarial and clean examples, classifiers can be resistant to adversarial examples.

In the area of natural language processing (NLP), existing methods are pair-wise, thus heavily depend on input data

. If attackers want to generate adversarial texts which should be classified as a chosen class with pair-wise methods, they must first collect texts labeled as the chosen class, then transform these labeled texts to the corresponding adversarial examples by replacing a few words. As the amount of labeled data is always small, the number of generated adversarial examples is limited. These adversarial examples are often not diverse enough and sometimes hard to read, thus can be easily detected by humans. Moreover, in practice, if attackers aim to attack a public opinion monitoring system, they must collect a large number of high-quality labeled samples to generate a vast amount of adversarial examples, otherwise, they can hardly create an impact on the targeted system. Therefore, pair-wise methods only demonstrate the feasibility of the attack but cannot create chaos on a large scale.

Figure 1: An illustration of adversarial text generation. (a) Given one negative text which is also classified as negative by a ML model, traditional methods replace a few words (yellow background) in the original text to get one paired adversarial text, which is still negative for humans, but the model prediction changes to positive. (b) Our unrestricted method does not need input texts. We only assign a ground-truth class - negative, then our method can generate large-scale adversarial texts. which are negative for humans, but classified as positive by the ML model.

In this paper, we propose an unrestricted end to end solution to efficiently generate adversarial texts, where adversarial examples can be generated from scratch without real-world texts and are still meaningful for humans. We argue that adversarial examples do not need to be generated by perturbing existing inputs. For example, we can generate a movie review that does not stem from any examples in the dataset at hand. If the movie review is thought to be a positive review by humans but classified as a negative review by the targeted model, the movie review is also an adversarial example. Adversarial examples generated in this way can break the limit of input number, thus we can get large scale adversarial examples. On the other hand, the proposed method can also be used to create more adversarial examples for defense. Trained with more adversarial examples often means more robustness for these key services.

The proposed method leverages a conditional variational autoencoder (VAE) to be the generator which can generate texts of a desired class. To guide the generator to generate texts that mislead the targeted model, we access the targeted model in a white-box setting and use an adversarial loss to make the targeted model make a wrong prediction. In order to make the generated texts consistent with human cognition, we use discrimators and the training framework of generative adversarial networks (GANs) to make generated texts similar as real data of the desired class. After the whole model is trained, we can sample from the latent space of VAE and generate infinite adversarial examples without accessing the targeted model. The model can also transforms a given input to an adversarial one.

We evaluate the performance of our attack method on a sentiment analysis task. Experiments show the scalability of generation. The adversarial examples generated from scratch achieve a high attack success rate and have acceptable quality. As the model can generate texts only with feed-forwards in parallel, the generation speed is quite fast compared with other methods. Additional ablation studies verify the effectiveness of discrimators, and data augmentation experiments demonstrate that our method can generate large-scale adversarial examples with higher quality than other methods. When existing data at hand is limited, our method is superior over the pair-wise generation.

In summary, the major contributions of this paper are as follows:

  • Unlike the existing literature in text attacks, we aim to construct adversarial examples not by transforming given texts. Instead, we train a model to generate text adversarial examples from scratch. In this way, adversarial examples are not restricted to existing inputs at hand but can be generated from scratch on a large-scale.

  • We propose a novel method based on the vanilla conditional VAE. To generate adversarial examples, we incorporate an adversarial loss to guide the vanilla VAE’s generation process.

  • We adopt one discrimator for each class of data. When training, we train the discrimators and the conditional VAE in a min-max game like GANs, which can make generated texts more consistent with real data of the desired class.

  • We conduct attack experiments on a sentiment analysis task. Experimental results show that our method is scalable and achieves a higher attack success rate at a higher speed than recent baselines. The quality of generated texts is also acceptable. Further ablation studies and data augmentation experiments verify our intuitions and demonstrate the superiority of scalable text adversarial example generation.

2 Related Work

There has been extensive studies on adversarial machine leaning, especially on deep neural models [31, 10, 16, 28, 1]. Much work focuses on image classification tasks [31, 10, 5, 11, 33]. [31] solves the attack problem as an optimization problem with a box-constrained L-BFGS. [10] proposes the fast gradient sign method (FGSM), which perturbs images with noise computed as the gradients of the inputs.

In NLP, perturbing texts is more difficult than images, because words in sentences are discrete, on which we can not directly perform gradient-based attacks like continuous image space. Most methods adapt the pair-wise methods of image attacks to text attacks. They perturb texts by replacing a few words in texts. [24, 9, 6]

calculate gradients with respect to the word vectors and perturb word embedding vectors with gradients. They find the word vector nearest to the perturbed vector. In this way, the perturbed vector can be map to a discrete word to replace the original one. These methods are gradient-based replacement methods.

Other attacks on texts can be summarized as gradient-free replacement methods. They replace words in texts with typos or synonyms. [16] proposes to edit words with tricks like insertion, deletion and replacement. They choose appropriate words to replace by calculating the word frequency and the highest gradient magnitude. [15] proposes five automatic word replacement methods, and use magnitude of gradients of the word embedding vectors to choose the most important words to replace. [26]

is based on synonyms substitution strategy. Authors introduce a new word replacement order determined by both the word saliency and the classification probability. However, these replacement methods still generate adversarial texts in a pair-wise way, which restrict the adversarial texts to the variants of given real-world texts. Besides, the substitute words sometimes change text meanings. Thus existing adversarial text generation methods only demonstrate the feasibility of the attack but cannot create chaos on a large scale.

In order to tackle the above problems, we propose an unrestricted end to end solution to generate diverse adversarial texts on a large scale with no need of given texts.

3 Methodology

In this section, we propose a novel method to generate adversarial texts for the text classification model on a large scale. Though trained with labeled data in a pair-wise way, after it is trained, our model can generate an unlimited number of adversarial examples without any input data. Moreover, like other traditional pair-wise generation methods, our model can also transform a given text into an adversarial one. Unlike the existing methods, our model generates adversarial texts without querying the attacked model, thus the generation procedure is quite fast.

Figure 2: The architecture of the whole model. In the training phase, generates an adversarial text to reconstruct the original text , and feed to and to make predict differently on and . After trained, the model can generate large-scale adversarial texts based on sampled latent space vector and a chosen class without original texts .

3.1 Overview

Figure 2 illustrates the overall architecture of our model. The model has three components: a generator , discrimators , and a targeted model . and form a generative adversarial network (GAN). When training, we feed an original input to the generator , which transforms to an adversarial output . The procedure can be defined as follows:


aims to generate to reconstruct . Then, we feed the generated to the targeted model , and will classify as a certain class, which we hope is a wrong label. Thus we have the following equation:


where and is the label space of the targeted classification model.

In order to keep being classified as the same class as by human, we add one discrimator for each class . With the help of the min-max training strategy of GAN framework, each class ’s discrimator can make close to the distribution of real class data, thus is made to be compatible with human congnition.

We now proceed by introducing these components in further details.

3.2 Generator

In this subsection, we describe the generator for text generation. We use the variational autoencoder (VAE) [14, 27] as the generator. The VAE is a generative model based on a regularized version of the standard autoencoder. This model supposes the latent variable is sampled from a prior distribution.

As shown in Figure 2, the VAE is composed of the encoder and the decoder , where is the parameters of and is the parameters of .

is a neural network. Its input is a text

, its output is a latent code . encodes into a latent representation space , which is a lower-dimensional space than the input space. is another neural network. Its input is the code , it outputs an adversarial text

to the probability distribution of the input data


In our model, we adopt the gated recurrent unit (GRU)

[7] as the encoder and the decoder. As in Figure 3, The input is a sentence of words, we formulate the input for neural networks as follows: for a word at the position in a sentence, we first transform it into a word vector by looking up a word embedding table. The word embedding table is randomly initialized and is updated during the model training. Then the word embedding vectors are fed into the GRU encoder. In the -th GRU cell, a hidden state is emitted.

We use to denote the last GRU cell’s hidden state, where N is the length of the encoder input. In order to get latent code , we feed into two linear layers to get and respectively. Following the Gaussian reparameterization trick [14], we sample a random sample from a standard Gaussian (, ), and compute as:


Computed in this way,

is guaranteed to be sampled from a Gaussian distribution


Figure 3: The generator . When training, we need input texts to train After is trained, we only need to sample from the latent space, and use the decoder to generate adversarial texts unrestrictedly without original texts.

Then, we can decode to generate an adversarial text . Before feeding to the decoder, we adopt a condition embedding to guide the decoder to generate text of a certain class , which can be chosen arbitrarily. Suppose in a text classification task, there are classes. Specifically, we randomly initialize a class embedding table as a matrix and look up to get the corresponding embedding of class . Then, we feed into a linear layer to get another vector representation. The vector encodes the information of the input text and a desired class.

The decoder GRU uses this vector as the initial state to generate the output text. Each GRU cell generates one word. The computation process is similar to that of the GRU encoder, except the output layer of each cell. The output of the -th GRU cell is computed as:


where is the transformation weights, is the word vocabulary, and . is the probability of the -th GRU cell emitting the -th word in the vocabulary.

In the training phase, the GRU cell chooses the word index with the highest probability to emit:


When training, the loss function of the VAE is calculated as:


The first term is the reconstruction loss, or expected negative log-likelihood. This term encourages the decoder to learn to reconstruct the data. So the output text is made to be similar to the input text. The second term is the Kullback-Leibler divergence between the latent vector distribution


. If the VAE were trained with only the reconstruction objective, it would learn to encode its inputs deterministically by making the variances in

vanishingly small [25]. Instead, the VAE uses the second term to encourages the model to keep its posterior distributions close to a prior , which is generally set as a standard Gaussian.

In the training phase, the input to the GRU decoder is the input text, appended with a special GO token as the start word. We add a special EOS token to the input text as the ground truth of the output text. The EOS token represents the end of the sentence. When training the GRU decoder to generate texts, the GRU decoder tends to ignore the latent code and only relies on the input to emit output text. It actually degenerates into a language model. This situation is called KL-vanishing. To tackle the KL-vanishing problem in training GRU decoder, we adopt the KL-annealing mechanism [2]. KL-annealing mechanism gradually increase the KL weight from to . This can be thought of as annealing from a vanilla autoencoder to a VAE. Also, we randomly drop the input words into the decoder with a fixed keep rate , to make the decoder depend on the latent code to generate output text.

Notably, if we randomly sample from a standard Gaussian, the decoder can also generate output text based on . The difference is that there is no input to the GRU decoder, but we can send the word generated by the -th GRU cell to the -th GRU cell as the -th input word. Specifically, in the inference phase, we use beam-search to generate words. The initial input word to the first GRU cell is the GO token. When the decoder emits the EOS token, the decoder stops generating new words, and the generation of one complete sentence is finished.

In this way, after is trained, theoretically, we can sample infinite from the latent space and generate infinite output texts based on these . This is part of the superiority of our method.

3.3 Targeted Model

Since the TextCNN model has good performances and is quite fast, it is one of the most widely used methods for text classification task in industrial applications [34]. As we aim to attack models used in practice, we take the TextCNN model [13] as our targeted model.

Suppose we set the condition of the VAE to be , the decoder generates the output text , then we feed the text into the targeted model, and the targeted model will predict a probability for each candidate class . We conduct targeted attack and aim to cheat the targeted model to classify as class (), we can get the following adversarial loss function:


This is a cross entropy loss that maximize the probability of class .

Recall that words in the adversarial text are computed in Equation 6, in which Function is not derivative. So we can not directly feed the word index computed in Equation 6 into the targeted model. In this paper, we utilize the Gumbel-Softmax [12] to make continuous value approximate discrete word index. The embedding matrix fed to TextCNN is calculated as:


where is the whole vocabulary embedding matrix, is from Equation 4, is drawn from distribution [12] and is the temperature.

Input: Training data of different classes , …,

Output: Text Adversarial Examples

1:Train a VAE by minimizing on , …, with KL-annealing mechanism and word drop
2:Initialize with the pretrained VAE
3:Initialize the targeted model with a pretrained TextCNN
4:Freeze the weights of the targeted model
6:     for  do
7:         sample a batch of texts of class from
8:          generates with condition
9:          Compute      
10:     end for
11:      Update weights of , , …, by minimizing
12:     Update weights of by minimizing
13:until convergence
14:if With inputs for the encoder then
15:      Encode inputs and decode the corresponding adversarial texts
17:     Randomly sample and choose a class
18:      The decoder takes and generates the adversarial text from scratch
Algorithm 1 Text Adversarial Examples Generation

3.4 Discrimator Model

Until this point, ideally, we suppose the generated should have many same words as of class (thus be classified as by humans) and be classified as class by the targeted model. But this assumption is not rigorous. Most of the time, is not classified as by humans. In natural language texts, even a single word change may change the whole meaning of a sentence. A valid adversarial example must be imperceptible to humans. That is, humans must classify as class .

Suppose is the distribution of real data of class and is the distribution of generated adversarial data transformed from . We utilize the idea of GAN framework to make similar to data from . Thus will be classified as by humans and classified as at the same time.

Specifically, we adopt one discrimator for each class . aims to distinguish the data distribution of real labeled data of class and adversarial data generated by with desired class :


The overall training objective is a min-max game played between the generator and the discrimators , , …, , where is the total number of classes:


tries to distinguish and , while tries to fool to make be classified as real data by . Trained in this adversarial way, the generated adversarial text distribution is drawn close to distribution , which is of class . Thus is mostly likely to be similar to data from and is classfied as by human as a result.

We implement the discrimators with multi-layer perceptions (MLPs). Because function is not derivable, similar to Equation 9 and 10 in Section 3.3, we first use Gumbel-Softmax to transform the decoder output from Equation 4 into a fixed-sized matrix . Then, calculate the probability of a text being true data of class as:


3.5 Model Training

Combining Equations 7, 8, 12, we obtain the joint loss function for model training:


We first train the VAE and the targeted model with training data. Then we freeze weights of the targeted model and initialize the ’s weights with the pretrained VAE’s weights. At last, the generator and all the discrimators , , …, are trained in a min-max game with loss . The whole training process is summarized in Algorithm 1.

4 Experiments

We report the performances of our method on attacking TextCNN on sentiment analysis task, which is an important text classification task. Sentiment analysis is widely applied to helping a business understand the social sentiment of their products or services by monitoring online user reviews and comments [23, 4, 21]. In several experiments, we evaluate the quality of the text adversarial examples for sentiment analysis generated by the proposed method.

Experiments are conducted from two aspects. Specifically, we first follow the popular settings and evaluate our model’s performances of transforming an existing input text into an adversarial one. We observe that our method has higher attack success rate, generates fluent texts and is efficient. Besides, we also evaluate our method on generating adversarial texts from scratch unrestrictedly. Experimental results show that we can generate large-scale diverse examples. The generated adversarial texts are mostly valid, and can be utilized to substantially improve the robustness of text classification models.

We further report ablation studies, which verifies the effectiveness of the discrimators. Defense experiment results demonstrate that generating large-scale can help to make model more robust.

4.1 Experiment Setup and Details

Experiments are conducted on two popular public benchmark datasets. They are both widely used in sentiment analysis [32, 19, 8] and adversarial example generation [15, 29, 30].

Rotten Tomatoes Movie Reviews (RT) [22]. This dataset consists of positive and negative processed movie reviews. We divide of the dataset as the training set, as the development set and as the test set.

IMDB [17]. This dataset contains 50,000 movie reviews from online movie websites. It consists of positive and negative paragraphs. 25,000 samples are for training and 25,000 are for testing. We held out of the training set as a validation set as [15].

4.2 Comparing With Pair-wise Methods

In most of the existing work [26, 18, 1], text adversarial examples are generated through a pair-wise way. That is, first we should take a text example, and then transform it into an adversarial instance.

To compare with the current methods fairly, we limit our method to pair-wise generation. In this experiment, we set . Specifically, we first feed an input text into the GRU encoder, and set the condition as the ground-truth class of the text. After that, the decoder can decode to get the adversarial output text.

We choose four representative methods as baselines:

  • Random: Select words randomly and modify them.

  • Fast Gradient Sign Method (FGSM) [10]: First, perturbation is computed as sign(), where is the loss function and is the word vectors. Then, search in the word embedding table to find the nearest word vector to the perturbed word vector. FGSM is the fastest among gradient-based replacement methods.

  • DeepFool [20]: This is also a gradient-based replacement method. It aims to find out the best direction, towards which it takes the shortest distance to cross the decision boundary. The perturbation is also applied to the word vectors. After that, nearest neighbor search is used to generate adversarial texts.

  • TextBugger [15]: TextBugger is a gradient-free replacement method. It proposes strategies such as changing the word’s spelling and replacing a word with its synonym, to change a word slightly to create adversarial texts. Gradients are only computed to find the most important words to change.

Attack Success Rate. Following the existing literature [10, 20, 15], we evaluate the attack success rate of our method and four baseline methods.

Method RT IMDB
Ours ()
Table 1: Attack success rate of transforming given texts in a pair-wise way.
Figure 4: Adversarial texts generated in a pair-wise way. In texts, the crossed out contents are from the original texts, while the red texts are the substitute contents in the adversarial examples.
(a) Attack Success Rate
(b) Perplexity
(c) Validity Rate
Figure 5: The attack success rate, perplexity and validity of unrestricted adversarial text generation from scratch. Randomly sample to generate adversarial texts from scratch with different . Note that when , the model is a vanilla VAE

We summarize the performances of of our method and all baselines in Table 1. From Table 1, we can observe that randomly changing words is not enough to fool the classifier. This implies the difficulty of attack. TextBugger and our method both achieve quite high attack success rate. While our method performs even better than TextBugger, which is the state-of-the-art method.

We show some adversarial examples generated by our method and TextBugger to demonstrate the differences in Figure 4.

We can observe that TextBugger mainly changes the spelling of words. The generated text becomes not fluent and easy to be detected by grammar checking systems. Also, though humans may guess the original meanings, the changed words are treated as out of vocabulary words by models. For example, TextBugger changes the spelling of ‘awful’, ‘cliches’ and ‘foolish’ in Figure 4. These words are important negative sentiment words for a negative sentence. It is natural that changing these words to unknown words can change the prediction of models. Unlike TextBugger, our method generates meaningful and fluent contents. For example, in the first example of Figure 4, we replace ‘read the novel’ with ‘love the book’, the substitution is still fluent and make sense to both humans and models.

Generation Speed. It takes about one hour and about 3 hours to train our model on RT dataset and IMDB dataset respectively. We also evaluate the time cost of generating one adversarial example. We take the FGSM method as the representative of gradient-based methods, as FGSM is the fastest among them. We measure the time cost of generating adversarial examples and calculate the average time of generating one. Results are shown in Table 2.

Method FGSM+NNS TextBugger Ours ()
Time 0.7s 0.05s 0.014s
Table 2: Time cost of generating one adversarial text.

We can observe that our method is much faster than others. That is mainly because our generative model is trained beforehand. After the model is trained, the generation of one batch just requires one feed-forward.

4.3 Unrestricted Adversarial Text Generation

Figure 6: Adversarial examples generated from scratch unrestrictedly. Humans should classify adversarial texts as the chosen emotional class .

As mentioned in Section 3.2, after our model is trained, we can randomly sample from latent space, choose a desired class , get the embedding vector of , then feed to the decoder to generate adversarial texts unrestrictedly with no need of labeled text.

Attack Success Rate. When training, we can tune in Equation 14 to affect the model. After trained with different , we observe the generated texts are different. We randomly generate 50,000 examples and compute the proportion of adversarial examples with different . The results are shown in Figure 5(a). Notice if we set , the model is a vanilla VAE and it is not trained continually after pretrained.

From Figure 5(a), we can observe that the attack success rate of the vanilla VAE is only and respectively, this implies that only randomly generating texts can hardly fool the targeted model. When is greater than , the attack success rate is consistently better than the vanilla VAE. This reflects the importance of .

Also, the attack success rate increases as becomes larger. It is because the larger is, the more important role will plays in the final joint loss . So, the text generator is more easily guided by the to generate an adversarial example.

To evaluate the quality of the generated adversarial texts with different , we adopt three metrics : perplexity, validity and diversity.

(a) RT
(b) IMDB
(c) Data augmentation compare
Figure 7: Defense with adversarial training in different settings. (a) and (b) On RT and IMDB datasets, data augmentation with adversarial data generated from scratch under different . (c) On RT dataset, accuracy of models trained with equal size of augmentation adversarial data, which is generated in pair-wise way and unrestricted generation way respectively.

Perplexity. Perplexity [3] is a measurement of how well a probability model predicts a sample. A low perplexity indicates the language model is good at predicting the sample. Given a pretrained language model, it can also be used to evaluate the quality of texts. Similarly, a low perplexity indicates the text is more fluent for the language model. We compute perplexity as:


where is the number of words in one sentence. is the probability of -th word in computed by the language model.

We train a language model with the training data of IMDB and RT, and use it as in Equation 15. We measure and compare the perplexity of the generated 50,000 texts and data of the original training set. Results are shown in Figure 5(b). We can observe that the perplexity is only a bit higher than the original data’s, which means that the quality of generated texts are acceptable. Also, as gets larger, the perplexity gets bigger. This is perhaps because can distort the generated texts.

Validity. If we feed to the decoder, then a valid generated adversarial text is supposed to be classified as class by humans but be classified as class by the targeted model. We randomly select 100 generated texts for each and manually evaluate their validity. The results are shown in Figure 5(c).

From Figures 5(c), we can observe that the validity rates of our method on both datasets are higher than and much higher than that of the vanilla VAE. This implies our methods can generate high-quality and high-validity texts with high attack success rate.

Diversity. We first generate one million adversarial texts. To compare generated texts with train data, we extract all 4-grams of train data and generated texts. On average, for each generated text, less than of 4-grams can be found in all 4-grams of train data on all datasets. This shows that there exists some similarity and our model can also generate texts with different words combinations. To compare generated texts with each other, we suppose that if over of 4-grams of one generated text don’t exist at the same time in any one of the other generated texts, the text is one unique text. We observe more than of generated texts are unique. This proved that the generated texts are diverse.

Adversarial Examples. We show some valid adversarial examples generated by our method in Figure 6. We can view that the adversarial examples generated by the vanilla VAE is more likely neutral, and the confidence of the targeted model is not huge. On the contrary, the generated examples of our method have high confidence of the targeted model. This shows is important to attack success rate. Besides, the fluency and validity of texts generated by our method are acceptable.

4.4 Ablation Study

In this section, we further demonstrate the effectiveness of discrimators. We now report the ablation study.

We first remove discrimators and , then train our model. We compare it with the model trained with in a min-max game. We evaluate their attack success rate, perplexity and validity. Results are show in Table 3.

Dataset Method
Attack Success
Perplexity Validity
RT with 2.79
without 7.32
IMDB with 2.88
without 7.41
Table 3: Performance of our model trained with and without .

The attack success rates of models trained with and without are close. But the validity of the model trained without is much lower than that of the model with . The reason of this phenomenon is as follows. When training the generator with only and , suppose we want to generate positive adversarial texts and the targeted model must classify it as negative, the easiest way to achieve this goal is to change a few words in the generated text to negative words, such as "bad". But texts generated this way can not fool humans. If we add discrimators to draw distribution of adversarial texts close to the distribution of real data, this phenomenon can be controlled. This shows that discrimators and the min-max game can improve the validity greatly.

4.5 Defense With Adversarial Training

Using the adversarial examples to augment the training data can make models more robust, this is called adversarial training.

On RT dataset, we randomly generate 4k adversarial texts to augment the training data and 1k to test the model. On IMDB dataset, we randomly generate 10k, of which 8k for training and 2k for testing. Results are shown in Figure 7(a) and Figure 7(b).

Through adversarial data augmentation, test accuracy on the original test data is stable. Also, the accuracy on the adversarial data is improved greatly (from to ). It implies that adversarial training can make models more robust without hurting its effectiveness.

Then, on RT dataset, we first augment training data with adversarial examples generated by pair-wise generation. The adversarial examples are generated through transforming training data. Note that we have 8k training data in RT dataset. When we set bigger , the attack success rate is higher, so we can generate more adversarial examples in the pair-wise way. But with any , unrestricted generation from scratch can result in infinite adversarial data. We compare the adversarial data augmentation performances of pair-wise and unrestricted generation from scratch. We use the same number of adversarial examples generated by the two modes, and hold out of generated data for testing. Results are shown in Figure 7(c).

We can see that with pair-wise generation, if training data is limited, we need to generate more adversarial examples to improve the adversarial test accuracy. Higher adversarial test accuracy requires higher . But higher results in bigger perplexity, which means low text quality. Differently, with unrestricted generation from scratch, we can generate infinite adversarial texts using very small , with high fluency and similar adversarial test accuracy. Thus, under similar adversarial test accuracy, the text fluency of pair-wise generation is worse than that of unrestricted generation from scratch. This indicates the advantage of our proposed method.

5 Conclusion

In this paper, we have proposed a scalable method to generate adversarial texts from scratch attacking a text classification model. We add an adversarial loss to enforce the generated text to mislead the targeted model. Besides, we use discrimators and GAN-like training strategy to make adversarial texts mimic real data of the desired class. After the generator is trained, it can generate diverse adversarial examples of a desired class on a large scale without real-world texts. Experiments show that the proposed method is scalable and can achieve higher attack success rate at a higher speed compared with recent methods. In addition, it is also demonstrated that the generated texts are of good quality and mostly valid. We further conduct ablation experiments to verify effects of discrimators. Experiments of data augmentation indicate that our method generates more diverse adversarial texts with higher quality than pair-wise generation, which can make the targeted model more robust.