Log In Sign Up

Universal Adversarial Attacks with Natural Triggers for Text Classification

Recent work has demonstrated the vulnerability of modern text classifiers to universal adversarial attacks, which are input-agnostic sequence of words added to any input instance. Despite being highly successful, the word sequences produced in these attacks are often unnatural, do not carry much semantic meaning, and can be easily distinguished from natural text. In this paper, we develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems when added to benign inputs. To achieve this, we leverage an adversarially regularized autoencoder (ARAE) to generate triggers and propose a gradient-based search method to output natural text that fools a target classifier. Experiments on two different classification tasks demonstrate the effectiveness of our attacks while also being less identifiable than previous approaches on three simple detection metrics.


page 1

page 2

page 3

page 4


On the Vulnerability of Capsule Networks to Adversarial Attacks

This paper extensively evaluates the vulnerability of capsule networks t...

MINIMAL: Mining Models for Data Free Universal Adversarial Triggers

It is well known that natural language models are vulnerable to adversar...

Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder

This paper demonstrates a fatal vulnerability in natural language infere...

Attack on Unfair ToS Clause Detection: A Case Study using Universal Adversarial Triggers

Recent work has demonstrated that natural language processing techniques...

Gradient-based adversarial attacks on categorical sequence models via traversing an embedded world

An adversarial attack paradigm explores various scenarios for vulnerabil...

Universal Rules for Fooling Deep Neural Networks based Text Classification

Recently, deep learning based natural language processing techniques are...

Generating Label Cohesive and Well-Formed Adversarial Claims

Adversarial attacks reveal important vulnerabilities and flaws of traine...

1 Introduction

In recent times, adversarial attacks have demonstrated significant success in bringing down the performance of modern deep learning methods 

Szegedy et al. (2014)

. An adversary can lightly perturb inputs to appear unchanged to humans but induce incorrect predictions by neural networks. Adversarial attacks started out in computer vision 

Goodfellow et al. (2015); Carlini and Wagner (2017)

and have been recently explored in several natural language processing (NLP) domains 

Jia and Liang (2017); Ebrahimi et al. (2018); Alzantot et al. (2018); Zhao et al. (2018b).

Universal adversarial attacks are a special sub-class of methods where the same attack perturbation can be applied to any input to the target classifier. These attacks, being input-agnostic, point to more serious shortcomings in trained models and do not require regeneration for each input. While Moosavi-Dezfooli et al. (2017) designed some of the first universal attacks for image classification, Wallace et al. (2019) and Behjati et al. (2019) have recently demonstrated successful universal adversarial attacks for NLP models. However, one limitation of their methods is that the generated attack sequences are often meaningless and irregular text (e.g., “zoning tapping fiennes” from Wallace et al. (2019)

). While human readers can easily identify them as unnatural, we also find that simple heuristic methods are sufficient to spot such attacks. For instance, the above attack trigger for sentiment analysis has an average word frequency of around

compared to for benign inputs in the Stanford Sentiment Treebank (SST) Socher et al. (2013).

In this paper, we focus on designing natural attack triggers by utilizing a generative model of text. Particularly, we use an adversarially regularized autoencoder (ARAE) Zhao et al. (2018a)

, which consists of an autoencoder and a generative adversarial network (GAN), and generates natural text using input noise vectors. This enables us to develop a

gradient-based search over the noise vector space for triggers with a good attack performance. Our method, which we call Natural Universal Trigger Search (NUTS), uses projected gradient descent with L2 regularization to avoid using out-of-distribution noise vectors and maintain the naturalness of text generated.

We demonstrate the success of our attacks on two different classification tasks – sentiment analysis and natural language inference (NLI). For instance, the phrase ‘she might not’, generated by our approach, brings down the accuracy of a classifier trained on the Stanford NLI corpus Bowman et al. (2015)

to a mere 1% on entailment decisions. Furthermore, we show that our attack text appears more natural than prior approaches according to three different measures – average word frequency, loss under the GPT-2 language model 

Radford et al. (2019), and errors identified by two online grammar checking tools scr ; che . For example, on attacks for the sentiment analysis task, the Scribens grammar checker reports 15.63% errors per word for Wallace et al. (2019), and only 9.38% errors per word for our approach.

2 Related Work

Input-dependent attacks

These attacks generate specific triggers for each different input to a classifier. Jia and Liang (2017) fool reading comprehension systems by adding a single distractor sentence to the input paragraph. Ebrahimi et al. (2018) replace words of benign texts with tokens whose word embeddings are close to that of the original words using the direction of the gradient of the model’s loss, causing a drop in performance. Similarly, Alzantot et al. (2018)

propose a word-replacing attack method based on genetic algorithms.

Zhao et al. (2018b) add adversarial perturbations to the latent embeddings of original text and leverage a text generation model to construct the final attack text.

Universal attacks

Universal adversarial attacks are input-agnostic and hence, word-replacing and embedding-perturbing approaches are not applicable. Wallace et al. (2019) and Behjati et al. (2019) concurrently proposed to generate sequences of words that can be added to any input text and fool a target NLP model. Both papers perform gradient-guided searches over the space of word embeddings to choose optimal attack triggers. However, in both cases, the generated attack word sequences are often meaningless, and can be easily detected by a semantic checking process (as we show in Section 4). In contrast, our goal is to generate attack triggers that appear like natural phrases and retain semantic meaning.

GANs for generating attacks

Generative adversarial networks (GANs) have also been explored for adversarial attacks, particularly in computer vision. Xiao et al. (2018) train a generator to adversarially perturb input images to fool both the target image classifier and the discriminator. Poursaeed et al. (2018) consider both input-dependent and universal attacks in the image domain by training a generator to fool the target model without using the discriminator. Song et al. (2018) use a standard Auxiliary Classifier GAN to synthesize images, which have different predictions between the auxiliary classifier and the target classifier. Zhao et al. (2018b) leverage a GAN to manipulate an input’s latent vector and separately train an inverter to reconstruct an adversarial example from that perturbed latent vector. We use a GAN-based model (ARAE) for generating attacks for models operating over text.

3 Universal Adversarial Attacks with Natural Triggers

We build upon the universal adversarial attacks proposed by Wallace et al. (2019). To enable natural attack triggers, we use a generative model which produces text using a continuous vector input, and perform a gradient-guided search over this input space. The resulting trigger, which is added to benign text inputs, is optimized so as to maximally increase the loss under the target classification model.

Problem Formulation

Consider a pre-trained text classifier to be attacked. We are given a set of benign input sequences with the same ground truth label and the classifier has been trained to predict . Our goal is to find a single input-agnostic trigger, that when concatenated222We follow Wallace et al. (2019) in adding the triggers in front of the benign text. with any benign input, causes to perform an incorrect classification, i.e., , where represents concatenation. In addition, we also need to ensure the trigger is natural fluent text.

Figure 1:

Overview of our gradient-based attack using an ARAE model. Based on the gradient of the target model’s loss function, we iteratively update the noise vector

with small perturbation to obtain successful and natural attack triggers.

Attack trigger generation

To ensure the trigger is natural, fluent and carries semantic meaning, we use a pre-trained adversarially regularized autoencoder (ARAE) Zhao et al. (2018a) (details in Section 4). The ARAE consists of an encoder-decoder structure and a generative adversarial network (GAN) Goodfellow et al. (2014). The input to the ARAE is a standard Gaussian noise vector , which is first mapped to a latent vector by the generator. Then the ARAE decoder uses this to generate a sequence of words – in our case, the trigger . This trigger is then concatenated with a set of benign texts to get full attack texts . The overall process can be formulated as follows:


We then pass each into the target classifier and compute the gradient of the classifier’s loss with respect to the noise vector,

. Backpropagating through the decoder is not straightforward since it produces discrete symbols. Hence, we use a reparameterization trick similar to the trick in Gumbel softmax 

Jang et al. (2017)

to sample words from the output vocabulary of ARAE model as a one-hot encoding of triggers, while allowing gradient backpropagation. Figure 

1 provides an overview of our attack algorithm, which we call Natural Universal Trigger Search (NUTS).

Ensuring natural triggers

In the ARAE model, the original noise vector

is sampled from a standard multi-variant Gaussian distribution. While we can change this noise vector to produce different outputs, simple gradient search may veer significantly off-course and lead to bad generations. To prevent this, following some adversarial attack literature 

Goodfellow et al. (2015); Carlini and Wagner (2017), we use projected gradient descent with a norm constraint to ensure the noise vector is always within a limited ball around the original noise . We iteratively update as:


where represents the projection operator with the norm constraint . We try different settings of attack steps, and , selecting the value based on the quality of output triggers. In our experiments, we use 1000 attack steps with and .

Final trigger selection

Since the generation process is not deterministic, we initialize multiple independent noise vectors and perform our updates (2) to obtain many candidate triggers. Then, we re-rank the triggers to balance both target classifier accuracy (lower is better) and naturalness in terms of the average per-token cross-entropy under GPT-2, (lower is better). In our experiments, we select the trigger with the minimum value of overall performance score, which is defined as , as the output of searching algorithm. We select to balance the difference in scales of and .

4 Experiments

NUTS (our model) Baseline Wallace et al. (2019)
Task Trigger Test Trigger Classifier Trigger Classifier
length data accuracy accuracy
SST No trigger + - 88.29% - 88.29%
- - 82.94% - 82.94%
2-7 3 + why none of 45.27% drown soggy timeout 20.27%
- natural energy efficiency 23.60% vividly riveting soar 10.51%
2-7 5 + a flat explosion empty over 26.35% drown soggy mixes soggy timeout 12.38%
- they can deeply restore our 18.46% captures stamina lifetime without prevents 6.30%
2-7 8 + the accident forced the empty windows shut down 27.25% collapses soggy timeout energy energy freshness intellect genitals 17.79%
- will deliver a deeply affected children from parents 10.05% sunny vitality blessed lifetime lifetime counterparts without pitfalls 1.87%
SNLI No trigger + - 90.95% - 90.95%
0 - 88.06% - 88.06%
- - 79.53% - 79.53%
2-7 3 + she might not 1.02% alien spacecraft naked 0.00%
0 there is no 2.53% spaceship cats zombies 0.00%
- he could leave 54.58% humans possesses energies 46.55%
2-7 5 + the new state won the 0.00% alien spacecraft nothing eat no 0.00%
0 there is no one or 1.82% cats running indoors destroy no 0.00%
- he is hoping to assess 39.93% mammals tall beings interact near 13.24%
2-7 8 + i read the crowd about the police after 0.00% mall destruction alien whatsoever shark pasture picnic no 0.00%
0 the man drowned in hospital and died in 3.74% cats rounds murder pandas in alien spacecraft mars 0.00%
- he is seen after training trips to help 26.08% human humans initiate accomplishment energies near objects near 22.76%
Table 1: Universal attack results on both the Stanford Sentiment Treebank (SST) classifier and the Stanford Natural Language Inference (SNLI) classifier. For SST, “+” and “-” represent test sentences with positive sentiment and negative sentiment, respectively. For SNLI, “+” , “0”, and “-” represent test sentence pairs with entailment, neutral, and contradiction relations, respectively. We report the model accuracy after adding the attack triggers to benign test data (a lower accuracy means a more successful attack). ‘No trigger’ refers to classifier accuracy without any attack. Compared to the baseline Wallace et al. (2019), our attack triggers are slightly less successful at reducing classifier accuracy but generate more natural triggers.

We demonstrate our attack algorithm on two different tasks – sentiment analysis and natural language inference. We employ the model of Wallace et al. (2019) as a baseline and use the same datasets and target classifiers for fair comparison. For the text generator, we use an ARAE model pre-trained on the 1 Billion Word dataset Chelba et al. (2014).333 For both our attack (NUTS) and the baseline, we limit the vocabulary of attack trigger words to the overlap of the classifier and ARAE vocabularies.

Defense metrics

We also employ three simple defense metrics to measure the naturalness of attacks:

  1. Word frequency:

    The average frequency of words in the attack trigger, computed using empirical estimates from the training set of the target classifier.

  2. Language model loss: The average per-token cross-entropy loss under a pre-trained language model – GPT-2 Radford et al. (2019).

  3. Automatic grammar checkers: We calculate the average number of errors in the attack sequences using two online grammar checkers – Scribens scr and Chegg Writing che .

4.1 Sentiment Analysis


The target classifier we consider uses word2vec embeddings Mikolov et al. (2013)

and consists of a 2-layer long short-term memory (LSTM) 

Hochreiter and Schmidhuber (1997), followed by a linear layer for sentiment predictions. The model is trained on the binary Stanford Sentiment Treebank (SST) Socher et al. (2013), using AllenNLP Gardner et al. (2018). After training, the classifier has an accuracy of 88.29% on positive sentiment and an accuracy of 82.94% on negative sentiment in the test data. To avoid generating sentiment words in the attack trigger and directly changing the real sentiment of the instance, we exclude a list of sentiment words444 from the trigger vocabulary, following Wallace et al. (2019).


The top half of Table 1 captures the results of both our attack and the baseline Wallace et al. (2019) with varied sentence lengths. For both positive and negative instances, our method is able to reduce the target classifier’s accuracy significantly, down to 10.05% in the best attack case. Although this is still higher compared to Wallace et al. (2019)’s attack, we observe that our attacks are much more natural, fluent and readable while the baseline generates rare words (e.g., stamina) and unnatural phrases (e.g. drown soggy timeout).

This is quantitatively portrayed in Figure 2, which shows the difference in statistics between benign text and each attack according to the metrics of word frequency and GPT-2 loss. We see that our generated attacks are much closer in these statistics to the original text inputs than Wallace et al. (2019). Further, as shown in Table 2, these two grammar checkers scr ; che report 9.38% and 21.88% errors per word on our attack triggers, compared to 15.63% and 28.13% for Wallace et al. (2019).

Figure 2: Difference in (a) average word frequency (normalized) and (b) average GPT-2 loss between benign text () and different attack triggers () (length 5) for SST and SNLI. All statistic differences are computed as . For SNLI, we observe that our generated attacks have lower GPT-2 loss values than even the original text, leading to a positive delta.
Task Scribens Chegg Writing
Ours Baseline Ours Baseline
SST 9.38% 15.63% 21.88% 28.13%
SNLI 2.08% 4.17% 12.50% 20.83%
Table 2: Percentage of grammatical errors in the triggers produced by our model (NUTS) and the baseline Wallace et al. (2019) according to two online grammar checkers – Scribens scr and Chegg Writing che .

4.2 Natural Language Inference


For this task, we use the Stanford Natural Language Inference (SNLI) dataset Bowman et al. (2015) and the Enhanced Sequential Inference Model (ESIM) Chen et al. (2017) with GloVe embeddings Pennington et al. (2014) as the classifier. The classifier achieves scores of , , for “entailment”, “neutral”, and “contradiction” categories, respectively. We attack this SNLI classifier by adding the attack trigger to the beginning of the hypothesis.


From Table 1, we see that our attack performs similar to the baseline on entailment and neutral examples. In fact, both attacks successfully decrease the model’s accuracy to almost on both entailment and neutral examples with all trigger lengths. On contradiction examples, our best attack brings the model accuracy down to while the baseline brings it down to . Although our attacks are less successful on contradiction, they are much more natural than the baseline. In Figure 2, our attacks are closer to the word frequency of benign inputs and even achieve a lower GPT-2 loss than the original text. Further, as shown in Table 2, two grammar checkers scr ; che report 2.08% and 12.5% errors per word on our attacks, compared to 4.17% and 20.83% for the baseline.

5 Conclusion

In this paper, we develop universal adversarial attacks with natural triggers for text classification models. We leverage the ARAE text generation model and propose a gradient-based approach to search over attack triggers which are fluent and semantically plausible. Experimental results on two different classification tasks validate our approach and show that our model can generate attack triggers that are both successful and natural. Future work can explore better ways to optimally balance attack success and trigger quality, while also investigating more sophisticated ways of detecting attacks.


  • (1) Chegg Writing: Check Your Paper for Free.
  • (2) Scribens: Free Online Grammar Checker.
  • Alzantot et al. (2018) Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. Generating natural language adversarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
  • Behjati et al. (2019) Melika Behjati, Seyed-Mohsen Moosavi-Dezfooli, Mahdieh Soleymani Baghshah, and Pascal Frossard. 2019. Universal adversarial attacks on text classifiers. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7345–7349. IEEE.
  • Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
  • Carlini and Wagner (2017) Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (S&P), pages 39–57.
  • Chelba et al. (2014) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Fifteenth Annual Conference of the International Speech Communication Association.
  • Chen et al. (2017) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In ACL (1), pages 1657–1668. Association for Computational Linguistics.
  • Ebrahimi et al. (2018) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. Hotflip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.
  • Gardner et al. (2018) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. Allennlp: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
  • Goodfellow et al. (2015) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR).
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR).
  • Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Moosavi-Dezfooli et al. (2017) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Universal adversarial perturbations. In

    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    , pages 1765–1773.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
  • Poursaeed et al. (2018) Omid Poursaeed, Isay Katsman, Bicheng Gao, and Serge Belongie. 2018. Generative adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4422–4431.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing.
  • Song et al. (2018) Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. 2018. Constructing unrestricted adversarial examples with generative models. In Advances in Neural Information Processing Systems, pages 8312–8323.
  • Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR).
  • Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. In Empirical Methods in Natural Language Processing.
  • Xiao et al. (2018) Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. 2018. Generating adversarial examples with adversarial networks. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence

  • Zhao et al. (2018a) Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M Rush, and Yann LeCun. 2018a. Adversarially regularized autoencoders. In

    International Conference on Machine Learning (ICML)

  • Zhao et al. (2018b) Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018b. Generating natural adversarial examples. In International Conference on Learning Representations (ICLR).