. That is, these adversarial samples are imperceptible to human judges while they can mislead the neural networks to incorrect predictions. Therefore, it is essential to explore these adversarial attack methods since the ultimate goal is to make sure the neural networks are highly reliable and robust. While in computer vision fields, both attack strategies and their defense countermeasures are well-exploredChakraborty et al. (2018), the adversarial attack for text is still challenging due to the discrete nature of languages. Generating of adversarial samples for text needs to possess such qualities: (1) imperceptible to human judges yet misleading to neural models; (2) fluent in grammar and semantically consistent with original inputs.
Previous methods craft adversarial samples mainly based on specific rules Li et al. (2018); Gao et al. (2018); Yang et al. (2018); Jin et al. (2019). Therefore, these methods are difficult to guarantee the fluency and semantically preservation in the generated adversarial samples. These manual craft methods are rather complicated. They are designed with multiple linguistic constraint like NER tagging or POS tagging. Introducing contextualized language models to serve as an automatic perturbation generator could make these rules designing much easier.
Recent rise of pre-trained language models, such as BERT Devlin et al. (2018), push the performances of NLP tasks to a new level. On the one hand, the powerful ability of a fine-tuned BERT on downstream tasks makes it more challenging to be adversarial attacked Jin et al. (2019). On the other hand, BERT is a pre-trained masked language model on extremely large-scale unsupervised data and has learned general-purpose language knowledge. Therefore, BERT has the potential to generate more fluent and semantic-consistent substitutions for an input text. Naturally, both the properties of BERT motivate us to explore the possibility of attacking a fine-tuned BERT with another BERT as attacker.
In this paper, we propose an effective and high-quality adversarial sample generation method: BERT-Attack, using BERT as a language model to generate adversarial samples. The core algorithm of BERT-Attack is straightforward and consists of two stages: finding the vulnerable words in one given input sequence for the target model, then applying BERT to generate substitutes for the vulnerable words. With the powerful ability of BERT, the perturbations are generated considering the context around. Therefore, the perturbations are fluent and reasonable. We uses the masked language model as perturbation generator and find perturbations that maximize the risk of making wrong predictions Goodfellow et al. (2014). Differently from previous attacking strategies that requires traditional single-direction language models as a constraint, we only need to inference the language model once as perturbation generator rather than repeatedly using language models to score the generated adversarial samples in a trail and error process.
Experimental results show that the proposed BERT-Attack method successfully fooled its fine-tuned downstream model with the highest attack success rate compared with previous methods. Meanwhile, the perturb percentage is considerably low, so does the query number, while the semantic preservation is high.
To summarize our main contributions:
We propose a simple and effective method, BERT-Attack, to generate fluent and semantically-preserved adversarial samples that can successfully mislead state-of-the-art models in NLP, such as fine-tuned BERT for various downstream tasks.
BERT-Attack has higher attacking success rate and lower perturb percentage with less access numbers to the target model compared with previous attacking algorithms, while does not require extra scoring models therefore extremely effective.
We can generate adversarial samples with BERT-Attack as a parallel dataset for further research on the robustness of NLP models.
2 Related Work
To explore the robustness of neural networks, adversarial attack has extensively studied for continuous data (such as image) Goodfellow et al. (2014); Nguyen et al. (2015); Chakraborty et al. (2018). The key idea is to find a minimal perturbation that maximize the risk of making wrong predictions. This minimax problem can be easily achieved by applying gradient descent over the continuous space of images. However, adversarial attack for discrete data such as text remains challenging.
Adversarial Attack for Text
Current successful attacks for text usually adopt heuristic rules to modify the characters of a word Jin et al. (2019), and substituting words with synonyms Ren et al. (2019). Li et al. (2018); Gao et al. (2018) apply perturbations based on word embeddings such as Glove Pennington et al. (2014), which is not strictly semantically and grammatically coordinated. Alzantot et al. (2018) adopts language models to score the perturbations generated by searching for close meaning words in the Glove Pennington et al. (2014)
embeddings, using trail and error to find possible perturbations, yet the perturbations generated are still not context-aware and heavily rely on cosine similarity measurement of word embeddings. Glove embeddings do not guarantee similar vector space with cosine similarity distance, therefore the perturbations are less semantically consistent.Jin et al. (2019) apply a semantically enhanced embedding Mrkšić et al. (2016), which is context unaware, thus less consistent with the unperturbed inputs. Liang et al. (2017) use phrase-level insertion and deletion, which produces unnatural sentences inconsistent with the original inputs, lacking fluency control. To preserve semantic information, Glockner et al. (2018) replace words manually to break language inference system Bowman et al. (2015). Jia and Liang (2017) propose manual craft methods to attack machine reading comprehension systems. Lei et al. (2019) introduce replacement strategies using embedding transition.
Although the above approaches have achieved good results, there is still much room for improvement regarding the perturbed percentage, attacking success rate, grammatical correctness and semantic consistency, etc. Moreover, the substitution strategies of these approaches are usually non-trivial, resulting in that they are limited to specific tasks.
Adversarial Attack against BERT
Pre-trained language models, have become the mainstream for many NLP tasks. Works such as Wallace et al. (2019); Jin et al. (2019); Pruthi et al. (2019) have explored these pre-trained language models in many different angles. Wallace et al. (2019) explored The possible ethical problems of learned knowledge in pre-trained models.
From our perspective, we take the idea of turning such language models against themselves. Therefore, we introduce a novel BERT-Attack algorithm to attack the fine-tuned models.
Motivated by the interesting idea of turning BERT against BERT, we propose BERT-Attack, using the original BERT model to craft adversarial samples to fool the fine-tuned BERT model.
Our method consists of two steps: (1) finding the vulnerable words for the target model and then (2) replacing them with the semantically similar and grammatically correct words until a successful attack.
The most-vulnerable words are the key words that help the target model make judgements. Perturbations over these words can be most beneficial in crafting adversarial samples. After finding which words that we are aimed to perturbate, we use masked language models to generate perturbations based on the top-K predictions from the masked language model.
3.1 Finding Vulnerable Words
Under the black-box scenario, the logit output by the target model (fine-tuned BERT or other neural models) is the only supervision we can get. We first select the words in the sequence which have a high significance influence on the final output logit.
Let denote the input sentence, and denote the logit output by the target model for correct label , the importance score is defined as
where is the sentence after replacing with .
Then we rank all the words according to the ranking score in descending order to create word list . We only take percent of the most important words since we tend to keep perturbations minimum.
This process maximize the risk of making wrong predictions which is previously done by calculating gradients in image domains. The problem is then formulated as replacing these most vulnerable words with semantically consistent perturbations.
3.2 Word Replacement via BERT
After finding the vulnerable words, we iteratively replace the words in list one by one to find perturbations that can mislead target model. Previous approaches usually use multiple human-crafted rules to ensure the generated example is semantically consistent with the original one and grammatically correct, such as synonym dictionary Ren et al. (2019), POS checker Jin et al. (2019), semantic similarity checker Jin et al. (2019), etc. Alzantot et al. (2018) applies traditional language model to score the perturbated sentence at every attempt of replacing a word.
These strategies of finding substitutes are unaware of the context between the perturb positions, thus are insufficient in fluency control and semantic consistency. More importantly, using language models or POS checker in scoring the perturbated samples is costly since this trail and error process requires massive inference time.
To overcome the lack of fluency control and semantic preservation by using synonyms or similar words in the embedding space, we leverage BERT for word replacement. The genuine nature of the masked language model makes sure that the generated sentences are relatively fluent and grammar-correct, also preserve most semantic information. Further, compared with previous approaches using rule-based perturbation strategies, the masked language model prediction is context-aware, thus dynamically searches for perturbations rather than simple synonyms replacing.
Different from previous methods using complicated strategies to score and constrain the perturbations, contextualized perturbation generator generates minimal perturbations with only one forward pass. The time-consuming part is accessing target model only without running models to score the sentence, therefore extremely efficient.
Thus, using the masked language model as a contextualized perturbation generator can be one possible solution to craft high-quality adversarial samples efficiently.
3.2.1 Word Replacement Strategy
As seen in Figure 1, given a chosen word to be replaced, we apply BERT to predict the possible words that are similar to yet can mislead the target model. Instead of following the masked language model settings, we do not mask the chosen word and use the original sequence as input, which can generate more semantic-consistent substitutes. For instance, given a sequence ”I like the cat.”, if we mask the word cat, it would be very hard for a masked language model to predict the original word cat since it could be just as fluent if the sequence is ”I like the dog.”. Further, if we mask out the given word , for each iteration we would have to rerun the masked language model prediction process which is costly.
Since BERT uses Bytes-Pair-Encoding (BPE) to tokenize the sequence into sub-word tokens: , we need algin the chosen word to its corresponding sub-words in BERT.
Let denote the BERT model, we feed the tokenized sequence into the BERT to get output prediction . Instead of using argmax prediction, we take the most possible predictions at each position, where is a hyper-parameter.
We iterate words that are sorted by word importance ranking process to find perturbations. BERT model uses BPE encoding to construct vocabularies. While most words are still single words, rare words are tokenized into sub-words. We treat single words and sub-words separately to generate the substitutes.
For a single word , we make attempts using the corresponding top-K prediction candidates . We first filter out stop words collected from NLTK; for sentiment classification tasks we filter out antonyms using synonym dictionaries Mrkšić et al. (2016) since BERT masked language model does not distinguish synonyms and antonyms. Then for given candidate we construct a perturbed sequence . If the target model is already fooled to predict incorrectly, we break the loop to obtain the final adversarial sample ; otherwise, we select from the filtered candidates to pick one best perturbation and turn to the next word in word list .
For word that is tokenized into sub-words in BERT, we cannot obtain its substitutes directly. Thus we use the perplexity of sub-word combinations to craft word substitutes from predictions in sub-word level. Given sub-words of word , we list all possible combinations from the prediction from , which is sub-word combinations, we can convert them back to normal words by reversing the BERT tokenization process. Then we use the perplexity of all combinations to get top-K combinations; in this way, those combinations that are less likely to be a natural word are filtered out.
Then we replace the original word with the most likely perturbation and repeat this process by iterating the importance word ranking list to find final adversarial sample. In this way, we acquire the adversarial samples effectively since we only iterate the masked language model once and do perturbations using masked language model without other checking strategies.
We summarize the two-step BERT-Attack process in Algorithm 1.
We apply our method to attack different types of NLP tasks in the form of text classification and natural language inference. Following Jin et al. (2019), we evaluate our method on 1k test samples randomly selected from the test set of the given task which are the same splits used by Alzantot et al. (2018); Jin et al. (2019).
We use different types of text classification tasks to study the effectiveness of our method.
Yelp Review classification dataset, containing. Following Zhang et al. (2015), we process the dataset to construct a polarity classification task.
IMDB Document-level movie review dataset, where the average sequence length is longer than Yelp dataset. We process the dataset into a polarity classification task 111 https://datasets.imdbws.com/.
AG’s News Sentence level news-type classification dataset, containing 4 types of news: World, Sports, Business, and Science.
FAKE Fake News Classification dataset, detecting whether a news document is fake from Kaggle Fake News Challenge 222 https://www.kaggle.com/c/fake-news/data.
Natural Language Inference
SNLI Stanford language inference task Bowman et al. (2015). Given one premise and one hypothesis, and the goal is to predict if the hypothesis is entailment, neural, or contradiction of the premise.
MNLI Language inference dataset on multi-genre texts, covering transcribed speech, popular fiction, and government reports Williams et al. (2018), which is more complicated with diversified written and spoken style texts, compared with SNLI dataset, including eval data matched with training domains and eval data mismatched with training domains.
4.2 Automatic Evaluation Metrics
To measure the quality of the generated samples, we set up various automatic evaluation metrics. The success rate, which is the counter-part of after-attack accuracy, is the core metric measuring the success of the attacking method. Meanwhile, the perturbed percentage is also crucial since, generally, less perturbation results in more semantic consistency. Further, under the black-box setting, queries of the target model are the only accessible information. Constant queries for one sample is less applicable. Thus query number per sample is also a key metric. As used in TextFoolerJin et al. (2019), we also use Universal Sentence Encoder Cer et al. (2018) to measure the semantic consistency between the adversarial sample and the original sequence. To balance between semantic preservation and attack success rate, we set up a threshold of semantic similarity score to filter the less similar examples.
4.3 Attacking Results
As shown in Table 1
, BERT-Attack method successfully fool its downstream fine-tuned model. In both text classification and natural language inference tasks, the fine-tuned BERTs fail to classify the generated adversarial samples correctly.
The average after-attack accuracy is lower than 10%, indicating that most samples are successfully perturbated to fool the state-of-the-art classification models. Meanwhile, the perturb percentage is less than 10 %, which is significantly less than previous works.
Further, BERT-Attack successfully attacked all tasks listed, which are in diversified domains such as News classification, review classification, language inference in different domains. The results indicate that the attacking method is robust in different tasks. Compared with the strong baseline introduced by Jin et al. (2019), the BERT-Attack method is more efficient and more imperceptible. The query number and the perturbation percentage of our method are much less.
We can observe that it is generally easier to attack the review classification task since the perturb percentage is incredibly low. BERT-Attack can mislead the target model by replacing a handful of words only. Since the average sequence length is relatively long, the target model tends to make judgments by only a few words in a sequence, which is not the natural way of human prediction. Thus, the perturbation of these keywords would result in incorrect prediction from the target model, revealing the vulnerability of it.
|Dataset||Method||Original Acc||Attacked Acc||Perturb %||Query Number||Avg Len||Semantic Sim|
|TextFoolerJin et al. (2019)||19.3||11.7||4403||0.76|
4.4 Human Evaluations
For further evaluation of the generated adversarial samples, we set up human evaluations to measure the quality of the generated samples in fluency and grammar as well as semantic preservation.
We ask human judges to score the grammar correctness of the mixed sentences of generated adversarial samples and original sequences, scoring from 1-5 following Jin et al. (2019). Then we ask human judges to make predictions for the generated adversarial samples mixed with original samples. We use IMDB dataset and MNLI dataset, and for each task, we select 100 samples of both original and adversarial samples for human judges.
Seen in Table 2, semantic score and label prediction of adversarial samples are close to original ones. MNLI task is a sentence pair prediction task constructed by human crafted hypotheses based on premises, therefore original pairs share a considerable amount of same words. Perturbations on these words would make it difficult for human judges to predict correctly therefore the accuracy is lower than simple sentence classification task.
|Dataset||Accuracy Ori/Adv||Semantic Ori/Adv|
4.5 BERT-Attack against Other Models
The BERT-Attack method is also applicable in attacking other target models, not limited to its fine-tuned model only. As seen in Table 3, the attack is successful against LSTM-based models, indicating that BERT-Attack is feasible for a wide range of models. Under BERT-Attack, ESIM model is more robust in MNLI dataset. We assume that encoding two sentences separately gets higher robustness. In attacking BERT-large models, the performance is also excellent, indicating that BERT-Attack is successful in attacking different pre-trained models not only against its own fine-tuned downstream models.
|Dataset||Model||Ori Acc||Atk Acc||Perturb %|
5 Ablations and Discussions
5.1 Importance of Candidate Numbers
The candidate pool range is the major hyper-parameter used in BERT-Attack algorithm. As seen in Figure 2, the attack rate is rising along with the candidate size increasing. Intuitively, a larger would result in less semantic similarity. However, the semantic measure via Universal Sentence Encoder is maintained in a stable range, (experiments show that semantic similarities drop less than 2%), indicating that the candidates are all reasonable and semantically consistent with the original sentence.
5.2 Importance of Sequence Length
BERT-Attack method is based on the contextualized masked language model. Thus the sequence length plays an important role in high-quality perturbation process. As seen, instead of the previous methods focusing on attacking the hypothesis of NLI task, we aim at premises whose average length is longer. This is because we believe that contextual replacement would be less reasonable when dealing with extremely short sequences. To avoid such a problem, we believe that many word-level synonym replacement strategies can be combined with BERT-Attack, allowing BERT-Attack method to be more applicable.
|Dataset||Method||Ori Acc||Atk Acc||Perturb %|
|MNLI||Ori||Some rooms have balconies .||Hypothesis||All of the rooms have balconies off of them .||Contradiction|
|Adv||Many rooms have balconies .||Hypothesis||All of the rooms have balconies off of them .||Neutral|
|IMDB||Ori||it is hard for a lover of the novel northanger abbey to sit through this bbc adaptation and to||Negative|
|keep from throwing objects at the tv screen… why are so many facts concerning the tilney family|
|and mrs . tilney ’ s death altered unnecessarily ? to make the story more ‘ horrible ? ’|
|Adv||it is hard for a lover of the novel northanger abbey to sit through this bbc adaptation and to||Positive|
|keep from throwing objects at the tv screen… why are so many facts concerning the tilney family|
|and mrs . tilney ’ s death altered unnecessarily ? to make the plot more ‘ horrible ? ’|
|FAKE||Ori||the us may soon face an apocalyptic seismic event starkman today , … earthquakes …, as geologists||Unreliable|
|say . via usualroutine the university of washington has already presented seismological … charts|
|showing a gigantic geological rift that … when scientists found a strange underground rupture …|
|Adv||the us may soon face an apocalyptic seismic event starkman today , … earthquakes …, as geologists||Reliable|
|say . en usualroutine the university of washington , already presented seismological … charts|
|showing a gigantic geological rift that … when scientists found a strange underground rupture …|
5.3 Transferability and Adversarial Training
To test the transferability of the generated adversarial samples, we take samples aimed at different target models to attack other target models. Here, we use BERT-base as masked language model for all different target models. As seen in Table 4, samples are transferable in NLI task while less transferable in text classification.
Meanwhile, we further fine-tune the target model using the generated adversarial samples from train set and then test it on test set used before. As seen in Table 5, generated samples used in fine-tuning help target model become more robust while accuracy is close to the model trained with clean datasets. The attack becomes more difficult, indicating that the model is harder to be attacked. Therefore, the generated dataset can be used as additional data for further exploration of making neural models more robust.
|Dataset||Model||Atk Acc||Perturb %||Semantic|
5.4 Effects on Sub-Word Level Attack
BPE method is currently the most efficient way to deal with a large number of words, as used in BERT. We establish a comparative experiment where we do not use sub-word level attack. That is we skip those words that are tokenized with multiple sub-words.
As seen in Table 7, using sub-word level attack can achieve higher performances, not only in higher attacking success rate but also in less perturbation percentage.
5.5 Effects on Word Importance Ranking
|Dataset||Method||Atk Acc||Perturb %||Semantic|
Word importance ranking strategy is supposed to find keys are essential to NN models, which is very much like calculating the maximum risk of wrong predictions in FGSM algorithm Goodfellow et al. (2014). When not using word importance ranking, attacking algorithm is less successful.
5.6 Examples of Generated Adversarial Sentences
As seen in Table 6, the generated adversarial samples are semantically consistent with its original input, while the target model makes incorrect predictions. In both review classification samples and language inference samples, the perturbations do not mislead human judges.
In this work, we propose a high-quality and effective method BERT-Attack to generate adversarial samples using BERT masked language model. Experiment results show that the proposed method achieves a high success rate while maintaining a minimum perturbation. Nevertheless, candidates generated from the masked language model can sometimes be antonyms or irrelevant to the original words, causing a semantic loss. Thus, enhancing language models to generate more semantically related perturbations can be one possible solution to perfect BERT-Attack in the future.
- Generating natural language adversarial examples. CoRR abs/1804.07998. External Links: Cited by: §2, §3.2, §4.1.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326. Cited by: §2, 1st item.
- Universal sentence encoder. arXiv preprint arXiv:1803.11175. Cited by: §4.2.
- Adversarial attacks and defences: a survey. arXiv preprint arXiv:1810.00069. Cited by: §1, §2.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §1.
- Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. Cited by: §1, §2.
- Breaking nli systems with sentences that require simple lexical inferences. arXiv preprint arXiv:1805.02266. Cited by: §2.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, §2, §5.5.
- Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328. Cited by: §2.
- Is BERT really robust? natural language attack on text classification and entailment. CoRR abs/1907.11932. External Links: Cited by: §1, §1, §2, §2, §3.2, §4.1, §4.2, §4.3, §4.4, Table 1.
- Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §1.
Discrete adversarial attacks and submodular optimization with applications to text classification.
Systems and Machine Learning (SysML). Cited by: §2.
- TextBugger: generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271. Cited by: §1, §2.
- Deep text classification can be fooled. arXiv preprint arXiv:1704.08006. Cited by: §2.
- Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892. Cited by: §2, §3.2.1.
Deep neural networks are easily fooled: high confidence predictions for unrecognizable images.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §2.
Glove: global vectors for word representation.
Proceedings of the conference on empirical methods in natural language processing, pp. 1532–1543. Cited by: §2.
- Combating adversarial misspellings with robust word recognition. arXiv preprint arXiv:1905.11268. Cited by: §2.
Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1085–1097. Cited by: §2, §3.2.
- Universal adversarial triggers for attacking and analyzing NLP. Empirical Methods in Natural Language Processing. Cited by: §2.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1112–1122. Cited by: 2nd item.
- Greedy attack and gumbel attack: generating adversarial examples for discrete data. arXiv preprint arXiv:1805.12316. Cited by: §1.
- Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: 1st item.