Deep learning models are susceptible to adversarial examples that have imperceptible perturbations in the original input, resulting in adversarial attacks against these models. Analysis of these attacks on the state of the art transformers in NLP can help improve the robustness of these models against such adversarial inputs. In this paper, we present Adv-OLM, a black-box attack method that adapts the idea of Occlusion and Language Models (OLM) to the current state of the art attack methods. OLM is used to rank words of a sentence, which are later substituted using word replacement strategies. We experimentally show that our approach outperforms other attack methods for several text classification tasks.READ FULL TEXT VIEW PDF
Deep neural network models are vulnerable to adversarial attacks. In man...
We present DeClaW, a system for detecting, classifying, and warning of
Adversarial attacks have shown the vulnerability of machine learning mod...
Neural network (NN) models that are solely trained to maximize the likel...
Adversarial attacks against machine learning models have threatened vari...
We present a probabilistic framework for studying adversarial attacks on...
Adversarial examples can be defined as inputs to a model which induce a
In recent times, deep learning models have become pervasive across different domains. Many of the recent deep models have shown SOTA performance on a variety of NLP tasks (wang2018glue). Consequently, deep models are being deployed in a variety of production systems for real-life applications. Hence, it becomes imperative to ensure the reliability and robustness of such models as it might pose a threat to security.
Recent studies have pointed out the vulnerability of deep models to adversarial attacks (goodfellow2014explaining). Adversarial attack comprises generating adversarial samples by performing small perturbations to the original input, making them imperceptible to humans while fooling the deep learning models to give incorrect predictions.
Adversarial attack on textual data is much more difficult due to the discrete nature of the text. The basic requirement of imperceptibility of perturbation by human judges is much more challenging in a language data setting. Therefore, the adversarial sample needs to be grammatically correct and semantically sound. Perturbations at word or character level that are perceptible to human judges have been explored in-depth (ebrahimi2017hotflip; belinkov2017synthetic; jia2017adversarial; gao2018black). Work on defense against misspellings based attacks (pruthi2019combating)
and use of optimization algorithms for attacks like genetic algorithm(alzantot2018generating; wang2019natural)zang2020word) have also been explored. With the rise of pre-trained language models, like BERT (devlin2018bert) and other transformer-based models, generating human imperceptible adversarial examples has become more challenging. wallace2019universal, jin2019bert, and pruthi2019combating have explored these models from different perspectives.
Adversarial examples can be generated using black-box, where no knowledge about the model is accessible, and white-box, where information about the technical details of models are known. Generation of textual adversarial samples in a black-box setting consists of two steps 1) Finding words to replace in a sample (Word Ranking) 2) Replacing the chosen word (Word Replacement). Word Ranking is necessary to ensure that the word that contributes the most to the output prediction is considered as the candidate for replacement in the next step. Other constraints like generating semantically similar adversarial samples, human imperceptibility, and minimal perturbation percentage are also considered. Previous work has obtained word ranking by performing deletion of words (e.g., BAE-R (garg2020bae), TextFooler jin2019bert), and replacement of words with [UNK] token (e.g., BERT-Attack (li2020bert)
) and then ranking the words based on the output logits difference.
Recently in the model explainability domain, the method of Occlusion and Language Models (OLM) (harbecke2020considering) has been proposed, the authors argue that the data likelihood of the samples obtained after either deleting the token or replacement with [UNK]
token is very low, which makes these methods unsuitable for determining relevance of the word towards the output probability. The authors propose the use of language models for calculating the relevance of the words in a sentence. Taking inspiration from OLM, we proposeAdv-OLM, a black box attack method, that adapts the idea of OLM (as the Work Ranking Strategy) to find the relevant words to replace. We empirically show that OLM provides a better set of ranked words compared to the existing word ranking strategies for the generation of adversarial examples.
We summarize our contributions as follows:
We propose a new method Adv-OLM, to rank words for generating adversarial examples.
We empirically show that Adv-OLM has a higher success rate and lower perturbation percentage than previous attacking methods.
The implementation for the proposed approach is made available at the GitHub repository: https://github.com/vijit-m/Adv-OLM.
We are given a corpus consisting of input samples, with corresponding labels and a trained classification model
that maps an input samples to its correct label. We assume a black-box setting where the attacker can only query the classifier for output label probabilities for the given input. For an input sample, the task is to construct an adversarial sample such that, with , and . Here, can be both the semantic and syntactic similarity function, and is the minimum similarity threshold. Ideally, the amount of perturbation should be minimized. The first step is to rank the words of the sample . Based on the ranking, starting from the most important word, the word is replaced by some candidate word that keeps the perturbed sample semantically similar and grammatically sound but changes the output prediction.
Adv-OLM uses the idea of Occlusion and Language Models to perform Word Ranking using both OLM and OLM-S methods. OLM uses a language model to sample some candidate instances for a word and then replaces the word. Let be a word of the input and be the incomplete input without this word. Then the OLM relevance score given the prediction function and label is (Here is the logit value corresponding to the label .)
Here, is not accurately defined and needs to be approximated since is the incomplete input. A language model generates input by predicting the masked word as that is as natural as possible for the model and thus approximates to:
where, is the prediction of the classification model after the language model’s prediction is added to the incomplete input .
The other method OLM-S calculates the sensitivity of a position in the text and has nothing to do with the word present at that position in the original input. The sensitivity score of OLM-S is calculated
where is the mean value from Equation 2. The sensitivity score is used for word ranking in OLM-S.
After performing the Word Ranking step using the relevance scores generated by OLM and OLM-S, the next step is to replace highly scored words with semantically similar words that form grammatically correct sentences (Word Replacement) such that the output prediction changes. Word replacement strategy is kept similar to existing methods. TextFooler uses Synonym Extraction, POS checking and semantic similarity checking whereas BAE-R uses a Language Model for word replacement. (details in Appendix C).
|Natural Language Inference|
|Original Acc.||Attacked Acc.||Success Rate||Perturbed %||Original Acc.||Attacked Acc.||Success Rate||Perturbed %|
|Model||Method||Word Ranking||Original Acc.||Attacked Acc.||Success Rate||Perturbed %|
We experiment with different benchmark datasets for text classification and entailment: IMDB, AG News, Yelp Polarity and MNLI (details in AppendixA). The statistics of the final dataset are shown in Table 1. Test set was randomly choosen stratified set. For evaluating the effectiveness of our proposed approach, we experiment with SOTA text classifiers i.e. transformer based models like BERT (devlin2018bert), ALBERT (lan2019albert), RoBERTa (liu2019roberta) and DistilBERT (sanh2019distilbert).
We replaced the existing word ranking strategies (i.e. Original (delete)) of previous attack methods: Textfooler (jin2019bert) and BAE-R (garg2020bae) with word rankings generated using OLM and OLM-S while keeping rest of the attack procedure same. The comparison is provided between the attacks generated through original word ranking, and OLM adapted word ranking (including comparison with PWWS attack method (ren2019generating)) in table 2, table 3 and table 4. PWWS (Probability Weighted Word Saliency) method considers the word saliency along with the classification probability. The change in value of the classification probability is used to measure the attack effect of the proposed substitute word, while word saliency shows how well the original word affects the classification. We use the default language model (BERT) employed in the OLM and OLM-S, and kept the number of samples generated by the OLM language model as 30 in all the experiments.
The following evaluation metrics are used:
Attacked Acc.: Accuracy of the model after attack. Lower the better.
Success Rate: Ratio of number of successful attacks and the total number of attempted attacks111Note that total number of attempted attacks are not the same as number of input examples i.e., the samples which were originally wrongly classified by the model even before an attack are skipped. Higher the better.
Perturbed Percentage: Ratio of number of words that were modified by the attack and the total number of words in the input example. Lower the better.
We use TextAttack’s (morris2020textattack) fine tuned models on these datasets and used it to execute the attacks, including Adv-OLM (Appendix B).
Number of queries in Adv-OLM: From equations 2 and 3, it is clear that unlike other methods of deletion and [UNK] token replacement, which perform only a single query, we need to perform multiple queries.
We set the number of samples generated by the OLM language model to 30 for our experiment. In the worst case, we would have all 30 samples of the token as unique, which will query the model 30 times. However, experimentally it was not the case. To study this, we varied the number of samples and evaluated the OLM ranking step’s number of queries. In fig 2, we plotted the number of queries for OLM averaged over the input samples against the number of samples. We can see that there is not a significant difference in the total number of samples in our case (OLM + Textfooler queries) when compared with PWWS.
Results are shown in Tables 2, 3 and 4. Table 2 provides the results on AG News and Yelp datasets on fine-tuned BERT and ALBERT model. Our method performs better on both datasets by increasing the success rate by about 1-3% than the previous methods and also decreasing the perturbation percentage. Table 3 gives the results of attacking a fine-tuned BERT on MNLI. Although we did not perform better than original BAE-R, we were still able to outperform TextFooler. Due to the unavailability of MNLI fine-tuned ALBERT model in TextAttack, we did not perform an attack on ALBERT. It can also be seen from Table 2 that the perturbation percentage for AG’s News exceed more than 20%, which seems to be a perceptible change, but since the average length of the article is only 40.41, making the space for finding relevant words less, the perturbation percentage becomes very high.
To compare attacks across different transformer-based models, we evaluate the performance of Adv-OLM on IMDB dataset. Table 4 provides the results of different attack methods on BERT, ALBERT, RoBERTa, DistilBERT and BiLSTM. Adv-OLM was able to outperform previous attack methods on BERT, ALBERT, RoBERTa by increasing the success rate up to 10% for BAE-R and up to 6% for TextFooler. Perturbation percentage was also reduced by 1-2%. On DistilBERT, Adv-OLM showed no change in the success rate, but the perturbation percentage was lowered slightly. We also performed an attack on a non-transformer based BiLSTM model which did not show any improvements in the success rate. For BAE-R, it even showed a decrease in the success rate for Adv-OLM. One possible reason for this might be that in both OLM and OLM-S word sampling is performed using a transformer-based BERT language model. We also have qualitative results on IMDB dataset (Figure 0(a), 0(b) in Appendix).
Experimentally it was observed that better words were ranked when OLM/OLM-S was used as the Word Ranking strategy (Figure 0(b)). When comparing with the original methods, Adv-OLM has more number of queries, which is due to the fact that for word rankings, OLM/OLM-S queries the model a number of times, thus increasing the overall queries. However, the difference in the number of Adv-OLM queries with the existing attacking methods is not very significant since the model is queried only for unique words from the samples generated from the language model.
In this work, we present Adv-OLM, a black box attacking method that uses OLM based word ranking strategy, improving the attack performance significantly over previous methods. We also studied how replacing a single variable in a complex system with a new existing method can improve upon the previously existing attack strategies. For future work, we would like to experiment with other language models in the OLM algorithm. We plan to study the effect of using different transformers for the language model and the target model.
We evaluate our adversarial attacks on text classification and natural language inference datasets. We evaluate our method on 500 samples randomly selected from the test set of the given dataset.
Text classification We used the following text classification datasets:
IMDB: Document-level large Movie Review dataset for binary sentiment classification. 222IMDB dataset
Yelp: The Yelp reviews dataset consists of reviews from Yelp. This is a dataset for sentiment classfication. It is extracted from the Yelp Dataset Challenge 2015 data. 333Yelp dataset
AG’s News: Sentence level news-type classi- fication dataset, containing 4 types of news: World, Sports, Business, and Science. 444AG’s News dataset
Natural Language Inference
The corpus of sentence pairs manually labeled for classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI). Unlike SNLI, MNLI is more diverse, based on multi-genre texts, covering transcribed speech, popular fiction, and government reports.555MNLI dataset
Average Length is the average number of words in the randomly chosen 500 samples taken from its test set for each dataset.
TextAttack is an open-source python framework for adversarial attacks, data augmentation and adversarial training in NLP.
|BAE (garg2020bae)||PWWS (ren2019generating)|
|Bert-Attack (li2020bert)||TextFooler (jin2019bert)|
|DeepwordBug (gao2018black)||HotFlip (ebrahimi2017hotflip)|
|Alzantot (alzantot2018generating)||Morpheus (tan-etal-2020-morphin)|
|IGA (wang2019natural)||Pruthi (pruthi2019combating)|
|Input-Reduction (feng2018pathologies)||PSO (zang2020word)|
|Seq2Sick (cheng2020seq2sick)||TextBugger (li2018textbugger)|
|Kuleshov (kuleshov2018adversarial)||Fast Alzantot (jia2019certified)|
Because of the modularity that TextAttack provides, it enables researchers to construct new attacks from a combination of novel and existing approaches or perform analysis on the already existing approaches. This helps in composing and comparing the attacks in a shared environment. TextAttack makes it easy to perform benchmark comparisons across all the previous attacks performed across models. Text Attack provides clean, readable implementations of 16 adversarial attacks from the literature. Out of which two are sequence to sequence attacks and nine are classification based attacks from the GLUE benchmark. A list of these attacks is presented in Table 5. TextAttack is directly integrated with HuggingFace’s transformers and NLP libraries. This allows users to test attacks on models and datasets.
TextAttack builds attacks from four components:
A search method that selects the words to be transformed.
A transformation that generates a set of possible perturbations for the given input.
A set of constraints implied on the transformation to ensure that the perturbations are valid with respect to the original input.
A goal function that determines whether an attack is successful in terms of model outputs. For classification tasks, untargeted, and targeted. For a sequence to sequence tasks, non-overlapping output, and minimum BLEU score.
Following workflow was proposed by the paper:
Synonym Extraction: Gather a candidate set CANDIDATES for all possible replacements of the selected word
and every other word in the vocabulary. To represent the words, counter fitting word embeddings were used. Using this set of embedding vectors, top
synonyms whose cosine similarity withis higher than some were chosen.
POS Checking: In the set CANDIDATES of the word , only the ones with the same part-of-speech(POS) as were kept. This step assures that the grammar of the text is mostly maintained.
Semantic Similarity Checking: For each remaining word CANDIDATES, these were substituted for in the sentence , and an adversarial example was obtained. Universal Sentence Encoder (USE) was used to encode the two sentences into high dimensional vectors and then use their cosine similarity score to calculate the sentence similarity between and . The words resulting in similarity scores above a preset threshold were placed in a final candidate pool (FINCANDIDATE).
Finally, every candidate word from the FINCANDIDATE was chosen one by one, and the one that resulted in the least confidence score of label was considered as the best replacement for word .
BAE uses a pre-trained BERT masked language model(MLM) to predict the mask tokens for replacement. Since BERT is powerful and trained on the large training corpus, the predicted mask tokens fit well grammatically in the sentence. BERT-MLM does not, however, guarantee semantic coherence to the original text. To ensure semantic similarity on introducing perturbations in the input text, a set of K masked tokens were filtered out using Universal Sentence Encoder(USE) based on sentence similarity score. An additional check for grammatical correctness of the generated adversarial example by filtering out predicted tokens that do not form the same part of speech(POS) as the original token in the sentence was performed.