Defense of Word-level Adversarial Attacks via Random Substitution Encoding

by   Zhaoyang Wang, et al.

The adversarial attacks against deep neural networks on computer version tasks has spawned many new technologies that help protect models avoiding false prediction. Recently, word-level adversarial attacks on deep models of Natural Language Processing (NLP) tasks have also demonstrated strong power, e.g., fooling a sentiment classification neural network to make wrong decision. Unfortunately, few previous literatures have discussed the defense of such word-level synonym substitution based attacks since they are hard to be perceived and detected. In this paper, we shed light on this problem and propose a novel defense framework called Random Substitution Encoding (RSE), which introduces a random substitution encoder into the training process of original neural networks. Extensive experiments on text classification tasks demonstrate the effectiveness of our framework on defense of word-level adversarial attacks, under various base and attack models.


page 1

page 2

page 3

page 4


Natural Language Adversarial Attacks and Defenses in Word Level

Up until recent two years, inspired by the big amount of research about ...

A Survey in Adversarial Defences and Robustness in NLP

In recent years, it has been seen that deep neural networks are lacking ...

Rebuild and Ensemble: Exploring Defense Against Text Adversaries

Adversarial attacks can mislead strong neural models; as such, in NLP ta...

Using Random Perturbations to Mitigate Adversarial Attacks on Sentiment Analysis Models

Attacks on deep learning models are often difficult to identify and ther...

MUTE: Data-Similarity Driven Multi-hot Target Encoding for Neural Network Design

Target encoding is an effective technique to deliver better performance ...

Deep Image Destruction: A Comprehensive Study on Vulnerability of Deep Image-to-Image Models against Adversarial Attacks

Recently, the vulnerability of deep image classification models to adver...

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Adversarial attacks are a major challenge faced by current machine learn...

1 Introduction

Deep Neural Network(DNN) has become one of the most popular frameworks to harvest knowledge from big data. Despite their success, the robustness of DNNs has ushered a serious problem, which has prompted the Adversarial attacks on them. Adversarial attack refers to generating imperceptible perturbed examples to fool a well-trained DNN model making wrong decisions. In the Computer Version(CV) domain, adversarial attacks against many famous DNN models have been shown to be an indisputable threat.

Recently, adversarial attacks on DNN models for Natural Language Processing(NLP) tasks have also received significant attentions. Existing attack methods can be classified into two categories: character-level attacks and word-level attacks. For character-level attacks, attackers can modify several characters of an original text to manipulate the target neural network. While character-level attacks are simple and effective, it is easy to defend when deploying a spell check and proofread algorithm before feeding the inputs into DNNs

[acl/PruthiDL19]. Word-level attacks substitute a set of words in original examples by their synonyms, and thus can preserve semantic coherence to some extent. The adversarial examples, created by word-level attackers, are more imperceptible for humans and more difficult for DNNs to defend.

Until now, there are few works on defense of adversarial attacks against NLP tasks, e.g., text classification. Most efforts had gone into increasing the model robustness by adding perturbations on word embeddings, e.g., adversarial training[corr/GoodfellowSS14]

or defensive distillation

[sp/PapernotM0JS16]. Although these approaches exhibit superior performance than base models, they assume there are no malicious attackers and couldn’t resist word-level adversarial attacks[wang2019natural]. The only work against word-level synonym adversarial attacks is [wang2019natural]. It proposed a Synonym Encoding Method(SEM) which maps synonyms into same word embeddings before training the deep models. As a result, the deep models are trained only on these examples with only fixed synonym substitutions. The reason why SEM based deep models can defend word-level attacks is that it can transform many unseen or even adversarial examples ‘move’ towards ‘normal’ examples that base models have seen. While SEM can effectively defend current best synonym adversarial attacks, it is too restrictive when the distances are large between transformed test examples and the limited training examples.

This paper takes a straightforward yet promising way towards this goal. Unlike modifying word embeddings before the training process, we put the synonyms substitutions into the training process in order to fabricate and feed models with more examples. To this end, we proposed a dynamic random synonym substitution based framework that introduces Random Substitution Encoding(RSE) between the input and the embedding layer. We also present a Random Synonym Substitution Algorithm for the training process with RSE. The RSE encodes input examples with randomly selected synonyms so as to make enough labeled neighborhood data to train a robust DNN. Note that the RSE works in both training and testing procedure, just like a dark glasses dressed on the original DNN model.

We perform extensive experiments on three benchmark datasets on text classification tasks based on three DNN base models, i.e., Word-CNN, LSTM and Bi-LSTM. The experiment results demonstrate that the proposed RSE can effectively defend word-level synonym adversarial attacks. The accuracy of these DNN models under RSE framework achieves better performance under popular word-level adversarial attacks, and is close to the accuracy on benign tests.

2 Related Work

Adversarial attack and defense are two active topics recently. In natural language processing many tasks are facing the threat of adversarial attack, e.g., Text Classification[GaoLSQ18, EbrahimiRLD18-2, 0002LSBLS18, LiJDLW19], Machine Translation[EbrahimiLD18], Question & Answer[RenDW18], etc. Among them, text classification models are more vulnerable and become the targets of malicious adversaries. The state-of-the-art adversarial attacks to text classification in literature can be categorized into the following types:

  • Character-level attacks. Attackers can modify a few characters of an original text to manipulate the target neural network. Gao et al.[GaoLSQ18] proposed DeepWordBug, an approach which adds small character perturbations to generate adversarial examples against DNN classifiers. Ebrahimi et al.[EbrahimiRLD18-2] proposed an efficient method, named by Hotflip, to generate white-box adversarial texts to trick a character-level neural network. In [0002LSBLS18] and [LiJDLW19], text adversarial samples were crafted in both white-box and black-box scenarios. However, these approaches are easy to defend by placing a word recognition model before feeding the inputs into neural network[PruthiDL19].

  • Word-level attacks. Word-level attacks substitute words in original texts by their synonyms, and thus can preserve semantic coherence. Liang, et al.[0002LSBLS18] designed three perturbation strategies to generate adversarial samples against deep text classification models. Alzantot et al.[AlzantotSEHSC18] proposed a genetic based optimization algorithm to generate semantically similar adversarial examples to fool a well-trained DNN classifier. To decrease the computational cost of attacks, Ren et al.[RenDHC19] proposed a greedy algorithm, namely PWWS, for text adversarial attack. Word-level adversarial examples are more imperceptible for humans and more difficult for DNNs to defend.

There exists very few works on defending word-level text adversarial attacks. To the best of our knowledge, [wang2019natural] is the only work on defenses against synonym substitution based adversarial attacks. They proposed Synonym Encoding Method (SEM), that encodes synonyms into same word embeddings to eliminate adversarial perturbations. However, it needs an extra encoding stage before the normal training process and is limited on the fixed synonym substitution. Our framework adopt a unified training process and provide a flexible synonym substitution encoding scheme.

3 Preliminaries

In this section, we firstly present the problem of adversarial attack and defense in text classification tasks. Next we provide preliminaries about attack model: several typical word-level synonym adversarial attacks.

3.1 Problem Definition

Given a trained text classifier , and denote the input and the output space respectively. Suppose there is an input text , the classifier can give a predicted true label

based on a posterior probability



An adversarial attack on classifier is defined that the adversary can generate a adversarial example by adding an imperceptible perturbation , such that:


where denotes the -norm and controls the small perturbation so that the crafted example is imperceptible to humans.

The defense against adversarial attack requires to train an enhanced text classifier over . A successful defense means that for a given input text example , the attacker failed to craft an adversarial example, or the generated adversarial example could not fool the classifier .


3.2 Synonym Adversarial Attacks

To ensure the perturbation small enough, the adversarial examples need to satisfy semantic coherence constraints. An intuitive way to craft adversarial examples is to replace several words in the input examples by their synonyms. Let denotes an input example, where denotes a word. Each word has a synonym candidate set . For a synonym adversarial attack, adversary can substitute words denoted by , to craft an adversarial example :


where denotes the th substitution candidate word in .

Note that the computational cost to find an adversarial example in the searching space will take exponential time. Existing synonym substitution based adversarial attacks had gone into proposing fast searching algorithms, such as Greedy Search Algorithm(GSA)[kuleshov2018adversarial]

and Genetic Algorithm (GA)

[AlzantotSEHSC18]. [RenDHC19] propose a fast state-of-the-art method called Probability Weighted Word Saliency(PWWS), which considers the word saliency and the classification confidence.

4 The Proposed Framework

In this section, we first present our motivation, and then demonstrate the detailed defense framework.

4.1 Motivation

There are many possible reasons why DNNs have vulnerabilities to adversarial attacks. One of pivotal factors comes from the internal robustness of neural networks. Given a normal example , suppose is within the decision boundary in which the classifier can make a correct prediction, as seen in Fig.1(a). However, attackers can craft an adversarial example in the neighborhood of such that the classifier will make a wrong prediction on .

Figure 1: Decision boundary around normal example.

For word-level adversarial attack, adversarial examples within neighborhood of normal examples are generally created by substituting parts of words in the text by their synonyms. Therefore a judicious solution is to encode synonyms into same embeddings[wang2019natural], and use the modified embeddings to train neural networks as well as test an example. However, this method doesn’t enlarge the decision boundary too much because the number of training examples doesn’t increase. Thus under this encoding, a carefully crafted adversarial example may or may not go through the decision boundary. From Fig.1(b) we can see that under such encodings, adversarial example may be mapped to (defense fail) or (defense successful).

In this paper, we apply a different way to generate more robust word embeddings. We randomly involve neighborhood examples of all training data into the model training process. The neighborhood examples come from random synonym substitution and they share the same label as the original example. Thus the decision boundary of one example may be expanded to cover most unseen neighborhood examples including adversarial examples, as shown in Fig.1(c). Note that we did not generate infinite neighborhood examples for a training data, and even if we generate a large number of neighborhood examples over one training example in advance, the training time is also very expensive.

To address this challenge, we adopt a dynamic synonym substitution strategy in the training process and the number of training data remains unchanged. As presented in Fig.1

(c), a neighborhood example(a green circle) replaces the original example(the blue circle) to involve in the training process in an epoch. Thus different neighborhood examples are generated and work in different epochs. In the test process, test examples are also required to randomly substitute some synonyms. As a result, no matter an unseen example(may be adversarial)

are mapped to , or , the model can also give the correct prediction. We give the details of our framework in the next subsection.

4.2 Framework Specification

Given a set of training examples , a text classification model with parameter , the objective of M is to minimize the negative log-likelihood:


To make the decision boundary more refined and ease the training load, our approach do not generate many labeled examples in advance. We dynamically generate neighborhood examples instead of original examples in every epoch of the training process. To this end, we proposed a dynamic random synonym substitution based framework RSE that introduces a Random Substitution Encoder between the input and the embedding layer. Then the training objective is to minimize:


where denotes the random synonym substitution operation, and guarantees that the generated example stays in the neighborhood of .

Since operation doesn’t need an optimization, we can fuse random synonym substitution and encodes the inputs into a new embeddings in real time in the training process of model . Fig.2 illustrates the representation of the proposed framework.

Figure 2: The proposed RSE framework

From Fig.2, we can see that RSE reads an input text and encodes it to an embedding using a random synonym substitution algorithm. For example, given an original example , RSE outputs a neighborhood and feeds the embedding into the subsequent model . Then model is trained on the perturbed examples. Here can be one of any specific DNN models in NLP tasks and in this paper we focused on text classification model such as CNN and LSTM.

4.3 Random Synonym Substitution Algorithm

Next we introduce the details of RSE and the training process under the proposed framework. In practice, to satisfy the constraints , we adopt a substitution rate instead of neighborhood radius . There are three steps to generate a neighborhood for original example . Firstly, we select a substitution rate between a minimal rate and a maximal rate . Then secondly we randomly sample a candidate words set in which will be substituted. Finally we randomly chose synonyms for all words in . Algorithm1 presents the details of these steps as well as the while training process.

In the test stage, a test example also need to be encoded under the proposed RSE in order to mitigate the possible adversarial noise. For example, given a test example , we firstly transform it to a representation of its neighborhood by performing an algorithm(Line 4-6 in Algorithm1). Then the embedding is fed into the well-trained model to give a prediction.

Input: Training data , model prameter , minimal rate and maximal rate
1 for epoch =  do
2       for minibatch  do
3             for original example  do
4                   Sample a substitution rate between and ;
5                   Sample candidate words set using Eq();
6                   Sample synonyms for all words in to generate ;
7                   Replace with in .
9             end for
10            Update using gradient ascent of log-likelihood 6 on minibatch .
12       end for
14 end for
Algorithm 1 Training for the RSE framework

5 Experiments

We evaluate the performance of the proposed RSE framework experimentally in this section. We first present the experiment setup, and then report the experiment results on three real-world datasets. The evaluate results show that our RSE achieves much better performance in defending adversarial examples .

5.1 Experiment Setup

In this subsection we give an overview of the datasets, target model, attacker model and baselines used in our experiments.

Datasets. We test our RSE framework on three benchmark datasets: IMDB, AG’s News and Yahoo! Answers.

IMDB[acl/MaasDPHNP11] is a dataset for binary sentiment classification containing 25,000 highly polarized movie reviews for training and 25,000 for testing.

AG’s News[nips/ZhangZL15] is extracted from news articles using only the title and description fields. It contains 4 classes, and each class includes 30,000 training samples and 1900 testing examples.

Yahoo! Answers[nips/ZhangZL15] is a topic classification dataset with 10 classes, which contains 4,483,032 questions and corresponding answers. We sampled 150,000 training data and 5,000 testing data from the original 1,400,000 training data and 60,000 testing data for the following experiments. Each class contains 15,000 training data and 500 testing data, respectively.

We also use padding length when preprocessing the input text. The padding length is decided by each datasets’ average sentence length. Table

1 lists the detailed description of the aforementioned datasets.

# of
Training samples
# of
Testing samples
# of
Vocab words
Padding length
IMDB 25,000 25,000 80,000 300
AG’s News 120,000 7,600 80,000 50
Yahoo! Answers 150,000 5,000 80,000 100
Table 1: The statistic and preprocessing settings for each dataset

Base Models. We used three main classic deep neural networks as base models in our RSE framework for text classification task: LSTM, Bi-LSTM and Word-CNN.

LSTM has a 100-dimension embedding layer, two LSTM layers where each LSTM cell has 100 hidden units and a fully-connected layer.

Bi-LSTM also has a 100-dimension embedding layer, two bi-directional LSTM layers and a fully-connected layer. Each LSTM cell has 100 hidden units.


has two embedding layers, one is static for pretrained word vectors, and another is non-static for training, three convolutional layers with filter size of 3, 4, and 5 respectively, one 1D-max-pooling layer and a fully-connected layer.

Attack Models. We adopt three synonym substitution adversarial attack models to evaluate the effectiveness of defense methods. We suppose attackers can obtain all testing examples of three datasets (IMDB, AG’s News, Yahoo! Answers) and can call prediction interfaces of any models at any time.

Random. We first randomly choose a set of candidate words that has synonyms. Then keep replacing the original word in the candidate set with a randomly synonym until the target model predicts wrong.

Textfool[kulynych2018evading] uses sorted word candidates based on the word similarity rank to replace with the synonym and keep perform until the target model predicts wrong. We will not use the typos substitution because we pay most attention to synonyms replacements attacks.

PWWS[RenDW18] is a greedy synonym replacement algorithm called Probability Weighted Word Saliency (PWWS) that considers the word saliency as well as the classification probability. As the same, we only use synonym replacements but not specific named entities replacements.

Baseline. We take NT, AT and SEM as three baselines in this paper. NT is a normal training framework without taking any defense methods. AT[corr/GoodfellowSS14] is an adversarial training framework, where extra adversarial examples are generated to train a robust model. We adopt the same adversarial training configurations as in [wang2019natural], which use PWWS to generate 10% adversarial examples from each dataset for every normal trained neural networks. Then the adversarial examples and original training examples will be mixed for the training process. SEM[wang2019natural] is an adversarial defense framework which inserts a fixed synonym substitution encoder before the input layer of the model. We will evaluate our framework and baseline frameworks by using LSTM, Bi-LSTM and Word-CNN as base models respectively.

5.2 Evaluations

We evenly sampled each class from the origin test data to form 1,000 clean test examples for every datasets. Then these examples are used to generate adversarial examples by the above attack models, which will take NT, AT, SEM and the proposed RSE as the victim targets.

Dataset Attack Model LSTM (%) Bi-LSTM (%) Word-CNN (%)
IMDB No attack 88.8 87.9 87.0 89.5 88.7 86.5 87.6 86.3 87.8
Random 80.6 77.8 83.1 81.5 79.2 81.9 77.5 75.0 83.0
Textfool 75.4 76.2 84.2 74.6 77.0 83.7 71.2 71.0 83.1
PWWS 26.3 29.3 82.2 27.3 28.1 79.3 13.5 10.5 81.2
AG’s News No attack 90.5 92.6 92.9 96.4 94.5 94.1 95.9 96.9 94.8
Random 84.7 87.9 89.2 91.4 89.7 92.2 91.6 92.6 93.1
Textfool 79.6 85.3 88.7 86.3 89.1 90.6 88.0 92.6 92.2
PWWS 63.0 72.8 84.2 70.4 78.0 88.3 67.5 77.1 89.9
Yahoo! Answers No attack 72.5 73.1 72.1 73.2 72.7 71.8 71.2 66.0 70.1
Random 60.5 65.1 68.6 61.2 64.2 68.9 58.9 54.3 67.3
Textfool 58.9 63.4 67.4 60.4 63.6 67.1 57.9 56.2 66.4
PWWS 29.1 39.3 64.3 26.3 39.2 64.6 28.8 26.8 62.6
Table 2: The evaluation results under different settings.

The key metrics to evaluate the performance of different defense frameworks in this paper are Accuracy, Accuracy Shift and Attack-Success Rate. Accuracy refers to the ratio that the number of correctively predicted examples against the total number of test examples. Accuracy Shift refers to the reduced accuracy before and after attacks. Attack-Success Rate is defined by the number of successfully attacked examples by attack models against the number of correctly predicted examples with no attack. It can be computed by the following:


The better defense performance the target model has, the lower Attack-Success Rate the attacker gets.

Table 2 shows the accuracy results of base models(LSTM, Bi-LSTM and Word-CNN) against various attack models (Random, Textfool, PWWS) under NT, AT, and the proposed RSE defense framework. For each base model with each dataset, we highlight the highest classification accuracy for different defense frameworks in bold to indicate the best defense performance.

From table 2, we can easily see the following observations when looking at each box to find the best accuracy result:

  1. When there is no attack, NT or AT usually haves the best accuracy. But once under other attack models, our RSE framework can get the best accuracy.

  2. For each column in each box, target models have the lowest accuracy under PWWS attack, which demonstrates PWWS is the most effective attack model. The accuracy of NT and AT drop significantly under PWWS attack. But our RSE behaves the best defense ability that it has a very small accuracy loss compared with ‘No attack’.

  3. Under different settings (various datasets, attack models and base models), our RSE framework always has a good performance with little accuracy decrease. This demonstrates the generalization of the proposed framework to strengthen a robust deep model against synonym adversarial attacks.

Metric % Base Model IMDB AG’s News Yahoo! Answers
Before-Attack Accuracy LSTM 86.8 87.0 90.9 92.9 69.0 72.1
Bi-LSTM 87.6 86.5 90.1 94.1 70.2 71.8
Word-CNN 86.8 87.8 88.7 94.8 65.8 70.1
After-Attack Accuracy LSTM 77.3 82.2 85.0 84.2 54.9 64.3
Bi-LSTM 76.1 79.3 81.1 88.3 57.2 64.6
Word-CNN 71.1 81.2 67.6 89.9 52.6 62.6
Accuracy Shift LSTM 9.5 4.8 5.9 8.7 14.1 7.8
Bi-LSTM 11.5 7.2 9.0 5.8 13.0 7.2
Word-CNN 15.7 6.6 21.1 4.9 13.2 7.5
Attack-Success Rate LSTM 10.94 5.52 6.49 9.36 20.43 10.82
Bi-LSTM 13.13 8.32 9.99 6.16 18.52 10.03
Word-CNN 18.09 7.52 23.79 5.17 20.06 10.70
Table 3: SEM VS. RSE under PWWS attack model.

RSE vs. SEM. We also compared SEM with our RSE framework as shown in table 3

. Please note that we evaluate the performance of RSE and SEM only under PWWS attack model on three datasets because: (1) PWWS has the strongest attacking efficacy; and (2) SEM has no opened source codes yet and we directly cite the results in

[wang2019natural] under the same experimental settings.

It can also be seen from Table 3 that the After-Attack Accuracy of RSE are mostly higher than SEM for about 5%-10%. Since the parameters of each base model may be different, we compare the Accuracy Shift of SEM and RSE. We find out that except for AGs News dataset with LSTM model, models under our framework have smaller Accuracy Shift, and the shifts are stable with only 5% decrease in average. But for SEM the decrease is about 10% in average. For AGs News dataset with Word-CNN model, the Accuracy Shift reaches 21.1% and thus this shift make the performance of the model very poor. We can also see that our RSE have lower Attack-Success Rate for nearly all settings. This means that it is more difficult for PWWS attacker to craft adversarial examples under RSE framework.

Substitution Rate. When crafting an adversarial example, it is better to add smaller perturbations. Thus noisy rate is an important metric in adversarial attack. If the noisy rate is high, it means that the crafted example may not be imperceptible. On the contrary, if a defense mechanism causes the attacker add more noise to success, the defense mechanism is better. Thus in this paper we introduce Substitution Rate as a metric, which is defined as the number of substituted words against the sentence length. The better performance the defense framework has, the more the substituted words attack model costs.

Dataset Attacker LSTM (%) Bi-LSTM (%) Word-CNN (%)
IMDB Textfool 17.98 18.22 20.09 17.63 18.26 20.06 17.57 17.35 20.03
PWWS 10.54 11.23 19.13 10.55 11.25 18.12 6.41 5.36 17.99
AG’s News Textfool 20.93 21.27 22.18 21.18 21.39 22.11 22.09 21.59 22.17
PWWS 19.07 20.88 21.66 20.09 21.44 22.20 19.35 20.75 23.08
Yahoo! Answers Textfool 13.69 14.19 13.25 13.59 14.29 13.38 13.88 13.71 13.50
PWWS 10.28 12.47 16.09 9.79 12.85 16.30 10.30 9.74 16.21
Table 4: Performance on Substitution Rate.

The table 4 shows the Substitution Rate of each base model without (NT) or with defend frameworks (AT and RSE). We could not list the results of SEM since they did not report in [wang2019natural]. From table 4 it could be seen that the Substitution Rate for attacking the models with RSE is over 20% in most cases, better than NT and AT frameworks. So we can safely conclude that RSE makes the attackers pay more cost for perturbing origin sentences.

6 Conclusion and Future Work

In this paper, we propose a defense framework called RSE, to protect text classification models against word-level adversarial attack. With this framework, a random synonym substitution encoder is fused into the deep neural network to endow base model with robustness to adversarial examples. We introduce and propose an effective training algorithm for the framework. Extensive experiments on three popular real-world datasets demonstrate the effectiveness of our framework on defense of word-level adversarial attacks. In the future, we will transfer our RSE framework into other typical NLP tasks, e.g., Machine Translation and Question & Answer, to protect deep models from word-leval adversarial attacks.