SIENA: Stochastic Multi-Expert Neural Patcher

11/17/2020 ∙ by Thai Le, et al. ∙ Penn State University Yonsei University 0

Neural network (NN) models that are solely trained to maximize the likelihood of an observed dataset are often vulnerable to adversarial attacks. Even though several methods have been proposed to enhance NN models' adversarial robustness, they often require re-training from scratch. This leads to redundant computation, especially in the NLP domain where current state-of-the-art models, such as BERT and ROBERTA, require great time and space resources. By borrowing ideas from Software Engineering, we, therefore, first introduce the Neural Patching mechanism to improve adversarial robustness by "patching" only parts of a NN model. Then, we propose a novel neural patching algorithm, SIENA, that transforms a textual NN model into a stochastic ensemble of multi-expert predictors by upgrading and re-training its last layer only. SIENA forces adversaries to attack not only one but multiple models that are specialized in diverse sub-sets of features, labels, and instances so that the ensemble model becomes more robust to adversarial attacks. By conducting comprehensive experiments, we demonstrate that all of CNN, RNN, BERT, and ROBERTA-based textual models, once patched by SIENA, witness an absolute increase of as much as 20 black-box attacks, outperforming 6 defensive baselines across 4 public NLP datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent works have shown that NN models that are trained solely to maximize prediction performance are often vulnerable to adversarial attacks papernot2016limitations. Even though several works have been proposed to defend NN models against such attacks, only a few of them focus on the NLP domain wang2019towards. Since many recent NLP models are shown to be vulnerable to adversarial attacks–e.g., fake news detection malcom and dialog system cheng2019evaluating, the investigation of robust defense methods for textual NN models has become necessary. To defend against adversarial texts, one can use either adversarial detection or model enhancement wang2019towards. Adversarial texts are often generated by replacing or inserting critical words (e.g., HotFlip ebrahimi2017hotflip) or characters (e.g., TextBugger li2018textbugger) in a sentence, that are usually exhibiting grammatical errors. Hence, many detection methods have focused on recognizing and correcting such misspellings from texts–e.g., ScRNN pruthi2019combating and DISP zhou2019learning. While misspelling-based methods are model-independent and require neither re-training nor modifying the models, they only work well on character-based attacks. In contrast, model enhancement approaches perform well under both character and word-based attacks. Particularly, most of the model enhancement methods enrich NN models by training them with adversarial data augmented via known attack strategies such as in adversarial training miyato2016adversarial; liu2020robust

, or with external information such as knowledge graphs

li2019knowledge. Nevertheless, these augmentations usually induce overhead costs in training. Therefore, we are in search of defense algorithms that directly enhance the models’ structures (e.g., li2019knowledge) while achieving higher extendability without acquiring additional data.

Fortunately, recent literature in computer vision shows that ensemble NNs achieve high adversarial robustness

kariyappa2019improving; pang2019improving; binarycode; dabouei2020exploiting. In theory, by directly extending a single NN model to an ensemble of multiple diverse sub-models, we challenge adversaries to attack not only one but a set of very different models adam2020evaluating. This makes any attacks significantly more difficult. However, applying such an idea from computer vision to the NLP domain faces one main challenge. Current ensemble methods require simultaneous training of several NN sub-models (with an identical architecture). This introduces impractical computational overhead during both training and inference, especially when one wants to maximize prediction accuracy by utilizing complex state-of-the-art (SOTA) sub-models such as BERT devlin2018bert and ROBERTA liu2019roberta, both of which have more than 100M parameters. Furthermore, applying current ensemble or other defensive approaches that directly enhance a model’s architecture to a large-scale NN model would usually require re-training everything from scratch, which may not be practical in many settings.

To address these challenges, we are borrowing ideas from Software Engineering, by first introducing the notion of Neural Patching to improve the adversarial robustness of NN models by “patching” only parts of models (Figure 1). Next, we develop a novel neural patching algorithm, SIENA, that patches only the last layer of an already deployed textual NN model of diverse architectures (e.g., CNN, RNN, transformersvaswani2017attention) and transforms it into an ensemble of multi-experts with enhanced adversarial robustness. By patching only the last layer of a model, SIENA introduces a lightweight computational overhead and requires no additional training data.

Distinguished from current ensemble methods, each sub-model trained by SIENA is specialized not only in a specific subset of features of the same input, i.e., an expert at feature-level pang2019improving; dabouei2020exploiting, or a sub-set of labels, i.e., an expert at class-level kariyappa2019improving, but also in texts of a distinguished topic, i.e., an expert at instance-level, during prediction. Such diversity at all of the feature-, class-, and instance-level expertise makes it challenging for adversaries to exploit multiple sub-models at the same time. In summary, our contributions in this paper are as follows:

  • We first formalize the concept of Neural Patching, a novel mechanism to fix or enhance a trained NN model by modifying only a subset of ’s parameters (Figure 1).

  • We propose SIENA, a novel neural patching algorithm that transforms to a stochastic ensemble of multi-experts with little computational overhead compared to the previous ensemble methods in the literature.

  • We show that textCNN, textRNN, BERT, and ROBERTA models patched by SIENA achieve an absolute increase of as much as 20% in prediction accuracy on average under 5 different white-box and black-box attacks, outperforming 6 defensive baselines on 4 public NLP datasets.

2 Related Work

SIENA is mainly motivated by previous ensemble-based defense methods, most of which are designed for computer vision. From these works, works such as adam2020evaluating; dabouei2020exploiting further introduce the theoretical analysis of the relationship between the diversity of sub-models, i.e., heads, of an ensemble NN model and its adversarial robustness. Though these works focus on the image domain, we can also derive a similar analysis on the advantage of ensemble diversity for defending against adversarial texts. The key idea is to force attackers to exploit not one but multiple NN models whose behaviors are very different. We call such distinguished behaviors as expertises of these sub-models.

On a feature-level expertise, if the sub-models focus on different words of a sentence to make a prediction, replacing a few critical words in the text might not necessarily fool their combination (Table 1

A). This diversity can be obtained by regularizing the differences among salient maps of sub-models via gradient vectors. However, previous works

kariyappa2019improving; dabouei2020exploiting

only focus on minimizing the cosine-similarity, i.e., direction differences, among the gradient vectors. In this work, we find that the differences in length among them also correlate with adversarial robustness (Sec

A.2). In contrast, sub-models with class-level expertise

might focus on a diverse sub-sets of labels. This can be enabled by regularizing prediction probabilities output by the sub-models

pang2019improving. Since the probabilities are limited to

, directly regularizing the differences in the prediction probabilities might not be effective. To improve this, we propose to regularize the prediction logits instead of probability on the class-level. For example, Table

1B shows that Head #1 and #2 specialize in only one label at a time, outputting logit scores for the positive and negative classes, respectively. In this work, we also propose an instance-level expertise where each sub-model is assigned with sentences of a distinguished topic (Table 1C). This is an extreme case in our method where one sub-model might not be at all involved in a specific prediction (Table 1B, Head#3).

Distinguished from previous ensemble methods, SIENA takes into account all of the three levels of expertise of sub-models. Moreover, applying current ensemble methods to the NLP domain faces a practical challenge where training multiple SOTA sub-models such as BERT or ROBERTA can be very costly in terms of space and time complexities. Thus, SIENA also enables to “hot-fix” a complex NN model with our Neural Patching mechanism, removing the need for training the entire model from scratch.

Figure 1: Patching a Part (in Red) of a NN Model Leads to Correct Prediction under Adversarial Attack on Input
Head Example
#1 23 things teachers actually want for chrismas
#2 23 things teachers actually want for chrismas
#3 23 things teachers actually want for chrismas
(A) Feature-Level Expertise (On a Specific Sentence)
Head Prediction Logit Meaning
Positive Negative
Class Class
#1 Promoting Clickbaitness
#2 Discounting Non-Clickbaitness
#3 N/A N/A Not Participating
(B) Class-Level Expertise (On a Specific Sentence)
Head Example
#1 Topic: relate, years, help, tv, city, etc.
#2 Topic: age, dies age, new, things, people, etc.
(C) Instance-Level Expertise
Table 1: Expertise among Heads in Detecting Clickbait

3 Patching a Neural Network

Adapted from Software Engineering, we first introduce the concept of Neural Patching as follows.

Definition 1 (Neural Patching)

Given a NN parameterized by , is trained with training dataset . A patcher is a function such that, when applied to the trained model , it yields a transformation as follows:

(1)

There are three important properties of the patched model : (i) Fidelity: should maintain (or a little sacrifice) the original prediction performance of , (ii) Complexity: should not require much more computational overhead than and (iii) Robustness: should improve in terms of adversarial robustness. There are always trade-offs among these three properties. Particularly, a complicated vulnerability might require a computationally expensive .

4 Siena Architecture

Given the NN model (Def. 1), we introduce Stochastic MultI-Expert Neural PAtcher (SIENA) to improve the adversarial robustness of . Figure 2 illustrates the architecture of SIENA of heads, i.e., sub-models, with two main components, namely (i) the Stochastic Ensemble (SE) module that transforms the last layer of into a randomized ensemble of different heads and (ii) the Multi-Expert (ME) module that uses Neural Architecture Search (NAS) to dynamically learn the optimal architecture of each head to promote their diversity. Intuitively, SIENA helps improve the robustness of by extending into an ensemble of multiple expert predictors. Details of SIENA are described as follows.

Figure 2: (Left) Architecture of SIENA with experts. (Right) Effects of Scaler on Prediction Logits of an Expert in Clickbait (CB) and Sentiment Treebank (SST) Dataset

4.1 Stochastic Ensemble (SE) Module

This module extends the last layer of , which is typically a fully-connected layer (followed by softmax for classification), to an ensemble of prediction heads, denoted . Each head , parameterized by , is an expert predictor that is fed with a feature representation learned by up to the second-last layer of and outputs a prediction logit score:

(2)

where are fixed parameters of up to the second-last layer, is the size of the feature representation of generated by the base model , and is the number of labels. To aggregate all logit scores returned from all heads, then, a classical ensemble method would average them as the final prediction: . However, this simple aggregation assumes each learns from very similar training signals. Hence, when already learns some of the task-dependent information, will eventually converge not to a set of experts but very similar predictors. To resolve this issue, we introduce “randomness” into the process by enabling different subsets of heads during both inference and testing. Specifically, we introduce a new aggregation mechanism as follows:

(3)

where is a scalar that scales to different directions and magnitudes, is a probabilistic scalar, representing how much a prediction head should contribute to the final prediction. Let us denote , as vectors containing all scalar and , respectively, and as the concatenation of all vector returned from each of the heads. Both and are calculated as follows:

(4)
(5)

where , are trainable parameters, is a noisy vector sampled from the Standard Gumbel Distribution and is sampled from a Gumbel-Softmax distribution controlled by the inverse-temperature . Finally, to train this module, we use Eq. (3) as the final prediction and train the whole module with Negative Log Likelihood (NLL) loss following the objective:

(6)

Intuitively, since scales all non-negative scalars in in the same magnitude and directions, the algorithm cannot easily control the product in favor of the ground-truth label, especially when is large. Hence, this encourages each head to output with only one dominant class while reducing the scores for other labels to zero (Figure 2 (Right)). In other words, we direct each head to be specialized in detecting only one label at a time. Moreover, as is sampled from Gumbel-Softmax distribution which is inherently stochastic, the same value of would result in different values of . Thus, helps facilitate the random assignment on how much each head contributes to the final prediction, which further diversifies the learning of each expert.

1:  Input: pre-trained neural network
2:  Input: , , ,
3:  Initialize
4:  repeat
5:      Freeze and optimize via Eq. (8) with multiplier in mini-batch from dev set.
6:      Freeze and optimize via Eq. (6) in mini-batch from train set.
7:  until convergence
Algorithm 1 Training SIENA Algorithm

4.2 Multi-Expert (ME) Module

While SE module promotes the diversity among heads in classifying different classes,

ME module searches for the optimal architecture for each head that maximizes the diversity in how they make predictions. To do this, we adapt the DARTS algorithm liu2018darts as follows. Let us denote as the set of number of possible networks to be selected for

. We want to learn a one-hot encoded

selection vector that assigns during prediction. Since operation is not differentiable, during training, we relax the categorical assignment of the architecture for to a softmax over all possible networks in :

(7)

To ensure that specializes in different features of an input, we can maximize the difference among gradients of the word-embedding of input w.r.t to the outputs of each . Hence, given a fixed set of parameters of all possible networks for every heads, we train all selection vectors by optimizing the objective:

(8)

where is the NLL loss as if we only use a single prediction head . In this module, however, not only do we want to maximize the differences among gradients vectors, but we also want to ensure the selected architectures eventually converge to good prediction performance. Therefore, we train the whole ME module with the following objective:

(9)

4.3 Overall Framework

To combine both SE and ME modules, we then incorporate Eq. (7) into Eq. (2) and optimize the overall objective:

(10)

We employ an iterative training strategy with the Adam optimization algorithm kingma2014adam as in Alg. 1. By alternately freezing and training , and using a train and a validation (val) set, we want to (i) achieve high quality prediction performance through Eq. (6) and (ii) select the optimal architecture for each expert to maximize their specialization through Eq. (8).

Dataset Total Vocab. Average Total
Classes Size Length Samples
Clickbait 2 25K 9 words 32K
Subjectivity 2 20K 24 words 10K
Movie Reviews 2 19K 21 words 11K
Sentiment Treebank 2 15K 19 words 9K
Table 2: Dataset Statistics
Attack Base Level Mode Constraint
HotFlip Gradient Word/Char White Box
FGSM Gradient Word White Box -
TextFooler Score Word Black Box
TextBugger Score Char Black Box
Table 3: Comparison among Experimental Attack Methods

5 Experimental Evaluation

5.1 Experimental Environment

Datasets & Metric:

Table 2 shows the statistics of all experimental datasets: Clickbait detection anand2017we, Subjectivity detection Pang+Lee:04a, Movie Reviews classification pang2005seeing, Sentiment Treebank wang2018glue. We split each dataset into train, validation and test set with the ratio of 8:1:1 whenever standard public splits are not available. All the datasets are balanced among classes. We use Accuracy on the test set as the main performance metric in all experiments (experimental results using F1 as the metric are also shown in the Appendix).

Attacks:

We test SIENA to defend against both white-box and black-box attacks as summarized in Table 3:

  • HotFlip (HF) ebrahimi2017hotflip perturbs tokens at both the character/word level. In this work, we use HF to generate adversarial texts by perturbing the top vulnerable words from the original sentences.

  • FGSM (FS) goodfellow2014explaining adds perturbation to the word-embeddings of , with is the gradients of , then translates the perturbed embeddings back to the most similar tokens behjati2019universal. We set in all experiments.

  • TextFooler (TF) jin2019bert

    uses a complex formula to estimate the importance of tokens in a sentence and swaps the most vulnerable ones with synonyms to generate adversarial texts.

  • TextBugger (TB) li2018textbugger: By using character manipulation strategies such as Insert, Delete, Swap and Substitute, TB fools a NLP system to encode a token as an out-of-vocabulary (OOV) while it might still be readable to a human (e.g., hello he11o).

Except for FGSM, all attacks apply semantic constraints to preserve the original meanings of input texts. We also test with HotFlip attacks without such constraints (HF).

Baselines:

We want to defend four textual NN models, namely textCNN kim2014convolutional, textRNN with GRU cell, transformer-based BERT devlin2018bert and ROBERTA liu2019roberta. We compare SIENA with the following six defensive baselines:

  • Ensemble is the classical ensemble of 5 different Base Models. We use the average of all NLL losses from the base models as the final training loss.

  • Diversity Training (DT) kariyappa2019improving is a variant of the Ensemble baseline where a regularization term is added to maximize the coherency of gradient vectors of the input text w.r.t each sub-model.

  • Adaptive Diversity Promoting (ADP) pang2019improving is a variant of Ensemble baseline where a regularization term is added to maximize the diversity among non-maximal predictions of individual sub-models.

  • Adversarial Training (Adv.Train) miyato2016adversarial is a semi-supervised algorithm that optimizes the NLL loss on the original training samples plus adversarial texts.

  • Mixup Training (Mixup) zhang2017mixup trains the Base Models

    with data constructed by linear interpolation of two random training samples. In this work, we use

    Mixup

    to regularize a NN to adapt linear transformation in-between the continuous embeddings of training samples.

  • Robust Word Recognizer (ScRNN) pruthi2019combating detects and corrects potential adversarial perturbations or misspellings in input texts.

Note that due to the limitation of GPU memory that is needed to simultaneously train several BERT and ROBERTA sub-models, we can only compare Ensemble, DT, and ADP for the cases of textCNN and textRNN.

Dataset Clickbait Subjectivity Movie Sentiment
Reviews Treebank
textCNN 0.97 0.90 0.79 0.83
+SIENA 0.98 0.91 0.79 0.82
textRNN 0.97 0.91 0.78 0.79
+SIENA 0.98 0.92 0.79 0.82
BERT 0.99 0.96 0.85 0.91
+SIENA 0.99 0.96 0.86 0.91
ROBERTA 0.99 0.96 0.86 0.93
+SIENA 0.99 0.96 0.87 0.94
Table 4: Prediction Performance on Clean Test Sets
#Params #Params Train Test
(Total) (Trainable)

1 epoch

1 batch
textCNN 8M 8M (100%) 0.030 0.002
+Adv.Train 8M 8M (100%) 0.083 0.002
+Mixup 8M 8M (100%) 0.030 0.002
+Ensemble 47M 47M (100% 0.140 0.008
+DT 47M 47M (100%) 0.296 0.009
+ADP 47M 47M (100%) 0.148 0.008
+ScRNN 47M 47M (100%) 0.030 0.003
+SIENA 11M 3M (27%) 0.222 0.003
BERT 109M 109M (100%) 0.202 0.020
+Adv. Train 109M 109M (100%) 0.543 0.021
+Mixup 109M 109M (100%) 0.202 0.021
+Ensemble 547M 547M (100%) - -
+DT&ADP 547M 547M (100%) - -
+ScRNN 547M 547M (100%) 0.202 0.026
+SIENA 118M 9M (8%) 1.162 0.061
”-”: Unable to run due to limited GPU memory
Table 5: Averaged Running Time (in seconds) on Sentiment Treebank Dataset and # of Parameters ()
Dataset Clickbait Subjectivity Movie Reviews Sentiment Treebank
Attack Type White Box Black Box White Box Black Box White Box Black Box White Box Black Box AVG
Attack HF* HF FS TF TB HF* HF FS TF TB HF* HF FS TF TB HF* HF FS TF TB
textCNN 0.47 0.97 0.45 0.58 0.42 0.45 0.89 0.10 0.39 0.16 0.16 0.76 0.05 0.17 0.07 0.15 0.80 0.07 0.18 0.08 0.37
+Ensemble 0.52 0.97 0.34 0.62 0.46 0.46 0.90 0.04 0.38 0.16 0.16 0.78 0.01 0.15 0.08 0.11 0.80 0.03 0.20 0.09 0.36
+DT 0.66 0.98 0.76 0.73 0.49 0.53 0.91 0.38 0.44 0.16 0.20 0.77 0.12 0.22 0.10 0.24 0.79 0.14 0.23 0.10 0.45
+ADP 0.64 0.97 0.47 0.64 0.50 0.45 0.90 0.03 0.31 0.13 0.15 0.77 0.03 0.13 0.06 0.22 0.79 0.03 0.17 0.08 0.37
+Mixup 0.55 0.96 0.76 0.54 0.43 0.50 0.90 0.53 0.30 0.13 0.13 0.74 0.41 0.12 0.06 0.17 0.79 0.54 0.13 0.07 0.44
+Adv.Train 0.65 0.97 0.56 0.74 0.47 0.50 0.90 0.08 0.46 0.16 0.14 0.78 0.08 0.21 0.07 0.12 0.82 0.06 0.26 0.09 0.41
+ScRNN 0.52 0.93 0.52 0.66 0.94 0.49 0.86 0.09 0.52 0.87 0.20 0.76 0.15 0.40 0.77 0.18 0.80 0.08 0.36 0.81 0.55
+SIENA 0.81 0.97 0.91 0.84 0.62 0.75 0.90 0.82 0.68 0.32 0.48 0.77 0.60 0.40 0.42 0.61 0.81 0.69 0.51 0.36 0.66
textRNN 0.77 0.97 0.76 0.67 0.52 0.67 0.90 0.40 0.45 0.22 0.36 0.77 0.47 0.30 0.11 0.40 0.78 0.54 0.33 0.17 0.53
+Ensemble 0.80 0.96 0.69 0.69 0.50 0.67 0.90 0.26 0.49 0.20 0.35 0.79 0.39 0.33 0.11 0.39 0.82 0.33 0.35 0.11 0.51
+DT 0.91 0.97 0.88 0.75 0.54 0.71 0.90 0.45 0.55 0.24 0.42 0.79 0.46 0.36 0.10 0.48 0.82 0.42 0.40 0.22 0.57
+ADP 0.71 0.96 0.79 0.63 0.52 0.67 0.90 0.39 0.48 0.23 0.39 0.79 0.39 0.33 0.10 0.43 0.81 0.41 0.35 0.11 0.52
+Mixup 0.61 0.95 0.86 0.56 0.60 0.56 0.76 0.63 0.34 0.21 0.35 0.75 0.55 0.12 0.09 0.40 0.79 0.66 0.15 0.13 0.50
+Adv.Train 0.87 0.97 0.84 0.82 0.51 0.75 0.91 0.22 0.53 0.20 0.49 0.80 0.28 0.32 0.11 0.51 0.83 0.29 0.35 0.10 0.54
+ScRNN 0.74 0.94 0.72 0.72 0.95 0.64 0.88 0.40 0.57 0.88 0.41 0.79 0.51 0.43 0.78 0.40 0.79 0.51 0.50 0.79 0.67
+SIENA 0.92 0.97 0.94 0.88 0.66 0.83 0.91 0.80 0.79 0.50 0.56 0.79 0.63 0.50 0.46 0.64 0.81 0.71 0.52 0.25 0.70
BERT 0.75 0.98 0.18 0.64 0.69 0.70 0.95 0.10 0.27 0.75 0.31 0.82 0.13 0.08 0.68 0.32 0.87 0.15 0.13 0.73 0.51
+Mixup 0.24 0.97 0.01 0.64 0.74 0.72 0.95 0.07 0.30 0.75 0.32 0.82 0.04 0.07 0.70 0.42 0.88 0.04 0.11 0.73 0.48
+Adv.Train 0.94 0.99 0.64 0.86 0.74 0.87 0.94 0.43 0.42 0.79 0.60 0.82 0.50 0.13 0.69 0.68 0.88 0.52 0.18 0.74 0.67
+ScRNN 0.77 0.97 0.21 0.76 0.92 0.70 0.93 0.16 0.59 0.88 0.36 0.79 0.16 0.36 0.75 0.39 0.84 0.19 0.45 0.73 0.59
+SIENA 0.89 0.99 0.83 0.92 0.77 0.88 0.95 0.59 0.69 0.75 0.62 0.84 0.54 0.37 0.72 0.72 0.90 0.57 0.50 0.76 0.74
ROBERTA 0.81 0.98 0.92 0.89 0.99 0.80 0.95 0.77 0.65 0.96 0.49 0.85 0.55 0.43 0.86 0.49 0.91 0.60 0.51 0.93 0.77
+Mix+Gau 0.30 0.99 0.14 0.93 0.99 0.78 0.95 0.51 0.65 0.96 0.46 0.85 0.41 0.57 0.88 0.39 0.92 0.28 0.60 0.93 0.67
+Adv.Train 0.93 0.99 0.95 0.91 0.99 0.89 0.95 0.79 0.65 0.96 0.62 0.88 0.66 0.48 0.89 0.64 0.94 0.72 0.54 0.95 0.82
+ScRNN 0.81 0.97 0.90 0.91 0.84 0.71 0.92 0.69 0.82 0.74 0.53 0.85 0.59 0.72 0.64 0.50 0.89 0.61 0.76 0.65 0.75
+SIENA 0.95 0.99 0.96 0.95 0.99 0.86 0.95 0.86 0.72 0.96 0.74 0.87 0.78 0.69 0.86 0.65 0.91 0.75 0.82 0.92 0.86
Bold, Red: the best and decreased results for each base model. Results are averaged across 3 runs in the case of textCNN and textRNN.
Table 6: Comparison of Prediction Accuracy under Adversarial Attacks

Implementation:

We train SIENA of 5 experts () with in all experiments. For each expert, we set to 3 () possible architectures: FCN with 1, 2 and 3 hidden layer(s). For each dataset, we use grid-search and try all to search for the best pairs of used during training and inference according to their defense performance (average of FS and TB attacks for textCNN and textRNN and solely FS attack for BERT and ROBERTA) on the validation set. We then report the performance of the best single model across all attacks on the test set. Our Appendix will include all details on all models’ parameters, random seeds, and implementation. The source code of SIENA will be released upon the acceptance of this paper.

5.2 Evaluation on Fidelity

We first evaluate SIENA’s prediction performance without adversarial attacks. Table 2 shows that all base models patched by SIENA either maintain a similar (e.g., for textCNN) or better (e.g., for textRNN, BERT, ROBERTA) accuracy across all datasets. Except for ScRNN which observes a consistent decrease in performance under clean datasets, other baselines maintain more or less the same accuracy with the base models (Appendix, Table 10). This shows that misspelling-based defenses such as ScRNN might accidentally correct unintentional errors in the input text and might lead to undesired predictions.

5.3 Evaluation on Complexity

Table 5 shows the computational complexity of all methods in terms of memory and running time. Regarding the space complexity, SIENA can extend a NN model into an ensemble model with a little increase on the # of parameters. Specifically, with being the # of parameters of the base model, SIENA requires a space complexity of while both Ensemble and DT require a complexity of and (e.g., 9M vs. 118M in BERT). During training, SIENA only trains parameters, while other defense methods, including ones using data augmentation, update all of them. Specifically, SIENA only trains using 27% of the parameters of the base model and 6.4% of the parameters of other ensemble baselines in the case of textCNN. These statistics even drop to 8% and 1.6%, respectively in the case of BERT. As to the running time of textCNN, even though SIENA takes nearly twice as much as that of the classical Ensemble model during training, it is still more time-efficient than DT. During inference, SIENA is also 3 times faster than previous ensemble-based approaches on average.

5.4 Evaluation on Robustness

Table 6 summarizes the performance of all defense methods under adversarial attacks. Overall, SIENA achieves the best robustness across all attacks and datasets. On average, SIENA observes an absolute improvement of about 20% in accuracy across all base models and nearly 6.3% in accuracy over the second-best defense algorithms. To be specific, SIENA significantly outperforms other model enhancement approaches in black-box TF and TB attacks, achieving as much as 20-30% absolute increase in accuracy on Movie Reviews and Sentiment Treebank. Even though SIENA is only second-best after the misspelling-based ScRNN method in defending character-based TB attack, SIENA is more versatile, effectively defending both character and word-based adversarial texts.

Different from black-box attacks, white-box attacks are more challenging to defend because attackers now have access to the input’s gradients which are back-propagated from the target model. Nevertheless, Table 6 shows that SIENA performs well even under white-box attacks. The stochastic design of SIENA further reinforces the effects of ensemble diversity on adversarial robustness by obscuring the gradients flow exploited by the adversaries. This is especially shown in how SIENA has achieved the absolute accuracy gains of 70% and 40% under FS attack in the case of textCNN and BERT. Furthermore, even when combined with Adversarial Training, SIENA still outweighs other baselines, including the Interpolated Adversarial Training (Mix+ADV) proposed by lamb2019interpolated, by large margins (Table 7). Due to space limitation, we refer the readers to Table 12 (Appendix) for results on all datasets.

Figure 3: Prediction Logits of DT (top) and SIENA (bottom)’s Heads w.r.t Decision Threshold (blue line) on Clickbait Dataset

Figure 4: Divergence among Gradients of Heads

Figure 5: Distribution of Prediction Logits among Heads on Sentiment Treebank Dataset
Dataset Movie Reviews Sentiment Treebank
Attack Type White Box Black Box White Box Black Box
Attack Method HF* FS TF TB HF* FS TF TB
textCNN 0.19 0.04 0.15 0.07 0.09 0.07 0.16 0.07
+ADV 0.16 0.09 0.22 0.07 0.22 0.09 0.27 0.09
+DT+ADV 0.14 0.38 0.32 0.08 0.14 0.35 0.36 0.16
+ADP+ADV 0.14 0.46 0.35 0.09 0.08 0.31 0.35 0.11
+Mix+ADV 0.22 0.32 0.15 0.09 0.34 0.40 0.19 0.09
+SIENA+ADV 0.64 0.65 0.55 0.46 0.38 0.72 0.37 0.14
BERT 0.31 0.13 0.08 0.68 0.32 0.15 0.13 0.73
+ADV 0.60 0.50 0.13 0.69 0.68 0.52 0.18 0.74
+Mix+ADV 0.56 0.32 0.16 0.71 0.62 0.35 0.19 0.73
+SIENA+ADV 0.63 0.53 0.67 0.72 0.71 0.60 0.43 0.74
Table 7: Performance when Combined with Adv.Train

6 Discussion

6.1 Diversity of Experts

Table 1 shows examples of how experts of an ensemble model learned by SIENA can achieve all three levels of expertise: class, feature, and instance-level. On the class-level expertise, Table 1A illustrates that experts detect a specific clickbait in different ways, either via promoting clickbaitness (Head #1), discounting non-clickbaitness (Head #2), or not participating in this prediction (Head #3). Figure 3 also displays an almost orthogonal distribution of prediction logits from learned experts. On the feature-level expertise, Figure 4 shows the distribution of pairwise cosine-similarity and distance among input gradients back-propagated through different experts. Different from DT, experts produced by SIENA are diverse not only in terms of directions (via cosine-similarity), but also in terms of magnitudes (via ). This means even if two experts focus on the same word in a sentence, attacking one of them might not necessarily discount the other. Finally, on the instance-level expertise, while sub-models learned by previous ensemble methods equally participate in the final prediction, those learned by SIENA are assigned with different portions of positive and negative samples during inference (Figure 3).

6.2 Parameter Sensitivity Analysis

Training SIENA requires hyper-parameter and . We observe that arbitrary value works well across all of the experiments. Though we did not observe any patterns on the effects of on the robustness, a performs well across all attacks (Appendix, Figure 6). On the contrary, different pairs of the inverse-temperature during training and inference witness varied performance w.r.t to different datasets (Appendix, Figure 7). Yet, we observe that a value of performs well on average. By setting during inference, we can control how many experts to use for prediction. Figure 5 shows that by decreasing , we enable more experts in final predictions. With , a value of corresponding to around a # of 3 experts that are actively utilized for each input.

6.3 Ablation Test

This section tests SIENA without SE and ME module. Table 8 shows that SE and ME performs differently across datasets. Overall, the ME performs better than SE module specifically under black-box attacks. The full SIENA module outperforms the two complementary modules SE and ME, with up to 9% and 5% increase in accuracy on average across all datasets in cases of CNN and BERT compared to the best individual module (Appendix, Table 13).

Dataset Clickbait Movie Reviews AVG
Attack HF* FS TF TB HF* FS TF TB
textCNN 0.49 0.49 0.59 0.46 0.19 0.04 0.15 0.07 0.31
+SIENA\ME 0.56 0.57 0.71 0.68 0.46 0.67 0.37 0.32 0.54
+SIENA\SE 0.79 0.81 0.76 0.62 0.21 0.19 0.16 0.07 0.45
+SIENA 0.92 0.96 0.84 0.61 0.63 0.75 0.37 0.46 0.70
BERT 0.75 0.18 0.64 0.69 0.31 0.13 0.08 0.68 0.43
+SIENA\ME 0.86 0.52 0.89 0.86 0.48 0.40 0.42 0.76 0.65
+SIENA\SE 0.83 0.96 0.86 0.76 0.38 0.33 0.09 0.69 0.61
+SIENA 0.89 0.83 0.92 0.77 0.62 0.54 0.37 0.72 0.71
SIENA\ME, SIENA\SE: without ME, SE module.
Table 8: Ablation Test

7 Limitation and Future Work

In this paper, we limit the architecture of each expert to be a FCN with a maximum of 3 hidden layers (except the base model ). As we include more options for this expert architecture (e.g., CNN, attention), the outcome will become more diverse. The design of SIENA is model-agnostic and is also applicable to other complex and large-scale NNs such as transformers-based models. Especially with the recent adoption of transformer architecture in both NLP and computer vision carion2020end; chen2020generative, potential future works include extending SIENA to patch other complex NN models (e.g., T5 raffel2019exploring) or other tasks and domains such as Q&A and language generation.

8 Conclusion

This paper proposes a novel algorithm, named SIENA, that consistently improves the adversarial robustness of textual NN models under both white-box and black-box attacks by upgrading and re-training only their last layer. By extending a single model to an ensemble of multiple experts that are diverse among feature-, prediction-, and instance-levels, SIENA achieves as much as 20% improvement in accuracy on average across 5 attacks on 4 NLP datasets. Thanks to its model-agnostic design, SIENA can help improve the adversarial robustness in NLP and other domains.

References

Appendix A Technical Appendix

Figure 6: Effects of # of Heads on the Prediction Performance, corresponds to the Base Model.
HL FS TB
Coef p-value Coef p-value Coef p-value
Cosine -0.33 0.148* -0.50 0.061** -0.12 0.624
8.42 0.152* 11.76 0.087** 7.63 0.257*
*: p-value 0.5;**: p-value 0.1
Table 9: Linear Analysis on Gradient Divergence among Heads v.s. Adversarial Robustness under Different Attacks
Model/Dataset CB SJ MR SST
textCNN 0.97 0.90 0.79 0.83
+Ensemble 0.97 0.91 0.80 0.83
+DT 0.98 0.92 0.78 0.82
+ADP 0.98 0.91 0.79 0.81
+Mixup 0.97 0.90 0.76 0.81
+Adv.Train 0.98 0.91 0.79 0.84
+ScRNN 0.94 0.87 0.79 0.82
+SIENA 0.98 0.91 0.79 0.82
textRNN 0.97 0.91 0.78 0.79
+Ensemble 0.97 0.91 0.79 0.83
+DT 0.97 0.90 0.80 0.82
+ADP 0.97 0.91 0.79 0.82
+Mixup 0.96 0.78 0.78 0.82
+Adv.Train 0.98 0.92 0.82 0.85
+ScRNN 0.95 0.89 0.79 0.80
+SIENA 0.98 0.92 0.79 0.82
BERT 0.99 0.96 0.85 0.91
+Mixup 0.99 0.88 0.86 0.92
+Adv.Train 0.99 0.95 0.85 0.91
+ScRNN 0.97 0.94 0.83 0.87
+SIENA 0.99 0.96 0.86 0.92
ROBERTA 0.99 0.96 0.86 0.93
+Mixup 0.99 0.96 0.86 0.93
+Adv.Train 0.99 0.96 0.89 0.95
+ScRNN 0.98 0.94 0.86 0.92
+SIENA 0.99 0.96 0.87 0.94
Table 10: Accuracy without Attacks on Clickbait (CB), Subjectivity (SJ), Movie Reviews (MR) and Sentiment Treebank (SST) Dataset
Dataset Clickbait Subjectivity Movie Reviews Sentiment Treebank
Attack Type White Box Black Box White Box Black Box White Box Black Box White Box Black Box AVG
Attack HF* HF FS TF TB HF* HF FS TF TB HF* HF FS TF TB HF* HF FS TF TB
textCNN 0.46 0.97 0.42 0.58 0.41 0.43 0.89 0.10 0.38 0.15 0.16 0.76 0.05 0.16 0.07 0.15 0.80 0.07 0.18 0.08 0.36
+Ensemble 0.50 0.97 0.30 0.62 0.44 0.44 0.90 0.04 0.37 0.15 0.15 0.78 0.01 0.15 0.08 0.11 0.80 0.03 0.20 0.09 0.36
+DT 0.65 0.98 0.76 0.73 0.48 0.52 0.91 0.38 0.44 0.15 0.19 0.77 0.11 0.21 0.09 0.23 0.79 0.13 0.23 0.09 0.44
+ADP 0.63 0.97 0.45 0.64 0.48 0.44 0.90 0.03 0.31 0.12 0.14 0.77 0.03 0.13 0.06 0.22 0.79 0.03 0.17 0.08 0.37
+Mixup 0.55 0.96 0.76 0.53 0.43 0.50 0.90 0.53 0.30 0.12 0.13 0.74 0.40 0.12 0.06 0.17 0.79 0.54 0.13 0.07 0.44
+Adv.Train 0.64 0.97 0.56 0.74 0.47 0.49 0.90 0.08 0.46 0.15 0.13 0.78 0.08 0.21 0.07 0.12 0.82 0.06 0.25 0.09 0.40
+ScRNN 0.47 0.93 0.52 0.66 0.94 0.46 0.86 0.09 0.50 0.86 0.19 0.76 0.14 0.35 0.77 0.18 0.80 0.08 0.34 0.81 0.54
+SIENA 0.80 0.97 0.91 0.84 0.61 0.75 0.90 0.82 0.67 0.29 0.48 0.77 0.60 0.39 0.41 0.61 0.81 0.69 0.51 0.35 0.66
textRNN 0.77 0.97 0.76 0.66 0.50 0.67 0.90 0.40 0.45 0.21 0.36 0.77 0.47 0.30 0.11 0.39 0.78 0.54 0.33 0.16 0.53
+Ensemble 0.79 0.96 0.69 0.69 0.48 0.67 0.90 0.26 0.49 0.19 0.35 0.79 0.38 0.33 0.11 0.39 0.82 0.33 0.35 0.11 0.50
+DT 0.91 0.97 0.88 0.75 0.52 0.70 0.90 0.44 0.54 0.22 0.41 0.79 0.46 0.36 0.10 0.47 0.82 0.41 0.40 0.19 0.56
+ADP 0.69 0.96 0.79 0.62 0.51 0.67 0.90 0.38 0.47 0.21 0.39 0.79 0.39 0.32 0.09 0.43 0.81 0.41 0.35 0.11 0.52
+Mixup 0.60 0.95 0.86 0.55 0.59 0.52 0.73 0.59 0.33 0.19 0.34 0.75 0.55 0.12 0.09 0.39 0.79 0.66 0.15 0.13 0.49
+Adv.Train 0.87 0.97 0.84 0.82 0.50 0.75 0.91 0.21 0.52 0.19 0.49 0.80 0.28 0.32 0.11 0.51 0.83 0.29 0.35 0.10 0.53
+ScRNN 0.73 0.94 0.72 0.72 0.95 0.63 0.88 0.39 0.56 0.88 0.41 0.79 0.51 0.42 0.78 0.39 0.78 0.50 0.47 0.79 0.66
+SIENA 0.92 0.97 0.94 0.88 0.65 0.83 0.91 0.80 0.79 0.48 0.56 0.79 0.63 0.50 0.46 0.63 0.81 0.71 0.52 0.25 0.70
BERT 0.75 0.98 0.18 0.64 0.69 0.70 0.95 0.10 0.27 0.75 0.30 0.82 0.12 0.08 0.68 0.31 0.87 0.14 0.12 0.73 0.51
+Mixup 0.24 0.97 0.01 0.64 0.73 0.72 0.95 0.07 0.29 0.75 0.31 0.82 0.04 0.07 0.69 0.42 0.88 0.04 0.11 0.73 0.47
+Adv.Train 0.94 0.99 0.63 0.86 0.74 0.87 0.94 0.42 0.42 0.79 0.60 0.82 0.44 0.13 0.69 0.68 0.88 0.51 0.17 0.74 0.66
+ScRNN 0.77 0.97 0.21 0.76 0.92 0.70 0.93 0.14 0.57 0.88 0.36 0.79 0.15 0.36 0.75 0.37 0.84 0.17 0.43 0.72 0.59
+SIENA 0.89 0.99 0.83 0.92 0.77 0.88 0.95 0.58 0.69 0.75 0.62 0.84 0.53 0.37 0.72 0.72 0.90 0.57 0.48 0.76 0.74
ROBERTA 0.81 0.98 0.91 0.89 0.99 0.80 0.95 0.77 0.65 0.96 0.49 0.85 0.54 0.42 0.86 0.49 0.91 0.60 0.51 0.93 0.77
+Mix+Gau 0.33 0.33 0.33 0.33 0.33 0.78 0.95 0.50 0.65 0.96 0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.44
+Adv.Train 0.93 0.99 0.95 0.91 0.99 0.89 0.95 0.78 0.65 0.96 0.62 0.88 0.66 0.47 0.89 0.64 0.94 0.72 0.54 0.95 0.81
+ScRNN 0.81 0.97 0.90 0.91 0.84 0.71 0.92 0.68 0.81 0.73 0.53 0.85 0.58 0.72 0.64 0.50 0.89 0.60 0.76 0.65 0.75
+SIENA 0.95 0.99 0.96 0.95 0.99 0.86 0.95 0.86 0.72 0.96 0.74 0.87 0.77 0.69 0.86 0.65 0.91 0.75 0.82 0.92 0.86
Bold, Red: the best and decreased results for each base model. Results are averaged across 3 runs in the case of textCNN and textCNN.
Table 11: Comparison of Prediction F1 under Adversarial Attacks
Dataset Clickbait Subjectivity Movie Reviews Sentiment Treebank
Attack Type White Box Black Box White Box Black Box White Box Black Box White Box Black Box AVG
Attack HF* FS TF TB HF* FS TF TB HF* FS TF TB HF* FS TF TB
textCNN 0.47 0.45 0.58 0.42 0.45 0.10 0.39 0.16 0.16 0.05 0.17 0.07 0.15 0.07 0.18 0.08 0.25
+DT+ADV 0.67 0.90 0.83 0.53 0.57 0.38 0.56 0.22 0.14 0.38 0.32 0.08 0.14 0.35 0.36 0.16 0.41
+ADP+ADV 0.83 0.92 0.83 0.53 0.51 0.62 0.51 0.20 0.14 0.46 0.35 0.09 0.08 0.31 0.35 0.11 0.43
+Mixup+ADV 0.70 0.75 0.65 0.47 0.59 0.46 0.37 0.16 0.22 0.32 0.15 0.09 0.34 0.40 0.19 0.09 0.37
+SIENA+ADV 0.86 0.90 0.79 0.49 0.64 0.85 0.63 0.22 0.64 0.65 0.55 0.46 0.38 0.72 0.37 0.14 0.58
BERT 0.75 0.18 0.64 0.69 0.70 0.10 0.27 0.75 0.31 0.13 0.08 0.68 0.32 0.15 0.13 0.73 0.41
+Mixup+ADV 0.69 0.10 0.78 0.82 0.88 0.45 0.52 0.77 0.56 0.32 0.16 0.71 0.62 0.35 0.19 0.73 0.54
+SIENA+ADV 0.90 0.64 0.91 0.85 0.62 0.51 0.50 0.63 0.63 0.53 0.67 0.72 0.71 0.60 0.43 0.74 0.66
Table 12: Prediction Performance Accuracy when Combined with Adv.Train (Full)
Dataset Clickbait Subjectivity Movie Reviews Sentiment Treebank
Attack Type White Box Black Box White Box Black Box White Box Black Box White Box Black Box AVG
Attack HF* FS TF TB HF* FS TF TB HF* FS TF TB HF* FS TF TB
textCNN 0.47 0.45 0.58 0.42 0.45 0.10 0.39 0.16 0.16 0.05 0.17 0.07 0.15 0.07 0.18 0.08 0.25
+SIENA\ME 0.56 0.57 0.71 0.68 0.65 0.74 0.65 0.51 0.46 0.67 0.37 0.32 0.23 0.36 0.45 0.45 0.52
+SIENA\SE 0.79 0.81 0.76 0.62 0.56 0.46 0.42 0.22 0.21 0.19 0.16 0.07 0.18 0.28 0.21 0.09 0.38
+SIENA 0.81 0.91 0.84 0.62 0.75 0.82 0.68 0.32 0.48 0.60 0.40 0.42 0.61 0.69 0.51 0.36 0.61
BERT 0.75 0.18 0.64 0.69 0.70 0.10 0.27 0.75 0.31 0.13 0.08 0.68 0.32 0.15 0.13 0.73 0.41
+SIENA\ME 0.86 0.52 0.89 0.86 0.81 0.51 0.80 0.93 0.48 0.40 0.42 0.76 0.51 0.35 0.55 0.77 0.65
+SIENA\SE 0.83 0.96 0.86 0.76 0.87 0.87 0.67 0.83 0.38 0.33 0.09 0.69 0.55 0.57 0.16 0.73 0.64
+SIENA 0.89 0.83 0.92 0.77 0.88 0.59 0.69 0.75 0.62 0.54 0.37 0.72 0.72 0.57 0.50 0.76 0.70
Table 13: Ablation Test (Full)
Dataset Movie Sentiment AVG
Reviews Treebank
Attack Method HF* HF FS HF* HF FS
textCNN 0.49 0.78 0.63 0.43 0.81 0.68 0.64
+Adv.Train 0.48 0.78 0.62 0.44 0.78 0.66 0.63
+DT 0.51 0.79 0.65 0.44 0.81 0.66 0.64
+ADP 0.53 0.75 0.54 0.47 0.79 0.66 0.62
+Mixup 0.49 0.78 0.64 0.47 0.82 0.69 0.65
+SIENA 0.50 0.79 0.64 0.44 0.82 0.69 0.65
textRNN 0.39 0.78 0.10 0.36 0.77 0.04 0.41
+Adv.Train 0.41 0.80 0.01 0.28 0.81 0.02 0.39
+DT 0.34 0.74 0.07 0.27 0.80 0.04 0.38
+ADP 0.41 0.79 0.40 0.28 0.81 0.02 0.45
+Mixup 0.45 0.77 0.34 0.32 0.79 0.34 0.50
+SIENA 0.41 0.75 0.16 0.35 0.77 0.05 0.42
Table 14: Black-Box Performance under HotFlip and FGSM Attacks

a.1 The role of inverse-temperature

Hyper-parameter ( Sec. 4.1) gives us the flexibility to control the sharpness of the probability vector . When , to get closer to one-hot encoded vector, i.e., use only one head at a time. When ,

gets closer to a uniform distribution, i.e., utilize every heads equally. Figure

7 shows the performance matrix among different pairs of used during training and inference, denoted as and , respectively. Even though we cannot derive a clear pattern from the figure, their performance varies according to both the tested dataset and attack methods. Nevertheless, a value works well in most of the cases (Table 15).

a.2 Linear Analysis on Gradient Divergence

Table 9 shows the linear analysis on the divergence of gradient vectors among heads versus adversarial robustness under both white-box and black-box attacks. Overall, it is statistically significant that not only cosine-similarity, i.e., the difference in directions, but distance, i.e., the difference in length, among the gradient vectors are highly correlating with the adversarial robustness. This further justifies the use of distance to maximize the diversity among heads as introduced in Sec. 4.2.

a.3 Additional Evaluation on Fidelity

In additional to Table 4, Table 10 shows the prediction performance of all defense methods on the test sets without adversarial attacks. Overall, all baselines perform more or less the same with the base models, except ScRNN algorithm which observes a decrease in accuracy in many cases.

a.4 Additional Evaluation on Robustness

In addition to Table 6, Table 11 shows the prediction performance of all defense methods under adversarial attacks in F1 score. Since all of the datasets are balanced, we do not see much differences in the rankings among the defense algorithms in terms of prediction performance in Accuracy or F1. In addition to Table 7, Table 12 shows the prediction performance of all defense methods when they are combined with Adversarial Training across all datasets.

a.5 Additional Ablation Test

In addition to Table 8, Table 13 shows the ablation tests on all datasets. We observe that SE module performs better than ME module at black-box attacks, while ME module performs better than SE module at white-box attacks, especially in the case of BERT. Overall, the combination of the two modules, i.e., full SIENA model, performs better than individual component modules. This shows that SE and ME modules complement each other to improve the adversarial robustness of the target NN model.

a.6 Transferability for White-Box Attack

Table 14 shows the transferability performance for white-box attack HF and FS. To attack CNN-based models, we use adversarial texts that are generated to attack a surrogate white-box textRNN model. In contrast, to attack RNN-based models, we use adversarial texts that are generated to attack a surrogate white-box textCNN model. We observe that there are no clear differences in prediction performance among defense methods in the case of textCNN. In the case of textRNN, Mixup performs best under the transferability of white-box attacks on average. In fact, only SIENA and Mixup are able to improve the overall adversarial robustness of the base models for both textCNN and textRNN. Evaluation of adversarial robustness under adversarial texts transferred from a surrogate white-box attack depends on many factors. Thus, future works are needed to draw any meaningful conclusions.

Appendix B Reproducibility

b.1 Software, Hardware, Dataset, Random Seed, and Source Code

  • Software: all the implementations are written in Python (v3.7) with Pytorch (v1.5.1), Numpy (v1.19.1), Scikit-learn (v0.21.3). We rely on

    Transformers (v3.0.2) library for loading and training transformers-based models (e.g., BERT, ROBERTA).

  • Hardware: we run all of the experiments on standard server machines installed with Ubuntu OS (v18.04), 20-Core Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 93GB of RAM and a Titan Xp GPU.

  • Dataset: all of the datasets are publicly available.

  • Random Seed: to ensure reproducibility, we set a consistent random seed using torch.manual_seed and np.random.seed function for all experiments. For the average results for the case of textCNN and textRNN (Table 4, Table 6, Table 10, Table 11), we randomly selected three different random seed numbers.

  • Source Code: we will also release the source code of SIENA upon the acceptance of this paper.

b.2 Experimental Settings for Base Models

Architectures and Parameters

  • textCNN: We implement the CNN sentence classification model kim2014convolutional with three 2D CNN layers, each of which is followed by a Max-Pooling

    layer. The concatenation of outputs of all Max-Pooling layers is fed into a Dropout layer with 0.5 probability, then an FCN + softmax layer for prediction. We use an

    Embedding layer of size 300 with pre-trained Glove embedding-matrix to transform each discrete text tokens into continuous input features before feeding them into the CNN network. Each of the three CNN layers uses 150 kernels with a size of 2, 3, 4, respectively.

  • textRNN: Because the original PyTorch implementation of RNN does not support double back-propagation on CuDNN which is required by DT and SIENA, to run the model on GPU, we use a publicly available Just-in-Time (JIT) version of GRU of one hidden layer as the main RNN cell. We use an Embedding layer of size 300 with pre-trained Glove embedding-matrix to transform each discrete text tokens into continuous input features before inputting them into the RNN layer. We flatten out all the outputs of the RNN layer, the output of which is followed by a Dropout layer with 0.5 probability, then an FCN + softmax layer for prediction.

  • BERT & ROBERTA: We use the transformers library from HuggingFace to fine-tune BERT and ROBERTA model. We use the bert-base-uncased version of BERT and the roberta-base version of ROBERTA.

Vocabulary and Input Length

Due to limited GPU memory, we set the maximum length of inputs for transformer-based models, i.e., BERT and ROBERTA, to 128 during training. During testing, we limit such length to 32 for ROBERTA-based models to reduce overhead computations. For textCNN and textRNN-based models, we limit the vocabulary size to , and we use all of the vocabulary tokens provided by pre-trained models for BERT and ROBERTA-based models.

b.3 Experimental Settings for Defense Methods

  1. SIENA: we search for and by using a grid-search strategy with each value drawn from the set . In the case of textCNN and textRNN-based SIENA models, the best values for and are selected using the averaged prediction performance (Accuracy) under the FS and TB attack on the validation set. To reduce computational overhead, for BERT and ROBERTA-based SIENA models, the best values are selected using the the performance solely under the FS attack on the validation instead. Table 15 shows the best values found for and . For hyper-parameter , and , we arbitrarily set , and and they work well across all datasets.

  2. Ensemble: we train an ensemble model of 5 sub-models, all of which have the same architecture with the base model. We use the average loss of all sub-models as the final loss to train the model.

  3. DT: we follow the implementation described in Section 3 of the original paper kariyappa2019improving and train an ensemble DT model with 5 sub-models, all of which have the same architecture with the base model. We set the hyper-parameter as suggested by the original paper.

  4. ADP: we follow the implementation described in Section 3 of the original paper pang2019improving and train an ensemble ADP model with 5 sub-models, all of which have the same architecture with the base model. We set the hyper-parameters required by ADP to default values ( and ) as suggested by the original implementation.

  5. Mix-up Training (Mix): we sample as suggested by the implementation provided by the original paper zhang2017mixup.

  6. Adversarial Training: we use a 1:1 ratio between original training samples and adversarial training samples as suggested by miyato2016adversarial.

  7. ScRNN: we use the implementation and pre-trained model provided by the original paper pruthi2019combating.

Model Parameter CB SJ MR SST
textCNN+SIENA 1e-2 1 1 1e-1
1e-1 1e-1 1e-1 1e-1
textRNN+SIENA 1e-2 1e-2 1 1e-2
1 1 1e-1 1e-3
BERT+SIENA 1e-1 1e-1 1e-2 1e-1
1 1e-3 1e-1 1e-2
ROBERTA+SIENA 1e-1 1e-2 1e-1 1e-1
1e-1 1e-1 1e-2 1e-3
Table 15: Final Hyper-Parameter for the Selected Best SIENA Model
Figure 7: in training (x-axis) and inference (y-axis) w.r.t. Performance under Different Attack Methods on Clickbait (CB), Subjectivity (SJ), Movie Reviews (MR) and Sentiment Treebank (SST) Test Set

b.4 Experimental Settings for Attack Methods

  1. HotFlip (HF* and HF): we adopt the implementation from AllenNLP repository. We also ensure that the cosine similarity between the embedding of word to replace with and the original token is larger than a threshold of 0.8.

  2. FGSM (FS): we implementation FS attack following the Algorithm 1 of the original paper behjati2019universal. We set hyper-parameter of the attack to for all experiments.

  3. TextFooler (TF): we adopt the implementation provided by the original paper jin2019bert.

  4. TextBugger (TB): we implement the black-box version of TB attack following the Algorithm 2 and Algorithm 3 of the original paper li2018textbugger.