Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples

03/03/2018 ∙ by Minhao Cheng, et al. ∙ University of California-Davis ibm 0

Crafting adversarial examples has become an important technique to evaluate the robustness of deep neural networks (DNNs). However, most existing works focus on attacking the image classification problem since its input space is continuous and output space is finite. In this paper, we study the much more challenging problem of crafting adversarial examples for sequence-to-sequence (seq2seq) models, whose inputs are discrete text strings and outputs have an almost infinite number of possibilities. To address the challenges caused by the discrete input space, we propose a projected gradient method combined with group lasso and gradient regularization. To handle the almost infinite output space, we design some novel loss functions to conduct non-overlapping attack and targeted keyword attack. We apply our algorithm to machine translation and text summarization tasks, and verify the effectiveness of the proposed algorithm: by changing less than 3 words, we can make seq2seq model to produce desired outputs with high success rates. On the other hand, we recognize that, compared with the well-evaluated CNN-based classifiers, seq2seq models are intrinsically more robust to adversarial attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial attack on deep neural networks (DNNs) aims to slightly modify the inputs of DNNs and mislead them to make wrong predictions [1, 2]. This task has become a common approach to evaluate the robustness of DNNs [3, 4, 5] – generally speaking, the easier an adversarial example can be generated, the less robust the DNN model is. However, models designed for different tasks are not born equal: some tasks are strictly harder to attack than others. For example, attacking an image is much easier than attacking a text string, since image space is continuous and the adversary can make arbitrarily small changes to the input. Therefore, even if most of the pixels of an image have been modified, the perturbations can still be imperceptible to humans when the accumulated distortion is not large. In contrast, text strings live in a discrete space, and word-level manipulations may significantly change the meaning of the text. In this scenario, an adversary should change as few words as possible, and hence this limitation induces a sparse constraint on word-level changes. Likewise, attacking a classifier should also be much easier than attacking a model with sequence outputs. This is because different from the classification problem that has a finite set of discrete class labels, the output space of sequences may have an almost infinite number of possibilities. If we treat each sequence as a label, a targeted attack needs to find a specific one over an enormous number of possible labels, leading to a nearly zero volume in search space. This may explain why most existing works on adversarial attack focus on the image classification task, since its input space is continuous and its output space is finite.

In this paper, we study a harder problem of crafting adversarial examples for sequence-to-sequence (seq2seq) models [6]. This problem is challenging since it combines both aforementioned difficulties, i.e., discrete inputs and sequence outputs with an almost infinite number of possibilities. We choose this problem not only because it is challenging, but also because seq2seq models are widely used in many safety and security sensitive applications, e.g., machine translation [7, 8], text summarization [9], and speech recognition [10], thus measuring its robustness becomes critical. Specifically, we aim to examine the following questions in this study:

  1. Is it possible to slightly modify the inputs of seq2seq models while significantly change their outputs?

  2. Are seq2seq models more robust than the well-evaluated CNN-based image classifiers?

We provide an affirmative answer to the first question by developing an effective adversarial attack framework called Seq2Sick. It is an optimization-based framework that aims to learn an input sequence that is close enough to the original sequence while leads to the desired outputs with high confidence. To address the challenges caused by the discrete input space, we propose to use the projected gradient descent method combined with group lasso and gradient regularization. To address the challenges of almost infinite output space, we design some novel loss functions for the tasks of non-overlapping attack and targeted keyword attack. Our experimental results show that the proposed framework yields high success rates in both tasks. However, even if the proposed approach can successfully attack seq2seq models, our answer to the second question is “Yes”. Compared with CNN-based classifiers that are highly sensitive to adversarial examples, seq2seq model is intrinsically more robust since it has discrete input space and the output space is exponentially large. As a result, adversarial examples of seq2seq models usually have larger distortions and are more perceptible than the adversarial examples crafted for CNN-based image classifiers.

In summary, our main contributions are as follows:

  • We propose Seq2Sick, a novel optimization-based approach to craft adversarial sequences to fool word-level seq2seq models. Seq2Sick provides two types of attacks: non-overlapping attack to change all the words in the output sentence, and targeted keyword attack to insert malicious keywords into the output.

  • To handle discrete input space, we propose a modified projected gradient-based method with group lasso regularization to enforce the sparsity in terms of the number of changed words. We also define corresponding loss functions on output sequence for each type of attack.

  • We conduct experiments on text summarization and machine translation tasks. By changing less than 3 words in input sequences, we can fool seq2seq models to produce malicious outputs with high success rates. On the other hand, we recognize that seq2seq models are more robust to adversarial examples than the CNN-based image classifiers.

2 Related Work

Most existing work on adversarial attack focuses on CNN-based models, while only a few studies discuss how to attack RNN-based methods. In this section, we briefly review CNN-based attacks, and then highlight the differences between our work and previous attack algorithms designed for RNN-based models.

2.1 Adversarial Attacks on CNN-based Models

[1] first discovered adversarial examples of the CNN in the context of image classification. Since then, a number of approaches have surfaced regarding adversarial attacks on CNNs, such as using gradient-based methods [2, 11, 12, 13, 14, 15], score-based methods [16, 17, 18], transfer-based methods [19], and decision-based methods [20]. In addition to image classification, attacks to other CNN-related tasks have also been actively investigated. Typical examples include semantic segmentation and object detection [21, 22, 23, 24, 25], image captioning [26], and visual QA [27]. Besides, recent studies have also conducted adversarial attacks in the real-world scenarios, including attacking road signs [28, 29, 24, 25], cell-phone cameras [30], robots [31], and generating 3D-printed adversarial objects [32].

2.2 Adversarial Attacks on RNN-based Models

Although lots of works have been done to fool CNN-based models, only a few studies have examined RNN-based models, and the majority of them focus on simple text classification problems. [33] first uses Fast Gradient Sign Method (FGSM) to conduct an attack on RNN/LSTM-based classification problems. In order to generate text adversarial examples, [34]

proposes to use reinforcement learning to locate important words that could be deleted in sentiment classification.

[35] and [36] generate adversarial sequences by inserting or replacing existing words with typos and synonyms. [37] aims to attack sentiment classification models in a black-box setting. It develops some scoring functions to find the most important words to modify. These approaches differ from our work in that they study simple text classification problems while we focus on the more challenging seq2seq model with sequential outputs. Other than attacking text classifiers, [38] aims to fool reading comprehension systems by adding misleading sentences, which has a different focus than ours. [39] uses the generative adversarial network (GAN) to craft natural adversarial examples. However, it can only perform the untargeted attack and also suffers from high computational cost.

Our work shares some similarities with [40] as it also studies the problem of attacking seq2seq model. However, [40] is less appealing than our work since (i) they only focus on untargeted attack while our work can handle the more challenging targeted attack, and (ii) their word-level attack method does not work well – it only yields a 22% success rate for the untargeted attack. Indeed, they recognize that attacking word-level seq2seq model is a challenging problem, which has been addressed by this paper.

Methods Gradient Based? Word-level RNN? Sequential Output? Targeted Attack?
Ebrahimi et al [40] Class
Jia&Liang [38]
Li et al [34] Class
Papernot et al [33]
Gao et al [37] Binary
Samanta&Mehta [35] Binary
Zhao et al [39] /
Liang et al [36] Class
Seq2Sick (Ours) Keyword
Table 1: Summary of existing works that are designed to attack RNN models. “BINARY” indicates the attack is for binary classifications, and there is no difference between untargeted and targeted attack in this case. “CLASS” means targeted attack to a specific class. “KEYWORD” means targeted attack to a specific keyword.

Notably, almost all the previous methods are based on greedy search, i.e., at each step, they search for the best word and the best position to replace the previous word. As a result, their search space grows rapidly as the length of input sequence increases. To address this issue, we propose a novel approach that uses group lasso regularization and the projected gradient descent method with gradient regularization to simultaneously search all the replacement positions. Table 1 summarizes the key differences between the proposed framework Seq2Sick and the existing attack methods on RNN-based models.

3 Methodology

3.1 A Revisit to Sequence-to-Sequence Model

Before introducing the proposed algorithms, we first briefly describe the sequence-to-sequence (seq2seq) model [41, 6]. Let

be the embedding vector of each input word,

be the input sequence length, and be the output sequence length. Let be the input vocabulary, and the output word where is the output vocabulary. The seq2seq model has an encoder-decoder framework that aims at mapping an input sequence of vectors to the output sequence . Its encoder first reads the input sequence, then each RNN/LSTM cell computes , where is the current input, and represent the previous and current cells’ hidden states, respectively. The next step computes the context vector using all the hidden layers of cells , i.e , where could be a linear or non-linear function. In this paper, we follow the setting in  [6] that .

Given the context vector and all the previously words , the decoder is trained to predict the next word . Specifically, the -th cell in the decoder receives its previous cell’s output and the context vector , and then outputs

(1)

where is another RNN/LSTM cell function. is a vector of the logits for each possible word in the output vocabulary .

3.2 Proposed Framework

The problem of crafting adversarial examples against the seq2seq model can be formulated as the following optimization problem:

(2)

where indicates the regularization function to measure the magnitude of distortions. is the loss function to penalize the unsuccessful attack and it may take different forms in different attack scenarios. A common choice for is the penalty , but it is, as we will show later, not suitable for attacking seq2seq model. is the regularization parameter that balances the distortion and attack success rate – a smaller will make the attack more likely to succeed but with the price of larger distortion.

In this work, we focus on two kinds of attacks: non-overlapping attack and targeted keywords attack. The first attack requires that the output of the adversarial example shares no overlapping words with the original output. This task is strictly harder than untargeted attack, which only requires that the adversarial output to be somewhat different from the original output [39, 40]. We ignore the task of untargeted attack since it is too trivial for the proposed framework, which can easily achieve a 100% attack success rate. Targeted keywords attack is an even more challenging task than non-overlapping attack. Given a set of targeted keywords, the goal of targeted keywords attack is to find an adversarial input sequence such that all the keywords must appear in its corresponding output. In the following, we respectively introduce the loss functions developed for the two attack approaches.

3.2.1 Non-overlapping Attack

To formally define the non-overlapping attack, we let be the original output sequence, where denotes the location of the -th word in the output vocabulary . indicates the logit layer outputs of the adversarial example. In the non-overlapping attack, the output of adversarial example should be entirely different from the original output , i.e.,

which is equivalent to

Given this observation, we can define a hinge-like loss function to generate adversarial examples in the non-overlapping attack, i.e.,

(3)

where denotes the confidence margin parameter. Generally speaking, a larger will lead to a more confident output and a higher success rate, but with the cost of more iterations and longer running time.

We note that non-overlapping attack is much more challenging than untargeted attack, which suffices to find a one-word difference from the original output [39, 40]. We do not take untargeted attack into account since it is straightforward and the replaced words could be some less important words such as “the” and “a”.

3.2.2 Targeted Keywords Attack

Given a set of targeted keywords, the goal of targeted keywords attack is to generate an adversarial input sequence to ensure that all the targeted keywords appear in the output sequence. This task is important since it suggests adding a few malicious keywords can completely change the meaning of the output sequence. For example, in the machine translation task that translates from English into German, an input sentence “policeman helps protesters to keep the assembly in order” should generate an output sentence “Polizist hilft Demonstranten, die Versammlung in Ordnung zu halten”. However, changing only one word from “hilft” to “verhaftet” in the output sequence will significantly change its meaning, as the new sentence means “police officer arrested protesters to keep the assembly in order”. Similarly, if we change the word “warm” to “cold” in the output sentence “people gives a warm welcome to soldiers”, its meaning will also be dramatically changed.

In our method, we do not specify the positions of the targeted keywords in the output sentence. Instead, it is more natural to design a loss function that allows the targeted keywords to become the top-1 prediction at any positions. The attack is considered as successful only when ALL the targeted keywords appear in the output sequence. Therefore, the more targeted keywords there are, the harder the attack is. To illustrate our method, we start from the simpler case with only one targeted keyword . To ensure that the target keyword word’s logit be the largest among all the words at a position , we design the following loss function:

(4)

which essentially searches the minimum of the hinge-like loss terms over all the possible locations . When there exist more than one targeted keywords , where denotes the -th word in output vocabulary , we follow the same idea to define the loss function as follows:

(5)

However, the loss defined in (5) suffers from the “keyword collision” problem. When there are more than one keyword, it is possible that multiple keywords compete at the same position to attack. To address this issue, we define a mask function to mask off the position if it has been already occupied by one of the targeted keywords:

(6)

In other words, if any of the keywords appear at position as the top-1 word, we ignore that position and only consider other positions for the placement of remaining keywords. By incorporating the mask function, the final loss for targeted keyword attack becomes:

(7)

3.3 Handling Discrete Input Space

As mentioned before, the problem of “discrete input space” is one of the major challenges in attacking seq2seq model. Let be the set of word embeddings of all words in the input vocabulary. A naive approach is to first learn in the continuous space by solving the problem (2), and then search for its nearest word embedding in . This idea has been used in attacking sequence classification models in [42]. Unfortunately, when applying this idea to targeted keywords attack, we report that all of the 100 attacked sequences on Gigaword dataset failed to generate the targeted keywords. The main reason is that by directly solving the problem (2), the final solution is likely not a feasible word embedding in

, and its nearest neighbor could be far away from it due to the curse of dimensionality 

[43].

To address this issue, we propose to add an additional constraint to enforce that belongs to the input vocabulary . The optimization problem then becomes

(8)
s.t.

We then apply a projected gradient descent method to solve this constrained problem. At each iteration, we project the current solution , where denotes the -th column of , back into to make sure that can map to a specific input word.

Group lasso Regularization:

norm has been widely used in the adversarial machine learning literature to measure distortions. However, it is not suitable for our task since almost all the learned

using regularization will be nonzero. As a result, most of the inputs words will be perturbed to another word, leading to an adversarial sequence that is significantly different from the input sequence.

To solve this problem, we treat each with variables as a group, and use the group lasso regularization

to enforce the group sparsity: only a few groups (words) in the optimal solution are allowed to be nonzero.

3.4 Gradient Regularization

When attacking the seq2seq model, it is common to find that the adversarial example is located in a region with very few or even no embedding vector. This will negatively affect our projected gradient method since even the closest embedding from those regions can be far away.

To address this issue, we propose a gradient regularization to make close to the word embedding space. Our final objective function becomes:

(9)
s.t.

where the third term is our gradient regularization that penalizes a large distance to the nearest point in . The gradient of this term can be efficiently computed since it is only related to one that has a minimum distance from . For the other terms, we use the proximal operator to optimize the group lasso regularization, and the gradient of the loss function can be computed through back-propagation. The detailed steps of our approach, Seq2Sick, is presented in Algorithm 1. Our source code is publicly available.111https://github.com/cmhcbb/Seq2Sick

Computational Cost:

Our algorithm needs only one back-propagation to compute the gradient . The bottleneck here is to project the solution back into the word embedding space, which depends on the number of words in the input dictionary of the model. [42] uses word embedding [44] that contains millions of words to do a nearest neighbor search. Fortunately, our model does not need to use any pre-trained word embedding, thus making it a more generic attack that does not depend on pre-trained word embedding. Besides, we can employ approximate nearest neighbor (ANN) approaches to further speed up the projection step.

  Input: input sequence , seq2seq model, target keyword
  Output: adversarial sequence
  Let denote the original output of .
  if  Targeted Keyword Attack then
     Set the loss in (LABEL:eq:loss) to be  (7)
  else
     Set the loss in (LABEL:eq:loss) to be  (3)
  end if
  for  do
     back-propagation loss to achieve gradient
     for  do
        if  then
           
        else
           
        end if
     end for
     
     
  end for
  
  
  return
Algorithm 1 Seq2Sick algorithm

4 Experiments

In this section, we conduct experiments on two widely-used applications of seq2seq model: text summarization and machine translation.

4.1 Datasets

We use three datasets DUC2003222http://duc.nist.gov/duc2003/tasks.html, DUC2004333http://duc.nist.gov/duc2004/, and Gigaword444https://catalog.ldc.upenn.edu/LDC2003T05, to conduct our attack for the text summarization task. Among them, DUC2003 and DUC2004 are widely-used datasets in documentation summarization. We also include a subset of randomly chosen samples from Gigaword to further evaluate the performance of our algorithm. For the machine translation task, we use 500 samples from WMT’16 Multimodal Translation task555http://www.statmt.org/wmt16/translation-task.html. The statistics about the datasets are shown in Table  2. All our experiments are performed on an Nvidia GTX 1080 Ti GPU.

Datasets # samples Average input lengths
Gigaword 1,000 30.1 words
DUC2003 624 35.5 words
DUC2004 500 35.6 words
Multi30k 500 11.5 words
Table 2: Statistics of the datasets. “# Samples” is the number of test examples we used for robustness evaluations

4.2 Seq2seq models

We implement both text summarization and machine translation models on OpenNMT-py666https://github.com/OpenNMT/OpenNMT-py. Specifically, we use a word-level LSTM encoder and a word-based attention decoder for both applications [7]. For the text summarization task, we use 380k training pairs from Gigaword dataset to train a seq2seq model. The architecture consists of a 2-layer stacked LSTM with 500 hidden units. We conduct experiments on two types of models, one uses the pre-trained 300-dimensional GloVe word embeddings and the other one is trained from scratch. For the machine translation task, we train our model using 453k pairs from the Europal corpus of German-English WMT 15777http://www.statmt.org/wmt15/translation-task.html, common crawl and news-commentary. We use the hyper-parameters suggested by OpenNMT for both models, and have reproduced the performance reported in [9] and [45].

4.3 Empirical Results

4.3.1 Text Summarization

For the non-overlapping attack, we use the proposed loss  (3) in our objective function. A non-overlapping attack is treated as successful only if there is no common word at every position between output sequence and original sequence. We set in all non-overlapping experiments. Table 3 summarizes the experimental results. It shows that our algorithm only needs to change 2 or 3 words on average and can generate entirely different outputs for more than of sentences. We have also included some adversarial examples in Table  9. More non-overlapping adversarial examples is available at appendix. From these examples, we can only change one word to let output sequence look completely different with the original one and change the sentence’s meaning completely.

For the targeted keywords attack, we randomly choose some targeted keywords from the output vocabulary after removing the stop words like “a” and “the”. A targeted keywords attack is treated as successful only if the output sequence contains all the targeted keywords. We set in our objective function (LABEL:eq:loss) in all our experiments. Table  4 summarizes the performance, including the overall success rate, average BLEU score  [46], and the average number of changed words in input sentences. Average BLEU score is defined by exponential average over BLEU 1,2,3,4, which is commonly used in evaluating the quality of text which has been machine-translated from one natural language to another. Also, we have included some adversarial examples crafted by our method in Table  10. In Table  10, we show some adversarial examples with 3 sets of keywords, where “##” stands for a two-digit number after standard preprocessing in text summarization. Through these examples, we can see that our method could generate totally irrelevant subjects, verbs, numerals and objects which could easily be formed as a complete sentence with only several word changes. More adversarial examples can be found in the appendix of this paper.

Note that there are three important techniques used in our algorithm: projected gradient method, group lasso, and gradient regularization. In the following, we conduct experiments to verify the importance of each of these techniques. First, without using the projected gradient method, we observe that the success rate immediately drops to close to . It is thus important to project the solution back to the input vocabulary’s word embeddings after each iteration. Table 5 shows the experimental results when group lasso or gradient regularization is removed, where “W/O GL” indicates the results without group lasso regularization and “W/O GR” indicates the results without gradient regularization. The algorithms are evaluated using the targeted 2-keyword attack. As shown in Table 5, if we do not use group lasso regularization that enforces group sparsity of distortion , the attack success rate remains similar, but both average BLEU score and the number of changed words become much larger. On the other hand, if we do not use gradient regularization that helps find a better that is close to the input embedding space, not only success rate drops, but also average BLEU score and the number of changed words become worse. These empirical results verify that all the three techniques used here are important in generating adversarial examples against seq2seq models.

Dataset Success rate BLEU # changed
Gigaword 86.0% 0.828 2.17
DUC2003 85.2% 0.774 2.90
DUC2004 84.2% 0.816 2.50
Table 3: Results of non-overlapping attack in text summarization. The high BLEU scores and low average number of changed words indicate that the crafted adversarial inputs are very similar to their originals, and we achieve high success rates to generate a summarization that differs with the original at every position for all three datasets.

Datasest Success rate BLEU # changed
Gigaword 1 99.8% 0.801 2.04
2 96.5% 0.523 4.96
3 43.0% 0.413 8.86
DUC2003 1 99.6% 0.782 2.25
2 87.6% 0.457 5.57
3 38.3% 0.376 9.35
DUC2004 1 99.6% 0.773 2.21
2 87.8% 0.421 5.1
3 37.4% 0.340 9.3
Table 4: Results of targeted keywords attack in text summarization. is the number of keywords. We found that our method can make the summarization include 1 or 2 target keywords with a high success rate, while the changes made to the input sentences are relatively small, as indicated by the high BLEU scores and low average number of changed words. When , this task becomes more challenging, but our algorithm can still find many adversarial examples.
Dataset Method success% BLEU # changed
Gigaword W/o GL 91.4 % 0.166 16.53
W/o GR 92.8 % 0.707 4.96
All 96.5% 0.523 4.96
DUC2003 W/o GL 95.7% 0.225 15.74
W/o GR 87.9% 0.457 5.57
All 87.6% 0.457 5.57
DUC2004 W/o GL 95.0% 0.212 15.60
W/o GR 87.0 % 0.421 5.14
All 87.8% 0.421 5.14
Table 5: Importance of each component of our algorithm for targeted 2-keyword attack int text summarization. GL = Group lasso. GR = Gradient regularization. We do not show our method without gradient projection as its success rates drop to 0 for all datasets. Group lasso is crucial for crafting adversarial examples which are similar to originals (high BLEU and low number of changed words), and gradient regularization can help to boost success rate.

4.3.2 Machine Translation

We then conduct both non-overlapping and targeted keywords attacks to the English-German machine translation model. We first filter out stop words like “Ein”(a), “und”(and) in German vocabulary and randomly choose several nouns, verbs, adjectives or adverbs in German as targeted keywords. Similar to the text summarization experiments, we set in our objective function. The success rates, BLEU scores, and the average number of words changed are reported in Table 6, with some adversarial examples shown in Table 8. Furthermore, the effects of group lasso and gradient regularizations are reported in Table 7. These results are consistent with the findings in the text summarization experiments, and verify the effectiveness of proposed attack framework and the importance of the proposed techniques.

Method Success% BLEU # changed
Non-overlap 89.4% 0.349 3.5
1-keyword 100.0% 0.705 1.8
2-keyword 91.0 % 0.303 4.0
3-keyword 69.6% 0.205 5.3
Table 6: Results of non-overlapping method and targeted keywords method in machine translation. Similar to our observation for text summarization, non-overlapping attack and targeted keyword attack with 1 and 2 keywords can be achieved with very high success rates, while a 3-keyword attack is more challenging.
Method success rate BLEU # changed
W/o GL 100.0% 0.163 6.4
W/o GR 91.0% 0.303 4.1
All 91.0% 0.303 4.0
Table 7: Importance of each component of our algorithm for targeted 2-keyword attack in machine translation. GL = Group lasso; GR = Gradient regularization. We do not show our method without gradient projection as its success rates drop to 0 for all datasets. Without group lasso, our algorithm can make the input sentence quite different from its original (low BLEU and high number of changed words). Gradient regularization can slightly improve results.
Source input seq A toddler is cooking with another person.
Adv input seq A dog is sit with another UNK.
Source output seq Ein kleines Kind kocht mit einer anderen Person.
Adv output seq Ein Hund sitzt mit einem anderen UNK.
Table 8: Machine translation adversarial examples with targeted keyword ”Hund sitzt”

4.4 Robustness of Seq2Seq Model

Finally, we summarize our results and make some final remarks about the robustness of seq2seq models. Our algorithm can achieve very good success rates () in both non-overlapping and targeted keywords attacks with 1 or 2 keywords. This clearly verifies the effectiveness of our attack algorithm. On the other hand, we also recognize some strengths of the seq2seq model: (i) unlike CNN models where targeted attack can be conducted easily with almost 100% success rate and very small distortion that cannot be perceived by human eyes [13]

, it is harder to turn the entire seq2seq output into a particular sentence – some sentences are even impossible to generate by seq2seq models; and (ii) since the input space of seq2seq is discrete, it is easier for human to detect the differences between the adversarial sequence and the original one, even if we only change one or few words. Therefore, we conclude that, compared with the DNN models designed for other tasks such as image classification, seq2seq models are more robust to adversarial attacks. The main reason, as pointed out in the introduction, is that the seq2seq model has a finite and discrete input space and almost infinite output space, so it is more robust than visual classification models that have an infinite and continuous input space and a very small output space (e.g., 10 categories in MNIST and 1,000 categories in ImageNet).

Source input seq among asia ’s leaders , prime minister mahathir mohamad was notable as a man with a bold vision : a physical and social transformation that would push this nation into the forefront of world affairs .
Adv input seq among lynn ’s leaders , prime minister mahathir mohamad was notable as a man with a bold vision : a physical and social transformation that would push this nation into the forefront of world affairs.
Source output seq asia ’s leaders are a man of the world
Adv output seq a vision for the world
Source input seq under nato threat to end his punishing offensive against ethnic albanian separatists in kosovo , president slobodan milosevic of yugoslavia has ordered most units of his army back to their barracks and may well avoid an attack by the alliance , military observers and diplomats say
Adv input seq under nato threat to end his punishing offensive against ethnic albanian separatists in kosovo , president slobodan milosevic of yugoslavia has jean-sebastien most units of his army back to their barracks and may well avoid an attack by the alliance , military observers and diplomats say.
Source output seq milosevic orders army back to barracks
Adv output seq nato may not attack kosovo
Table 9: Text summarization adversarial examples using non-overlapping method. Surprisingly, it is possible to make the output sequence completely different by changing only one word in the input sequence.
Source input seq north korea is entering its fourth winter of chronic food shortages with its people malnourished and at risk of dying from normally curable illnesses , senior red cross officials said tuesday.
Adv input seq north detectives is apprehended its fourth winter of chronic food shortages with its people malnourished and at risk of dying from normally curable illnesses , senior red cross officials said tuesday.
Source output seq north korea enters fourth winter of food shortages
Adv output seq north police arrest fourth winter of food shortages.
Source input seq after a day of fighting , congolese rebels said sunday they had entered kindu , the strategic town and airbase in eastern congo used by the government to halt their advances.
Adv input seq after a day of fighting , nordic apprehended said sunday they had entered UNK , the strategic town and airbase in eastern congo used by the government to halt their advances.
Source output seq congolese rebels say they have entered UNK.
Adv output seq nordic police arrest ## in congo.
Source input seq three weeks ago , when someone started stalking homeless people here , bending over them as they slept in the alleys and doorways and slashing their throats , a grim joke began making the rounds on the streets : the killer had been hired by the city.
Adv input seq three weeks ago dissidents when someone started modi homeless people here thai bending over them as they slept in the alleys and doorways and slashing their throats , a grim joke began making the rounds on the streets : the killer had been hired by the city.
Source output seq homeless people get a taste of life
Adv output seq thai police arrest # in crackdown on protesters
Table 10: Text summarization examples with targeted keywords “police arrest”, “police arrest ##”, “police arrest protesters”

5 Conclusion

In this paper, we propose a novel framework, i.e., Seq2Sick, to generate adversarial examples for sequence-to-sequence neural network models. We propose a projected gradient method to address the issue of discrete input space, adopt group lasso to enforce the sparsity of the distortion, and develop a regularization technique to further improve the success rate. Besides, different from most existing algorithms that are designed for untargeted attack and classification tasks, our algorithm can perform the more challenging targeted keywords attack. Our experimental results show that the proposed framework is powerful and effective: it can achieve high success rates in both non-overlapping and targeted keywords attacks with relatively small distortions. At the same time, we recognize that, compared with the DNN models for image classifications that have already been evaluated elsewhere, seq2seq models are indeed more robust to adversarial attacks.

6 Acknowledgment

CJH, MC and HZ are partially supported by NSF IIS-1719097.

References

Appendix A Appendix

a.1 More adversarial examples on target keyword method

In Table  11, we show more adversarial examples generated by our Seq2Sick framework using targeted keywords method.

Source input seq cambodia ’s two-party opposition asked the asian development bank monday to stop providing loans to the incumbent government , which it calls illegal
Adv input seq cambodia ’s heavy-handed manhunts warrant the asian development bank monday to stop providing loans to the incumbent government, rwanda apec calls police
Source output seq cambodian opposition asks adb to stop loans to government
Adv output seq cambodia ’s police arrest asian development bank
Source input seq europe ’s top official said friday that he hoped former chilean dictator augusto pinochet would be extradited to spain to be tried for crimes committed during his 17-year rule.
Adv input seq detectives sweep top official said friday that he hoped former chilean dictator augusto pinochet would be extradited to spain to be tried for crimes committed during his UNK rule.
Source output seq europe ’s top official hopes pinochet will be extradited to spain.
Adv output seq chilean police arrest pinochet ’s extradition.
Source input seq prime minister rafik hariri has declined an informal invitation from lebanon ’s new president to form the next government , sparking a political crisis in this country as it rebuilds from its devastating civil war .
Adv input seq prime minister rafik hariri apprehended declined riot-torn informal invitation from lebanon ’s new president to form the next government , sparking a political crisis ex-police this country as it UNK from its devastating civil war .
Source output seq lebanese prime minister declines to form new government
Adv output seq lebanese police arrest ## in new government
Source input seq the trials of three outspoken dissidents over , communist party leader jiang zemin signaled wednesday that china will sustain a crackdown on dissent throughout next year .
Adv input seq firemen trials wanchope consists outspoken dissidents over , communist party leader jiang zemin signaled wednesday that china will sustain a crackdown on dissent throughout next year .
Source output seq china ’s jiang signals crackdown on dissent
Adv output seq chinese police say they will continue crackdown on dissent
Source input seq president boris yeltsin stayed home tuesday , nursing a respiratory infection that forced him to cut short a foreign trip and revived concerns about his ability to govern.
Adv input seq president boris yeltsin stayed home tuesday , cops cops respiratory infection that forced him to cut short a foreign trip and revived concerns about his ability to govern.
Source output seq yeltsin stays home after illness
Adv output seq yeltsin stays home after police arrest
Source input seq a group of high-powered u.s. investors in southeast asia on monday applauded efforts to perk up thailand ’s staggering economy ,saying they had been assured by top thai officials that key economic reform packages will soon be approved
Adv input seq a eds confiscate rafah arresting supporters in southeast asia on monday applauded efforts to perk up thailand ’s staggering economy , saying they had been assured by top thai officials that key economic reform packages will soon be approved .
Source output seq us investors welcome thai economic reforms
Adv output seq thai police arrest protesters in southeast asia
Table 11: Text summarization adversarial examples using targeted keywords as ”police arrest”,”police arrest protestors”

a.2 More adversarial examples on non-overlapping method

In Table 12, we show more adversarial examples generated by our Seq2Sick framework using non-overlapping method. By changing only one word in the input, we can fool the seq2seq to generate totally different output.

Source input seq a group of high-powered u.s. investors in southeast asia on monday applauded efforts to perk up thailand ’s staggering economy , sayi ng they had been assured by top thai officials that key economic reform packages will soon be approved.
Adv input seq a group of high-powered u.s. investment in southeast asia on monday applauded efforts to perk up thailand ’s staggering economy , saying they had been assured by top thai officials that key economic reform packages will soon be approved.
Source output seq us investors welcome thai economic reforms.
Adv output seq u.s. investment in southeast asia to be boosted.
Source input seq just four years ago , it was a good bet that newt gingrich would be the pivotal figure in u.s. politics at the turn of the millennium.
Adv input seq just four years ago , it was a good bet that newt lawmakers would be the pivotal figure in u.s. politics at the turn of the millennium .
Source output seq gingrich ’s UNK vision is a good betmilosevic orders army back to barracks
Adv output seq a new year for the millennium
Source input seq russia ’s new prime minister picked an unusual way to reassure the nation thursday , after two weeks of deliberations he announced that he still had no plan to rescue the country from its economic crisis .
Adv input seq russia ’s new prime minister picked an unusual way to peacocks the nation thursday , after two weeks of deliberations he announced that he still had no plan to rescue the country from its economic crisis.
Source output seq russia s new prime minister says he has no plan to rescue country from economic crisis
Adv output seq new russian prime minister picks peacocks
Source input seq as turkey kept up warlike rhetoric against damascus , egypt on sunday began shuttle democracy between the two neighbors to avoid a military confrontation over turkish kurdish bases in syria.
Adv input seq as turkey kept up warlike rhetoric against damascus of egypt on sunday began shuttle democracy between the two neighbors to avoid a military confrontation over turkish kurdish bases in syria.
Source output seq egypt begins shuttle diplomacy with syria
Adv output seq turkey holds up its rhetoric against syria
Source input seq indian prime minister p.v. narasimha rao ’s promise of more autonomy for troubled kashmir and his plea for early state elections has sparked a violent reaction from provincial moslem and opposition parties.
Adv input seq indian prime minister p.v. narasimha rao honeywell promise of more autonomy for troubled kashmir and his plea for early state elections has sparked a violent reaction from provincial moslem and opposition parties.
Source output seq russia s new prime minister says he has no plan to rescue country from economic crisis
Adv output seq new autonomy for troubled kashmir indian pm ’s promise of autonomy for kashmir sparks violent reaction
Source input seq

the moment of truth for chuck knoblauch came in the bottom of the first inning when his name was announced at yankee stadium for the first time since he neglected to chase down that memorable loose ball last wednesday.

Adv input seq the moment of truth for chuck yankees came in the bottom of the first inning when his name was announced at yankee stadium for the first time since he neglected to chase down that memorable loose ball last wednesday.
Source output seq knoblauch ’s return is a UNK
Adv output seq yankees yankees lose yankees #-#
Table 12: Text summarization adversarial examples using non-overlapping method