Adversarial Reprogramming of Sequence Classification Neural Networks

09/06/2018 ∙ by Paarth Neekhara, et al. ∙ University of California, San Diego 1

Adversarial Reprogramming has demonstrated success in utilizing pre-trained neural network classifiers for alternative classification tasks without modification to the original network. An adversary in such an attack scenario trains an additive contribution to the inputs to repurpose the neural network for the new classification task. While this reprogramming approach works for neural networks with a continuous input space such as that of images, it is not directly applicable to neural networks trained for tasks such as text classification, where the input space is discrete. Repurposing such classification networks would require the attacker to learn an adversarial program that maps inputs from one discrete space to the other. In this work, we introduce a context-based vocabulary remapping model to reprogram neural networks trained on a specific sequence classification task, for a new sequence classification task desired by the adversary. We propose training procedures for this adversarial program in both white-box and black-box settings. We demonstrate the application of our model by adversarially repurposing various text-classification models including LSTM, bi-directional LSTM and CNN for alternate classification tasks.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Traditionally, an adversarial example is a sample from the classifier’s input domain which has been perturbed in such a way that is intended to cause a machine learning model to misclassify it. While the perturbation is usually imperceptible, such an adversarial input results in the neural network model outputting an incorrect class label with higher confidence. Several studies have shown such adversarial attacks to be successful in both the continuous input domain  

[Szegedy et al.2013, Papernot et al.2015, Papernot et al.2016a, Papernot, McDaniel, and Goodfellow2016, Brown et al.2017, Goodfellow, Shlens, and Szegedy2015] and discrete input spaces  [Papernot et al.2016b, Hu and Tan2017, Yang et al.2018]. However, this restricts the outcome of the attack to the original task without the ability to repurpose the network to perform a different task chosen by the attacker.

Adversarial Reprogramming  [Elsayed, Goodfellow, and Sohl-Dickstein2018]

is a new class of adversarial attacks where a machine learning algorithm is repurposed to perform a new task chosen by the attacker. The authors demonstrated how an adversary may repurpose a pre-trained ImageNet  

[Deng et al.2009] model for an adversarial classification task like classification of MNIST digits or CIFAR-10 images without modifying the network parameters. Since machine learning agents can be reprogrammed to perform unwanted actions as desired by the adversary, such an attack can lead to theft of computational resources such as cloud-hosted machine learning models. Besides theft of computational resources, the adversary may perform a task that violates the code of ethics of the system provider.

The adversarial reprogramming approach proposed by  [Elsayed, Goodfellow, and Sohl-Dickstein2018] trains an additive contribution to the inputs of the neural network to repurpose it for the desired alternate task. The adversary defines a hard-coded mapping between the class labels of the original and adversarial task. The adversarial program parameterized by is updated such that the classifier predicted label, when mapped to the adversarial label space, correctly classifies an adversarial input. This approach assumes a white-box attack scenario where the adversary has access to the network’s parameters. Also, the adversarial program proposed in this work is only applicable to tasks where the input space of the the original and adversarial task is continuous.

Figure 1: Example of Adversarial Reprogramming for Sequence Classification. We aim to design and train the adversarial reprogramming function , such that it can be used to repurpose a pretrained classifier C, for a desired adversarial task.

In this work, we propose a method to adversarially repurpose neural networks which operate on sequences from a discrete input space. The task is to learn a simple transformation (adversarial program) from the input space of the adversarial task to the input space of the neural network such that the neural network can be repurposed for the adversarial task. We propose a context-based vocabulary remapping function as an adversarial program for sequence classification networks. We propose training procedures for this adversarial program in both white-box and black-box scenarios. In the white-box attack scenario, where the adversary has access to the classifier’s parameters, a Gumbel-Softmax trick  [Jang, Gu, and Poole2016] is used to train the adversarial program. Assuming a black-box attack scenario, where the adversary may not have access to the classifier’s parameters, we present a REINFORCE  [Williams1992] based optimization algorithm to train the adversarial program.

We apply our proposed methodology on various text classification models including Recurrent Neural Networks such as LSTMs and bidirectional LSTMs, and Convolutional Neural Networks (CNNs). We demonstrate experimentally, how these neural networks trained on a particular (original) text classification task can be repurposed for alternate (adversarial) classification tasks. We experiment with different text classification datasets given in table

1 as candidate original and adversarial tasks and adversarially reprogram the aforementioned text classification models to study the robustness of the attack.

Background and Related Work

Figure 2: Adversarial Reprogramming Function and Training Procedures. Left: White-box Adversarial Reprogramming. The adversary generates gumbel distributions

at each time-step which are passed as a soft version of one-hot vectors to the classifier C. The cross-entropy loss between the predictions and the mapped class is backpropagated to train the adversarial program

. Right: Black-box Adversarial Reprogramming. The adversarial reprogramming function is used as a policy network and the sampled action (sequence s) is passed to the classifier to get a reward based on prediction correctness. The adversarial program is then trained using REINFORCE.

Adversarial Examples

Traditionally, adversarial examples are intentionally designed inputs to a machine learning model that cause the model to make a mistake  [Goodfellow, Shlens, and Szegedy2015]. These attacks can be broadly classified into untargeted and targeted attacks. In the untargeted attack scenario, the adversary succeeds as long as the victim model classifies the adversarial input into any class other than the correct class, while in the targeted attack scenario, the adversary succeeds only if the model classifies the adversarial input into a specific incorrect class. In both these scenarios, the intent of the adversary is usually malicious and the outcome of the victim model is still limited to the original task being performed by the model.

Adversarial attacks of image-classification models often use gradient descent on an image to create a small perturbation that causes the machine learning model to mis-classify it  [Szegedy et al.2013, Biggio et al.2013]. There has been a similar line of adversarial attacks on neural networks with discrete input domains  [Papernot et al.2016b, Yang et al.2018], where the adversary modifies a few tokens in the input sequence to cause misclassification by a sequence model. In addition, efforts have been made in designing more general adversarial attacks in which the same modification can be applied to many different inputs to generate adversarial examples  [Brown et al.2017, Goodfellow, Shlens, and Szegedy2015, Moosavi-Dezfooli et al.2017]. For example, authors  [Baluja and Fischer2018]

trained an Adversarial Transformation Network that can be applied to all inputs to generate adversarial examples targeting a victim model or a set of victim models. In this work, we aim to learn such universal transformations of discrete sequences for a fundamentally different task:

Adversarial Reprogramming described below.

Adversarial Reprogramming

Adversarial Reprogramming  [Elsayed, Goodfellow, and Sohl-Dickstein2018] introduced a new class of adversarial attacks where the adversary wishes to repurpose an existing neural network for a new task chosen by the attacker, without the attacker needing to compute the specific desired output. The adversary achieves this by first defining a hard-coded one-to-one label remapping function that maps the output labels of the adversarial task to the label space of the classifier ; and learning a corresponding adversarial reprogramming function that transforms an input 111 is an ImageNet size padded input image from the input space of the new task to the input space of the classifier. The authors proposed an adversarial reprogramming function , for repurposing ImageNet models for adversarial classification tasks. An adversarial example for an input image can be generated using the following adversarial program: 222Masking ignored because it is only a visualization convenience

where is the learnable weight matrix of the adversarial program (where n is the ImageNet image width). Let

denote the probability of the victim model predicting label

for an input . The goal of the adversary is to maximize the probability where is the label of the adversarial input . The following optimization problem that maximizes the log-likelihood of predictions for the adversarial classification task, can be solved using backpropagation to train the adversarial program parameterized by :



is the regularization hyperparameter. Since the adversarial program proposed is a trainable additive contribution

to the inputs, it’s application is limited to neural networks with a continuous input space. Also, since the the above optimization problem is solved by back-propagating through the victim network, it assumes a white-box attack scenario where the adversary has gained access to the victim model’s parameters. In this work, we describe how we can learn a simple transformation in the discrete space to extend the application of adversarial reprogramming on sequence classification problems. We also propose a training algorithm in the black-box setting where the adversary may not have access to the model parameters.

Transfer Learning

Transfer Learning  [Raina et al.2007] is a study closely related to Adversarial Reprogramming. During training, neural networks learn representations that are generic and can be useful for many related tasks. A pre-trained neural network can be effectively used as a feature extractor and the parameters of just the last few layers are retrained to realign the output layer of the neural network for the new task. Prior works have also applied transfer learning on text classification tasks  [Do and Ng2006, Semwal et al.2018]. While transfer learning approaches exploit the learnt representations for the new task, they cannot be used to repurpose an exposed neural network for a new task without modifying some intermediate layers and parameters.

Adversarial Reprogramming reprogramming studied whether it is possible to keep all the parameters of the neural network unchanged and simply learn an input transformation that can realign the outputs of the neural network for the new task. This makes it possible to repurpose exposed neural network models like cloud-based photo services to a new task where transfer learning is not applicable since we do not have access to intermediate layer outputs.


Adversarial Reprogramming Problem Definition

Consider a sequence classifier trained on the original task of mapping a sequence to a class label i.e . An adversary wishes to repurpose the original classifier for the adversarial task of mapping a sequence to a class label i.e . The adversary can achieve this by hard-coding a one-to-one label remapping function:

that maps an original task label to the new task label and learning a corresponding adversarial reprogramming function:

that transforms an input from the input space of the adversarial task to the input space of the original task. The adversary aims to update the parameters of the adversarial program such that the mapping can perform the adversarial classification task .

Adversarial Reprogramming Function

The goal of the adversarial reprogramming function is to map a sequence to such that it is labeled correctly by the classifier .

The tokens in the sequence and belong to some vocabulary lists and respectively. We can represent the sequence as where is the vocabulary index of the token in sequence in the vocabulary list . Similarly sequence can be represented as where is the vocabulary index of the token of sequence in the vocabulary list .

In the simplest scenario, the adversary may try to learn a vocabulary mapping from to using which each can be independently mapped to some to generate the adversarial sequence. Such an adversarial program has limited potential since the representational capacity of such a reprogramming function is very limited. We experimentally support this hypothesis by showing how such a transformation has limited potential for the purpose of adversarial reprogramming.

A more sophisticated adversarial program can be a sequence to sequence machine translation model  [Sutskever, Vinyals, and Le2014] that learns a translation for adversarial reprogramming. While theoretically this is a good choice, it defeats the purpose of adversarial reprogramming. This is because the computational complexity of training and using such a machine translation model would be similar if not greater than that of a new sequence classifier for the adversarial task .

The adversarial reprogramming function should be computationally inexpensive but powerful enough for adversarial repurposing. To this end, we propose a context-based vocabulary remapping model that produces a distribution over the target vocabulary at each time-step based on the surrounding input tokens. More specifically, we define our adversarial program as a trainable 3-d matrix

where k is the context size. Using this, we generate a probability distribution

over the vocabulary at each time-step as follows:


Both and are vectors of length . To generate the adversarial sequence we sample each independently from the distribution .

Given the max input length accepted by the victim model, the input sequence is padded with instances of a dummy token before the first token and instances after the last token to generate an length output . For sequences with , we select the first tokens of as input to the adversarial reprogramming function . We demonstrate in the Experiments section, that this approach works for different combinations of adversarial and original tasks with different average sequence lengths.

In practice, we implement this adversarial program as a single layer of 1-d convolution over the sequence of one-hot encoded vectors of adversarial tokens

with input channels and output channels with -length kernels parameterized by . Note that the time-complexity of using this adversarial reprogramming function (equations 2,3) is just and it can be parallelized to improve further.

White-box Attack

In the white-box attack scenario, we assume that the adversary has gained access to the victim network’s parameters and architecture. To train the adversarial reprogramming function , we use an optimization objective similar to equation 1. Let denote the probability of predicting label for a sequence by classifier . We wish to maximize the probability which is the probability of the output label of the classifier being mapped to the correct class for an input in the domain of the adversarial task. Therefore we need to solve the following log-likelihood maximization problem:


Note that that the output of the adversarial program is a sequence of discrete tokens. This makes the above optimization problem non-differentiable. Prior works  [Kusner and Hernández-Lobato2016, Gu, Im, and Li2017, Yang et al.2018] have demonstrated how we can smoothen such an optimization problem using the Gumbel-Softmax  [Jang, Gu, and Poole2016] distribution.

In order to backpropagate the gradient information from the classifier to the adversarial program, we smoothen the generated tokens using Gumbel-Softmax trick as per the following:

For an input sequence , we generate a sequence of Gumbel distributions . The component of distribution is generated as follows:

where is the softmax distribution at the time-step obtained using equation 3, is a random number sampled from the Gumbel distribution  [Gumbel1954] and is the temperature of Gumbel-Softmax.

Gumbel-Softmax approximates one-hot vectors of ’s with differentiable representations. The temperature parameter controls the flatness of this distribution. As the Gumbel distribution becomes close to a one-hot vector and as

the Gumbel distribution assumes a uniform distribution over

variables. The sequence then passed to the classifier is the sequence which serves as a soft version of the one-hot encoded vectors of ’s. Since the model is now differentiable, we can solve the following optimization problem using backpropagation:


During training the temperature parameter is annealed from some high value to a very low value . The details of this annealing process for our experiments have been included in the supplementary material.

Black-box Attack

In the black-box attack scenario, the adversary can only query the victim classifier for labels. Since the adversarial program needs to produce a discrete output to feed as input to the classifier , it is not possible to pass the gradient update from the classifier to the adversarial program using standard back-propagation. Also, in the black-box attack setting it is not possible to back-propagate the cross entropy loss through the classifier in the first place.

We formulate the sequence generation problem as a Reinforcement Learning problem  

[Bachman and Precup2015, Bahdanau et al.2016, Yu et al.2016] where the adversarial reprogramming function is the policy network. We define the state, action, policy and reward for this problem as follows:

  • State and Action Space: The state of the adversarial program is a sequence where is the input space of the adversarial task. An action of an RL agent is to produce a sequence of tokens where is the input space of the original task.

  • Policy: The adversarial program parameterized by , models the stochastic policy such that a sequence may be sampled from this policy conditioned on .

  • Reward: We use a simple reward function where we assign a reward +1 for a correct prediction and -1 for an incorrect prediction using the classifier where is the label remapping function and is the classifier. Formally:

The optimization objective to train the policy network is the following:

Following the REINFORCE algorithm  [Williams1992] we can write the gradient of the expectation with respect to as per the following:

Note that is the same as defined in equation 3 which can be differentiated with respect to

. The expectations are estimated as sample averages. Having obtained the gradient of expected reward, we can use mini-batch gradient ascent to update

with a learning rate as: .


Datasets and Classifiers

We demonstrate the application of the proposed reprogramming techniques on various text-classification tasks. In our experiments, we design adversarial programs to attack both word-level and character-level text classifiers. Additionally, we aim to adversarially repurpose a character-level text classifier for a word-level classification task and vice-versa. To this end, we choose the following text-classification datasets as candidates for the original and adversarial classification tasks:

  • Surname Classification Dataset (Names-18, Names-5[Robertson2017]: The dataset categorizes surnames from 18 languages of origin. We use this dataset for character-level classification task. We use a subset of this dataset Names-5 containing Names from 5 classes: Dutch, Scottish, Polish, Korean and Portuguese, as a candidate for adversarial task in the experiments.

  • Experimental Data for Question Classification (Questions[Li and Roth2002]: categorizes around 5500 questions into 6 classes: Abbreviation, Entity, Description, Human, Location, Numeric. We divide this dataset into 4361 questions for training and 1091 for testing.

  • Arabic Tweets Sentiment Classification Dataset [Abdulla et al.2013]: contains 2000 binary labeled tweets on diverse topics such as politics and arts. The tweets in this dataset, comprising of 1000 positive and 1000 negative tweets, are written in Modern Standard Arabic (MSA) and the Jordanian dialect. We use 1600 samples for training and 400 for testing.

  • Large Movie Review Dataset (IMDB) for sentiment classification  [Maas et al.2011]: contains 50,000 movie reviews categorized into binary class of positive and negative sentiment. It is split into 25,000 reviews for training and 25,000 reviews for testing.

The statistics of the above mentioned datasets have been given in table 1

. We train adversarial reprogramming functions to repurpose various text-classifiers based on Long Short-Term Memory (LSTM) network  

[Hochreiter and Schmidhuber1997], bidirectional LSTM network  [Graves, Fernández, and Schmidhuber2005] and Convolutional neural network  [Kim2014] models. All the aforementioned models can be trained for both word-level and character-level classification. We use character level classifiers for Names-18 and Names-5 datasets and word-level classifiers for IMDB, Questions and Arabic Tweets datasets. We use randomly initialized word/character embeddings for all the classification models. For LSTM, we use the output at last timestep for prediction. For the Bi-LSTM, we combine the outputs of the first and last time step for prediction. For the Convolutional Neural Network we follow the same architecture as  [Kim2014]. The hyper-parameter details of these classifiers have been included in table 2 of the supplementary material.

Test Accuracy (%)
Data Set
# Classes
Names-18 18 115,028 28,758 90 7.1 97.84 97.84 97.88
Names-5 5 3632 909 66 6.5 99.88 99.88 99.77
Questions 6 4361 1091 1205 11.2 96.70 98.25 98.07
Arabic Tweets 2 1600 400 955 9.7 87.25 88.75 88.00
IMDB 2 25,000 25,000 10000 246.8 86.83 89.43 90.02
Table 1: Summary of datasets and test accuracy of original classification models. denotes the vocabulary size of each dataset. Note that we use character-level models for Names-5 and Names-18 and word-level models for all other tasks.

Experimental Setup

As described in the methodology section, the label remapping function we use, is a one-to-one mapping between the labels of the original task and the adversarial task. Therefore it is required to apply the constraint that the number of classes of the adversarial task are less than or equal to the number of classes of the original task. We choose Names-5, Arabic Tweets and Question Classification as candidates for the adversarial tasks and repurpose the models allowed under this constraint. We use context size for all our experiments.

In white-box attacks, we use the Gumbel-Softmax based approach described in the methodology to train the adversarial program. The details of the temperature annealing process are included in table 1 of the supplementary material. For black-box attacks, we use the REINFORCE algorithm described in methodology, on mini-batches of sequences. Since the action space for certain reprogramming problems, (eg. reprogramming of IMDB classifier) is large , we restrict the output of the adversarial program to most frequent 1000 tokens in the vocabulary . We use Adam optimizer  [Kingma and Ba2014] for all our experiments. Hyperparameter details of all our experiments are included in table 1 of the supplementary material.

Results and Discussions

Test Accuracy (%)
White-Box on
LSTM Questions Names-5 80.96 97.03 44.33
Questions Arabic Tweets 73.50 87.50 50.00
Names-18 Questions 68.56 95.23 28.23
Names-18 Arabic Tweets 83.00 84.75 51.50
IMDB Arabic Tweets 80.75 88.25 50.50
Bi-LSTM Questions Names-5 93.51 99.66 63.14
Questions Arabic Tweets 81.75 83.50 70.00
Names-18 Questions 94.96 97.15 80.01
Names-18 Arabic Tweets 78.75 84.25 69.25
IMDB Arabic Tweets 83.25 86.75 84.00
CNN Questions Names-5 88.90 99.22 93.06
Questions Arabic Tweets 82.25 87.25 76.25
Names-18 Questions 71.03 97.61 33.45
Names-18 Arabic Tweets 80.75 86.50 60.00
IMDB Arabic Tweets 84.00 87.00 84.25
Table 2: Adversarial Reprogramming Experiments: The accuracies of white-box and black-box reprogramming experiments on different combinations of original task, adversarial task and model. Figures in bold correspond to our best results on a particular adversarial task in the given attack scenario scenario (black-box and white-box). White-box on Random Network column presents results of the white-box attack on an untrained neural network. Context size is used for all our experiments.

The accuracies of all adversarial reprogramming experiments have been reported in table 2. To interpret the results in context, the accuracies achieved by the LSTM, Bi-LSTM and CNN text classification models on the adversarial tasks can be found in table 1.

We demonstrate how character-level models trained on Names-18 dataset can be repurposed for word-level sequence classification tasks like Question Classification and Arabic Tweet Sentiment Classification. Similarly, word-level classifiers trained on Question Classification Dataset can be repurposed for the character-level Surname classification task. Interestingly, classifiers trained on IMDB Movie Review Dataset can be repurposed for Arabic Tweet Sentiment Classification even though there is a high difference between the vocabulary size (10000 vs 955) and average sequence length(246.8 vs 9.7) of the two tasks. It can be seen that all of the three classification models are susceptible to adversarial reprogramming in both white-box and black-box setting.

White-box based reprogramming outperforms the black-box based approach in all of our experiments. Figure 3 shows the learning curves for both white-box and black-box attacks. In practice, we find that training the adversarial program in the black-box scenario requires careful hyper-parameter tuning for REINFORCE to work. We believe that improved reinforcement learning techniques for sequence generation tasks  [Bahdanau et al.2016, Bachman and Precup2015] can make the training procedure for black-box attack more stable. We propose such improvement as a direction of future research.

Figure 3: Top: Training and validation accuracy plots for 2 different white-box experiments. Bottom: Accuracy and reward plots for a black-box training experiment.

To assess the importance of the original task on which the network was trained, we also present results of white-box adversarial reprogramming on untrained random network. Our results are coherent with similar experiments on adversarial reprogramming of untrained ImageNet models  [Elsayed, Goodfellow, and Sohl-Dickstein2018] demonstrating that adversarial reprogramming is less effective when it targets untrained networks. The figures in table 2 suggest that the representations learned by training a text classifier on an original task, are important for repurposing it for an alternate task. However another plausible reason as discussed by  reprogramming is that the reduced performance on random networks might be because of simpler reasons like poor scaling of network weight initialization making the optimization problem harder.

Adversarial Sequences:

Figure 4 shows some adversarial sequences generated by the adversarial program for Names-5 Classification while attacking a CNN trained on the Question Classification dataset. A sequence in the first column is transformed into the adversarial sequence in the second column by the trained adversarial reprogramming function. Note that in contrast to traditional adversarial examples, the generated adversarial sequences need not be constrained by a small perturbation to the valid input sequence of the original task. While these adversarial sequences may not make semantic or grammatical sense, it exploits the learned representation of the classifier to map the inputs to the desired class. For example, sequences that should be mapped to HUMAN class have words like Who in the generated adversarial sequence. Similarly, sequences that should be mapped to LOCATION class have words like world, city in the adversarial sequence. Other such interpretable transformations are depicted via colored text in the adversarial sequences of Figure 4.

Figure 4: Adversarial sequences generated by our adversarial program for Names-5 Classification (adversarial task), when targeting a CNN trained on the Question Classification dataset (original task). Interpretable transformations are shown as colored words in the second column. Adversarial program outputs that are mapped to the same class are depicted with the same color in the second column.
Figure 5: Accuracy vs Context size () plots for all 3 classification models on 2 different adversarial reprogramming experiments.

Effect of Context Size:

By varying the context size of the convolutional kernel in our adversarial program we are able to control the representational capacity of the adversarial reprogramming function. Figure 5 shows the percentage accuracy obtained when training the adversarial program with different context sizes on two different adversarial tasks: Arabic Tweets Classification and Name Classification. Using a context size reduces the adversarial reprogramming function to simply a vocabulary remapping function from to . It can be observed that the performance of the adversarial reprogramming model at is significantly worse than that at higher values of . While higher values of improve the performance of the adversarial program, they come at a cost of increased computational complexity and memory required for the adversarial reprogramming function. For the adversarial tasks studied in this paper, we observe that is a reasonable choice for context size of the adversarial program.


In this paper, we extend adversarial reprogramming, a new class of adversarial attacks, to target sequence classification neural networks. We introduce a novel adversarial program and present training algorithms in both white-box and black-box settings. Our results demonstrate the effectiveness of such attacks in the more challenging black-box settings, posing them as a strong threat in real-world attack scenarios. We demonstrate, for the first time, that recurrent neural networks (RNNs) can be reprogrammed for alternate tasks, which opens doors to solve more ambitious problems such as repurposing them for mining cryptocurrrency. Due to the threat presented by adversarial reprogramming, we recommend future work to study defenses against such attacks.