Adversarial Reprogramming of Neural Networks

06/28/2018 ∙ by Gamaleldin F. Elsayed, et al. ∙ Google 10

Deep neural networks are susceptible to adversarial attacks. In computer vision, well-crafted perturbations to images can cause neural networks to make mistakes such as identifying a panda as a gibbon or confusing a cat with a computer. Previous adversarial examples have been designed to degrade performance of models or cause machine learning models to produce specific outputs chosen ahead of time by the attacker. We introduce adversarial attacks that instead reprogram the target model to perform a task chosen by the attacker---without the attacker needing to specify or compute the desired output for each test-time input. This attack is accomplished by optimizing for a single adversarial perturbation, of unrestricted magnitude, that can be added to all test-time inputs to a machine learning model in order to cause the model to perform a task chosen by the adversary when processing these inputs---even if the model was not trained to do this task. These perturbations can be thus considered a program for the new task. We demonstrate adversarial reprogramming on six ImageNet classification models, repurposing these models to perform a counting task, as well as two classification tasks: classification of MNIST and CIFAR-10 examples presented within the input to the ImageNet model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The study of adversarial examples is often motivated in terms of the danger posed by an attacker whose goal is to cause model prediction errors by making a small change to the model’s input. Such an attacker could make a self-driving car react to a phantom stop sign evtimov2017robust by means of a sticker (a small perturbation), or cause an insurance company’s damage model to overestimate the claim value from the resulting accident by subtly doctoring photos of the damage (a small perturbation). With this context in mind, various methods have been proposed both to construct szegedy2013intriguing; papernot2015limitations; papernot2017practical; papernot2016transferability; brown2017adversarial; liu2016delving and defend against goodfellow2014explaining; kurakin2016mlatscale; madry2017towards; ensemble_training; kolter2017provable; kannan2018adversarial this style of adversarial attack. Thus far, the majority of adversarial attacks have consisted of untargeted attacks that aim to degrade the performance of a model without necessarily requiring it to produce a specific output, or targeted

attacks in which the attacker designs an adversarial perturbation of an input to produce a specific output for that input. For example, an attack against a classifier might target a specific desired output class for each input image, or an attack against a reinforcement learning agent might induce that agent to enter a specific state

lin2017tactics.

In this work, we consider a more complicated attacker goal: inducing the model to perform a task chosen by the attacker, without the attacker needing to compute the specific desired output. Consider a model trained to perform some original task: for inputs it produces outputs . Consider an adversary who wishes to perform an adversarial task: for inputs (not necessarily in the same domain as ) the adversary wishes to compute a function . We show that an adversary can accomplish this by learning adversarial reprogramming functions and that map between the two tasks. Here, converts inputs from the domain of into the domain of (i.e., is a valid input to the function ), while maps output of back to outputs of . The parameters of the adversarial program are then adjusted to achieve .

In our work, for simplicity, and to obtain highly interpretable results, we define to be a small image, a function that processes small images, a large image, and a function that processes large images. Our function then just consists of drawing in the center of the large image and in the borders, and is simply a hard coded mapping between output class labels. However, the idea is more general; () could be any consistent transformation that converts between the input (output) formats for the two tasks and causes the model to perform the adversarial task.

We refer to the class of attacks where a machine learning algorithm is repurposed to perform a new task as adversarial reprogramming. We refer to as an adversarial program. In contrast to most previous work in adversarial examples, the magnitude of this perturbation need not be constrained. The attack does not need to be imperceptible to humans, or even subtle, in order to be considered a success. Potential consequences of adversarial reprogramming include theft of computational resources from public facing services, and repurposing of AI-driven assistants into spies or spam bots. Risks stemming from this type of attack are discussed in more detail in Section 5.3.

It may seem unlikely that an additive offset to a neural network’s input would be sufficient on its own to repurpose the network to a new task. However, this flexibility stemming only from changes to a network’s inputs is consistent with results on the expressive power of deep neural networks. For instance, in raghu2016expressive

it is shown that, depending on network hyperparameters, the number of unique output patterns achievable by moving along a one-dimensional trajectory in input space increases exponentially with network depth. Further,

li2018measuring

shows that networks can be trained to high accuracy on common tasks even if parameter updates are restricted to occur only in a low dimensional subspace. An additive offset to a neural network’s input is equivalent to a modification of its first layer biases (for a convolutional network with biases shared across space, this operation effectively introduces new parameters because the additive input is not subject to the sharing constraint), and therefore an adversarial program corresponds to an update in a low dimensional parameter subspace. Finally, successes in transfer learning have shown that representations in neural networks can generalize to surprisingly disparate tasks. The task of reprograming a trained network may therefore be easier than training a network from scratch — a hypothesis we explore experimentally.

In this paper, we present the first instances of adversarial reprogramming. In Section 2, we discuss related work. In Section 3, we present a training procedure for crafting adversarial programs, which cause a neural network to perform a new task. In Section 4

, we experimentally demonstrate adversarial programs that target several convolutional neural networks designed to classify ImageNet data. These adversarial programs alter the network function from ImageNet classification to: counting squares in an image, classifying MNIST digits, and classifying CIFAR-10 images. We additionally examine the susceptibility of trained and untrained networks to adversarial reprogramming. Finally, we end in Sections

5 and 6 by discussing and summarizing our results.

2 Background and Related Work

2.1 Adversarial examples

One definition of adversarial examples is that they are “inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake” goodfellow2017. They are often formed by starting with a naturally occuring image and using a gradient-based optimizer to search for a nearby image that causes a mistake Biggio13; szegedy2013intriguing. These attacks can be either untargeted (the adversary succeeds if they cause any mistake at all) or targeted (the adversary succeeds only if they cause the model to recognize the input as belonging to a specific incorrect class). Adversarial attacks have been also proposed for other domains like malware detection grosse17, generative models kos2017adversarial, network policies for reinforcement learning tasks huang2017adversarial, and network interpretations ghorbani2017interpretation. In these domains, the attack remains either untargeted (generally degrading the performance) or targeted (producing a specific output). We extend this line of work by developing reprogramming methods that aim to produce specific functionality rather than a specific hardcoded output.

Several authors have observed that the same modification can be applied to many different inputs in order to form adversarial examples goodfellow2014explaining; moosavi2017universal. For example, brown2017adversarial designed an “adversarial patch” that can switch the prediction of many models to one specific class (e.g. toaster) when it is placed physically in their field of view. We continue this line of work by finding a single adversarial program that can be presented with many input images to cause the model to process each image according to the adversarial program.

2.2 Transfer Learning

Transfer learning is a well studied topic in machine learning raina2007self; mesnil2011unsupervised. The goal of transfer learning is to use knowledge obtained from one task to perform another. Neural networks possess properties that can be useful for many tasks yosinski2014transferable. For example, neural networks when trained on images develop features that resemble Gabor filters in early layers even if they are trained with different datasets or different training objectives such as supervised image classification krizhevsky2012imagenet, unsupervised density learning lee2009convolutional

, or unsupervised learning of sparse representations

le2011ica. Empirical work has demonstrated that it is possible to take a convolutional neural network trained to perform one task, and simply train a linear SVM classifier to make the network work for other tasks razavian2014cnn; donahue2014decaf. These findings suggest that the task of repurposing neural networks may not require retraining all the weights of neural network. Instead, the adversary task may be simplified to only design a perturbation that effectively realign the output layer of the network for the new task. The main challenge here is whether this task can be accomplished with additive adversarial contributions to neural network inputs?

Figure 1: Illustration of adversarial reprogramming. (a) Mapping of ImageNet labels to adversarial task labels (squares count in an image). (b) Images from the adversarial task (left) are embedded at the center of an adversarial program (middle), yielding adversarial images (right). The adversarial program shown repurposes an Inception V3 network to count squares in images. (c) Illustration of inference with adversarial images. The network when presented with adversarial images will predict ImageNet labels that map to the adversarial task.

3 Methods

The attack scenario that we propose here is that an adversary has gained access to the parameters of a neural network that is performing a specific task, and wishes to manipulate the function of the network using an adversarial program that can be added to the network input in order to cause the network to perform a new task. Here we assume that the network was originally designed to perform ImageNet classification, but the methods discussed here can be directly extended to other settings.

Our adversarial program is formulated as an additive contribution to network input. Note that unlike most adversarial perturbations, the adversarial program is not specific to a single image. The same adversarial program will be applied to all images. We define the adversarial program as:

(1)

where is the adversarial program parameters to be learned, is the ImageNet image width, and is a masking matrix that is 0 for image locations that corresponds to the adversarial data for the new task, otherwise 1. Note that the mask is not required – we mask out the central region of the adversarial program purely to improve visualization of the action of the adversarial program. Also, note that we use to bound the adversarial perturbation to be in – the same range as the (rescaled) ImageNet images the target networks are trained to classify.

Let, be a sample from the dataset to which we wish to apply the adversarial task, where . is the equivalent ImageNet size image with placed in the proper area, defined by the mask . The corresponding adversarial image is then:

(2)

Let

be the probability that an ImageNet classifier gives to ImageNet label

, given an input image . We define a hard-coded mapping function that maps a label from an adversarial task to a set of ImageNet labels. For example, if an adversarial task has 10 different classes (), may be defined to assign the first 10 classes of ImageNet, any other 10 classes, or multiple ImageNet classes to the adversarial labels. Our adversarial goal is thus to maximize the probability . We set up our optimization problem as

(3)

where is the coefficient for a weight norm penalty, to reduce overfitting. We optimize this loss with Adam while exponentially decaying the learning rate. Hyperparameters are given in Appendix A. Note that after the optimization the adversarial program has a minimal computation cost from the adversary’s side as it only requires computing (Equation 2), and mapping the resulting ImageNet label to the correct class. In other words, during inference the adversary needs only store the program and add it to the data, thus leaving the majority of computation to the target network.

One interesting property of adversarial reprogramming is that it must exploit nonlinear behavior of the target model. This is in contrast to traditional adversarial examples, where attack algorithms based on linear approximations of deep neural nets are sufficient to cause high error rate goodfellow2014explaining. Consider a linear model that receives an input and a program

concatenated into a single vector:

. Suppose that the weights of the linear model are partitioned into two sets, . The output of the model is . The adversarial program adapts the effective biases but cannot adapt the weights applied to the input . The adversarial program can thus bias the model toward consistently outputting one class or the other but cannot change the way the input is processed. For adversarial reprogramming to work, the model must contain a term that involves nonlinear interactions of and

. A deep neural net with nonlinear activation functions satisfies this requirement.

4 Results

Figure 2: Examples of adversarial programs for MNIST classification. (a-f) Adversarial programs which cause six ImageNet models to instead function as MNIST classifiers. Each program is shown being applied to one MNIST digit.

To demonstrate the feasibility of adversarial reprogramming, we crafted adversarial programs targeted at six ImageNet models. In each case, we reprogrammed the network to perform three different adversarial tasks: counting squares, MNIST classification, and CIFAR-10 classification. The weights of all trained models were obtained from tfslim, and top-1 ImageNet precisions are shown in Table Supp. 1. We additionally examined whether adversarial training conferred resistance to adversarial reprogramming, and compared the susceptibility of trained networks to random networks.

4.1 Counting squares

To illustrate the adversarial reprogramming procedure, we start with a simple adversarial task. That is counting the number of squares in an image. We generated images of size that include white squares with black frames. Each square could appear in 16 different position in the image, and the number of squares ranged from to . The squares were placed randomly on gridpoints (Figure 1b left). We embedded these images in an adversarial program (Figure 1b middle). The resulting images are of size with the images of the squares at the center (Figure 1b right). Thus, the adversarial program is simply a frame around the counting task images. We trained one adversarial program per ImageNet model, such that the first 10 ImageNet labels represent the number of squares in each image (Figure 1c). Note that the labels we used from ImageNet have no relation to the labels of the new adversarial task. For example, a ‘White Shark’ has nothing to do with counting 3 squares in an image, and an ‘Ostrich’ does not at all resemble 10 squares. We then evaluated the accuracy in the task by sampling 100,000 images and comparing the network prediction to the number of squares in the image.

Despite the dissimilarity of ImageNet labels and adversarial labels, and that the adversarial program is equivalent simply to a first layer bias, the adversarial program masters this counting task for all networks (Table 1). These results demonstrate the vulnerability of neural networks to reprogramming on this simple task using only additive contributions to the input.

4.2 MNIST classification

Figure 3: Examples of adversarial images for CIFAR-10 classification. An adversarial program repurposing an Inception V3 model to instead function as an CIFAR-10 classifier is shown being applied to four CIFAR-10 images.
MNIST CIFAR-10
ImageNet Model Counting train set test set train set test set
Inception V3
Inception V4
Inception Resnet V2
Resnet V2 152
Resnet V2 101
Resnet V2 50
Inception V3 adv.
Table 1: Trained ImageNet classifiers can be adversarially reprogrammed to perform a variety of tasks. Table gives accuracy of reprogrammed networks on a counting task, MNIST classification task, and CIFAR-10 classification task.

In this section, we demonstrate adversarial reprogramming on somewhat more complex task of classifying MNIST digits. We measure test and train accuracy, so it is impossible for the adversarial program to have simply memorized all training examples. Similar to the counting task, we embedded MNIST digits of size inside a frame representing the adversarial program, we assigned the first 10 ImageNet labels to the MNIST digits, and trained an adversarial program for each ImageNet model. Figure 2 shows examples of the adversarial program for each network being applied.

Our results show that ImageNet networks can be successfully reprogramed to function as an MNIST classifier by presenting an additive adversarial program. The adversarial program additionally generalized well from the training to test set, suggesting that the reprogramming does not function purely by memorizing train examples, and is not brittle to small changes in the input. One interesting observation is that the adversarial programs targeted at Inception architectures are qualitatively different from those targeted at Resnet architectures (Figure 2). This suggests that the method of action of the adversarial program is in some sense architecture-specific.

4.3 CIFAR-10 classification

Here we implement a more challenging adversarial task. That is, crafting adversarial programs to repurpose ImageNet models to instead classify CIFAR-10 images. Some examples of the resulting adversarial images are given in Figure 3. Our results show that our adversarial program was able to increase the accuracy on CIFAR-10 from chance to a moderate accuracy (Table 1). This accuracy is near what is expected from typical fully connected networks lin2015far but with minimal computation cost from the adversary side at inference time. One observation is that although adversarial programs trained to classify CIFAR-10 are different from those that classify MNIST or perform counting task, the programs show some visual similarities, e.g. ResNet architecture adversarial programs seem to possess some low spatial frequency texture (Figure 4a).

Figure 4: Adversarial programs exhibit qualitative similarities and differences across both network and task. (a) Top: adversarial programs targeted to repurpose networks pre-trained on ImageNet to count squares in images. Middle: adversarial programs targeted to repurpose networks pre-trained on ImageNet to function as MNIST classifiers. Bottom: adversarial programs to cause the same networks to function as CIFAR-10 classifiers. (b) Adversarial programs targeted to repurpose networks with randomly initialized parameters to function as MNIST classifiers.

4.4 Reprogramming untrained and adversarially trained networks

One important question is the degree to which susceptibility to adversarial reprogramming depends on the details of the model being attacked. To test this, we first examined attack success on an Inception V3 model that was trained on ImageNet data using adversarial training ensemble_training. Adversarial training augments each minibatch with adversarial examples during training, and is one of the most common methods for guarding against adversarial examples. As in Section 4.2, we adversarially reprogrammed this network to classify MNIST digits. Our results (Table 1) indicate that the model trained with adversarial training is still vulnerable to reprogramming, with only a slight reduction in attack success. This shows that a standard approach to adversarial defense has little efficacy against adversarial reprogramming. This finding is likely explained by the differences between adversarial reprogramming and standard adversarial attacks. First, that the goal is to repurpose the network rather than cause it to make a specific mistake, and second that the magnitude of adversarial programs can be large, while traditional adversarial attacks are of a small perturbation magnitude.

To further explore dependence on the details of the model, we performed adversarial reprogramming attacks on models with random weights. We used the same models and MNIST target task as in Section 4.2 – we simply used the ImageNet models with randomly initialized rather than trained weights. MNIST classification task was easy for networks pretrained on ImageNet (Table 1). However, for random networks, training was very challenging and generally converged to a much lower accuracy (only one model could train to a similar accuracy as trained ImageNet models; see Table 2). Moreover, the appearance of the adversarial programs was qualitatively distinct from the adversarial programs obtained with networks pretrained on ImageNet (see Figure 4b).

This finding suggests that the original task the neural networks perform is important for adversarial reprogramming. This result may seem surprising, as random networks have rich structure adversarial programs might be expected to take advantage of. For example, theoretical results have shown that wide neural networks become identical to Gaussian processes, where training specific weights in intermediate layers is not necessary to perform tasks matthews2018gaussian; lee2017deep. Other work has demonstrated that it is possible to use random networks as generative models for images ustyuzhaninov2016texture; he2016powerful, further supporting their potential richness. On the other hand, ideas from transfer learning suggest that networks generalize best to tasks with similar structure. Our experimental results suggest that the structure in our three adversarial tasks is similar enough to that in ImageNet that the adversarial program can benefit from training of the target model on ImageNet. They also suggest that it is possible for changes to the input of the network to take advantage of that similarity, rather than changes to the output layer as is more typical in transfer learning. However, another plausible hypothesis is that randomly initialized networks perform poorly for simpler reasons, such as poor scaling of network weights at initialization.

MNIST
Random Model train set test set
Inception V3
Inception V4
Inception Resnet V2
Resnet V2 152
Resnet V2 101
Resnet V2 50
Table 2: Adversarial reprogramming is less effective when it targets untrained networks. Table gives accuracy of reprogrammed networks on an MNIST classification task. Target networks have been randomly initialized, and have not been trained.

5 Discussion

5.1 Flexibility of trained neural networks

We found that trained neural networks were more susceptible to adversarial reprogramming than random networks. This suggests that the adversarial program is repurposing learned features which already exist in the network for a new task. This can be seen as a novel form of transfer learning, where the inputs to the network (equivalent to first layer biases) are modified, rather than the readout weights as is more typical. Our results suggest that dynamical reuse of neural circuits should be practical in modern artificial neural networks. This holds the promise of enabling machine learning systems which are easier to repurpose, more flexible, and more efficient due to shared compute. Indeed, recent work in machine learning has focused on building large dynamically connected networks with reusable components shazeer2017outrageously.

It is unclear whether the reduced performance when targeting random networks, and when reprogramming to perform CIFAR-10 classification, was due to limitations in the expressivity of the adversarial perturbation, or due to the optimization task in Equation 3 being more difficult in these situations. Disentangling limitations in expressivity and trainability will be an interesting direction for future work.

5.2 Beyond the image domain

We only demonstrated adversarial reprogramming on tasks in the image domain. It is an interesting area for future research whether similar attacks might succeed for audio, video, text, or other domains. Adversarial reprogramming of recurrent neural networks (RNNs) would be particularly interesting, since RNNs (especially those with attention or memory mechanisms) can be Turing complete

neelakantan2015neural. An attacker would therefore only need to find inputs which induced the RNN to perform a small number of simple operations, such as increment counter, decrement counter, and change input attention location if counter is zero minsky1961recursive. If adversarial programs can be found for these simple operations, then they could be composed to reprogram the RNN to perform a very large array of tasks.

5.3 Potential goals of an adversarial reprogramming attack

A variety of nefarious ends may be achievable if machine learning systems can be reprogrammed by a specially crafted input. The most direct of these is the simple theft of computational resources. For instance, an attacker might develop an adversarial program which causes the computer vision classifier in a cloud hosted photos service to solve image captchas and enable creation of spam accounts. If RNNs can be flexibly reprogrammed as described in Section 5.2, this computational theft might extend to more arbitrary tasks, such as mining cryptocurrency. A major danger beyond the computational theft is that an adversary may repurpose computational resources to perform a task which violates the code of ethics of the system provider.

Adversarial programs could also be used as a novel way to achieve more traditional computer hacks. For instance, as phones increasingly act as AI-driven digital assistants, the plausibility of reprogramming someone’s phone by exposing it to an adversarial image or audio file increases. As these digital assistants have access to a user’s email, calendar, social media accounts, and credit cards the consequences of this type of attack also grow larger.

6 Conclusion

In this work, we proposed a new class of adversarial attacks that aim to reprogram neural networks to perform novel adversarial tasks. Our results demonstrates for the first time the possibility of such attacks. These results demonstrate both surprising flexibility and surprising vulnerability in deep neural networks. Future investigation should address the properties and limitations of adversarial programming and possible ways to defend against it.

Acknowledgments

We are grateful to Jaehoon Lee, Sara Hooker, Simon Kornblith, Supasorn Suwajanakorn for useful comments on the manuscript. We thank Alexey Kurakin for help reviewing the code. We thank Justin Gilmer and Luke Metz for discussion surrounding the original idea.

Appendix A Supplementary Tables

Model Accuracy
Inception V3
Inception V4
Inception Resnet V2
Resnet V2 152
Resnet V2 101
Resnet V2 50
Inception V3 adv.
Table Supp. 1: Top-1 precision of models on ImageNet data
ImageNet Model batch GPUS learn rate decay epochs/decay steps
Inception V3
Inception V4
Inception Resnet V2
Resnet V2 152
Resnet V2 101
Resnet V2 50
Table Supp. 2: Hyper-parameters for adversarial program training for the square counting adversarial task. For all models, we used the Adam optimizer with its default parameters while decaying the learning rate exponentially during training. We distributed training data across a number of GPUs (each GPU receive ‘batch’ data samples ). We then performed synchronized updates of the adversarial program parameters.
ImageNet Model batch GPUS learn rate decay epochs/decay steps
Inception V3
Inception V4
Inception Resnet V2
Resnet V2 152
Resnet V2 101
Resnet V2 50
Inception V3 adv.
Table Supp. 3: Hyper-parameters for adversarial program training for MNIST classification adversarial task. For all models, we used the Adam optimizer with its default parameters while decaying the learning rate exponentially during training. We distributed training data across a number of GPUs (each GPU receive ‘batch’ data samples ). We then performed synchronized updates of the adversarial program parameters. (The Model Inception V3 adv is pretrained on ImageNet data using adversarial training method.
ImageNet Model batch GPUS learn rate decay epochs/decay steps
Inception V3
Inception V4
Inception Resnet V2
Resnet V2 152
Resnet V2 101
Resnet V2 50
Table Supp. 4: Hyper-parameters for adversarial program training for CIFAR-10 classification adversarial task. For all models, we used ADAM optimizer with its default parameters while decaying the learning rate exponentially during training. We distributed training data on number of GPUS (each GPU receive ‘batch’ data samples ). We then performed synchronized updates of the adversarial program parameters.
Random Model batch GPUS learn rate decay epochs/decay steps
Inception V3
Inception V4
Inception Resnet V2
Resnet V2 152
Resnet V2 101
Resnet V2 50
Table Supp. 5: Hyper-parameters for adversarial program training for MNIST classification adversarial task. For all models, we used the Adam optimizer with its default parameters while decaying the learning rate exponentially during training. We distributed training data across a number of GPUs (each GPU receive ‘batch’ data samples ). We then performed synchronized updates of the adversarial program parameters.