1 Introduction
Adversarial examples have been shown to exist for a variety of deep learning architectures.^{1}^{1}1
Adversarial examples are even easier to produce against most other machine learning architectures, as shown in
Papernot et al. (2016), but we are focused on deep networks. They are small perturbations of the original inputs, often barely visible to a human observer, but carefully crafted to misguide the network into producing incorrect outputs. Seminal work by Szegedy et al. (2013) and Goodfellow et al. (2014), as well as much recent work, has shown that adversarial examples are abundant and finding them is easy.Most previous work focuses on the application of adversarial examples to the task of classification, where the deep network assigns classes to input images. The attack adds small adversarial perturbations to the original input image. These perturbations cause the network to change its classification of the input, from the correct class to some other incorrect class (possibly chosen by the attacker). Critically, the perturbed input must still be recognizable to a human observer as belonging to the original input class.^{2}^{2}2 Random noise images and “fooling” images (Nguyen et al., 2014) do not belong to this strict definition of an adversarial input, although they do highlight other limitations of current classifiers.
Deep generative models, such as Kingma & Welling (2013), learn to generate a variety of outputs, ranging from handwritten digits to faces (Kulkarni et al., 2015), realistic scenes (Oord et al., 2016), videos (Kalchbrenner et al., 2016), 3D objects (Dosovitskiy et al., 2016), and audio (van den Oord et al., 2016). These models learn an approximation of the input data distribution in different ways, and then sample from this distribution to generate previously unseen but plausible outputs.
To the best of our knowledge, no prior work has explored using adversarial inputs to attack generative models. There are two main requirements for such work: describing a plausible scenario in which an attacker might want to attack a generative model; and designing and demonstrating an attack that succeeds against generative models. We address both of these requirements in this work.
One of the most basic applications of generative models is input reconstruction. Given an input image, the model first encodes it into a lowerdimensional latent representation, and then uses that representation to generate a reconstruction of the original input image. Since the latent representation usually has much fewer dimensions than the original input, it can be used as a form of compression. The latent representation can also be used to remove some types of noise from inputs, even when the network has not been explicitly trained for denoising, due to the lower dimensionality of the latent representation restricting what information the trained network is able to represent. Many generative models also allow manipulation of the generated output by sampling different latent values or modifying individual dimensions of the latent vectors without needing to pass through the encoding step.
These properties of input reconstruction generative networks suggest a variety of different attacks that would be enabled by effective adversaries against generative networks. Any attack that targets the compression bottleneck of the latent representation can exploit natural security vulnerabilities in applications built to use that latent representation. Specifically, if the person doing the encoding step is separated from the person doing the decoding step, the attacker may be able to cause the encoding party to believe they have encoded a particular message for the decoding party, but in reality they have encoded a different message of the attacker’s choosing. We explore this idea in more detail as it applies to the application of compressing images using a VAE or VAEGAN architecture.
2 Related work and background
This work focuses on adversaries for variational autoencoders (VAEs, proposed in Kingma & Welling (2013)) and VAEGANs (VAEs composed with a generative adversarial network, proposed in Larsen et al. (2015)).
2.1 Related work on adversaries
Many adversarial attacks on classification models have been described in existing literature (Goodfellow et al., 2014; Szegedy et al., 2013). These attacks can be untargeted, where the adversary’s goal is to cause any misclassification, or the least likely misclassification (Goodfellow et al., 2014; Kurakin et al., 2016); or they can be targeted, where the attacker desires a specific misclassification. MoosaviDezfooli et al. (2016) gives a recent example of a strong targeted adversarial attack. Some adversarial attacks allow for a threat model where the adversary does not have access to the target model (Szegedy et al., 2013; Papernot et al., 2016), but commonly it is assumed that the attacker does have that access, in an online or offline setting (Goodfellow et al., 2014; Kurakin et al., 2016).^{3}^{3}3 See Papernot et al. (2015) for an overview of different adversarial threat models.
Given a classifier and original inputs , the problem of generating untargeted adversarial examples can be expressed as the following optimization: , where is a chosen distance measure between examples from the input space (e.g., the norm). Similarly, generating a targeted adversarial attack on a classifier can be expressed as , where is some target label chosen by the attacker.
These optimization problems can often be solved with optimizers like LBFGS or Adam (Kingma & Ba, 2015), as done in Szegedy et al. (2013) and Carlini & Wagner (2016). They can also be approximated with singlestep gradientbased techniques like fast gradient sign (Goodfellow et al., 2014), fast gradient (Huang et al., 2015), or fast least likely class (Kurakin et al., 2016); or they can be approximated with iterative variants of those and other gradientbased techniques (Kurakin et al., 2016; MoosaviDezfooli et al., 2016).
An interesting variation of this type of attack can be found in Sabour et al. (2015). In that work, they attack the hidden state of the target network directly by taking an input image and a target image and searching for a perturbed variant of that generates similar hidden state at layer of the target network to the hidden state at the same layer generated by . This approach can also be applied directly to attacking the latent vector of a generative model.
A variant of this attack has also been applied to VAE models in the concurrent work of Tabacof et al. (2016)^{4}^{4}4 This work was made public shortly after we published our early drafts. , which uses the KL divergence between the latent representation of the source and target images to generate the adversarial example. However in their paper, the authors mention that they tried attacking the output directly and that this only managed to make the reconstructions more blurry. While they do not explain the exact experimental setting, the attack sounds similar to our attack, which we find very successful. Also, in their paper the authors do not consider the more advanced VAEGAN models and more complex datasets like CelebA.
2.2 Background on VAEs and VAEGANs
The general architecture of a variational autoencoder consists of three components, as shown in Figure 8. The encoder
is a neural network mapping a highdimensional input representation
into a lowerdimensional (compressed) latent representation . All possible values of form a latent space. Similar values in the latent space should produce similar outputs from the decoder in a welltrained VAE. And finally, the decoder/generator , which is a neural network mapping the compressed latent representation back to a highdimensional output . Composing these networks allows basic input reconstruction. This composed architecture is used during training to backpropagate errors from the loss function.
The variational autoencoder’s loss function enables the network to learn a latent representation that approximates the intractable posterior distribution :
(1) 
is the learned approximation of the posterior distribution . is the prior distribution of the latent representation .
denotes the Kullback–Leibler divergence.
is the variational lower bound, which in the case of input reconstruction is the crossentropy between the inputs and their reconstructions . In order to generate the VAE needs to sample and then compute .For the VAE to be fully differentiable while sampling from , the reparametrization trick (Kingma & Welling, 2013) extracts the random sampling step from the network and turns it into an input,
. VAEs are often parameterized with Gaussian distributions. In this case,
outputs the distribution parameters and . That distribution is then sampled by computing where is the input random sample, which does not depend on any parameters of , and thus does not impact differentiation of the network.The VAEGAN architecture of Larsen et al. (2015) has the same and pair as in the VAE. It also adds a discriminator that is used during training, as in standard generative adversarial networks (Goodfellow et al., 2014). The loss function of
uses the disciminator loss instead of crossentropy for estimating the reconstruction error.
3 Problem definition
We provide a motivating attack scenario for adversaries against generative models, as well as a formal definition of an adversary in the generative setting.
3.1 Motivating attack scenario
To motivate the attacks presented below, we describe the attack scenario depicted in Figure 1. In this scenario, there are two parties, the sender and the receiver, who wish to share images with each other over a computer network. In order to conserve bandwidth, they share a VAE trained on the input distribution of interest, which will allow them to send only latent vectors .
The attacker’s goal is to convince the sender to send an image of the attacker’s choosing to the receiver, but the attacker has no direct control over the bytes sent between the two parties. However, the attacker has a copy of the shared VAE. The attacker presents an image to the sender which resembles an image that the sender wants to share with the receiver. For example, the sender wants to share pictures of kittens with the receiver, so the attacker presents a web page to the sender with a picture of a kitten, which is . The sender chooses and sends its corresponding to the receiver, who reconstructs it. However, because the attacker controlled the chosen image, when the receiver reconstructs it, instead of getting a faithful reproduction of (e.g., a kitten), the receiver sees some other image of the attacker’s choosing, , which has a different meaning from (e.g., a request to send money to the attacker’s bank account).
There are other attacks of this general form, where the sender and the receiver may be separated by distance, as in this example, or by time, in the case of storing compressed images to disk for later retrieval. In the timeseparated attack, the sender and the receiver may be the same person or multiple different people. In either case, if they are using the insecure channel of the VAE’s latent space, the messages they share may be under the control of an attacker. For example, an attacker may be able to fool an automatic surveillance system if the system uses this type of compression to store the video signal before it is processed by other systems. In this case, the subsequent analysis of the video signal could be on compromised data showing what the attacker wants to show.
3.2 Defining adversarial examples against generative models
We make the following assumptions about generating adversarial examples on a target generative model, . is trained on inputs that can naturally be labeled with semantically meaningful classes , although there may be no such labels at training time, or the labels may not have been used during training. normally succeeds at generating an output in class when presented with an input from class
. In other words, whatever target output class the attacker is interested in, we assume that
successfully captures it in the latent representation such that it can generate examples of that class from the decoder. This target output class does not need to be from the most salient classes in the training dataset. For example, on models trained on MNIST, the attacker may not care about generating different target digits (which are the most salient classes). The attacker may prefer to generate the same input digits in a different style (perhaps to aid forgery). We also assume that the attacker has access to . Finally, the attacker has access to a set of examples from the same distribution as that have the target label the attacker wants to generate. This does not mean that the attacker needs access to the labeled training dataset (which may not exist), or to an appropriate labeled dataset with large numbers of examples labeled for each class (which may be hard or expensive to collect). The attacks described here may be successful with only a small amount of data labeled for a single target class of interest.One way to generate such adversaries is by solving the optimization problem , where Oracle reliably discriminates between inputs of class and inputs of other classes. In practice, a classifier trained by the attacker may server as Oracle. Other types of adversaries from Section 2.1 can also be used to approximate this optimization in natural ways, some of which we describe in Section 4.
If the attacker only needs to generate one successful attack, the problem of determining if an attack is successful can be solved by manually reviewing the and pairs and choosing whichever the attacker considers best. However, if the attacker wants to generate many successful attacks, an automated method of evaluating the success of an attack is necessary. We show in Section 4.5 how to measure the effectiveness of an attack automatically using a classifier trained on .
4 Attack methodology
The attacker would like to construct an adversariallyperturbed input to influence the latent representation in a way that will cause the reconstruction process to reconstruct an output for a different class. We propose three approaches to attacking generative models: a classifierbased attack, where we train a new classifier on top of the latent space and use that classifier to find adversarial examples in the latent space; an attack using to target the output directly; and an attack on the latent space, . All three methods are technically applicable to any generative architecture that relies on a learned latent representation . Without loss of generality, we focus on the VAEGAN architecture.
4.1 Classifier attack
By adding a classifier to the pretrained generative model^{5}^{5}5
This is similar to the process of semisupervised learning in
Kingma et al. (2014), although the goal is different. , we can turn the problem of generating adversaries for generative models back into the previously solved problem of generating adversarial examples for classifiers. This approach allows us to apply all of the existing attacks on classifiers in the literature. However, as discussed below, using this classifier tends to produce lowerquality reconstructions from the adversarial examples than the other two attacks due to the inaccuracies of the classifier.Step 1.
The weights of the target generative model are frozen, and a new classifier is trained on top of using a standard classification loss such as crossentropy, as shown in Figure 3. This process is independent of how the original model is trained, but it requires a training corpus pulled from approximately the same input distribution as was used to train , with ground truth labels for at least two classes: and , the negative class.
Step 2.
With the trained classifier, the attacker finds adversarial examples using the methods described in Section 4.4.
Using to generate adversarial examples does not always result in highquality reconstructions, as can be seen in the middle column of Figure 5 and in Figure 11. This appears to be due to the fact that adds additional noise to the process. For example, sometimes confidently misclassifies latent vectors that represent inputs that are far from the training data distribution, resulting in failing to reconstruct a plausible output from the adversarial example.
4.2 attack
Our second approach generates adversarial perturbations using the VAE loss function. The attacker chooses two inputs, (the source) and (the target), and uses one of the standard adversarial methods to perturb into such that its reconstruction matches the reconstruction of , using the methods described in Section 4.4.
The adversary precomputes the reconstruction by evaluating once before performing optimization. In order to use in an attack, the second term (the reconstruction loss) of (see Equation 1) is changed so that instead of computing the reconstruction loss between and , the loss is computed between and . This means that during each optimization iteration, the adversary needs to compute , which requires the full to be evaluated.
4.3 Latent attack
Our third approach attacks the latent space of the generative model.
Single latent vector target.
This attack is similar to the work of Sabour et al. (2015), in which they use a pair of source image and target image to generate that induces the target network to produce similar activations at some hidden layer as are produced by , while maintaining similarity between and .
For this attack to work on latent generative models, it is sufficient to compute and then use the following loss function to generate adversarial examples from different source images , using the methods described in Section 4.4:
(2) 
is a distance measure between two vectors. We use the norm, under the assumption that the latent space is approximately euclidean.
We also explored a variation on the single latent vector target attack, which we describe in Section A.1 in the Appendix.
4.4 Methods for solving the adversarial optimization problem
We can use a number of different methods to generate the adversarial examples. We initially evaluated both the fast gradient sign Goodfellow et al. (2014) method and an optimization method. As the latter produces much better results we focus on the optimization method, while we include some FGS results in the Appendix. The attack can be used either in targeted mode (where we want a specific class, , to be reconstructed) or untargeted mode (where we just want an incorrect class to be reconstructed). In this paper, we focus on the targeted mode of the attacks.
optimization.
The optimizationbased approach, explored in Szegedy et al. (2013) and Carlini & Wagner (2016), poses the adversarial generation problem as the following optimization problem:
(3) 
As above, is a distance measure, and is one of , , or . The constant is used to balance the two loss contributions. For the attack, the optimizer must do a full reconstruction at each step of the optimizer. The other two attacks do not need to do reconstructions while the optimizer is running, so they generate adversarial examples much more quickly, as shown in Table 1.
4.5 Measuring attack effectiveness
To generate a large number of adversarial examples automatically against a generative model, the attacker needs a way to judge the quality of the adversarial examples. We leverage to estimate whether a particular attack was successful.^{6}^{6}6 Note that here is being used in a different manner than when we use it to generate adversarial examples. However, the network itself is identical, so we don’t distinguish between the two uses in the notation.
Reconstruction feedback loop.
The architecture is the same as shown in Figure 3. We use the generative model to reconstruct the attempted adversarial inputs by computing:
(4) 
Then, is used to compute:
(5) 
The input adversarial examples are not classified directly, but are first fed to the generative model for reconstruction. This reconstruction loop improves the accuracy of the classifier by on average against the adversarial attacks we examined. The predicted label after the reconstruction feedback loop is compared with the attack target
to determine if the adversarial example successfully reconstructed to the target class. If the precision and recall of
are sufficiently high on , can be used to filter out most of the failed adversarial examples while keeping most of the good ones.We derive two metrics from classifier predictions after one reconstruction feedback loop. The first metric is , the attack success rate ignoring targeting, i.e., without requiring the output class of the adversarial example to match the target class:
(6) 
is the total number of reconstructed adversarial examples; is when , the classification of the reconstruction for image , does not equal , the ground truth classification of the original image, and otherwise. The second metric is , the attack success rate including targeting (i.e., requiring the output class of the adversarial example to match the target class), which we define similarly as:
(7) 
Both metrics are expected to be higher for more successful attacks. Note that . When computing these metrics, we exclude input examples that have the same ground truth class as the target class.
5 Evaluation
We evaluate the three attacks on MNIST (LeCun et al., 1998), SVHN (Netzer et al., 2011) and CelebA (Liu et al., 2015)
, using the standard training and validation set splits. The VAE and VAEGAN architectures are implemented in TensorFlow
(Abadi & et al., 2015). We optimized using Adam with learning rateand other parameters set to default values for both the generative model and the classifier. For the VAE, we use two architectures: a simple architecture with a single fullyconnected hidden layer with 512 units and ReLU activation function; and a convolutional architecture taken from the original VAEGAN paper
Larsen et al. (2015) (but trained with only the VAE loss). We use the same architecture trained with the additional GAN loss for the VAEGAN model, as described in that work. For both VAE and VAEGAN we use a 50dimensional latent representation on MNIST, a 1024dimensional latent representation on SVHN and 2048dimensional latent representation on CelebA.In this section we only show results where no sampling from latent space has been performed. Instead we use the mean vector as the latent representation . As sampling can have an effect on the resulting reconstructions, we evaluated it separately. We show the results with different number of samples in Figure 22 in the Appendix. On most examples, the visible change is small and in general the attack is still successful.
5.1 Mnist
Both VAE and VAEGAN by themselves reconstruct the original inputs well as show in Figure 9, although the quality from the VAEGAN is noticeably better. As a control, we also generate random noise of the same magnitude as used for the adversarial examples (see Figure 13), to show that random noise does not cause the reconstructed noisy images to change in any significant way. Although we ran experiments on both VAEs and VAEGANs, we only show results for the VAEGAN as it generates much higher quality reconstructions than the corresponding VAE.
5.1.1 Classifier attack
We use a simple classifier architecture to help generate attacks on the VAE and VAEGAN models. The classifier consists of two fullyconnected hidden layers with 512 units each, using the ReLU activation function. The output layer is a 10 dimensional softmax. The input to the classifier is the 50 dimensional latent representation produced by the VAE/VAEGAN encoder. The classifier achieves
accuracy on the validation set after training for 100 epochs.
To see if there are differences between classes, we generate targeted adversarial examples for each MNIST class and present the results perclass. For the targeted attacks we used the optimization method with lambda , where Adambased optimization was performed for epochs with a learning rate of . The mean norm of the difference between original images and generated adversarial examples using the classifier attack is , while the mean RMSD is .
Numerical results in Table 2 show that the targeted classifier attack successfully fools the classifier. Classifier accuracy is reduced to , while the matching rate (the ratio between the number of predictions matching the target class and the number of incorrectly classified images) is , which means that all incorrect predictions match the target class. However, what we are interested in (as per the attack definition from Section 3.2) is how the generative model reconstructs the adversarial examples. If we look at the images generated by the VAEGAN for class , shown in Figure 4, the targeted attack is successful on some reconstructed images (e.g. one, four, five, six and nine are reconstructed as zeroes). But even when the classifier accuracy is and matching rate is , an incorrect classification does not always result in a reconstruction to the target class, which shows that the classifier is fooled by an adversarial example more easily than the generative model.
Reconstruction feedback loop.
The reconstruction feedback loop described in Section 4.5 can be used to measure how well a targeted attack succeeds in making the generative model change the reconstructed classes. Table 4 in the Appendix shows and for all source and target class pairs. A higher value signifies a more successful attack for that pair of classes. It is interesting to observe that attacking some source/target pairs is much easier than others (e.g. pair vs. ) and that the results are not symmetric over source/target pairs. Also, some pairs do well in , but do poorly in (e.g., all source digits when targeting ). As can be seen in Figure 11, the classifier adversarial examples targeting consistently fail to reconstruct into something easily recognizable as a . Most of the reconstructions look like , but the adversarial example reconstructions of source s instead look like or .
5.1.2 attack
For generating adversarial examples using the attack, we used the optimization method with , where Adambased optimization was performed for epochs with a learning rate of . The mean norm of the difference between original images and generated adversarial examples with this approach is , while the mean RMSD is .
We show and of the attack in Table 5 in the Appendix. Comparing with the numerical evaluation results of the latent attack (below), we can see that both methods achieve similar results on MNIST.
5.1.3 Latent attack
To generate adversarial examples using the latent attack, we used the optimization method with , where Adambased optimization was performed for epochs with a learning rate of . The mean norm of the difference between original images and generated adversarial examples using this approach is , while the mean RMSD is .
Table 3 shows and for all source and target class pairs. Comparing with the numerical evaluation results of the classifier attack we can see that the latent attack performs much better. This result remains true when visually comparing the reconstructed images, shown in Figure 5.
We also tried an untargeted version of the latent attack, where we change Equation 2 to maximize the distance in latent space between the encoding of the original image and the encoding of the adversarial example. In this case the loss we are trying to minimize is unbounded, since the distance can always grow larger, so the attack normally fails to generate a reasonable adversarial example.
Additionally, we also experimented with targeting latent representations of specific images from the training set instead of taking the mean, as described in Section 4.3. We show the numerical results in Table 3 and the generated reconstructions in Figure 15 (in the Appendix). It is also interesting to compare the results with , by choosing the same image as the target. Results for for the same target images as in Table 3 are shown in Table 6 in the Appendix. The results are identical between the two attacks, which is expected as the target image is the same – only the loss function differs between the methods.
5.2 Svhn
The SVHN dataset consists of cropped street number images and is much less clean than MNIST. Due to the way the images have been processed, each image may contain more than one digit; the target digit is roughly in the center. VAEGAN produces highquality reconstructions of the original images as shown in Figure 17 in the Appendix.
For the classifier attack, we set after testing a range of values, although we were unable to find an effective value for this attack against SVHN. For the latent and attacks we set .
In Table 10 we show and for the
optimization latent attack. The evaluation metrics are less strong on SVHN than on MNIST, but it is still straightforward for an attacker to find a successful attack for almost all source/target pairs. Figure
2 supports this evaluation. Visual inspection shows that out of the adversarial examples reconstructed as , the target digit. It is worth noting that out of the adversarial examples look like zeros (rows and ), and two others look like both the original digit and zero, depending on whether the viewer focuses on the light or dark areas of the image (rows and ). The optimization latent attack achieves much better results than the attack (see Table 11 and Figure 6) on SVHN, while both attacks work equally well on MNIST.5.3 CelebA
The CelebA dataset consists of more than 200,000 cropped faces of celebrities, each annotated with 40 different attributes. For our experiments, we further scale the images to 64x64 and ignore the attribute annotations. VAEGAN reconstructions of original images after training are shown in Figure 19 in the Appendix.
Since faces don’t have natural classes, we only evaluated the latent and attacks. We tried lambdas ranging from to for both attacks. Figure 20 shows adversarial examples generated using the latent attack and a lambda value of ( norm between original images and generated adversarial examples , RMSD ) and the corresponding VAEGAN reconstructions. Most of the reconstructions reflect the target image very well. We get even better results with the attack, using a lambda value of ( norm between original images and generated adversarial examples , RMSD ) as shown in Figure 21.
5.4 Summary of different attack methods
MNIST  SVHN  

Method  Mean  Mean RMSD  Time to attack  Mean  Mean RMSD  Time to attack 
Optimization Classifier Attack  
Optimization Attack  
Optimization Latent Attack 
Table 1 shows a comparison of the mean distances between original images and generated adversarial examples for the three different attack methods. The larger the distance between the original image and the adversarial perturbation, the more noticeable the perturbation will tend to be, and the more likely a human observer will no longer recognize the original input, so effective attacks keep these distances small while still achieving their goal. The latent attack consistently gives the best results in our experiments, and the classifier attack performs the worst.
We also measure the time it takes to generate adversarial examples using the given attack method. The attack is by far the slowest of the three, due to the fact that it requires computing full reconstructions at each step of the optimizer when generating the adversarial examples. The other two attacks do not need to run the reconstruction step during optimization of the adversarial examples.
6 Conclusion
We explored generating adversarial examples against generative models such as VAEs and VAEGANs. These models are also vulnerable to adversaries that convince them to turn inputs into surprisingly different outputs. We have also motivated why an attacker might want to attack generative models. Our work adds further support to the hypothesis that adversarial examples are a general phenomenon for current neural network architectures, given our successful application of adversarial attacks to popular generative models. In this work, we are helping to lay the foundations for understanding how to build more robust networks. Future work will explore defense and robustification in greater depth as well as attacks on generative models trained using natural image datasets such as CIFAR10 and ImageNet.
Acknowledgments
This material is in part based upon work supported by the National Science Foundation under Grant No. TWC1409915. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References
 Abadi & et al. (2015) Martín Abadi and Ashish Agarwal et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Carlini & Wagner (2016) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644, 2016.
 Dosovitskiy et al. (2016) Alexey Dosovitskiy, Jost Springenberg, Maxim Tatarchenko, and Thomas Brox. Learning to generate chairs, tables and cars with convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2016. ISSN 01628828. doi: 10.1109/TPAMI.2016.2567384.
 Goodfellow et al. (2014) I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. ArXiv eprints, June 2014.
 Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 Huang et al. (2015) Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvári. Learning with a strong adversary. CoRR, abs/1511.03034, 2015.
 Kalchbrenner et al. (2016) Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
 Kingma & Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
 Kulkarni et al. (2015) Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pp. 2539–2547, 2015.
 Kurakin et al. (2016) Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
 Larsen et al. (2015) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Liu et al. (2015)
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  MoosaviDezfooli et al. (2016) SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. 2016.
 Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
 Nguyen et al. (2014) Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. CoRR, abs/1412.1897, 2014.
 Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328, 2016.
 Papernot et al. (2015) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Proceedings of the 1st IEEE European Symposium on Security and Privacy, 2015.
 Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical blackbox attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697, 2016.
 Sabour et al. (2015) Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J. Fleet. Adversarial manipulation of deep representations. CoRR, abs/1511.05122, 2015. URL http://arxiv.org/abs/1511.05122.
 Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Tabacof et al. (2016) P. Tabacof, J. Tavares, and E. Valle. Adversarial Images for Variational Autoencoders. ArXiv eprints, December 2016.
 Toderici et al. (2015) George Toderici, Sean M O’Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085, 2015.
 Toderici et al. (2016) George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. Full resolution image compression with recurrent neural networks. arXiv preprint arXiv:1608.05148, 2016.
 van den Oord et al. (2016) Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016. URL http://arxiv.org/abs/1609.03499.
Appendix A Appendix
a.1 Mean latent vector targeted attack
A variant of the single latent vector targeted attack described in Section 4.3, that was not explored in previous work to our knowledge is to take the mean latent vector of many target images and use that vector as . This variant is more flexible, in that the attacker can choose different latent properties to target without needing to find the ideal input. For example, in MNIST, the attacker may wish to have a particular line thickness or slant in the reconstructed digit, but may not have such an image available. In that case, by choosing some images of the target class with thinner lines or less slant, and some with thicker lines or more slant, the attacker can find a target latent vector that closely matches the desired properties.
In this case, the attack starts by using to produce the target latent vector, , from the chosen target images, .
(8) 
In this work, we choose to reconstruct “ideal” MNIST digits by taking the mean latent vector of all of the training digits of each class, and using those vectors as . Given a target class , a set of examples and their corresponding ground truth labels , we create a subset as follows:
(9) 
a.2 Evaluation results
Target  0  1  2  3  4  5  6  7  8  9 

Classifier accuracy  
Matching rate 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Source  Target 0  Target 1  Target 2  Target 3  Target 4  Target 5  Target 6  Target 7  Target 8  Target 9  

0   










1 

 









2 


 








3 



 







4 




 






5 





 





6 






 




7 







 



8 








 


9 









 
Orig  Mean  1 Smp  12 Smp  50 Smp  Adv  Mean  1 Smp  12 Smp  50 Smp 