Adversarial attacks expressly optimize the input to “fool” models, e.g., in image classification, the adversarial input — while visually tantamount to an ordinary original image — leads to mislabelling with high confidence.
Here, we explore adversarial images for autoencoders — models optimized to reconstruct their inputs from compact internal representations. In an autoencoder, the attack targets not a single label, but a whole reconstruction. Our contributions include:
An adversarial attack on variational — and, for comparison, deterministic — autoencoders. Our attack aims not only at disturbing the reconstruction, but at fooling the autoencoder into reconstructing a completely different target image;
A comparison between attacks for autoencoders and for classifiers, showing that while the former is much harder, in both cases the amount of distortion on the input is proportional to the amount of misdirection on the output. For classifiers, however, such proportionality is hidden by the normalization of the output, which maps a linear layer into non-linear probabilities.
Evaluating generative models is hard theis2015note , there are no clear-cut success criteria for autoencoder reconstruction, and therefore, neither for the attack. We attempt to bypass that difficulty by analyzing how inputs and outputs differ across varying regularization constants.
The seminal article of Szegedy et al. szegedy2013intriguing introduced adversarial images, showing how to force a deep network to misclassify an image by applying nearly imperceptible distortions. Goodfellow et al. goodfellow2014explaining exploited the linear nature of deep convolutional networks to both attempt explaining how adversarial samples arise, and to propose a much faster technique to create them. Tabacof and Valle tabacof2015exploring explored the geometry of adversarial regions, showing that they appear in relatively dense regions of the input space, and that shallow, simple classifiers tend to be more robust to them.
The existence of adversarial images lead to interesting questions on their significance, and even usefulness. Training models to resist adversarial attacks was advanced as a form of regularization goodfellow2014explaining ; miyato2015distributional . Gu et al. gu2014towards used autoencoders to pre-process the input and try to reinforce the network against adversarial attacks, finding that although in some cases resistance improved, attacks with small distortions remained possible. A more recent trend is training adversarial models, in which one attempts to generate “artificial” samples (from a generative model) and the other attempts to recognize those samples goodfellow2014generative . Makhzani et al. makhzani2015adversarial employ such scheme to train an autoencoder.
Although autoencoders appear in the literature of adversarial images as an attempt to obtain robustness to the attacks gu2014towards , and in the literature of adversarial training as models that can be trained with the technique makhzani2015adversarial , we are unaware of any attempts to create attacks targeted to them. In the closest related literature, Sara Sabour et al. sabour2015adversarial show that adversarial attacks can not only lead to mislabelling, but also manipulate the internal representations of the network. In this paper, we show that an analogous manipulation allows us to attack autoencoders, but that those remain much more resistant than classifiers to such attacks.
2 Autoencoders and Variational Autoencoders
Autoencoders are models that map their input into a compact latent representation, and then, from such representation, build back the input (discounting some distortion). Therefore, autoencoders are trained to minimize the distortion between their input and their (reconstructed) output — plus regularization terms. The model comprises two parts: an encoder, which maps the input into the latent representation; and a decoder, which maps such representation into an output as close to the input as possible. In regular autoencoders, the training loss function may be as simple as the-distance between input and output.
Famous variants include sparse autoencoders, which use -regularization ng2011sparse
, and denoising autoencoders, which use implicit regularization by feeding noise to the input, while keeping the original input in the reconstruction loss termvincent2010stacked . An important offshoot are models with similar encoder–decoder structure, but which seek not to reconstruct the input, but to produce an output related to it (e.g., a segmentation map) noh2015learning .
A modern variant of growing popularity, variational autoencoders kingma2013auto
interpret the latent representation through a Bayesian lens, thus offering a theoretical foundation for the reconstruction and regularization objectives. Variational autoencoders are probabilistic generative models, where we find the probability distribution of the data by marginalizing over the latent variables:
The likelihood is the probabilistic explanation of the observed data: in practice, often it is simply the output of the decoder network under a noise consideration (e.g. additive Gaussian noise for RGB pixels). The subscript comprises all decoder parameters, while is the latent representation, over which we marginalize. The representation prior is often the standard normal kingma2013auto , but might be instead a discrete distribution (e.g. Bernoulli) kingma2014semi , or even some distribution with geometric interpretation (“what” and “where” latent variables) eslami2016attend . Since the integration above is often intractable, we maximize its variational lower bound…
…which is the Kullback–Leibler (KL) divergence between the approximate and the (unknown) exact posterior. Thus, maximizing the variational lower bound may also be interpreted as finding the best posterior approximation. In the context of variational autoencoders, such approximate posterior is usually an uncorrelated multivariate normal determined by the encoder network (with parameters ):
We can approximate the likelihood expectation
by Monte Carlo. As the prior and the approximated posterior are normal distributions, their KL divergence has analytic formkingma2013auto kingma2015variational .
. Simulating a chain of samples from the latent variables and likelihood allows to denoise images, or to impute missing data (inpaint images)rezende2014stochastic
. The latent variables of a variational autoencoder also allow visual analogy and interpolationradford2015unsupervised .
3 Adversarial Images for Autoencoders
Adversarial procedures minimize an adversarial loss to mislead the model (e.g., misclassification), while distorting the input as little as possible. If the attack is successful, humans should hardly be able to distinguish between the adversarial and the regular inputs szegedy2013intriguing ; tabacof2015exploring . We can be even more strict, and only allow a distortion below the input quantization noise goodfellow2014explaining ; sabour2015adversarial .
To build adversarial images for classification, one can maximize the misdirection towards a certain wrong label szegedy2013intriguing ; tabacof2015exploring or away from the correct one goodfellow2014explaining . The distortion can be minimized szegedy2013intriguing ; tabacof2015exploring or constrained to be small goodfellow2014explaining ; sabour2015adversarial . Finally, one often requires that images stay within their valid space (i.e., no pixels “below black or above white”).
In autoencoders, there is not a single class output to misclassify, but instead a whole image output to scramble. The attack attempts to mislead the reconstruction: if a slightly altered image enters the autoencoder, but the reconstruction is wrecked, then the attack worked. A more dramatic attack — the one we attempt in this paper — would be to change slightly the input image and make the autoencoder reconstruct a completely different valid image (Fig. 2).
Our attack consists in selecting an original image and a target image, and then feeding the network the original image added to a small distortion, optimized to get an output as close to the target image as possible (Fig. 2). Our attempts to attack the output directly failed: minimizing its distance to the target only succeeded in blurring the reconstruction. As autoencoders reconstruct from the latent representation, we can attack it instead. The latent layer is the information bottleneck of the autoencoder, and thus particularly convenient to attack. We used the following adversarial optimization:
where is the adversarial distortion; and are the latent representations, respectively, for the adversarial and the target images; is the original image; is the adversarial image; and are the bounds on the input space; and is the regularizing constant the balances reaching the target and limiting the distortion.
We must choose a function to compare representations. For regular autoencoders a simple -distance sufficed; however, for variational autoencoders, the KL-divergence between the distributions induced by the latent variables not only worked better, but also offered a sounder justification. In our variational autoencoders, the are uncorrelated multivariate normal distributions with parameters given by the encoder:
are the representation mean vector, and (diagonal) covariance matrix output by the last layer of the encoder network; whileare the autoencoder parameters — learned previously by training it for its ordinary task of reconstruction. During the entire adversarial procedure, remains fixed.
4 Data and Methods
We worked on the binarized MNISTlecun1998mnist and SVHN datasets netzer2011reading . The former allows for very fast experiments and very controlled conditions; the latter, while still allowing to manage a large number of experiments, provides much more noise and variability. Following literature kingma2013auto , we modeled pixel likelihoods as independent Bernoullis (for binary images), or as independent normals (for RGB images). We used Parmesan and Lasagne lasagne for the implementation111The code for the experiments can be found at https://github.com/tabacof/adv_vae.
The loss function to train the variational autoencoder (equation 2) is the expectation of the likelihood under the approximated posterior plus the KL divergence between the approximated posterior and the prior. We approximate the expectation of the likelihood with one sample of the posterior. We extract the gradients of the lower bound using automatic differentiation and maximize it using stochastic gradient ascent via the ADAM algorithm kingma2014adam . We used 20 and 100 latent variables for MNIST and SVHN, respectively. We parameterized the encoder and decoder as fully-connected networks in the MNIST case, and as convolutional and deconvolutional zeiler2010deconvolutional networks in the SVHN case. After the training is done, we can use the autoencoder to reconstruct some image samples through the latent variables, which are the learned representation of the images. An example of a pair of input image/reconstructed output appears in Fig. 1.
For classification tasks, the regularization term (Eq. 4) may be chosen by bisection as the smallest constant that still leads to success tabacof2015exploring . Autoencoders complicate such choice, for there is no longer a binary criterion for success. Goodfellow et al. goodfellow2014explaining and Sabour et al.sabour2015adversarial optimize differently, choosing for an -norm constrained to make the distortion imperceptible, while maximizing the misdirection. We found such solution too restrictive, leading to reconstructions visually too distinct from the target images. Our solution was instead to forgo a single choice for , and analyze the behavior of the system throughout a series of values.
In our experiments, we pick at random 25 pairs of original/target images (axis “experiment” in graphs). For each pair, we span 100 different values for the regularization constant in a logarithmic scale (from to ), measuring the -distance between the adversarial input and the original image (axis “distortion”), and the -distance between the reconstructed output and the target image (axis “adversarialtarget”). The “distortion” axis is normalized between 0.0 (no attack) and the -distance between the original and target images in the pair (a large distortion that could reach the target directly). The “adversarialtarget” is normalized between the -distance of the reconstruction of the target and the target (the best expected attack) and the -distance of the reconstruction of the original and the target (the worst expected attack). The geometry of such normalization is illustrated by the colored lines in the graphs of Fig. 3. For variational autoencoders, the reconstruction is stochastic: therefore, each data point is sampled 100 times, and the average is reported.
For comparison purposes, we use the same protocol above to generate a range of adversarial images for the usual classification tasks on the same datasets. The aim is to contrast the behavior of adversarial attacks across the two tasks (autoencoding / classification). In those experiments we pick pairs of original image / adversarial class (axis “experiment”), and varying (from to
), we measure the distortion as above, and the probability (with corresponding logit) attributed to the adversarial (red lines) and to the original classes (blue lines). The axes here are no longer normalized, but we center at 0 in the “distortion” axis the transition point between attack failure and success — the point where red and blue lines cross.
5 Results and Discussion
We found that generating adversarial images for autoencoders is a much harder task than for classifiers. If we apply little distortion (comparable to those used for misleading classifiers), the reconstructions stay essentially untouched. To get reconstructions very close to the target’s, we have to apply heavy distortions to the input. However, by hand-tuning the regularization parameter, it is possible to find trade-offs where the reconstruction approaches the target’s and the adversarial image will still resemble the input (two examples in Fig. 3).
The plots for the full set of 25 original/target image pairs appear in Fig. 4. All series saturate when the latent representation of the adversarial image essentially equals the target’s. That saturation appears well before the upper distortion limit of 1.0, and provides a measure of how resistant the model is to the attack: Variational Autoencoders appear slightly more resistant than Deterministic Autoencoders, and MNIST much more resistant than SVHN. The latter is not surprising, since large complex models seem, in general, more susceptible to adversarial attacks. Before the “hinge” where the attack saturates, there is a quasi-linear trade-off between input distortion and output similarity to target, for all combinations of dataset and autoencoder choice. We were initially hoping for a more non-linear behavior, with a sudden drop at some point in the scale, but data suggests that there is a give-and-take for attacking autoencoders: each gain in the attack requires a proportional increase in distortion.
The comparison with the (much better-studied) attacks for classifiers, showed, at the beginning, a much different behavior: when we contrasted the probability attributed to the adversarial class vs. the distortion imposed on the input, we observed the non-linear, sudden change we were expecting (left column of Fig. 6). The question remained, however whether such non-linearity was intrinsic, or whether it was due to the highly non-linear nature of the probability scale. The answer appears in the right column of Fig. 6, where, with a logit transformation of the probabilities, the linear behavior appears again. It seems that the attack on classifiers show, internally, the same linear give-and-take present in autoencoders, but that the normalization of the outputs of the last layer into valid probabilities aids the attack: changes in input lead to proportional changes in logit, but to much larger changes in probability. That makes feasible for the attack on classifiers to find much better sweet spots than the attack on autoencoders (Fig. 5). Goodfellow et al. goodfellow2014explaining suggested that the linearity of deep models make them susceptible to adversarial attacks. Our results seems to reinforce that such linearity plays indeed a critical role, with “internal” success of the attack being proportional to the distortion on inputs. On classification networks, however, which are essentially piecewise linear until the last layer, the non-linearity of the latter seems to compound the problem.
We proposed an adversarial method to attack autoencoders, and evaluated their robustness to such attacks. We showed that there is a linear trade-off between how much the adversarial input is similar to the original input, and how much the adversarial reconstruction is similar to the target reconstruction — frustrating the hope that a small change in the input could lead to drastic changes in the reconstruction. Surprisingly, such linear trade-off also appears for adversarial attacks on classification networks, if we “undo” the non-linearity of the last layer. In the future, we intend to extend our empirical results to datasets with larger inputs and more complex networks (e.g. ImageNet) — as well as to different autoencoder architectures. For example, the DRAW variational autoencodergregor2015draw uses feedback from the reconstruction error to improve the reconstruction — and thus could be more robust to attacks. We are also interested in advancing theoretical explanations to illuminate our results.
We thank Brazilian agencies CAPES, CNPq and FAPESP for financial support. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research. Eduardo Valle is partially supported by a Google Awards LatAm 2016 grant, and by a CNPq PQ-2 grant (311486/2014-2).
-  Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
-  Pedro Tabacof and Eduardo Valle. Exploring the space of adversarial images. arXiv preprint arXiv:1510.05328, 2015.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing by virtual adversarial examples. arXiv preprint arXiv:1507.00677, 2015.
-  Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
-  Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122, 2015.
-  Andrew Ng. Sparse autoencoder. CS294A Lecture notes, 72:1–19, 2011.
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion.
The Journal of Machine Learning Research, 11:3371–3408, 2010.
-  Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Computer Vision (ICCV), 2015 IEEE International Conference on, 2015.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
-  SM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, and Geoffrey E Hinton. Attend, infer, repeat: Fast scene understanding with generative models. arXiv preprint arXiv:1603.08575, 2016.
-  Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. arXiv preprint arXiv:1506.02557, 2015.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
-  Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. arXiv preprint arXiv:1604.08772, 2016.
-  Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
Yann LeCun, Corinna Cortes, and Christopher JC Burges.
The mnist database of handwritten digits, 1998.
-  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
-  Sander Dieleman, Jan Schlüter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, Daniel Nouri, Daniel Maturana, Martin Thoma, Eric Battenberg, Jack Kelly, et al. Lasagne: First release., August 2015.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus.
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2528–2535. IEEE, 2010.