Log In Sign Up

Adversarial Images for Variational Autoencoders

We investigate adversarial attacks for autoencoders. We propose a procedure that distorts the input image to mislead the autoencoder in reconstructing a completely different target image. We attack the internal latent representations, attempting to make the adversarial input produce an internal representation as similar as possible as the target's. We find that autoencoders are much more robust to the attack than classifiers: while some examples have tolerably small input distortion, and reasonable similarity to the target image, there is a quasi-linear trade-off between those aims. We report results on MNIST and SVHN datasets, and also test regular deterministic autoencoders, reaching similar conclusions in all cases. Finally, we show that the usual adversarial attack for classifiers, while being much easier, also presents a direct proportion between distortion on the input, and misdirection on the output. That proportionality however is hidden by the normalization of the output, which maps a linear layer into non-linear probabilities.


page 4

page 6


Disrupting Deepfakes with an Adversarial Attack that Survives Training

The rapid progress in generative models and autoencoders has given rise ...

Double Backpropagation for Training Autoencoders against Adversarial Attack

Deep learning, as widely known, is vulnerable to adversarial samples. Th...

Adversarial Attacks on Variational Autoencoders

Adversarial attacks are malicious inputs that derail machine-learning mo...


An autoencoder is a specific type of a neural network, which is mainlyde...

Defending Variational Autoencoders from Adversarial Attacks with MCMC

Variational autoencoders (VAEs) are deep generative models used in vario...

Towards a Theoretical Understanding of the Robustness of Variational Autoencoders

We make inroads into understanding the robustness of Variational Autoenc...

Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer

Autoencoders provide a powerful framework for learning compressed repres...

1 Introduction

Adversarial attacks expressly optimize the input to “fool” models, e.g., in image classification, the adversarial input — while visually tantamount to an ordinary original image — leads to mislabelling with high confidence.

Here, we explore adversarial images for autoencoders — models optimized to reconstruct their inputs from compact internal representations. In an autoencoder, the attack targets not a single label, but a whole reconstruction. Our contributions include:

  • [leftmargin=*]

  • An adversarial attack on variational — and, for comparison, deterministic — autoencoders. Our attack aims not only at disturbing the reconstruction, but at fooling the autoencoder into reconstructing a completely different target image;

  • A comparison between attacks for autoencoders and for classifiers, showing that while the former is much harder, in both cases the amount of distortion on the input is proportional to the amount of misdirection on the output. For classifiers, however, such proportionality is hidden by the normalization of the output, which maps a linear layer into non-linear probabilities.

Evaluating generative models is hard theis2015note , there are no clear-cut success criteria for autoencoder reconstruction, and therefore, neither for the attack. We attempt to bypass that difficulty by analyzing how inputs and outputs differ across varying regularization constants.

The seminal article of Szegedy et al. szegedy2013intriguing introduced adversarial images, showing how to force a deep network to misclassify an image by applying nearly imperceptible distortions. Goodfellow et al. goodfellow2014explaining exploited the linear nature of deep convolutional networks to both attempt explaining how adversarial samples arise, and to propose a much faster technique to create them. Tabacof and Valle tabacof2015exploring explored the geometry of adversarial regions, showing that they appear in relatively dense regions of the input space, and that shallow, simple classifiers tend to be more robust to them.

The existence of adversarial images lead to interesting questions on their significance, and even usefulness. Training models to resist adversarial attacks was advanced as a form of regularization goodfellow2014explaining ; miyato2015distributional . Gu et al. gu2014towards used autoencoders to pre-process the input and try to reinforce the network against adversarial attacks, finding that although in some cases resistance improved, attacks with small distortions remained possible. A more recent trend is training adversarial models, in which one attempts to generate “artificial” samples (from a generative model) and the other attempts to recognize those samples goodfellow2014generative . Makhzani et al. makhzani2015adversarial employ such scheme to train an autoencoder.

Although autoencoders appear in the literature of adversarial images as an attempt to obtain robustness to the attacks gu2014towards , and in the literature of adversarial training as models that can be trained with the technique makhzani2015adversarial , we are unaware of any attempts to create attacks targeted to them. In the closest related literature, Sara Sabour et al. sabour2015adversarial show that adversarial attacks can not only lead to mislabelling, but also manipulate the internal representations of the network. In this paper, we show that an analogous manipulation allows us to attack autoencoders, but that those remain much more resistant than classifiers to such attacks.

2 Autoencoders and Variational Autoencoders

Autoencoders are models that map their input into a compact latent representation, and then, from such representation, build back the input (discounting some distortion). Therefore, autoencoders are trained to minimize the distortion between their input and their (reconstructed) output — plus regularization terms. The model comprises two parts: an encoder, which maps the input into the latent representation; and a decoder, which maps such representation into an output as close to the input as possible. In regular autoencoders, the training loss function may be as simple as the

-distance between input and output.

Figure 1: Autoencoders are models able to map their input into a (deterministic or stochastic) latent representation, and then to map such representation into an output similar to the input; those two maps form the two halves of the model: the encoder and the decoder.

Famous variants include sparse autoencoders, which use -regularization ng2011sparse

, and denoising autoencoders, which use implicit regularization by feeding noise to the input, while keeping the original input in the reconstruction loss term 

vincent2010stacked . An important offshoot are models with similar encoder–decoder structure, but which seek not to reconstruct the input, but to produce an output related to it (e.g., a segmentation map) noh2015learning .

A modern variant of growing popularity, variational autoencoders kingma2013auto

interpret the latent representation through a Bayesian lens, thus offering a theoretical foundation for the reconstruction and regularization objectives. Variational autoencoders are probabilistic generative models, where we find the probability distribution of the data by marginalizing over the latent variables:


The likelihood is the probabilistic explanation of the observed data: in practice, often it is simply the output of the decoder network under a noise consideration (e.g. additive Gaussian noise for RGB pixels). The subscript comprises all decoder parameters, while is the latent representation, over which we marginalize. The representation prior is often the standard normal  kingma2013auto , but might be instead a discrete distribution (e.g. Bernoulli) kingma2014semi , or even some distribution with geometric interpretation (“what” and “where” latent variables) eslami2016attend . Since the integration above is often intractable, we maximize its variational lower bound…


…which is the Kullback–Leibler (KL) divergence between the approximate and the (unknown) exact posterior. Thus, maximizing the variational lower bound may also be interpreted as finding the best posterior approximation. In the context of variational autoencoders, such approximate posterior is usually an uncorrelated multivariate normal determined by the encoder network (with parameters ):


We can approximate the likelihood expectation

by Monte Carlo. As the prior and the approximated posterior are normal distributions, their KL divergence has analytic form 


. We can use the reparameterization trick to reduce the variance of the gradient estimator 

kingma2015variational .

The encoder and the decoder may be any neural network: a multilayer perceptron 

kingma2013auto , a convolutional network radford2015unsupervised , or even LSTMs. The latter are a recent development — recurrent variational autoencoders — which use soft attention to encode and decode patches from the input image gregor2015draw ; gregor2016towards

. Simulating a chain of samples from the latent variables and likelihood allows to denoise images, or to impute missing data (inpaint images) 


. The latent variables of a variational autoencoder also allow visual analogy and interpolation 

radford2015unsupervised .

3 Adversarial Images for Autoencoders

Adversarial procedures minimize an adversarial loss to mislead the model (e.g., misclassification), while distorting the input as little as possible. If the attack is successful, humans should hardly be able to distinguish between the adversarial and the regular inputs szegedy2013intriguing ; tabacof2015exploring . We can be even more strict, and only allow a distortion below the input quantization noise goodfellow2014explaining ; sabour2015adversarial .

To build adversarial images for classification, one can maximize the misdirection towards a certain wrong label szegedy2013intriguing ; tabacof2015exploring or away from the correct one goodfellow2014explaining . The distortion can be minimized szegedy2013intriguing ; tabacof2015exploring or constrained to be small goodfellow2014explaining ; sabour2015adversarial . Finally, one often requires that images stay within their valid space (i.e., no pixels “below black or above white”).

In autoencoders, there is not a single class output to misclassify, but instead a whole image output to scramble. The attack attempts to mislead the reconstruction: if a slightly altered image enters the autoencoder, but the reconstruction is wrecked, then the attack worked. A more dramatic attack — the one we attempt in this paper — would be to change slightly the input image and make the autoencoder reconstruct a completely different valid image (Fig. 2).

Figure 2: Adversarial attacks for autoencoders add (ideally small) distortions to the input, aiming at making the autoencoder reconstruct a different target. We attack the latent representation, attempting to match it to the target image’s.

Our attack consists in selecting an original image and a target image, and then feeding the network the original image added to a small distortion, optimized to get an output as close to the target image as possible (Fig. 2). Our attempts to attack the output directly failed: minimizing its distance to the target only succeeded in blurring the reconstruction. As autoencoders reconstruct from the latent representation, we can attack it instead. The latent layer is the information bottleneck of the autoencoder, and thus particularly convenient to attack. We used the following adversarial optimization:


where is the adversarial distortion; and are the latent representations, respectively, for the adversarial and the target images; is the original image; is the adversarial image; and are the bounds on the input space; and is the regularizing constant the balances reaching the target and limiting the distortion.

We must choose a function to compare representations. For regular autoencoders a simple -distance sufficed; however, for variational autoencoders, the KL-divergence between the distributions induced by the latent variables not only worked better, but also offered a sounder justification. In our variational autoencoders, the are uncorrelated multivariate normal distributions with parameters given by the encoder:


where and

are the representation mean vector, and (diagonal) covariance matrix output by the last layer of the encoder network; while

are the autoencoder parameters — learned previously by training it for its ordinary task of reconstruction. During the entire adversarial procedure, remains fixed.

4 Data and Methods

We worked on the binarized MNIST 

lecun1998mnist and SVHN datasets netzer2011reading . The former allows for very fast experiments and very controlled conditions; the latter, while still allowing to manage a large number of experiments, provides much more noise and variability. Following literature kingma2013auto , we modeled pixel likelihoods as independent Bernoullis (for binary images), or as independent normals (for RGB images). We used Parmesan and Lasagne lasagne for the implementation111The code for the experiments can be found at

The loss function to train the variational autoencoder (equation 2) is the expectation of the likelihood under the approximated posterior plus the KL divergence between the approximated posterior and the prior. We approximate the expectation of the likelihood with one sample of the posterior. We extract the gradients of the lower bound using automatic differentiation and maximize it using stochastic gradient ascent via the ADAM algorithm kingma2014adam . We used 20 and 100 latent variables for MNIST and SVHN, respectively. We parameterized the encoder and decoder as fully-connected networks in the MNIST case, and as convolutional and deconvolutional zeiler2010deconvolutional networks in the SVHN case. After the training is done, we can use the autoencoder to reconstruct some image samples through the latent variables, which are the learned representation of the images. An example of a pair of input image/reconstructed output appears in Fig. 1.

For classification tasks, the regularization term (Eq. 4) may be chosen by bisection as the smallest constant that still leads to success tabacof2015exploring . Autoencoders complicate such choice, for there is no longer a binary criterion for success. Goodfellow et al. goodfellow2014explaining and Sabour et al.sabour2015adversarial optimize differently, choosing for an -norm constrained to make the distortion imperceptible, while maximizing the misdirection. We found such solution too restrictive, leading to reconstructions visually too distinct from the target images. Our solution was instead to forgo a single choice for , and analyze the behavior of the system throughout a series of values.

In our experiments, we pick at random 25 pairs of original/target images (axis “experiment” in graphs). For each pair, we span 100 different values for the regularization constant in a logarithmic scale (from to ), measuring the -distance between the adversarial input and the original image (axis “distortion”), and the -distance between the reconstructed output and the target image (axis “adversarialtarget”). The “distortion” axis is normalized between 0.0 (no attack) and the -distance between the original and target images in the pair (a large distortion that could reach the target directly). The “adversarialtarget” is normalized between the -distance of the reconstruction of the target and the target (the best expected attack) and the -distance of the reconstruction of the original and the target (the worst expected attack). The geometry of such normalization is illustrated by the colored lines in the graphs of Fig. 3. For variational autoencoders, the reconstruction is stochastic: therefore, each data point is sampled 100 times, and the average is reported.

For comparison purposes, we use the same protocol above to generate a range of adversarial images for the usual classification tasks on the same datasets. The aim is to contrast the behavior of adversarial attacks across the two tasks (autoencoding / classification). In those experiments we pick pairs of original image / adversarial class (axis “experiment”), and varying (from to

), we measure the distortion as above, and the probability (with corresponding logit) attributed to the adversarial (red lines) and to the original classes (blue lines). The axes here are no longer normalized, but we center at 0 in the “distortion” axis the transition point between attack failure and success — the point where red and blue lines cross.

5 Results and Discussion

Figure 3: Top row: MNIST. Bottom row: SVHN. The figures on the left show the trade-off between the quality of adversarial attack and the adversarial distortion magnitude, with changing regularization parameter (implicit in the graphs, chosen from a logarithmic scale). The figures on the right correspond to the points shown in red in the graphs, illustrating adversarial images and reconstructions using fully-connected, and convolutional variational autoencoders (for MNIST and SVHN, respectively).

We found that generating adversarial images for autoencoders is a much harder task than for classifiers. If we apply little distortion (comparable to those used for misleading classifiers), the reconstructions stay essentially untouched. To get reconstructions very close to the target’s, we have to apply heavy distortions to the input. However, by hand-tuning the regularization parameter, it is possible to find trade-offs where the reconstruction approaches the target’s and the adversarial image will still resemble the input (two examples in Fig. 3).

The plots for the full set of 25 original/target image pairs appear in Fig. 4. All series saturate when the latent representation of the adversarial image essentially equals the target’s. That saturation appears well before the upper distortion limit of 1.0, and provides a measure of how resistant the model is to the attack: Variational Autoencoders appear slightly more resistant than Deterministic Autoencoders, and MNIST much more resistant than SVHN. The latter is not surprising, since large complex models seem, in general, more susceptible to adversarial attacks. Before the “hinge” where the attack saturates, there is a quasi-linear trade-off between input distortion and output similarity to target, for all combinations of dataset and autoencoder choice. We were initially hoping for a more non-linear behavior, with a sudden drop at some point in the scale, but data suggests that there is a give-and-take for attacking autoencoders: each gain in the attack requires a proportional increase in distortion.

Figure 4: Plots for the whole set of experiments in MNIST and SVHN. Top: variational autoencoders (VAE). Bottom: deterministic autoencoders (AE). Each line in a graph corresponds to one experiment with adversarial images from a single pair of original/target images, varying the regularization parameter (like shown in Fig. 3). The “distortion” and “adversarialtarget” axes show the trade-off between cost and success. The “hinge” where the lines saturate show the point where the reconstruction is essentially equal to the target’s: the distortion at the hinge measures the resistance to the attack.

The comparison with the (much better-studied) attacks for classifiers, showed, at the beginning, a much different behavior: when we contrasted the probability attributed to the adversarial class vs. the distortion imposed on the input, we observed the non-linear, sudden change we were expecting (left column of Fig. 6). The question remained, however whether such non-linearity was intrinsic, or whether it was due to the highly non-linear nature of the probability scale. The answer appears in the right column of Fig. 6, where, with a logit transformation of the probabilities, the linear behavior appears again. It seems that the attack on classifiers show, internally, the same linear give-and-take present in autoencoders, but that the normalization of the outputs of the last layer into valid probabilities aids the attack: changes in input lead to proportional changes in logit, but to much larger changes in probability. That makes feasible for the attack on classifiers to find much better sweet spots than the attack on autoencoders (Fig. 5). Goodfellow et al. goodfellow2014explaining suggested that the linearity of deep models make them susceptible to adversarial attacks. Our results seems to reinforce that such linearity plays indeed a critical role, with “internal” success of the attack being proportional to the distortion on inputs. On classification networks, however, which are essentially piecewise linear until the last layer, the non-linearity of the latter seems to compound the problem.

Figure 5: Examples for the classification attacks. Top: MNIST. Bottom: SVHN. Left: probabilities. Middle: logit transform of probabilities. Right: images illustrating the intersection point of the curves. The adversarial class is ‘4’ for MNIST, and ‘0’ for SVHN. The red curve shows the probability/logit for the adversarial class, and the blue curve shows the same for the original class: the point where the curves cross is the transition point between failure and success of the attack.
Figure 6: Plot of whose set of experiments for classifiers. Top: MNIST. Bottom: SVHN. Left: probabilities. Right: logit transform of probabilities. Each experiment corresponds to one of the graphs shown in Fig. 5, centered to make the crossing point between the red and blue lines stay at 0 in the “distortion” axis.

6 Conclusion

We proposed an adversarial method to attack autoencoders, and evaluated their robustness to such attacks. We showed that there is a linear trade-off between how much the adversarial input is similar to the original input, and how much the adversarial reconstruction is similar to the target reconstruction — frustrating the hope that a small change in the input could lead to drastic changes in the reconstruction. Surprisingly, such linear trade-off also appears for adversarial attacks on classification networks, if we “undo” the non-linearity of the last layer. In the future, we intend to extend our empirical results to datasets with larger inputs and more complex networks (e.g. ImageNet) — as well as to different autoencoder architectures. For example, the DRAW variational autoencoder

gregor2015draw uses feedback from the reconstruction error to improve the reconstruction — and thus could be more robust to attacks. We are also interested in advancing theoretical explanations to illuminate our results.


We thank Brazilian agencies CAPES, CNPq and FAPESP for financial support. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research. Eduardo Valle is partially supported by a Google Awards LatAm 2016 grant, and by a CNPq PQ-2 grant (311486/2014-2).