Denoising Adversarial Autoencoders

by   Antonia Creswell, et al.
Imperial College London

Unsupervised learning is of growing interest because it unlocks the potential held in vast amounts of unlabelled data to learn useful representations for inference. Autoencoders, a form of generative model, may be trained by learning to reconstruct unlabelled input data from a latent representation space. More robust representations may be produced by an autoencoder if it learns to recover clean input samples from corrupted ones. Representations may be further improved by introducing regularisation during training to shape the distribution of the encoded data in latent space. We suggest denoising adversarial autoencoders, which combine denoising and regularisation, shaping the distribution of latent space using adversarial training. We introduce a novel analysis that shows how denoising may be incorporated into the training and sampling of adversarial autoencoders. Experiments are performed to assess the contributions that denoising makes to the learning of representations for classification and sample synthesis. Our results suggest that autoencoders trained using a denoising criterion achieve higher classification performance, and can synthesise samples that are more consistent with the input data than those trained without a corruption process.



There are no comments yet.


page 22

page 23

page 25

page 27

page 36


On the Transformation of Latent Space in Autoencoders

Noting the importance of the latent variables in inference and learning,...

Learning Inverse Mappings with Adversarial Criterion

We propose a flipped-Adversarial AutoEncoder (FAAE) that simultaneously ...

On denoising autoencoders trained to minimise binary cross-entropy

Denoising autoencoders (DAEs) are powerful deep learning models used for...

Denoising random forests

This paper proposes a novel type of random forests called a denoising ra...

PCB Defect Detection Using Denoising Convolutional Autoencoders

Printed Circuit boards (PCBs) are one of the most important stages in ma...

Noise Learning Based Denoising Autoencoder

This letter introduces a new denoiser that modifies the structure of den...

Denoising Autoencoders for Overgeneralization in Neural Networks

Despite the recent developments that allowed neural networks to achieve ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Background

I-a Autoencoders

In a supervised learning setting, given a set of training data,

we wish to learn a model, that maximises the likelihood, of the true label, given an observation, . In the supervised setting, there are many ways to calculate and approximate the likelihood, because there is a ground truth label for every training data sample.

When trying to learn a generative model, , in the absence of a ground truth, calculating the likelihood of the model under the observed data distribution,

, is challenging. Autoencoders introduce a two step learning process, that allows the estimation,

of via an auxiliary variable, . The variable may take many forms and we shall explore several of these in this section. The two step process involves first learning a probabilistic encoder [14], , conditioned on observed samples, and a second probabilistic decoder [14], , conditioned on the auxiliary variables. Using the probabilistic encoder, we may form a training dataset, where is the ground truth output for with the input being . The probabilistic decoder, , may then be trained on this dataset in a supervised fashion. By sampling conditioning on suitable

’s we may obtain a joint distribution,

, which may be marginalised by integrating over all to obtain, to .

In some situations the encoding distribution is chosen rather than learned [5], in other situations the encoder and decoder are learned simultaneously [14, 21, 12].

I-B Denoising Autoencoders (DAEs)

Bengio et al. [5] treat the encoding process as a local corruption process, that does not need to be learned. The corruption process, defined as where , the corrupted is the auxiliary variable (instead of ). The decoder, , is therefore trained on the data pairs, .

By using a local corruption process (e.g. additive white Gaussian noise [5]), both and have the same number of dimensions and are close to each-other. This makes it very easy to learn . Bengio et al. [5] shows how the learned model may be sampled using an iterative process, but does not explore how representations learned by the model may transfer to other applications such as classification.

Hinton et al. [11] show that when auxiliary variables of an autoencoder have lower dimension than the observed data, the encoding model learns representations that may be useful for tasks such as classification and retrieval.

Rather than treating the corruption process, , as an encoding process [5] – missing out on potential benefits of using a lower dimensional auxiliary variable – Vincent et al. [31, 32] learn an encoding distribution, , conditional on corrupted samples. The decoding distribution, learns to reconstruct images from encoded, corrupted images, see the DAEs in Figure 1. Vincent et al. [32, 31] show that compared to regular autoencoders, denoising autoencoders learn representations that are more useful and robust for tasks such as classification. Parameters, and are learned simultaneously by minimisng the reconstruction error for the training set, , which does not include . The ground truth for a given is unknown. The form of the distribution over , to which samples are mapped, is also unknown - making it difficult to draw novel data samples from the decoder model, .

I-C Variational Autoencoders

Variational autoencoders (VAEs) [14] specify a prior distribution, to which should map all samples, by formulating and maximising a variational lower bound on the log-likelihood of .

The variational lower bound on the log-likelihood of is given by [14]:


The term corresponds to the likelihood of a reconstructed given the encoding, of a data sample . This formulation of the variational lower bound does not involve a corruption process. The term is the Kullback-Libeller divergence between and . Samples are drawn from via a re-parametrisation trick, see the VAE in Figure 1.

If is chosen to be a parametrised multivariate Gaussian,

, and the prior is chosen to be a Gaussian distribution, then

may be computed analytically. divergence may only be computed analytically for certain (limited) choices of prior and posterior distributions.

VAE training encourages to map observed samples to the chosen prior, . Therefore, novel observed data samples may be generated via the following simple sampling process: , [14].

Note that despite the benefits of the denoising criterion shown by Vincent et al. [31, 32], no corruption process was introduced by Kingma et al. [14] during VAE training.

I-D Denoising Variational Autoencoders

Adding the denoising criterion to a variational autoencoder is non-trivial because the variational lower bound becomes intractable.

Consider the conditional probability density function,

, where is the probabilistic encoder conditioned on corrupted samples, , and is a corruption process. The variational lower bound may be formed in the following way [12]:

If is chosen to be Gaussian, then in many cases will be a mixture of Gaussians. If this is the case, there is no analytical solution for and so the denoising variational lower bound becomes analytically intractable. However there may still be an analytical solution for . The denoising variational autoencoder therefore maximises . We refer to the model which is trained to maximise this objective as a DVAE, see the DVAE in Figure 1. Im et al. [12] show that the DVAE achieves lower negative variational lower bounds than the regular variational autoencoder on a test dataset.

However, note that is matched to the prior, rather than . This means that generating novel samples using is not as simple as the process of generating samples from a variational autoencoder. To generate novel samples, we should sample , , which is difficult because of the need to evaluate . Im et al. [12] do not address this problem.

For both DVAEs and VAEs there is a limited choice of prior and posterior distributions for which there exists an analytic solution for the KL divergence. Alternatively, adversarial training may be used to learn a model that matches samples to an arbitrarily complicated target distribution – provided that samples may be drawn from both the target and model distributions.

Ii Related Work

Ii-a Adversarial Training

In adversarial training [9] a model is trained to produce output samples, that match a target probability distribution . This is achieved by iteratively training two competing models, a generative model, and a discriminative model, . The discriminative model is fed with samples either from the generator (i.e. ‘fake’ samples) or with samples from the target distribution (i.e. ‘real’ samples), and trained to correctly predict whether samples are ‘real’ or ‘fake’. The generative model - fed with input samples , drawn from a chosen prior distribution, - is trained to generate output samples that are indistinguishable from target samples in order to ‘fool’ [25] the discriminative model into making incorrect predictions. This may be achieved by the following mini-max objective [9]:

It has been shown that for an optimal discriminative model, optimising the generative model is equivalent to minimising the Jensen-Shannon divergence between the generated and target distributions [9]. In general, it is reasonable to assume that, during training, the discriminative model quickly achieves near optimal performance [9]. This property is useful for learning distributions for which the Jensen-Shannon divergence may not be easily calculated.

The generative model is optimal when the distribution of generated samples matches the target distribution. Under these conditions, the discriminator is maximally confused and cannot distinguish ‘real’ samples from ‘fake’ ones. As a consequence of this, adversarial training may be used to capture very complicated data distributions, and has been shown to be able to synthesise images of handwritten digits and human faces that are almost indistinguishable from real data [25].

Ii-B Adversarial Autoencoders

Makhzani et al. [21] introduce the adversarial autoencoder (AAE), where is both the probabilistic encoding model in an autoencoder framework and the generative model in an adversarial framework. A new, discriminative model, is introduced. This discriminative model is trained to distinguish between latent samples drawn from and . The cost function used to train the discriminator, is:

where and and is the size of the training batch.

Adversarial training is used to match to an arbitrarily chosen prior, . The cost function for matching to prior, is as follows:


where and is the size of a training batch. If both and are optimised, will be indistinguishable from .

In Makhzani et al.’s [21] adversarial autoencoder, is specified by a neural network whose input is and whose output is . This allows to have arbitrary complexity, unlike the VAE where the complexity of is usually limited to a Gaussian. In an adversarial autoencoder the posterior does not have to be analytically defined because an adversary is used to match the prior, avoiding the need to analytically compute a KL divergence.

Makhzani et al. [21] demonstrate that adversarial autoencoders are able to match to several different priors, , including a mixture of 10 2D-Gaussian distributions. We explore another direction for adversarial autoencoders, by extending them to incorporate a denoising criterion.

Iii Denoising Adversarial Autoencoder

We propose denoising adversarial autoencoders - denoising autoencoders that use adversarial training to match the distribution of auxiliary variables, to a prior distribution, .

We formulate two versions of a denoising adversarial autoencoder which are trained to approximately maximise the denoising variational lower bound [12]. In the first version, we directly match the posterior to the prior, using adversarial training. We refer to this as an integrating Denoising Adversarial Autoencoder, iDAAE. In the second, we match intermediate conditional probability distribution to the prior, . We refer to this as a DAAE.

In the iDAAE, adversarial training is used to bypass analytically intractable KL divergences [12]. In the DAAE, using adversarial training broadens the choice for prior and posterior distributions beyond those for which the KL divergence may be analytically computed.

Iii-a Construction

The distribution of encoded data samples is given by [12]. The distribution of decoded data samples is given by . Both and may be trained to maximise the likelihood of a reconstructed sample, by minimising the reconstruction cost function, where the are obtained via the following sampling process , , , and is distribution of the training data.

We also want to match the distribution of auxiliary variables, to a prior, . When doing so, there is a choice to match either or to . Each choice has its own trade-offs either during training or during sampling.

Iii-A1 iDAAE: Matching to a prior

In DVAEs there is often no analytical solution for the KL divergence between and [12], making it difficult to match to . Rather, we propose using adversarial training to match to , requiring samples to be drawn from during training. It is challenging to draw samples directly from , but it is easy to draw samples from and so may be approximated by , , where are samples from the training data, see Figure 1. Matching is achieved by minimising the following cost function:

where , , , .

Iii-A2 DAAE: Matching to a prior

Since drawing samples from is trivial, may be matched to via adversarial training. This is more efficient than matching since a Monte-Carlo integration step (in Section III-A1) is not needed, see Figure 1. In using adversarial training in place of

divergence, the only restriction is that we must be able to draw samples from the chosen prior. Matching may be achieved by minimising the following loss function:

where .

Though more computationally efficient to train, there are drawbacks when trying to synthesise novel samples from if – rather than – is matched to the prior. The effects of using a DAAE rather than an iDAAE may be visualized by plotting the empirical distribution of encodings of both data samples and corrupted data samples with the desired prior, these are shown in Figure 2.

Iv Synthesising novel samples

In this section, we review several techniques used to draw samples from trained autoencoders, identify a problem with sampling DVAEs, which also applies to DAAEs, and propose a novel approach to sampling DAAEs; we draw strongly on previous work by Bengio et al. [4, 5].

Iv-a Drawing Samples From Autoencoders

New samples may be generated by sampling a learned , conditioning on drawn from a suitable distribution. In the case of variational [14] and adversarial [21] autoencoders, the choice of this distribution is simple, because during training the distribution of auxiliary variables is matched to a chosen prior distribution, . It is therefore easy and efficient to sample both variational and adversarial autoencoders via the following process: , [14, 21].

The process for sampling denoising autoencoders is more complicated. In the case where the auxiliary variable is a corrupted image, [3], the sampling process is as follows: , , [5]. In the case where the auxiliary variable is an encoding, [31, 32] the sampling process is the same, with encompassing both the encoding and decoding process.

However, since a denoising autoencoder is trained to reconstruct corrupted versions of its inputs, is likely to be very similar to . Bengio et al. [5] propose a method for iteratively sampling denoising autoencoders by defining a Markov chain whose stationary distribution - under certain conditions - exists and is equivalent, under certain assumption, to the training data distribution. This approach is generalised and extended by Bengio et al. [4] to introduce a latent distribution with no prior assumptions on .

We now consider the implication for drawing samples from denoising adversarial autoencoders introduced in Section III-A. By using the iDAAE formulation (Section  III-A1) – where is matched to the prior over samples may be drawn from conditioning on . However, if we use the DAAE – matching to a prior – sampling becomes non-trivial.

On the surface, it may appear easy to draw samples from DAAEs (Section  III-A2), by first sampling the prior, and then sampling . However, the full posterior distribution is given by , but only is matched to during training (See figure 2). The implication of this is that, when attempting to synthesize novel samples from , drawing samples from the prior, , is unlikely to yield samples consistent with . This will become more clear in Section IV-B.

(a) DAAE
(b) iDAAE
Figure 2: Compare how iDAAE and DAAE match encodings to the prior when trained on the CelebA dataset. encoding refers to , prior refers to the normal prior , encoded corrupted data refers to (a) DAAE: Encoded corrupted samples match the prior, (b) iDAAE: Encoded data samples match the prior.

Iv-B Proposed Method For Sampling DAAEs

Here, we propose a method for synthesising novel samples using trained DAAEs. In order to draw samples from , we need to be able to draw samples from .

To ensure that we draw novel data samples, we do not want to draw samples from the training data at any point during sample synthesis. This means that we cannot use data samples from our training data to approximately draw samples from .

Instead, similar to Bengio et al. [5], we formulate a Markov chain, which we show has the necessary properties to converge and that the chain converges to

. Unlike Bengio’s formulation, our chain is initialised with a random vector of the same dimensions as the latent space, rather than a sample drawn from the training set.

We define a Markov chain by the following sampling process:


Notice that our first sample is any real vector of dimension , where is the dimension of the latent space. This Markov chain has the transition operator:


We will now show that under certain conditions this transition operator defines an ergodic Markov chain that converges to in the following steps: 1) We will show that that there exists a stationary distribution for drawn from a specific choice of initial distribution (Lemma 1). 2) The Markov chain is homogeneous, because the transition operator is defined by a set of distributions whose parameters are fixed during sampling. 3) We will show that the Markov chain is also ergodic, (Lemma 2). 4) Since the chain is both homogeneous and ergodic there exists a unique stationary distribution to which the Markov chain will converge [23].

Step 1) shows that one stationary distribution is , which we now know by 2) and 3) to be the unique stationary distribution. So the Markov chain converges to .

In this section, only, we use a change of notation, where the training data probability distribution, previously represented as is represented as , this is to help make distinctions between “natural system” probability distributions and the learned distributions. Further, note that is the prior, while the distribution required for sampling is such that:

Lemma 1.

is a stationary distribution for the Markov chain defined by the sampling process in (3).

For proof see Appendix.

Lemma 2.

The Markov chain defined by the transition operator, (4) is ergodic, provided that the corruption process is additive Gaussian noise and that the adversarial pair, and are optimal within the adversarial framework.

For proof see Appendix.

Theorem 1.

Assuming that is approximately equal to , and that the adversarial pair – and – are optimal, the transition operator defines a Markov chain whose unique stationary distribution is .


This follows from Lemmas 1 and 2. ∎

This sampling method uncovers the distribution on which samples drawn from must be conditioned in order to sample . Assuming , this allows us to draw samples from .

For completeness, we would like to acknowledge that there are several other methods that use Markov chains during the training of autoencoders [2, 22] to improve performance. Our approach for synthesising samples using the DAAE is focused on sampling only from trained models; the Markov chain sampling is not used to update model parameters.

V Implementation

The analyses of Sections III and IV are deliberately general: they do not rely on any specific implementation choice to capture the model distributions. In this section, we consider a specific implementation of denoising adversarial autoencoders and apply them to the task of learning models for image distributions. We define an encoding model that maps corrupted data samples to a latent space , and which maps samples from a latent space to an image space. These respectively draw samples according to the conditional probabilities and . We also define a corruption process, , which draws samples according to .

The parameters and of models and are learned under an autoencoder framework; the parameters are also updated under an adversarial framework. The models are trained using large datasets of unlabelled images.

V-a The Autoencoder

Under the autoencoder framework, is the encoder and

is the decoder. We used fully connected neural networks for both the encoder and decoder. Rectifying Linear Units (ReLU) were used between all intermediate layers to encourage the networks to learn representations that capture multi-modal distributions. In the final layer of the decoder network, a sigmoid activation function is used so that the output represents pixels of an image. The final layer of the encoder network is left as a linear layer, so that the distribution of encoded samples is not restricted.

As described in Section III-A, the autoencoder is trained to maximise the log-likelihood of the reconstructed image given the corrupted image. Although there are several ways in which one may evaluate this log-likelihood, we chose to measure pixel wise binary cross-entropy between the reconstructed sample, and the original samples before corruption, . During training we aim to learn parameters and that minimise the binary cross-entropy between and . The training process is summarised by lines to in Algorithm 1 in the Appendix.

The vectors output by the encoder may take any real values, therefore minimising reconstruction error is not sufficient to match either or to the prior, . For this, parameters must also be updated under the adversarial framework.

V-B Adversarial Training

To perform adversarial training we define the discriminator , described in Section II-A to be a fully connected neural network, which we denote . The output of is a “probability” because the final layer of the neural network has a sigmoid activation function, constraining the range of to be between . Intermediate layers of the network have ReLU activation functions to encourage the network to capture highly non-linear relations between and the labels, {‘real’, ‘fake’}.

How adversarial training is applied depends on whether or is being fit to the prior . refers to the samples drawn from the distribution that we wish to fit to and , samples drawn from the prior, . The discriminator, , is trained to predict whether ’s are ‘real’ or ‘fake’. This may be achieved by learning parameters that maximise the probability of the correct labels being assigned to and . This training procedure is shown in Algorithm 1 on Lines to .

Drawing samples, , involves sampling some prior distribution, , often a Gaussian. Now, we consider how to draw fake samples, . How these samples are drawn depends on whether (DAAE) is being fit to the prior or (iDAAE) is being fit to the prior. Drawing samples, is easy if is being matched to the prior, as these are simply obtained by mapping corrupted samples though the encoder: .

However, if is being matched to the prior, we must use Monte Carlo sampling to approximate samples (see Section III-A1). The process for calculating is given by Algorithm 2 in the Appendix, and detailed in Section III-A1.

Finally, in order to match the distribution of samples to the prior, , adversarial training is used to update parameters while holding parameters fixed. Parameters are updated to minimise the likelihood that

correctly classifies

as being ‘fake’. The training procedure is laid out in lines and of Algorithm 1.

Algorithm 1 shows the steps taken to train an iDAAE. To train a DAAE instead, all lines in Algorithm 1 are the same except Line , which may be replaced by .

V-C Sampling

Although the training process for matching to is less computationally efficient than matching to , it is very easy to draw samples when is matched to the prior (iDAAE). We simply draw a random value from , and calculate , where is a new sample. When drawing samples, parameters and are fixed.

If is matched to the prior (DAAE), an iterative sampling process is needed in order to draw new samples from . This sampling process is described in Section IV-B. To implement this sampling process is trivial. A random sample, is drawn from any distribution; the distribution does not have to be the chosen prior, . New samples, are obtained by iteratively decoding, corrupting and encoding , such that is given by:

In the following section, we evaluate the performance of denoising adversarial autoencoders on three image datasets, a handwritten digit dataset (MNIST) [18], a synthetic colour image dataset of tiny images (Sprites) [26], and a complex dataset of hand-written characters [17]. The denoising and non-denoising adversarial autoencoders (AAEs) are compared for tasks including reconstruction, generation and classification.

Vi Experiments & Results

Vi-a Code Available Online

We make our PyTorch

[24] code available at the following link: 222

An older version of our code in Theano available at with our results presented in iPython notebooks. Since this is a revised version of our paper and Theano is no longer being supported, our new experiments on the CelebA datasets were performed using PyTorch.

Vi-B Datasets

We evaluate our denoising adversarial autoencoder on three image datasets of varying complexity. Here, we describe the datasets and their complexity in terms of variation within the dataset, number of training examples and size of the images.

Vi-B1 Datasets: Omniglot

The Omniglot dataset is a handwritten character dataset consisting of categories of character from different writing systems, with only examples of each character. Each example in the dataset is -by- pixels, taking values {0,1}. The dataset is split such that examples from categories make up the training dataset, while one example from each of those categories makes up the testing dataset. The characters from each of the remaining categories make up the evaluation dataset. This means that experiments may be performed to reconstruct or classify samples from categories not seen during training of the autoencoders.

Vi-B2 Datasets: Sprites

The sprites dataset is made up of unique human-like characters. Each character has attributes including hair, body, armour, trousers, arm and weapon type, as well as gender. For each character there animations consisting of to frames each. There are between and examples of each character, however every example is in a different pose. Each sample is -by- pixels and samples are in colour. The training, validation and test datasets are split by character to have , and unique characters each, with no two sets having the same character.

Vi-B3 Datasets: CelebA

The CelebA dataset consists of k images of faces in colour. Though a version of the dataset with tightly cropped faces exists, we use the un-cropped dataset. We use samples for testing and the rest for training. Each example has dimensions -by- and a set of labelled facial attributes for example, ‘No Beard’, ‘Blond Hair’, ‘Wavy Hair’ etc. . This face dataset is more complex than the Toronto Face Dataset used by Makhzani et al. [21] for training the AAE.

Vi-C Architecture and Training

For each dataset, we detail the architecture and training parameters of the networks used to train each of the denoising adversarial autoencoders. For each dataset, several DAAEs, iDAAEs and AAEs are trained. In order to compare models trained on the same datasets, the same network architectures, batch size, learning rate, annealing rate and size of latent code is used for each. Each set of models were trained using the same optimization algorithm. The trained AAE [21] models act as a benchmark, allowing us to compare our proposed DAAEs and iDAAEs.

Vi-C1 Architecture and Training: Omniglot

The decoder, encoder and discriminator networks consisted of , and fully connected layers respectively, each layer having neurons. We found that deeper networks than those proposed by Makhazni et al. [21] (for the MNIST dataset) led to better convergence. The networks are trained for epochs, using a learning rate , a batch size of 64 and the Adam [13] optimization algorithm. We used a

D Gaussian for the prior and additive Gaussian noise with standard deviation

for the corruption process. When training the iDAAE, we use steps of Monte Carlo integration (see Algorithm 2 in the Appendix).

Vi-C2 Architecture and Training: Sprites

Both the encoder and discriminator are -layer fully connected neural networks with neurons in each layer. For the decoder, we used a -layer fully connected network with neurons in the first layer and in each of the last layers, this configuration allowed us to capture complexity in the data without over fitting. The networks were trained for epochs, using a batch size of , a learning rate of and the Adam [13] optimization algorithm. We used an encoding units, D Gaussian for the prior and additive Gaussian noise with standard deviation for the corruption process. The iDAAE was trained with steps of Monte Carlo integration.

Vi-C3 Architecture and Training: CelebA

The encoder and decoder were constructed with convolutional layers, rather than fully connected layers since the CelebA dataset is more complex than the Toronto face dataset use by Makhzani et al. [21]. The encoder and decoder consisted of convolutional layers with a similar structure to that of the DCGAN proposed by Radford et al. [25]. We used a -layer fully connected network for the discriminator. Networks were trained for epochs with a batch size of

using RMSprop with learning rate

and momentum of for training the discriminator. We found that using smaller momentum values lead to more blurred images, however larger momentum values prevented the network from converging and made training unstable. When using Adam instead of RMSprop (on the CelebA dataset specifically) we found that the values in the encodings became very large, and were not consistent with the prior. The encoding was made up of units and we used a D Gaussian for the prior. We used additive Gaussian noise for the corruption process. We experimented with different noise level, between , we found several values in this range to be suitable. For our classification experiments we fixed and for synthesis from the DAAE, to demonstrate the effect of sampling, we used . For the iDAAE we experimented with . We found that (when ), was not sufficient to train an iDAAE. By comparing histograms of encoded data samples to histograms of the prior (see Figure 2), for an iDAAE trained with a particular value, we are able to see whether is sufficiently larger or not. We found to be sufficiently large for most experiments.

Vi-D Sampling DAAEs and iDAAEs

Samples may be synthesized using the decoder of a trained iDAAE or AAE by passing latent samples drawn from the prior through the decoder. On the other hand, if we pass samples from the prior through the decoder of a trained DAAE, the samples are likely to be inconsistent with the training data. To synthesize more consistent samples using the DAAE, we draw an initial from any random distribution – we use a Gaussian distribution for simplicity333which happens to be equivalent to our choice of prior – and decode, corrupt and encode the sample several times for each synthesized sample. This process is equivalent to sampling a Markov chain where one iteration of the Markov chain includes decoding, corrupting and encoding to get a after iterations. The may be used to synthesize a novel sample which we call, . is the sample generated when is passed through the decoder.

To evaluate the quality of some synthesized samples, we calculated the log-likelihood of real samples under the model [21]. This is achieved by fitting a Parzen window to a number of synthesised samples. Further details of how the log-likelihood is calculated for each dataset is given in the Appendix F.

We expect initial samples, ’s drawn from the DAAE to have a lower (worse) log-likelihood than those drawn from the AAE, however we expect Markov chain (MC) sampling to improve synthesized samples, such that for should have larger log-likelihood than the initial samples. It is not clear whether for drawn using a DAAE will be better than samples drawn form an iDAAE. The purpose of these experiments is to demonstrate the challenges associated with drawing samples from denoising adversarial autoencoders, and show that our proposed methods for sampling a DAAE and training iDAAEs allows us to address these challenges. We also hope to show that iDAAE and DAAE samples are competitive with those drawn from an AAE.

Vi-D1 Sampling: Omniglot

Here, we explore the Omniglot dataset, where we look at log-likelihood score on both a testing and evaluation dataset. Recall (Section VI-B1) that the testing dataset has samples from the same classes as the training dataset and the evaluation dataset has samples from different classes.

First, we discuss the results on the evaluation dataset. The results, shown in Figure 3, are consistent with what is expected of the models. The iDAAE out-performed the AAE, with a less negative (better) log-likelihood. The initial samples drawn using the DAAE had more negative (worse) log-likelihood values than samples drawn using the AAE. However, after one iteration of MC sampling, the synthesized samples have less negative (better) log-likelihood values than those from the AAE. Additional iterations of MC sampling led to worse results, possibly because synthesized samples tending towards multiple modes of the data generating distribution, appearing to be more like samples from classes represented in the training data.

The Omniglot testing dataset consists of one example of every category in the training dataset. This means that if multiple iterations of MC sampling cause synthesized samples to tend towards modes in the training data, the likelihood score on the testing dataset is likely to increase. The results shown in Figure 3 confirm this expectation; the log-likelihood for the sample is less negative (better) than for the sample. These apparently conflicting results (in Figure 3) – whether sampling improves or worsens synthesized samples – highlights the challenges involved with evaluating generative models using the log-likelihood, discussed in more depth by Theis et al. [30]. For this reason, we also show qualitative results.

Figure 4(a) shows an set of initial samples () drawn from a DAAE and samples synthesised after iterations () of MC sampling in Figure 4(b), these samples appear to be well vaired, capturing multiple modes of the data generating distribution.

Figure 3: Omniglot log-likelihood of compared on the testing and evaluation datasets. The training and evaluation datasets have samples from different handwritten character classes. All models were trained using a 200D Gaussian prior. The training and testing datasets have samples from the same handwritten character classes.
Figure 4: Omniglot Markov chain (MC) sampling: (a) Initial sample, and (b) Corresponding samples, after iterations of MC sampling. The chain was initialized with .

Vi-D2 Sampling: Sprites

In alignment with expectation, the iDAAE model synthesizes samples with higher (better) log-likelihood, , than the AAE, . The initial image samples drawn from the DAAE model under-perform compared to the AAE model, , however after just one iteration of sampling the synthesized samples have higher log-likelihood than samples from the AAE. Results also show that synthesized samples drawn using the DAAE after one iteration of MC sampling have higher likelihood, , than samples drawn using either the the iDAAE or AAE models.

When more than one step of MC sampling is applied, the log-likelihood decreases, in a similar way to results on the Omniglot evalutation dataset, this may be related to how the training, and test data are split, each dataset has a unique set of characters, so combinations seen during training will not be present in the testing dataset. These results further suggest that MC sampling pushes synthesized samples towards modes in the training data.

Vi-D3 Sampling: CelebA

In Figure 5 we compare samples synthesized using an AAE to those synthesized using an iDAAE trained using intergration steps. Figure 5(b) show samples drawn from the iDAAE which improve upon those drawn from the AAE model.

In Figure 6 we show samples synthesized from a DAAE using the iterative approach described in section IV-B for . We see that the initial samples, have blurry artifacts, while the final samples, are more sharp and free from the blurry artifacts.

(a) AAE (Previous work)
(b) iDAAE (Our work)
Figure 5: CelebA iDAAE samples (a) AAE (no noise) with and (c) iDAAE (with noise) with , and .
Figure 6: DAAE Face Samples

When drawing samples from iDAAE and DAAE models trained on CelebA, a critical difference between the two models emerges: samples synthesized using a DAAE have good structure but appear to be quite similar to each other, while the iDAAE samples have less good structure but appear to have lots of variation. The lack of variation in DAAE samples may be related to the sampling procedure, which accoriding to theory presented by Alain et al. [1], would be similar to taking steps toward the highest density regions of the distribution (i.e. the mode), explaining why samples appear to be quite similar.

When comparing DAAE or iDAAE samples to samples from other generative models such as GANs [9] we may notice that samples are less sharp. However, GANs often suffer from ‘mode collapse’ this is where all synthesised samples are very similar, the iDAAE does not suffer mode collapse and does not require any additional procedures to prevent mode collapse [27]. Further, (vanilla) GANs do not offer an encoding model. Other GAN variants such as Bi-GAN [6] and ALI [7] do offer encoding models, however the fidelity of reconstruction is very poor. The AAE, DAAE and iDAAE models are able to reconstruct samples faithfully. We will explore fidelity of reconstruction in the next section and compare to a state-of-art ALI that has been modified to have improved reconstruction fidelity, ALICE [19].

We conclude this section on sampling, by making the following observations; samples synthesised using iDAAEs out-performed AAEs on all datasets, where . It is convenient, that relatively small yields improvement, as the time needed to train an iDAAE may increase linearly with . We also observed that initial samples synthesized using the DAAE are poor and in all cases even just one iteration of MC sampling improves image synthesis.

Finally, evaluating generated samples is challenging: log-likelihood is not always reliable [30], and qualitative analysis is subjective. For this reason, we provided both quantitative and qualitative results to communicate the benefits of introducing MC sampling for a trained DAAE, and the advantages of iDAAEs over AAEs.

Vi-E Reconstruction

The reconstruction task involves passing samples from the test dataset through the trained encoder and decoder to recover a sample similar to the original (uncorrupted) sample. The reconstruction is evaluated by computing the mean squared error between the reconstruction and the original sample.

We are interested in reconstruction for several reasons. The first is that if we wish to use encodings for down stream tasks, for example classification, a good indication of whether the encoding is modeling the sample well is to check the reconstructions. For example if the reconstructed image is missing certain features that were present in the original images, it may be that this information is not preserved in the encoding. The second reason is that checking sample reconstructions is also a method to evaluate whether the model has overfit to test samples. The ability to reconstruct samples not seen during training suggests that a model has not overfit. The final reason, is to further motivate AAE, DAAE and iDAAE models as alternatives to GAN based models that are augmented with encoders [19], for down stream tasks that require good sample reconstruction. We expect that adding noise during training would both prevent over fitting and encourage the model to learn more robust representations, therefore we expect that the DAAE and iDAAE would outperform the AAE.

Vi-E1 Reconstruction: Omniglot

Table I compares reconstruction errors of the AAE, DAAE and iDAAE trained on the Omniglot dataset. The reconstruction errors for both the iDAAE and the DAAE are less than the AAE. The results suggest that using the denoising criterion during training helps the network learn more robust features compared to the non-denoising variant. The smallest reconstruction error was achieved by the DAAE rather than the iDAAE; qualitatively, the reconstructions using the DAAE captured small details while the iDAAE lost some. This is likely to be related to the multimodal nature of in the DAAE compared to the unimodal nature of in an iDAAE.

Vi-E2 Reconstruction: Sprites

Table I shows reconstruction error on samples from the sprite test dataset for models trained on the sprite training data. In this case only the iDAAE model out-performed the AAE and the DAAE performed as well as the AAE.

Vi-E3 Reconstruction: CelebA

Table I shows reconstruction error on the CelebA dataset. We compare AAE, DAAE and iDAAE models trained with momentum, , where the DAAE and iDAAE have corruption and the iDAAE is trained with integration steps. We also experimented with however, better results were obtained using . While the DAAE performs similarly well to the AAE, the iDAAE outperforms both. Figure 7 shows examples of reconstructions obtained using the iDAAE. Although the reconstructions are slightly blurred, the reconstructions are highly faithful, suggesting that facial attributes are correctly encoded by the iDAAE model.

(a) Original
(b) Reconstructions
Figure 7: CelebA Reconstruction Error with an iDAAE
Model Omniglot Sprite MNIST444The MNIST dataset is described in the Appendix. CelebA
AAE 0.047 0.019 0.017 0.500
DAAE 0.029 0.019 0.015 0.501
iDAAE 0.031 0.018 0.018 0.495
ALICE [19] - - 0.080
Table I: Reconstruction: Shows the mean squared error for reconstructions of corrupted test data samples. This table server two purposes: (1) To demonstrate that in most cases the DAAE and iDAAE are better able to reconstruct images compared to the AAE. (2) To motivate why we are interested in AAEs, as opposed to other GAN [9] related approaches, by comparing reconstruction error on MNIST for a state of the art GAN variant the ALICE [19], which was designed to improve reconstruction fidelity in GAN-like models.

Vi-F Classification

We are motivated to understand the properties of the representations (latent encoding) learned by the DAAE and iDAAE trained on unlabeled data. A particular property of interest is the separability, in latent space, between objects of different classes. To evaluate separability, rather than training in a semi-supervised fashion [21] we obtain class predictions by training an SVM on top of the representations, in a similar fashion to that of Kumar et al. [16].

Vi-F1 Classification: Omniglot

Classifying samples in the Omniglot dataset is very challenging: the training and testing datasets consists of classes, with only examples of each class in the training dataset. The classes make up writing systems, where symbols between writing systems may be visually indistinguishable. Previous work has focused on only classifying , or classes from within a single writing system [28, 33, 8, 17], however we attempt to perform classification across all classes. The Omniglot training dataset is used to train SVMs (with RBF kernels) on encodings extracted from encoding models of the trained DAAE, iDAAE and AAE models. Classification scores are reported on the Omniglot evaluation dataset, (Table II).

Results show that the DAAE and iDAAE out-perform the AAE on the classification task. The DAAE and iDAAE also out-perform a classifier trained on encodings obtained by applying PCA to the image samples, while the AAE does not, further showing the benefits of using denoising.

We perform a separate classification task using only classes from the Omniglot evaluation dataset (each class has examples). This second test is performed for two key reasons: a) to study how well autoencoders trained on only a subset of classes can generalise as feature extractors for classifiers of classes not seen during autoencoder training; b) to facilitate performance comparisons with previous work [8]. A linear SVM classifier is trained on the samples from each of the classes in the evaluation dataset and tested on the remaining sample from each class. We perform the classification times, leaving out a different sample from each class, in each experiment. The results are shown in Table II. For comparison, we also show classification scores when PCA is used as a feature extractor instead of a learned encoder.

Results show that the the DAAE model out-performs the AAE model, while the iDAAE performs less well, suggesting that features learned by the DAAE transfer better to new tasks, than those learned by the iDAAE. The AAE, iDAAE and DAAE models also out-perform PCA.

Model Test Acc. Eval Acc. %
AAE 18.36% 78.75 %
DAAE 31.74% 83.00%
iDAAE 34.02% 78.25%
PCA 31.02% 76.75%
Random chance 0.11% 5%
Table II: Omniglot Classification on all Test Set Classes and On Evaluation Classes.

Vi-F2 Classification: CelebA

We perform a more extensive set of experiments to evaluate the linear separability of encodings learned on the celebA dataset and compare to state of the art methods including the VAE [14] and the -VAE555The -VAE [10] weights the term in the VAE cost function with to encourge better organisation of the latent sapce, factorising the latent encoding into interpretable, indepenant compoenents. [10].

We train a linear SVM on the encodings of a DAAE (or iDAAE) to predict labels for facial attributes, for example ‘Blond Hair’, ‘No Beard’ etc. . In our experiments, we compare classification accuracy on attributes obtained using the AAE, DAAE and iDAAE compared to previously reported results obtained for the VAE [14] and the -VAE [10], these are shown in Figure 8. The results for the VAE and -VAE were obtained using a similar approach to ours and were reported by Kumar et al. [16]. We used the same hyper parameters to train all models and a fixed noise level of , the iDAAE was trained with . The Figure (8) shows that the AAE, iDAAE and DAAE models outperform the VAE and -VAE models on most facial attribute categories.

Figure 8: Facial Attribute Classification Comparison of classification scores for an AAE, DAAE, iDAAE compared to the VAE [14] and -VAE [10]. A Linear SVM classifier is trained on encodings to demonstrate the linear separability of representation learned by each model. The attribute classification values for the VAE and -VAE were obtained from Kumar et al. [16]

To compare models more easily, we ask the question, ‘On how many facial attributes does one model out perform another?’, in the context of facial attribute classification. We ask this question for various combinations of model pairs, the results are shown in Figure 9. Figures 9 (a) and (b), comparing the AAE to DAAE and iDAAE respectively, demonstrate that for some attributes the denoising models outperform the non-denoising models. More over, the particular attributes for which the DAAE and iDAAE outperform the AAE is (fairly) consistent, both DAAE and iDAAE outperform the AAE on the (same) attributes: ‘Attractive’, ‘Blond Hair’, ‘Wearing Hat’, ‘Wearing Lipstick’. The iDAAE outperforms on an additional attribute, ‘Arched Eyebrows’.

Figure 9: On how many facial attributes does one model out perform another? For each chart, each portion shows the number of facial attributes that each model out performs the other model in the same chart.

There are various hyper parameters that may be chosen to train these models; for the DAAE and iDAAE we may choose the level of corruption and for the iDAAE we may additionally choose the number of integration steps, used during training. We compare attribute classification results for vastly different choices of parameter settings. The results are presented as a bar chart in Figure 10 for the DAAE. Additional results for the iDAAE are shown in the Appendix (Figure 15). These figures show that the models perform well under various different parameter settings. Figure 10 suggests that the model performs better with a smaller amount of noise rather than with , however it is important to note that a large amount of noise does not ‘break’ the model. These results demonstrate that the model works well for various hyper parameters, and fine tuning is not necessary to achieve reasonable results (when compared to the VAE for example). It is possible that further fine tuning may be done to achieve better results, however a full parameter sweep is highly computationally expensive.

Figure 10: DAAE Robustness to hyper parameters

From this section, we may conclude that with the exception of facial attributes, AAEs and variations of AAEs are able to outperform the VAE and -VAE on the task of facial attribute classification. This suggests that AAEs and their variants are interesting models to study in the setting of learning linearly separable encodings. We also show that for a specific set of several facial attribute categories, the iDAAE or DAAE performs better than the AAE. This consistency, suggests that there are some specific attributes that the denoising variants of the AAE learn better than the non-denoising AAE.

Vi-G Trade-offs in Performance

The results presented in this section suggest that both the DAAE and iDAAE out-perform AAE models on most generation and some reconstruction tasks and suggest it is sometimes beneficial to incorporate denoising into the training of adversarial autoencoders. However, it is less clear which of the two new models, DAAE or iDAAE, are better for classification. When evaluating which one to use, we must consider both the practicalities of training, and for generative purposes, the practicalities – primarily computational load – of each model.

The integrating steps required for training an iDAAE means that it may take longer to train than a DAAE. On the other hand, it is possible to perform the integration process in parallel provided that sufficient computational resource is available. Further, once the model is trained, the time taken to compute encodings for classification is the same for both models. Finally, results suggest that using as few as integrating steps during training, leads to an improvement in classification score. This means that for some classification tasks, it may be worthwhile to train an iDAAE rather than a DAAE.

For generative tasks, neither the DAAE nor the iDAAE model consistently out-perform the other in terms of log-likelihood of synthesized samples. The choice of model may be more strongly affected by the computational effort required during training or sampling. In terms of log-likelihood on the synthesized samples, an iDAAE using even a small number of integration steps () during training of an iDAAE leads to better quality images being generated, and similarly using even one step of sampling with a DAAE leads to better generations.

Conflicting log-likelihood values of generated samples between testing and evaluation datasets means that these measurements are not a clear indication of how the number of sampling iterations affects the visual quality of samples synthesized using a DAAE. In some cases it may be necessary to visually inspect samples in order to assess effects of multiple sampling iterations (Figure 4).

Vii Conclusion

We propose two types of denoising autoencoders, where a posterior is shaped to match a prior using adversarial training. In the first, we match the posterior conditional on corrupted data samples to the prior; we call this model a DAAE. In the second, we match the posterior, conditional on original data samples, to the prior. We call the second model an integrating DAAE, or iDAAE, because the approach involves using Monte Carlo integration during training.

Our first contribution is the extension of adversarial autoencoders (AAEs) to denoising adversarial autoencoders (DAAEs and iDAAEs). Our second contribution includes identifying and addressing challenges related to synthesizing data samples using DAAE models. We propose synthesizing data samples by iteratively sampling a DAAE according to a MC transition operator, defined by the learned encoder and decoder of the DAAE model, and the corruption process used during training.

Finally, we present results on three datasets, for three tasks that compare both DAAE and iDAAE to AAE models. The datasets include: handwritten characters (Omniglot [17]), a collection of human-like sprite characters (Sprites [26]) and a dataset of faces (CelebA [20]). The tasks are reconstruction, classification and sample synthesis.


We acknowledge the Engineering and Physical Sciences Research Council for funding through a Doctoral Training studentship. We would also like to thank Kai Arulkumaran for interesting discussions and managing the cluster on which many experiments were performed. We also acknowledge Nick Pawlowski and Martin Rajchl for additional help with clusters.


Appendix A Proofs

Lemma 1.

is a stationary distribution for the Markov chain defined by the sampling process in 3.


Consider the case where . is from , by equation 5. Following the sampling process, , , is also from , by equation 6. Similar to proof in Bengio [4]. Therefore is a stationary distribution of the Markov chain defined by 3. ∎

Lemma 2.

The Markov chain defined by the transition operator, (4)is ergodic, provided that the corruption process is additive Gaussian noise and that adversarial pair, and are optimal within the adversarial framework.


Consider , and . Where and :

  1. Assuming that is a good approximation of the underlying probability distribution, , then s.t. .

  2. Assuming that adversarial training has shapped the distribution of to match the prior, , then s.t. . This holds because if not all points in could be visited, would not have matched the prior.

1) suggests that every point in may be reached from a point in and 2) suggests that every point in may be reached from a point in . Under the assumption that is an additive Gaussian corruption process then is likely to lie within a (hyper) spherical region around . If the corruption process is sufficiently large such that (hyper) spheres of nearby samples overlap, for an and a set such that, and where is the support. Then, it is possible to reach any from any (including the case ). Therefore, the chain is both irreducible and positive recurrent.

To be ergodic the chain must also be aperiodic: between any two points and , there is a boundary, where values between and the boundary are mapped to , and points between and the boundary are mapped to . By applying the corruption process to , followed by the reconstruction process, there are always at least two possible outcomes because we assume that all (hyper) spheres induced by the corruption process overlap with at least one other (hyper) sphere: either is not pushed over the boundary and remains the same, or is pushed over the boundary and moves to a new state. The probability of either outcome is positive, and so there is always more that one route between two points, thus avoiding periodicity provided that for , . Consider, the case where for , , then it would not be possible to recover both and using and so if then (and vice verse), which is a contradiction to 1).

Appendix B Examples Of Sythesised Samples

Figure 11: Examples of synthesised samples: Examples of randomly sytheisised data samples.

Appendix C Algorithm For Training the iDAAE

1 #draw a batch of samples from the training data
2 for  to NoEpoch do
3       #corrupt all samples
4       #encode all corrupted samples
5       #reconstruct
6       #minimise reconstruction cost
10       #match to using adversarial training
11       approx_z() #draw samples for
12       #draw samples from prior
13       #train the discriminator
17       #train the decoder to match the prior
20 end for
Algorithm 1 iDAAE : Matching to

Appendix D Algorithm For Monte Carlo Integration

1 function: approx_z()
2 for  to  do
4       for  to  do
8       end for
11 end for
Algorithm 2 Drawing samples from

Appendix E MNIST Results

E-a Datasets: MNIST

The MNIST dataset consists of grey-scale images of handwritten digits between and , with k training samples, k validation samples and k testing samples. The training, validation and testing dataset have an equal number of samples from each category. The samples are -by- pixels. The MNIST dataset is a simple dataset with few classes and many training examples, making it a good dataset for proof-of-concept. However, because the dataset is very simple, it does not necessarily reveal the effects of subtle, but potentially important, changes to algorithms for training or sampling. For this reason, we consider two datasets with greater complexity.

E-B Architecture and Training: MNIST

For the MNIST dataset, we train a total of models detailed in Table LABEL:MNIST_models. The encoder, decoder and discriminator networks each have two fully connected layers with neurons each. For most models the size of the encoding is units, and the prior distribution that the encoding is being matched to is a D Gaussian. All networks are trained for epochs on the training dataset, with a learning rate of and a batch size of . The standard deviation of the additive Gaussian noise used during training is for all iDAAE and DAAE models.

An additional DAAE model is trained using the same training parameters and networks as described above but with a mixture of D Gaussians for the prior. Each D Gaussian with standard deviation is equally spaced with its mean around a circle of radius units. This results in a prior with modes separated from each other by large regions of very low probability. This model of the prior is very unrealistic, as it assumes that MNIST digits occupy distinct regions of image probability space. In reality, we may expect two numbers that are similar to exist side by side in image probability space, and for there to exist a smooth transition between handwritten digits. As a consequence of this, this model is intended specifically for evaluating how well the posterior distribution over latent space may be matched to a mixture of Gaussians - something that could not be achieved easily by a VAE [14].

ID Model Corruption Prior
1 AAE 0.0 10D Gaussian -
2 DAAE 0.5 10D Gaussian -
3 DAAE 0.5 10-GMM -
4 iDAAE 0.5