I Background
Ia Autoencoders
In a supervised learning setting, given a set of training data,
we wish to learn a model, that maximises the likelihood, of the true label, given an observation, . In the supervised setting, there are many ways to calculate and approximate the likelihood, because there is a ground truth label for every training data sample.When trying to learn a generative model, , in the absence of a ground truth, calculating the likelihood of the model under the observed data distribution,
, is challenging. Autoencoders introduce a two step learning process, that allows the estimation,
of via an auxiliary variable, . The variable may take many forms and we shall explore several of these in this section. The two step process involves first learning a probabilistic encoder [14], , conditioned on observed samples, and a second probabilistic decoder [14], , conditioned on the auxiliary variables. Using the probabilistic encoder, we may form a training dataset, where is the ground truth output for with the input being . The probabilistic decoder, , may then be trained on this dataset in a supervised fashion. By sampling conditioning on suitable’s we may obtain a joint distribution,
, which may be marginalised by integrating over all to obtain, to .IB Denoising Autoencoders (DAEs)
Bengio et al. [5] treat the encoding process as a local corruption process, that does not need to be learned. The corruption process, defined as where , the corrupted is the auxiliary variable (instead of ). The decoder, , is therefore trained on the data pairs, .
By using a local corruption process (e.g. additive white Gaussian noise [5]), both and have the same number of dimensions and are close to eachother. This makes it very easy to learn . Bengio et al. [5] shows how the learned model may be sampled using an iterative process, but does not explore how representations learned by the model may transfer to other applications such as classification.
Hinton et al. [11] show that when auxiliary variables of an autoencoder have lower dimension than the observed data, the encoding model learns representations that may be useful for tasks such as classification and retrieval.
Rather than treating the corruption process, , as an encoding process [5] – missing out on potential benefits of using a lower dimensional auxiliary variable – Vincent et al. [31, 32] learn an encoding distribution, , conditional on corrupted samples. The decoding distribution, learns to reconstruct images from encoded, corrupted images, see the DAEs in Figure 1. Vincent et al. [32, 31] show that compared to regular autoencoders, denoising autoencoders learn representations that are more useful and robust for tasks such as classification. Parameters, and are learned simultaneously by minimisng the reconstruction error for the training set, , which does not include . The ground truth for a given is unknown. The form of the distribution over , to which samples are mapped, is also unknown  making it difficult to draw novel data samples from the decoder model, .
IC Variational Autoencoders
Variational autoencoders (VAEs) [14] specify a prior distribution, to which should map all samples, by formulating and maximising a variational lower bound on the loglikelihood of .
The variational lower bound on the loglikelihood of is given by [14]:
(1) 
The term corresponds to the likelihood of a reconstructed given the encoding, of a data sample . This formulation of the variational lower bound does not involve a corruption process. The term is the KullbackLibeller divergence between and . Samples are drawn from via a reparametrisation trick, see the VAE in Figure 1.
If is chosen to be a parametrised multivariate Gaussian,
, and the prior is chosen to be a Gaussian distribution, then
may be computed analytically. divergence may only be computed analytically for certain (limited) choices of prior and posterior distributions.VAE training encourages to map observed samples to the chosen prior, . Therefore, novel observed data samples may be generated via the following simple sampling process: , [14].
ID Denoising Variational Autoencoders
Adding the denoising criterion to a variational autoencoder is nontrivial because the variational lower bound becomes intractable.
Consider the conditional probability density function,
, where is the probabilistic encoder conditioned on corrupted samples, , and is a corruption process. The variational lower bound may be formed in the following way [12]:If is chosen to be Gaussian, then in many cases will be a mixture of Gaussians. If this is the case, there is no analytical solution for and so the denoising variational lower bound becomes analytically intractable. However there may still be an analytical solution for . The denoising variational autoencoder therefore maximises . We refer to the model which is trained to maximise this objective as a DVAE, see the DVAE in Figure 1. Im et al. [12] show that the DVAE achieves lower negative variational lower bounds than the regular variational autoencoder on a test dataset.
However, note that is matched to the prior, rather than . This means that generating novel samples using is not as simple as the process of generating samples from a variational autoencoder. To generate novel samples, we should sample , , which is difficult because of the need to evaluate . Im et al. [12] do not address this problem.
For both DVAEs and VAEs there is a limited choice of prior and posterior distributions for which there exists an analytic solution for the KL divergence. Alternatively, adversarial training may be used to learn a model that matches samples to an arbitrarily complicated target distribution – provided that samples may be drawn from both the target and model distributions.
Ii Related Work
Iia Adversarial Training
In adversarial training [9] a model is trained to produce output samples, that match a target probability distribution . This is achieved by iteratively training two competing models, a generative model, and a discriminative model, . The discriminative model is fed with samples either from the generator (i.e. ‘fake’ samples) or with samples from the target distribution (i.e. ‘real’ samples), and trained to correctly predict whether samples are ‘real’ or ‘fake’. The generative model  fed with input samples , drawn from a chosen prior distribution,  is trained to generate output samples that are indistinguishable from target samples in order to ‘fool’ [25] the discriminative model into making incorrect predictions. This may be achieved by the following minimax objective [9]:
It has been shown that for an optimal discriminative model, optimising the generative model is equivalent to minimising the JensenShannon divergence between the generated and target distributions [9]. In general, it is reasonable to assume that, during training, the discriminative model quickly achieves near optimal performance [9]. This property is useful for learning distributions for which the JensenShannon divergence may not be easily calculated.
The generative model is optimal when the distribution of generated samples matches the target distribution. Under these conditions, the discriminator is maximally confused and cannot distinguish ‘real’ samples from ‘fake’ ones. As a consequence of this, adversarial training may be used to capture very complicated data distributions, and has been shown to be able to synthesise images of handwritten digits and human faces that are almost indistinguishable from real data [25].
IiB Adversarial Autoencoders
Makhzani et al. [21] introduce the adversarial autoencoder (AAE), where is both the probabilistic encoding model in an autoencoder framework and the generative model in an adversarial framework. A new, discriminative model, is introduced. This discriminative model is trained to distinguish between latent samples drawn from and . The cost function used to train the discriminator, is:
where and and is the size of the training batch.
Adversarial training is used to match to an arbitrarily chosen prior, . The cost function for matching to prior, is as follows:
(2) 
where and is the size of a training batch. If both and are optimised, will be indistinguishable from .
In Makhzani et al.’s [21] adversarial autoencoder, is specified by a neural network whose input is and whose output is . This allows to have arbitrary complexity, unlike the VAE where the complexity of is usually limited to a Gaussian. In an adversarial autoencoder the posterior does not have to be analytically defined because an adversary is used to match the prior, avoiding the need to analytically compute a KL divergence.
Makhzani et al. [21] demonstrate that adversarial autoencoders are able to match to several different priors, , including a mixture of 10 2DGaussian distributions. We explore another direction for adversarial autoencoders, by extending them to incorporate a denoising criterion.
Iii Denoising Adversarial Autoencoder
We propose denoising adversarial autoencoders  denoising autoencoders that use adversarial training to match the distribution of auxiliary variables, to a prior distribution, .
We formulate two versions of a denoising adversarial autoencoder which are trained to approximately maximise the denoising variational lower bound [12]. In the first version, we directly match the posterior to the prior, using adversarial training. We refer to this as an integrating Denoising Adversarial Autoencoder, iDAAE. In the second, we match intermediate conditional probability distribution to the prior, . We refer to this as a DAAE.
In the iDAAE, adversarial training is used to bypass analytically intractable KL divergences [12]. In the DAAE, using adversarial training broadens the choice for prior and posterior distributions beyond those for which the KL divergence may be analytically computed.
Iiia Construction
The distribution of encoded data samples is given by [12]. The distribution of decoded data samples is given by . Both and may be trained to maximise the likelihood of a reconstructed sample, by minimising the reconstruction cost function, where the are obtained via the following sampling process , , , and is distribution of the training data.
We also want to match the distribution of auxiliary variables, to a prior, . When doing so, there is a choice to match either or to . Each choice has its own tradeoffs either during training or during sampling.
IiiA1 iDAAE: Matching to a prior
In DVAEs there is often no analytical solution for the KL divergence between and [12], making it difficult to match to . Rather, we propose using adversarial training to match to , requiring samples to be drawn from during training. It is challenging to draw samples directly from , but it is easy to draw samples from and so may be approximated by , , where are samples from the training data, see Figure 1. Matching is achieved by minimising the following cost function:
where , , , .
IiiA2 DAAE: Matching to a prior
Since drawing samples from is trivial, may be matched to via adversarial training. This is more efficient than matching since a MonteCarlo integration step (in Section IIIA1) is not needed, see Figure 1. In using adversarial training in place of
divergence, the only restriction is that we must be able to draw samples from the chosen prior. Matching may be achieved by minimising the following loss function:
where .
Though more computationally efficient to train, there are drawbacks when trying to synthesise novel samples from if – rather than – is matched to the prior. The effects of using a DAAE rather than an iDAAE may be visualized by plotting the empirical distribution of encodings of both data samples and corrupted data samples with the desired prior, these are shown in Figure 2.
Iv Synthesising novel samples
In this section, we review several techniques used to draw samples from trained autoencoders, identify a problem with sampling DVAEs, which also applies to DAAEs, and propose a novel approach to sampling DAAEs; we draw strongly on previous work by Bengio et al. [4, 5].
Iva Drawing Samples From Autoencoders
New samples may be generated by sampling a learned , conditioning on drawn from a suitable distribution. In the case of variational [14] and adversarial [21] autoencoders, the choice of this distribution is simple, because during training the distribution of auxiliary variables is matched to a chosen prior distribution, . It is therefore easy and efficient to sample both variational and adversarial autoencoders via the following process: , [14, 21].
The process for sampling denoising autoencoders is more complicated. In the case where the auxiliary variable is a corrupted image, [3], the sampling process is as follows: , , [5]. In the case where the auxiliary variable is an encoding, [31, 32] the sampling process is the same, with encompassing both the encoding and decoding process.
However, since a denoising autoencoder is trained to reconstruct corrupted versions of its inputs, is likely to be very similar to . Bengio et al. [5] propose a method for iteratively sampling denoising autoencoders by defining a Markov chain whose stationary distribution  under certain conditions  exists and is equivalent, under certain assumption, to the training data distribution. This approach is generalised and extended by Bengio et al. [4] to introduce a latent distribution with no prior assumptions on .
We now consider the implication for drawing samples from denoising adversarial autoencoders introduced in Section IIIA. By using the iDAAE formulation (Section IIIA1) – where is matched to the prior over – samples may be drawn from conditioning on . However, if we use the DAAE – matching to a prior – sampling becomes nontrivial.
On the surface, it may appear easy to draw samples from DAAEs (Section IIIA2), by first sampling the prior, and then sampling . However, the full posterior distribution is given by , but only is matched to during training (See figure 2). The implication of this is that, when attempting to synthesize novel samples from , drawing samples from the prior, , is unlikely to yield samples consistent with . This will become more clear in Section IVB.
IvB Proposed Method For Sampling DAAEs
Here, we propose a method for synthesising novel samples using trained DAAEs. In order to draw samples from , we need to be able to draw samples from .
To ensure that we draw novel data samples, we do not want to draw samples from the training data at any point during sample synthesis. This means that we cannot use data samples from our training data to approximately draw samples from .
Instead, similar to Bengio et al. [5], we formulate a Markov chain, which we show has the necessary properties to converge and that the chain converges to
. Unlike Bengio’s formulation, our chain is initialised with a random vector of the same dimensions as the latent space, rather than a sample drawn from the training set.
We define a Markov chain by the following sampling process:
(3) 
Notice that our first sample is any real vector of dimension , where is the dimension of the latent space. This Markov chain has the transition operator:
(4) 
We will now show that under certain conditions this transition operator defines an ergodic Markov chain that converges to in the following steps: 1) We will show that that there exists a stationary distribution for drawn from a specific choice of initial distribution (Lemma 1). 2) The Markov chain is homogeneous, because the transition operator is defined by a set of distributions whose parameters are fixed during sampling. 3) We will show that the Markov chain is also ergodic, (Lemma 2). 4) Since the chain is both homogeneous and ergodic there exists a unique stationary distribution to which the Markov chain will converge [23].
Step 1) shows that one stationary distribution is , which we now know by 2) and 3) to be the unique stationary distribution. So the Markov chain converges to .
In this section, only, we use a change of notation, where the training data probability distribution, previously represented as is represented as , this is to help make distinctions between “natural system” probability distributions and the learned distributions. Further, note that is the prior, while the distribution required for sampling is such that:
(5) 
(6) 
Lemma 1.
is a stationary distribution for the Markov chain defined by the sampling process in (3).
For proof see Appendix.
Lemma 2.
The Markov chain defined by the transition operator, (4) is ergodic, provided that the corruption process is additive Gaussian noise and that the adversarial pair, and are optimal within the adversarial framework.
For proof see Appendix.
Theorem 1.
Assuming that is approximately equal to , and that the adversarial pair – and – are optimal, the transition operator defines a Markov chain whose unique stationary distribution is .
This sampling method uncovers the distribution on which samples drawn from must be conditioned in order to sample . Assuming , this allows us to draw samples from .
For completeness, we would like to acknowledge that there are several other methods that use Markov chains during the training of autoencoders [2, 22] to improve performance. Our approach for synthesising samples using the DAAE is focused on sampling only from trained models; the Markov chain sampling is not used to update model parameters.
V Implementation
The analyses of Sections III and IV are deliberately general: they do not rely on any specific implementation choice to capture the model distributions. In this section, we consider a specific implementation of denoising adversarial autoencoders and apply them to the task of learning models for image distributions. We define an encoding model that maps corrupted data samples to a latent space , and which maps samples from a latent space to an image space. These respectively draw samples according to the conditional probabilities and . We also define a corruption process, , which draws samples according to .
The parameters and of models and are learned under an autoencoder framework; the parameters are also updated under an adversarial framework. The models are trained using large datasets of unlabelled images.
Va The Autoencoder
Under the autoencoder framework, is the encoder and
is the decoder. We used fully connected neural networks for both the encoder and decoder. Rectifying Linear Units (ReLU) were used between all intermediate layers to encourage the networks to learn representations that capture multimodal distributions. In the final layer of the decoder network, a sigmoid activation function is used so that the output represents pixels of an image. The final layer of the encoder network is left as a linear layer, so that the distribution of encoded samples is not restricted.
As described in Section IIIA, the autoencoder is trained to maximise the loglikelihood of the reconstructed image given the corrupted image. Although there are several ways in which one may evaluate this loglikelihood, we chose to measure pixel wise binary crossentropy between the reconstructed sample, and the original samples before corruption, . During training we aim to learn parameters and that minimise the binary crossentropy between and . The training process is summarised by lines to in Algorithm 1 in the Appendix.
The vectors output by the encoder may take any real values, therefore minimising reconstruction error is not sufficient to match either or to the prior, . For this, parameters must also be updated under the adversarial framework.
VB Adversarial Training
To perform adversarial training we define the discriminator , described in Section IIA to be a fully connected neural network, which we denote . The output of is a “probability” because the final layer of the neural network has a sigmoid activation function, constraining the range of to be between . Intermediate layers of the network have ReLU activation functions to encourage the network to capture highly nonlinear relations between and the labels, {‘real’, ‘fake’}.
How adversarial training is applied depends on whether or is being fit to the prior . refers to the samples drawn from the distribution that we wish to fit to and , samples drawn from the prior, . The discriminator, , is trained to predict whether ’s are ‘real’ or ‘fake’. This may be achieved by learning parameters that maximise the probability of the correct labels being assigned to and . This training procedure is shown in Algorithm 1 on Lines to .
Drawing samples, , involves sampling some prior distribution, , often a Gaussian. Now, we consider how to draw fake samples, . How these samples are drawn depends on whether (DAAE) is being fit to the prior or (iDAAE) is being fit to the prior. Drawing samples, is easy if is being matched to the prior, as these are simply obtained by mapping corrupted samples though the encoder: .
However, if is being matched to the prior, we must use Monte Carlo sampling to approximate samples (see Section IIIA1). The process for calculating is given by Algorithm 2 in the Appendix, and detailed in Section IIIA1.
Finally, in order to match the distribution of samples to the prior, , adversarial training is used to update parameters while holding parameters fixed. Parameters are updated to minimise the likelihood that
correctly classifies
as being ‘fake’. The training procedure is laid out in lines and of Algorithm 1.VC Sampling
Although the training process for matching to is less computationally efficient than matching to , it is very easy to draw samples when is matched to the prior (iDAAE). We simply draw a random value from , and calculate , where is a new sample. When drawing samples, parameters and are fixed.
If is matched to the prior (DAAE), an iterative sampling process is needed in order to draw new samples from . This sampling process is described in Section IVB. To implement this sampling process is trivial. A random sample, is drawn from any distribution; the distribution does not have to be the chosen prior, . New samples, are obtained by iteratively decoding, corrupting and encoding , such that is given by:
In the following section, we evaluate the performance of denoising adversarial autoencoders on three image datasets, a handwritten digit dataset (MNIST) [18], a synthetic colour image dataset of tiny images (Sprites) [26], and a complex dataset of handwritten characters [17]. The denoising and nondenoising adversarial autoencoders (AAEs) are compared for tasks including reconstruction, generation and classification.
Vi Experiments & Results
Via Code Available Online
We make our PyTorch
[24] code available at the following link: https://github.com/ToniCreswell/pyTorch_DAAE ^{2}^{2}2An older version of our code in Theano available at
https://github.com/ToniCreswell/DAAE_ with our results presented in iPython notebooks. Since this is a revised version of our paper and Theano is no longer being supported, our new experiments on the CelebA datasets were performed using PyTorch..ViB Datasets
We evaluate our denoising adversarial autoencoder on three image datasets of varying complexity. Here, we describe the datasets and their complexity in terms of variation within the dataset, number of training examples and size of the images.
ViB1 Datasets: Omniglot
The Omniglot dataset is a handwritten character dataset consisting of categories of character from different writing systems, with only examples of each character. Each example in the dataset is by pixels, taking values {0,1}. The dataset is split such that examples from categories make up the training dataset, while one example from each of those categories makes up the testing dataset. The characters from each of the remaining categories make up the evaluation dataset. This means that experiments may be performed to reconstruct or classify samples from categories not seen during training of the autoencoders.
ViB2 Datasets: Sprites
The sprites dataset is made up of unique humanlike characters. Each character has attributes including hair, body, armour, trousers, arm and weapon type, as well as gender. For each character there animations consisting of to frames each. There are between and examples of each character, however every example is in a different pose. Each sample is by pixels and samples are in colour. The training, validation and test datasets are split by character to have , and unique characters each, with no two sets having the same character.
ViB3 Datasets: CelebA
The CelebA dataset consists of k images of faces in colour. Though a version of the dataset with tightly cropped faces exists, we use the uncropped dataset. We use samples for testing and the rest for training. Each example has dimensions by and a set of labelled facial attributes for example, ‘No Beard’, ‘Blond Hair’, ‘Wavy Hair’ etc. . This face dataset is more complex than the Toronto Face Dataset used by Makhzani et al. [21] for training the AAE.
ViC Architecture and Training
For each dataset, we detail the architecture and training parameters of the networks used to train each of the denoising adversarial autoencoders. For each dataset, several DAAEs, iDAAEs and AAEs are trained. In order to compare models trained on the same datasets, the same network architectures, batch size, learning rate, annealing rate and size of latent code is used for each. Each set of models were trained using the same optimization algorithm. The trained AAE [21] models act as a benchmark, allowing us to compare our proposed DAAEs and iDAAEs.
ViC1 Architecture and Training: Omniglot
The decoder, encoder and discriminator networks consisted of , and fully connected layers respectively, each layer having neurons. We found that deeper networks than those proposed by Makhazni et al. [21] (for the MNIST dataset) led to better convergence. The networks are trained for epochs, using a learning rate , a batch size of 64 and the Adam [13] optimization algorithm. We used a
D Gaussian for the prior and additive Gaussian noise with standard deviation
for the corruption process. When training the iDAAE, we use steps of Monte Carlo integration (see Algorithm 2 in the Appendix).ViC2 Architecture and Training: Sprites
Both the encoder and discriminator are layer fully connected neural networks with neurons in each layer. For the decoder, we used a layer fully connected network with neurons in the first layer and in each of the last layers, this configuration allowed us to capture complexity in the data without over fitting. The networks were trained for epochs, using a batch size of , a learning rate of and the Adam [13] optimization algorithm. We used an encoding units, D Gaussian for the prior and additive Gaussian noise with standard deviation for the corruption process. The iDAAE was trained with steps of Monte Carlo integration.
ViC3 Architecture and Training: CelebA
The encoder and decoder were constructed with convolutional layers, rather than fully connected layers since the CelebA dataset is more complex than the Toronto face dataset use by Makhzani et al. [21]. The encoder and decoder consisted of convolutional layers with a similar structure to that of the DCGAN proposed by Radford et al. [25]. We used a layer fully connected network for the discriminator. Networks were trained for epochs with a batch size of
using RMSprop with learning rate
and momentum of for training the discriminator. We found that using smaller momentum values lead to more blurred images, however larger momentum values prevented the network from converging and made training unstable. When using Adam instead of RMSprop (on the CelebA dataset specifically) we found that the values in the encodings became very large, and were not consistent with the prior. The encoding was made up of units and we used a D Gaussian for the prior. We used additive Gaussian noise for the corruption process. We experimented with different noise level, between , we found several values in this range to be suitable. For our classification experiments we fixed and for synthesis from the DAAE, to demonstrate the effect of sampling, we used . For the iDAAE we experimented with . We found that (when ), was not sufficient to train an iDAAE. By comparing histograms of encoded data samples to histograms of the prior (see Figure 2), for an iDAAE trained with a particular value, we are able to see whether is sufficiently larger or not. We found to be sufficiently large for most experiments.ViD Sampling DAAEs and iDAAEs
Samples may be synthesized using the decoder of a trained iDAAE or AAE by passing latent samples drawn from the prior through the decoder. On the other hand, if we pass samples from the prior through the decoder of a trained DAAE, the samples are likely to be inconsistent with the training data. To synthesize more consistent samples using the DAAE, we draw an initial from any random distribution – we use a Gaussian distribution for simplicity^{3}^{3}3which happens to be equivalent to our choice of prior – and decode, corrupt and encode the sample several times for each synthesized sample. This process is equivalent to sampling a Markov chain where one iteration of the Markov chain includes decoding, corrupting and encoding to get a after iterations. The may be used to synthesize a novel sample which we call, . is the sample generated when is passed through the decoder.
To evaluate the quality of some synthesized samples, we calculated the loglikelihood of real samples under the model [21]. This is achieved by fitting a Parzen window to a number of synthesised samples. Further details of how the loglikelihood is calculated for each dataset is given in the Appendix F.
We expect initial samples, ’s drawn from the DAAE to have a lower (worse) loglikelihood than those drawn from the AAE, however we expect Markov chain (MC) sampling to improve synthesized samples, such that for should have larger loglikelihood than the initial samples. It is not clear whether for drawn using a DAAE will be better than samples drawn form an iDAAE. The purpose of these experiments is to demonstrate the challenges associated with drawing samples from denoising adversarial autoencoders, and show that our proposed methods for sampling a DAAE and training iDAAEs allows us to address these challenges. We also hope to show that iDAAE and DAAE samples are competitive with those drawn from an AAE.
ViD1 Sampling: Omniglot
Here, we explore the Omniglot dataset, where we look at loglikelihood score on both a testing and evaluation dataset. Recall (Section VIB1) that the testing dataset has samples from the same classes as the training dataset and the evaluation dataset has samples from different classes.
First, we discuss the results on the evaluation dataset. The results, shown in Figure 3, are consistent with what is expected of the models. The iDAAE outperformed the AAE, with a less negative (better) loglikelihood. The initial samples drawn using the DAAE had more negative (worse) loglikelihood values than samples drawn using the AAE. However, after one iteration of MC sampling, the synthesized samples have less negative (better) loglikelihood values than those from the AAE. Additional iterations of MC sampling led to worse results, possibly because synthesized samples tending towards multiple modes of the data generating distribution, appearing to be more like samples from classes represented in the training data.
The Omniglot testing dataset consists of one example of every category in the training dataset. This means that if multiple iterations of MC sampling cause synthesized samples to tend towards modes in the training data, the likelihood score on the testing dataset is likely to increase. The results shown in Figure 3 confirm this expectation; the loglikelihood for the sample is less negative (better) than for the sample. These apparently conflicting results (in Figure 3) – whether sampling improves or worsens synthesized samples – highlights the challenges involved with evaluating generative models using the loglikelihood, discussed in more depth by Theis et al. [30]. For this reason, we also show qualitative results.
Figure 4(a) shows an set of initial samples () drawn from a DAAE and samples synthesised after iterations () of MC sampling in Figure 4(b), these samples appear to be well vaired, capturing multiple modes of the data generating distribution.
ViD2 Sampling: Sprites
In alignment with expectation, the iDAAE model synthesizes samples with higher (better) loglikelihood, , than the AAE, . The initial image samples drawn from the DAAE model underperform compared to the AAE model, , however after just one iteration of sampling the synthesized samples have higher loglikelihood than samples from the AAE. Results also show that synthesized samples drawn using the DAAE after one iteration of MC sampling have higher likelihood, , than samples drawn using either the the iDAAE or AAE models.
When more than one step of MC sampling is applied, the loglikelihood decreases, in a similar way to results on the Omniglot evalutation dataset, this may be related to how the training, and test data are split, each dataset has a unique set of characters, so combinations seen during training will not be present in the testing dataset. These results further suggest that MC sampling pushes synthesized samples towards modes in the training data.
ViD3 Sampling: CelebA
In Figure 5 we compare samples synthesized using an AAE to those synthesized using an iDAAE trained using intergration steps. Figure 5(b) show samples drawn from the iDAAE which improve upon those drawn from the AAE model.
In Figure 6 we show samples synthesized from a DAAE using the iterative approach described in section IVB for . We see that the initial samples, have blurry artifacts, while the final samples, are more sharp and free from the blurry artifacts.
When drawing samples from iDAAE and DAAE models trained on CelebA, a critical difference between the two models emerges: samples synthesized using a DAAE have good structure but appear to be quite similar to each other, while the iDAAE samples have less good structure but appear to have lots of variation. The lack of variation in DAAE samples may be related to the sampling procedure, which accoriding to theory presented by Alain et al. [1], would be similar to taking steps toward the highest density regions of the distribution (i.e. the mode), explaining why samples appear to be quite similar.
When comparing DAAE or iDAAE samples to samples from other generative models such as GANs [9] we may notice that samples are less sharp. However, GANs often suffer from ‘mode collapse’ this is where all synthesised samples are very similar, the iDAAE does not suffer mode collapse and does not require any additional procedures to prevent mode collapse [27]. Further, (vanilla) GANs do not offer an encoding model. Other GAN variants such as BiGAN [6] and ALI [7] do offer encoding models, however the fidelity of reconstruction is very poor. The AAE, DAAE and iDAAE models are able to reconstruct samples faithfully. We will explore fidelity of reconstruction in the next section and compare to a stateofart ALI that has been modified to have improved reconstruction fidelity, ALICE [19].
We conclude this section on sampling, by making the following observations; samples synthesised using iDAAEs outperformed AAEs on all datasets, where . It is convenient, that relatively small yields improvement, as the time needed to train an iDAAE may increase linearly with . We also observed that initial samples synthesized using the DAAE are poor and in all cases even just one iteration of MC sampling improves image synthesis.
Finally, evaluating generated samples is challenging: loglikelihood is not always reliable [30], and qualitative analysis is subjective. For this reason, we provided both quantitative and qualitative results to communicate the benefits of introducing MC sampling for a trained DAAE, and the advantages of iDAAEs over AAEs.
ViE Reconstruction
The reconstruction task involves passing samples from the test dataset through the trained encoder and decoder to recover a sample similar to the original (uncorrupted) sample. The reconstruction is evaluated by computing the mean squared error between the reconstruction and the original sample.
We are interested in reconstruction for several reasons. The first is that if we wish to use encodings for down stream tasks, for example classification, a good indication of whether the encoding is modeling the sample well is to check the reconstructions. For example if the reconstructed image is missing certain features that were present in the original images, it may be that this information is not preserved in the encoding. The second reason is that checking sample reconstructions is also a method to evaluate whether the model has overfit to test samples. The ability to reconstruct samples not seen during training suggests that a model has not overfit. The final reason, is to further motivate AAE, DAAE and iDAAE models as alternatives to GAN based models that are augmented with encoders [19], for down stream tasks that require good sample reconstruction. We expect that adding noise during training would both prevent over fitting and encourage the model to learn more robust representations, therefore we expect that the DAAE and iDAAE would outperform the AAE.
ViE1 Reconstruction: Omniglot
Table I compares reconstruction errors of the AAE, DAAE and iDAAE trained on the Omniglot dataset. The reconstruction errors for both the iDAAE and the DAAE are less than the AAE. The results suggest that using the denoising criterion during training helps the network learn more robust features compared to the nondenoising variant. The smallest reconstruction error was achieved by the DAAE rather than the iDAAE; qualitatively, the reconstructions using the DAAE captured small details while the iDAAE lost some. This is likely to be related to the multimodal nature of in the DAAE compared to the unimodal nature of in an iDAAE.
ViE2 Reconstruction: Sprites
Table I shows reconstruction error on samples from the sprite test dataset for models trained on the sprite training data. In this case only the iDAAE model outperformed the AAE and the DAAE performed as well as the AAE.
ViE3 Reconstruction: CelebA
Table I shows reconstruction error on the CelebA dataset. We compare AAE, DAAE and iDAAE models trained with momentum, , where the DAAE and iDAAE have corruption and the iDAAE is trained with integration steps. We also experimented with however, better results were obtained using . While the DAAE performs similarly well to the AAE, the iDAAE outperforms both. Figure 7 shows examples of reconstructions obtained using the iDAAE. Although the reconstructions are slightly blurred, the reconstructions are highly faithful, suggesting that facial attributes are correctly encoded by the iDAAE model.
Model  Omniglot  Sprite  MNIST^{4}^{4}4The MNIST dataset is described in the Appendix.  CelebA 

AAE  0.047  0.019  0.017  0.500 
DAAE  0.029  0.019  0.015  0.501 
iDAAE  0.031  0.018  0.018  0.495 
ALICE [19]      0.080 
ViF Classification
We are motivated to understand the properties of the representations (latent encoding) learned by the DAAE and iDAAE trained on unlabeled data. A particular property of interest is the separability, in latent space, between objects of different classes. To evaluate separability, rather than training in a semisupervised fashion [21] we obtain class predictions by training an SVM on top of the representations, in a similar fashion to that of Kumar et al. [16].
ViF1 Classification: Omniglot
Classifying samples in the Omniglot dataset is very challenging: the training and testing datasets consists of classes, with only examples of each class in the training dataset. The classes make up writing systems, where symbols between writing systems may be visually indistinguishable. Previous work has focused on only classifying , or classes from within a single writing system [28, 33, 8, 17], however we attempt to perform classification across all classes. The Omniglot training dataset is used to train SVMs (with RBF kernels) on encodings extracted from encoding models of the trained DAAE, iDAAE and AAE models. Classification scores are reported on the Omniglot evaluation dataset, (Table II).
Results show that the DAAE and iDAAE outperform the AAE on the classification task. The DAAE and iDAAE also outperform a classifier trained on encodings obtained by applying PCA to the image samples, while the AAE does not, further showing the benefits of using denoising.
We perform a separate classification task using only classes from the Omniglot evaluation dataset (each class has examples). This second test is performed for two key reasons: a) to study how well autoencoders trained on only a subset of classes can generalise as feature extractors for classifiers of classes not seen during autoencoder training; b) to facilitate performance comparisons with previous work [8]. A linear SVM classifier is trained on the samples from each of the classes in the evaluation dataset and tested on the remaining sample from each class. We perform the classification times, leaving out a different sample from each class, in each experiment. The results are shown in Table II. For comparison, we also show classification scores when PCA is used as a feature extractor instead of a learned encoder.
Results show that the the DAAE model outperforms the AAE model, while the iDAAE performs less well, suggesting that features learned by the DAAE transfer better to new tasks, than those learned by the iDAAE. The AAE, iDAAE and DAAE models also outperform PCA.
Model  Test Acc.  Eval Acc. % 

AAE  18.36%  78.75 % 
DAAE  31.74%  83.00% 
iDAAE  34.02%  78.25% 
PCA  31.02%  76.75% 
Random chance  0.11%  5% 
ViF2 Classification: CelebA
We perform a more extensive set of experiments to evaluate the linear separability of encodings learned on the celebA dataset and compare to state of the art methods including the VAE [14] and the VAE^{5}^{5}5The VAE [10] weights the term in the VAE cost function with to encourge better organisation of the latent sapce, factorising the latent encoding into interpretable, indepenant compoenents. [10].
We train a linear SVM on the encodings of a DAAE (or iDAAE) to predict labels for facial attributes, for example ‘Blond Hair’, ‘No Beard’ etc. . In our experiments, we compare classification accuracy on attributes obtained using the AAE, DAAE and iDAAE compared to previously reported results obtained for the VAE [14] and the VAE [10], these are shown in Figure 8. The results for the VAE and VAE were obtained using a similar approach to ours and were reported by Kumar et al. [16]. We used the same hyper parameters to train all models and a fixed noise level of , the iDAAE was trained with . The Figure (8) shows that the AAE, iDAAE and DAAE models outperform the VAE and VAE models on most facial attribute categories.
To compare models more easily, we ask the question, ‘On how many facial attributes does one model out perform another?’, in the context of facial attribute classification. We ask this question for various combinations of model pairs, the results are shown in Figure 9. Figures 9 (a) and (b), comparing the AAE to DAAE and iDAAE respectively, demonstrate that for some attributes the denoising models outperform the nondenoising models. More over, the particular attributes for which the DAAE and iDAAE outperform the AAE is (fairly) consistent, both DAAE and iDAAE outperform the AAE on the (same) attributes: ‘Attractive’, ‘Blond Hair’, ‘Wearing Hat’, ‘Wearing Lipstick’. The iDAAE outperforms on an additional attribute, ‘Arched Eyebrows’.
There are various hyper parameters that may be chosen to train these models; for the DAAE and iDAAE we may choose the level of corruption and for the iDAAE we may additionally choose the number of integration steps, used during training. We compare attribute classification results for vastly different choices of parameter settings. The results are presented as a bar chart in Figure 10 for the DAAE. Additional results for the iDAAE are shown in the Appendix (Figure 15). These figures show that the models perform well under various different parameter settings. Figure 10 suggests that the model performs better with a smaller amount of noise rather than with , however it is important to note that a large amount of noise does not ‘break’ the model. These results demonstrate that the model works well for various hyper parameters, and fine tuning is not necessary to achieve reasonable results (when compared to the VAE for example). It is possible that further fine tuning may be done to achieve better results, however a full parameter sweep is highly computationally expensive.
From this section, we may conclude that with the exception of facial attributes, AAEs and variations of AAEs are able to outperform the VAE and VAE on the task of facial attribute classification. This suggests that AAEs and their variants are interesting models to study in the setting of learning linearly separable encodings. We also show that for a specific set of several facial attribute categories, the iDAAE or DAAE performs better than the AAE. This consistency, suggests that there are some specific attributes that the denoising variants of the AAE learn better than the nondenoising AAE.
ViG Tradeoffs in Performance
The results presented in this section suggest that both the DAAE and iDAAE outperform AAE models on most generation and some reconstruction tasks and suggest it is sometimes beneficial to incorporate denoising into the training of adversarial autoencoders. However, it is less clear which of the two new models, DAAE or iDAAE, are better for classification. When evaluating which one to use, we must consider both the practicalities of training, and for generative purposes, the practicalities – primarily computational load – of each model.
The integrating steps required for training an iDAAE means that it may take longer to train than a DAAE. On the other hand, it is possible to perform the integration process in parallel provided that sufficient computational resource is available. Further, once the model is trained, the time taken to compute encodings for classification is the same for both models. Finally, results suggest that using as few as integrating steps during training, leads to an improvement in classification score. This means that for some classification tasks, it may be worthwhile to train an iDAAE rather than a DAAE.
For generative tasks, neither the DAAE nor the iDAAE model consistently outperform the other in terms of loglikelihood of synthesized samples. The choice of model may be more strongly affected by the computational effort required during training or sampling. In terms of loglikelihood on the synthesized samples, an iDAAE using even a small number of integration steps () during training of an iDAAE leads to better quality images being generated, and similarly using even one step of sampling with a DAAE leads to better generations.
Conflicting loglikelihood values of generated samples between testing and evaluation datasets means that these measurements are not a clear indication of how the number of sampling iterations affects the visual quality of samples synthesized using a DAAE. In some cases it may be necessary to visually inspect samples in order to assess effects of multiple sampling iterations (Figure 4).
Vii Conclusion
We propose two types of denoising autoencoders, where a posterior is shaped to match a prior using adversarial training. In the first, we match the posterior conditional on corrupted data samples to the prior; we call this model a DAAE. In the second, we match the posterior, conditional on original data samples, to the prior. We call the second model an integrating DAAE, or iDAAE, because the approach involves using Monte Carlo integration during training.
Our first contribution is the extension of adversarial autoencoders (AAEs) to denoising adversarial autoencoders (DAAEs and iDAAEs). Our second contribution includes identifying and addressing challenges related to synthesizing data samples using DAAE models. We propose synthesizing data samples by iteratively sampling a DAAE according to a MC transition operator, defined by the learned encoder and decoder of the DAAE model, and the corruption process used during training.
Finally, we present results on three datasets, for three tasks that compare both DAAE and iDAAE to AAE models. The datasets include: handwritten characters (Omniglot [17]), a collection of humanlike sprite characters (Sprites [26]) and a dataset of faces (CelebA [20]). The tasks are reconstruction, classification and sample synthesis.
Acknowledgment
We acknowledge the Engineering and Physical Sciences Research Council for funding through a Doctoral Training studentship. We would also like to thank Kai Arulkumaran for interesting discussions and managing the cluster on which many experiments were performed. We also acknowledge Nick Pawlowski and Martin Rajchl for additional help with clusters.
References

[1]
G. Alain and Y. Bengio.
What regularized autoencoders learn from the datagenerating
distribution.
The Journal of Machine Learning Research
, 15(1):3563–3593, 2014.  [2] P. Bachman and D. Precup. Variational generative stochastic networks with collaborative shaping. In Proceedings of the 32nd International Conference on Machine Learning, pages 1964–1972, 2015.
 [3] Y. Bengio. Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
 [4] Y. Bengio, E. ThibodeauLaufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable by backprop. In Journal of Machine Learning Research: Proceedings of the 31st International Conference on Machine Learning, volume 32, 2014.
 [5] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising autoencoders as generative models. In Advances in Neural Information Processing Systems, pages 899–907, 2013.
 [6] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 [7] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 [8] H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
 [9] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [10] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. 2016.
 [11] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.

[12]
D. J. Im, S. Ahn, R. Memisevic, Y. Bengio, et al.
Denoising criterion for variational autoencoding framework.
In
Proceeding of the 31 AAAI Conference on Artificial Intelligence
, 2017.  [13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 2015 International Conference on Learning Representations (ICLR2015), 2014.
 [14] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In Proceedings of the 2015 International Conference on Learning Representations (ICLR2015), 2014.
 [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [16] A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848, 2017.
 [17] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.

[18]
Y. LeCun, C. Cortes, and C. J. Burges.
The mnist database of handwritten digits, 1998.
 [19] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. Alice: Towards understanding adversarial learning for joint distribution matching. Neural Information Processing Systems (NIPS), 2017.

[20]
Z. Liu, P. Luo, X. Wang, and X. Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  [21] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 [22] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005, 2016.
 [23] H. U. of Technology. Markov chains and stochastic sampling. [Chapter 1: pg5: Theorem (Markov Chain Convergence)].
 [24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [25] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations (ICLR) 2016, 2015.
 [26] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogymaking. In Advances in Neural Information Processing Systems, pages 1252–1260, 2015.
 [27] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. arXiv preprint arXiv:1606.03498, 2016.
 [28] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Oneshot learning with memoryaugmented neural networks. arXiv preprint arXiv:1605.06065, 2016.
 [29] P. Y. Simard, D. Steinkraus, J. C. Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, volume 3, pages 958–962. Citeseer, 2003.
 [30] L. Theis, A. v. d. Oord, and M. Bethge. A note on the evaluation of generative models. In Proceedings of the International Conference of Learning Representations, 2015.
 [31] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, pages 1096–1103. ACM, 2008.
 [32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.
 [33] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
Appendix A Proofs
Lemma 1.
is a stationary distribution for the Markov chain defined by the sampling process in 3.
Proof.
Lemma 2.
The Markov chain defined by the transition operator, (4)is ergodic, provided that the corruption process is additive Gaussian noise and that adversarial pair, and are optimal within the adversarial framework.
Proof.
Consider , and . Where and :

Assuming that is a good approximation of the underlying probability distribution, , then s.t. .

Assuming that adversarial training has shapped the distribution of to match the prior, , then s.t. . This holds because if not all points in could be visited, would not have matched the prior.
1) suggests that every point in may be reached from a point in and 2) suggests that every point in may be reached from a point in . Under the assumption that is an additive Gaussian corruption process then is likely to lie within a (hyper) spherical region around . If the corruption process is sufficiently large such that (hyper) spheres of nearby samples overlap, for an and a set such that, and where is the support. Then, it is possible to reach any from any (including the case ). Therefore, the chain is both irreducible and positive recurrent.
To be ergodic the chain must also be aperiodic: between any two points and , there is a boundary, where values between and the boundary are mapped to , and points between and the boundary are mapped to . By applying the corruption process to , followed by the reconstruction process, there are always at least two possible outcomes because we assume that all (hyper) spheres induced by the corruption process overlap with at least one other (hyper) sphere: either is not pushed over the boundary and remains the same, or is pushed over the boundary and moves to a new state. The probability of either outcome is positive, and so there is always more that one route between two points, thus avoiding periodicity provided that for , . Consider, the case where for , , then it would not be possible to recover both and using and so if then (and vice verse), which is a contradiction to 1).
∎
Appendix B Examples Of Sythesised Samples
Appendix C Algorithm For Training the iDAAE
Appendix D Algorithm For Monte Carlo Integration
Appendix E MNIST Results
Ea Datasets: MNIST
The MNIST dataset consists of greyscale images of handwritten digits between and , with k training samples, k validation samples and k testing samples. The training, validation and testing dataset have an equal number of samples from each category. The samples are by pixels. The MNIST dataset is a simple dataset with few classes and many training examples, making it a good dataset for proofofconcept. However, because the dataset is very simple, it does not necessarily reveal the effects of subtle, but potentially important, changes to algorithms for training or sampling. For this reason, we consider two datasets with greater complexity.
EB Architecture and Training: MNIST
For the MNIST dataset, we train a total of models detailed in Table LABEL:MNIST_models. The encoder, decoder and discriminator networks each have two fully connected layers with neurons each. For most models the size of the encoding is units, and the prior distribution that the encoding is being matched to is a D Gaussian. All networks are trained for epochs on the training dataset, with a learning rate of and a batch size of . The standard deviation of the additive Gaussian noise used during training is for all iDAAE and DAAE models.
An additional DAAE model is trained using the same training parameters and networks as described above but with a mixture of D Gaussians for the prior. Each D Gaussian with standard deviation is equally spaced with its mean around a circle of radius units. This results in a prior with modes separated from each other by large regions of very low probability. This model of the prior is very unrealistic, as it assumes that MNIST digits occupy distinct regions of image probability space. In reality, we may expect two numbers that are similar to exist side by side in image probability space, and for there to exist a smooth transition between handwritten digits. As a consequence of this, this model is intended specifically for evaluating how well the posterior distribution over latent space may be matched to a mixture of Gaussians  something that could not be achieved easily by a VAE [14].
ID  Model  Corruption  Prior  

1  AAE  0.0  10D Gaussian   
2  DAAE  0.5  10D Gaussian   
3  DAAE  0.5  10GMM   
4  iDAAE  0.5 
Comments
There are no comments yet.