Training Generative Reversible Networks

06/05/2018 ∙ by Robin Tibor Schirrmeister, et al. ∙ Universitätsklinikum Freiburg 0

Generative models with an encoding component such as autoencoders currently receive great interest. However, training of autoencoders is typically complicated by the need to train a separate encoder and decoder model that have to be enforced to be reciprocal to each other. To overcome this problem, by-design reversible neural networks (RevNets) had been previously used as generative models either directly optimizing the likelihood of the data under the model or using an adversarial approach on the generated data. Here, we instead investigate their performance using an adversary on the latent space in the adversarial autoencoder framework. We investigate the generative performance of RevNets on the CelebA dataset, showing that generative RevNets can generate coherent faces with similar quality as Variational Autoencoders. This first attempt to use RevNets inside the adversarial autoencoder framework slightly underperformed relative to recent advanced generative models using an autoencoder component on CelebA, but this gap may diminish with further optimization of the training setup of generative RevNets. In addition to the experiments on CelebA, we show a proof-of-principle experiment on the MNIST dataset suggesting that adversary-free trained RevNets can discover meaningful latent dimensions without pre-specifying the number of dimensions of the latent sampling distribution. In summary, this study shows that RevNets can be employed in different generative training settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models that include an encoder-decoder architecture have several appealing properties. For example, they tend to be more stable to train Tolstikhin et al. (2018) and can potentially be used for classification in a semi-supervised fashion Makhzani et al. (2016). However, a drawback of recent generative models with an encoder-decoder architecture is the requirement to train two separate models, including the need to ensure the encoder and the decoder are reciprocal. While autoencoders with tied weights can at least overcome the problem of training two separate models, recent approaches applying autoencoders to realistic image datasets such as CelebA Liu et al. (2015) use separate decoder and encoder models Donahue et al. (2017); Dumoulin et al. (2016); Tolstikhin et al. (2018).

An interesting alternative to encoder-decoder architectures could be models that are invertible by design. Recently, invertible-by-design neural networks called reversible neural networks were proposed. In the beginning they were used as generative models Dinh et al. (2015, 2017), later as classification models with smaller memory requirements Gomez et al. (2017) and finally to study theoretical assumptions about learning and generalization of deep neural networks Jacobsen et al. (2018). For example, their good classification performance showed that loss of information about the input in later representations of a neural network is not a necessary precondition for good generalization.

In their application as generative models, reversible networks were trained in two ways. In earlier works, they were trained using the so-called change of variable formula to directly optimize the likelihood of the data under the reversible network model Dinh et al. (2015, 2017). Later, they were trained using an adversarial approach on the generated samples Danihelka et al. (2017); Grover et al. (2018) same as in generative adversarial networks Goodfellow et al. (2014). In this study, we instead investigate their performance when using an adversary in the latent space in an adversarial autoencoder framework Makhzani et al. (2016)

. In general, the RevNet’s built-in bijectivity could either be an advantage or a disadvantage when optimizing in this framework. For example, the bijectivity prevents one from hand-designing the value range for the generated samples as is sometimes done using a sigmoid nonlinearity as the final operation on the decoder output.

We indeed find it is possible to use RevNets as generative models in the adversarial autoencoder framework, producing samples of comparable quality to variational autoencoders (VAEs) Kingma & Welling (2014) on the CelebA dataset Liu et al. (2015). Furthermore, in an attempt to exploit the direct correspondence between encodings and inputs in a RevNet, we make a proof of concept for an adversary-free training without a prespecified number of latent dimensions on the MNIST dataset.

2 Background

2.1 Reversible Networks

Figure 1: Reversible block. Functions and process inputs and (top), which can be recovered from the outputs and (bottom). See text for details.

Reversible networks (RevNets) are neural networks that are invertible by design Dinh et al. (2015); Gomez et al. (2017); Jacobsen et al. (2018) through the use of invertible blocks. The basic invertible block is defined for an input , split into disjoint parts , and two functions F and G that have the same output as input size as follows (also see Figure 1):

(1)

Inputs and can be inverted from the outputs and as follows:

(2)

and will typically be a sequence of convolutional or other neural network layers. The splitting of input x into disjoint inputs , is often implemented along the channel dimension of the network.

Figure 2: Subsampling step. Our subsampling operation applied twice to a 4x4 input. Upper and lower rectangles on middle and right represent the different streams inside the reversible net (i.e., and ). On the right, individual squares represent individual channels, so each channel has a single value at the end. Note that at the end, both streams have access to pixels that cover the entire 4x4 input.

One important addition to the invertible network architecture are invertible subsampling blocks that were introduced to make the RevNets end-to-end invertible Dinh et al. (2015); Jacobsen et al. (2018). Invertible subsampling is possible by shifting spatial dimensions into the channel dimensions. Basically, for a 2x2 subsampling, 4 translated spatial checkerboard patterns of the input are moved into four different channels as seen in Figure 2. Our subsampling operation is a slightly modified version of the operation proposed in earlier work Dinh et al. (2015); Jacobsen et al. (2018) that ensures that the final and correspond to checkerboard patterns covering the entire input image as indicated in Figure 2. This was motivated by our observation that early on in the training, the values of the RevNet encodings are still strongly influenced by the values at the input positions they correspond to, as shown in Figure 3. Therefore, both F and G seeing inputs that cover the entire image might make it easier for F and G to correctly predict what will be added to their output, easing the generative training.

Figure 3: Artifacts in earlier phases of the training. Generated samples on CelebA of an uncoverged RevNet with green and purple pixel artifacts. Artifacts are caused by some latent dimensions still strongly influencing the input dimensions they correspond to as explained in the text.

2.2 Adversarial Autoencoders

In the adversarial autoencoder framework, the encoder and decoder are trained together to minimize a reconstruction loss on the decoded inputs and an adversarial loss on the encodings. For the reconstruction loss, the encoder and decoder are optimized to minimize a reconstruction error of the inputs , where is an input, the input distribution, is the decoder (generator), is the encoder, and is a reconstruction loss such as L1-loss or L2-loss.

Since RevNets are invertible by construction, we propose to use a single RevNet to instantiate both the encoder and the decoder, leading to a reconstruction loss of zero by design, regardless of the weights used in the RevNet. In practice, we aim for a lower-dimensional latent space that still yields good reconstructions. In order to obtain this, in the reconstruction training phase, we clip the encodings produced by the RevNet according to a prior distribution that sets most encoding dimensions to zero; we then invert the clipped encodings through the RevNet and optimize the L1 reconstruction loss between the original inputs and the inverted clipped encodings, which has been reported to work better than L2 in natural image settings Isola et al. (2017); Ulyanov et al. (2017). We also penalize the L2-distance of the encodings and the clipped encodings as we found this to greatly stabilize this training phase.111

In practice, since we wanted to keep the option of using a uniform distribution in +-2 as the latent sampling distribution, we also clipped the nonzero dimensions to +-2 for both the L1 and L2 loss, but we do not expect this to strongly influence the results.

While in principle the reconstruction phase is not even necessary for reversible networks, we still found it useful as a first phase to allow the network to generate a useful arrangement of the inputs in the encoding space before optimization of the adversarial loss. We still apply this reconstruction loss in the next phase of the training where we include an adversarial loss.

For the adversarial loss, a discriminator network is trained to distinguish the distribution of the encoder outputs from a prior distribution. The encoder tries to fool the discriminator by making the encoder outputs indistinguishable from samples of the prior distribution. The adversarial game can be setup with a variety of loss functions; we choose the adversarial hinge loss as advocated for the use in Generative Adversarial Networks (GANs)

Goodfellow et al. (2014) in Lim & Ye (2017):

(3)

with and the losses for the discriminator and the encoder, in our case the RevNet, respectively. In our setting, we only apply the discriminator on the nonzero dimensions of the prior distribution, while penalizing the remaining dimensions through the L1 and L2 loss on the clipped encodings as explained before. This should greatly simplify the adversarial training as it makes the problem substantially lower dimensional (e.g., 64 dimensions vs. 12288 dimensions in the case of a 64-dimensional prior and 64x64 RGB images, which will be our setting on the CelebA Liu et al. (2015) dataset as explained in the experiments section).

We also make use of the recently proposed spectral normalization for the discriminator Miyato et al. (2018). Spectral normalization normalizes the spectral norm of any weight matrix of the discriminator to unit spectral norm:

(4)

where

is also equivalent to the largest singular value of

. Spectral normalization was designed to regularize the Lipschitz norm of the discriminator network to stabilize the training Miyato et al. (2018). In practice, spectral normalization can be computed efficiently using the power iteration method, only using a single iteration per forward pass; we defer to Miyato et al. (2018) for details.

2.3 Optimal Transport

Optimal transport distances measures the distance between two distributions as the distance needed to morph one distribution into the other. This can be visualized as the transport of sand when imagining both distributions to be piles of sand Peyré & Cuturi (2018). Formally, it is defined for two distributions as:

(5)

where is a user-defined cost/distance function, and

is a coupling distribution whose probabilities specifies how much probability mass is moved from each point

to each point . To ensure that this coupling correctly distributes all the mass from one distribution to the other, it must come from the set

of all joint distributions of

with marginals and , respectively. For two empirical distributions with the same number of samples, it is equivalent to the pairing that minimizes the average distance between the pairs.

Optimal transport distances have recently seen an increasing usage and interest in the field of generative models, especially due to their ability to compare distributions with disjoint support. As such, they have been used in different ways to train GANs Arjovsky et al. (2017); Salimans et al. (2018). For a more thorough overview over optimal transport and its applications, we highly recommend Peyré & Cuturi (2018). In this study, we make more direct use of optimal transport distances in our experiments on the MNIST dataset Lecun et al. (1998).

Finally, we note that theoretical analysis using optimal transport distances has recently generalized the adversarial autoencoder framework into the Wasserstein Autoencoder framework Tolstikhin et al. (2018). This analysis showed that any method that matches the latent sampling distribution and the encoding distribution of the real inputs can minimize an arbitrary optimal transport distance (the chosen reconstruction loss) between the distribution of generated inputs and the real input distribution. More precisely, for a given decoder, a given latent sampling distribution and a given distance, the optimal transport distance is equivalent to the minimum expected encoder-decoder reconstruction distance over all such encoders whose encoding distribution of the real inputs is identical to the latent sampling distribution. We defer to Tolstikhin et al. (2018) for more details.

For our RevNets, inverting unclipped encodings of a RevNet should result in the exact same inputs that produced the encodings. Therefore, the distribution of generated samples would be identical to the real input distribution if the latent sampling distribution would match the encoding distribution of the real inputs produced by the RevNets exactly. Nevertheless, as the encodings never exactly match the imposed prior distribution, it remains important that the encoding distances remain meaningful throughout the training, which we found to be much more so when using the initial reconstruction phase described earlier.

2.4 Fréchet Inception Distance

The Fréchet Inception Distance (FID) has been proposed as a measure for evaluating the quality of generated samples for a specific dataset Heusel et al. (2017)

. It is the optimal L2-transport distance between features of the Imagenet-pretrained Inception network computed on the given dataset and computed on a set of generated samples, under the assumption that both feature distributions follow a Gaussian distribution. The Gaussianity assumption makes it possible to compute the optimal transport distance directly from the mean and covariance matrices. The FID has been advocated as the measure best correlated with human notions of sample quality of all automatically computable measures that have been proposed so far

Heusel et al. (2017); Lucic et al. (2017), although recently alternatives overcoming the assumption of Gaussianality have been proposed Bińkowski et al. (2018).

3 Experiments

3.1 CelebA

We run our our generative reversible network on the CelebA dataset Liu et al. (2015), a widely used dataset to evaluate autoencoders. We crop and downsample the images to 64x64 pixels as is common practice, using the same code as in Tolstikhin et al. (2018)222See code here: https://github.com/tolstikhin/wae/blob/a1fdf24066b83665feffbcf18298cd605658e33d/datahandler.py##L188-L208. Our RevNet architecture uses 11 reversible function blocks and 6 reversible subsampling steps with 60 million parameters and is shown in Figure 4.

Figure 4: RevNet Architecture and F/G functions. On the left, RevNet architecture for use on CelebA. On the right, our F/G function inside the reversible blocks. The RevNet functions both as the encoder (from top to bottom) and through its inverse as the decoder (from bottom to top). Numbers after each RevBlock indicate intermediate number of channels. For the F/G function, indicates the number of input channels for the function and indicates the aforementioned intermediate number of convolutional filters/channels.

For the discriminator, we use a fully connected network with 2 hidden layers with 400 and 800 units each. The first layer uses concatenated ReLUs () Shang et al. (2016) and the second layer regular ReLUs Glorot et al. (2011) as nonlinearities. We chose concatenated ReLUs in the first layer as we observed in preliminary experiments that they help the discriminator produce more useful gradients when the encodings are too concentrated around the mean of the prior distribution. We apply spectral normalization on the discriminator using 1 power iteration per forward pass as described in Miyato et al. (2018).

The prior distribution is a 64-dimensional standard-normal distribution. The 64 dimensions are the output dimensions with the highest standard deviations of the outputs for the untrained RevNet on the dataset. For the optimization, we follow

Heusel et al. (2017) in employing different learning rates for the generative RevNet and the discriminator, using Adam with and , respectively ( and for both). These settings were chosen identical to a fairly recent successful GAN setting Zhang et al. (2018). Code for reproducing these experiments will be released upon publication.

3.1.1 Results

Our generative reversible network generates globally coherent faces as seen in Figure 5. The generated faces are fairly blurry, which is also reflected in an FID score close to those reported for VAEs and higher than for other autoencoders in an adversarial framework (see Table 1). Reconstructions from the restricted latent space again show that the RevNet preserves some global attributes while losing detail (Figure 6). Reconstructions from the unrestricted outputs show that the RevNet does not suffer from any numerical instabilities that can be visually perceived from the reconstructions (Figure 6). Numerical analysis of the reconstruction losses confirms this with a mean L1 error of

on the entire CelebA dataset for our trained RevNet. Interpolations in latent space show coherent interpolated faces when staying in the latent space restricted to the nonzero dimensions of the prior (Figure

7). Interpolations in the full latent space, while having more detail, also show unrealistic artifacts in some cases (Figure 8). Samples generated by varying the latent space in 5 dimensions of the prior latent distribution show that the latent dimensions seem to encode combinations of semantically meaningful attributes such as smiling vs. nonsmiling, hair color, background and gender (Figure 9).

Finally, we observe that the training is very stable, we rerun the experiment 4 times using the same model pretrained in the reconstruction phase but varying the order of examples and the seeds for initializing the adversary parameters. Due to time constraints, we were not yet able to rerun the reconstruction phase with different seeds, but based on preliminary experiments we expect similar results in that case as well.


Model FID
Variational
Autoencoder 63
WAE-MMD 55
WAE-GAN 42
RevNet 65
Table 1: Fréchet Inception distance on CelebA.Estimated from 10000 samples to be consistent to Tolstikhin et al. (2018) (lower is better).
Figure 5: Generated samples on CelebA. Samples are for the most part globally coherent, lacking some details.
Figure 6: Reconstructions on CelebA. Top row: original, middle row: reconstruction from latent space restricted to the prior distribution, bottom row: reconstruction from full latent space. Reconstructions from restricted latent space are somewhat blurry and lose detail, reconstructions from full latent space show that numerical errors do not lead to visible image changes.
Figure 7: Interpolations on CelebA in restricted latent space. Images obtained by interpolating between encodings of two inputs in the encoding space restricted to the nonzero dimensions of the latent sampling distribution. Intermediate images clearly resemble human faces.
Figure 8: Interpolations on CelebA in full latent space. Intermediate images have more details, however show clear unnatural artifact patterns.
Figure 9: Samples when varying five randomly chosen dimensions of the latent prior. Top to bottom: Different latent dimensions. Left to right: Varying the corresponding latent dimension from -3 to +3 standard deviations around the mean. Different latent dimensions seem to encode different combinations of attributes such as smiling vs nonsmiling, hair color, gender, background color, etc.
Figure 10: Learning curves for the Fréchet Inception Distance.Epoch refers to one training epoch passing over the whole dataset. Colors indicate different runs using a different order of examples and different seeds for initializing the adversary. Curves are very similar, showing very stable training. Note that these FID scores are computed using the PyTorch Inception model and differ from the corresponding scores computed with Tensorflow (hence the discrepancy to Table 1).

3.2 Mnist

In a second experiment, we attempted to answer two questions: First, can generative reversible networks be trained using optimal transport without an adversary? The question of adversary-free training or alternatively, training with a adversary limited to computing an adversarial kernel function, continues to attract considerable interest due to the often difficult training dynamics of generative adversarial networks Bińkowski et al. (2018); Tolstikhin et al. (2018); Rubenstein et al. (2018). Second, is it possible to avoid prespecifying the latent dimensionality? This question is interesting as using a too large latent dimensionality might make the matching impossible Makhzani et al. (2016); Tolstikhin et al. (2018); Rubenstein et al. (2018) and using a too small latent dimensionality might make the network unable to model some variation in the generated samples, which it could otherwise retain (see Rubenstein et al. (2018) for a more thorough discussion of these effects).

We chose the MNIST dataset as this experiment should mainly serve as a proof of principle, and not to judge the quality of this approach compared to more established approaches of optimizing generative models. To this end, we considered that a simpler dataset, such as MNIST, with less factors of variation, could yield more helpful insights for a first attempt.

Concretely, we train the RevNet to match class-conditional latent distributions on the outputs, while we optimize the parameters of these distributions at the same time as follows. We first define the class-conditional latent distributions as uncorrelated Gaussian distributions and set the means and standard deviations to the corresponding means and standard deviations of the encodings of the untrained RevNet. Then, for each minibatch, we compute the optimal transport distance for the encodings of that minibatch and a same-size sample from the latent distribution using Euclidean distances as the cost function and solving the transport problem exactly using the algorithm from Bonneel et al. (2011) 333We use the code from the Python Optimal Transport library, https://github.com/rflamary/POT/blob/81b2796226f3abde29fc024752728444da77509a/ot/lp/__init__.py##L19. The optimal transport distance is then used as a loss for both the RevNet and the means and standard deviations of the class-conditional latent distributions. While the optimal transport distance is known to have biased gradients Bellemare et al. (2017); Salimans et al. (2018), we still find it to work well enough on MNIST for reasonable per-class batch sizes (). Besides ensuring a low optimal transport distance of the encodings and the sampling distributions, we must prevent the RevNet from “hiding” information in encoding dimensions with small standard deviations for these transport distances in encoding space to remain meaningful and to keep the training stable. For that, we propose a simple perturbation loss, that penalizes the reconstruction loss after applying a small perturbation sampled from a Gaussian distribution on the encodings. Concretely we penalize:

(6)

where and are the forward and inverse functions of the RevNet, respectively. The perturbation loss should also prevent the RevNet and the latent distributions from shrinking their standard deviations too much, which would otherwise cause a very unstable training which we have also observed in practice.

Figure 11: Samples on MNIST. Samples show realistic digits, are somewhat blurry and lack some diversity.
Figure 12: Samples when varying three dimensions of the latent prior. Varying the three latent dimensions with the largest standard deviations of the averaged per-class standard deviations. The three dimensions seem to roughly correspond to tilt, thickness and size respectively.

3.2.1 Results

The RevNet ends up using only 3-4 dimensions per class to encode the digits, with several of the dimensions shared between the classes. While the samples are somewhat blurry and lack diversity (see Figure 11), interestingly some of the used dimensions with the largest standard deviations encode semantically identical features for the different digits as shown in Figure 12. This indicates the RevNet has learned an encoding that keeps class-independent dimensions such as thickness or tilt in the same encoding dimensions despite having the freedom to use completely different dimensions for the different classes.

4 Discussion

Overall, we have shown for the first time that reversible neural networks can be used inside the adversarial autoencoder framework, yielding globally coherent generated faces on CelebA. While they still underperform relative to recent advanced generative autoencoder models on that dataset according to the Fréchet Inception Distance, the performance gap might be due to hyperparameters or architecture design choices, which have not been explored for RevNets prior to this work and are known to strongly affect generative model results

Lucic et al. (2017). Closing the performance gap through automated search for architectures and hyperparameters could therefore be an interesting next step. This could also include other forms of matching the distributions such as maximum mean discrepancy Gretton et al. (2012); Li et al. (2017); Tolstikhin et al. (2018) or sliced Wasserstein distances Kolouri et al. (2018).

Furthermore, the previous maximum-likelihood and input-adversarial methods used to train invertible networks in a generative setting Dinh et al. (2015, 2017); Danihelka et al. (2017); Grover et al. (2018) could be more directly compared to the adversarial autoencoder method from this study. The generated samples in the maximum-likelihood approach on CelebA in Dinh et al. (2017) feature more details, but also more unnatural artifacts. Attributing these differences to the training procedure or the model architecture could meaningfully extend prior work comparing maximum-likelihood and input-adversarial training of generative RevNets Danihelka et al. (2017). For the input-adversarial approach, one could also combine it with our proposal to use only a subset of the full latent sampling dimensionality. For this combination, it might be insightful to study the resulting encodings of the real inputs, especially in terms of what is modelled in the encoding space outside of the used sampling distribution, similar to our reconstructions from restricted and full latent space. Finally, one could compare the performance of RevNets n the adversarial autoencoder framework to approaches that use more traditional non-invertible autoencoders Donahue et al. (2017); Dumoulin et al. (2016); Ulyanov et al. (2017)

Later works on generative invertible networks used a hierarchical ordering of the latent sampling dimensions (see Dinh et al. (2017) for details). This might be worth exploring further. First, one might study this idea in combination with the adversarial autoencoder framework employed in this study. Second, the model architecture and hierarchical latent dimension ordering to enable high-quality generative modelling could be further optimized. Third, one might try to combine this idea with the progressive training of generative models as in Karras et al. (2018).

Our experiment on MNIST indicates that for simple datasets an adversary-free approach that does not need a prespecified latent dimensionality can result in meaningful encoding dimensions. This might be interesting for other works investigating the effect of latent dimensionality and intrinsic dimensionality on generative models Rubenstein et al. (2018). However, even for MNIST, the results are somewhat underwhelming with regards to the diversity of the generated samples. Still, we hope our results inspire further investigations on how to properly achieve the goals of having a meaningful encoding dimension, a small distance between encodings of the real inputs and the sampling distribution and realistic generated samples.

Additionally, the excellent performance of reversible networks on supervised classification tasks Jacobsen et al. (2018) makes it attractive to investigate their use in semi-supervised classification settings where adversarial autoencoders have already shown good results Makhzani et al. (2016).

Acknowledgements

This work was supported by the BrainLinks-BrainTools Cluster of Excellence (DFG grant EXC 1086) and by the Federal Ministry of Education and Research (BMBF, grant Motor-BIC 13GW0053D).

References