1 Introduction
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) emerged as a powerful framework for training generative models in the recent years. GANs consist of two competing (adversarial) networks: a generative model that tries to capture the distribution of a given dataset to map from an arbitrary latent space (usually drawn from a multivariate Gaussian) to new synthetic data points, and a discriminative model that tries to distinguish between samples from the generator and the true data. Iterative training of both models ideally results in a discriminator capturing features from the true data that the generator does not synthesize, while the generator learns to include these features in the generative process, until real and synthesized data are no longer distinguishable.
Experiments by Radford et al. (2016) showed that a GAN can learn rich representation of the data in the latent space in which interpolations produce semantic variations and shifts in certain directions correspond to variations of specific features of the generated data. However, due to the lack of an inverse mapping from data to the latent space, GANs cannot be used to encode individual data points in the latent space
(Donahue et al., 2016).Moreover, although GANs show promising results in various tasks, such as the generation of realistic looking images (Radford et al., 2015; Berthelot et al., 2017; Sønderby et al., 2016) or 3D objects (Wu et al., 2016), training a GAN in the aforementioned ideal way is difficult to set up and sensitive to hyperparameter selection (Salimans et al., 2016). Additionally, GANs tend to restrict themselves on generating only a few major modes of the true data distribution, since such a socalled mode collapse is not penalized in the GAN objective, while resulting in more realistic samples from these modes (Che et al., 2016). Hence, the majority of the latent space only maps to a few regions in the target space resulting in poor representation of the true data.
We propose a novel GAN framework, InvariantEncoding Generative Adversarial Networks (IVEGAN), which extends the classical GAN architecture by an additional encoding unit to map samples from the true data to the latent space (compare Fig. 1). To encourage the encoder to learn a rich representation of the data in the latent space, the discriminator is asked to distinguish between different predefined transformations of the input sample and generated samples
by taking the the original input as condition into account. While the discriminator has to learn what the different variations have in common with the original input, the encoder is forced to encode the necessary information in the latent space so that the generator can fool the discriminator by generating samples which are similar to the original samples. Since the discriminator is invariant to the predefined transformations, the encoder can ignore these variations in the input space and learn a rich and to such transformations invariant representation of the data. The variations of the generated samples are modeled by an additional latent vector
(drawn from a multivariate Gaussian). Thus, the encoded samples condition the generator . Moreover, since the discriminator learns to distinguish between generated samples and variations of the original for each individual sample in the data, the latent space can not collapse to a few modes of the true data distribution since it will be easy for the discriminator to distinguish generated samples from original ones if the mode is not covered by the generator. Thus, the proposed IVEGAN learns a rich and to certain transformations invariant representation of a dataset and naturally encourages the generator to cover all modes of the data distribution.To generate novel samples, the generator can be fed with an arbitrary latent representation .
In summary, we make the following contributions:

We derive a novel GAN framework for learning rich and transformation invariant representation of the data in the latent space.

We show that our GANs reproduce samples from a data distribution without mode collapsing issues.

We demonstrate robust GAN training across various data sets and showcase that our GAN produces very realistic samples.
2 Related Work
Generative Adversarial Networks (Goodfellow et al., 2014) (GANs) are a framework for training generative models. It is based on a minmaxgame of two competing (adversarial) networks. The generator tries to map an arbitrary latent space (usually drawn from a multivariate Gaussian) to new synthetic data points by training a generator distribution that matches the true data distribution . The training is performed by letting the generator compete against the second network, the discriminator . The discriminator aims at distinguishing between samples from the generator distribution and real data points from
by assigning a probability
. The formal definition of the objective of this minmaxgame is given by:(1) 
However, training a GAN on this objective usually results in a generator distribution where large volumes of probability mass collapse onto a few major modes of the true data generation distribution (Che et al., 2016). This issue, often called mode collapsing, has been subject of several recent publications proposing new adjusted objectives to reward a model for higher variety in data generation.
A straightforward approach to control the generated modes of a GAN is to condition it with additional information. Conditional Generative Adversarial Nets (Mirza & Osindero, 2014) utilize additional information such as classlabels to direct the data generation process. The conditioning is done by additionally feeding the information into both generator and discriminator. The objective function Eq. (1) becomes:
(2) 
Obviously, such a conditioning is only possible if additional information is provided.
Che et al. (2017) proposed a framework which adds two regularizers to the classical GAN. The metric regularizer trains in addition to the generator an encoder and includes the objective in the training. This additional objective forces the generated modes closer to modes of the true data. As distance measure they proposed e.g. the pixelwise distance or the distance of learned features by the discriminator (Dumoulin et al., 2016). To encourage the generator to also target minor modes in the proximity of major modes the objective is extended by a mode regulizer .
Another proposed approach addressing the mode collapsing issue are unrolled GANs (Metz et al., 2016). In practice GANs are trained by simultaneously updating in Eq. (1
), since explicitly updating G for the optimal D for every step is computational infeasible. This leads to an update for the generator which basically ignores the maxoperation for the calculation of the gradients and ultimately encourages mode collapsing. The idea behind unrolled GANs is to update the generator by backpropagating through the gradient updates of the discriminator for a fixed number of steps. This leads to a better approximation of Eq. (
1) and reduces mode collapsing.Another way to avoid mode collapse is the Coulomb GAN, which models the GAN learning problem as a potential field with a unique, globally optimal nash equlibrium (Unterthiner et al., 2017).
Work has also been done aiming at introducing an inverse mapping from data to latent space . Bidirectional Generative Adversarial Networks (BiGANs) (Donahue et al., 2016) and Adversarially Learning Inference (Dumoulin et al., 2016) are two frameworks based on the same idea of extending the GAN framework by an additional encoder and to train the discriminator on distinguishing joint samples from in the data space ( versus ) as well as in the latent space ( versus ). Also the inverse mapping () is never explicitly computed, the authors proved that in an ideal case the encoder learns to invert the generator almost everywhere () to fool the discriminator.
However, upon visual inspection of their reported results (Dumoulin et al., 2016), it appears that the similarity of the original to the reconstructions is rather vague, especially in the case of relative complex data such as CelebA (compare appendix A). It seems that the encoder concentrates mostly on prominent features such as gender, age, hair color, but misses the more subtle traits of the face.
3 Proposed Method
Consider a subset of the domain that is setwise invariant under a transformation so that . We can utilize different elements to train a discriminator on learning the underlying concept of by discriminating samples and samples . In an adversarial procedure we can then train a generator on producing samples .
could be e.g. a set of higherlevel features that are invariant under certain transformation. An example for a dataset with such features are natural images of faces. Highlevel features like facial parts that are critical to classify and distinguish between different faces are invariant e.g. to small local shifts or rotations of the image.
We propose an approach to learn a mapping from the data to the latent space by utilizing such invariant features of the data. In contrast to the previous described related methods, we learn the mapping from the data to the latent space not by discriminating the representations and but by discriminating generated samples conditioned on encoded original samples and transformations of an original sample by taking the original sample as additional information into account. In order to fool the discriminator, the encoder has to extract enough features from the original sample so that the generator can reconstruct samples which are similar to the original one, apart from variations . The discriminator, on the other hand, has to learn which features samples have in common with the original sample to discriminate variations from the original samples and generated samples. To fool a perfect discriminator the encoder has to extract all individual features from the original sample so that the generator can produce perfect variants of the original.
To generate novel samples, we can draw samples as latent space. To learn a smooth representation of the data, we also include such generated samples and train an additional discriminator on discriminating them from true data as in the classical GAN objective Eq. (1).
Hence, the objective for the IVEGAN is defined as a minmaxgame:
(3)  
One thing to consider here is that by introducing an invariance in the discriminator with respect to transformations , the generator is no longer trying to exactly match the true data generating distribution but rather the distribution of the transformed true data. Thus, one has to carefully decide which transformation to chose, ideally only such describing already present variations in the data and which are not affecting features in data that are of interest for the representation.
4 Experiments and Results
To evaluate the IVEGAN with respect to quality of the generated samples and learned representations we perform experiments on 3 datasets: a synthetic dataset, the MNIST dataset and the CelebA dataset.
4.1 Synthetic Dataset
To evaluate how well a generative model can reproduce samples from a data distribution without missing modes, a synthetic dataset of known distribution is a good way to check if a the model suffers from mode collapsing (Metz et al., 2016).
Following Metz et al. (2016), we evaluate our model on a synthetic dataset of a 2D mixture of eight Gaussians with covariance matrix , and means arranged on a ring. As invariant transformation we define small shifts in both dimensions:
(4) 
so that the discriminator becomes invariant to the exact position within the eight Gaussians.
Fig. 2 shows the distribution of the generated samples of the model over time compared to the true data generating distribution.
The IVEGAN learns a generator distribution which converges to all modes of the true data generating distribution while distributing its probability mass equally over all modes.
4.2 Mnist
As a next step to increase complexity we evaluate our model on the MNIST dataset. As invariant transformations we define small random shifts (up to 4 pixels) in both width and heightdimension and small random rotations up to .
Fig. 3 shows novel generated samples from the IVEGAN trained on the MNIST dataset as a result of randomly sampling the latent representation from a uniform distribution .
Fig. 4 shows for different samples from the MNIST dataset the generated reconstructions for a model with a 16dimensional as well as a 3dimensional latent space . As one might expect, the model with higher capacity produces images of more similar style to the originals and makes less errors in reproducing digits of unusual style. However, 3 dimensions still provide enough capacity for the IVEGAN to store enough information of the original image to reproduce the right digit class in most cases. Thus, the IVEGAN is able to learn a rich representation of the MNIST dataset in 3 dimensions by utilizing classinvariant transformations. Fig. 5 shows the learned representation of the MNIST dataset in 3 dimensions without using further dimensionality reduction methods. We observe distinct clusters for the different digit classes.
4.3 CelebA
As a last experiment we evaluate the proposed method on the more complex CelebA dataset (Liu et al., 2015), centrally cropped to pixel. As in the case of MNIST, we define invariant transformation as small random shifts (here up to 20 pixel) in both width and heightdimension as well as random rotations up to . Additionally, performs random horizontal flips and small random variations in brightness, contrast and hue.
Fig. 6 shows for different images from the CelebA dataset some of the random transformation and some of the generated reconstructed images with random noise . The reconstructed images show clear similarity to the original images not only in prominent features but also subtle facial traits.
Fig. 7 shows novel generated samples from the IVEGAN trained on the CelebA dataset as an result of randomly sampling the latent representation from a uniform distribution . To illustrate the influence of the noise component , the generation was performed with the same five noise components for each image respectively. We observe that altering the noise component induces a similar relative transformation in each image.
To visualize the learned representation of the trained IVEGAN we encode 10.000 samples from the CelebA dataset into the 1024dimensional latent space and projected it into two dimensions using tDistributed Stochastic Neighbor Embedding (tSNE) (Maaten & Hinton, 2008).
Fig. (a)a shows this projection of the latent space with example images for some high density regions. Since the CelebA dataset comes with labels, we can evaluate the representation with respect to its ability to clusters images of same features.
Fig. (b)bFig. (e)e shows the tSNE embedding of both the latent representation and the original images for a selection of features. Observing the visualization of the latent space, we can make out distinct clusters of images sharing similar style and features. Images that are close together in the latent space, share similar visual attributes. It is noteworthy that even images of people wearing normal eyeglasses are separated from images of people wearing sunglasses. By comparing the embedding of the learned representation with the embedding of the original images we observe a clear advantage of the representation learned by the IVEGAN in terms of clustering images with the same features.
We also evaluate whether smooth interpolation in the learned feature space can be performed. Since we have a method at hand that can map arbitrary images from the dataset into the latent space, we can also interpolate between arbitrary images of this dataset. This is in contrast to Radford et al.(2016), who could only show such interpolations between generated ones.
Fig. 9 shows generated images based on interpolation in the latent representation between two original images. The intermediate images, generated from the interpolated latent representations are visually appealing and show a smooth transformation between the images. This finding indicates that the IVEGAN learns a smooth latent space and can generate new realistic images not only from latent representations of training samples.
5 Conclusion
With this work we proposed a novel GAN framework that includes a encoding unit that maps data to a latent representation by utilizing features in the data which are invariant to certain transformations. We evaluate the proposed model on three different dataset and show the IVEGAN can generate visually appealing images of high variance while learning a rich representation of the dataset also covering subtle features.
Acknowledgments
We thank Floriane Montanari, Joren Retel, Jay Vala Roland Vollgraf, Martin Heusel, Thomas Unterthiner and Sepp Hochreiter for their helpful discussions and comments on this work.
References
 Berthelot et al. (2017) David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 Che et al. (2016) Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
 Donahue et al. (2016) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 Dumoulin et al. (2016) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014.

Liu et al. (2015)
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015. 
Maaten & Hinton (2008)
Laurens van der Maaten and Geoffrey Hinton.
Visualizing data using tsne.
Journal of Machine Learning Research
, 9(Nov):2579–2605, 2008.  Metz et al. (2016) Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
 Mirza & Osindero (2014) Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 2234–2242. Curran Associates, Inc., 2016.
 Sønderby et al. (2016) Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image superresolution. arXiv preprint arXiv:1610.04490, 2016.
 Unterthiner et al. (2017) Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, Martin Heusel, Hubert Ramsauer, and Sepp Hochreiter. Coulomb gans: Provably optimal nash equilibria via potential fields. CoRR, abs/1708.08819, 2017. URL http://arxiv.org/abs/1708.08819.
 Wu et al. (2016) Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 82–90. Curran Associates, Inc., 2016.
Appendix A Generated Images by ADVERSARIALLY LEARNED INFERENCE (ALI)
Appendix B Network Architecture and Hyperparameters
Unit  Operation  Num Neurons 
Activation 

Dense  128  Tanh  
Dense  2  Tanh  
Concatenate and  
Dense  128  Tanh  
Dense  2  Tanh  
)  
Concatenate and respectively  
Dense  128  Tanh  
Dense  1  Tanh  
Dense  128  Tanh  
Dense  1  Tanh  
dimensions  4  
dimensions  3  
Optimizer  Adam (, , , )  
Optimizer  Adam (, , , )  
Batch size  1024  
Epochs  50000  
LReLU slope  0.2  
Weight initialization  LReLUlayer: He, else: Xavier  
Bias initialization  Constant zero 
Network architecture and hyperparameters for the synthetic dataset experiment.
Unit  Operation  Kernel  Strides  Num Filter  BN?  Activation 

Conv  5x5  2x2  64  ✓  LReLU  
Conv  3x3  2x2  128  ✓  LReLU  
Conv  3x3  2x2  256  ✓  LReLU  
Conv  3x3  2x2  256  ✓  LReLU  
Conv  2x2  1x1  256  ✓  LReLU  
Flatten  
Dense  1024  ✗  Tanh  
Concatenate and  
Dense  4096  ✓  LReLU  
Reshape to [Batch Size, 4, 4, 256]  
Conv transposed  3x3  2x2  256  ✓  LReLU  
Conv transposed  3x3  2x2  128  ✓  LReLU  
Conv transposed  5x5  2x2  64  ✗  LReLU  
Conv transposed  5x5  1x1  1  ✗  Sigmoid  
Conv  5x5  2x2  64  ✓  LReLU  
Conv  3x3  2x2  128  ✓  LReLU  
Conv  3x3  2x2  256  ✓  LReLU  
Conv  3x3  2x2  256  ✓  LReLU  
Conv  2x2  1x1  256  ✓  LReLU  
Flatten  
Dense  3  ✗  LReLU  
Concatenate and  
Dense  128  ✗  LReLU  
Dense  64  ✗  LReLU  
Dense  16  ✗  LReLU  
Dense  1  ✗  Linear  
Dense  64  ✗  LReLU  
Dense  1  ✗  Linear  
dimensions  4  
dimensions  3  
Optimizer  Adam (, , , )  
Optimizer  Adam (, , , )  
Batch size  512  
Epochs  100  
LReLU slope  0.2  
Weight initialization  LReLUlayer: He, else: Xavier  
Bias initialization  Constant zero 
Unit  Operation  Kernel  Strides  Num Filter  BN?  Activation 

Conv  5x5  2x2  128  ✓  LReLU  
Conv  5x5  2x2  128  ✓  LReLU  
Conv  5x5  2x2  256  ✓  LReLU  
Conv  3x3  2x2  256  ✓  LReLU  
Conv  3x3  2x2  512  ✓  LReLU  
Conv  3x3  2x2  512  ✓  LReLU  
Conv  2x2  1x1  1024  ✓  LReLU  
Flatten  
Dense  1024  ✗  Tanh  
Concatenate and  
Dense  4096  ✓  LReLU  
Reshape to [Batch Size, 2, 2, 1024]  
Conv transposed  2x2  2x2  512  ✓  LReLU  
Conv transposed  2x2  1x1  512  ✓  LReLU  
Conv transposed  3x3  2x2  256  ✓  LReLU  
Conv transposed  3x3  1x1  256  ✓  LReLU  
Conv transposed  3x3  2x2  256  ✓  LReLU  
Conv transposed  3x3  1x1  256  ✓  LReLU  
Conv transposed  5x5  2x2  128  ✓  LReLU  
Conv transposed  5x5  1x1  128  ✓  LReLU  
Conv transposed  5x5  2x2  128  ✓  LReLU  
Conv transposed  5x5  1x1  128  ✗  LReLU  
Conv transposed  5x5  2x2  64  ✗  LReLU  
Conv transposed  5x5  1x1  3  ✗  Sigmoid  
Conv  5x5  2x2  128  ✓  LReLU  
Conv  5x5  2x2  128  ✓  LReLU  
Conv  5x5  2x2  256  ✓  LReLU  
Conv  3x3  2x2  256  ✓  LReLU  
Conv  3x3  2x2  512  ✓  LReLU  
Conv  3x3  2x2  512  ✓  LReLU  
Conv  2x2  1x1  1024  ✓  LReLU  
Flatten  
Dense  1024  ✗  LReLU  
Concatenate and  
Dense  1024  ✗  LReLU  
Dense  512  ✗  LReLU  
Dense  128  ✗  LReLU  
Dense  1  ✗  Linear  
Dense  128  ✗  LReLU  
Dense  1  ✗  Linear  
dimensions  16  
dimensions  1024  
Optimizer  Adam (, , , )  
Optimizer  Adam (, , , )  
Batch size  64  
Epochs  16  
LReLU slope  0.2  
Weight initialization  LReLUlayer: He, else: Xavier  
Bias initialization  Constant zero 
Comments
There are no comments yet.