Introduction
Generative models like GANGoodfellow et al. (2014) and VAEKingma and Welling (2014)
have shown remarkable progress in recent years.Generative adversarial networks have shown state-of-the-art performance in a variety of tasks like Image-To-Image translation
Isola et al. (2018), video predictionLiang et al. (2017), Text-to-Image translationZhang et al. (2017), drug discoveryHong et al. (2019), and privacy-preservingShi et al. (2018). VAE has shown state-of-the-art performance in a variety of tasks like image generationGregor et al. (2015)Maaløe et al. (2016), and interpolating between sentences
Bowman et al. (2016). VAE approximates its latent distribution by ELBO method and uses Gaussian or uniform distribution as a marginal latent distribution which leads to huge reconstruction error. Adversarial autoencoder(AAE)
Makhzani et al. (2016) uses adversarial training to train VAE, and LatentGAN is directly inspired by the AAE strategy of training VAE. -VAEBurgess et al. (2018) can be used for learning disentangled representation, but it needs hyperparameter search and also huge would lead to huge reconstruction error. InfoGANChen et al. (2016) have shown to learn disentangled representation in GAN. It learns to disentangle representation by maximizing the mutual information between a subset of generated sample and the output of the recognition network. Learning disentangled representation and understanding generative factors might help in a variety of tasks and domainsBengio et al. (2013); Burgess et al. (2018). Disentangled representation can be defined as one where single latent units are sensitive to changes in single generative factors, while being relatively invariant to changes in other factors. Disentangled representations could boost the performance of state-of-the-art AI approaches in situations where they still struggle but where humans excelLake et al. (2016). In this paper, we have addressed the issue of learning disentangled control by learning the latent distribution of autoencoder directly using LatentGAN generator, and LatentGAN discriminator which tries to discriminate whether the sample belongs to real or fake latent distribution. Additionally, we also use mutual information inspired directly from InfoGAN to learn disentangled representation.Related Work
In this work, we present a new way to learn control over autoencoder latent distribution with the help of AAE Makhzani et al. (2016) which approximates posterior of the latent distribution of autoencoder using any arbitrary prior distribution and using Chen et al. (2016) for learning disentangled representation. The previous work by Wang et al. (2019) had used a similar method of learning the latent prior using AAE along with a perceptual loss and Information maximization regularizer to train the decoder with the help of an extra discriminator.
Contribution
In this work, we are trying to learn control over autoencoder and in doing so following are our contribution:
-
We have shown that it is possible to approximate autoencoder latent distribution directly using LatentGAN generator without using extra discriminator as in Wang et al. (2019).
-
We are also able to learn disentangled control over autoencoder latent distribution by using same LatentGAN Discriminator and have shown some results in the experiment section.
And in doing so we are also able to get 2.38 error rate on MNIST unsupervised classification which is far less then InfoGAN and Adversarial autoencoder.
Method
We train autoencoder i.e. and LatentGAN generator and discriminator simultanously,this helps generator to compete with discriminator.Autoencoder training objective is as follows:
LatentGAN tries to model data distribution where is the latent embedding of the autoencoder, and denotes latent code which generated by concatenating Gaussian and code , as show in the Figure 1. LatentGAN training objective is as follows:
(2) |
For learning the control over latent distribution additional mutual information based objective inspired by(Chen et al. 2016) is used,
by defining variational lower bound over second term of the equation (Method) can be further written as:
term insures that GAN loss and differential entropy loss are on same scale.
Implementation Details
We train autoencoder which consist of encoder which models , where is latent distribution and is the input image and decoder which models using equation (Method). LatentGAN generator and discriminator as shown in Figure 1 can be implemented using linear layer with non-linearity.The network also has linear layer,and shares parameters with .Then , and are trained using follow process:
-
1) Random Gaussian noise is sampled from along with latent code, which can be uniformly continuous or categorically discrete where is the number of categories.
-
2) Latent code and Gaussian noise are passed through , which generates where can be or code.
-
3) gets input from and , which outputs whether the sample belongs to or not.
-
4) receives input from , and outputs code ,which will be used for optimizing equation (Method).
It has been observed that using layer initializations from DCGAN Radford et al. (2016) hurts LatentGAN training process,but the suggestion of using noisy labels for training helps during training.
Experiments
MNIST Dataset
In MNIST LeCun and Cortes (2010) dataset, we use one categorical discrete variable with , and two continous uniform variable and as an input latent code to . After training, random samples are generated by choosing , , and and output is shown in the Figure 2.This shows that LatentGAN generator is able to approximate latent distribution of autoencoder directly.
Method | K | Test error (↓) |
---|---|---|
InfoGAN | 10 | 5 ± 0.01 |
AAE | 32 | 4.10 ± 1.13 |
ours | 10 | 2.38 ± 0.38 |
We have also shown learned disentangled rotation control as shown in the Figure 4, and disentangled thickness control in the Figure 3.This shows that we can directly learn control on latent distribution of autoencoder.Also we can use discriminator for unsupervised classification, and Table 1 shows that our method has better test error than InfoGAN and AAE, and (↓) denotes lower the score better the performance.
Encoder | Decoder | Discriminator / | Generator |
---|---|---|---|
Input(1X32x32) | Input | Input | Input |
c4-o16-s2-r | r-u4-c3-o128-p1-bn128-r | fc-o1000-r | fc-o1000-r |
c4-o32-s2-r | u2-c3-o64-p1-bn64-r | fc-o1000-r | fc-o1000-r |
c4-o64-s2-r | u2-c3-o32-p1-bn32-r | fc-o512-r | fc-o1000-r |
c4-o128-s2-r | u2-c3-o3-p1-tanh | fc-o1-sig,fc-o2,fc-o10 | fc-o64 |
c4-o128-s2-r | - | - | - |
fc-o64 | - | - | - |
For training LatentGAN Autoencoder on MNIST dataset, we have used architecture as mentioned in the Table 2 with batchsize of 128, lr of 0.0002 and Adam optimizer with beta1 of 0.5 and beta2 of 0.9 for both autoencoder and PriorGAN. In the Table 2 following are the conventions used, c is convolution, o is output channels, s is stride , u is bilinear upsampling, p is padding , r is relu, sig is sigmoid and bn is batchnorm. In column 3 of the table final output have 3 sections, 1st section belongs to discriminator and other sections belong to network. and is set to 1 and 0.1 respectively.
3D Chair Dataset
In 3D Chair Aubry et al. (2014)
dataset, we use 3 discrete categorical variable
where and 1 continuous uniform variable as an input to . We are able to generate meaningful sample as show in the Figure 5, that means is able to approximate autoencoder latent distribution. We are also able to learn disentangled rotational control on the chair dataset as shown in the Figure 6. Hyperparameter setting for 3D Chair dataset is same as MNIST Dataset except and is set to 1 and 10 respectively.Encoder | Decoder | Discriminator / | Generator |
---|---|---|---|
Input(1X64x64) | Input | Input | Input |
c4-o64-s2-r | r-u4-c3-o512-p1-bn128-r | fc-o3000-r | fc-o3000-r |
c4-o128-s2-r | u2-c3-o256-p1-bn64-r | fc-o3000-r | fc-o3000-r |
c4-o256-s2-r | u2-c3-o128-p1-bn32-r | fc-o3000-r | fc-o3000-r |
c4-o512-s2-r | u2-c3-o64-p1-bn32-r | fc-o512-r | fc-o3000-r |
c4-o1024-s2-r | u2-c3-o1-p1-tanh | fc-o1-sig,(fc-o2),3x(fc-o20) | fc-o128 |
c4-o128-s2-r | - | - | |
fc-o128 | - | - | - |
For training LatentGAN Autoencoder on 3D Chair dataset, we have used architecture as mentioned in the Table 3.
CelebA Dataset
In CelebA Liu et al. (2015) dataset, we use 10 discrete categorical variable only where as an input to .After training Generator and are able to generate meaningful samples shown in the Figure 7.
Also we are able to learn smile feature control on the CelebA dataset as shown in the Figure 8.
Encoder | Decoder | Discriminator / | Generator |
---|---|---|---|
Input(3X32x32) | Input | Input | Input |
c4-o64-s2-r | r-u4-c3-o512-p1-bn128-r | fc-o3000-r | fc-o3000-r |
c4-o128-s2-r | u2-c3-o256-p1-bn64-r | fc-o3000-r | fc-o3000-r |
c4-o256-s2-r | u2-c3-o128-p1-bn32-r | fc-o3000-r | fc-o3000-r |
c4-o512-s2-r | u2-c3-o3-p1-tanh | fc-o512-r | fc-o3000-r |
c4-o128-s2-r | - | fc-o1-sig,10x(fc-o10) | fc-o128 |
fc-o128 | - | - | - |
For training LatentGAN Autoencoder on CelebA dataset, we have used architecture as mentioned in the Table 4.Hyperparameter setting for CelebA dataset is same as MNIST Dataset except is set to 1.
Conclusion
We proposed LatentGAN autoencoder which can learn to control directly over autoencoder latent distribution and in doing so it is able to generate meaningful samples. Experimentally, we are able to verify that LatentGAN autoencoder can be used to learn meaningful disentangled representation over latent distribution and in unsupervised MNIST classification task performs better then InfoGAN and AAE. This further suggests that rather than making generator learn image distribution which may be challenging, we can approximate latent distribution which is less challenging and easy for generator to learn.
Acknowledgement
We thank Wobot Intelligence Inc. for providing NVIDIA GPU hardware resource for conducting this research which significantly boosted our experimentation cycle.
References
- Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, Cited by: 3D Chair Dataset.
- Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. External Links: Document Cited by: Introduction.
- Generating sentences from a continuous space. External Links: 1511.06349 Cited by: Introduction.
- Understanding disentangling in -vae. External Links: 1804.03599 Cited by: Introduction.
- InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. External Links: 1606.03657 Cited by: Introduction, Related Work.
- Generative adversarial networks. External Links: 1406.2661 Cited by: Introduction.
-
DRAW: a recurrent neural network for image generation
. External Links: 1502.04623 Cited by: Introduction. - Molecular generative model based on adversarially regularized autoencoder. External Links: 1912.05617 Cited by: Introduction.
- Image-to-image translation with conditional adversarial networks. External Links: 1611.07004 Cited by: Introduction.
- Auto-encoding variational bayes. External Links: 1312.6114 Cited by: Introduction.
- Building machines that learn and think like people. External Links: 1604.00289 Cited by: Introduction.
- MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: MNIST Dataset.
- Dual motion gan for future-flow embedded video prediction. External Links: 1708.00284 Cited by: Introduction.
-
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, Cited by: CelebA Dataset. - Auxiliary deep generative models. External Links: 1602.05473 Cited by: Introduction.
- Adversarial autoencoders. External Links: 1511.05644 Cited by: Introduction, Related Work.
- Unsupervised representation learning with deep convolutional generative adversarial networks. External Links: 1511.06434 Cited by: Implementation Details.
- SSGAN: secure steganography based on generative adversarial networks. External Links: 1707.01613 Cited by: Introduction.
- Learning priors for adversarial autoencoders. External Links: 1909.04443 Cited by: Related Work, 1st item.
- StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. External Links: 1612.03242 Cited by: Introduction.