1 Introduction
Realworld data involves smooth features as well as discrete features. The size of a dog would smoothly vary from small to large, but a smooth transition from a dog to a cat hardly exists as the species is inherently a discrete feature. Since the discrete feature induces disconnections in the space, the underlying structure of realworld data is generally not a single manifold but multiple disconnected manifolds.
Generative adversarial networks (GANs) goodfellow_generative_2014 and their successors (e.g., radford_unsupervised_2015 ; arjovsky_wasserstein_2017 ; berthelot_began_20171 ) present remarkable ability in learning the manifold structure of data. However, they have difficulties handling the disconnections in multimanifold data, since their decoderbased generator primally defines a single connected manifold khayatkhoei_disconnected_2018 ; shao_riemannian_2017 ; xiao_bourgan_2018 . Some recent works tackle this issue by using multiple generator networks khayatkhoei_disconnected_2018 ; ghosh_multiagent_2017 ; hoang_mgan_2018 or giving a mixture density on the latent space xiao_bourgan_2018 ; gurumurthy_deligan_2017 . These approaches avoid the difficulties due to the disconnections as they model each of the manifolds separately.
However, it should be noted that the manifolds generally have shared structures as they represent common smooth features. For example, cats and dogs have common smooth features such as the size and the pose. The respective manifolds of cats and dogs should share their structures, as the local transforming rule according to the size (stretching the foreground patch) or the pose (rotating the patches of body parts) are the same. The separate modeling of manifolds cannot capture this shared structure, so it cannot learn the common abstract features of data.
In this work, we propose an encoderpowered GAN (EncGAN) that is able to learn both the multimanifold structure and the common abstract features of data. Unlike the conventional decoderbased GANs, EncGAN primally uses an encoder for modeling the data. The encoder combines multiple manifolds into a single latent space by eliminating the disconnected regions between the manifolds. In the process, the manifolds are aligned and overlapped, from which the common smooth features are obtained. Data is generated by inverting the encoder, restoring the eliminated components to make distinct manifolds. This generating scheme sets the disconnected regions aside from the modeling of manifolds, thus resolves the difficulties that GANs have. The advantages of EncGAN can be summarized as follows:

Efficient and abstractive modeling: EncGAN uses a single encoder to model the multiple manifolds, thus it is efficient and able to learn the common abstract features.

Circumvention of modeling the disconnected region: Discrete features that induce the disconnections are set aside from the modeling of manifolds. Thus, EncGAN does not present the difficulties that GANs have.

Disentangled features: Although EncGAN is not explicitly driven to disentangle the features, it gives a good disentanglement in the latentspace features due to the shared manifold modeling.

Easy applicability: Data generation involves a computationally challenging inversion of the encoder. However, we propose an inverted, decoderlike formulation with a regularizer, avoiding the computation of the inverse. This makes EncGAN easily applicable to any existing GAN architectures.
We start by looking into the difficulties that the conventional GANs have. Next, we explain our model in detail and investigate the above advantages with experiments.
2 Difficulties in Learning Multiple Manifolds with GAN
In conventional GANs goodfellow_generative_2014 , the generator consists of a latentspace distribution and a decoding mapping . Transforming using , an inputspace distribution is defined to model the data, and this particular way of defining introduces a manifold structure. The structure further becomes a single, globally connected manifold, when
is set to a uniform or a normal distribution as usual. This is because the supports of aforesaid distributions are globally connected space, and
is a smooth and injective^{1}^{1}1Although these conditions are not guaranteed, they are approximately met or observed in practice. See shao_riemannian_2017 map. As smooth, injective maps preserve the connectedness before and after the mapping, the manifold produced from is the same globally connected space as the latent space khayatkhoei_disconnected_2018 .A consequence of this is a difficulty in learning multimanifold data. As the generator can only present a single connected manifold, the best possible option for the generator is to cover all the manifolds and approximate the disconnected regions with low densities. This requires the generator to learn a highly nonlinear because the density is obtained, using change of variable, as and the Jacobian has to be large to present a low density in (see Figure 1). Highly nonlinear functions are hard to learn and often lead to a mode collapse problem metz_unrolled_2016 ; khayatkhoei_disconnected_2018 ; xiao_bourgan_2018 . Moreover, even if the generator succeeds in learning all the manifolds, unrealistic data are sometimes generated as the model presents only an approximation to the disconnection khayatkhoei_disconnected_2018 .
2.1 Separate Manifold Modeling in Extended GANs
Recently, several extended GAN models are suggested to tackle the multimanifold learning. The models can be categorized into two, according to which component of the generator is extended. The first approach extends a single decoding mapping to multiple decoding mappings khayatkhoei_disconnected_2018 ; ghosh_multiagent_2017 ; hoang_mgan_2018 (see Figure 1, DMWGAN), from which a disconnected manifold is obtained. In particular, each component distribution obtained from models an individual manifold. The second approach extends the latentspace itself to a disconnected space, by using Gaussian mixture latent distribution xiao_bourgan_2018 ; gurumurthy_deligan_2017 or by employing discrete latent variables chen_infogan_2016 (see Figure 1, DeLiGAN and InfoGAN). Here, each mixture component or discrete variable models an individual manifold.
Although these extensions work quite well, they could be inefficient as they consider each manifold separately. Multiple decoding mappings could require a larger number of parameters, and the disconnected latent modeling could require much complexity in , as each disconnected latent region has to be mapped differently. Also, they can hardly capture the shared structure that multimanifold data generally involves. As we have seen in the cats and dogs example, the manifolds likely share their structures according to the abstract smooth features. The separate modeling of manifolds cannot consider this structure, so they miss learning the abstract features shared among the manifolds.
3 EncoderPowered GAN
Our objective is to model multimanifold data without having difficulties due to the disconnections. Also, we aim to model the data efficiently and capture the abstract features shared among the manifolds. Here, we explain the proposing encoderpowered GAN in detail and how it can meet these objectives. We first focus on a single linear layer then move on to the encoder network inheriting the principles of the linear layer. Among others, the most important contribution we make is the inverted formulation using the bias regularizer. We will see shortly that this enables a tractable model training with mathematical guarantees.
3.1 Linear Encoding Layer
Modeling and Abstracting the Multiple Manifolds
Consider a linear encoding layer with a weight^{2}^{2}2Here, we assume is fullrank (rank) (see Footnote 1 for this assumption). () and a bias . When an input is passed through the layer, we know from the basic linear algebra that only the tangential component to the column space of survives, and the normal component is lost. We use this property to model the multiple manifolds and to extract the shared features among them. In particular, we model the manifolds as spaces parallel to the column space of , where they differ only in the normal component (see Figure 2, left). This means, a point lying on the th manifold can be described as
where the tangential component is variable and shared among the manifolds and the normal component is fixed and distinct for each th manifold. With this modeling, the latent output
(1) 
is the same for the points lying on different manifolds as long as they have the same tangential components. This way, the linear layer captures the smooth features shared among the manifolds in its latent output. At the same time, it abstracts multiple manifolds into a single latent space by eliminating the disconnected region along the normal component.
Data Generation
Data is generated by inverting the encoding mapping (Eq. 1), from which a random latent sample is transformed into an input data sample. Firstly, multiplying the pseudoinverse of on the both sides of Eq. 1, we obtain a tangential component as . Then, we add a normal component to generate the data sample for each th manifold:
(2) 
Here, the normal components restore the disconnections between the manifolds. The normal components can be either trained or directly computed from the encoder inputs depending on the model setting. In this work, we only consider the trainable setting.
Inverted Formulation
The primal formulation given above contains the pseudoinverse of , which is challenging to compute and unstable for a gradient descent learning. Our approach here is to construct a dual formulation based on an inverted, decoderlike parameterization. Although the model mainly learns decoders under this regime, they are properly regularized such that the primal single encoder can be recovered anytime.
Let us start by rearranging the Eq. 2 as
where denotes a decoder weight and denotes a decoder bias for the th manifold (see Figure 2, right). Comparing this rearrangement with a general decoding mapping, the condition that makes the decoders consistent with the encoder can be inferred: The weight and the tangential component of the bias are shared among all the manifolds. If we make the decoders keep this condition while training, we can always recover the encoder by and . Making shared is as trivial as setting the same weight for all the manifolds, but making shared requires a regularization.
Bias Regularizer
We could make
shared, by minimizing the sum of the variance of them:
. However, computing this term is intractable due to the inversion inside of .Proposition 1.
Proof.
Note that
where the second line is obtained from the cyclic property of trace and the last line is obtained from the CauchySchwarz inequality of the positive semidefinite matrices (see Appendix A for the details). ∎
As the harmonic mean in the proposition is constant from the perspective of , we can minimize the original term by minimizing the upper bound instead. With an additional function to match the scale due to the dimensionality, we propose the upper bound as a regularizer for the shared :
3.2 Encoder Network
A single linear layer is, of course, insufficient as it can only model the linear manifolds and eliminate the disconnected region only linearly. So we build a deep encoder network to model the complex nonlinear manifolds. In each linear layer of the encoder, the disconnected region is linearly eliminated, so there remains a residual nonlinear region that is not eliminated. The following nonlinear layer flattens this region, thus the next linear layer can eliminate it further up to a smaller number of dimensions. Continuing this procedure layerbylayer, the encoder finally yields a single lowdimensional linear space shared among the manifolds, in which the disconnected region is fully eliminated.
Instead of directly considering the encoder, we again use the inverted formulation of defining a decoder and regularizing it such that the encoder can be recovered. Specifically, for each of the linear layers in the decoder, we set multiple biases and apply the regularizer proposed above. As the other types of layers (e.g., batchnorm or nonlinear activations) can be inverted in a closed form, we can invert the entire decoder to recover the original encoder as needed.
Training
If we denote the decoding weight of the th linear layer as and the decoding biases as , we can express the ancestral sampling of data as:
Here,
stands for the probability of selecting the
th bias. This probability could be also learned using the method proposed in khayatkhoei_disconnected_2018 , but it is beyond our scope and we fix it as . Now, denoting the real data distribution as and the fake distribution that the above sampling presents as , we define our GAN losses as:where and are the generator and the discriminator losses respectively and is a regularization weight. We use Wasserstein GAN (WGAN) arjovsky_wasserstein_2017 so the discriminator is limited to a Lipschitz function.
Encoding
With the inverted, decoderlike formulation, we have circumvented the computation of the inversions in data generation; but conversely, we have difficulties in the encoding, especially due to the convolutional layers. A recent work considers this difficulty and prove that a distance minimization approach, , can find the correct latent value for ma_invertibility_2018 . Applying this to our case, we compute the and the proper decoding biases by:
(3) 
where the second term is introduced to regularize the shared tangential component condition, is the regularization weight, and is the mean of the .
4 Experiments
WGAN  DMWGAN  VAE  InfoGAN  EncGAN (Ours)  EncGAN,  
FID  MNIST  
3DChair    
Disent.  MNIST (slant)      
MNIST (width)      
3DChair (height)      
3DChair (bright.)     
Datasets
We experiment on MNIST lecun_gradientbased_1998 , 3DChair aubry_seeing_2014 and UTZap50k yu_finegrained_2014 image datasets. 3DChair contains 1393 distinct chairs rendered for 62 different viewing angles (total 86,366 images); in experiments, only frontlooking 44,576 images are used and rescaled to 64x64 grayscale images. UTZap50k contains images of 4 different types of shoes (total 50,025 images); rescaled to 32x32.
Model Architecture
We use DCGAN radford_unsupervised_2015 like model architectures for all the datasets (see Appendix B for the complete information). For each of the linear layers in the generator, the number of biases are set as 10 (MNIST), 20 (3DChair) and 4 (UTZap50k). Although our multibiased linear layer can be applied to both fullyconnected and transposedconvolution layers, we apply it only to the former. This is sufficient for our purpose since discrete features rarely exist for such small sized kernel patches. In the discriminator, we use a spectral normalization miyato_spectral_2018 to achieve the Lipschitz condition for WGAN. For training and encoding, Adam kingma_adam_2014 is used with the defaults except for the learning rate, 0.0002.
4.1 MultiManifold Learning
Although the true multimanifold structures of the datasets are unknown, we can make a reasonable guess by considering which features are discrete and which features are smooth. MNIST involves discrete digits and smoothly varying writing styles, so we can guess it has a distinct manifold for each digit and the manifolds present the varying writing styles. Similarly, 3DChair would have a distinct manifold for each category of chairs, and the manifolds represent varying viewing angles, shapes or brightness; UTZap50k would have a distinct manifold for each type of shoes, and the manifolds represent varying colors or styles.
Looking at Figure 3 rowwise, we can see that our model learns distinct manifolds well, in accordance with our guesses (see in particular the rolling chairs and the boots). Columnwise, we can see that semantic features (e.g., stroke weight) are well aligned among the manifolds, which indicates that our model learns the shared abstract features in the latent space. To quantitatively examine the sample quality, we compute the FID score heusel_gans_2017 . FID score is widely used to quantitatively measure the diversity and quality of the generated image samples, which also gives a hint of the presence of a mode collapse. Table 1 shows that our model has better or comparable FID score than others.
4.2 Disentangled Abstract Features
To examine how much our learned latent features are disentangled, we define a disentanglement score and compare it with other models. We first take a few images from the dataset and manually change one of the smooth features that corresponds to a known transformation. Then we encode these images to the latent codes, analyze the covariance, and set the ratio of the first eigenvalue to the second as the disentanglement scores (see Appendix C for the details). Table 1 shows our model gets better scores than other models, but sometimes not as good as VAE. But, note that our model is not guided in information theoretic sense to disentangle the features as VAE and InfoGAN, yet still shows good disentanglement as Fig. 4.
, except the columns represent linear changes in the latent space along the first eigenvector in the disentanglement score analysis (see Sec.
4.2). Slant, width (MNIST), height and brightness (3DChair) components are shown.4.3 Style Transfer
We demonstrate a styletransfer between images using our model, by matching the latent codes. To make a styletransfer in other models (e.g., VAE), we usually need to separate the discrete and smooth features. Then, we transfer only the smooth features such that the style is transferred, not the identity of the object. In EncGAN, on the other hand, the latent space contains the smooth features only, as the discrete features reside in the biases. Thus, we can make styletransferred images simply by replacing the latent code of the source image with that of the target images, and regenerate the data (see Fig. 5).
Interestingly, EncGAN is even able to styletransfer the images that are off the trained distribution (see the red boxes). This is an exclusive property of EncGAN due to the encoder. If the encoder generalizes well (which apparently does), it can recognize the abstract features of an image even if the image has added frame noises or rectangle noises. Once the features are recognized, the model knows how to transform this image smoothly just like the original data. So the styletransferred images can be obtained with the noises still remained, as the noises are stored in the form of biases (computed from Eq. 3) and restored during the regeneration like discrete features.
5 Conclusion
In this work, we considered the problem of multimanifold learning by proposing encoderpowered GAN (EncGAN). We showed that EncGAN successfully learns the multimanifold structure of the data and captures the disentangled features shared among the manifolds. As EncGAN uses an encoder to abstractly model the data, it showed a potential to be generalized to unseen data, as demonstrated in the styletransfer experiment. If it is trained with larger datasets such as ImageNet in the future, we expect EncGAN becomes more versatile that it could be able to infer the manifolds of different datasets (not just the data with noise). Another line of future work would be transferlearning the encoder from the pretrained classifier models, which would bring a great boost in GAN training.
References
 [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. January 2017.

[2]
Mathieu Aubry, Daniel Maturana, Alexei A. Efros, Bryan C. Russell, and Josef
Sivic.
Seeing 3D Chairs: Exemplar PartBased 2D3D Alignment
Using a Large Dataset of CAD Models.
In
2014 IEEE Conference on Computer Vision and Pattern Recognition
, pages 3762–3769, Columbus, OH, USA, June 2014. IEEE.  [3] David Berthelot, Thomas Schumm, and Luke Metz. BEGAN: Boundary Equilibrium Generative Adversarial Networks. arXiv:1703.10717 [cs, stat], March 2017.
 [4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2172–2180. Curran Associates, Inc., 2016.
 [5] Arnab Ghosh, Viveka Kulharia, Vinay Namboodiri, Philip H. S. Torr, and Puneet K. Dokania. MultiAgent Diverse Generative Adversarial Networks. arXiv:1704.02906 [cs, stat], April 2017.
 [6] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 [7] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and R. Venkatesh Babu. DeLiGAN : Generative Adversarial Networks for Diverse and Limited Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 166–174, 2017.
 [8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two TimeScale Update Rule Converge to a Local Nash Equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc., 2017.
 [9] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. BetaVae: Learning Basic Visual Concepts with a Constrained Variational Framework. November 2016.
 [10] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. MGAN: Training Generative Adversarial Nets with Multiple Generators. February 2018.
 [11] Mahyar Khayatkhoei, Maneesh K. Singh, and Ahmed Elgammal. Disconnected Manifold Learning for Generative Adversarial Networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7343–7353. Curran Associates, Inc., 2018.
 [12] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. December 2014.
 [13] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [14] Fangchang Ma, Ulas Ayaz, and Sertac Karaman. Invertibility of Convolutional Generative Networks from Partial Measurements. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9628–9637. Curran Associates, Inc., 2018.
 [15] Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein. Unrolled Generative Adversarial Networks. arXiv:1611.02163 [cs, stat], November 2016.
 [16] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normalization for Generative Adversarial Networks. arXiv:1802.05957 [cs, stat], February 2018.
 [17] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. November 2015.
 [18] Hang Shao, Abhishek Kumar, and P. Thomas Fletcher. The Riemannian Geometry of Deep Generative Models. arXiv:1711.08014 [cs, stat], November 2017.
 [19] Chang Xiao, Peilin Zhong, and Changxi Zheng. BourGAN: Generative Networks with Metric Embeddings. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2269–2280. Curran Associates, Inc., 2018.
 [20] Aron Yu and Kristen Grauman. FineGrained Visual Comparisons with Local Learning. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 192–199, Columbus, OH, USA, June 2014. IEEE.
Appendix Appendix A Proposition 1 and the Proof
Proposition.
The following inequality holds
where are the eigenvalues of and denotes a harmonic mean.
Proof.
Note that
where the second and the fourth lines use the definition of the covariance, third line is obtained from the cyclic property of trace and the last line is obtained from the CauchySchwarz inequality of the positive semidefinite matrices. Thus,
where are eigenvalues of and denotes the harmonic mean. ∎
Appendix Appendix B Model Architecture and Experimenting Environments
We used machines with one NVIDIA Titan Xp for the training and the inference of all the models.
Appendix B.1 Mnist
We use distinct decoding biases in the model. In the training, we set the regularization weight and use the Adam optimizer with learning rate 0.0002. In the encoding, we use the Adam optimizer with learning rate 0.1, and the set the regularization weight .
Generator  Discriminator 

Input(8)  Input(1,28,28) 
Full(1024), BN, LReLU(0.2)  Conv(c=64, k=4, s=2, p=1), BN, LReLU(0.2) 
Full(6272), BN, LReLU(0.2)  Conv(c=128, k=4, s=2, p=1), BN, LReLU(0.2) 
ReshapeTo(128,7,7)  ReshapeTo(6272) 
ConvTrs(c=64, k=4, s=2, p=1), BN, LReLU(0.2)  Full(1024), BN, LReLU(0.2) 
ConvTrs(c=32, k=4, s=2, p=1), BN, LReLU(0.2)  Full(1) 
ConvTrs(c=1, k=3, s=1, p=1), Tanh 
Appendix B.1.1 Notes on the Other Compared Models
Overall, we match the architecture of other models with our model for fair comparison. Some differences to note are:

DMWGAN: We used 10 generators. Each generator has the same architecture as ours except the number of features or the channels are divided by 4, to match the number of trainable parameters. Note that 4 is the suggested number from the original paper.

VAE: We used Bernoulli likelihood.

InfoGAN: Latent dimensions consist of 1 discrete variable (10 categories), 2 continuous variable and 8 noise variable.
Appendix B.2 3DChair
We use distinct decoding biases in the model. In the training, we set the regularization weight and use the Adam optimizer with learning rate 0.0002. In the encoding, we use the Adam optimizer with learning rate 0.1, and the set the regularization weight .
Generator  Discriminator 

Input(10)  Input(1,64,64) 
Full(256), BN, LReLU(0.2)  Conv(c=64, k=4, s=2, p=1), BN, LReLU(0.2) 
Full(8192), BN, LReLU(0.2)  Conv(c=128, k=4, s=2, p=1), BN, LReLU(0.2) 
ReshapeTo(128,8,8)  Conv(c=128, k=4, s=2, p=1), BN, LReLU(0.2) 
ConvTrs(c=64, k=4, s=2, p=1), BN, LReLU(0.2)  ReshapeTo(8192) 
ConvTrs(c=32, k=4, s=2, p=1), BN, LReLU(0.2)  Full(1024), BN, LReLU(0.2) 
ConvTrs(c=16, k=4, s=2, p=1), BN, LReLU(0.2)  Full(1) 
ConvTrs(c=1, k=3, s=1, p=1), Tanh 
Appendix B.2.1 Notes on the Other Compared Models

VAE: We used Bernoulli likelihood.

InfoGAN: Latent dimensions consist of 3 discrete variable (20 categories), 1 continuous variable and 10 noise variable.
Appendix B.3 UTZap50k
We use distinct decoding biases in the model. For the regularization weight in the training, we start with then raise to
after 300 epochs.
Generator  Discriminator 

Input(8)  Input(3,32,32) 
Full(512), BN, LReLU(0.2)  Conv(c=128, k=4, s=2, p=1), BN, LReLU(0.2) 
Full(1024), BN, LReLU(0.2)  Conv(c=256, k=4, s=2, p=1), BN, LReLU(0.2) 
Full(8192), BN, LReLU(0.2)  Conv(c=512, k=4, s=2, p=1), BN, LReLU(0.2) 
ReshapeTo(512,4,4)  ReshapeTo(8192) 
ConvTrs(c=256, k=4, s=2, p=1), BN, LReLU(0.2)  Full(1024), BN, LReLU(0.2) 
ConvTrs(c=128, k=4, s=2, p=1), BN, LReLU(0.2)  Full(512), BN, LReLU(0.2) 
ConvTrs(c=64, k=4, s=2, p=1), BN, LReLU(0.2)  Full(1) 
ConvTrs(c=3, k=3, s=1, p=1), Tanh 
Appendix Appendix C Disentanglement Score
To compute the disentanglement score, we first take 500 images from the dataset and manually change one of the smooth features that corresponds to a known transformation. For example, we change the slant of the MNIST digits by taking a sheer transform. With 11 different degrees of the transformation, we obtain 5500 transformed images in total. We encode these images to obtain the corresponding latent codes and subtract the mean for each group of the images (originates from the same image) to align all the latent codes. Then, we conduct Principal Component Analysis (PCA) to obtain the principal direction and the spectrum of variations of the latents codes. If the latent features are well disentangled, the dimensionality of the variation should be close to one. To quantify how much it is close to one, we compute the ratio of the first eigenvalue to the second eigenvalue of the PCA covariance, and set it as the disentanglement score.
Appendix Appendix D Effect of the Bias Regularizer
To examine the effectiveness of our bias regularizer, we visualize the raw values of biases and their (pseudo)tangential component (see Fig. D.1, D.2). In all figures, we see that the biases are diverse, but their tangential components are well aligned due to the bias regularizer (left). On the contrary, without the regularizer, the tangential components are not aligned (right).
Comments
There are no comments yet.