The recent impressive results of deep generative models and autoencoder-type models rely on a core idea: uncovering a compact, powerful latent space where the original high-dimensional data can be better synthesised or manipulated. Some of the most astounding recent synthesis results in deep learning have come from generative models such as generative autoencodersKingma2014Auto ; van2016conditional ; sonderby2016ladder ; tolstikhin2018wasserstein ; huang2018introvae ; heljakka2020towards or Generative Adversarial Networks (GANs) goodfellow2014generative ; salimans2016improved ; karras2017progressive ; srivastava2017veegan ; karras2018style ; choi2017stargan ; zhu2017toward ; pham2019simultaneous . However, in spite of their undoubted efficiency, the latent spaces created by these models are difficult to interpret. In particular, a common problem is that these spaces are entangled: several image characteristics are often combined into one dimension of the latent space, making navigation and understanding difficult. Certain previous approaches have attempted to disentangle the space in a semi-supervised manner, that requires knowledge about the true underlying factors of the data kingma2014semi ; reed2014learning ; mathieu2016disentangling ; siddharth2017learning ; denton2017unsupervised ; hsu2017unsupervised . However, we would like to achieve this organisation of the latent space in a non-supervised approach, letting the data tell us what variability exists in the database.
In this work, we propose a network which we refer to as the “Principal Component Analysis Autoencoder” (PCAAE). An autoencoder is a neural network consisting of two sub-networks : an encoder and a decoder. These networks project data to and from the lower-dimensional latent space. Ideally, we would like this latent space to be interpretable and navigable. We propose to achieve this by creating an autoencoder which shares some of the desirable characteristics of the PCA. The classical PCA is a linear transformation to a space with two main properties. Firstly, the axes are organised in order of decreasing variability. So, along the first axis lies the greatest variability of the data, along the second orthogonal axis lies the second-greatest variability, and so on and so forth. Secondly, the axes are orthogonal to each other, which is necessary for interpretation and manipulation. Ideally, we would like to have the best of both worlds, ie. the power of a non-linear transformation (a neural network here) with the aforementioned properties of PCA. This is precisely the objective of this work. More precisely, our goal is to propose an autoencoder with the following two properties: i) the latent space components (axes) are ordered in terms of decreasing importance and ii) each component of a code is statistically independent from the other components.
To achieve this, we start by training an autoencoder with a latent space of size 1. Once this is trained, we fix the values of this first element in the latent space, and train an autoencoder with a latent space of size 2, where only the second component is trained. At each step, the decoder is discarded, and a new one is trained from scratch. This continues until we reach the required latent space size (see Figure 1 for an illustration of this approach). Secondly, we add a latent space covariance loss term to the autoencoder loss to ensure that each component is statistically independent. If the intrinsic characteristics of the data are distributed independently throughout the dataset, then this will be reflected in the PCAAE latent space. The final objective is to create an autoencoder whose latent space efficiently separates (disentangles) independent characteristics of the data being considered. For example, this could be properties such as size, shape or colour, or more high-level characteristics such as gender or hair colour in the case of images of faces. We achieve this without any reference to labels relative to these characteristics. Instead, we aim to discover the latter in a completely unsupervised fashion, through the data itself.
To summarise, in this paper we propose the following contributions:
An algorithm to create a autoencoder with a latent space where the components of the latent code are ordered in terms of decreasing importance to the data;
We use a covariance loss term to encourage the components of the latent space to be statistically independent to decrease entanglement;
We show how the PCAAE can be used to organise and disentagle the latent space of a pre-trained generative network such as a GAN.
In other words, we wish both to impose an order on and disentangle the latent space. We demonstrate the efficiency of our autoencoder on synthetic examples of images of geometric shapes as well as on the more complex data of the CelebA dataset. In the first case, we show that the resulting autoencoder retrieves meaningful axes that can be manipulated to change different geometric characeristics (size, rotation) of the shapes. In the second, we automatically discover properties such as hair colour, gender and pose. We emphasise that this is done in a completely unsupervised manner, without any access to the labellings of these characteristics. While other approaches to disentanglement exist, they mainly focus on supervised settings where the labels are available. In this work, we wish to discover these underlying properties automatically, by letting the data indicate its different variable characteristics.
2 Previous work
Broadly speaking, there are two main categories of networks which are used for image editing and synthesis: autoencoders Kingma2014Auto ; ranzato2008sparse ; Glorot2011Deep ; Makhzani2013K and GANs radford2015unsupervised ; arjovsky2017wasserstein ; gulrajani2017improved ; chen2016infogan ; odena2017conditional ; yan2016attribute2image ; mukherjee2019clustergan ; delannoy2020segsrgan
. The goal of autoencoders is to compress and decompress data to and from a compact, powerful latent space. GANs, on the other hand, fix the latent space with an a priori distribution (for example Gaussian), and attempt to create realistic data with the parallel action of a generator and an adversarial network. While the models have produced impressive results, understanding and interpreting their latent spaces is now an extremely hot topic. Ideally, we would like to understand what kinds of hidden representations the model has learned. More precisely, the latent space should be disentangled so that one latent code represents one factor of the variation in the formation of the data space.
Many previous works concerning such models have the goal of improving the compactness and power of latent spaces. Firstly, a commonly remarked behaviour of autoencoders is that they fill up all the space allowed in their latent space, which is detrimental to interpretation and manipulation. A common solution to this problem is to allow the autoencoder more space than is likely necessary, and then try to impose some sort of structure on the latent space. Sparse autoencoders ranzato2008sparse ; Glorot2011Deep ; Makhzani2013K
, for example, attempt to have as few active (non-zero) neurons or specify a maximum number of non-zero values as possible in the network. However, while this forces compactness, the autoencoder can still entangle several data characteristics in a single latent component. The well-known generative autoencoder such as variational autoencoderKingma2014Auto , Wasserstein autoencoder tolstikhin2018wasserstein creates an autoencoder whose latent space is encouraged to follow a certain predefined distribution. While this is very useful for the purposes of synthesis, this does not in itself improve the interpretability of the latent space components, which can mix several characteristics.
Many previous works exist on the specifc task of disentangling latent representations. Rifai et al rifai2012disentangling
employ contractive autoencoders to learn locally invariant features at multiple resolutions, which is then given to a “contractive discriminant analysis” block for the purpose of emotion prediction. Reed et alreed2014learning
propose a Boltzmann machine to discover underlying variation in data with two strategies. Firstly, they include the data labels in their cost function for the Boltzmann machine, and secondly, they “clamp” (impose) a code for two data points which are known to share some characteristics. The work of Cheung et alcheung2014discovering , Kumar et al kumar2018variational and Lezama lezama2019overcoming
are the most similar previous works to ours, in certain aspects. In particular, these works employ some form of covariance loss. Cheung et al use a semi-supervised autoencoder to output an image and at the same time predict a class. Kumar et al propose the covariance loss for the latent space to decorrelate its dimensions, leading to match the moments of the distributions of data and the latent space. Lezama uses a loss on the Jacobian of an autoencoder output with respect to the latent code, to encourage the code to follow the desired class, as well as a prediction loss using binary classes. Lample et al proposed Fader networkslample2017fader , which try to isolate a single image characteristic in a single latent component, with an innovative use of a discriminator network. This produces a network where the characteristic can be effectively controlled with a slider. In the case of the work of -VAE higgins2017beta , -VAE burgess2018understanding , FactorVAE kim2018disentangling and -TCVAE chen2018isolating
, propose frameworks or regularisation to disentangle VAE by modeling and weighting the Kullback-Leibler divergence term to encourage factorised representations in the latent space.
3 Principal Component Analysis Autoencoder
Before describing the PCAAE, we first set out some notation. Let be the data space, in general, we will consider images of size , so . We note with the latent space, being the dimensionality of this latent space. We denote the encoder with , and the decoder with . We denote with the component of . Let be the output of the autoencoder. The standard autoencoder loss, also called the reconstruction loss, is given by:
Now, we describe the core idea and algorithm of PCAAE. As we explained above, there are two central questions we must address in order to define the PCAAE : 1) What do we mean by “decreasing importance” of the latent space components, and how can we impose this? 2) How can we enforce independence of the latent components?
In the case of the standard PCA, importance refers to the variability of the data along an axis. Such a definition is difficult to use with an autoencoder since, in general, all the dimensions in the latent space are filled during training. Thus, it is not useful to simply carry out a PCA on the latent space. Therefore, we impose a notion of importance by training a series of autoencoders of increasing latent space size, starting with a latent space of size 1 (a scalar). In this first autoencoder, we can suppose that the information of greatest “importance” will be encoded, in the sense of the cost of the autoencoder loss. We then increase the size of the latent space by 1, while maintaining the same first component from the previous training: only the second component is trained. This is repeated iteratively until a certain predefined dimension is attained (as described in Figure 1). Note that at each iteration, the previous decoder is thrown away, and a new one is trained from scratch. Indeed, we wish to impose some structure on the latent space via the training of the encoder, but the decoder must be allowed to do as it sees fit.
We address the second question, how can we impose independence on the latent codes, in the following manner. We require that the covariance matrix of the vector to be as close as possible to the identity matrix
to be as close as possible to the identity matrix. In order to reduce the computational burden we can, without loss of generality, impose a batch normalisation ioffe2015batch
(BN) layer to the vector, without any learning associated to it. That is to say impose each component of
to be of 0 mean and of variance 1. The magnitude of the off-diagonal entries of the covariance matrix can then be simply expressed aswhere and range through the dimensions of . We recall that we are adding a new dimension to our latent space while freezing the first dimensions. Therefore, imposing the independence between the components of the vector boils down to minimizing
where is the current dimension being added. This, in turn, can be translated in the loss term, by replacing the expectation by a mean over the whole dataset and keeping only the terms depending on the example :
where is the size of the dataset. Since is much larger than , the first term in the brackets can be discarded. This loss can be effectively implemented by replacing the term by a running mean, similarly to what is done in the case of a BN. Alternatively, one can simply compute the quantity in (2
) over a mini-batch and use it instead of the previous formulation. Finally, the loss function of theautoencoder is :
4 PCAAE for GAN
The objective of the generator of GANs is to find a mapping from the latent distribution into the image data distribution . Ideally, we would like each latent component to correspond to one factor of variation in the data. In practice, the latent representations of GANs are entangled (see supplementary material for experimental proof of this). In order to organise and disentangle this latent representation, we apply the PCAE to the latent space of a pre-trained GAN. Indeed, we do not intend to create a new GAN architecture which can compete with state-of-the-art generators such as PGAN, rather we propose to use our PCAAE to better understand and organise the latent space of a high quality, pre-trained GAN. In other words, since the problem of simultaneously learning and organising the latent space is too difficult, we propose to learn first and then organise afterwards. The learning part is done during the training of the high-quality GAN.
Let us highlight that the strategy we propose can be easily adapted to analyse any GAN, and we have chosen PGAN in the current work due to its impressive performances. The input sample to the PGAN lives in
(the PGAN latent space), or more precisely is a random sample from the normal gaussian distribution in dimension 512. Since the input is normalised in the first operation of PGAN’s generator during testing, we can assume that the latent codes are drawn uniformly from a sphere, which is not convex. To make the job of the autoencoder easier, and since the latent space is not convex, we will apply our tool locally around a given point from the latent space. More precisely, letbe a fixed point of this sphere (see Figure 2). Let be the generator of PGAN and be a small perturbation vector (drawn randomly). Our goal is to design a low dimensional autoencoder that minimizes the following loss :
In other words, the autoencoder’s goal is to produce a vector which leads to an image that is as close as possible to . This vector will have passed through the low dimensional internal representation of the autoencoder, which is well-organised. The covariance loss in Equation (5) is defined as in (3) and will encourage disentanglement of the latent space of the pair . We apply the same training strategy that consists in iteratively increasing the number of latent components, while freezing the first components. This training process is illustrated in Figure 2. The pair is, a priori, specific to a fixed although some pairs have been found to work around other points (see supplementary material).
In this section, we present the results of our PCAAE, and we compare with those of VAE kingma2014adam , -VAE higgins2017beta , -VAE burgess2018understanding , FactorVAE kim2018disentangling and -TCVAE chen2018isolating . Note that other approaches to disentangling the latent space use data labels, which we wish to avoid here : our goal is to discover the variability of the data in an unsupervised fashion.
5.1 Disentanglement evaluation
|Area of ellipses|
We propose to use the absolute Pearson correlation coefficient (PCC) as a disentanglement evaluation to verify the relationship between the attributes of image data and the components of the trained latent space. Given a pair of random variableswhere is the attribute of image and denotes the component of the latent space that represents the data, the absolute PCC is computed as:
denote the standard deviation ofand , respectively. and is the mean of and , respectively. The absolute PCC ranges from 0 to 1. One dimension of the disentangled latent space, which corresponds to a attribute of data, show the much larger value than others.
5.2 Experimental setup and results on synthetic data
In order to find out whether our PCAAE is able to capture meaningful components which correspond to the parameters of visual objects, we have first tested our algorithm on synthetic data of binary images of geometric shapes which are centred in the image, with a single shape per image. We have created images of ellipses in the case of three parameters (two axes, and rotation). The two ellipse axes are sampled from a uniform distribution on the interval, and the rotation from a uniform distribution on the interval . In these experiments, we set (maximum autoencoder dimension) to 3 (the number of parameters used to create the dataset). A drawback of using data with binary images of shapes is that we have a limited number of centred parametric shapes that we can create, even though we sample the parameters from a continuous space. To solve this problem, we blur the binary shapes slightly with a Gaussian filter with pixels, allowing us to create as many images as we wish.
Figure 3 shows decoded images of interpolated points in the latent space, in the case of ellipses. Table 1 shows the numeric evaluation based on the absolute PCC between the attributes of ellipses with respect to three parameter of the trained latent space. We observe that the latent space of our PCAAE corresponds to three principal attributes of ellipses : area (A), the ratio of two diameters towards vertical and horizontal directions (R1), the ratio of two diameters towards diagonal directions (R2). The compared methods also create a meaningful latent space whereas AE and VAE learn a latent space where the intrinsic parameters of the ellipses are mixed up. While these are not the parameters with which we created the images (indeed, the autoencoder has absolutely no way of knowing what representation to choose, and we cannot impose one), they are indeed independent; for a given surface, the ratio between the axes is an independent parameter, and vice versa. This gives us a way to interpolate in the latent space in a meaningful manner. These independent parameters are sufficient to describe the ellipse, and each axis is hierarchically more interpretable and navigable than in the case of other methods. For more results, see the supplementary material.
5.3 Experimental setup and results of the PCAAE applied to the latent space of PGAN
To show the use of our PCAAE on more high-level data, we take a pre-trained model of PGAN karras2017progressive 111Pytorch GAN zoo: https://github.com/facebookresearch/pytorch_GAN_zoo trained with the CelebA dataset liu2015faceattributes . Note that the pre-trained generator is fixed during the training of our PCAAE. The latent space of PGAN is entangled (we show experiments to support this in the supplementary material), so that a variation along one parameter of this initial code in the latent space can modify several characteristics of the generated images. The latent space size of this pre-trained network is 512. An initial code , from which the network generates a photo-realistic image, is chosen. In order to create the set of random perturbations, we sample from a Gaussian distribution .
In order to evaluate the disentanglement of the latent space of other methods and ours, we use pre-trained classifiers to determine an attribute of generated images. We choose three main attributes which the classifiersFace can recognize well, corresponding to the head pose (i.e. turning left to right), hair colour (i.e. black, brunette and blond) and gender. Note that in the supplementary material, we also display the results of our PCAAE directly to the Celeba data. This gives very blurry results (since the task is very difficult, as mentioned above), similar to the results of -VAE higgins2017beta (Figure 4 of their paper), which lead us to our approach to using the PCAAE applied to GANs.
We now show our results of PCAAEs for organising the latent space of the pre-trained PGAN karras2017progressive . To demonstrate the performance of our algorithm, we have trained the standard AE, the aforementioned VAE-based methods and our proposed PCAAE, using the procedure described in Section 4. Table 2 shows the numeric evaluation of the methods and Figure 4 shows the generated images of the generator of PGAN from the latent spaces of the other approaches and of our proposed PCAAE. The other methods construct a latent space where the attributes of the generated images are correlated with more than one component. For example, we can see that the latent space of AE mixes up the attributes. In addition, it can be seen that the fourth parameter of -TCVAE controls the hair color and the gender of generated images simultaneously. Respectively, the first parameter of FactorVAE changes the head pose. Then, the third one of this model still corresponds to the head pose. Indeed, the absolute PCC of this model for the head pose is correlated to the first and third components of the latent space. Our proposed PCAAE yields a disentangled latent space which is organised in a hierarchical fashion: the first component corresponds to the colour hair of the generated images, the second one represents head poses (e.g. turning left and right), the third parameter corresponds to hair thickness and the last one is mildy correlated to skin tone.). Our PCAAE is able to efficiently separate the different facial attributes and rank them according to their importance in the reconstruction. Thus, the latent space created by our method is easier to interpret and navigate than the original GAN latent space.
We highlight that this procedure can be applied to any pre-trained model, so that the disentangling and organisation of the latent space can be carried out after the initial, computationally expensive, training of a GAN.
In this paper, we have presented a novel autoencoder where the latent space is organised according to decreasing importance, and where these components are statistically independent. We refer to this network as a Principal Component Analysis Autoencoder - PCAAE. The autoencoder is trained with latent spaces of increasing sizes to ensure that we capture the properties of the data in decreasing order of importance, in an unsupervised manner. Furthermore, we have imposed statistical independence of the latent components by employing a covariance loss term, which we add to the standard autoencoder cost, to encourage a disentangled latent space. We have used synthetic data to illustrate that the PCAAE learns a latent space which is interpretable and which can be interpolated in a meaningful manner with respect to the properties inherent in the data. We have applied our autoencoder to high quality face data, and have shown that this efficiently disentangles the latent space of a powerful pre-trained GAN by projecting it to another smaller, interpretable, latent space. The resulting model can manipulate one facial attribute on each component. Furthermore, the proposed method can be applied to any pre-trained generative model, so that the initial time-consuming training of a powerful model and the organisation of its latent space can be carried out separately. We hope that this work will contribute to the interpretation and manipulation of latent spaces of complex data.
This work was funded by the DIGICOSME project.
- (1) Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations. (2014)
- (2) Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixelcnn decoders. In: Advances in Neural Information Processing Systems. (2016) 4790–4798
- (3) Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. In: Advances in Neural Information Processing Systems. (2016) 3738–3746
- (4) Tolstikhin, I., Bousquet, O., Gelly, S., Schoelkopf, B.: Wasserstein auto-encoders. In: International Conference on Learning Representations. (2018)
- (5) Huang, H., He, R., Sun, Z., Tan, T., et al.: Introvae: Introspective variational autoencoders for photographic image synthesis. In: Advances in Neural Information Processing Systems. (2018) 52–63
Heljakka, A., Solin, A., Kannala, J.:
Towards photographic image manipulation with balanced growing of
In: The IEEE Winter Conference on Applications of Computer Vision. (2020) 3120–3129
- (7) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems. (2014) 2672–2680
- (8) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems. (2016) 2234–2242
- (9) Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. In: International Conference on Learning Representations. (2018)
- (10) Srivastava, A., Valkov, L., Russell, C., Gutmann, M.U., Sutton, C.: Veegan: Reducing mode collapse in gans using implicit variational learning. In: Advances in Neural Information Processing Systems. (2017) 3308–3318
Karras, T., Laine, S., Aila, T.:
A style-based generator architecture for generative adversarial
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. (2019)
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.:
Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. (2018)
- (13) Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal image-to-image translation. In: Advances in Neural Information Processing Systems. (2017) 465–476
Pham, C.H., Tor-Díez, C., Meunier, H., Bednarek, N., Fablet, R., Passat,
N., Rousseau, F.:
Simultaneous super-resolution and segmentation using a generative adversarial network: Application to neonatal brain MRI.In: International Symposium on Biomedical Imaging. (2019) 991–994
- (15) Kingma, D.P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learning with deep generative models. In: Advances in Neural Information Processing Systems. (2014) 3581–3589
Reed, S., Sohn, K., Zhang, Y., Lee, H.:
Learning to disentangle factors of variation with manifold
In: International Conference on Machine Learning. (2014) 1431–1439
- (17) Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., LeCun, Y.: Disentangling factors of variation in deep representation using adversarial training. In: Advances in Neural Information Processing Systems. (2016) 5040–5048
- (18) Siddharth, N., Paige, B., Van de Meent, J.W., Desmaison, A., Goodman, N., Kohli, P., Wood, F., Torr, P.: Learning disentangled representations with semi-supervised deep generative models. In: Advances in Neural Information Processing Systems. (2017) 5925–5935
- (19) Denton, E.L., et al.: Unsupervised learning of disentangled representations from video. In: Advances in Neural Information Processing Systems. (2017) 4414–4423
- (20) Hsu, W.N., Zhang, Y., Glass, J.: Unsupervised learning of disentangled and interpretable representations from sequential data. In: Advances in Neural Information Processing Systems. (2017) 1878–1889
Ranzato, M., Boureau, Y.L., Cun, Y.L.:
Sparse feature learning for deep belief networks.In: Advances in Neural Information Processing Systems. (2008) 1185–1192
Glorot, X., Bordes, A., Bengio, Y.:
Deep sparse rectifier neural networks.
In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. (2011)
- (23) Makhzani, A., Frey, B.: K-sparse autoencoders. In: International Conference on Learning Representations. (2014)
- (24) Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations. (2016)
- (25) Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning. (2017) 214–223
- (26) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems. (2017) 5767–5777
- (27) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems. (2016) 2172–2180
- (28) Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier gans. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org (2017) 2642–2651
- (29) Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: Conditional image generation from visual attributes. In: European Conference on Computer Vision, Springer (2016) 776–791
- (30) Mukherjee, S., Asnani, H., Lin, E., Kannan, S.: Clustergan: Latent space clustering in generative adversarial networks. In: Proceedings of the AAAI Conference on Artificial Intelligence. Volume 33. (2019) 4610–4617
- (31) Delannoy, Q., Pham, C.H., Cazorla, C., Tor-Díez, C., Dollé, G., Meunier, H., Bednarek, N., Fablet, R., Passat, N., Rousseau, F.: SegSRGAN: Super-resolution and segmentation uing generative adversarial networks Application to neonatal brain MRI. Computers in Biology and Medicine (2020) 103755
- (32) Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M.: Disentangling factors of variation for facial expression recognition. In: European Conference on Computer Vision, Springer (2012) 808–822
- (33) Cheung, B., Livezey, J.A., Bansal, A.K., Olshausen, B.A.: Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583 (2014)
- (34) Kumar, A., Sattigeri, P., Balakrishnan, A.: Variational inference of disentangled latent concepts from unlabeled observations. In: International Conference on Learning Representations. (2018)
- (35) Lezama, J.: Overcoming the disentanglement vs reconstruction trade-off via jacobian supervision. In: International Conference on Learning Representations. (2019)
- (36) Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., et al.: Fader networks: Manipulating images by sliding attributes. In: Advances in Neural Information Processing Systems. (2017) 5967–5976
- (37) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: -vae: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations. Volume 2. (2017) 6
- (38) Burgess, C.P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., Lerchner, A.: Understanding disentangling in -vae. In: NIPS Workshop on Learning Disentangled Representations. (2018)
- (39) Kim, H., Mnih, A.: Disentangling by factorising. In: International Conference on Machine Learning. (2018)
- (40) Chen, T.Q., Li, X., Grosse, R.B., Duvenaud, D.K.: Isolating sources of disentanglement in variational autoencoders. In: Advances in Neural Information Processing Systems. (2018) 2610–2620
- (41) Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. (2015) 448–456
- (42) Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations. (2015)
- (43) Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision. (2015)
- (44) https://www.faceplusplus.com/: Face++ cognitive services
- (45) Ladjal, S., Newson, A., Pham, C.H.: A PCAlike autoencoder. arXiv preprint arXiv:1904.01277 (2019)
8.1 Extension of the PCAAE as a generative model: PCAWAE
Recently, Wasserstein autoencoder (WAE) has been proposed as a new algorithm for building a generative model based on the latent variable tolstikhin2018wasserstein . The WAE proposes to use the Kantorovich-Rubinstein duality arjovsky2017wasserstein as an adversarial objective on the latent space. In this work, we apply the GAN-based penalty in tolstikhin2018wasserstein in order to extend our PCAAE as a generative model.
Let us denote a prior (e.g. a Gaussian distribution) and a random code which is generated randomly from . In order to match the latent space of the proposed PCAAE with the prior distribution , we apply the GAN-based penalty in tolstikhin2018wasserstein by using a set of discriminators , where . Concretely, the discriminator attempts to distinguish the random codes which are sampled from and the generated latent codes of the autoencoder by ascending the following loss function:
Meanwhile, the objective of the autoencoder is to minimise three loss functions: the reconstruction loss, covariance loss and adversarial loss, described as following:
Thus, the min-max game between the discriminator and the autoencoder attempts to impose a prior distribution into the first components of the latent space of PCAAE. When the maximum size is reached (i.e. ), the whole latent space of the autoencoder will be matched with the prior. We refer this method as PCAWAE.
8.2 Limitation and Future works
One limitation of our algorithm is that we increase the latent space size by one at each step. This can be problematic in some cases, where the autoencoder needs a certain amount of freedom to learn a useful representation. Therefore, we could consider increasing the latent space by small packets of codes, to give it the freedom it needs. It is clear that the use of the norm is not optimal to define the importance of a latent component. Indeed, in the case of the CelebA dataset as shown in Figure 6, applying the PCAAE directly to the image data leads to very blurry results. Replacing the norm reconstruction loss by an alternative, perceptual, metric could provide better results. Finally, the application of a PCAAE trained in one region of a GAN latent space is not necessarily valid for another region. A future challenge will be to create a PCAAE which is applicable to the whole space of the GAN.
8.3 Architecture of the proposed methods
The pseudo-code for our algorithms can be seen in Algorithm 1 and Algorithm 2. Note that in this pseudo-code, we have used a standard gradient descent, but any gradient-descent based algorithm can be used (we used Adam kingma2014adam ).
For the geometrical structures, our autoencoder is a simple CNN with Leaky ReLUs (
) and strided convolutions. The size of the input image 6464 with 6 layers leading to a geometrical size of 1. The number of features is and all kernels of size 4. The decoder is symmetrical to that except that its number of features is multiplied by the current size of the latent space. Note that, for the purposes of fair comparison with other approaches, we used the same achitecture for the decoder of all methods.
For the latent variable, our discriminators contain 5 fully connected layers with Leaky ReLUs. The number features of the first discriminator is (8,8,8,8,1). The last activation is a sigmoid function. Then, the number of features of the next discriminators is multiplied those of the first network by the current size of the latent space.
In the case of the manipulation of the latent space of PGAN we use fully connected layers. Indeed, we are applying the PCAAE directly to a latent space, therefore convolutions are not appropriate in this case. The number of layers is 2, and again, the decoder is symmetrical to the encoder. The number of features in the case of the latent space of PGAN can be seen in Table 3.
8.4 Results of the proposed methods for synthetic data
We show the evaluations of the compared methods: those of AE, VAE kingma2014adam , WAE tolstikhin2018wasserstein , -VAE higgins2017beta , -VAE burgess2018understanding , FactorVAE kim2018disentangling , -TCVAE chen2018isolating and our proposed PCAAE in Table 4 and Figure 5. For a fair comparison, we use the same architecture for all decoders. Please see the code attached for more details. One can see that the first dimension () of our PCAAE always controls the area of the reconstructed ellipse. The last two components correspond to the orientation of the ellipse. The illustrations show how any two dimensions of our latent space have independent actions with respect to each other. In any column of the grids shown in Figure 4 of our main paper, the effect of traveling up and down the axis is the same as in any other column (and similarly for the lines). Note that all unsupervised learning methods of disentangled representations such as -VAE, FactorVAE, -TCVAE and our proposed methods take systematically one component of the latent space for the area of ellipses.
Table 4 also shows an ablation study which compares the PCAAE with the baselines such as a standard AE, WAE and our PCAAE with no covariance loss (i.e. ). We can see that more than one component of the latent space of AE, WAE and the PCAAE with no covariance loss controls the area of the ellipses. In the case of the PCAAE with no covariance loss, the first and the third component of its latent space correspond to the area attribute simultaneously. This confirms the need of the proposed covariance loss.
Thus, our method has efficiently organised and disentangled the latent space of the ellipses. This can be tested in our demo code.
8.5 Results of PCAAE for CelebA dataset
In Figure 6, we show the results of the PCAAE applied directly to images from the CelebA dataset. We can see that, while the PCAAE correctly organises the latent space (changing the average colour of the images in the first latent space component, for example), the results are overly smoothed. This is due to the complex nature of the CelebA dataset. Therefore, we found that a better approach was to apply the PCAAE to the latent space of a pre-trained GAN which is known to produce reliable results. These approaches can be seen in Section 4 “PCAAE for GAN” of our main paper.
8.6 Interpolation in the original (entangled) latent space of PGAN
In Figure 7 we show several examples of interpolation in the original latent space of PGAN, before applying our PCAAE. We visualise images generated by PGAN while varying one latent component at a time. We can see that it is difficult to interpret this latent space. For example, we can see a blond woman at both the first and the last parameters of the latent space, and the woman in the generated images changes the pose of her head when either the second and last components are varied. It is clear that this latent space is heavily entangled, with several characteristics modified by changing one component. This makes it difficult to understand, navigate and manipulate the latent space. Addressing these problems is precisely the goal of the present work.
8.7 Further examples of navigating the PGAN latent space
We report a further comparison of the methods in Table 5 and Figure 8. We can see that several methods utilise two or more components of the latent space to control one attribute. The proposed method PCAAE and its extension PCAWAE take only one component for each attribute. For instance, they choose the first component for hair colour, the second one for head pose and the third one for gender.
We show another example of the automatic navigation of the latent space of the PGAN in Figure 9. It is generated by training our PCAAE around the code generating the image at the middle of the three grids. We can see that for this example, the first component () corresponds to the hair colour from black to blond, the second one () controls the head poses and the third parameter () changes the gender.
In order to better visualise the results of the proposed method, we adjust two components which correspond to hair colour and head poses of generated images from the training initial code as shown in the first row of each sub figure of Figure 10. Then, we apply the trained model to other initial codes of the latent space of PGAN. We can see that the attribute of generated images from testing initial code also change as those of the training code (described as the last rows).
Keep in mind that these results are obtained in a completely unsupervised manner with solely an norm as a guide. As said in Section 6 “Limitations and future works” of our main paper, it remains a challenge to find an universal PCAAE that could be used around each possible input of a GAN while maintaining the meaning of each dimension.