Official Tensorflow implementation of the paper "Y-Autoencoders: disentangling latent representations via sequential-encoding"
In the last few years there have been important advancements in generative models with the two dominant approaches being Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). However, standard Autoencoders (AEs) and closely related structures have remained popular because they are easy to train and adapt to different tasks. An interesting question is if we can achieve state-of-the-art performance with AEs while retaining their good properties. We propose an answer to this question by introducing a new model called Y-Autoencoder (Y-AE). The structure and training procedure of a Y-AE enclose a representation into an implicit and an explicit part. The implicit part is similar to the output of an autoencoder and the explicit part is strongly correlated with labels in the training set. The two parts are separated in the latent space by splitting the output of the encoder into two paths (forming a Y shape) before decoding and re-encoding. We then impose a number of losses, such as reconstruction loss, and a loss on dependence between the implicit and explicit parts. Additionally, the projection in the explicit manifold is monitored by a predictor, that is embedded in the encoder and trained end-to-end with no adversarial losses. We provide significant experimental results on various domains, such as separation of style and content, image-to-image translation, and inverse graphics.READ FULL TEXT VIEW PDF
Popular generative model learning methods such as Generative Adversarial...
We propose a vine copula autoencoder to construct flexible generative mo...
This paper proposes a new type of implicit generative model that is able...
Generative models with an encoding component such as autoencoders curren...
This paper presents a simulator-assisted training method (SimVAE) for
Generative models with an encoding component such as autoencoders curren...
Incorporating encoding-decoding nets with adversarial nets has been wide...
Official Tensorflow implementation of the paper "Y-Autoencoders: disentangling latent representations via sequential-encoding"
In this article we present a new training procedure for conditional autoencoders (cAE) [3, 2] that allows a standard cAE to obtain remarkable results in multiple conditional tasks. We call the resulting model Y-Autoencoder (Y-AE), where the letter Y is a reference to the particular branching structure used at training time. Y-AEs generally represent explicit information via discrete latent units, and implicit information via continuous units.
Consider the family of generative models where an input is conditioned on some paired explicit information , such that we can encode the input through the function and then decode via to generate new samples . The explicit information can be any additional information about the inputs, such as labels, tags, or group assignments. We may replace and
While cAEs have met with some success, they often struggle disentangling the latent representation. In other words the fitting procedure often ignores , since there is no effective regularization to enforce an effect. For this reason, recent work has mainly tackled conditional generation through Variational Autoencoders (VAEs)  and Generative Adversarial Networks (GANs) . The former rely on a probabilistic approach, that can capture relevant properties of the input space and constraint them inside a latent distribution via variational inference. The latter are based on a zero-sum learning rule, that simultaneously optimize a generator and a discriminator until convergence. Both VAEs and GANs can be conditioned on the explicit information to generate new samples . Recent work in this direction has explored facial attributes generation , natural image descriptions , people in clothing , and video frame prediction .
However, both VAEs and GANs suffer of a variety of problems. GANs are notoriously difficult to train, and may suffer of mode collapse when the state space is implicitly multimodal 
. VAEs rarely include discrete units due to the inability to apply backpropagation through those layers (since discrete sampling operations create discontinuities giving the objective function zero gradient with respect to the weights), preventing the use of categorical variables to represent discrete properties of the input space. Both have difficult exploiting rich prior problem structures, and must attempt to discover these structures by themselves, leaving valuable domain knowledge unused.
The Y-AE provides a flexible method to train standard cAEs that avoid these drawbacks, and can compete with specialized approaches. As there is no structural change the Y-AE simply becomes a cAE at test time, and it it is possible to assign values to the discrete units in the explicit layer whilst keeping the implicit information unchanged. It is important to notice that for a Y-AE the definition of implicit and explicit is very broad. The explicit information can either be the label assigned to each element of the dataset, or a weak label that just identifies a group assignment.
The contribution of this article can be summarized in the following points:
Our core contribution is a new deep architecture called Y-AE, a conditional generative model that can effectively disentangle implicit and explicit information.
We define a new training procedure based on the sequential-encoding of the reconstruction, which exploit weight sharing to enforce latent disentanglement. This procedure is generic enough to be used in other contexts or merged with other methods.
We perform quantitative and qualitative experiments to verify the effectiveness of Y-AEs and the possibility of using them in a large variety of domains with minimal adjustments.
We provide the open source code to reproduce the experiments. 111https://github.com/mpatacchiola/Y-AE
Autoencoders. Deep convolutional inverse graphics networks (DC-IGNs)  use a graphics code layer to disentangle specific information (e.g. pose, light). DC-IGNs are bounded by the need to organize the data into two sets of mini-batches, the first corresponding to changes in only a single extrinsic variable, the second to changes in only the intrinsic properties. In contrast Y-AEs do not require such a rigid constraint. Deforming Autoencoders  disentangle shape from appearance through the use of a deformable template. A spatial deformation warps the texture to the observed image coordinates.
, has showed state-of-the-art results in disentangled factor learning through the adoption of a hyperparameterthat balances latent capacity and independence constraints. A cycle-consistent VAE was proposed in . This VAE is based on the minimization of the changes applied in a forward and reverse transform, given a pair of inputs.
Adversarial. Adversarial autoencoders (aversarial-AE) 
achieve disentanglement by matching the aggregated posterior of the hidden code vector with an arbitrary prior distribution and using a discriminator trained with an adversarial loss. In the disentangled representation is learned through mixing and unmixing feature chunks in latent space with the supervision of an adversarial loss. The authors also use a form of sequential encoding that has some similarities with the one we propose. However, the key difference is that in a Y-AE part of the latent information is explicit and controllable whereas in  is not.
GANs. A conditional form of GAN has been introduced in , and it is constructed such that the explicit information is passed to both the generator and discriminator.  propose InfoGAN, an information-theoretic extension of GANs, with the aim of performing unsupervised disentanglement.  used a type of GAN named CycleGAN to concurrently optimize two mapping functions and two discriminators through an adversarial loss. Differently from CycleGANs, Y-AEs rely on a single network, and the consistency is ensured in the latent space with the aim of maximizing the distance between domains in the image space.
We define an autoencoder as a neural network consisting of two parts: encoder and decoder. The encoder is a function performing a non-linear mapping from an input to a latent representation . We refer to this latent representation as the code. The decoder is a function performing a non-linear mapping from the latent representation to a reconstruction . Encoder and decoder are parametrized by and
respectively - these are omitted in the rest of the article to keep the notation uncluttered. Parameters are adjusted during an online training phase via stochastic gradient descent, minimizing the mean squared error between the input and reconstructionon a random mini-batch of samples. The network is designed such that , forming a bottleneck. This ensures that if is small, must be a compressed version of .
This article focuses on the particular case where we have access to some label information , and have divided the latent representations into two parts , where the stands for implicit and stands for explicit. The distinction between the two is that the explicit information should be approximately equal to , whereas the implicit information should be independent of it. We denote a decoder which takes a separable hidden state as input by . The label may take the form of a one-hot vector (in a classification setting) or a vector of real values (in a regression).
The encoding phase of a Y-AE is identical to a standard cAE, but the reconstruction is quite different as it is divided into two branches, left and right. These two branches share the same weights, similarly to a siamese network . Thus, the Y-AE requires no more parameters than its cAE counterpart (see Figure 1).
The implicit information produced by is given as input to the two branches, whereas the explicit information is discarded. Instead of the left branch takes as input the actual label and the right branch takes as input a random label . The decoding phase produces two reconstructions and , where the subscript specifies the left branch, and the subscript the right branch. At this point we have two images which have identical implicit representations but different explicit ones.
The encoding stage is then applied to the two branches (a process we call sequential-encoding), producing two latent representations , , and , . The sequential encoding is used to verify that the implicit representations are not altered when only the explicit repesentation changes. In addition, the right branch ensures that the explicit information is not also hidden in the implicit data, since it must be able to propagate through (see Figure 2). It is important to notice that the sequential-encoding is only applied at training time as showed in Figure 1
. The second encoding stage concludes the forward pass. The backward pass is based on the simultaneous optimization of multiple loss function and it is described in the next section.
The loss function consists of four separate components.
Firstly, in the left branch of a Y-AE the label is assigned explicitly, replacing the
component inferred by the encoder. This is done to avoid instability in the preliminary learning phase, when the classifier predictions are still inaccurate. To ensure appropriate reconstructions following this we penalize deviations betweenand , using the standard least-squared error reconstruction loss:
Secondly, we include a computationally cheap cross-entropy loss penalty between and ,
This is done as this particular part of the output of the encoder can be considered as the output of a predictor, identifying which type of explicit content is present in the input .
This predictor-like aspect is made direct use of in the right branch of the Y-AE, where it is deployed to verify the consistency of the relation . This is ensured using a third, cross-entropy loss:
Finally, on the left branch a sequential-encoding is also performed. The vector can be compared with the right counterpart. Since the implicit information has not been manipulated it should be consistent in the two branches. This constraint can be added as an Euclidean distance penalty:
The losses defined above are then integrated in the global loss function
where the relative contribution of the explicit and implicit losses can be controlled by altering and respectively. Note that the reconstruction and classification losses have not been given similar weightings, since the first is the main reconstruction objective, and the second only acts in support of the explicit loss (which is already accounted for). An ablation study of the effect of altering and is presented in the experimental section (Section 3.2).
In order to demonstrate the efficacy of the Y-AE training scheme, we use a straightforward autoencoder architecture. In Section 3.2, we remove various parts of the Y-AE structure to show that they are all necessary, in Section 3.3, we compare the Y-AE training method to some simpler baseline training methods and in Section 3.4, we evaluate the Y-AE structure on a variety of different tasks in a qualitative manner to show it’s applicability to a variety of domains.
The encoders used in these experiments are based on the principle of simultaneously halving the spatial domain whilst doubling the channel domain, as successfully used by 
(the opposite is done in the decoding phase). Each network module is made of three consecutive operations: convolution (or transpose convolution in the decoder), batch normalization, and leaky-ReLU. For inputs of sizefour such modules have been used, increasing to six for inputs of size
. Reduction (or augmentation) is performed via stride-2 convolution (or transpose convolution). No pooling operations have been used at any stage. The input images have been normalized so to have continuous values
. The sigmoid activation function has been used in the implicit portion of the code, and softmax in the explicit part. All the other units use a leaky-ReLU with slope of. The parameters have been initialized following the Glorot uniform initialization scheme 
. To stabilize the training in the first iterations, we initialized the parameters of the input to the implicit layer’s sigmoid activation function by randomly sampling from a Gaussian distribution (, ) with the bias set to a negative value () such that the sigmoid is initially saturated toward zero. All the models have been trained using the Adam optimizer 
. The models have been implemented in Python using the Tensorflow library, and trained using a cluster of NVIDIA GPUs of the following families: TITAN-X, K-80, and GTX-1060. A detailed description of the networks structure and hyperparameters is reported in the supplementary material.
In this experimental section we compare against different ablations of the full loss (Equation 4), to provide a deeper understanding of the results presented in Section 3.4. To do this we vary the mixing coefficients and that regulate the weight of the explicit (Equation 3) and the implicit (Equation 4) losses, systematically setting each to or . As such, we either minimize neither of the losses (, ), only the explicit loss (, ), only the implicit loss (, ), or both (, ).
The focus of this section is a widely used benchmark: the Modified National Institute of Standards and Technology (MNIST) . This dataset is composed of a training set of greyscale handwritten digit images and a test set of images. We aim at separating an implicit information such as the style of the digits (orientation, stroke, size, etc) and an explicit information represented by the digit value (from 0 to 9). We used a Y-AE with 32 units in the implicit portion of the code and 10 unit in the explicit portion. We trained the model for 100 epochs using the Adam optimizer with a learning rate of and no weight decay. The results are the average of three runs for each condition.
An overview of the results is reported in Figure 3 and the average loss on the test set in Table 1. In the first condition (, ; Figure 3-a) there are no constraints on the two losses and the reconstruction reach a value of on the test set. This is achieved by exploiting the implicit portion of the code and ignoring the information carried by the explicit part (note the large disparity in implicit and explicit loss). The triplet of samples in the first column show that the explicit information has been ignored as each displays three almost identical digits, indicating the two branches of the Y-AE produce identical outputs. In the second condition (, ; Figure 3-b) only the explicit loss is regularized. This forces the output of the right branch to take account of the explicit information. However, as there is no regularization on the implicit portion of the code, the network learns to use it to carry the explicit information. The samples produced by the right branch show that the explicit content has been kept but the style partially corrupted. This condition has been further investigated in Section 3.3. The third case (, ; Figure 3-c) only regularizes the implicit loss. The explicit loss rapidly diverges, indicating that the reconstruction on the right branch does not resemble the digit it ought to. The samples confirm this assumption, showing the right reconstructions as more similar to the inputs than to the random content. Finally, the fourth and last condition (, ; Figure 3-d) is the complete loss function with all the components being minimized. Both explicit and implicit losses rapidly converge toward zero, whereas the reconstruction loss moves down reaching a value of on the test set. The samples produced clearly show that the style of the input (left digit) is kept and the content changed (right digit).
An overall comparison between all the conditions shows that the implicit loss (blue curve) act as a regularizer, with the effect of inhibiting the reconstruction on the left branch. This is an expected result, since the implicit loss limits the capacity of the code and it ensures that only the high-level information about the style is considered. A qualitative analysis of the samples shows that only the use of both explicit and implicit losses (Figure 3-d) guarantees the disentanglement of style and content, supporting our hypothesis about the functioning of the Y-AE. In particular it is evident how the high-level style information has been correctly codified, with the generated samples incorporating orientation, stroke, and size of the inputs.
In this section we compare the proposed method against different baselines on the MNIST dataset. In all cases, we encode the input image, change the explicit information (i.e. the number), then decode it to produce an image. We use a a pre-trained classifier to test whether the generated images have the right appearance. Also, since changing should change the digits, we test similarity of the generated digits against the original digits using MSE (which should be high) and the perceptual structural similarity measure, SSIM, which should be low.
We train the autoencoder models using the same set-up described in Section 3.2. The evaluation has been performed encoding all the inputs in the test set, extracting the implicit code , and randomly sampling (without replacement) of the possible contents , then and were used to get the reconstruction . This procedure generated a dataset with being input-label tuples , and being six times larger than the original test set. For the evaluation classifier, we train an ensemble of five LeNet  classifiers on the original dataset.
|cAE + regularizer|
|Y-AE + ablation [our]|
Conditional Autoencoders (cAEs). A cAE defines a conditional dependence on explicit information such that the reconstruction is conditioned on both the input and the labels . This can be considered as the main baseline, since a cAE has the same structure of a Y-AE but only relies on a standard MSE loss (see Equation 1) to minimize the distance between the inputs and the reconstructions.
cAE + regularization. We applied a series of regularizers on cAEs to push their performance. Strong regularization may enforce disentanglement of style and content limiting the amount of information codified in the latent space. We drastically reduced the number of epochs from 100 to 20, the number of implicit units from 32 to 16, and we applied a weight decay of . None of these regularizers has been used in the Y-AE.
Adversarial-AE. We performed a comparison against an adversarial-AE 
. As adversarial discriminator we used a multi-layer perceptron with 512 hidden units and leaky-ReLU. It was necessary to apply strong regularization in order to obtain decent results (8 units as code, 20 epochs,weight decay).
Y-AE + ablation. To check whether the Y-AE accuracy is just a result of the fact that it has been trained with a predictor in the loop, we tested against the ablated version of the model with and . This condition produces samples with consistent content but the style can be partially corrupted (Section 3.2). We expect to see the accuracy being lower than the Y-AE trained with the complete loss function, because in comparison the samples have lower quality.
The quantitative results are reported in Table 2, and the qualitative results in Figure 4. The accuracy of the Y-AEs is higher than most of the other methods (), meaning that the samples carry the right content. As a result we observe that the SSIM is low () and the MSE high (), as expected when style and content are well separated. Conversely, the accuracy of standard cAEs is close to chance level () because the model is producing the same digit (Figure 4-cAE) and ignoring the content information , meaning that just of the produced samples are correct. Interestingly, the accuracy of Y-AEs with ablation is pretty high () but inferior to the standard counterpart, with the samples showing stylistic artifacts caused by the ablation of the implicit loss. Strong regularization increases the performances of cAEs but the results are still far from both standard and ablated Y-AEs. The performance of the cVAE and beta-VAE is lower in SSIM and MSE when compared to the Y-AE, with beta-VAE being slightly better in terms of accuracy ( vs ). The digits produced by the beta-VAE are clear but the style does not significantly change among the inputs (Figure 4-beta-VAE). This is due to the pressure imposed by
on the Kullback-Leibler divergence that moves the latent space closer to the Gaussian prior resulting in low expressivity. In conclusion, the qualitative analysis of the samples (Figure4) shows that Y-AEs are superior to other methods on the problem at hand (see Section 3.4 for additional samples).
The evaluation of the method has been done in three ways in order to verify the performances on a wide set of problems. The first test is disentanglement of style and content, the second pose generation (inverse graphics), and the third unpaired image-to-image translation.
Disentanglement of style and content. This experiment shows how a Y-AE can be used to disentangle style and content. This is shown through two widely used datasets: the MNIST , and the Street View House Number (SVHN) . In this task, the implicit information is the style (orientation, height, width, etc) and the explicit information is the content (digit value). We set and on the MNIST and and on the SVHN dataset. We report some of the generated samples in Figure 5. In the MNIST the implicit units have captured the most important underlying properties, such as orientation, size, and stroke. Similarly on the SVHN dataset the model has been able to retain the explicit information and codify the salient properties (digit style, background and foreground colours) in the implicit portion of the code.
Pose generation (inverse graphics). Pose generation consists in producing a complete sequence of poses given a single frame of the sequence. This task is particularly challenging because relevant details of the object may be occluded in the input frame, and the network has to make a conjecture about any missing component. We tested the Y-AE on the 3D chairs dataset . This dataset contains 1393 rendered models of chairs. Each model has two sequences of 31 frames representing a 360 degrees rotation around the vertical axis. Following a similar procedure reported in  we randomly selected 100 models and we used them as test set. In a similar way we also preprocessed the images, first we removed 50 pixels from each border, then we resized the images to
pixels greyscale using a bicubic interpolation overpixel neighborhood. The explicit representation has been encoded in the Y-AE using 31 units, one unit for each discrete pose. The implicit information has been encoded with 481 units and used to codify the properties of the chair model. Results are showed in Figure 6. Even though the network never seen the test model before, at any orientation, it is able to generalize effectively and produce a full rotation of 360 degrees.
Unpaired image-to-image translation. The aim of this series of experiments is to verify how Y-AEs behave when the explicit information is only provided by a weak label, being the group assignment. In unpaired training there are two separate sets and of input and targets that do not overlap, meaning that samples belonging to are not present in and vice-versa. The goal is to translate an image from one set to the other. Here we focus on three particular types of unpaired translation problems: malefemale, glassesno-glasses, and segmentednatural. For the malefemale and glassesno-glasses tasks we used the CelebA dataset , a database with more than 200K celebrity images, of size pixels, each with 40 attribute annotations. The images cover large pose variations and have rich annotations (gender, eyeglasses, moustache, face shape, etc). In the malefemale task we used an penalty on the reconstruction which generally gives sharper results. In the glassesno-glasses we used instead an reconstruction loss, so to compare the quality of the samples with both losses. We used a neural networks with implicit units and explicit units. For the problem of segmentednatural translation we used a subset of the Cityscapes dataset . Cityscapes is based on a stereo video sequences recorded in streets from 50 different cities, and it includes both natural and semantically segmented images. To have unpaired samples, we randomly removed from the dataset one of the pair, so to have half natural and half segmented images. The final dataset consisted of training images of size pixels. This is a fairly limited amount of images, but we considered it as an additional test to verify the performance of the method on a limited amount of data. As regularization we just reduced the size of the code from to implicit units.
Results obtained on the CelebA dataset are reported in Figure 7 and Figure 8. In the malefemale task, the transition from one sex to the other looks robust in most of the samples (Figure 7-top). To understand which attributes have been codified in the implicit portion of the code we computed the SSIM between the two reconstructions. We report in Figure 7-bottom the greyscale maps based on this metric. The SSIM shows that the most intense changes are localized in the eyes region, with a peak on eyebrows and eyelashes. Minor adjustments are applied around the mouth (lips and beard), forefront (wrinkles), cheekbones, ears, and hairline. It is important to notice that the model found these differences without any specific supervision, only through the weak labelling identifying the groups. In the glassesno-glasses task, the samples look more blurred because of the use of the loss, however also in this case the transition is robust. The Cityscapes dataset (Figure 9) translation task proved extremely challenging. The model has been able to successfully separate the two domains and identify the major factors of variation (e.g. sky, road, cars, trees, etc). However, minor details such as road signs and vehicle type have been discarded. We suspect this is due to the large difference between the two domains, small number of images, and highly lossy nature of the naturalsegmented translation. Further work is required to overcome the difficulties presented in this setting.
In this article we present a new deep model called Y-AE, allowing disentanglement of implicit and explicit information in the latent space without using variational methods or adversarial losses. The method splits the reconstruction in two branches (with shared weights) and performs a sequential encoding, with an implicit and an explicit loss ensuring the consistency of the representations. We show through a wide experimental session that the method is effective and that its performance is superior to similar methods.
Future work should mainly focus on applying the principles of Y-AEs to GANs and VAEs. For instance, codifying the implicit information as a Gaussian distribution it is possible to integrate Y-AEs and VAEs in a unified framework and having the best of both worlds.
Neural networks and principal component analysis: Learning from examples without local minima.Neural networks, 2(1):53–58, 1989.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.