1 Introduction
One way to simplify the problem of classifying or regressing attributes of interest from data is to build an intermediate representation, a feature, where the information about the attributes is better separated than in the input data. Better separation means that some entries of the feature vary only with respect to one and only one attribute. In this way, classifiers and regressors would not need to build invariance to many nuisance attributes. Instead, they could devote more capacity to discriminating the attributes of interest, and possibly achieve better performance. We call this task
disentangling factors of variation, and we identify attributes with the factors. In addition to facilitating classification and regression, this task is beneficial to image synthesis. One could build a model to render images, where each input varies only one attribute of the output, and to transfer attributes between images.When labeling is possible and available, supervised learning can be used to solve this task. In general, however, some attributes may not be easily quantifiable (
e.g., style). Therefore, we consider using weak labeling, where we only know what attribute has changed between two images, although we do not know by how much. This type of labeling may be readily available in many cases without manual annotation. For example, image pairs from a stereo system are automatically labeled with a viewpoint change, albeit unknown. A practical model that can learn from these labels is an encoderdecoder pair subject to a reconstruction constraint. In this model the weak labels can be used to define similarities between subsets of the feature obtained from two input images.In this paper we study the ambiguities in mapping images to factors of variation and the effect of the feature dimensionality on the learned representation. Moreover, we introduce a novel architecture and training procedure to disentangle factors of variation. We find that the simple reconstruction constraint may fail at disentangling the factors when the dimensionality of the features is too large. We thus introduce an adversarial training to address this problem. More importantly, in general there is no guarantee that a model will learn to disentangle all factors. We call this challenge the reference ambiguity and formally show the conditions under which it appears. In practice, however, we observe experimentally that often the reference ambiguity does not occur on several synthetic datasets.
2 Related work
Autoencoders.Autoencoders in Bourlard & Kamp (1988), Hinton & Salakhutdinov (2006), Bengio et al. (2013) learn to reconstruct the input data as , where is the internal image representation (the encoder) and Dec (the decoder) reconstructs the input of the encoder. Variational autoencoders in Kingma & Welling (2014) use a generative model; , where is the observed data (images), and
are latent variables. The encoder estimates the parameters of the posterior,
, and the decoder estimates the conditional likelihood, . In Hinton et al. (2011) autoencoders are trained with transformed image input pairs. The relative transformation parameters are also fed to the network. Because the internal representation explicitly represents the objects presence and location, the network can learn their absolute position. One important aspect of the autoencoders is that they encourage latent representations to keep as much information about the input as possible.GAN. Generative Adversarial Nets Goodfellow et al. (2014)
learn to sample realistic images with two competing neural networks. The generator
Dec creates images from a random noise sample and tries to fool a discriminator Dsc, which has to decide whether the image is sampled from the generator or from real images. After a successful training the discriminator cannot distinguish the real from the generated samples. Adversarial training is often used to enforce constraints on random variables. BIGAN,
Donahue et al. (2016) learns a feature representation with adversarial nets by training an encoder Enc, such that is Gaussian, when . CoGAN, Liu & Tuzel (2016)learns the joint distribution of multidomain images by having generators and discriminators in each domain, and sharing their weights. They can transform images between domains without being given correspondences. InfoGan,
Chen et al. (2016)learns a subset of factors of variation by reproducing parts of the input vector with the discriminator.
Disentangling and independence. Many recent methods use neural networks for disentangling features, with various degrees of supervision. In Xi Peng (2017)
multitask learning is used with full supervision for pose invariant face recognition. Using both identity and pose labels
Tran et al. (2017) can learn pose invariant features and synthesize frontalized faces from any pose. In Yang et al. (2015) autoencoders are used to generate novel viewpoints of objects. They disentangle the object category factor from the viewpoint factor by using as explicit supervision signals: the relative viewpoint transformations between image pairs. In Cheung et al. (2014)the output of the encoder is split in two parts: one represents the class label and the other represents the nuisance factors. Their objective function has a penalty term for misclassification and a crosscovariance cost to disentangle class from nuisance factors. Hierarchical Boltzmann Machines are used in
Reed et al. (2014) for disentangling. A subset of hidden units are trained to be sensitive to a specific factor of variation, while being invariant to others. Variational Fair Autoencoders Louizos et al. (2016) learn a representation that is invariant to specific nuisance factors, while retaining as much information as possible. Autoencoders can also be used for visual analogy Reed et al. (2015). GAN is used for disentangling intrinsic image factors (albedo and normal map) in Shu et al. (2017) without using ground truth labelling. They achieve this by explicitly modeling the physics of the image formation in their network.The work most related to ours is Mathieu et al. (2016), where an autoencoder restores an image from another by swapping parts of the internal image representation. Their main improvement over Reed et al. (2015) is the use of adversarial training, which allows for learning with image pairs instead of image triplets. Therefore, expensive labels like viewpoint alignment between different car types are no longer needed. One of the differences between this method and ours is that it trains a discriminator for each of the given labels. A benefit of this approach is the higher selectivity of the discriminator, but a drawback is that the number of model parameters grows linearly with the number of labels. In contrast, we work with image pairs and use a single discriminator so that our method is uninfluenced by the number of labels. Moreover, we show formally and experimentally the difficulties of disentangling factors of variation.
3 Disentangling factors of variation
We are interested in the design and training of two models. One should map a data sample (e.g., an image) to a feature that is explicitly partitioned into subvectors, each associated to a specific factor of variation. The other model should map this feature back to an image. We call the first model the encoder and the second model the decoder. For example, given the image of a car we would like the encoder to yield a feature with two subvectors: one related to the car viewpoint, and the other related to every other car attribute. The subvectors of the feature obtained from the encoder should be useful for classification or regression of the corresponding factor that they depend on (the car viewpoint in the example). This subvector separation would also be very useful to the decoder. In fact, given a valid feature, one could vary only one of its subvectors (for example, the one corresponding to the viewpoint) and observe a variation of the output of the decoder just about its associated factor (the viewpoint). Such decoder would enable advanced editing of images. For example, it would allow the transfer of the viewpoint or other car attributes from an image to another.
The main challenge in the design of these models, when trained on weakly labeled data, is that the factors of variation are latent and introduce ambiguities in the representation. We explain later that avoiding these ambiguities is not possible without using further prior knowledge about the data. We prove this fundamental issue formally, provide an example where it manifests itself and demonstrate it experimentally. Interestingly, as the experiments will show, whether the ambiguity emerges or not depends on the complexity of the data. Next, we introduce our model of the data and formal definitions of our encoder and decoder.
Data model. We assume that our observed data is generated through some deterministic invertible and smooth process that depends on the factors and , so that . In our earlier example, is an image, is a viewpoint (the varying component), is a car type (the common component), and is the rendering engine. We assume that is unknown, and and are independent.
The encoder. Let Enc be the encoder mapping images to features. For simplicity, we consider features split into only two column subvectors, and , one associated to the varying factor and the other associated to the common factor . Then, we have that . Ideally, we would like to find the inverse of the image formation function, , which separates and recovers the factors and from data samples , i.e.,
(1) 
In practice, this is not possible because any bijective transformation of and could be undone by and produce the same output . Therefore, we aim for and that satisfy the following feature disentangling properties
(2) 
for all , , and for some bijective functions and , so that is invariant to and is invariant to .
The decoder. Let Dec be the decoder mapping features to images. A first constraint is that the sequence encoderdecoder forms an autoencoder, that is,
(3) 
To use the decoder for image synthesis, so that each input subvector affects only one factor in the rendered image, the ideal decoder should satisfy the data disentangling property
(4) 
for any , , , and . The equation above describes the transfer of the varying factor of and the common factor of to a new image .
In the next section, we show that there is an inherent ambiguity in recovering from images and in transferring it from one image to another. We show that, given our weakly labeled data, it may not be possible to satisfy all the feature and data disentangling properties. We call this challenge the reference ambiguity.
3.1 The reference ambiguity
Let us consider the ideal case where we observe the space of all images. When weak labels are made available to us, we also know what images and share the same
factor (for example, which images have the same car). This labeling is equivalent to defining the probability density function
and the joint conditional , where(5) 
Firstly, we show that the labeling allows us to satisfy the feature disentangling property for (2). For any we impose . In particular, this equation is true for pairs when one of the two images is held fixed. Thus, , where the function only depends on , as is invariant to . Lastly, images with the same varying factor, but different common factor must also result in different features, , otherwise the autoencoder constraint cannot be satisfied. Then, there exists a bijective function such that property (2) is satisfied for . Unfortunately, this is not true in general for the other disentangling properties.
Definition 1.
A function reproduces the data distribution, when it generates samples and that have the same distribution as the data. Formally, , where the latent factors are independent, , and .
The reference ambiguity occurs, when a decoder reproduces the data without satisfying the disentangling properties.
Proposition 1.
Proof.
We already saw that satisfies (2), so we can choose , the ideal encoding. Now we look at defining and the decoder. The isoprobability property of implies that there exists a mapping , such that and for some and . For example, let us denote with two varying components such that . Then, let
(6) 
and is a subset of the domain of , where . Now, let us define the encoder as . By using the autoencoder constraint, the decoder satisfies
(7) 
Because and by construction, and and are independent, our encoderdecoder pair defines a data distribution identical to that given as training set
(8) 
The feature disentanglement property is not satisfied because , when and . Similarly, the data disentanglement property does not hold, because . ∎
The above proposition implies that we cannot learn to disentangle all the factors of variation from weakly labeled data, even if we had access to all the data and knew the distributions and .
To better understand it, let us consider a practical example. Let be the (continuous) viewpoint (the azimuth angle) and the car type, where
denotes the uniform distribution and
the Bernoulli distribution with probability
(i.e., there are only car types). In this case, every instance of is isoprobable in so we have the worst scenario for the reference ambiguity. We can define the function so that the mapping of is mirrored as we change the car type. By definition for any and for and . So we cannot tell the difference between and the ideal correct mapping to the viewpoint factor. This is equivalent to an encoder that reverses the ordering of the azimuth of car with respect to car . Each car has its own reference system, and thus it is not possible to transfer the viewpoint from one system to the other.We now introduce a training procedure to build the encoder and decoder from weakly labeled data. We will use these models to illustrate several challenges: 1) the reference ambiguity, 2) the choice of the feature dimensionality and 3) the normalization layers (see the Implementation section).
3.2 Model training
In our training procedure we use two terms in the objective function: an autoencoder loss and an adversarial loss. We describe these losses in functional form, however the components are implemented using neural networks. In all our terms we use the following sampling of independent factors
(9) 
The images are formed as , and . The images and share the same common factor, and and are independent. In our objective functions, we use either pairs or triplets of the above images. We denote the inverse of the rendering engine as , where the subscript refers to the recovered factor.
Autoencoder loss. In this term, we use images and with the same common factor . We feed both images to the encoder. Since both images share the same , we impose that the decoder should reconstruct from the encoder subvector and the encoder subvector , and similarly for the reconstruction of . The autoencoder objective is thus defined as
(10) 
Adversarial loss. We introduce an adversarial training where the generator is our encoderdecoder pair and the discriminator Dsc is a neural network, which takes image pairs as input. The discriminator learns to distinguish between real image pairs and fake ones , where . If the encoder were ideal, the image would be the result of taking the common component from and the viewpoint component from . The generator learns to fool the discriminator, so that looks like the random variable (the common component is and the varying component is independent of ). To this purpose, the decoder must make use of , since does not carry any information about . The objective function is thus defined as
(11) 
Composite loss. Finally, we optimize the weighted sum of the two losses ,
where regulates the relative importance of the two losses.
Shortcut problem. Ideally, at the global minimum of , relates only to the factor and only to . However, the encoder may map a complete description of its input into and the decoder may completely ignore . We call this challenge the shortcut problem. When the shortcut problem occurs, the decoder is invariant to its second output, so it does not transfer the factor correctly,
(12) 
The shortcut problem can be addressed by reducing the dimensionality of , so that it cannot build a complete representation of all input images. This also forces the encoder to make use of for the common factor. However, this strategy may not be convenient as it leads to a time consuming trialanderror procedure to find the correct dimensionality. A better way to address the shortcut problem is to use adversarial training. For our analysis we assume that the discriminator is perfect and the global optimum of the adversarial training has been reached. Thus, the real and fake image pair distributions are identical, and any statistics of the two distributions should also match. We compute statistics of the inverse of the common component . For the images and we obtain
(13) 
by construction (of and ). For the images and we obtain
(14) 
where . We achieve equality if and only if everywhere. This means that the decoder must use to recover the common component of its input and is sufficient to recover it.
3.3 Implementation
In our implementation we use convolutional neural networks for all the models. We denote with
the parameters associated to each network. Then, the optimization of the composite loss can be written as(15) 
We choose and also add regularization to the adversarial loss so that each logarithm has a minimum value. We define (and similarly for the other logarithmic term) and use . The main components of our neural network are shown in Fig. 1. The architecture of the encoder and the decoder were taken from DCGAN Radford et al. (2015), with slight modifications. We added fully connected layers at the output of the encoder and to the input of the decoder. For the discriminator we used a simplified version of the VGG Simonyan & Zisserman (2014) network. As the input to the discriminator is an image pair, we concatenate them along the color channels.
Normalization.
In our architecture both the encoder and the decoder networks use blocks with a convolutional layer, a nonlinear activation function (ReLU/leaky ReLU) and a normalization layer, typically, batch normalization (BN). As an alternative to BN we consider the recently introduced
instance normalization (IN) Ulyanov et al. (2017). The main difference between BN and IN is that the latter just computes the mean and standard deviation across the spatial domain of the input and not along the batch dimension. Thus, the shift and scaling for the output of each layer is the same at every iteration for the same input image. In practice, we find that IN improves the performance.
4 Experiments
We tested our method on the MNIST, Sprites and ShapeNet datasets. We performed qualitative experiments on attribute transfer, and quantitative tests on the nearest neighbor classification task. We show results with models using only the autoencoder loss (AE) and the composite loss (AE+GAN).
MNIST. The MNIST dataset LeCun et al. (1998) contains handwritten grayscale digits of size pixel. There are K images of classes for training and K for testing. The common factor is the digit class and the varying factor is the intraclass variation. We take image pairs that have the same digit for training, and use our full model AE+GAN with dimensions for and for . In Fig. 2 (a) and (b) we show the transfer of varying factors. Qualitatively, both our method and Mathieu et al. (2016) perform well. We observe neither the reference ambiguity nor the shortcut problem in this case.
Sprites. The Sprites dataset Reed et al. (2015) contains pixel color images of animated characters (sprites). There are sprites, for training, for testing and for validation. Each sprite has animations and images, so the full dataset has K images in total. There are many changes in the appearance of the sprites, they differ in their body shape, gender, hair, armor, arm type, greaves, and weapon. We consider character identity as the common factor and the pose as the varying factor. We train our system using image pairs of the same sprite and do not exploit labels on their pose. We train the AE+GAN model with dimensions for and for . Fig. 2 (c) and (d) show results on the attribute transfer task. Both our method and Mathieu et al. (2016)’s transfer the identity of the sprites correctly.
ShapeNet with a white background. The ShapeNet dataset Chang et al. (2015) contains 3D objects than we can render from different viewpoints. We consider only one category (cars) for a set of fixed viewpoints. Cars have high intraclass variability and they do not have rotational symmetries. We used approximately K car types for training and for testing. We rendered possible viewpoints around each object in a full circle, resulting in K images in total. The elevation was fixed to degrees and azimuth angles were spaced degrees apart. We normalized the size of the objects to fit in a pixel bounding box, and placed it in the middle of a pixel image. We trained both AE and AE+GAN on ShapeNet, and tried different settings for the feature dimensions . The size of the common feature was fixed to dimensions. Fig. 3 shows the attribute transfer on the Shapenet dataset with a white background. We compare the methods AE and AE+GAN with different feature dimension of . We can observe that the transferring performance degrades for AE, when we increase the feature size of . As expected, the autoencoder takes the shortcut and tries to store all information into . The model AE+GAN instead renders images without loss of quality, independently of the feature dimension. Furthermore, none of the models exhibits the reference ambiguity: In all cases the viewpoint could be transferred correctly.
In Fig. 4 we visualize the tSNE embeddings of the features for several models using different feature sizes. For the case, we do not modify the data. We can see that both AE with dimensions and AE+GAN with separate the viewpoints well, but AE with dimensions does not due to the shortcut problem. We investigate the effect of dimensionality of the features on the nearest neighbor classification task. The performance is measured by the mean average precision. For we use the viewpoint as ground truth. Fig. 4 also shows the results on AE and AE+GAN models with different feature dimensions. The dimension of was fixed to for this experiment. One can now see quantitatively that AE is sensitive to the size of , while AE+GAN is not. AE+GAN also achieves a better performance. We used the ShapeNet with a white background dataset also to compare the different normalization choices in Table 1. We evaluate the case when batch, instance and no normalization are used and compute the performance on the nearest neighbor classification task. We fixed the feature dimensions at for both and features in all normalization cases. We can see that both batch and instance normalization perform equally well on viewpoint classification and “no normalization” is slightly worse. For the car type classification instance normalization is clearly better.
Normalization  mAP  mAP 

None  0.47  0.13 
Batch  0.50  0.08 
Instance  0.50  0.20 
ShapeNet transfers with (a) a white and (b) ImageNet background.
ShapeNet with ImageNet background. We render the ShapeNet dataset (same set of cars as in the previous section) with ImageNet images as background. The settings for the rendering (image size, viewpoints) are the same as in the case with a white background. We choose the backgrounds randomly for each car image, so that the overall dataset size of the data is the same, K. Since the image pairs use a different background during the training, the background is also part of the varying component. In Fig. 5 we show results on attribute transfer in the case of ShapeNet with a white and with ImageNet background. We found that the reference ambiguity does not emerge in the first dataset, but it does emerge in the second dataset, possibly due to the higher complexity. We highlight these incorrect viewpoint transfers with a red border (see top three rows in Fig. 5 (b). Nonetheless, we find that the proposed model more often than not correctly transfers the viewpoint. The background seems to transfer less well than the viewpoint, but we speculate that the background transfer might improve with better tuning and longer training.
5 Conclusions
In this paper we studied the challenges of disentangling factors of variation. We described the reference ambiguity and showed that it is inherently present in the task, when weak labels are used. Most importantly this ambiguity can be stated independently of the learning algorithm. We also introduced a novel method to train models to disentangle factors of variation. The model must be part of an autoencoder since our method requires that the representation is sufficient to reconstruct the input data. We have shown how the shortcut problem due to feature dimensionality can be kept under control through adversarial training. We demonstrated that training and transfer of factors of variation may not be guaranteed. However, in practice we observe that our trained model works well on most datasets and exhibits good generalization capabilities.
References
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.

Bourlard & Kamp (1988)
Hervé Bourlard and Yves Kamp.
Autoassociation by multilayer perceptrons and singular value decomposition.
Biological cybernetics, 59(4):291–294, 1988.  Chang et al. (2015) Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An InformationRich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], 2015.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
 Cheung et al. (2014) Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of variation in deep networks. arXiv:1412.6583, 2014.
 Donahue et al. (2016) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv:1605.09782, 2016.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
 Hinton & Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 Hinton et al. (2011) Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming autoencoders. In International Conference on Artificial Neural Networks, pp. 44–51. Springer, 2011.
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Liu & Tuzel (2016) MingYu Liu and Oncel Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 469–477, 2016.
 Louizos et al. (2016) Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. In ICLR, 2016.
 Mathieu et al. (2016) Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5041–5049, 2016.
 Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.

Reed et al. (2014)
Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee.
Learning to disentangle factors of variation with manifold
interaction.
In
Proceedings of the 31st International Conference on Machine Learning (ICML14)
, pp. 1431–1439, 2014.  Reed et al. (2015) Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogymaking. In Advances in Neural Information Processing Systems, pp. 1252–1260, 2015.

Shu et al. (2017)
Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and
Dimitris Samaras.
Neural face editing with intrinsic image disentangling.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, July 2017.  Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Tran et al. (2017) Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning gan for poseinvariant face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 Ulyanov et al. (2017) Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Improved texture networks: Maximizing quality and diversity in feedforward stylization and texture synthesis. In CVPR, 2017.
 Xi Peng (2017) Kihyuk Sohn Dimitris Metaxas Manmohan Chandraker Xi Peng, Xiang Yu. Reconstruction for feature disentanglement in poseinvariant face recognition. arXiv:1702.03041, 2017.
 Yang et al. (2015) Jimei Yang, Scott E Reed, MingHsuan Yang, and Honglak Lee. Weaklysupervised disentangling with recurrent transformations for 3d view synthesis. In NIPS, 2015.