Objects in the real world encompass many different attributes mixed together. Some of the attributes are permanent i.e. the fixed identity or content of the object, whereas others are transitory e.g. the pose of the object. Humans can often effectively separate between the content or identity of the object, and the transitory pose of the object. Humans can sometimes achieve this even from a single observation. A key task for artificial intelligence is to empower computers to learn to separate between different attributes of observed data, often referred to as disentanglement. In this paper, we will present a new method for achieving disentanglement between content and pose. We restrict our attention to image data, however some of our ideas may carry over to other modalities.
There are multiple settings for disentanglement. The simplest is fully supervised - for each training image both the content and pose are given as labels. A fully supervised scheme (e.g. deep encoders) may be trained to recover the content and pose information from a single image. Conversely, a generative model can be trained to generate an image given input pose and content information. On the other extreme, fully unsupervised disentanglement takes as input a set of images with no further information. A successful unsupervised disentanglement algorithm will be able to learn a representation in which different aspects of the object such as content and pose will be represented separately. Fully unsupervised disentanglement is highly ambitious and is work in progress, current methods typically do not produce consistently good results in this setting.
In this work, we deal with the content-supervised disentanglement task. In this setting, training images also include information on the content of the image. An example of such supervision includes: the identity of the face shown in the training image. Such supervision is easily obtained in practice e.g. by simply tracking an object in a video, we obtain multiple images in multiple poses with the same content (in this case, person identity). The objective of the disentanglement task is to learn a representation containing all the information not available in the content. In the case of faces, this may include: head pose, smile, glasses etc.
In previous approaches content-supervised disentanglement was achieved by adversarial constraints or by introducing cycle-constraints to non-adversarial encoding methods. We present a novel method for this task. Our method is generative, it trains a generator that reconstructs every training image using its content and pose features. Our method differs from previous methods by several methodological improvements. i) we believe that using a strong inductive bias is critical to achieving successful disentanglement. Therefore in-line with style-transfer and domain translation works, we utilize the AdaIN architecture. We show that this architecture is important for achieving disentanglement. ii) instead of training content and pose encoders, which decompose the input image into pose and content features, we optimize over the pose and content features (embeddings) directly using latent optimization. In addition to achieving better optima than encoder-based methods, latent optimization is able to ensure that all images having the same content (identity), share exactly the same content embedding. This significantly boosts the disentanglement quality. iii) the above latent optimization is very effective at disentangling the training set images. It is not however particularly efficient at obtaining the content embedding of new test images (this requires a difficult optimization problem at runtime, and is poorly determined). We overcome this challenge, by introducing a second stage which uses the content and pose embeddings learned in the first stage as "synthetic" supervision for training direct image to pose and content encoders. The encoders generalize well and are fast to evaluate on test data.
Our method is evaluated qualitatively and quantitatively in terms of generation of novel poses for observed identities. We also quantitatively evaluate the quality of disentanglement of learned features by classifying object identity. Our method is shown to significantly outperform other non-adversarial methods as well as methods that use adversarial training. Code for the method presented in this paper is available on our project page111http://www.vision.huji.ac.il/lord.
2 Previous Work
Unsupervised Disentanglement: In unsupervised disentanglement, simply by observing a random sample of training images, a representation
is learned such that different dimensions of the latent vectorcorrespond to different semantic attributes of the image. Many of the recent methods are based on variational auto-encoders (VAEs) (Kingma and Welling, 2014) e.g. -VAE (Higgins et al., 2017), factor-VAE (Kim and Mnih, 2018), mutual-information (Chen et al., 2018) modularity-based methods (Ridgeway and Mozer, 2018). This setting is very challenging. In an extensive comparative study, Locatello et al. (Locatello et al., 2019) showed that none of the compared methods have been successful on all the datasets examined.
Supervised Disentanglement: It seems likely that some supervision is required for effective disentanglement. The easiest scenario is fully supervised e.g. when both the identity and attributes of a training face are given. Another scenario is domain-supervision: when the pose of the training image is known but not the content e.g. when training faces are labeled with attributes. Attributes can range from head pose to a detailed facial attributes, but do not include identity. An example of such method is MUNIT (Huang et al., 2018), which learns disentangled domain and content representation with domain supervision. StarGAN (Chen et al., 2018) achieved this for multiple domains. The other supervision setting is content-supervision e.g. when the identity of the training face is given but not their transitory attributes. Popular methods in this setting include: adversarial methods Mathieu et al (Mathieu et al., 2016), Denton and Birodkar (Denton and Birodkar, 2017), Szabo et al. (Szabó et al., 2018) and well as non-adversarial method Jha et al. (Harsh Jha et al., 2018). Our method also deals with the latter setting and is non-adversarial.
Style Transfer: Transferring between image style and content is a long standing task, related to disentanglement. The tasks are however different, style transfers decomposed low-level texture and high-level content. Our disentanglement setting decomposes permanent and transitory attributes. Gatys et al. (Gatys et al., 2016)
presented a landmark method for using neural networks for style transfer. Demoulin et al.(Dumoulin et al., 2016)
proposed a conditional batch-normalization method, which learns a per-channel factor and bias parameter, learned for every output style. It was found sufficient for image style transfer. Huang and Belongie(Huang et al., 2018) propose an adaptive instance normalization (AdaIN) layer, which instead of learning a per-style lookup, learns a feedforward encoding mapping between the target style image, and the per-channel bias and scale parameters.
Neural Image Generation:
Synthesizing images from a noise distribution using neural networks is an active research topic. Notable paradigms for training generators include: variational autoencoders (VAEs)(Kingma and Welling, 2014), generative adversarial networks (GANs) (Goodfellow et al., 2014) and generative latent optimization (GLO) (Bojanowski et al., 2018). Our generator training method is based on latent optimization and is related to GLO. Several other generators use AdaIN (or similar) layers e.g. projection SNGAN (Miyato and Koyama, 2018) and StyleGAN (Karras et al., 2018), but not for pose and content disentanglement.
Image Mapping: Mapping images between different domains has received much recent attention. A fundamental issue with mapping between image domains is the challenge of predicting the target style which is not available in the input image. Recently, MUNIT (Huang et al., 2018) proposed to train an encoder which describes the target style as a set of style parameters. The style-parameters are injected into the mapping network using an AdaIN layers. This allows mapping a single image into multiple solutions. Our architecture is influenced by the MUNIT architecture.
Non-adversarial Training: Many methods were proposed for mapping between high-dimensional distributions. Recently, adversarial training, a method that trains two networks in a competitive (adversarial) way, has achieved much success for several tasks such as image generation (Brock et al., 2019), image mapping (Isola et al., 2017), domain alignment (Liu et al., 2017) etc. Adversarial methods are notoriously hard to optimize, require very careful architecture and hyper-parameters tuning due to their min-max nature. To overcome these issues, non-adversarial methods have been proposed to achieve better results on tasks previously dominated by adversarial networks (e.g. image to image mapping (Hoshen and Wolf, 2018), word translation (Hoshen and Malik, 2019)). In this paper, we present a non-adversarial method achieving state-of-the-art performance on disentanglement.
3 Non-Adversarial Latent Disentanglement
We present a novel method for disentanglement with content supervision. The task of disentanglement is poorly specified, there are many possible ways of separating pose and content, but not all are semantically meaningful. In (Locatello et al., 2019) Locatello et al. showed that due to the difficulty of unsupervised disentanglement, an inductive bias is necessary for effective disentanglement. Our method presents two different strong inductive biases: i) architectural bias ii) temporal or identity representation consistency enforced using latent optimization. Better inference is performed by the second stage.
3.1 Generative Model
Our approach is generative i.e. we learn a generative model that given the pose and content codes is able to generate all images. We define our generative model , implemented as a neural network parameterized by . Our model recieves two inputs and corresponding to pose and content embeddings respectively. The output of the generator is a synthetic image. Formally it can be written as:
As disentanglement is a poorly specified task, different architectures were used for enforcing an inductive bias. We use an adaptive instance norm (AdaIN) based architecture. The pose embedding is input into the generator, which then projects and gradually upsamples it into an image. Each convolutional layer is followed by normalization using the AdaIN parameters obtained from the content embedding, forcing the global statistics of each feature map. More formally, the content embedding is first fed into a modulation module (the parameters of which are learned together with the other generator parameters), projecting it separately into per-layer AdaIN parameters. In each layer of the generator, each feature map is first normalized and then scaled and biased using the corresponding scalars and .
3.2 Latent Optimization for Content Supervision
In the previous section, we showed how the generative model architecture can act as an inductive bias for disentanglement. We additionally use weak-supervision for providing more learning bias. In our setting, we observe the same object identity under multiple poses, although we are not told what the poses are. For example in videos of faces; by using facial tracking we obtain many images containing the same face with different poses. In non-video data it can be obtained by utilizing image class labels, or using facial recognition methods. Any agent which can observe and track objects, is able to obtain such content supervision for virtually free.
Let us define pose and content embeddings and for every image , where is the total number of training images. Using the content supervision described in this section, we define the indicator function which takes as input the training image identity , and return its content identity. Note that many images may share the same content identity (e.g. faces of the same person at different poses ). We denote the embedding of a given content identity as (the number of unique identities is ):
To ensure that the content embedding is equal between all images sharing the same content identity, we optimize the pose and content of each image directly using latent optimization (Bojanowski et al., 2018). For each image , we directly optimize and
so that the generated image reconstructs the input image as closely as possible. The optimization is performed under a perceptual loss. The perceptual loss extracts deep features for both the generated and true images. Perceptual features correlate better than simpleor losses on human perceptual similarity. We use the implementation provided by the authors of (Hoshen and Wolf, 2018). For fairness, we use perceptual loss for baseline methods in our experiments. The cost function is given by Eq. 3:
We optimize over the per-image pose and content embeddings, where one content embedding may be shared by several images. The optimization becomes:
After this training stage, each image has a pose and content embedding. In the experimental section, we show that the poses are typically well aligned between the different identities. A sketch of this stage can be seen in Fig. 1.
3.3 Encoder Training using "Synthetic" Supervision
The method detailed in Sec. 3.2 is very effective at disentangling the training set, but has issues at inference time. Latent optimization, which was used effectively for training, requires optimization for every image (including at inference time). This optimization is very slow at inference time. Also, at training time the content embedding was shared by multiple images, which prevented the embedding from including pose information. At inference time however, a single image from an unknown identity is presented. There is nothing preventing latent optimization from the pose leaking onto the content embedding.
We therefore introduce a second stage to our method. We use the results of the previous stage as a ”supervised” disentanglement problem, where for each training image the pose and content embeddings and inferred in the first stage are used as synthetic ground truth.
We train encoders and , which take as input an image and output its pose and content embeddings, which then reconstruct the image . The optimization cost is presented in Eq. 5. The optimization is over the parameters of encoders and , as well as the parameters of the generator , and is a constant (we use ).
After training, test images are disentangled into pose and content embeddings and . To display the same content in a new pose , we simply run it through the generator to obtain:
3.4 Implementation Details
The architecture of the generator consists of 3 fully-connected layers followed by convolutional layers (the first of them are preceded by an upsampling layer and followed by AdaIN normalization). We set the size of the pose latent code to and the size of the content code to in all our experiments. We perform the latent optimization using SGD utilizing the ADAM method, with learning rate of for the generator and for the latent codes. For the second stage, the pose encoder is a CNN with convolutional layers and fully-connected layers, all followed by Batch Normalization and LeakyRelu. The content encoder has similar architecture except for applying Global Average Pooling and not using Batch Normalization. We found that adding Gaussian noise regularization was helpful for the latent optimization of pose codes.
Our method is evaluated against SOTA techniques representing the dominant current approaches.
Cars3D (Reed et al., 2015): This dataset consists of 183 car CAD models, each rendered from equi-spaced 24 azimuth directions and 4 elevations at resolution. We define the car model as the content and the rest as pose. We use 163 car models for training and the other 20 were held out for testing.
SmallNorb (LeCun et al., 2004): This dataset contains images of 50 toys belonging to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees). We use this dataset in two configurations: i) SmallNorb: 25 separate identities for training and 25 for testing, treating lighting and elevations as part of the object identity, and azimuth as the pose. This configuration is used for evaluating the generalization capability of the disentanglement methods from a very limited set of seen identities. ii) SmallNorb-Poses: Using all the identities for training, holding out 10% of the images for testing. In this case we treat the elevation as part of the pose as well.
CelebA (Liu et al., 2015): CelebA contains 202,599 facial images of over 10,177 celebrities. The faces are aligned and cropped to contain only the facial region. We designate the person identity as the content, and transitory facial attributes (smile, orientation) as pose. 9,177 identities were seen during training and the other 1,000 were used for testing.
KTH (Laptev et al., 2004): KTH contains videos of people, performing 6 different activities (walking , running, jogging, boxing, handwaving, handclapping), in different settings. We designate person identity as content, and transitory attributes (predominantly skeleton position) as pose. Due to the very limited amount of subjects, we use all the identities for training, holding out 10% of the images for testing. In all the experiments, images are resized to 64x64 resolution to fit the same architecture.
Denton and Birodkar (Denton and Birodkar, 2017): This method trains encoders separating the image into pose and content codes. Independence of the codes is maintained using a GAN loss, content code sharing between images that have the same identity is enforced by a code similarity constraint. A decoder reconstructs the image using pose and content codes.
Szabo et al. (Szabó et al., 2018): This method trains an autoencoder similar to (Denton and Birodkar, 2017), however the adversarial loss is not set on the codes but rather on pairs of images. This has the advantages of using stronger architectures, but it deals with the representation disentanglement task less directly.
Jha et al. (Harsh Jha et al., 2018): This is a non-adversarial method which trains a VAE while discouraging degenerate solutions by a cycle-reconstruction loss for the pose and content codes.
Some of the baselines are very sensitive to code size (our method is not). We made an effort to find effective sizes but as the search space is large, we cannot guarantee that we found optimal sizes. We evaluated each baseline with and perceptual loss and reported the best.
We visually evaluated the results of our method against the two strongest baselines (Harsh Jha et al., 2018) and (Denton and Birodkar, 2017) (as measured in the quantitative results section). In every experiment, we trained using our two-stage approach on the training data. We used the second stage pose and content encoders, to compute pose and content codes for each test image. We visualize switching between pose and content codes for each pair within a set of 4 test images. The results can be seen in Fig. 2. On Cars3D, our method achieved excellent pose transfer while keeping identities fixed. (Denton and Birodkar, 2017) was mostly able to transfer the pose, but it did not keep the car model fixed. (Harsh Jha et al., 2018) results were of lower fidelity. On SmallNorb-Poses, our method worked well, whereas the baseline methods struggled with some rotations (e.g bottom row). SmallNorb was harder for all methods, due to the requirement for generalizing to unseen identities from 25 identities in training. Although the exact model was not kept perfectly fixed, our method was able to transfer the pose to a similar model as the target. The baselines did worse (e.g top and bottom rows). On KTH both our method as well as (Denton and Birodkar, 2017) performed well, although our method achieved more accurate transfer (e.g. last image in the top row). On CelebA, (Harsh Jha et al., 2018), was unable to transfer the pose. (Denton and Birodkar, 2017), generally transferred head pose but did not preserve the person identity. Our method achieved better pose transfer than both baselines, and was able to maintain the identity and style.
Reconstruction experiments: To test the quality of disentanglement, we measure the quality of pose transfer in terms of perceptual reconstruction cost (using LPIPS (Zhang et al., 2018)). We use the pose labels available in the Cars3D and SmallNorb datasets as ground truth for pose transfer. For given test images and , we measure the similarity between and another image of the same object as in having the same pose of . Results are reported in Tab. 1.
Classification experiments: To verify that our method in fact removes all pose information from content features, we trained a classifier (with the same architecture as in (Harsh Jha et al., 2018)), to classify object identity from the pose and content codes. Results can be seen in Tab. 2. All methods could classify identity from content codes very accurately (not shown). On Cars3D our method removes more identity information from the pose code than the baselines leading to lower classification accuracy (better disentanglement). On SmallNorb our method and (Denton and Birodkar, 2017) and (Harsh Jha et al., 2018) achieved near perfect disentanglement. To conclude, our method is able to train the most disentangled features for the identity classification task without introducing adversarial constraints between the pose and content codes.
|Szabo et al. (Szabó et al., 2018)||0.13||0.14||0.40||0.42||0.21||0.21|
|Jha et al. (Harsh Jha et al., 2018)||0.14||0.14||0.19||0.20||0.19||0.20|
|Denton et al. (Denton and Birodkar, 2017)||0.05||0.10||0.14||0.17||0.14||0.15|
|Szabo et al. (Szabó et al., 2018)||0.91||0.01|
|Jha et al. (Harsh Jha et al., 2018)||0.08||< 0.01|
|Denton et al. (Denton and Birodkar, 2017)||0.26||< 0.01|
|Random chance||< 0.01||< 0.01|
Jha et al. (Harsh Jha et al., 2018)
|Denton et al. (Denton and Birodkar, 2017)||Ours|
Content supervision: Humans observing nature, track objects over a certain length of time and therefore obtain content supervision for multiple objects for virtually free. We showed that content supervision is very helpful for achieving strong disentanglement performance. Without the use of tracking or other form of content supervision, disentanglement performance is significantly reduced.
Latent optimization: In previous methods, content supervision was partially introduced by encouraging pairs of images with same content to have similar content codes. We introduced a much more powerful method for enforcing content supervision: by putting a hard constraint for all images sharing the same content to have exactly the same content code.
Multiple Stages The latent optimization first stage results in an effective disentanglement of the training set. Optimizing over the latent codes at test time without content supervision tends to overfit and results in non-transferable pose codes. In an ablation experiment on the Cars3D dataset, using latent optimization at test time leads to degraded performance on the test set (0.14 compared to 0.08 with the additional stage - measured by LPIPS). The second stage is crucial for achieving high quality disentanglement, as well as boosting the inference time.
Non-Adversarial training: Differently from most other previous works, we did not use adversarial training to enforce disentanglement between the pose and content codes. Instead, we used direct latent optimization of the embeddings and a strong architectural bias. Non-adversarial training has significant advantages in the ease of optimization. Interestingly, our method results in pose and content embedding that leak very little information. We believe this should motivate researchers to look at non-adversarial alternatives.
For training our autoencoder, we used a perceptual loss function, originally trained on the imagenet dataset. This should not count as using extra supervision, as the imagenet dataset is not strongly related to any of the tested datasets. In our experiments we found the perceptual loss was helpful to other method that did not use GANs on the output image (even if they used GANs on the intermediate features representations e.g.(Denton and Birodkar, 2017)). In line with other work (Hoshen and Malik (2019)), we found that perceptual losses are very helpful for latent optimization.
Domain supervision: Although quite different from the content, we have found in preliminary experiments, that using domain label as the ”content” supervision achieves promising results on the edges2shoes dataset (particularly on the shoes to edges tasks). We believe that our technique can be used for non-adversarial unsupervised domain translation. We leave this direction to future work.
We presented an effective approach for weakly-supervised image disentanglement, which combines architecture, weak supervision, latent optimization, and a second stage "synthetic" supervision. Notably, our approach does not require adversarial training. In the experiments, our approach achieved state-of-the-art performance, achieving better performance than other top adversarial and non-adversarial methods.
- Bojanowski et al.  Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing the latent space of generative networks. ICML, 2018.
- Brock et al.  Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. ICLR, 2019.
- Chen et al.  Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In NeurIPS, 2018.
- Denton and Birodkar  Emily L Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. In NIPS, 2017.
- Dumoulin et al.  Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. arXiv preprint arXiv:1610.07629, 2016.
Gatys et al. 
Leon A Gatys, Alexander S Ecker, and Matthias Bethge.
Image style transfer using convolutional neural networks.In CVPR, 2016.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
- Harsh Jha et al.  Ananya Harsh Jha, Saket Anand, Maneesh Singh, and VSR Veeravasarapu. Disentangling factors of variation with cycle-consistent variational auto-encoders. In ECCV, 2018.
- Higgins et al.  Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
- Hoshen and Malik  Yedid Hoshen and Jitendra Malik. Non-adversarial image synthesis with generative latent nearest neighbors. CVPR, 2019.
- Hoshen and Wolf  Yedid Hoshen and Lior Wolf. Nam: Non-adversarial unsupervised domain mapping. In ECCV, 2018.
Huang et al. 
Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz.
Multimodal unsupervised image-to-image translation.In ECCV, 2018.
Isola et al. 
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.
Image-to-image translation with conditional adversarial networks.In CVPR, 2017.
- Karras et al.  Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
- Kim and Mnih  Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
- Kingma and Welling  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.
- Laptev et al.  Ivan Laptev, Barbara Caputo, et al. Recognizing human actions: a local svm approach http://www.nada.kth.se/cvap/actions/. In ICPR, 2004.
- LeCun et al.  Y LeCun, Fu Jie Huang, and L Bottou. Learning methods for generic object recognition with invariance to pose and lighting https://cs.nyu.edu/ ylclab/data/norb-v1.0-small/. In CVPR, 2004.
- Liu et al.  Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
- Liu et al.  Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. http://mmlab.ie.cuhk.edu.hk/projects/celeba.html. In ICCV, 2015.
- Locatello et al.  Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. ICML, 2019.
- Mathieu et al.  Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In NIPS, 2016.
- Miyato and Koyama  Takeru Miyato and Masanori Koyama. cgans with projection discriminator. ICLR, 2018.
- Reed et al.  Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making https://github.com/carpedm20/visual-analogy-tensorflow. In NIPS, 2015.
- Ridgeway and Mozer  Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss. In NeurIPS, 2018.
- Szabó et al.  Attila Szabó, Qiyang Hu, Tiziano Portenier, Matthias Zwicker, and Paolo Favaro. Challenges in disentangling independent factors of variation. ICLRW, 2018.
- Zhang et al.  Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.