Log In Sign Up

Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement

by   Xuanchi Ren, et al.

Content and style (C-S) disentanglement intends to decompose the underlying explanatory factors of objects into two independent subspaces. From the unsupervised disentanglement perspective, we rethink content and style and propose a formulation for unsupervised C-S disentanglement based on our assumption that different factors are of different importance and popularity for image reconstruction, which serves as a data bias. The corresponding model inductive bias is introduced by our proposed C-S disentanglement Module (C-S DisMo), which assigns different and independent roles to content and style when approximating the real data distributions. Specifically, each content embedding from the dataset, which encodes the most dominant factors for image reconstruction, is assumed to be sampled from a shared distribution across the dataset. The style embedding for a particular image, encoding the remaining factors, is used to customize the shared distribution through an affine transformation. The experiments on several popular datasets demonstrate that our method achieves the state-of-the-art unsupervised C-S disentanglement, which is comparable or even better than supervised methods. We verify the effectiveness of our method by downstream tasks: domain translation and single-view 3D reconstruction. Project page at


page 3

page 4

page 5

page 6

page 8


Open-Ended Content-Style Recombination Via Leakage Filtering

We consider visual domains in which a class label specifies the content ...

Disentangling Style and Content in Anime Illustrations

Existing methods for AI-generated artworks still struggle with generatin...

Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

This paper addresses the unsupervised learning of content-style decompos...

Metrics for Exposing the Biases of Content-Style Disentanglement

Recent state-of-the-art semi- and un-supervised solutions for challengin...

Pose Randomization for Weakly Paired Image Style Translation

Utilizing the trained model under different conditions without data anno...

Code Repositories


[ICCVW 2021] Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement

view repo

1 Introduction

The disentanglement task aims to recover the underlying explanatory factors of natural images into different dimensions of latent space, and provide an informative representation for downstream tasks like image translation (Wu et al., 2019c; Kotovenko et al., 2019), domain adaptation (Li et al., 2019) and geometric attributes extraction (Xing et al., 2019), etc.

In this work, we focus on content and style (C-S) disentanglement, where content and style represent two independent latent subspaces. Most of the previous C-S disentanglement works (Denton and Birodkar, 2017; Jha et al., 2018; Bouchacourt et al., 2018; Gabbay and Hoshen, 2020) rely on supervision. For example, Gabbay and Hoshen (2020) achieve disentanglement by forcing images from the same group to share a common embedding. It is not tractable, however, to collect such a dataset (e.g. groups of paintings with each group depicting the same scene in different styles). To our best knowledge, the only exception is Wu et al. (2019b) which, however, forces the content to encode pre-defined geometric structure limited by the expressive ability of 2D landmarks.

Previous works define the content and style based on either the supervision or manually pre-defined attributes. There is no general definition of content and style for unsupervised C-S disentanglement. In this work, we define content and style from the perspective of VAE-based unsupervised disentanglement works (Higgins et al., 2017; Burgess et al., 2018; Kim and Mnih, 2018; Chen et al., 2018). These methods try to explain the images with the latent factors, of which each controls only one interpretable aspect of the images. However, extracting all disentangled factors is a very challenging task, and Locatello et al. (2019) prove that unsupervised disentanglement is fundamentally impossible without inductive bias on both the model and data. Furthermore, these methods have limited down-stream applications due to poor image generation quality on real-world datasets.

Inspired by the observation that the latent factors have different degrees of importance for image reconstruction (Burgess et al., 2018), we assume the disentangled factors are of different importance when modeling the real data distributions. Instead of finding all the independent factors, which is challenging, we make the problem tractable by defining content as a group of factors that are the most important ones for image reconstruction across the whole dataset, and defining style as the remaining ones. Take the human face dataset CelebA (Liu et al., 2015)

as an example, as pose is a more dominant factor than identity for image reconstruction across the face dataset, content encodes pose, and identity is encoded by style. We further assume that each content embedding of the dataset is sampled from a shared distribution, which characterizes the intrinsic property of content. For the real-world dataset CelebA, the shared distribution of content (pose) is approximately a Standard Normal Distribution, where zero-valued embedding stands for the canonical pose. For the synthetic dataset Chairs 

(Aubry et al., 2014)

, as each image is synthesized from equally distributed surround views, the shared distribution of content (pose) is approximately an Uniform Distribution.

Based on the above definitions and assumptions, we propose a problem formulation for unsupervised C-S disentanglement, and a C-S Disentanglement Module (C-S DisMo) which assigns different and independent roles to content and style when approximating the real data distributions. Specifically, C-S DisMo forces the content embeddings of individual images to follow a common distribution, and the style embeddings are used to scale and shift the common distribution to match the target image distribution via a generator. With the above assumptions as the data inductive bias, and C-S DisMo as the corresponding model inductive bias, we achieve unsupervised C-S disentanglement with good image generation quality. Furthermore, we demonstrate the effectiveness of our disentangled C-S representations on two down-stream applications, i.e., domain translation and single-view 3D reconstruction.

Figure 1: Overview of our method. Content embeddings are labelled with different shapes, and style embeddings are labelled with different colors. A C-S Disentanglement Module (C-S DisMo) is composed of a -constraint and an affine transformation. The -constraint forces content embeddings to follow a shared distribution and the affine transformation scales and shifts the shared content distribution with different styles (colors) as the Generator’s input to approximate the target image distributions. Each image from grids (right side) is generated with the content embedding from the column and style embedding from the row.

We follow Gabbay and Hoshen (2020) to apply latent optimization to optimize the embeddings and the parameters of the generator. Please note that we only use the image reconstruction loss as the supervision; no human annotation is needed. We also propose to use instance discrimination as an auxiliary constraint to assist the disentanglement.

The experiments on several popular datasets demonstrate that our method achieves the state-of-the-art (SOTA) unsupervised C-S disentanglement, which is comparable or even better than supervised methods. Furthermore, by simplifying the factors disentanglement problem into the C-S disentanglement problem, we achieve much better performance than the SOTA VAE-based unsupervised disentanglement method when modified for C-S disentanglement by manually splitting the factors into content and style.

Main contributions. The main contributions of our work are as follows: By rethinking content and style from the perspective of VAE-based unsupervised disentanglement, we achieve unsupervised C-S disentanglement by introducing both data and model inductive bias. We propose the C-S DisMo to assign different and independent roles to content and style when modeling the real data distributions, and we provide several solutions for the distribution constraint of content. We verify the effectiveness of our method by presenting two down-stream applications based on the well-disentangled content and style.

2 Related Work

Unsupervised Disentanglement.

There have been a lot of studies on unsupervised disentangled representation learning 

(Higgins et al., 2017; Burgess et al., 2018; Kim and Mnih, 2018; Chen et al., 2018). These models learn disentangled factors by factorizing aggregated posterior. However, Locatello et al. (2019) prove that unsupervised disentanglement is impossible without introducing inductive bias on both the models and data. Therefore, these models are currently unable to obtain a promising disentangled representation. Inspired by these previous unsupervised disentanglement works, we revisit and formulate the unsupervised C-S disentanglement problem. We simplify the challenging task of extracting individual disentanglement factors into a manageable task: disentangling two groups of factors (content and style).

C-S Disentanglement. Originated from style transfer, most of the prior works on C-S disentanglement divide latent variables into two spaces relying on group supervision. To achieve disentanglement, Mathieu et al. (2016) and Szabó et al. (2018) combine the adversarial constraint and auto-encoders. Meanwhile, VAE (Kingma and Welling, 2014) is combined with non-adversarial constraints, such as cycle consistency (Jha et al., 2018) and evidence accumulation (Bouchacourt et al., 2018). Furthermore, latent optimization is shown to be superior to amortized inference for C-S disentanglement (Gabbay and Hoshen, 2020). The only exception is Wu et al. (2019b), which proposes a variational U-Net with structure learning for disentanglement in an unsupervised manner, but is limited by the expressive ability of 2D landmarks. In our work, we focus on the unsupervised C-S disentanglement problem and explore inductive bias for unsupervised disentanglement.

Key Difference from Image Translation. Image translation (Huang et al., 2018; Liu et al., 2019) between domains tries to decompose the latent space into domain-shared representations and domain-specific representations with the domain label of each image as supervision. The decomposition relies on the “swapping” operation and pixel-level adversarial loss without semantic level disentanglement ability. This pipeline fails in the unsupervised C-S disentanglement task on the single domain dataset due to lack of domain supervision, as demonstrated in Figure 8. Our unsupervised C-S disentanglement task is to disentangle the latent space into content (containing most dominant factors typically carrying high-level semantic attributes) and style (containing the rest of the factors). We achieve disentangled content and style by assigning different roles to them without relying on domain supervision or the “swapping” operation. We formulate the problem for a single domain but it can be extended to cross-domain to achieve domain translation without domain supervision, as shown in Figure 9.

3 Exploring Inductive Bias for Unsupervised C-S Disentanglement

3.1 Problem Formulation

For a given dataset , where is the total number of images, we assume each image is sampled from a distribution , where are the disentangled factors. Disentangling all these factors unsupervisedly is a challenging task, which has been proved to be fundamentally impossible without introducing the model and data inductive bias (Locatello et al., 2019). Based on the observation that the factors play roles of different importance for image reconstruction (Burgess et al., 2018), we assume are of different importance and popularity for modeling the image distribution . We define the content as representing the most important factors across the whole dataset for image reconstruction and style as representing the rest ones. We assume c follows a shared distribution across the whole dataset, and assign each image a style embedding which parameterizes to be an image-specific distribution . This serves as the data bias for our unsupervised C-S disentanglement.

With a generator that maps content and style embeddings to images, where is the parameter of the generator, we further parameterize the target image distributions as . For each image , we assign as the content embedding. All the content embeddings should conform the assumed distribution of content , which is denoted as

. Then we are able to estimate the likelihood of

by . Given the dataset D, our goal is to minimize the negative log-likelihood of :


3.2 Proposed C-S Disentanglement Module

Here we propose a framework to address the formulated problem in Section 3.1. We design a C-S Disentanglement Module (C-S DisMo) to assign different roles to content and style in modeling real data distributions according to their definitions (data bias) in Section 3.1, which servers as the corresponding model bias.

More specifically, as shown in Figure 1, a C-S DisMo is composed of a -constraint to enforce content embeddings to conform to , which corresponds to the second term in Eq. 1, and an affine transformation serving to customize the shared content distribution into image-specific distributions. This module is followed by the generator to generate the target image.

The affine transformation

is inspired by the observation that the mean and variance of features carry individual information 

(Gatys et al., 2016; Li and Wand, 2016; Li et al., 2017; Huang and Belongie, 2017). We use the style embeddings to provide the statistics to scale and shift content embedings as


where and are two fully connected layers predicting the scalars for scaling and shifting respectively. When is a Normal Distribution, Eq. 1 is equivalent to minimizing:


with the proof provided in Appendix I.

For the reconstruction term in Eq. 3, we adopt a VGG perceptual loss (Simonyan and Zisserman, 2015; Ren et al., 2020), which is widely used in unsupervised disentanglement methods (Wu et al., 2020, 2019b).

Figure 2: Generated images at different training steps. The first and second rows share the same style embedding. The second and third rows share the same content embedding.

For the -constraint, i.e. the second term in Eq. 3, we propose and study discrimination-based, NLL-based and normalization-based solutions. The form of should be carefully selected to better approximate the ground truth content distribution of the dataset. We describe details of these solutions and related limitations according to the form of below.

Discrimination-based solution can be adopted when has a tractable form for sampling. Inspired by adversarial learning (Karras et al., 2019), we propose to use a discriminator to distinguish between content embeddings (false samples) and items sampled from (true samples). When it is difficult for the discriminator to distinguish true from false, the content embeddings are likely to follow .

NLL-based solution is inspired by flow-based generative models (Kingma and Dhariwal, 2018), and can be adopted when . We can use negative log-likelihood (NLL) to optimize to follow as


Normalization-based solution can be adopted when has one of the following specific forms: ) a Standard Normal Distribution , and ) a Uniform Distribution. To approximately follow the constraint, Instance Normalization (IN) is used to force the mean and variance of to be zeros and respectively. When is a Uniform Distribution, we can use normalization to force to follow Uniform Distribution approximately (Muller, 1959).

For these solutions, we show the qualitative and quantitative comparisons in Figure 3 and Table 3 respectively to verify their effectiveness. Furthermore, discrimination-based and NLL-based solutions need extra optimization terms which introduce instability. In our work, we mainly adopt normalization-based solution to meet the -constraint.

As shown in Figure 1, we can use the C-S DisMo before the generator, denoted as the Single C-S DisMo framework. We can also insert it before each layer of the generator to provide multiple paths for disentanglement, denoted as the Multiple C-S DisMo framework. For more details, please refer to Appendix A.

3.3 Demystifying C-S Disentanglement

In this section, we perform some experiments to verify that the C-S disentanglement is achieved by introducing inductive bias on model (C-S DisMo) and data (our assumptions of the dataset). The experimental setting can be found in Section 4.

To understand how C-S DisMo achieves disentanglement, we visualize the generated images during the training process of CelebA in Figure 2. As the generated images show, a mean shape of faces is first learned. Then the faces start to rotate, which indicates the pose, as a dominant factor for generation, is disentangled as content. After that, the identity features emerge since the identity is disentangled as style for better image generation.

If we treat content and style equally, i.e., concatenating content and style embedding as the input of the generator, the network can hardly disentangle any meaningful information for the CelebA dataset, as shown in Figure 3 (a). Our Single C-S DisMo framework with different solutions to meet -constraint can disentangle the content (pose) and style (identity) of human faces, as shown in Figure 3 (c)-(e). When -constraint is removed from C-S DisMo, the result is shown in Figure 3 (b), where the pose and identity can not be disentangled. For the Multiple C-S DisMo framework, as multiple paths are provided, and the network has more flexibility to approximate the target image distribution, it outperforms the Single C-S DisMo framework, as shown in Figure 3 (f).

We conduct experiments to demonstrate that better disentanglement can be achieved by choosing a better form for . For the real-world dataset CelebA, the distribution of pose is better modeled as a Standard Normal Distribution. As Figure 4 (a) and (b) show, IN achieves better disentanglement than . For the synthetic Chairs (Aubry et al., 2014) dataset, the distribution of pose is close to Uniform Distribution rather than Standard Normal Distribution. Figure 4 (c) and (d) show that the normalization results in better consistency of identity and pose.

(a) Concatenation (b) w/o -constraint
(c) Discrimination (d) NLL
(e) IN (f) Multiple w/ IN
Figure 3: Ablation study of C-S DisMo. For each image, the content embedding is from the topmost image in the same column, and style embedding is from the leftmost image in the same row. A good disentanglement is that: horizontally, the style (identity) of the images is well maintained when the content (pose) varies, and vertically, the content of the images is well aligned when the style varies.
(a) IN (b) Normalization
(c) IN (d) Normalization
Figure 4: Comparison of the disentanglement with different normalizations. Instance Normalization (IN) achieves better results on CelebA, e.g., the face identities are more alike. normalization outperforms on Chairs, where the shapes of chairs are more consistent in each row.

3.4 Auxiliary Loss Function

In addition to the in Eq. 3

, we propose two auxiliary loss functions to help the model to better disentangle C-S.

Instance discrimination. Instance discrimination can discover image-specific features (Wu et al., 2018). The image-specific feature corresponds to style according to our definition. Inspired by this, we first pretrain a backbone network on the target dataset with instance discrimination. Then we adopt Contrastive Learning to help disentangle content and style by pulling together the images with the same content embeddings and pushing away the images with different content embeddings in the backbone’s representation space. We denote the contrastive loss as . The implementation detail can be found in Appendix C.

Information bottleneck.  Burgess et al. (2018) propose improving the disentanglement by controlling the capacity increment. This motivated us to control the information bottleneck capacity of content and style to help to avoid leakage. This loss is denoted as . The details of this loss are provided in Appendix C.

Figure 5:

Demonstrations of the content and style space by interpolation (a & b) and retrieval (c-e).

FactorVAE Lord Ours
Figure 6: Comparison of visual analogy results on Chairs, Car3D and CelebA (from top to bottom). Zoom in for details.

Full objective. Therefore, our full objective is


where hyperparameters

, , and represent the weights for each loss term respectively. The ablation study for the auxiliary loss terms is presented in Appendix E.

4 Experiments

In this section, we perform quantitative and qualitative experiments to evaluate our method. We test our method on several datasets: Car3D (Reed et al., 2015), Chairs (Aubry et al., 2014) and CelebA (Liu et al., 2015). For these three datasets, pose is the most dominant factor and encoded by content. For details of the datasets and results on more datasets, please refer to Appendix B and D.

Baselines. We choose several SOTA group-supervised C-S disentanglement benchmarks for comparisons: Cycle-VAE (Jha et al., 2018), DrNet (Denton and Birodkar, 2017) and Lord (Gabbay and Hoshen, 2020). We select the only unsupervised C-S disentangled method  Wu et al. (2019b) 111

There is no open-sourced implementation for it. We modify and provide pseudo ground truth landmarks to the network. Thus it becomes semi-supervised actually.. We choose one VAE-based unsupervised disentanglement method: FactorVAE (Kim and Mnih, 2018). For FactorVAE, according to our definition of content and style, we manually traverse the latent space to select the factors related to pose as content and treat the other factors as style, for all these three datasets. More details for baselines are presented in Appendix B.

4.1 Quantitative Experiments

We compare our method (Multiple C-S DisMo framework) with the baselines on Car3D, Chairs and CelebA.

Content Transfer Metric. To evaluate our method’s disentanglement ability, we follow the protocol of Gabbay and Hoshen (2020) to measure the quality of content transfer by LPIPS (Zhang et al., 2018). Details are presented in Appendix A. The results are shown in Table 1. We achieve the best performance among the unsupervised methods, even though pseudo labels are provided for  Wu et al. (2019b). Our method significantly outperforms FactorVAE, which verifies the effectiveness of our formulation: simplifying the problem from disentangling factors to disentangling content and style. Furthermore, our method is comparable to or even better than the supervised ones.

Classification Metric. Classification accuracy is used to evaluate disentanglement in Denton and Birodkar (2017); Jha et al. (2018); Gabbay and Hoshen (2020). we follow the protocol of Jha et al. (2018). Low classification accuracy indicates small leakage between content and style. Without content annotations for CelebA, we regress the position of the facial landmarks from the style embeddings instead. The results are summarized in Table 2. Though without supervision, the performance of our method is still comparable to several other methods. We note that the classification metric may not be appropriate for disentanglement, which is also observed in Liu et al. (2020). The observation is that the classification metric is also influenced by information capacity and dimensions of embeddings. For FactorVAE, the poor reconstruction quality indicates that the content and style embeddings encode little information that can hardly be identified. The dimensions of the content and style embeddings of different methods vary from ten to hundreds, and a higher dimension usually leads to easier classification.

Method Supervision Cars3D Chairs CelebA
DrNet (Denton and Birodkar, 2017) 0.146 0.294 0.221
Cycle-VAE (Jha et al., 2018) 0.148 0.240 0.202
Lord (Gabbay and Hoshen, 2020) 0.089 0.121 0.163
FactorVAE (Kim and Mnih, 2018) 0.190 0.287 0.369
Wu et al. (2019b) 0.185
Ours 0.082 0.190 0.161
Table 1: Performance comparison in content tranfer metric (lower is better). For Wu et al. (2019b), we provide pseudo facial landmarks, and there are no suitable landmarks for cars and chairs.
  Method Supervision Cars3D Chairs CelebA
DrNet (Denton and Birodkar, 2017) 0.27 0.03 0.06 0.01 4.99 0.00
Cycle-VAE (Jha et al., 2018) 0.81 0.77 0.60 0.01 2.80 0.12
Lord (Gabbay and Hoshen, 2020) 0.03 0.09 0.02 0.01 4.42 0.01
FactorVAE (Kim and Mnih, 2018) 0.07 0.01 0.14 0.01 5.34 0.00
Wu et al. (2019b) 5.42 0.11
Ours 0.33 0.24 0.66 0.05 4.15 0.05
Table 2: Classification accuracy of style labels from content codes () and of content labels from style codes () (lower is better). For Wu et al. (2019b), we provide pseudo ground truth landmarks. Note that the column () presents the error of face landmark regression from the style embeddings (higher is better).

4.2 Qualitative Experiments

Disentanglement & Alignment. In Figure 5 (a) and (b), we conduct linear interpolation to show the variation in the two embedding spaces. Horizontally, with the interpolated style embeddings, the identity (style) is changed smoothly while the pose (content) is well maintained. Vertically, the identity remains the same as the pose changes. We have the following observations: The learned content and style spaces are continuous. Columns of the left and right figures share the same pose, suggesting that the learned content spaces are well aligned. Factors encoded by style is maintained when changing the content embeddings and vice versa, suggesting the good disentanglement.

We perform retrieval on the content and style latent spaces, respectively. As shown in Figure 5 (c) and (d), given a query image (labeled with a red box), its nearest neighbors in the content space share the same pose but have different identities, which reveals the content space is well aligned. To better identify the faces, we let the query’s nearest neighbors in the style space share the same pose, and the generated faces look very similar, revealing that the style is well maintained. As shown in Figure 5 (e), zero-valued content embedding result in a canonical view. As we assume that the pose distribution of faces is , the canonical views are the most common pose in the dataset, and the zero-valued content embedding has the largest likelihood accordingly.

Input Our generated multi-view Single Ours GT
Figure 7: 3D reconstruction results on Chairs. Single: the object reconstructed by only Input. Ours: the object reconstructed from multi-view inputs generated by our method from Input. GT: the object reconstructed by the ground truth of multi-view inputs.
(a) MUNIT (b) Park et al. (2020)
(c) Ours (d) Our fine
Figure 8: Comparison with MUNIT (Huang et al., 2018) and Park et al. (2020). MUNIT (Huang et al., 2018) and Park et al. (2020) learn the texture information, which is different from Ours (c). Our fine (d) is the one in which we only exchange the fine styles.

Visual Analogy & Comparison. Visual analogy (Reed et al., 2015) is to switch style and content embeddings for each pair. We show the visual analogy results of our method against FactorVAE (typical unsupervised baseline) and Lord (strongest supervised baseline) in Figure 6 on Chairs, Car3D, and CelebA. The results show that FactorVAE on all datasets is of poor generation quality and bad content transfer. On Cars3D, Lord’s results have artifacts (e.g., third column), and its style embeddings could not encode the color information of the test images (e.g., fourth row). On CelebA, the transfer result of Lord is not consistent, e.g., the content embedding controls facial expression in the fifth column, while other content embeddings do not control expression. Our method achieves comparable pose transfer to Lord and maintains the identities of the images. For more results (including on other datasets), please refer to Appendix D.

4.3 Ablation Study

Beside the qualitative experiment shown in Figure 4, we perform ablation study on CelebA to evaluate different solutions for -constraint introduced in Section 3.2. In this subsection, we do not use auxiliary loss functions. As shown in Table 3, all the solutions can achieve the SOTA performance in terms of content transfer metric, which means that the -constraint for content embeddings is crucial. This result further verifies that our definition is reasonable. For the classification metric, the results of discrimination-based and NLL-based solutions are relatively poor due to the reasons discussed in Section 4.1. Normalization-based solution achieves the best results on all the metrics. We believe that is because Normalization-based solution does not introduce an extra optimization term, which may hurt the optimization process and limit the expressive ability of embeddings.

4.4 Unseen Images Inference

Our method can be generalized to the held-out data. A solution is to train two encoders to encode images to the content and style spaces respectively. We train a style encoder and a content encoder by minimizing


We apply our model trained on the CelebA dataset to faces collected by Wu et al. (2020) including paintings and cartoon drawings. As shown in Figure 10, our method can be well generalized to unseen images from different domains.

Figure 9: Examples of translating shoes to edge (left column) and translating edges to shoes (right column). Triplet order (left to right) is: content, style, translation.
Method Content transfer metric Classification metric
Single 0.204 3.03 0.06
Single w/ Disc 0.178 2.97 0.14
Single w/ NLL 0.171 2.98 0.09
Single w/ IN 0.166 3.46 0.04
Table 3: Ablation study for different solutions for -constraint on Celeba. Disc means discrimination-based solution.
Figure 10: Inference for unseen images. Our method performs well on images from different domains: painting and cartoon.

4.5 Comparison with Image Translation

As shown in Figure 8 (d), we can also achieve similar performance in exchanging the tone of the images by exchanging the fine style, which is the style inputs of the last C-S DisMo in the Multiple C-S DisMo framework. The affine transformation of our work plays the same role as in image translation works. The key difference is that we have -constraint to force the content embeddings to follow a common distribution.

4.6 Extension for Applications

In this work, we explore two applications of C-S disentanglement. For 3D reconstruction, single-view settings lack reliable 3D constraints (Wu et al., 2019a). Base on our disentangled representations, we can generate multi-view from a single view. On Chairs, we adopt Pix2Vox (Xie et al., 2019), a framework for single-view, and multi-view 3D reconstruction to verify the effectiveness of our method. As shown in Figure 7, the 3D objects reconstructed from multi-view generated from our method are much better than those reconstructed from a single view, and even comparable to those reconstructed from ground-truth multi-view. For more results, please refer to Appendix G.

For domain translation, our method can work on the images merged from two domains without using any domain label. As shown in Figure 9, based on the disentangled content (edge structure) and style (texture), we can translate edge images into shoe images and vice versa. Please refer to Appendix H for more about domain translation.

5 Conclusion

We propose a definition for content and style and a problem formulation for unsupervised C-S disentanglement. Based on the formulation, C-S DisMo

is proposed to assign different and independent roles to content and style when approximating the real data distributions. Our method outperforms other unsupervised approaches and achieves comparable or even better performance than the SOTA supervised methods. As for the limitation, we fail on datasets containing multiple categories with large appearance variation, e.g., CIFAR-10 

(Krizhevsky et al., 2009), which does not match our assumption. Our method could be adopted to help downstream tasks, e.g., domain translation, single-view 3D reconstruction, etc. An interesting direction is to apply our method to contrastive learning. With disentangled representations, contrastive learning could perform more effectively.


  • M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic (2014) Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of CAD models. In CVPR, Cited by: §1, §3.3, §4.
  • D. Bouchacourt, R. Tomioka, and S. Nowozin (2018)

    Multi-level variational autoencoder: learning disentangled representations from grouped observations

    In AAAI, Cited by: §1, §2.
  • C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner (2018) Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599. Cited by: §1, §1, §2, §3.1, §3.4.
  • R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In NeurPIS, Cited by: §1, §2.
  • E. L. Denton and V. Birodkar (2017) Unsupervised learning of disentangled representations from video. In NeurIPS, Cited by: §1, §4.1, Table 1, Table 2, §4.
  • A. Gabbay and Y. Hoshen (2020) Demystifying inter-class disentanglement. In ICLR, Cited by: §1, §1, §2, §4.1, §4.1, Table 1, Table 2, §4.
  • L. A. Gatys, A. S. Ecker, and M. Bethge (2016)

    Image style transfer using convolutional neural networks

    In CVPR, Cited by: §3.2.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §1, §2.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: §3.2.
  • X. Huang, M. Liu, S. J. Belongie, and J. Kautz (2018)

    Multimodal unsupervised image-to-image translation

    In ECCV, Cited by: §2, Figure 8.
  • A. H. Jha, S. Anand, M. Singh, and V. S. R. Veeravasarapu (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. In ECCV, Cited by: §1, §2, §4.1, Table 1, Table 2, §4.
  • T. Karras, S. Laine, and T. Aila (2019)

    A style-based generator architecture for generative adversarial networks

    In CVPR, Cited by: §3.2.
  • H. Kim and A. Mnih (2018) Disentangling by factorising. In ICML, Cited by: §1, §2, Table 1, Table 2, §4.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §2.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In NeurIPS, Cited by: §3.2.
  • D. Kotovenko, A. Sanakoyeu, S. Lang, and B. Ommer (2019) Content and style disentanglement for artistic style transfer. In ICCV, Cited by: §1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.
  • C. Li and M. Wand (2016)

    Combining markov random fields and convolutional neural networks for image synthesis

    In CVPR, Cited by: §3.2.
  • Y. Li, N. Wang, J. Liu, and X. Hou (2017) Demystifying neural style transfer. In IJCAI, Cited by: §3.2.
  • Y. Li, C. Lin, Y. Lin, and Y. F. Wang (2019) Cross-dataset person re-identification via unsupervised pose disentanglement and adaptation. In ICCV, Cited by: §1.
  • M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019) Few-shot unsupervised image-to-image translation. In ICCV, Cited by: §2.
  • X. Liu, S. Thermos, G. Valvano, A. Chartsias, A. O’Neil, and S. A. Tsaftaris (2020) Metrics for exposing the biases of content-style disentanglement. CoRR abs/2008.12378. Cited by: §4.1.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, Cited by: §1, §4.
  • F. Locatello, S. Bauer, M. Lucic, G. Rätsch, S. Gelly, B. Schölkopf, and O. Bachem (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In ICML, Cited by: §1, §2, §3.1.
  • M. Mathieu, J. J. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In NeurIPS, Cited by: §2.
  • M. E. Muller (1959) A note on a method for generating points uniformly on n-dimensional spheres. Communications of the ACM 2 (4), pp. 19–20. Cited by: §3.2.
  • T. Park, J. Zhu, O. Wang, J. Lu, E. Shechtman, A. A. Efros, and R. Zhang (2020) Swapping autoencoder for deep image manipulation. In NeurIPS, Cited by: Figure 8.
  • S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee (2015) Deep visual analogy-making. In NeurIPS, Cited by: §4.2, §4.
  • X. Ren, H. Li, Z. Huang, and Q. Chen (2020) Self-supervised dance video synthesis conditioned on music. In ACM MM, Cited by: §3.2.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.2.
  • A. Szabó, Q. Hu, T. Portenier, M. Zwicker, and P. Favaro (2018) Challenges in disentangling independent factors of variation. In ICLRW, Cited by: §2.
  • F. Wu, L. Bao, Y. Chen, Y. Ling, Y. Song, S. Li, K. N. Ngan, and W. Liu (2019a) MVF-net: multi-view 3d face morphable model regression. In CVPR, Cited by: §4.6.
  • S. Wu, C. Rupprecht, and A. Vedaldi (2020)

    Unsupervised learning of probably symmetric deformable 3d objects from images in the wild

    In CVPR, Cited by: §3.2, §4.4.
  • W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy (2019b) Disentangling content and style via unsupervised geometry distillation. In ICLRW, Cited by: §1, §2, §3.2, §4.1, Table 1, Table 2, §4.
  • W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy (2019c) TransGaGa: geometry-aware unsupervised image-to-image translation. In CVPR, Cited by: §1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §3.4.
  • H. Xie, H. Yao, X. Sun, S. Zhou, and S. Zhang (2019) Pix2Vox: context-aware 3d reconstruction from single and multi-view images. In ICCV, Cited by: §4.6.
  • X. Xing, T. Han, R. Gao, S. Zhu, and Y. N. Wu (2019) Unsupervised disentangling of appearance and geometry by deformable generator network. In CVPR, Cited by: §1.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In CVPR, Cited by: §4.1.