CS-DisMo
[ICCVW 2021] Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement
view repo
Content and style (C-S) disentanglement intends to decompose the underlying explanatory factors of objects into two independent subspaces. From the unsupervised disentanglement perspective, we rethink content and style and propose a formulation for unsupervised C-S disentanglement based on our assumption that different factors are of different importance and popularity for image reconstruction, which serves as a data bias. The corresponding model inductive bias is introduced by our proposed C-S disentanglement Module (C-S DisMo), which assigns different and independent roles to content and style when approximating the real data distributions. Specifically, each content embedding from the dataset, which encodes the most dominant factors for image reconstruction, is assumed to be sampled from a shared distribution across the dataset. The style embedding for a particular image, encoding the remaining factors, is used to customize the shared distribution through an affine transformation. The experiments on several popular datasets demonstrate that our method achieves the state-of-the-art unsupervised C-S disentanglement, which is comparable or even better than supervised methods. We verify the effectiveness of our method by downstream tasks: domain translation and single-view 3D reconstruction. Project page at https://github.com/xrenaa/CS-DisMo.
READ FULL TEXT VIEW PDF[ICCVW 2021] Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement
The disentanglement task aims to recover the underlying explanatory factors of natural images into different dimensions of latent space, and provide an informative representation for downstream tasks like image translation (Wu et al., 2019c; Kotovenko et al., 2019), domain adaptation (Li et al., 2019) and geometric attributes extraction (Xing et al., 2019), etc.
In this work, we focus on content and style (C-S) disentanglement, where content and style represent two independent latent subspaces. Most of the previous C-S disentanglement works (Denton and Birodkar, 2017; Jha et al., 2018; Bouchacourt et al., 2018; Gabbay and Hoshen, 2020) rely on supervision. For example, Gabbay and Hoshen (2020) achieve disentanglement by forcing images from the same group to share a common embedding. It is not tractable, however, to collect such a dataset (e.g. groups of paintings with each group depicting the same scene in different styles). To our best knowledge, the only exception is Wu et al. (2019b) which, however, forces the content to encode pre-defined geometric structure limited by the expressive ability of 2D landmarks.
Previous works define the content and style based on either the supervision or manually pre-defined attributes. There is no general definition of content and style for unsupervised C-S disentanglement. In this work, we define content and style from the perspective of VAE-based unsupervised disentanglement works (Higgins et al., 2017; Burgess et al., 2018; Kim and Mnih, 2018; Chen et al., 2018). These methods try to explain the images with the latent factors, of which each controls only one interpretable aspect of the images. However, extracting all disentangled factors is a very challenging task, and Locatello et al. (2019) prove that unsupervised disentanglement is fundamentally impossible without inductive bias on both the model and data. Furthermore, these methods have limited down-stream applications due to poor image generation quality on real-world datasets.
Inspired by the observation that the latent factors have different degrees of importance for image reconstruction (Burgess et al., 2018), we assume the disentangled factors are of different importance when modeling the real data distributions. Instead of finding all the independent factors, which is challenging, we make the problem tractable by defining content as a group of factors that are the most important ones for image reconstruction across the whole dataset, and defining style as the remaining ones. Take the human face dataset CelebA (Liu et al., 2015)
as an example, as pose is a more dominant factor than identity for image reconstruction across the face dataset, content encodes pose, and identity is encoded by style. We further assume that each content embedding of the dataset is sampled from a shared distribution, which characterizes the intrinsic property of content. For the real-world dataset CelebA, the shared distribution of content (pose) is approximately a Standard Normal Distribution, where zero-valued embedding stands for the canonical pose. For the synthetic dataset Chairs
(Aubry et al., 2014), as each image is synthesized from equally distributed surround views, the shared distribution of content (pose) is approximately an Uniform Distribution.
Based on the above definitions and assumptions, we propose a problem formulation for unsupervised C-S disentanglement, and a C-S Disentanglement Module (C-S DisMo) which assigns different and independent roles to content and style when approximating the real data distributions. Specifically, C-S DisMo forces the content embeddings of individual images to follow a common distribution, and the style embeddings are used to scale and shift the common distribution to match the target image distribution via a generator. With the above assumptions as the data inductive bias, and C-S DisMo as the corresponding model inductive bias, we achieve unsupervised C-S disentanglement with good image generation quality. Furthermore, we demonstrate the effectiveness of our disentangled C-S representations on two down-stream applications, i.e., domain translation and single-view 3D reconstruction.
We follow Gabbay and Hoshen (2020) to apply latent optimization to optimize the embeddings and the parameters of the generator. Please note that we only use the image reconstruction loss as the supervision; no human annotation is needed. We also propose to use instance discrimination as an auxiliary constraint to assist the disentanglement.
The experiments on several popular datasets demonstrate that our method achieves the state-of-the-art (SOTA) unsupervised C-S disentanglement, which is comparable or even better than supervised methods. Furthermore, by simplifying the factors disentanglement problem into the C-S disentanglement problem, we achieve much better performance than the SOTA VAE-based unsupervised disentanglement method when modified for C-S disentanglement by manually splitting the factors into content and style.
Main contributions. The main contributions of our work are as follows: By rethinking content and style from the perspective of VAE-based unsupervised disentanglement, we achieve unsupervised C-S disentanglement by introducing both data and model inductive bias. We propose the C-S DisMo to assign different and independent roles to content and style when modeling the real data distributions, and we provide several solutions for the distribution constraint of content. We verify the effectiveness of our method by presenting two down-stream applications based on the well-disentangled content and style.
Unsupervised Disentanglement.
There have been a lot of studies on unsupervised disentangled representation learning
(Higgins et al., 2017; Burgess et al., 2018; Kim and Mnih, 2018; Chen et al., 2018). These models learn disentangled factors by factorizing aggregated posterior. However, Locatello et al. (2019) prove that unsupervised disentanglement is impossible without introducing inductive bias on both the models and data. Therefore, these models are currently unable to obtain a promising disentangled representation. Inspired by these previous unsupervised disentanglement works, we revisit and formulate the unsupervised C-S disentanglement problem. We simplify the challenging task of extracting individual disentanglement factors into a manageable task: disentangling two groups of factors (content and style).C-S Disentanglement. Originated from style transfer, most of the prior works on C-S disentanglement divide latent variables into two spaces relying on group supervision. To achieve disentanglement, Mathieu et al. (2016) and Szabó et al. (2018) combine the adversarial constraint and auto-encoders. Meanwhile, VAE (Kingma and Welling, 2014) is combined with non-adversarial constraints, such as cycle consistency (Jha et al., 2018) and evidence accumulation (Bouchacourt et al., 2018). Furthermore, latent optimization is shown to be superior to amortized inference for C-S disentanglement (Gabbay and Hoshen, 2020). The only exception is Wu et al. (2019b), which proposes a variational U-Net with structure learning for disentanglement in an unsupervised manner, but is limited by the expressive ability of 2D landmarks. In our work, we focus on the unsupervised C-S disentanglement problem and explore inductive bias for unsupervised disentanglement.
Key Difference from Image Translation. Image translation (Huang et al., 2018; Liu et al., 2019) between domains tries to decompose the latent space into domain-shared representations and domain-specific representations with the domain label of each image as supervision. The decomposition relies on the “swapping” operation and pixel-level adversarial loss without semantic level disentanglement ability. This pipeline fails in the unsupervised C-S disentanglement task on the single domain dataset due to lack of domain supervision, as demonstrated in Figure 8. Our unsupervised C-S disentanglement task is to disentangle the latent space into content (containing most dominant factors typically carrying high-level semantic attributes) and style (containing the rest of the factors). We achieve disentangled content and style by assigning different roles to them without relying on domain supervision or the “swapping” operation. We formulate the problem for a single domain but it can be extended to cross-domain to achieve domain translation without domain supervision, as shown in Figure 9.
For a given dataset , where is the total number of images, we assume each image is sampled from a distribution , where are the disentangled factors. Disentangling all these factors unsupervisedly is a challenging task, which has been proved to be fundamentally impossible without introducing the model and data inductive bias (Locatello et al., 2019). Based on the observation that the factors play roles of different importance for image reconstruction (Burgess et al., 2018), we assume are of different importance and popularity for modeling the image distribution . We define the content as representing the most important factors across the whole dataset for image reconstruction and style as representing the rest ones. We assume c follows a shared distribution across the whole dataset, and assign each image a style embedding which parameterizes to be an image-specific distribution . This serves as the data bias for our unsupervised C-S disentanglement.
With a generator that maps content and style embeddings to images, where is the parameter of the generator, we further parameterize the target image distributions as . For each image , we assign as the content embedding. All the content embeddings should conform the assumed distribution of content , which is denoted as
. Then we are able to estimate the likelihood of
by . Given the dataset D, our goal is to minimize the negative log-likelihood of :(1) |
Here we propose a framework to address the formulated problem in Section 3.1. We design a C-S Disentanglement Module (C-S DisMo) to assign different roles to content and style in modeling real data distributions according to their definitions (data bias) in Section 3.1, which servers as the corresponding model bias.
More specifically, as shown in Figure 1, a C-S DisMo is composed of a -constraint to enforce content embeddings to conform to , which corresponds to the second term in Eq. 1, and an affine transformation serving to customize the shared content distribution into image-specific distributions. This module is followed by the generator to generate the target image.
The affine transformation
is inspired by the observation that the mean and variance of features carry individual information
(Gatys et al., 2016; Li and Wand, 2016; Li et al., 2017; Huang and Belongie, 2017). We use the style embeddings to provide the statistics to scale and shift content embedings as(2) |
where and are two fully connected layers predicting the scalars for scaling and shifting respectively. When is a Normal Distribution, Eq. 1 is equivalent to minimizing:
(3) |
with the proof provided in Appendix I.
For the reconstruction term in Eq. 3, we adopt a VGG perceptual loss (Simonyan and Zisserman, 2015; Ren et al., 2020), which is widely used in unsupervised disentanglement methods (Wu et al., 2020, 2019b).
For the -constraint, i.e. the second term in Eq. 3, we propose and study discrimination-based, NLL-based and normalization-based solutions. The form of should be carefully selected to better approximate the ground truth content distribution of the dataset. We describe details of these solutions and related limitations according to the form of below.
Discrimination-based solution can be adopted when has a tractable form for sampling. Inspired by adversarial learning (Karras et al., 2019), we propose to use a discriminator to distinguish between content embeddings (false samples) and items sampled from (true samples). When it is difficult for the discriminator to distinguish true from false, the content embeddings are likely to follow .
NLL-based solution is inspired by flow-based generative models (Kingma and Dhariwal, 2018), and can be adopted when . We can use negative log-likelihood (NLL) to optimize to follow as
(4) |
Normalization-based solution can be adopted when has one of the following specific forms: ) a Standard Normal Distribution , and ) a Uniform Distribution. To approximately follow the constraint, Instance Normalization (IN) is used to force the mean and variance of to be zeros and respectively. When is a Uniform Distribution, we can use normalization to force to follow Uniform Distribution approximately (Muller, 1959).
For these solutions, we show the qualitative and quantitative comparisons in Figure 3 and Table 3 respectively to verify their effectiveness. Furthermore, discrimination-based and NLL-based solutions need extra optimization terms which introduce instability. In our work, we mainly adopt normalization-based solution to meet the -constraint.
As shown in Figure 1, we can use the C-S DisMo before the generator, denoted as the Single C-S DisMo framework. We can also insert it before each layer of the generator to provide multiple paths for disentanglement, denoted as the Multiple C-S DisMo framework. For more details, please refer to Appendix A.
In this section, we perform some experiments to verify that the C-S disentanglement is achieved by introducing inductive bias on model (C-S DisMo) and data (our assumptions of the dataset). The experimental setting can be found in Section 4.
To understand how C-S DisMo achieves disentanglement, we visualize the generated images during the training process of CelebA in Figure 2. As the generated images show, a mean shape of faces is first learned. Then the faces start to rotate, which indicates the pose, as a dominant factor for generation, is disentangled as content. After that, the identity features emerge since the identity is disentangled as style for better image generation.
If we treat content and style equally, i.e., concatenating content and style embedding as the input of the generator, the network can hardly disentangle any meaningful information for the CelebA dataset, as shown in Figure 3 (a). Our Single C-S DisMo framework with different solutions to meet -constraint can disentangle the content (pose) and style (identity) of human faces, as shown in Figure 3 (c)-(e). When -constraint is removed from C-S DisMo, the result is shown in Figure 3 (b), where the pose and identity can not be disentangled. For the Multiple C-S DisMo framework, as multiple paths are provided, and the network has more flexibility to approximate the target image distribution, it outperforms the Single C-S DisMo framework, as shown in Figure 3 (f).
We conduct experiments to demonstrate that better disentanglement can be achieved by choosing a better form for . For the real-world dataset CelebA, the distribution of pose is better modeled as a Standard Normal Distribution. As Figure 4 (a) and (b) show, IN achieves better disentanglement than . For the synthetic Chairs (Aubry et al., 2014) dataset, the distribution of pose is close to Uniform Distribution rather than Standard Normal Distribution. Figure 4 (c) and (d) show that the normalization results in better consistency of identity and pose.
![]() |
![]() |
(a) Concatenation | (b) w/o -constraint |
![]() |
![]() |
(c) Discrimination | (d) NLL |
![]() |
![]() |
(e) IN | (f) Multiple w/ IN |
![]() |
![]() |
(a) IN | (b) Normalization |
![]() |
![]() |
(c) IN | (d) Normalization |
In addition to the in Eq. 3
, we propose two auxiliary loss functions to help the model to better disentangle C-S.
Instance discrimination. Instance discrimination can discover image-specific features (Wu et al., 2018). The image-specific feature corresponds to style according to our definition. Inspired by this, we first pretrain a backbone network on the target dataset with instance discrimination. Then we adopt Contrastive Learning to help disentangle content and style by pulling together the images with the same content embeddings and pushing away the images with different content embeddings in the backbone’s representation space. We denote the contrastive loss as . The implementation detail can be found in Appendix C.
Information bottleneck. Burgess et al. (2018) propose improving the disentanglement by controlling the capacity increment. This motivated us to control the information bottleneck capacity of content and style to help to avoid leakage. This loss is denoted as . The details of this loss are provided in Appendix C.
![]() |
Demonstrations of the content and style space by interpolation (a & b) and retrieval (c-e).
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
FactorVAE | Lord | Ours |
Full objective. Therefore, our full objective is
(5) |
where hyperparameters
, , and represent the weights for each loss term respectively. The ablation study for the auxiliary loss terms is presented in Appendix E.In this section, we perform quantitative and qualitative experiments to evaluate our method. We test our method on several datasets: Car3D (Reed et al., 2015), Chairs (Aubry et al., 2014) and CelebA (Liu et al., 2015). For these three datasets, pose is the most dominant factor and encoded by content. For details of the datasets and results on more datasets, please refer to Appendix B and D.
Baselines.
We choose several SOTA group-supervised C-S disentanglement benchmarks for comparisons:
Cycle-VAE (Jha et al., 2018), DrNet (Denton and Birodkar, 2017) and Lord (Gabbay and Hoshen, 2020).
We select the only unsupervised C-S disentangled method Wu et al. (2019b) 111 There is no open-sourced implementation for it. We modify
We compare our method (Multiple C-S DisMo framework) with the baselines on Car3D, Chairs and CelebA.
Content Transfer Metric. To evaluate our method’s disentanglement ability, we follow the protocol of Gabbay and Hoshen (2020) to measure the quality of content transfer by LPIPS (Zhang et al., 2018). Details are presented in Appendix A. The results are shown in Table 1. We achieve the best performance among the unsupervised methods, even though pseudo labels are provided for Wu et al. (2019b). Our method significantly outperforms FactorVAE, which verifies the effectiveness of our formulation: simplifying the problem from disentangling factors to disentangling content and style. Furthermore, our method is comparable to or even better than the supervised ones.
Classification Metric. Classification accuracy is used to evaluate disentanglement in Denton and Birodkar (2017); Jha et al. (2018); Gabbay and Hoshen (2020). we follow the protocol of Jha et al. (2018). Low classification accuracy indicates small leakage between content and style. Without content annotations for CelebA, we regress the position of the facial landmarks from the style embeddings instead. The results are summarized in Table 2. Though without supervision, the performance of our method is still comparable to several other methods. We note that the classification metric may not be appropriate for disentanglement, which is also observed in Liu et al. (2020). The observation is that the classification metric is also influenced by information capacity and dimensions of embeddings. For FactorVAE, the poor reconstruction quality indicates that the content and style embeddings encode little information that can hardly be identified. The dimensions of the content and style embeddings of different methods vary from ten to hundreds, and a higher dimension usually leads to easier classification.
Method | Supervision | Cars3D | Chairs | CelebA |
---|---|---|---|---|
DrNet (Denton and Birodkar, 2017) | ✓ | 0.146 | 0.294 | 0.221 |
Cycle-VAE (Jha et al., 2018) | 0.148 | 0.240 | 0.202 | |
Lord (Gabbay and Hoshen, 2020) | 0.089 | 0.121 | 0.163 | |
FactorVAE (Kim and Mnih, 2018) | ✗ | 0.190 | 0.287 | 0.369 |
Wu et al. (2019b) | – | – | 0.185 | |
Ours | 0.082 | 0.190 | 0.161 |
Method | Supervision | Cars3D | Chairs | CelebA | |||
---|---|---|---|---|---|---|---|
DrNet (Denton and Birodkar, 2017) | ✓ | 0.27 | 0.03 | 0.06 | 0.01 | 4.99 | 0.00 |
Cycle-VAE (Jha et al., 2018) | 0.81 | 0.77 | 0.60 | 0.01 | 2.80 | 0.12 | |
Lord (Gabbay and Hoshen, 2020) | 0.03 | 0.09 | 0.02 | 0.01 | 4.42 | 0.01 | |
FactorVAE (Kim and Mnih, 2018) | ✗ | 0.07 | 0.01 | 0.14 | 0.01 | 5.34 | 0.00 |
Wu et al. (2019b) | – | – | – | – | 5.42 | 0.11 | |
Ours | 0.33 | 0.24 | 0.66 | 0.05 | 4.15 | 0.05 |
Disentanglement & Alignment. In Figure 5 (a) and (b), we conduct linear interpolation to show the variation in the two embedding spaces. Horizontally, with the interpolated style embeddings, the identity (style) is changed smoothly while the pose (content) is well maintained. Vertically, the identity remains the same as the pose changes. We have the following observations: The learned content and style spaces are continuous. Columns of the left and right figures share the same pose, suggesting that the learned content spaces are well aligned. Factors encoded by style is maintained when changing the content embeddings and vice versa, suggesting the good disentanglement.
We perform retrieval on the content and style latent spaces, respectively. As shown in Figure 5 (c) and (d), given a query image (labeled with a red box), its nearest neighbors in the content space share the same pose but have different identities, which reveals the content space is well aligned. To better identify the faces, we let the query’s nearest neighbors in the style space share the same pose, and the generated faces look very similar, revealing that the style is well maintained. As shown in Figure 5 (e), zero-valued content embedding result in a canonical view. As we assume that the pose distribution of faces is , the canonical views are the most common pose in the dataset, and the zero-valued content embedding has the largest likelihood accordingly.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Input | Our generated multi-view | Single | Ours | GT |
![]() |
![]() |
(a) MUNIT | (b) Park et al. (2020) |
![]() |
![]() |
(c) Ours | (d) Our fine |
Visual Analogy & Comparison. Visual analogy (Reed et al., 2015) is to switch style and content embeddings for each pair. We show the visual analogy results of our method against FactorVAE (typical unsupervised baseline) and Lord (strongest supervised baseline) in Figure 6 on Chairs, Car3D, and CelebA. The results show that FactorVAE on all datasets is of poor generation quality and bad content transfer. On Cars3D, Lord’s results have artifacts (e.g., third column), and its style embeddings could not encode the color information of the test images (e.g., fourth row). On CelebA, the transfer result of Lord is not consistent, e.g., the content embedding controls facial expression in the fifth column, while other content embeddings do not control expression. Our method achieves comparable pose transfer to Lord and maintains the identities of the images. For more results (including on other datasets), please refer to Appendix D.
Beside the qualitative experiment shown in Figure 4, we perform ablation study on CelebA to evaluate different solutions for -constraint introduced in Section 3.2. In this subsection, we do not use auxiliary loss functions. As shown in Table 3, all the solutions can achieve the SOTA performance in terms of content transfer metric, which means that the -constraint for content embeddings is crucial. This result further verifies that our definition is reasonable. For the classification metric, the results of discrimination-based and NLL-based solutions are relatively poor due to the reasons discussed in Section 4.1. Normalization-based solution achieves the best results on all the metrics. We believe that is because Normalization-based solution does not introduce an extra optimization term, which may hurt the optimization process and limit the expressive ability of embeddings.
Our method can be generalized to the held-out data. A solution is to train two encoders to encode images to the content and style spaces respectively. We train a style encoder and a content encoder by minimizing
(6) |
We apply our model trained on the CelebA dataset to faces collected by Wu et al. (2020) including paintings and cartoon drawings. As shown in Figure 10, our method can be well generalized to unseen images from different domains.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Method | Content transfer metric | Classification metric | |
---|---|---|---|
Single | 0.204 | 3.03 | 0.06 |
Single w/ Disc | 0.178 | 2.97 | 0.14 |
Single w/ NLL | 0.171 | 2.98 | 0.09 |
Single w/ IN | 0.166 | 3.46 | 0.04 |
![]() |
![]() |
As shown in Figure 8 (d), we can also achieve similar performance in exchanging the tone of the images by exchanging the fine style, which is the style inputs of the last C-S DisMo in the Multiple C-S DisMo framework. The affine transformation of our work plays the same role as in image translation works. The key difference is that we have -constraint to force the content embeddings to follow a common distribution.
In this work, we explore two applications of C-S disentanglement. For 3D reconstruction, single-view settings lack reliable 3D constraints (Wu et al., 2019a). Base on our disentangled representations, we can generate multi-view from a single view. On Chairs, we adopt Pix2Vox (Xie et al., 2019), a framework for single-view, and multi-view 3D reconstruction to verify the effectiveness of our method. As shown in Figure 7, the 3D objects reconstructed from multi-view generated from our method are much better than those reconstructed from a single view, and even comparable to those reconstructed from ground-truth multi-view. For more results, please refer to Appendix G.
For domain translation, our method can work on the images merged from two domains without using any domain label. As shown in Figure 9, based on the disentangled content (edge structure) and style (texture), we can translate edge images into shoe images and vice versa. Please refer to Appendix H for more about domain translation.
We propose a definition for content and style and a problem formulation for unsupervised C-S disentanglement. Based on the formulation, C-S DisMo
is proposed to assign different and independent roles to content and style when approximating the real data distributions. Our method outperforms other unsupervised approaches and achieves comparable or even better performance than the SOTA supervised methods. As for the limitation, we fail on datasets containing multiple categories with large appearance variation, e.g., CIFAR-10
(Krizhevsky et al., 2009), which does not match our assumption. Our method could be adopted to help downstream tasks, e.g., domain translation, single-view 3D reconstruction, etc. An interesting direction is to apply our method to contrastive learning. With disentangled representations, contrastive learning could perform more effectively.Multi-level variational autoencoder: learning disentangled representations from grouped observations
. In AAAI, Cited by: §1, §2.Image style transfer using convolutional neural networks
. In CVPR, Cited by: §3.2.Multimodal unsupervised image-to-image translation
. In ECCV, Cited by: §2, Figure 8.A style-based generator architecture for generative adversarial networks
. In CVPR, Cited by: §3.2.Combining markov random fields and convolutional neural networks for image synthesis
. In CVPR, Cited by: §3.2.Unsupervised learning of probably symmetric deformable 3d objects from images in the wild
. In CVPR, Cited by: §3.2, §4.4.The unreasonable effectiveness of deep features as a perceptual metric
. In CVPR, Cited by: §4.1.