Improving Style-Content Disentanglement in Image-to-Image Translation

07/09/2020 ∙ by Aviv Gabbay, et al. ∙ 0

Unsupervised image-to-image translation methods have achieved tremendous success in recent years. However, it can be easily observed that their models contain significant entanglement which often hurts the translation performance. In this work, we propose a principled approach for improving style-content disentanglement in image-to-image translation. By considering the information flow into each of the representations, we introduce an additional loss term which serves as a content-bottleneck. We show that the results of our method are significantly more disentangled than those produced by current methods, while further improving the visual quality and translation diversity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image translation is the task of mapping images between different domains, i.e. given an input image in a source domain (e.g. dogs), we aim to generate an analogous image in a target domain (e.g. cats). Although this task is generally poorly specified, it is often made possible under the assumption that images in different domains share similar content

(e.g. head pose) which can be transferred over during translation. In cases where pairwise correspondences between domains are available, general-purpose conditional adversarial networks such as pix2pix

Isola et al. (2017); Wang et al. (2018) achieve remarkable results and scale up to extremely high resolutions. Much current research in image translation deals with the more challenging unsupervised setting in which no correspondences are given. Early attempts Zhu et al. (2017) to solve this problem find a single image in the target domain for every input image in the source domain. While this one-to-one mapping can be satisfactory in some cases, most image domains are multi-modal in nature i.e. there are several possible mappings for every input (e.g. an image of a cat can be translated to images of different dog breeds). As a result, uni-modal formulations fail to capture the underlying distribution of images in different domains. Notable methods such as MUNIT Huang et al. (2018) and StarGAN-v2 Choi et al. (2020) tackle the multi-modal translation task and present high quality and diverse domain mappings along with reference-guided image synthesis in which the specific target style is borrowed from a reference image in the target domain.

Another closely related line of work studies the problem of learning disentangled representations in the class-supervised setting Gabbay and Hoshen (2020); Denton and others (2017); Bouchacourt et al. (2018). In this task, the goal is to learn disentangled representations for each class (domain) in the dataset and a residual content representation for each image. For example, LORD Gabbay and Hoshen (2020) utilizes latent optimization for learning a class representation that is shared exactly between all images of the same class and an additional regularized representation which captures the image-specific content and can be applied to images from different classes. It is shown that non-adversarial bottlenecks provide better disentanglement than methods that rely on domain confusion losses. In this work, we draw inspiration from these principles and carefully analyse the information between the domain, content, and style representations. We introduce an additional loss term which strongly encourages content-style disentanglement. We then show that state-of-the-art architectures for image translation can greatly benefit from these disentanglement principles to achieve higher translation quality and greater output diversity.

2 Related Work

Image Translation

Isola et al. (2017) propose pix2pix, a conditional generative adversarial network, as a general purpose model for solving different image-to-image tasks in the supervised setting. Wang et al. (2018) extend this framework to generating high resolution images. Initial progress in the unsupervised setting has been made by CycleGAN Zhu et al. (2017) which introduces a cycle consistency loss to guarantee that the translated image properly preserves the domain-invariant characteristics (e.g. pose) of the source image. As a consequence of this approach, the model learns a deterministic one-to-one mapping, thus can not capture the multi-modal nature of the image distribution. MUNIT Huang et al. (2018) recognizes this limitation and extends the framework to learn multi-modal mappings. Despite its capability of generating diverse and realistic translation outputs, this method trains separate encoder-decoder models for each domain and therefore can not easily scale up to multiple domains. StarGAN Choi et al. (2018, 2020) addresses this issue and presents a unified model for multi-domain translation. FUNIT Liu et al. (2019) attempts to generalize to images from unseen domains using a few reference images from a target domain, but it requires fine-grained class labels during training and can not model intra-class unspecified variations.

Class-Supervised Disentanglement

Learning disentangled representations from a set of observations is a fundamental problem in machine learning. The most related setting to our work is the class-supervised setting in which there exists a class label for every image. The goal is generally to anchor the semantics of all the images within each class into a separate class representation while modeling all the remaining image-specific properties by a content representation per image. Several methods encourage disentanglement by adversarial constraints

Denton and others (2017); Szabó et al. (2018); Mathieu et al. (2016) while other rely on cycle consistency Harsh Jha et al. (2018) or group accumulation Bouchacourt et al. (2018). LORD Gabbay and Hoshen (2020) takes a non-adversarial approach and trains a generative model while directly optimizing over class and content codes. Although most of the works in this area demonstrate domain translation results, they do not deal with the multi-modal unsupervised setting. Moreover, their primary objective is to achieve disentanglement at the representation level and put less effort on tuning architectures for high-quality image translation.

3 An Analysis of Disentanglement in Image-Translation

Let us model the formation of an image as a function of domain , style and content .

(1)

Each image belongs to a single domain (e.g. "cat" or "dog"). The content describes the information that is invariant across domains (e.g. head pose). This information should be preserved if the domain label of the image is changed. The style

describes the residual properties that are not preserved across domains and are not shared across all images in the same domain i.e. intra-domain variations. As an illustrative example, let us assume that we are provided with a set of images each classified into either "cat" or "dog". The domain specifies the characteristics which are shared across all cats or dogs. The content describes the information which is invariant to the species e.g. the pose of the animal. The style captures the information not shared across species or shared by all member of the same species e.g. the breed of the animal, its color or the texture of its fur.

Current image translation models encourage the disentanglement of these three factors of variation by introducing several different constraints. Let us briefly review the most common techniques.

The most common approach for learning a domain-invariant content representation is by using a domain confusion objective. MUNIT Huang et al. (2018) trains a discriminator for each of the domains which aims at distinguishing whether the output image is from the specific domain or not. StarGAN-v2 Choi et al. (2020) scales this approach to multiple domains and trains a conditional discriminator which learns to identify whether the output image is from the given target domain.

Alternatively, another popular approach is by utilizing a conditional discriminator at the representation level of the content code which attempts to predict the domain label given the content code Benaim et al. (2019); Denton and others (2017). There are two issues with relying on adversarial constraints for disentanglement: i) GAN training is unstable and sensitive to hyper-parameters due to the challenging saddle point optimization problem. It is shown Gabbay and Hoshen (2020) that GAN discriminators often do not in fact remove all domain-specific information from the content representation. ii) The conditional discriminators only ensure disentanglement between the content and the domain but not necessarily between the content and the style. As we do not have supervision on the style, it is not obvious how to train a style conditional discriminator. The content codes may therefore contain a significant amount of style information.

In the other direction, in order constrain the capacity of the style representation and avoid leakage of content information, current methods rely on locality-preserving architectures that bias translations towards local changes. The style is typically injected as scale and shift parameters into Adaptive Instance Normalization (AdaIN) Huang and Belongie (2017); Karras et al. (2019) layers at different levels of the generator architecture. As this operates in a global manner per channel, it effectively preserves the spatial structure of the content image. Although this makes optimization easier, it inevitably limits the diversity of generated images which is typically constrained to low-level variations.

In this work, we focus on improving content-style disentanglement in image translation models, by introducing a well-motivated bottleneck on the content representation. We show that integrating this term with state-of-the-art architectures greatly improves the quality and diversity of translation results.

4 Improved Content-Style Disentanglement with a Content Bottleneck

We provide a principled approach for improving content-style disentanglement in image translation.

4.1 Image Translation Framework

Our architecture is strongly influenced by those of state-of-the-art methods such as StarGAN-v2 Choi et al. (2020) and MUNIT Huang et al. (2018).

Given an input image in domain , we train a content encoder and a domain-conditional style encoder to obtain a content code and a style code , respectively:

(2)

During training, we sample a random image from a source domain and optimize the generator network to translate into an image in a random target domain with the style of another randomly sampled image :

(3)

In order to encourage to generate valid images from a target domain, we train a conditional discriminator and employ an adversarial loss:

(4)

To enforce to preserve the content of the input image , we apply a cycle reconstruction loss:

(5)

4.2 Content Bottleneck

In Sec. 3

we argued that current methods do not explicitly ensure the disentanglement of the content from style. To this end, we propose a simple but principled and subtle new term to encourage disentanglement. Specifically, we regularize the content by turning the content encoder and the generator into a variational auto-encoder by introducing a KL-divergence loss between the content code and a prior Gaussian distribution. Although typically in VAE, the encoder learns both the mean and log-variance of the random variable, we follow LORD

Gabbay and Hoshen (2020) in using noise with a constant (unlearned) variance. This prevents partial posterior collapses, and prevents information leakage through the content code. The additional content-bottleneck (cb) term is therefore:

(6)

The content code is changed accordingly during the feed-forward step:

(7)

Our entire objective can be summarized as:

(8)

4.3 Implementation Details

In order to emphasize the contribution of our proposed loss term for the task of style-content disentanglement, and the simplicity of integrating it into existing state-of-the-art image translation models, we base our model on exact same architectures of all the neural networks proposed in StarGAN-v2

Choi et al. (2020). We will briefly describe the main components for completeness.

Content Encoder Our content encoder takes as input an image and outputs a content code

. As the content is regularized within a VAE in our framework, we find that fixing the variance of the estimated distribution improves stability and avoids leakage of other information into the content.

Style Encoder and Mapping Network The style encoder takes two inputs: an image and a corresponding domain label and outputs a style code . In order to enable sampling random styles we include a mapping network that translates a latent code sampled from a prior Gaussian distribution to a style code through a series of fully-connected layers. Note that the last layers are trained in a domain-specific fashion to improve performance.

Generator Our generator takes two inputs: a content code and a style code . Note that since the style is already modulated with the domain there is no need to provide the generator with directly. Similarly to other state-of-the-art domain translation methods, the style is injected as scale and shift parameters into Adaptive Instance Normalization (AdaIN) Huang and Belongie (2017); Karras et al. (2019) layers at different levels of the generator architecture.

Discriminator The discriminator is domain-conditonal and therefore takes as input an image x and a domain label and determines whether the image is a real image of domain or fake image generated by the generator.

We set and , in all our experiments.

5 Experiments

5.1 Baselines

Our method is evaluated against StarGAN-v2 Choi et al. (2020) which is a state-of-the-art domain translation method and against the established MUNIT Huang et al. (2018) framework. While StarGAN-v2 supports multiple domains similarly to our method, MUNIT is trained multiple times for every possible pair of domains in the following experiments.

5.2 Datasets

Afhq

We first assess the performance of all methods on the recently proposed Animal-Faces-HQ dataset Choi et al. (2020), consisting of high quality images. The images are categorized into three domains: cat, dog and wildlife. We follow the protocol used in Choi et al. (2020) and use 500 images from each domain as a test set and the rest as a training set.

Cub

To further compare the disentanglement performance of our method and StarGAN-v2 on multi-domain translation with tens of domains which exhibit a considerable amount of style variation, we compare both methods on CUB-200-2011 Wah et al. (2011), a dataset of 200 bird species with 6,000 images. In order to increase the variance within each domain we aggregate similar bird species by their descriptions (e.g. Gull, Woodpecker) and form CUB-47 variant in which the birds are separated into only coarse domains, which is a much more challenging benchmark for multi-modal domain translation methods.

Edges2Shoes

(Yu and Grauman, 2014) A collection of 50,000 images separated into two domains: shoe images and their edges. Note that we do not make use of the pairwise correspondences provided in this dataset.

We conduct all the experiments at 128x128 resolution.

5.3 Evaluation Protocol

In order to measure the diversity of translation results, we follow Choi et al. (2020) and translate all images in the test set to each of the other domains multiple times using two strategies: i) Reference-guided translation i.e. borrowing style codes from random reference images in the target domain. ii) Sampling-based translation i.e. generating outputs by sampling different random style codes. We then measure the perceptual pairwise distances using LPIPS Zhang et al. (2018) between all translations of the same input image. Higher average distances indicate greater diversity in image translation. To assess the improvement in style-content disentanglement, we compute FID Heusel et al. (2017) which measures the discrepancy between the distribution of images in each target domain and the corresponding translations generated by the models. A lower FID score indicates that the translations are more reliable and better fit to the target domain. It should be noted that none of the used metrics is well-suitable for measuring style-content disentanglement. LPIPS merely measures the amount of variation due to style, but not its fidelity to the original style or its disentanglement from the content. FID measures the similarity between the generated and true images in a particular domain, but it does not measure the similarity to a particular style. FID is therefore more effective for class-content disentanglement than multi-modal setting that contain style. Developing competent metrics for the content-style disentanglement setting is an important and unsolved task.

5.4 Results

Reference-guided Sampling
LPIPS FID LPIPS FID
MUNIT Huang et al. (2018) 0.302 139.732 0.426 32.208
StarGAN-v2 Choi et al. (2020) 0.328 27.842 0.357 28.751
Ours 0.394 23.927 0.427 18.767
Table 1: Multi-modal domain translation results between the three domains of AFHQ.
Reference-guided Sampling
LPIPS FID LPIPS FID
StarGAN-v2 Choi et al. (2020) 0.231 76.391 0.182 80.196
Ours 0.315 71.738 0.239 93.688
Table 2: Multi-modal domain translation results between 47 domains of CUB-47.

We present a quantitative evaluation of our method and state-of-the-art baseline methods MUNIT and StarGAN-v2 on the AFHQ dataset in Tab. 1. It can be seen that our method achieves both higher translation diversity (LPIPS) and better visual quality (FID) than both baselines in the reference-guided setting as well as when sampling random style codes. It should be noted that as observed in Choi et al. (2020), in the case of reference-guided translation, MUNIT experiences mode-collapse and fails to generate diverse images according to provided reference images. As can be seen in the qualitative examples in Fig. 1, StarGAN-v2 leaks a significant amount of details from the input image which results in unreliable translation between the different domains.

             Style       

Content


StarGAN-v2

Ours


StarGAN-v2

Ours


StarGAN-v2

Ours


Figure 1: Comparison between our method and StarGAN-v2 on AFHQ. StarGAN-v2 leaks a significant amount of details of the content image and generates unreliable and inconsistent translations. Our method produces much more disentangled results and captures the target style faithfully.

Furthermore, it is mostly capable of changing low-level features as color and texture. Our method generates much more disentangled translations and captures the exact style of the reference image.

In order to assess the performance of the methods in the multi-domain translation setting, we follow the same protocol on the CUB-47 dataset. In this dataset, the images are classified into 47 different bird species, while the images in each domain exhibit variations in fine-grained details. Results of this experiments are presented in Tab. 2 along with visual examples in Fig. 2. In this experiment we compare our method only to StarGAN-v2, as the MUNIT architecture supports only a pair of domains and can not scale to this amount of domains. It can be seen that we achieve higher diversity and better visual quality than StarGAN-v2 in the reference-guided setting although exhibit some degradation in quality in the sampling case. Moreover, out method is capable of translating to a specific fine-grained style of the bird, as it correctly applies the breast color and the bill shape.

Another evidence for the content leakage of StarGAN-v2 is demonstrated in Fig. 3. In the case of translating edges to shoes, both our method and StarGAN-v2 succeed in translating to the correct style in the target domain. However, in the case of translating between shoe images within the same domain, StarGAN-v2 leaks the entire input image through the content code and can barely change its color. The effectiveness of our proposed content-bottleneck term can be easily observed as our model performs well and is able to translate the style of the shoe while preserving the structure of the image.

Additional results are provided in the supplementary material.

             Style       

Content


StarGAN-v2

Ours


StarGAN-v2

Ours


Figure 2: Comparison between our method and StarGAN-v2 on CUB-47. Our method better captures the fine-grained details of the target style. For example, it is able to transfer-over the breast color and the bill shape.
             Content       

Style


StarGAN-v2

Ours


StarGAN-v2

Ours


Figure 3: Comparison between our method and StarGAN-v2 on Edges2Shoes. Although both methods perform well on translating edges to shoes, StarGAN-v2 fails to disentangle content from style in the shoes domain and fails to transfer style between shoe images.

6 Discussion

VAEs for disentanglement in image translation models

Recent image translation methods rely on adversarial objectives for learning a domain-invariant content representation. To the best of our knowledge, a VAE-based content bottleneck has not been used to disentangle content from style in multi-modal image translation before. Exceptions that we are aware of are: UNIT Liu et al. (2017) and LORD Gabbay and Hoshen (2020). UNIT does not tackle the multi-modal setting and does not specifically use VAE for disentanglement. LORD specifically uses a content bottleneck for content-class disentanglement in a non-adversarial framework but has not considered the multi-modal setting in which domains exhibit style variations.

Representation Disentanglement

We have significantly improved the disentanglement in the generated images over a state-of-the-art image translation framework. However, in preliminary experiments we observed that the disentanglement is not well satisfied at the representation level (e.g. the domain can still be classified from the content codes), in our method and the baselines. LORD tackles this issue Gabbay and Hoshen (2020) and utilizes latent optimization for learning disentangled representations. Unfortunately, it can not model intra-domain variations and scale to high resolutions. We leave this challenge to future work.

Integration with other state-of-the-art models

We have also made an effort to integrate our content-bottleneck principle with other domain translation frameworks such as MUNIT. Although some considerable improvement has been achieved in reducing style-leakage (especially in translating dogs to wildlife), MUNIT suffers from other instability issues and therefore could not compete with our main framework.

Evaluation metrics for style-content disentanglement

As stated in the experimental section, current benchmarks for style-content disentanglement are not equipped with well-designed metrics for measuring style disentanglement, primarily due to their unsupervised nature. We believe that the development of better evaluation criteria will greatly speed-up progress in this field.

7 Conclusion

We present a simple yet principled approach for improving style-content disentanglement in image translation. We show the effectiveness of our proposed content-bottleneck in preventing style leakage and improving the translation performance. Our method produces state-of-the-art results in terms of visual quality and output style diversity.

References

  • S. Benaim, M. Khaitov, T. Galanti, and L. Wolf (2019) Domain intersection and domain difference. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 3445–3453. Cited by: §3.
  • D. Bouchacourt, R. Tomioka, and S. Nowozin (2018)

    Multi-level variational autoencoder: learning disentangled representations from grouped observations

    .
    In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §2.
  • Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 8789–8797. Cited by: §2.
  • Y. Choi, Y. Uh, J. Yoo, and J. Ha (2020) StarGAN v2: diverse image synthesis for multiple domains. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1, §2, §3, §4.1, §4.3, §5.1, §5.2, §5.3, §5.4, Table 1, Table 2.
  • E. L. Denton et al. (2017) Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pp. 4414–4423. Cited by: §1, §2, §3.
  • A. Gabbay and Y. Hoshen (2020) Demystifying inter-class disentanglement. In ICLR, Cited by: §1, §2, §3, §4.2, §6, §6.
  • A. Harsh Jha, S. Anand, M. Singh, and V. Veeravasarapu (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. In ECCV, Cited by: §2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §5.3.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §3, §4.3.
  • X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §1, §2, §3, §4.1, §5.1, Table 1.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1, §2.
  • T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §3, §4.3.
  • M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §6.
  • M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019) Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10551–10560. Cited by: §2.
  • M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In NIPS, Cited by: §2.
  • A. Szabó, Q. Hu, T. Portenier, M. Zwicker, and P. Favaro (2018) Challenges in disentangling independent factors of variation. ICLRW. Cited by: §2.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §5.2.
  • T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §1, §2.
  • A. Yu and K. Grauman (2014) Fine-grained visual comparisons with local learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 192–199. Cited by: §5.2.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §5.3.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §2.

Appendix A Appendix

a.1 Qualitative results

             Style       

Content


StarGAN-v2

Ours


StarGAN-v2

Ours


StarGAN-v2

Ours


Figure 4: More qualitative results of the comparison between our method and StarGAN-v2 on AFHQ. StarGAN-v2 leaks a significant amount of details of the content image and generates unreliable and inconsistent translations. Our method produces much more disentangled results and captures the target style faithfully.
             Style       

Content


StarGAN-v2

Ours


StarGAN-v2

Ours


StarGAN-v2

Ours


Figure 5: More qualitative results of the comparison between our method and StarGAN-v2 on AFHQ. StarGAN-v2 leaks a significant amount of details of the content image and generates unreliable and inconsistent translations. Our method produces much more disentangled results and captures the target style faithfully.
             Style       

Content

0

StarGAN-v2

Ours


StarGAN-v2

Ours


StarGAN-v2

Ours


Figure 6: More qualitative results of the comparison between our method and StarGAN-v2 on CUB-47. It can be seen that StarGAN-v2 tends to overfit the content exactly and often fails to generate valid bird images of the target style. Moreover, our method better captures the fine-grained details of the target style. For example, it is able to transfer-over the throat color and the head pattern.
          Content     

Style


StarGAN-v2

Ours


StarGAN-v2

Ours


StarGAN-v2

Ours


Figure 7: More qualitative results from the comparison between our method and StarGAN-v2 on Edges2Shoes. Although both methods perform well on translating edges to shoes, StarGAN-v2 fails to disentangle content from style in the shoes domain and fails to transfer style between shoe images.

Afhq

We provide more qualitative results on AFHQ in Fig. 4 and 5. It can be seen that especially in translating dogs to wildlife and vice versa, StarGAN-v2 leaks a significant amount of details of the content image and generates unreliable and inconsistent translations. Our method produces much more disentangled results and captures the target style faithfully.

Cub-47

We provide more qualitative results on CUB-47 in Fig. 6. It can be seen that our method better captures the fine-grained details of the target style. For example, it is able to transfer-over the throat color and the head pattern.

Edges2Shoes

More evidence for the effectiveness of our proposed content-bottleneck is demonstrated in Fig. 7. StarGAN-v2 succeeds to translate edges to shoes as there is no style-related information to leak from the edge images to the shoe images, but fails to disentangle content from style within the shoes domain and barely changes the shoe color. Our method is capable of translating images within a domain and across domains.