GMM-UNIT: Unsupervised Multi-Domain and Multi-Modal Image-to-Image Translation via Attribute Gaussian Mixture Modeling

03/15/2020 ∙ by Yahui Liu, et al. ∙ 0

Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images. Recent studies have shown remarkable success for multiple domains but they suffer from two main limitations: they are either built from several two-domain mappings that are required to be learned independently, or they generate low-diversity results, a problem known as model collapse. To overcome these limitations, we propose a method named GMM-UNIT, which is based on a content-attribute disentangled representation where the attribute space is fitted with a GMM. Each GMM component represents a domain, and this simple assumption has two prominent advantages. First, it can be easily extended to most multi-domain and multi-modal image-to-image translation tasks. Second, the continuous domain encoding allows for interpolation between domains and for extrapolation to unseen domains and translations. Additionally, we show how GMM-UNIT can be constrained down to different methods in the literature, meaning that GMM-UNIT is a unifying framework for unsupervised image-to-image translation.



There are no comments yet.


page 2

page 13

page 22

page 23

page 24

page 25

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Translating images from one domain into another is a challenging task that has significant influence on many real-world applications where data are expensive, or impossible to obtain and to annotate. Image-to-Image translation models have indeed been used to increase the resolution of images [dong2014learning], fill missing parts [pathak2016context], transfer styles [gatys2016image], synthesize new images from labels [liu2017unsupervised], and help domain adaptation [bousmalis2017unsupervised, murez2018image]. In many of these scenarios, it is desirable to have a model mapping one image to multiple domains, while providing visual diversity (e.g. a day scene night scene in different seasons). However, most of the existing models can either map an image to multiple stochastic results in a single domain, or model multiple domains in a deterministic fashion. In other words, the majority of the methods in the literature are either multi-domain or multi-modal.

Several reasons have hampered a stochastic translation of images to multiple domains. On the one hand, most of the Generative Adversarial Network (GAN) models assume a deterministic mapping [choi2018stargan, pumarola2018ganimation, zhu2017unpaired], thus failing at modeling the correct distribution of the data [huang2018multimodal]

. On the other hand, approaches based on Variational Auto-Encoders (VAEs) usually assume a shared and common zero-mean unit-variance normally distributed space 

[huang2018multimodal, zhu2017toward], limiting to two-domain translations.

Figure 1: GMM-UNIT is a multi-domain and multi-modal image-to-image translation model where the target domain can either be sampled from a distribution, or extracted from a reference image. The first two rows show diverse images generated for each domain translation. The last row shows translations from a reference image.

We propose a novel UNsupervised Image-to-image Translation (UNIT) model that disentangles the visual content from the domain attributes. The attribute latent space is assumed to follow a Gaussian Mixture Model (GMM), thus naming the method: GMM-UNIT (see Figure 1). This simple assumption allows three key properties: mode-diversity thanks to the stochastic nature of the probabilistic latent model, multi-domain translation since the domains are represented as clusters in the same attribute spaces and few/zero-shot generation since the continuity of the attribute representation allows interpolating between domains and extrapolating to unseen domains with very few or almost no observed data from these domains. The code and models will be made publicly available.

2 Related work

Our work is best placed in the literature of image-to-image translation, where the challenge is to translate one image from a visual domain (e.g. summer) to another one (e.g. winter). This problem is inherently ill-posed, as there could be many mappings between two images. Thus, researchers tried to tackle the problem from different perspectives. The most impressive results on this task are undoubtedly related to GANs, which aim to synthesize new images as similar as possible to the real data through an adversarial approach between a Discriminator and a Generator. The former continuously learns to recognize real and fake images, while the latter tries to generate new images that are indistinguishable from the real data, and thus to fool the Discriminator. These networks can be effectively conditioned and thus generate new samples from a specific class [chen2016infogan]

and a latent vector extracted from the images. For example,

[isola2017image] and [wang2018high] trained a conditional GAN to encode the latent features that are shared between images of the same domain and thus decode the features to images of the target domain in a one-to-one mapping. However, this approach is limited to supervised settings, where pairs of corresponding images in different domains are available (e.g. a photos-sketch image pair). In many cases, it is too expensive and unrealistic to collect a large amount of paired data.

Unsupervised Domain Translation. Translating images from one domain to another without a paired supervision is particularly difficult, as the model has to learn how to represent both the content and the domain. Thus, constraints are needed to narrow down the space of feasible mappings between images. [taigmanPW17] proposed to minimize the feature-level distance between the generated and input images. [liu2017unsupervised] created a shared latent space between the domains, which encourages different images to be mapped in the same latent space. [zhu2017unpaired] proposed CycleGAN, which uses a cycle consistency loss that requires a generated image to be translated back to the original domain. Similarly, [kim2017learning] used a reconstruction loss applying the same approach to both the target and input domains. [mo2018instanceaware] later expanded the previous approach to the problem of translating multiple instances of objects in the same image. All these methods, however, are limited to a one-to-one domain mapping, thus requiring training multiple models for cross-domain translation. Recently, [choi2018stargan] proposed StarGAN, a unified framework to translate images in a multi-domain

setting through a single GAN model. To do so, they used a conditional label and a domain classifier ensuring network consistency when translating between domains. However, StarGAN is limited to a deterministic mapping between domains.

Style transfer. A related problem is style transfer, which aims to transform the style of an image but not its content (e.g. from a photo to a Monet painting) to another image [donahue2018semantically, gatys2015neural, huang2017arbitrary, tenenbaum1997separating]. Differently from domain translation, usually the style is extracted from a single reference image. We will show that our model could be applied to style transfer as well.

Multi-modal Domain Translation. Most existing image-to-image translation methods are deterministic, thus limiting the diversity of the translated outputs. However, even in a one-to-one domain translation such as when we want to translate people’s hair from blond to black, there could be multiple hair color shades that are not modeled in a deterministic mapping. The straightforward solution would be injecting noise in the model, but it turned out to be worthless as GANs tend to ignore it [isola2017image, mathieu2015deep, zhu2017toward]. To address this problem, [zhu2017toward] proposed BicycleGAN, which encourages the multi-modality in a paired setting through GANs and Variational Auto-Encoders (VAEs). [almahairi2018augmented] have instead augmented CycleGAN with two latent variables for the input and target domains and showed that it is possible to increase diversity by marginalizing over these latent spaces. [huang2018multimodal] proposed MUNIT, which assumes that domains share a common content space but different style spaces. Then, they showed that by sampling from the style space and using Adaptive Instance Normalization (AdaIN) [huang2017arbitrary], it is possible to have diverse and multimodal outputs. Similarly, [ma2018exemplar] focused on the semantic consistency during the translation, and applied AdaIN to the feature-level space. Recently, [MSGAN] proposed a mode seeking loss to encourage GANs to better explore the modes and help the network avoiding the mode collapse.

Altogether, the models in the literature are either multi-modal or multi-domain. Thus, one has to choose between generating diverse results and training one single model for multiple domains. Here, we propose a unified model to overcome this limitation. Concurrent to our work, DRIT++ [lee2019drit++] also proposed a multi-modal and multi-domain model using a discrete domain encoding and assuming, however, a zero-mean unit-variance Gaussian shared space for multiple modes. We instead propose a content-attribute disentangled representation, where the attribute space fits a GMM distribution. A variational loss forces the latent representation to follow this GMM, where each component is associated to a domain. This is the key to provide for both multi-modal and multi-domain translation. In addition, GMM-UNIT is the first method proposing a continuous encoding of the domains, as opposed to the discrete encoding used in the literature. This is important because it allows for domain interpolation and extrapolation with very few or no data (few/zero-shot generation). The main properties of GMM-UNIT compared to the literature are shown in Table 1.

Method Unpaired Multi-Domain Multi-Modal Domain encoding
CycleGAN [zhu2017unpaired] None
BicycleGAN [zhu2017toward] None
MUNIT [huang2018multimodal] None
StarGAN [choi2018stargan] Discrete
DRIT++ [lee2019drit++] Discrete
GMM-UNIT Continuous
Table 1: A comparison of the state of the art for image-to-image translation.

3 Gmm-Unit

GMM-UNIT is an image-to-image translation model that translates an image from one domain to multiple domains in a stochastic fashion, which means that it generates multiple outputs with visual diversity for the same translation.

Following recent seminal works [huang2018multimodal, lee2018diverse], our model assumes that each image can be decomposed in a domain-invariant content space and a domain-specific attribute space. Given

attributes of a set of images, we model the attribute latent space through Gaussian Mixture Models (GMMs). Formally the probability density of the latent space

is defined as:


where denotes a random attribute vector sample, and denote respectively the mean vector and covariance matrix of the -th GMM component, which is a -dimensional Gaussian ( and is symmetric and positive definite). denotes the weight associated to the -th component, where , . As later explained, in this paper we set , which means that each Gaussian component represents a domain. In other words, for an image from domain (i.e. ), then its latent attribute is assumed to follow , which is the -th Gaussian component of the GMM that describes the domain .

In the proposed representation, the domains are Gaussian components in a mixture. This simple yet effective model has one prominent advantage. Differently from previous works, where each domain is a category with a binary vector representation, we model the distribution of attribute space. The continuous encoding of the domains we here introduce allows us to navigate in the attribute latent space, thus generating images corresponding to domains that have never (or very little) been observed and allowing to interpolate between two domains.

We note that the state of the art models can be traced back particular case of GMMs. Existing multi-domain models such as StarGAN [choi2018stargan] or GANimation [pumarola2018ganimation] can be modeled with and , thus only allowing the generation of a single result per domain translation. Then, when , , and it is possible to model the state of the art approaches in multi-modal translation [huang2018multimodal, zhu2017toward], which share a unique latent space where every domain is overlapped, and it is thus necessary to train models to achieve the multi-domain translation. Finally, we can obtain DRIT++ [lee2019drit++] by separating the attribute latent space into what they call an attribute space and a domain code. The former is a GMM with , , and , while the latter is another GMM with and , which in [lee2019drit++]

is a one-hot encoding of the domain. Thus, our GMM-UNIT is a generalization of the existing state of the art. In the next sections, we formalize our model and show that the use of GMMs for the latent space allows learning multi-modal and multi-domain mappings, and also few/zero-shot image generation.

Figure 2: GMM-UNIT translates an input image from one domain to a target domain. The content is extracted from the input image, while the attribute can be either sampled (a) or extracted from a reference image (b). In detail: c) Training phase to translate an image from domain to . The generator uses the content of the input image (extracted by ) and the attribute of the target image (extracted by ) to generate an image in . This image has the content of (e.g. Scarlett Johansson) but the attributes of (e.g. black hair). The attributes are modeled through a GMM. b) Testing phase where we use the content of an image in and the target attributes sampled from the GMM distribution of the attributes of domain ; c) Testing phase where we extract the content from an image in and the attributes from an image belonging to the target domain . The style of this Figure is inspired from [zhu2017toward].

3.1 The generative-discriminative approach

GMM-UNIT follows the generative-discriminative philosophy. The generator inputs a content latent code and an attribute latent code , and outputs a generated image . This image is then fed to a discriminator that must discern between “real” or “fake” images (), and must also recognize the domain of the generated image (). The attribute and content latent representations need to be learned, and they are modeled by two architectures, namely a content extractor and an attribute extractor . See Figure 2 for a graphical representation of GMM-UNIT for an domain translation.

In addition to tackling the problem of multi-domain and multi-modal translation, we would like these two extractors, content and attribute, to be disentangled [huang2018multimodal]. This would constrain the learning and hopefully yield better domain translation, since the content would be as independent as possible from the attributes. We expect the attributes features to be related to the considered attributes, while the content features are supposed to be related to the rest of the image. Formally, the following two properties must hold:

Sampled attribute translation
Extracted attribute translation

3.2 Training the GMM-UNIT

The encoders and , and the generator need to be learned to satisfy three main properties. Consistency: An image and its generated/extracted codes have to be consistent even after a translation from a domain to a domain . Fit: The distribution of the attribute latent space must follow a GMM. Realism: The generated images must be indistinguishable from the real images. In the following, we discuss different losses used to force the overall pipeline to satisfy these properties.

In the consistency term, we include image, attribute and content reconstruction, as well as cycle consistency. More formally, we use the following losses:

  • [leftmargin=*, topsep=4pt]

  • Self-reconstruction of any input image from its extracted content and attribute vectors:

  • Content reconstruction from an image, translated into any domain: L_c/rec = ∑n,m=1KEx∼pXn, zN(μm, Σm) [∥ Ec(G(Ec(x), z)) - Ec(x)∥1 ]

  • Attribute reconstruction from an image translated with any content: L_a/rec = ∑n,m=1KEx∼pXn, zN(μm, Σm) [∥ Ez(G(Ec(x), z)) - z1 ]

  • Cycle consistency when translating an image back to the original domain: L_cyc = ∑n,m=1K Ex∼pXn,zN(μm, Σm) [∥ G(Ec(G(Ec(x), z)), Ez(x)) -x1]

We note that all these losses are used in prior work [choi2018stargan, huang2018multimodal, zhu2017unpaired, zhu2017toward] to constraint the infinite number of mappings that exist between an image in one domain and an image into another one. The loss is used as it generates sharper results than the loss [isola2017image]. We also propose to complement the Attribute reconstruction with an isometry loss, to encourage the attribute extractor to be as similar as possible to the sampled attributes. Formally: L_iso = ∑n,m=1KEx∼pXn, z,z’∼N(μm, Σm) [ —∥ Ez(G(Ec(x), z)) - Ez(G(Ec(x), z’))∥1 - ∥z-z’ ∥1]

In the fit

term we encourage both the attribute latent variable to follow the Gaussian mixture distribution and the generated images to follow the domain’s distribution. We set two loss functions:

  • [leftmargin=*, topsep=4pt]

  • Kullback-Leibler divergence between the extracted latent code and the model. Since the KL divergence between two GMMs is not analytically tractable, we resort on the fact that we know from which domain are we sampling and define:

    where is the Kullback-Leibler divergence.

  • Domain classification of generated and original images. For any given input image , we would like the method to classify it as its original domain, and to be able to generate from its content an image in any domain. Therefore, we need two different losses, one directly applied to the original images, and a second one applied to the generated images:

where is the label of domain . Importantly, while the generator is trained using the second loss only, the discriminator is trained using both.

The realism term tries to making the generated images indistinguishable from real images; we adopt the adversarial loss to optimize both the real/fake discriminator and the generator : L_GAN = ∑n,m=1KEx∼pXn[-logDr/f(x)] + E [-log(1-Dr/f(G(Ec(x), z)))]

The full objective function of our network is:

where are hyper-parameters of weights for corresponding loss terms. The values of most of these parameters come from the literature. We refer to the Supplementary for the details.

4 Experiments

We perform extensive quantitative and qualitative analysis in three real-world tasks, namely: edges-shoes, digits and faces. First, we test GMM-UNIT on a simple task such as a one-to-one domain translation. Then, we move to the problem of multi-domain translation where each domain is independent from each other. Finally, we test our model on multi-domain translation where each domain is built upon different combinations of lower level attributes. Specifically, for this task, we test GMM-UNIT in a dataset containing over 40 labels related to facial attributes such as hair color, gender, and age. Each domain is then composed by combinations of these attributes, which might be mutually exclusive (e.g. either male or female) or mutually inclusive (e.g. blond and black hair).

Additionally, we show how the learned GMM latent space can be used to interpolate attributes and generate images in previously unseen domains. Finally, we apply GMM-UNIT to the Style transfer task.

We compare our model to the state of the art of both multi-modal and multi-domain image translation problems. In the former, we select BicycleGAN [zhu2017toward], MUNIT [zhu2017unpaired] and MSGAN [MSGAN]. In the latter, we compare with StarGAN [choi2018stargan] and DRIT++ [lee2019drit++], which is the only multi-modal and multi-domain method in the literature. However, StarGAN is not multi-modal. Thus, similarly to what done previously [zhu2017toward], we modify StarGAN to be conditioned on Gaussian noise () in the input domain vector. We call this version of the model StarGAN* and we test it. More details are in the Supplementary.

4.1 Metrics

We quantitatively evaluate our method through image quality and diversity of generated images. The former is evaluated through the Fréchet Inception Distance (FID) [NIPS2017_7240], while we evaluate the latter through the LPIPS [zhang2018unreasonable].

We use FID to measure the distance between the generated and real distributions. Lower FID values indicate better quality of the generated images. We estimate the FID using 1000 input images and 10 samples per input v.s. randomly selected 10000 images from the target domain.

LPIPS The LPIPS distance is defined as the

distance between the features extracted by a deep learning model of two images. This distance has been demonstrated to match well the human perceptual similarity 

[zhang2018unreasonable]. Thus, following [huang2018multimodal, lee2018diverse, zhu2017toward], we randomly select 100 input images and translate them to different domains. For each domain translation, we generate 10 images for each input image and evaluate the average LPIPS distance between the 10 generated images. Finally, we get the average of all distances. Higher LPIPS distance indicates better diversity among the generated images.

4.2 Edges Shoes: Two-domains Translation

We first evaluate our model on a simpler task than multi-domain translation: two-domain translation (e.g. edges to shoes). We use the dataset provided by [isola2017image, zhu2017unpaired] containing images of shoes and their edge maps generated by the Holistically-nested Edge Detection (HED) [xie2015holistically]. We resize all images to 256256 and train a single model for edges shoes without using paired information. Figure 3 displays examples of shoes generated from the same sketch by all the state of the art models. GMM-UNIT and MUNIT generate high-quality and diverse results that are almost indistinguishable from the ground truth and the results of BicycleGAN, which is a paired (supervised) method. Although, MSGAN and DRIT++ generate diverse images, they suffer from low quality results. The results of StarGAN* confirm the findings of previous studies that only adding noise does not increase diversity [isola2017image, mathieu2015deep, zhu2017toward]. These results are confirmed in the quantitative evaluation displayed in Table 2. Our model generates images with high diversity and quality using half the parameters of the state of the art (MUNIT), which needs to be re-trained for each transformation. Particularly, the diversity is comparable to the paired model performance. These results show that this multi-modal and multi-domain model can be efficiently applied also to simpler tasks than multi-domain problems without much loss in performance, while other multi-domain models suffer in this setting. We refer to the Supplementary for additional results on this task.

Figure 3: Qualitative evaluation on the Edges Shoes.
Model Unpaired MM MD FID LPIPS Params
StarGAN* [choi2018stargan] 140.41
MUNIT [huang2018multimodal]
MSGAN [MSGAN] 111.19
DRIT++ [lee2019drit++] 123.87
GMM-UNIT 58.46
BicycleGAN [zhu2017toward] 47.43
Table 2: Quantitative evaluation on the Edges Shoes dataset. The best performance for unpaired (unsupervised) models is in green. refers to supervised method. MM and MD stands for Multi-Modal and Multi-Domain respectively.

4.3 Digits: Single-attribute Multi-domain Translation

We then increase the complexity of the task by evaluating our model in a multi-domain translation setting, where each domain is composed by digits collected in different scenes. We use the Digits-Five dataset introduced in [xu2018deep], from which we select three different domains, namely MNIST [lecun1998gradient], MNIST-M [ganin2014unsupervised], and Street View House Numbers (SVHN) [yuval2011reading]. During the training, given that all images are resized to 3232, we reduce the depth of our model and compared models. We compare our model with the state-of-the-art on multi-domain translation, and we show in Table 3 the quantitative results. We add in the Supplementary extensive qualitative results for space limit reasons.

From these results we conclude that StarGAN* fails at generating diversity, while GMM-UNIT generates images with higher quality and diversity than all the state-of-the-art models. Additional experiments carried out implementing a StarGAN*-like GMM-UNIT (i.e. setting ) indeed produced similar results. Specifically, the StarGAN*-like GMM-UNIT tends to generate for each input image one single (deterministic) output and thus the corresponding LPIPS scores are zero. We refer to the Supplementary for additional results on this task.

Model MM MD Digits Faces
StarGAN* [choi2018stargan] 69.11 51.68
DRIT++ [lee2019drit++] 88.94 55.64
Table 3: Quantitative evaluation on the Digits and Faces datasets. The best performance is in green. For Faces, we also evaluate the diversity on the background.

4.4 Faces: Multi-attribute Multi-domain Translation

We also evaluate GMM-UNIT in the complex setting of multi-domain translation in a dataset of facial attributes. We use the Celebfaces Attributes (CelebA) dataset [liu2015deep], which contains 202,599 face images of celebrities where each face is annotated with 40 binary attributes. We apply central cropping to the initial 178218 size images to 178178, then resize the cropped images to 128128. We randomly select 2,000 images for testing and use all remaining images for training. This dataset is composed of some attributes that are mutually exclusive (e.g. either male or female) and those that are mutually inclusive (e.g. people could have both blond and black hair). Thus, we model each attribute as a different GMM component. For this reason, we can generate new images for all the combinations of attributes by sampling from the GMM. As aforementioned, this is not possible for state-of-the-art models such as StarGAN and DRIT++, as they use one-hot domain codes to represent the domains. To be consistent with the state of the art (StarGAN) we show five binary attributes: hair color (black, blond, brown), gender (male/female), and age (young/old). These five attributes allow GMM-UNIT to generate 32 domains.

We observed that image-to-image translation is sensitive to complex background information. In fact, models are inclined to manipulate the intensity and details of pixels that are not related to the desired attribute transformation. Hence, we add a convolutional layer at the end the of decoder to learn a one-channel attention mask in an unsupervised manner. Hence, the final prediction is obtained through combining the input image and its initial prediction through: . We also apply the attention layer to Edges Shoes and Digits, but find that it provides no noticeable improvements in the results.

Figure 4 shows some generated results of our model. We can see that GMM-UNIT learns to translate images to simple attributes such as blond hair, but also to translate images with combinations of them (e.g. blond hair and male). Moreover, we can see that the rows show different realizations of the model thus demonstrating the stochastic approach of GMM-UNIT. These results are corroborated by Table 3 that shows that our model is superior to StarGAN* and DRIT++ in both quality and diversity of generated images. Particularly, the use of an attention mechanism allows our model to achieve diversity only on the part of the image that is involved in the transformation (e.g. hair and face for gender and hair translation). To demonstrate this, we compute the LPIPS distance between the background of the input image and the generated images (LPIPS). Table 3 that our model is the best at preserving the original background information. In Figure 9 we show the difference between the diversity we achieve and DRIT++ diversity. GMM-UNIT preserves the background while it changes the face and create diverse hair styles, while DRIT++ just changes the overall color intensity and affects parts of the image not related to the attributes, which is not desirable. Extensive results are displayed in the Supplementary.

Input Black hair Brown hair Blond hair Blond+Male Blond+Older
Figure 4: Facial expression synthesis results on the CelebA dataset with different attribute combinations. Each row represents a different output sampled from the model.

4.5 Style transfer

We evaluate our model on style transfer, which is a specific task where the style is usually extracted from a single reference image. Thus, we randomly select two input images and synthesize new images where, instead of sampling from the GMM distribution, we extract the style (through ) from some reference images. Figure 5 shows that the generated images are sharp and realistic, showing that our method can also be effectively applied to Style transfer.

Figure 5: Examples of GMM-UNIT applied on the Style transfer task. The style is here extracted from a single reference images provided by the user.
Input Black+Blond+Female+Young Black+Blond+Male+Young
Figure 6: Generated images in previously unseen combinations of attributes.
Input Black hair+Female+Young Blond hair+Female+Young
Figure 7: An example of domain interpolation given an input image.

4.6 Domain interpolation and extrapolation

In addition, we evaluate the ability of GMM-UNIT to synthesize new images with attributes that are extremely scarce or non present in the training dataset. To do so, we select three combinations of attributes consisting of less than two images in the CelebA dataset: Black hair+Blond hair+Male+Young and Black hair+Blond hair+Female+Young.

Figure 6 shows that learning the continuous and multi-modal latent distribution of attributes allow to effectively generate images as zero- or few-shot generation. At the best of our knowledge, we are the first ones being able to translate images in previously unseen domains at no additional cost. Recent literature on zero-pair translation learning indeed scale linearly with the number of domains [wang2018mix]. This ability can be of vital importance in tasks where labels are extremely imbalanced.

Finally, we show that by learning the full latent distribution of the attributes we can do attribute interpolation both intra- and inter-domains. In contrast, state of the art methods such as [lee2019drit++] can only do intra-domain interpolations due to their discrete domain encoding. Other works such as Chen et al. [chen2019homomorphic] are focused on explicitly learning an interpolation and use a reference image to do the same task, while we can either interpolate between two reference images or between any two points in the attribute latent space (by sampling these points/vectors), even for multiple attributes. Figure 7 shows some generated images through a linear interpolation between two given attributes, while in Supplementary we show that we can also do intra-domain interpolations.

4.7 Ablation study

Given that the importance of and was verified in previous works (i.e. CycleGAN and StarGAN), and that are necessary to the model convergence, we compare GMM-UNIT with three variants of the model that ablate , and in the Digits dataset. Figure 9 shows the results of the ablation. As expected, is needed to have higher image quality, and we observe that it increases the diversity because of noisy results. When is removed image quality decreases, but still helps to learn the attributes space. Finally, without we observe that both diversity and quality decrease, thus confirming the need of all these losses. For the first time from its introduction in [huang2018multimodal], we also test for the disentangled assumption of visual content and attributes. Although we cannot test the network removing the attribute extractor , we remove the content extractor and change the generator to have and as input. We observe that the results are similar, although the diversity decreases substantially. This means that the disentanglement approach needs to be further studied in the multiple architectures and tasks that propose it [gonzalez2018image, huang2018multimodal, wu2019transgaga] to understand its necessity and contribution. We refer to Supplementary for the disentanglement and the additional ablation results broken down by domain.

Input Black hair + Female
Figure 8: GMM-UNIT diversity is only on the subject thanks to the attention, while DRIT++ changes also the background.
(A) w/o 84.06
(A) w/o 62.20
(A) w/o 63.70
(A) w/o disent. 60.72
Figure 9: Ablation study performance on the Digits dataset.

5 Conclusion

In this paper, we present a novel image-to-image translation model that maps images to multiple domains and provides a stochastic translation. GMM-UNIT disentangles the content of an image from its attributes and represents the attribute space with a GMM, which allows us to have a continuous encoding of domains. This has two main advantages: first, it can easily be extended to most multi-domain and multi-modal image-to-image translation tasks. Second, GMM-UNIT allows for interpolation across-domains and the translation of images into previously unseen domains.

We conduct extensive experiments in three different tasks, namely two-domain translation, multi-domain translation and multi-attribute multi-domain translation. We show that GMM-UNIT achieves quality and diversity superior to state of the art, most of the times with fewer parameters. Future work includes the possibility to thoroughly learn the mean vectors of the GMM from the data and extending the experiments to a higher number of GMM components per domain.


Appendix 0.A Implementation details

Our deep neural model architecture is built upon the state-of-the-art methods MUNIT [huang2018multimodal], BicycleGAN [zhu2017toward] and StarGAN [choi2018stargan]. As shown in Table 4, we apply Instance Normalization (IN) [ulyanov2017improved] to the content encoder , while we apply Adaptive Instance Normalization (AdaIN) [huang2017arbitrary] and Layer Normalization (LN) [ba2016layer] for the decoder

. For the discriminator network, we use Leaky ReLU 

[xu2015empirical] with a negative slope of 0.2. We note that we reduce the number of layers of the discriminator on the Digits dataset.

Part Input Output Shape Layer Information
(, , 3) (, , 64) CONV-(N64, K7x7, S1, P3), IN, ReLU
(, , 64) (, , 128) CONV-(N128, K4x4, S2, P1), IN, ReLU
(, , 128) (, , 256) CONV-(N256, K4x4, S2, P1), IN, ReLU
(, , 256) (, , 256) Residual Block: CONV-(N256, K3x3, S1, P1), IN, ReLU
(, , 256) (, , 256) Residual Block: CONV-(N256, K3x3, S1, P1), IN, ReLU
(, , 256) (, , 256) Residual Block: CONV-(N256, K3x3, S1, P1), IN, ReLU
(, , 256) (, , 256) Residual Block: CONV-(N256, K3x3, S1, P1), IN, ReLU
(, , 3) (, , 64) CONV-(N64, K7x7, S1, P3), ReLU
(, , 64) (, , 128) CONV-(N128, K4x4, S2, P1), ReLU
(, , 128) (, , 256) CONV-(N256, K4x4, S2, P1), ReLU
(, , 256) (, , 256) CONV-(N256, K4x4, S2, P1), ReLU
(, , 256) (, , 256) CONV-(N256, K4x4, S2, P1), ReLU
(, , 256) (1, 1, 256) GAP
(256,) (,) FC-(N)
(256,) (,) FC-(N)
(, , 256) (, , 256) Residual Block: CONV-(N256, K3x3, S1, P1), AdaIN, ReLU
(, , 256) (, , 256) Residual Block: CONV-(N256, K3x3, S1, P1), AdaIN, ReLU
(, , 256) (, , 256) Residual Block: CONV-(N256, K3x3, S1, P1), AdaIN, ReLU
(, , 256) (, , 256) Residual Block: CONV-(N256, K3x3, S1, P1), AdaIN, ReLU
(, , 256) (, , 128) UPCONV-(N128, K5x5, S1, P2), LN, ReLU
(, , 128) (, , 64) UPCONV-(N64, K5x5, S1, P2), LN, ReLU
(, , 64) (, , 3) CONV-(N3, K7x7, S1, P3), Tanh
†(, , 64(+1)) (, , 1) CONV-(N3, K7x7, S1, P3), Sigmoid
(, , 3) (, , 64) CONV-(N64, K4x4, S2, P1), Leaky ReLU
(, , 64) (, , 128) CONV-(N128, K4x4, S2, P1), Leaky ReLU
(, , 128) (, , 256) CONV-(N256, K4x4, S2, P1), Leaky ReLU
(, , 256) (, , 512) CONV-(N512, K4x4, S2, P1), Leaky ReLU
(, , 512) (, , 1) CONV-(N1, K1x1, S1, P0)
(, , 512) (1, 1, ) CONV-(N, Kx, S1, P0)
Table 4: GMM-UNIT network architecture. We use the following notations: : the dimension of attribute vector,

: the number of attributes, N: the number of output channels, K: kernel size, S: stride size, P: padding size, CONV: a convolutional layer, GAP: a global average pooling layer, UPCONV: a 2

bilinear upsampling layer followed by a convolutional layer, FC: fully connected layer. We set in Edges2shoes and Digits, in Faces. refers to be optional.

We use the Adam optimizer [kingma2014adam] with = 0.5, = 0.999, and an initial learning rate of 0.0001. The learning rate is decreased by half every 2e5 iterations. In all experiments, we use a batch size of 1 for Edges2shoes and Faces and batch size of 32 for Digits. And we set the loss weights to = 10, = 10, = 0.1, and = 0.1. We use the domain-invariant perceptual loss with weight 0.1 in all experiments. Random mirroring is applied during training.

0.a.1 Gmm

While the GMM supports a full covariance matrix, simplify the problem as typically done in the literature. The simplified version satisfies the following properties:

  • The mean vectors are placed on the vertices of -dimensional regular simplex, so that the mean vectors are equidistant.

  • The covariance matrices are diagonal, with the same on all the components. In other words, each Gaussian component is spherical, formally: , where

    is the identity matrix.

0.a.2 Implementation of state of the art models

For all the models but StarGAN*, we used the state of the art implementations released by the authors without any modification. StarGAN* corresponds to a StarGAN model that is conditioned on Gaussian noise () in the input domain vector. We will release the code and trained model of StarGAN*.

Image CAM Attention Image CAM Attention Image CAM Attention
Figure 10: Several examples of CAMs and our unsupervised attention masks for hair color translation.

0.a.3 Class Activation Maps for Faces

The Faces dataset is a very challenging dataset where each images has multiple attributes, and where the background is very diverse and complex. For this reason, we employ an attention mask that helps the network focusing on the attributes that have to be changed. However, we found from experimental results that attention is hard to learn. Thus, during the training, we help the network at learning of the unsupervised attention mask through the use Class Activation Maps (CAMs) [zhou2016learning, selvaraju2017grad]. We fine-tune the pretrained network VGG-16 [simonyan2014very] on CelebA dataset to do multi-label classification for the selected attributes in the domain translation. Then, the predicted one-channel CAM of the real attributes in the original input image is concatenated into the attention layer in decoder . Although the CAMs are pretty rough (see Fig. 10), they improve the unsupervised attention as expected (FID: 48.28 FID: 46.21). Future work is needed to extend the CAMs method to the multiple attributes settings as in Faces. This would greatly improve the interpretability and efficacy of CAMs.

Appendix 0.B Additional results

0.b.1 Edges shoes: Two-domain translation

In this section, we present the additional results for the one-to-one domain translation. As shown in Figure 11, we qualitatively compare GMM-UNIT with the state-of-the-art. We observe that while all the methods (multi-domain and not) achieve acceptable diversity, it seems that DRIT++ suffers from problems of realism. As expected, StarGAN* does not generate diverse results.

Figure 11: Visual comparisons of state of the art methods on Edge Shoes dataset. We note that BicycleGAN, MUNIT and MSGAN are one-to-one domain translation models, while StarGAN* is a multi-domain (deterministic) model. Finally, DRIT++ and GMM-UNIT are multi-modal and multi-domain methods.

0.b.2 Digits: single-attribute multi-domain translation

Figure 12 shows the qualitative comparison with the state of the art, while Table 5 show the breakdown, per domain, of the quantitative results. We observe, as expected, that StarGAN* fails at generating diverse results.

Figure 12: Visual comparisons of state of the art methods on the digits dataset. We note that StarGAN* is a multi-domain (deterministic) model, while DRIT++ and GMM-UNIT are multi-modal and multi-domain methods. Image quality is very similar to the input images.
Target Domain Metric Method
MNIST FID 85.11 122.59
LPIPS 0.002 0.001
SVHN FID 64.91 66.88
LPIPS 0.006 0.045
MNIST-M FID 57.31 77.35
LPIPS 0.010 0.127
Params 11.18M1 24.49M1 14.26M1
Table 5: Quantitative comparison on the Digits dataset.

0.b.3 Faces: multi-attribute multi-domain translation

In Table 6 we show the quantitative results on the CelebA dataset, broken down per domain. In Figure 13 and Figure 14, we show some generated images in comparison with StarGAN. Figure 15 shows more examples of manipulating images by using reference images. Figure 16 shows the possibility to do attribute interpolation inside a domain, while Figure 17 shows the interpolation between domains.

Figure 13: Comparisons on CelebA dataset. BA: Black hair, BN: blond hair, BW: Brown hair, M: Male, FM: Female, Y: Young, O: Old.
Figure 14: Comparisons on CelebA dataset. BA: Black hair, BN: blond hair, BW: Brown hair, M: Male, FM: Female, Y: Young, O: Old.
Figure 15: Examples of GMM-UNIT applied on the Style transfer task. The style is here extracted from a single reference images provided by the user.
Input Blond hair Blond hair
Brown hair Brown hair
Female Female
Black hair + Old Black hair + Old
Figure 16: Examples of attribute intra-domain interpolation.
Target Domain Metric Method
Black hair + Female + Young FID 46.80 47.94
LPIPS 0.001 0.016
Blond hair + Female + Young FID 63.09 71.43
LPIPS 0.003 0.017
Brown hair + Female + Young FID 45.15 47.54
LPIPS 0.003 0.017
Params 53.23M1 54.06M1 26.91M1
Table 6: Quantitative comparison on the CelebA dataset.
Input Black hair Blond hair
Blond hair+Female Black hair+Male
Blond hair+Young Blond hair+Old
Blond hair+Young Brown hair+Old
Figure 17: Examples of domain interpolation given an input image.

Appendix 0.C Ablation study per domain

In Table 7 we show additional, per domain, ablation results on the Digits dataset. As it can be seen, we achieve the best image quality results in SVHN and MNISTM but MNIST work better with less complexity. This could be explain by the fact that MNIST is a very simple dataset with only grayscale pixels, where the FID score might be very sensible. In all the domain it seems that the network has to achieve a trade-off between quality and diversity, and this trade-off is largely due to . We note that higher diversity can be achieved especially with low-quality images, in which all the pixels can be randomly changed. Thus, the network has to achieve high quality and also diversity in high quality images.

Target domain Model FID LPIPS
MNIST GMM-UNIT w/o 64.92 0.066
GMM-UNIT w/o  74.32 0.059
GMM-UNIT w/o  77.86  0.067
GMM-UNIT w/o disent. 72.68 0.031
GMM-UNIT 78.08 0.067
SVHN GMM-UNIT w/o 70.63 0.172
GMM-UNIT w/o 45.26 0.113
GMM-UNIT w/o 43.33 0.110
GMM-UNIT w/o disent. 45.97 0.092
GMM-UNIT 47.78 0.115
MNISTM GMM-UNIT w/o 116.64 0.162
GMM-UNIT w/o 67.01 0.189
GMM-UNIT w/o 69.91 0.169
GMM-UNIT w/o disent. 63.51 0.169
GMM-UNIT 55.44 0.191
Table 7: Full ablation study performance per domain on the Digits dataset.

Appendix 0.D Visualization of the Attribute Latent space

In Figure 18 we illustrate how three exemplar attributes (black, blond and brown hair) sampled from the GMM distribution are similarly projected in the latent space as those same attributes extracted by the encoder . To project the attributes to a 2D space we use the t-SNE [maaten2008visualizing] algorithm with and 300 iterations. We can observe from the figure that the attributes are well separated in the space, while the extracted attributes are very close to those sampled. In other words, for example the extracted black hair attribute is most similar to the sampled black hair attribute and most dissimilar to the extracted/sampled attribute of brown hair.

Figure 18: t-SNE projection of the attribute vectors in a 2D space. The points cloud refer to both extracted and sampled attributes, namely black, blond and brown hair, from the GMM-UNIT. The attributes are well separated, while for each attribute the extracted vectors are similar to the sampled ones.