SDIT: Scalable and Diverse Cross-domain Image Translation

08/19/2019 ∙ by Yaxing Wang, et al. ∙ Universitat Autònoma de Barcelona 5

Recently, image-to-image translation research has witnessed remarkable progress. Although current approaches successfully generate diverse outputs or perform scalable image transfer, these properties have not been combined into a single method. To address this limitation, we propose SDIT: Scalable and Diverse image-to-image translation. These properties are combined into a single generator. The diversity is determined by a latent variable which is randomly sampled from a normal distribution. The scalability is obtained by conditioning the network on the domain attributes. Additionally, we also exploit an attention mechanism that permits the generator to focus on the domain-specific attribute. We empirically demonstrate the performance of the proposed method on face mapping and other datasets beyond faces.



There are no comments yet.


page 1

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Image-to-image translation aims to build a model to map images from one domain to another. Many computer vision tasks can be interpreted as image-to-image translation, e.g. style transfer (Gatys et al., 2016), image dehazing (Zhang and Patel, 2018)

, colorization 

(Zhang et al., 2016)

, surface normal estimation 

(Eigen and Fergus, 2015), and semantic segmentation (Long et al., 2015). Face translation has always been of great interest in the context of image translation, and several methods (Perarnau et al., 2016; Choi et al., 2018a; Pumarola et al., 2018) have shown outstanding performance. Image-to-image translation can be formulated in a supervised manner when corresponding image pairs from both domains are provided, and unsupervised otherwise. In this paper, we focus on unsupervised image-to-image translation with the two-fold goal of learning a model that has both scalability and diversity (see Figure 1(a)).

Figure 1. (a) Example of diverse image translations for various attributes of our method generated by a single model. (b-e) Comparison to current unpaired image-to-image translation methods. Given four color subsets (orange, yellow, green, blue), the task is to translate images between the domains. (b) CycleGAN requires three independent generators (indicated by pink lines) which produce deterministic results. (c) StarGAN only requires a single generator but produces deterministic results. (d) MUNIT requires separate generators but is able to produce diverse results. (e) SDIT produces diverse results from a single generator.

Recently, Isola et al. (Isola et al., 2017) consider a conditional generative adversarial network to perform image mapping from input to output with paired training samples. One of the drawbacks, however, is that this method produces a deterministic output for a given input image. BicycleGAN (Zhu et al., 2017b) extended image-to-image translation to one-to-many mappings between images by training the model to reconstruct the noise used in the latent space, effectively forcing it to use it in the translations. To address the same concern, Gonzalez-Garcia et al(Gonzalez-Garcia et al., 2018) explicitly exploit the feature representation, disentangling the latent feature into shared and exclusive representations, the latter being aligned with the input noise.

The above methods, however, need paired images during the training process. For many image-to-image translation cases, obtaining abundant annotated data remains very expensive or, in some cases, even impossible. To relax the requirement of paired training images, recent approaches have made efforts to address this issue. The cyclic consistency constraint (Kim et al., 2017; Yi et al., 2017; Zhu et al., 2017a) was initially proposed for unpaired image-to-image translation. Liu et al(Liu et al., 2017) assumes a shared joint latent distribution between the encoder and the decoder, then learns the unsupervised translation.

Nonetheless, previous methods perform a deterministic one-to-one translation and lack diversity on its outputs, as shown in Figure 1(b). For example, given the task from orange (domain A) to yellow (domain B) the generator taking the orange shoes as input only synthesizes shows with a single shade of yellow. Recently, the idea of non-deterministic outputs was extended to unpaired methods (Huang et al., 2018; Lee et al., 2018) by disentangling the latent feature space into content and style and aligning the style code with a known distribution (typically Gaussian or uniform). During inference, the model is able to generate diverse outputs by sampling different style codes from the distribution. The main drawback of these methods is that they lack scalability. As shown in Figure 1(d) the orange shoes can be translated into many possible green shoes with varying green shades. As the number of colors increases, however, the number of required domain-specific encoder-decoder pairs rises quadratically.

IcGAN (Perarnau et al., 2016) initially performs face editing by combining cGAN (Mirza and Osindero, 2014) with an attribute-independent encoder, and at the inference stage conducts face mapping for given face attributes. Recently, Yunjey et al. (Choi et al., 2018a) proposed StarGAN, a domain-independent encoder-decoder architecture for face translation that concatenates the domain label to the input image. Unlike the aforementioned non-scalable approaches (Huang et al., 2018; Lee et al., 2018), StarGAN is able to perform scalable image-to-image translation between multi-domains (Figure 1(b)). StarGAN, however, fails to synthesize diverse translation outputs.

In this paper, we propose a compact and general architecture that allows for diversity and scalability in a single model, as shown in Figure 1(e). Our motivation is that scalability and diversity are orthogonal properties that can be independently controlled. Scalability is obtained by using the domain label to train a single multi-domain image translator, preventing the need to train a encoder-decoder for each domain. Inspired by (Dumoulin et al., [n. d.]), we employ Conditional Instance Normalization (CIN) layers in the generator to introduce the latent code and generate diverse outputs. We explore the reasons behind CIN’s success (Fig. 6) and discover the following limitation: CIN affects the entirety of the latent features and could possibly modify areas that do not correspond to the specific target domain. To prevent this from happening, we include an attention mechanism that helps the model focus on domain-specific areas of the input image.

Our contributions are as follows:

  • [noitemsep]

  • We propose a compact and effective framework that combines both scalability and diversity in a single model. Note that current models only possess one of these desirable properties, whereas our model achieves both simultaneously.

  • We empirically demonstrate the effectiveness of the attention technique for multi-domain image-to-image translation.

  • We conduct extensive qualitative and quantitative experiments. The results show that our method is able to synthesize diverse outputs while being scalable to multiple domains.

2. Related work

Generative adversarial networks.  Typical GANs (Goodfellow et al., 2014) are composed of two modules: a generator and a discriminator. The aim of the generator is to synthesize images to fool the discriminator, while the discriminator distinguishes between fake images and real images. There have been many variants of GANs (Goodfellow et al., 2014) and they show remarkable performance on a wide variety of image-to-image translation tasks (Isola et al., 2017; Zhu et al., 2017a; Yi et al., 2017; Huang et al., 2018; Lee et al., 2018; Pumarola et al., 2018)

, super-resolution 

(Ledig et al., 2017), image compression (Rippel and Bourdev, 2047), and conditional image generation such as text to image(Zhang et al., 2017b, a; Ma et al., 2018), segmentation to image(Karras et al., 2017; Wang et al., 2018a) and domain adaptation (Gong et al., 2012; Ganin and Lempitsky, 2015; Tsai et al., 2018; Wu et al., 2018a; Zhang et al., 2019; Saito et al., 2017; Zou et al., 2018).

Conditional GANs.  Exploiting conditional image generation is an active topic in GAN research. Early methods considered incorporating into the model category information (Mirza and Osindero, 2014; Odena, 2016; Odena et al., 2017; Choi et al., 2018a) or text description (Reed et al., 2016; Zhang et al., 2017b; Johnson et al., 2018) for image synthesis. More recently, a wide variety of ideas have been proposed and used in several tasks such as image super-resolution (Ledig et al., 2017), video prediction (Mathieu et al., 2016), and photo editing (Shu et al., 2017). Similarly, we consider image-to-image translation conditioned on an input image and the label of the target domain.

Image-to image-translation.  The goal of image-to-image translation is to learn a mapping between images of the source domain and images of the target domain. Given pairs of data samples, pix2pix (Isola et al., 2017) initially performed this mapping by using conditional GANs and relying on the real images. This model, however, fails to conduct one-to-many mappings, namely, it cannot generate diverse outputs from a single input. BicycleGAN (Zhu et al., 2017b) explicitly modeled the mapping between output and latent space, and aligned the latent distribution with a known distribution. Finally, the diverse outputs are performed by sampling from the latent distribution. Gonzalez-Garcia et al(Gonzalez-Garcia et al., 2018) disentangle the latent space into disjoint elements, which allows them to successfully perform cross-domain retrieval as well as one-to-many translation. Although these methods allow to synthesize diverse results, the requirement of paired data limits their application. Recently, the cycle consistency loss (Kim et al., 2017; Yi et al., 2017; Zhu et al., 2017a) is enforced into models to explicitly reconstruct the source sample, which is translated into the target domain and back, thus enabling translation using unpaired data. In addition, UNIT (Liu et al., 2017) aligns the latent space in two domains by assuming the similar domains share the same content. Although this approach shows remarkable results without paired data, they fail to perform diverse outputs. More recently, several image-to-image translation methods (Almahairi et al., 2018; Choi et al., 2018b; Pumarola et al., 2018; Wang et al., 2018b) enable diverse results with the usage of noise or labels.

Diversity of image-to-image translation.  Most recently, several approaches  (Chen et al., 2016; Lee et al., 2018; Huang et al., 2018; Gonzalez-Garcia et al., 2018; Jakab et al., 2018; Yu et al., 2018a; Li, 2018) consider to disentangle factors in feature space by enforcing a latent structure or regulating the structure distribution. Exploiting this disentangled representation enables the generator to synthesize diverse outputs by controlling style distribution. The key difference with the proposed method is that our method additionally performs scalable image-to-image translation while still having diversity.

Scalability of image-to-image translation.   The scalability aim is to conduct image-to-image translation across multiple domains by a single generator. MMNet (Wang et al., 2018b) uses a shared encoder and a domain-independent decoder, not only allowing to perform style learning but zero-pair image-to-image translation. Anoosheh et al. (Anoosheh et al., 2018) additionally consider encoder-decoder pairs for each domain as well as the used techniques in CycleGAN (Zhu et al., 2017a). IcGAN (Perarnau et al., 2016) and StarGAN (Choi et al., 2018a) condition the domain label on the latent space and input, respectively. Our approach also works by imposing domain labels in a single generator, but simultaneously enabling the model to synthesize diverse outputs.

Attention learning.  Attention mechanisms have been successfully employed for image-to-image translation. Current approaches (Chen et al., 2018; Mejjati et al., 2018) learn an attention mask to enforce the translation to focus only on the objects of interest and preserve the background area. GANimation (Pumarola et al., 2018) uses action units to choose regions from the input images that are relevant for facial animation. These methods exploit attention mechanisms at the image level. Our method, on the other hand, learns feature-wise attention maps, which enables us to control which features are modified during translation. Therefore, our attention maps are highly effective at restricting the translation to change only domain-specific areas (e.g. forehead region when modifying the ‘bangs’ attribute).

Figure 2. Model architecture. (Left) The proposed approach is composed of two main parts: a discriminator to distinguish the generated images and the real images; and the set of the encoder

, multilayer perceptron

and the generator , containing the attention block, residual blocks with CIN, and the transposed convolutional layers. (Right) At test time, we can generate multiple plausible translations in the desired domain using a single model.

3. Scalable and Diverse Image Translation

Our method must be able to perform multi-domain image-to-image translation. We aim to learn a model with both scalability and diversity. By scalability we refer to the property that a single model can be used to perform translations between multiple domains. By diversity we refer to the property that given a single input image, we can obtain multiple plausible output translations by sampling from a random variable.

3.1. Method Overview

Here we consider two domains: source domain and target domain (it can trivially be extended to multiple domains). As illustrated in Figure 2

, our framework is composed of four neural networks: encoder

, generator , multilayer perceptron , and discriminator . Let be the input source image and

the target output, with corresponding labels

for the source and for the target. In addition, let

be the latent code, which is sampled from a Gaussian distribution.

An overview of our method is provided in Figure 2. To address the problem of scalability we introduce the target domain as a conditioning label to the encoder, . The diversity is introduced by the latent variable , which is mapped to the input parameters of a Conditional Instance Normalization (CIN) layer (Dumoulin et al., [n. d.]) by means of the multilayer perceptron . The CIN learns an additive () and a multiplicative term () for each feature layer. Both the output of the encoder and the multilayer perceptron are used as input to the generator . The generator outputs a sample of the target domain. Sampling different results into different output results . The unpaired domain translation is enforced by a cycle consistency (Kim et al., 2017; Yi et al., 2017; Zhu et al., 2017a): taking as input the output and the source category , we reconstruct the input image as . The encoder , the multilayer perceptron , and the generator are all shared.

The function of the discriminator is threefold. It produces three outputs: . Both and

represent probability distributions, while

is a regressed code. The goal of

is to distinguish between real samples and generated images in the target domain. The auxiliary classifier

predicts the target label and allows the generator to perform domain-specific output conditioned on it. This was found to improve the quality of the conditional GAN (Odena et al., 2017). Similarly to previous methods (Chen et al., 2016; Huang et al., 2018) we reconstruct the latent input code in the output . This was found to lead to improved diversity. Note that is just used for generated samples, as aims to reconstruct the latent code, which is not defined for real images.

We shortly summarize here the differences of our method with respect to the most similar approaches. StarGAN (Choi et al., 2018a) can also generate outputs on multiple domains, but: (1) it learns a scalable but deterministic model, while our method additionally obtains diversity via the latent code; (2) we explicitly exploit an attention mechanism to focus the generator on the object of interest. Comparing against both MUNIT (Huang et al., 2018) and DRIT (Lee et al., 2018), which perform diverse image-to-image translation but without being scalable, our method: (1) employs the domain label to control the target domain, allowing to conduct image-to-image translation among multiple domains with a single generator; (2) avoids the need for domain-specific style encoders, effectively saving computational resources; (3) considers attention to avoid undesirable changes in the translation; and (4) experimentally proves that the bias of CIN is the key factor to make the generator achieve the diversity, whereas the multiplicative term was only found to play a minor role.

3.2. Training Losses

The full loss function consists of several losses: the

adversarial loss that discriminates the distribution of synthesized data and the real distribution in target domain, domain classification loss which contributes to the model , to learn the specific attribute for a given target label, the latent code reconstruction loss regularizes the latent code to improve diversity and avoids the problem of partial mode collapse, and the image reconstruction loss that guarantees that the translated image keeps the structure of the input images.

Adversarial loss.   We employ GANs (Goodfellow et al., 2014) to distinguish the generated images from the real images


where the discriminator tries to differentiate between generated images from the generator and real images, while tries to fool the discriminator taking the output of and the output of as input. The final loss function is optimized by the minimax game


Domain classification loss.   In this paper, we consider Auxiliary Classifier GANs (AC-GAN) (Odena et al., 2017) to control domains. The discriminator aims to output a probability distribution over given input images and domain label, in consequence and synthesize the domain-specific images. We share the discriminator model except for the last layer and optimize the triplet by the cross-entropy loss. The final domain classification loss for generated samples, real samples, and total are

(3) ,

respectively. Given domain labels and these objectives are able to minimize the classification loss so that the model explicitly generates domain-specific outputs.

Latent code reconstruction loss.   The lack of constraints on the latent code results in the generated images suffering from partial mode collapse as the latent code is ignored. We use the discriminator to predict the latent code, which forces the network to use it for generation:


Image reconstruction loss.   Both adversarial loss and classification loss fail to keep the structure of the input. To avoid this, we formulate the image reconstruction loss as


Full Objective.   The full objective function of our model is:


where , , , , are hyper-parameters that balance the importance of each iterm.

3.3. Attention-guided generator

The attention mechanism encourages the generator to locate the domain-specific area relevant to the target domain label. Let be the output of the encoder. We propose to localize the CIN operation by introducing an attention mechanism. Only part of the encoder output should be changed to obtain the desired diversity. We separate the signal into two parallel residual blocks and . The CIN is applied to the residual block according to . We estimate the attention with a separate residual block according to . We then combine the original encoder output and the CIN output using attention:


In (Pumarola et al., 2018), an attention loss

regularizes the attention maps, since they quickly saturate to 1. In contrast, we employ the attention in the bottleneck features, and experimentally prove that the attention masks can be easily learned. This makes the task easier due to lower resolution in the bottleneck, and avoids the need to tune the attention hyperparameter. Finally, our attention mechanism does not add any new terms to the overall optimization loss in (


4. Experimental setup

Training setting.  Our model is composed of four sub-networks: encoder , multilayer perceptron , generator , and discriminator . The encoder contains 3 convolutional layers and 6 blocks. Each convolutional layer uses

filters with stride 2, except for the first one which uses

with stride 1, and each block contains two convolutional layers with filters and stride of 1. consists of two fully connected layers with 256 and 4096 units. The generator

comprises ResBlock layers, attention layers and two fractionally strided convolutional layers. The ResBlock consists of 6 residual blocks, as in the encoder

, but including CIN layers. The CIN layers take the output of and the ouput of the as input. Except for six blocks like the CIN layers, the attention layers also use additional convolutional layers with sigmoid activations on top. For the discriminator , we use six convolutional layers with and stride 2, followed by three parallel sub-networks, each of them containing one convolutional layer with filters and stride 1, except for the branch to output which uses an additional fully connected layer from 32 units to 8. Note how adds around 1M parameters to the architecture.

All models are implemented in PyTorch 

(Paszke et al., 2017) and released111The codes are available at . We randomly initialize the weights following a Gaussian distribution, and optimize the model using Adam (Kingma and Ba, 2014) with batch size 16 and 4 for face and non-face datasets, respectively. The learning rate is 0.0001, followed the exponential decay rates . In all experiments, we use the following hyper-parameters: , , , and

. We use Gaussian noise to the latent code with zero mean and a standard deviation of 1.

4.1. Datasets

We consider several datasets to evaluate our models. In order to verify the generality of our method, the datasets were chosen to cover a variety of cases, including faces (CelebA), object (Color), and scenes (Artworks).

CelebA (Liu et al., 2015). The Celeb Faces Attributes is a face dataset of celebrities with 202,599 images and 40 attribute labels per face. To explicitly preserve the face ratio, we crop the face size of and resize it to . We leave out 2000 random images for test and train with the rest.

Color dataset (Yu et al., 2018b). We use the dataset collected by Yu (Yu et al., 2018b), which consists of 11 color labels, each category containing 1000 images. In order to easily compare to the non-scalable baselines which need train one independent model for each domain pair, we use only four colors (green, yellow, blue, orange). We resize all images to . We collected 3200 images for the train set and 800 images for the test set.

Artworks (Zhu et al., 2017a). We also illustrate SDIT in an artwork setting (Zhu et al., 2017a). This includes real images (photo) and three artistic styles (Monet, Ukiyo-e, and Cezanne). The training set contains 3000 (photo), 700 (Ukiyo-e), 500 (Cezanne) and 1000 (Monet) images, while the test set are: 300 (photo), 100 (Ukiyo-e), 100 (Cezanne) and 200 (Monet) images. All image are resized to .

4.2. Evaluation Metrics

To validate our approach, we consider the three following metrics.

LPIPS. In this paper, LPIPS (Zhang et al., 2018) is used to compute the similarity of pairs of images from the same attribute. LPIPS takes larger values if the generator has more diversity. In our setting, we generate 10 samples given an input image via different random codes.

ID distance. The key point of face mapping is to preserve the identity of the input, since an identity change is unacceptable for this task. To measure whether two images depict the same identity, we consider ID distance (Wang et al., 2019), which represents the difference in identity between pairs of input and translated faces. More concretely, given a pair of input and output faces, we extract the identity features represented by the VGGFace (Parkhi et al., 2015)

network, and compute the distance between these features. VGGFace is trained on a large face dataset and is robust to appearance changes (e.g. illumination, age, expression, etc.). Therefore, two images of the same person should have a very small value. We only use this evaluation metric for CelebA. We use all 2000 test images as input and generate 10 output images, which in total amounts to 20,000 pairs.

Figure 3. Ablation study of different variants of our method. We show results for the face task of adding ‘bangs’. We display three random outputs for each variant of the method.
Method Atten CIN ID Distance LPIPS
SDIT w/o CIN (Atten) Y N N 0.061 0.408
SDIT w/o Atten () N Y N 0.063 0.409
SDIT w/o Atten () N Y Y 0.070 0.432
SDIT () Y Y N 0.063 0.412
SDIT Y Y Y 0.060 0.424
Table 1. ID distance (lower, better) / LPIPS (higher, better) for different variants of our method. Atten: attention, Y: yes, N: no.
Figure 4. Qualitative comparison to the baselines. The input face image is at the left bottom and the remaining columns show the attribute-specific mapped images. The first two lines show the translated results of the IcGAN (Perarnau et al., 2016) and StarGAN (Choi et al., 2018a), respectively, while the remaining rows are from the proposed method.

Reverse classification. One of the methods to evaluate conditional image-to-image translation is to train a reference classifier on real images and test it on generated images (Wang et al., 2018c; Wu et al., 2018b). The reference classifier, however, fails to evaluate diversity, since it may still report a high accuracy even when the generator encounters mode-collapse for a specific domain, as shown on the third column of Figure 3. Following (Shmelkov et al., 2018; Wu et al., 2018b), we use the reverse classifier which is trained using translated images for each target domain and evaluated on real images for which we know the label. Lower classification errors indicate more realistic and diverse translated images.

5. Experimental Results

In Section 5.1 we introduce several baselines against which we compare our model, as well as multiple variants of our model. Next, we evaluate the model on faces in Section 5.2. Finally, in Section 5.3 and Section 5.4, we analyze the generality of the model to color translation and scene translation.

5.1. Baselines and variants






Wearing hat

Pale skin

Brown hair

Blond hair


Mouth open


StarGAN (Choi et al., 2018a) 0.067/0.427 0.065/0.428 0.068/0.428 0.061/0.427 0.075/0.427 0.064/0.421 0.060/0.418 0.067/0.426 0.066/0.435 0.059/0.429 0.065/0.427
IcGAN (Perarnau et al., 2016) 0.118/0.430 0.097/0.431 0.094/0.430 0.121/0.430 0.102/0.429 0.10/0.430 0.127/0.424 0.113/0.421 0.097/0.425 0.116/0.438 0.108/0.432
SDIT 0.068/0.456 0.065/0.447 0.069/0.444 0.061/0.449 0.076/0.458 0.065/0.439 0.058/0.443 0.067/0.442 0.066/0.458 0.058/0.457 0.065/0.451
Real data -/0.486 -/0.483 -/0.484 -/0.480 -/0.489 -/0.479 -/0.492 -/0.490 -/0.492 -/0.489 -/0.486
Table 2. ID distance (lower, better) / LPIPS (higher, better) on CelebA dataset.

We compare our method with the following baselines. For all baselines, we use the authors’ original implementations and recommended hyperparameters. We also consider different configurations of our proposed SDIT approach. In particular, we study variants with and without CIN, attention, and latent code reconstruction.

CycleGAN (Zhu et al., 2017a). CycleGAN is composed of two pairs of domain-specific encoders and decoders. The full objective is optimized with an adversarial loss and a cycle consistency loss.

MUNIT (Huang et al., 2018). MUNIT disentangles the latent distribution into the content space which is shared between two domains, and the style space which is domain-specific and aligned with a Gaussian distribution. At test time, MUNIT takes as input the source image and different style codes to achieve diverse outputs.

IcGAN (Perarnau et al., 2016). IcGAN explicitly maps the input face into a latent feature, followed by a decoder which is conditioned on the latent feature and a target face attribute. In addition, the face attribute can be explicitly reconstructed by an inverse encoder.

StarGAN (Choi et al., 2018a). StarGAN shares the encoders and decoders for all domains. The full model is trained by optimizing the adversarial loss, the reconstruction loss and the cross-entropy loss, which controls that the input image is translated into a target image.

5.2. Face translation

We firstly conduct an experiment on the CelebA (Liu et al., 2015) dataset to compare against ablations of our full model. Next, we compare SDIT to the baselines. For this case, we consider IcGAN and StarGAN, both of which show outstanding results for face synthesis.

Ablation study.   We performed an ablation study comparing several variants of SDIT in terms of model diversity. We consider five attributes, namely bangs, blond hair, brown hair, young, and male. Figure 3 shows the translated images obtained with different variants of our method. As expected, SDIT with only attention (second column of Figure 3) fails to synthesize diverse outputs, since the model lacks the additional factors (e.g. noise) to control this. Both the third and fourth columns show that adding CIN to our method without attention generates diverse images. Their quality, however, is unsatisfactory and the model suffers from partial mode collapse, since CIN operates on the entire image, rather than being localized by the attention mechanism to the desired area (e.g. the bangs). Combining both CIN and attention but without the latent code reconstruction () leads to little diversity, as shown in the fifth column. Finally, our full model (last column) achieves the best results in terms of quality and diversity.

For quantitative evaluation, we report the results in terms of the ID distance and LPIPS. As shown in Table 1, the SDIT models without CIN or generate less diverse outputs according to LPIPS scores. Using without attention contributes to improve the diversity. It has a higher LPIPS, but this could be because it is adding unwanted diversity (e.g. the red lips in the fourth column of Figure 3). This may explain its higher ID distance. Combining both attention and (i.e. the full SDIT model) encourages the results to have better targeted diversity, as reported in the last row of Table 1. The preservation of identity is crucial for the facial attribute transfer task, and thus we keep both attention and the reconstruction loss in the following sections.

Figure 5. Generated images and learned attention maps for three input images. For each of them we present multi-domain outputs and attribute-specific attention.
Figure 6. Ablation study on CIN. We compare three cases: , where is learnable; , where is learnable; and , where both and are learnable.

Attention.   Figure 5 shows the attention maps for several translations from the face dataset. We note that our method explicitly learns the attribute-specific attention for a given face image (e.g. eyeglasses), and generates the corresponding outputs. In this way, attention enables to modify only attribute-specific areas of the input image. This is a key factor to restrict the effect of the CIN, which otherwise would globally process the entire feature representation.

CIN learning.   We explain here how CIN contributes to the diversity of the generator. In this experiment, we only consider CIN without attention nor latent code reconstruction. The operation performed by CIN on a feature is given by:


where and are the output of encoder and latent code , respectively; are affine parameters learned from and , are the mean and standard deviation. As shown in the second column of Figure 6, only learning fails to output diverse images, while only learning already generates diverse results (third column of Figure 6), clearly indicating that is the key factor to diversity. Updating the two parameters obtains a similar performance in this task. However, could be ignored by the network. Therefore we introduced the latent code reconstruction loss, Eq. 6, which helps to avoid this.

Comparison against baselines.   Figure 4 shows the comparison to the baselines on test data. We consider ten attributes: bangs, blond hair, brown hair, young, male, mouth slightly open, smiling, pale skin, wearing hat, and eyeglasses. Although both IcGAN and StarGAN are able to perform image-to-image translation to each domain, they fail to synthesize diverse outputs. Moreover, the performance of IcGAN is unsatisfactory and it fails to keep the personal identity. Our method not only enables the generation of realistic and diverse outputs, but also allows scalable image-to-image translation. Note that both StarGAN and our method use a single model. The visualization shows that scalability and diversity can be successfully integrated in a single model without conflict. Taking adding bangs as an example translation; the generated bangs with different directions do not impact the classification performance or the adversarial learning, in fact possibly contribute to the adversarial loss, since the CIN layer slightly reduces the compactness of the network, which increases the freedom of the generator.

As we can see in Table 2, our method obtains the best scores in both LPIPS and ID distance. In the case of LPIPS, the mean value of our method is 0.451, while IcGAN and StarGAN achieve 0.432 and 0.427 respectively. This clearly indicates that SDIT can successfully generate multimodal outputs using a single model. Moreover, the low ID distance indicates that SDIT effectively preserves the identity, achieving a competitive performance with StarGAN. Note that here we do not compare to CycleGAN and MUNIT because these methods require a single generator to be trained for each pair of domains. This is unfeasible for this task, because each attribute combination would require a different generator.

Figure 7. Examples of scalable and diverse inference of multi-domain translations on (a) color dataset and (b) artworks dataset. In both cases, the first column is the input, the next three show results for CycleGAN (Zhu et al., 2017a), IcGAN (Perarnau et al., 2016), and StarGAN (Choi et al., 2018a), respectively, followed by three samples from MUNIT (Huang et al., 2018) in next three columns and three samples from SDIT in the last three. Each row indicates a different domain.

5.3. Object translation

The experiments in the previous section were conducted on a face dataset, in which all images have a relatively similar content and structure (a face on a background). Here we consider the color object dataset to show that SDIT can be applied to datasets that lack a common structure. This dataset contains a wide range of different objects which greatly vary in shape, scale, and complexity. This makes the translation task more challenging.

Qualitative results.   Figure 7(a) compares image-to-image translations obtained with CycleGAN (Zhu et al., 2017a), IcGAN (Perarnau et al., 2016), StarGAN (Choi et al., 2018a), MUNIT (Huang et al., 2018) and the proposed method. We can see how SDIT clearly generates highly realistic and attribute-specific bags with different color shades, which is comparable to the results of MUNIT. Other baselines, however, only generate one color shade. The main advantage of SDIT is the scalability, as SDIT explicitly synthesizes the target color image (yellow, green, or blue) using a single generator.

Quantitative results.   The qualitative observations above are validated here by quantitative evaluations. Table 3 compares the results of SDIT to the baseline methods. Our method outperforms both baseline methods on LPIPS despite only using a single model. For the classification accuracy, CycleGAN, IcGAN and StarGAN produce a lower score, since it is not able to generate diverse outputs for a given test samples. Both MUNIT and SDIT have a similar performance. However, for both CycleGAN and MUNIT training all pairwise translation would in case of domains require generators. Since we consider here, we have trained a total of 6 generators for CycleGAN and MUNIT. The advantage of SDIT with respect to this non-scalable models would be even more evident for an increased number of domains.

Method Yellow Blue Green Orange Mean Num E/G
CycleGAN 93.4/0.599 95.1/0.601 93.4/0.584 92.3/0.587 93.5/0.592 6/6
IcGAN 92.2/0.581 93.5/0.592 92.8/0.579 92.1/0.589 92.6/0.585 1/1
StarGAN 95.9/0.591 95.3/0.602 96.0/0.590 94.2/0.584 95.3/0.591 1/1
MUNIT 97.3/0.607 97.1/0.603 97.2/0.599 96.8/0.621 97.2/0.608 6/6
SDIT 97.6/0.610 96.6/0.607 97.3/0.604 97.1/0.627 97.1/0.612 1/1
Real image 98.5/0.652 98.6/0.652 97.8/0.653 98.8/0.652 98.4/0.652 -/-
Table 3. Reverse classification accuracy () and LPIPS on the color dataset. For both metrics, the higher the better.

5.4. Scene translation

Finally, we train our model on the photo and artworks dataset (Zhu et al., 2017a). Differently from the model used for faces and color objects, here we consider the variant of our model without attention. This difference is due to the fact that previous datasets had a foreground that needed to be changed (object) and a fixed background, whereas in the scene case we need the generator to learn a global image translation instead of a local one, and thus background must also be changed.

Figure 7(b) shows several representative examples of the different methods. The conclusions are similar to previous experiments: SDIT maps the input (photo) to other domains with diversity while using a single model. Table 4 also confirms this, showing how the proposed method achieves excellent scores with only one scalable model.

Method Photo Cezanne Ukiyoe Monet Mean Num E/G
CycleGAN 52.8/0.684 57.4/0.654 56.1/0.674 60.9/0.648 56.8/0.665 6/6
IcGAN 50.9/0.697 56.8/0.663 55.1/0.677 59.7/0.651 55.6/0.671 1/1
StarGAN 60.1/0.694 61.5/0.667 61.3/0.689 62.7/0.663 61.3/0.678 1/1
MUNIT 66.2/0.763 67.9/0.784 67.2/0.791 63.9/0.778 66.3/0.779 6/6
SDIT 65.6/0.816 63.4/0.806 65.3/0.829 66.4/0.802 65.1/0.828 1/1
Real image 70.2/0.856 72.4/0.874 69.9/0.884 71.7/0.864 71.1/0.869 -/-
Table 4. Reverse classification accuracy () and LPIPS on the artworks dataset. For both metrics, the higher the better.

6. Conclusion

We have introduced SDIT to perform image-to-image translation with scalability and diversity using a simple and compact network. The key challenge lies in controlling the two functions separately without conflict. We achieve scalability by conditioning the encoder with the target domain label, and diversity by applying conditional instance normalization in the bottleneck. In addition, the use of attention on the latent represent further improves the performance of image translation, allowing the model to mainly focus on domain-specific areas instead of the unrelated ones. The model has limited applicability for domains with large variations (for example, faces and paintings in a single model) and works better when the domains have characteristics in common.

Acknowledgements.   Y. Wang acknowledges the Chinese Scholarship Council (CSC) grant No.201507040048. L. Herranz acknowledges the European Union research and innovation program under the Marie Skłodowska-Curie grant agreement No. 6655919. This work was supported by TIN2016-79717-R, and the CHISTERA project M2CR (PCIN-2015-251) of the Spanish Ministry, the CERCA Program of the Generalitat de Catalunya. We also acknowledge the generous GPU support from NVIDIA.


  • (1)
  • Almahairi et al. (2018) Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. 2018. Augmented cyclegan: Learning many-to-many mappings from unpaired data. International Conference on Machine Learning (2018).
  • Anoosheh et al. (2018) Asha Anoosheh, Eirikur Agustsson, Radu Timofte, and Luc Van Gool. 2018. ComboGAN: Unrestrained Scalability for Image Domain Translation.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    (Jun 2018).
  • Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems. 2172–2180.
  • Chen et al. (2018) Xinyuan Chen, Chang Xu, Xiaokang Yang, and Dacheng Tao. 2018. Attention-GAN for object transfiguration in wild images. In Proceedings of the European Conference on Computer Vision (ECCV). 164–180.
  • Choi et al. (2018a) Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018a. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Choi et al. (2018b) Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018b. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Dumoulin et al. ([n. d.]) Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. [n. d.]. A learned representation for artistic style. ([n. d.]).
  • Eigen and Fergus (2015) David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the International Conference on Computer Vision. 2650–2658.
  • Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. 2015.

    Unsupervised domain adaptation by backpropagation. In

    International Conference on Machine Learning. 1180–1189.
  • Gatys et al. (2016) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016.

    Image style transfer using convolutional neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2414–2423.
  • Gong et al. (2012) Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. 2012. Geodesic flow kernel for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2066–2073.
  • Gonzalez-Garcia et al. (2018) Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. 2018. Image-to-image translation for cross-domain disentanglement. In Advances in Neural Information Processing Systems. 1294–1305.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680.
  • Huang et al. (2018) Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision. 172–189.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017.

    Image-to-image translation with conditional adversarial networks.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017).
  • Jakab et al. (2018) Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. 2018. Unsupervised learning of object landmarks through conditional image generation. In Advances in Neural Information Processing Systems. 4020–4031.
  • Johnson et al. (2018) Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1219–1228.
  • Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).
  • Kim et al. (2017) Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. International Conference on Machine Learning (2017).
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations (2014).
  • Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4681–4690.
  • Lee et al. (2018) Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Kumar Singh, and Ming-Hsuan Yang. 2018. Diverse Image-to-Image Translation via Disentangled Representations. In Proceedings of the European Conference on Computer Vision.
  • Li (2018) Jerry Li. 2018. Twin-GAN–Unpaired Cross-Domain Image Translation with Weight-Sharing GANs. arXiv preprint arXiv:1809.00946 (2018).
  • Liu et al. (2017) Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised Image-to-Image Translation Networks. Advances in Neural Information Processing Systems (2017).
  • Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
  • Ma et al. (2018) Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. 2018. DA-GAN: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5657–5666.
  • Mathieu et al. (2016) Michael Mathieu, Camille Couprie, and Yann LeCun. 2016. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (2016).
  • Mejjati et al. (2018) Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. 2018. Unsupervised Attention-guided Image-to-Image Translation. In Advances in Neural Information Processing Systems. 3697–3707.
  • Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
  • Odena (2016) Augustus Odena. 2016. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583 (2016).
  • Odena et al. (2017) Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional image synthesis with auxiliary classifier gans. In International Conference on Machine Learning. JMLR. org, 2642–2651.
  • Parkhi et al. (2015) O. M. Parkhi, A. Vedaldi, and A. Zisserman. 2015.

    Deep Face Recognition. In

    Proceedings of the British Machine Vision Conference.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
  • Perarnau et al. (2016) Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and Jose M Álvarez. 2016. Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355 (2016).
  • Pumarola et al. (2018) Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. 2018. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision. 818–833.
  • Reed et al. (2016) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. International Conference on Machine Learning (2016).
  • Rippel and Bourdev (2047) Oren Rippel and Lubomir Bourdev. 2047. Real-time adaptive image compression. In International Conference on Machine Learning. JMLR. org, 2922–2930.
  • Saito et al. (2017) Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Asymmetric tri-training for unsupervised domain adaptation. International Conference on Machine Learning (2017).
  • Shmelkov et al. (2018) Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. 2018. How good is my GAN?. In Proceedings of the European Conference on Computer Vision (ECCV). 213–229.
  • Shu et al. (2017) Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and Dimitris Samaras. 2017. Neural face editing with intrinsic image disentangling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5541–5550.
  • Tsai et al. (2018) Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker. 2018. Learning to adapt structured output space for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018).
  • Wang et al. (2018a) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8798–8807.
  • Wang et al. (2019) Yaxing Wang, Abel Gonzalez-Garcia, Joost van de Weijer, and Luis Herranz. 2019. Controlling biases and diversity in diverse image-to-image translation. arXiv preprint arXiv:1907.09754 (2019).
  • Wang et al. (2018b) Yaxing Wang, Joost van de Weijer, and Luis Herranz. 2018b. Mix and match networks: encoder-decoder alignment for zero-pair image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5467–5476.
  • Wang et al. (2018c) Yaxing Wang, Chenshen Wu, Luis Herranz, Joost van de Weijer, Abel Gonzalez-Garcia, and Bogdan Raducanu. 2018c. Transferring GANs: generating images from limited data. In Proceedings of the European Conference on Computer Vision (ECCV). 218–234.
  • Wu et al. (2018b) Chenshen Wu, Luis Herranz, Xialei Liu, Joost van de Weijer, Bogdan Raducanu, et al. 2018b. Memory Replay GANs: Learning to Generate New Categories without Forgetting. In Advances in Neural Information Processing Systems. 5966–5976.
  • Wu et al. (2018a) Zuxuan Wu, Xintong Han, Yen-Liang Lin, Mustafa Gkhan Uzunbas, Tom Goldstein, Ser Nam Lim, and Larry S Davis. 2018a. DCAN: Dual Channel-wise Alignment Networks for Unsupervised Scene Adaptation. In Proceedings of the European Conference on Computer Vision.
  • Yi et al. (2017) Zili Yi, Hao Zhang, Ping Tan Gong, et al. 2017. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In Proceedings of the International Conference on Computer Vision.
  • Yu et al. (2018b) Lu Yu, Yongmei Cheng, and Joost van de Weijer. 2018b. Weakly Supervised Domain-Specific Color Naming Based on Attention. arXiv preprint arXiv:1805.04385 (2018).
  • Yu et al. (2018a) Xiaoming Yu, Xing Cai, Zhenqiang Ying, Thomas Li, and Ge Li. 2018a. SingleGAN: Image-to-Image Translation by a Single-Generator Network using Multiple Generative Adversarial Learning. In Proceedings of the Asian Conference on Computer Vision.
  • Zhang and Patel (2018) He Zhang and Vishal M Patel. 2018. Densely Connected Pyramid Dehazing Network. In CVPR.
  • Zhang et al. (2017a) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. 2017a. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).
  • Zhang et al. (2017b) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017b. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the International Conference on Computer Vision. 5907–5915.
  • Zhang et al. (2019) Lichao Zhang, Abel Gonzalez-Garcia, Joost van de Weijer, Martin Danelljan, and Fahad Shahbaz Khan. 2019. Synthetic data generation for end-to-end thermal infrared tracking. IEEE Transactions on Image Processing 28, 4 (2019), 1837–1850.
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In Proceedings of the European Conference on Computer Vision. Springer, 649–666.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018.

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In

  • Zhu et al. (2017a) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017a. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the International Conference on Computer Vision.
  • Zhu et al. (2017b) Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. 2017b. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems. 465–476.
  • Zou et al. (2018) Yang Zou, Zhiding Yu, B.V.K. Vijaya Kumar, and Jinsong Wang. 2018. Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training. In Proceedings of the European Conference on Computer Vision.