Show, Attend and Translate: Unpaired Multi-Domain Image-to-Image Translation with Visual Attention

by   Honglun Zhang, et al.
Shanghai Jiao Tong University

Recently unpaired multi-domain image-to-image translation has attracted great interests and obtained remarkable progress, where a label vector is utilized to indicate multi-domain information. In this paper, we propose SAT (Show, Attend and Translate), an unified and explainable generative adversarial network equipped with visual attention that can perform unpaired image-to-image translation for multiple domains. By introducing an action vector, we treat the original translation tasks as problems of arithmetic addition and subtraction. Visual attention is applied to guarantee that only the regions relevant to the target domains are translated. Extensive experiments on a facial attribute dataset demonstrate the superiority of our approach and the generated attention masks better explain what SAT attends when translating images.



There are no comments yet.


page 3

page 4

page 6

page 7

page 9

page 10


In2I : Unsupervised Multi-Image-to-Image Translation Using Generative Adversarial Networks

In unsupervised image-to-image translation, the goal is to learn the map...

Less is More: Unified Model for Unsupervised Multi-Domain Image-to-Image Translation

In this paper, we aim at solving the multi-domain image-to-image transla...

Informative Sample Mining Network for Multi-Domain Image-to-Image Translation

The performance of multi-domain image-to-image translation has been sign...

GAIT: Gradient Adjusted Unsupervised Image-to-Image Translation

Image-to-image translation (IIT) has made much progress recently with th...

Unsupervised Attention-guided Image to Image Translation

Current unsupervised image-to-image translation techniques struggle to f...

Panoptic-based Object Style-Align for Image-to-Image Translation

Despite remarkable recent progress in image translation, the complex sce...

Flow-based Deformation Guidance for Unpaired Multi-Contrast MRI Image-to-Image Translation

Image synthesis from corrupted contrasts increases the diversity of diag...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently image-to-image translation has attracted great interests and obtained remarkable progress with the prosperities of generative adversarial networks (GANs) [3]. It aims to change a certain aspect of a given image in a desired manner and covers a wide variety of applications, ranging from changing face attributes like hair color and gender [1], reconstructing street scenes from semantic label maps [6], to transforming realistic photos into art works [26].

A domain refers to a group of images that share some latent semantic features in common, which are denoted as domain labels. The values of different domain labels can be either binary, like male and female for gender, or categorical such as black, blonde and brown for hair color. Based on the above definition, the scenarios and applications of image-to-image translation for two domains are extremely various, interesting and creative [26, 6, 7, 11].

A more complicate and useful challenge, however, is multi-domain image-to-image translation, which is supposed to transform a given image to several target domains of high qualities. There are a few benchmark image datasets available with more than two labels. For example, the CelebA [12] dataset contains about 200K celebrity face images, each annotated with 40 binary labels describing facial attributes like hair color, gender and age. Inspired by [26], [1] utilizes a label vector to convey the information of multiple domains, and applies cycle consistent loss to guarantee preservations of domain-unrelated contents. However, there are not many other literature dedicated to this topic, still leaving a lot of room for improvements.

In this paper, we propose the model SAT (Show, Attend and Translate), an unified and explainable generative adversarial network to achieve unpaired multi-domain image-to-image translation with a single generator. We explore different fashions of combining the given images with the multi-domain information, either in the raw or latent phase, and compare their influences on the translation results. Based on the label vector, we propose the action vector, a more intuitive and understandable representation that converts the original translation tasks as problems of arithmetic addition and subtraction.

We utilize the effective visual attention [20] to capture the correlations between the given images and the target domains, so that the domain-unrelated regions are preserved. We conduct extensive experiments and the results demonstrate the superiority of our approach. Visualizations of the generated attention masks also help better understand what SAT attends when translating images.

Our contributions are summarized in three-folds:

  • We propose SAT, an unified and explainable generative adversarial network for unpaired multi-domain image-to-image translation.

  • SAT utilizes the action vector to convey the information of target domains and the visual attention to determine which regions to focus when translating images.

  • Both qualitative and quantitative experiments on a facial attribute dataset demonstrate the effectiveness of our approach.

Figure 1: Overall architecture of Show, Attend and Translate, which consists of a generator and a discriminator . (a) takes as input both the real source image and the target action vector to synthesize the fake target image . (b) tries to obtain the reconstructed image based on the fake target image and the source action vector , where the Cycle Consistent Loss is imposed on and . (c) accepts the real source image and a zero action vector to produce the fake source image , which should be exactly the same with constrained by the Identity Reconstruction Loss. (d) learns to distinguish the fake images from the real ones and infer the most appropriate labels for classification.

2 Related Work

Generative Adversarial Nets. Generative Adversarial Nets (GANs) [3] are a powerful method for training generative models of complicate data and have been proven effective in a wide variety of applications, including image generation [17, 24, 4], image-to-image translation [6, 26, 1]

, image super-resolution 

[9] and so on. Typically a GAN model consists of a Generator () and a Discriminator () playing a two-player game, where tries to synthesize fake samples from random noises following a prior distribution, while learns to distinguish those from real ones. The two roles combat with each other and finally reach a Nash Equilibrium, where the generator is able to produce indistinguishable fake samples of high qualities.

Auxiliary Classifier GANs. Several work are devoted to controlling certain details of the generated images by introducing additional supervisions, which can be a multi-hot label vector indicating the existences of some target attributes [13, 14], or a textual sentence describing the desired content to generate [24, 25, 21]

. The Auxiliary Classifier GANs (ACGAN) 

[14] belongs to the former group and the label vector conveys semantic implications such as gender and hair color in a facial synthesis task. is also enhanced with an auxiliary classifier that learns to infer the most appropriate label for any real or fake sample. Based on the label vector, we further propose the action vector, which is more intuitive and explainable for image-to-image translation.

Image-to-Image Translation. There is a large body of literature dedicated to image-to-image translation with impressive progress. For example, pix2pix [6] proposes an unified architecture for paired image-to-image translation based on cGAN [13] and a L1 reconstruction loss. To alleviate the costs for obtaining paired data, the problem of unpaired image-to-image translation has also been widely exploited [26, 7, 11], which mainly focuses on translating images between two domains. For the more challenge task of multi-domain image-to-image translation, StarGAN [1] combines the ideas of [26] with [14] and can robustly translate a given image to multiple target domains with only a single generator. In this paper, we dive deeper into this issue and integrate several novel improvements.

Visual Attention

. Attention based models have demonstrated significant performances in a wide range of applications, including neural machine translation 

[19], image captioning [20], image generation [23] and so on. [22] utilizes the visual attention to localize the domain-related regions for facial attribute editing, but can only handle a single attribute by translating between two domains. In this paper, we validate the use of attention to solve the more generalized problem of translating images for multiple domains.

Figure 2: Detailed network architectures of and in SAT. consists of two sub-modules, the Translation Network and the Attention Network.

produces probability distributions for both adversarial discrimination and label classification. (a) and (b) denote the raw and latent strategies of combination in the backbone respectively.

3 Methodology

In this section, we discuss our approach SAT (Show, Attend and Translate) for unpaired multi-domain image-to-image translation. The overall and detailed architectures of SAT are illustrated in Fig.1 and Fig.2 respectively.

3.1 Multi-Domain Image-to-Image Translation

Multi-domain image-to-image translation accepts an input image and produces an output image for multiple target domains. We denote the number of involved domains by and utilize a multi-hot label vector to convey the target domain information, where means the existence of a certain domain while acts oppositely.

3.2 Raw or Latent

Generally consists of three parts, an encoder to downsample to a latent representation of lower resolution, several residual blocks [5] for nonlinear transformations, and a decoder for upsampling and producing of the original resolution. So there are two different strategies of combining the input image with the target domain information.

  • Raw. Combine before the encoder by spatially replicating , which is then concatenated with .

  • Latent. Combine after the encoder and before the residual blocks in a similar way.

[1] adopts the former manner, but we suspect that it may be more intuitive to introduce in the latent phase, as contains the semantic implications about the target domains, so it should be less appropriate to directly combine with the raw pixel values of . We will further investigate this issue in the Experiments section.

3.3 Action Vector

In [14, 1], the label vector is utilized to contain the target domain information. Based on , we propose the action vector for guiding to translate images corresponding to the target domains. The label vector of the source domains is denoted by , so we need to translate an input image with to an output image with . The target action vector is defined as follows.


The motivation of action vector is straightforward and meaningful. tells the model how the generated images should look like, while describes what should be done and changed to generate the desired outputs. Given that and are both multi-hot vectors, the values of should be , or , which mean removing, preserving or adding the related content of a target domain respectively. In this way, the original task of multi-domain image-to-image translation can be understood as a problem of arithmetic addition and subtraction, , enabling the model to focus more on what should be changed and translated for the target domains.

Based on and , we can also conduct the translation reversely conditioned on as Fig.1 shows and obtain the reconstructed image, , where the source action vector is calculated as follows.


It is noticeable that , which well coincides with the definition of reverse translation. We use to denote the zero action vector, which means translating nothing and preserving the content for the original domains.

3.4 Visual Attention

In order to leverage the effective visual attention, we modify the generator and now consists of two sub-modules, the Translation Network (TN) and the Attention Network (AN) as Fig.2 shows. TN achieves the translation task and generates from conditioned on .


However, TN may change the domain-unrelated contents and thus produce unsatisfactory results, which should be avoided for multi-domain image-to-image translation. To solve this problem, AN accepts and generate an attention mask with the same resolution as .


Ideally, the values of should be either or , which indicates that the corresponding pixel of the input image is irrelevant or relevant to the target domains. Based on the attention mask, we can obtain the refined translation result by extracting only the domain-related content from and copying the rest from .


where denotes element-wise multiplication and means inverting the attention mask to get the domain-unrelated regions. We combine Eq.()-() and reformulate as follows.


In this way, we decompose the original task of multi-domain image-to-image translation into two sub-tasks, determining which regions to focus attention on and learning how to generate realistic images conforming to the target domains, which guarantee that the domain-unrelated contents are preserved when translating images. The detailed network architectures of are illustrated in Fig.2 and will be further discussed in the Implementation section.

3.5 Loss Functions

Adversarial Loss. In order to distinguish fake samples from real ones, learns to minimize the following adversarial loss [3].


while tries to synthesize fake samples to fool so the adversarial loss of acts oppositely.


By playing such a two-player game, obtains stronger capability of discrimination and is able to generate realistic samples of high qualities.

Classification Loss. In order to translate the input image to the desired output conforming to the target domains , should possess the capability of classifying images to their correct labels, and should be able to generate images corresponding to with supervisions from .

We utilize the annotations between the source images and the source label vectors to train the auxiliary classifier in . By minimizing the following classification loss, where are the ground truths and are the predicted probability distributions, learns to infer the most appropriate labels with confidence for any given images.


At the meanwhile, we also impose the classification loss on to guarantee that the generated images are not only realistic but also classified to the target domains supervised by .


Cycle Consistent Loss. It is intuitive that multi-domain image-to-image translation should only change the domain-related contents while keeping other details preserved, which cannot be solely satisfied by Eq. () and (). After translation and reverse translation, the reconstructed image should be exactly the same as the original image . In other words, the effects of and should compensate for each other, which can be regularized by the following cycle consistent loss [26].


where the L1 norm is applied to calculate the difference between and .

Identity Reconstruction Loss. Another intuition for multi-domain image-to-image translation is that the generated image should also be exactly the same as the input image if equals , namely modifying nothing if the target domains are the same as the source domains. In this case, the generated image is denoted by , or after considering the action vector, where is the zero action vector indicating that no translation should be conducted. We utilize the L1 norm again and impose the following identity reconstruction loss between and .

Figure 3: Translation results and residual images of the four variants of SAT on the test set. A residual image with fewer non-zero values means better performances of attention. The action vector can better preserve the original identity and generate images that are more related to the target domains.

Gradient Penalty. To stabilize the training process and generate images of higher qualities, we replace Eq. () and () with Wasserstein GAN objectives [4].


What is more, a gradient penalty loss is imposed on to enforce the 1-Lipschitz constraint [4].


where are uniformly sampled along straight lines between pairs of real and translated samples.


Total Loss

. We combine the losses discussed above to define the total loss functions of

and to minimize.


where are the hyper-parameters to control the weights of different loss terms.

4 Implementation

We implement SAT with TensorFlow

111 and conduct all the experiments on a single NVIDIA Tesla P100 GPU.

The network architecture of SAT is thoroughly depicted in Fig.2, where blocks of different colors denote different types of neural layers. For the convolution and transposed convolution layers, the attached texts contain the parameters of the convolution kernels, such as denoting a kernel with the kernel size of ,

filters and the stride size of

. We apply instance normalization [18] for but no normalization for , and the default nonlinearities for are relu and leaky relu respectively.

The two sub-modules of , TN and AN, both accepts and own a backbone respectively, which share the same architecture but are assigned with different parameters. The details of the backbone are also illustrated in Fig.2, where (a) (b) denote the raw or latent strategies of combining or with by depth-wise concatenation. In the encoder of the backbone, is downsampled to by two convolution layers with the stride size of . Six residual blocks are employed and the decoder consists of two transposed convolution layers with the stride size of for upsampling.

In order to produce a normalized RGB image in , a convolution layer with filters followed by nonlinearity is appended to TN. In contrast, the backbone of AN is followed by a convolution layer with only filter and nonlinearity to generate an attention mask in . The outputs of TN and AN are further blended with the input image to obtain the refined result as Eq.() defines.

Figure 4: Translation results with real-valued vectors on the domain Male. The red boxes denote the input images and the blue boxes denote the identity reconstructed images.

The network structure of is relatively simpler, six convolution layers with the stride size of for downsampling, followed by another two convolution layers for discrimination and classification respectively.

5 Experiments

In this section, we perform extensive experiments to demonstrate the superiority of SAT. We first investigate the influences of different settings on the translation results, label or action, raw or latent. Then we explore the feasibility of real-valued vectors and conduct both qualitative as well as quantitative evaluations to compare SAT with existing literature. Lastly, we train SAT again to translate images of higher resolution and visualize the attention masks of different domains for better explainability.

5.1 Datasets

We utilize the CelebFaces Attributes dataset [12] to conduct experiments for SAT. CelebA contains face images of the size from celebrities, each annotated with binary labels indicating facial attributes like hair color, gender and age. images are randomly selected as the test set and all the other images are used for training. We construct seven domains with the following attributes, hair color (black, blond, brown), gender (male, female) and age (young, old).

5.2 Training

The images of CelebA are horizontally flipped with a probability of , cropped centrally and resized to the resolution of for preprocessing. All the parameters of the neural layers are initialized with the Xavier initializer [2] and we set ,, for all experiments. The Adam [8] optimizer is utilized with and we train SAT on CelebA for epochs, fixing the learning rate as for the first epochs and linearly decaying it to over the next epochs.

We set the batch size to and perform one generator update for every five discriminator updates [4]. For each iteration, we obtain a batch of and from the training data, randomly shuffle to obtain and calculate accordingly, which imposes to translate images into various target domains.

5.3 Different Settings

We construct four variants of SAT to investigate the influences of different settings.

  • SAT-lr: the label vector in the raw phase.

  • SAT-ll: the label vector in the latent phase.

  • SAT-ar: the action vector in the raw phase.

  • SAT-al: the action vector in the latent phase.

We optimize the above four models on the training images of CelebA and compare their performances on the unseen test set. For each test image, three different operations can be conducted. 1) H: changing the hair color. 2) G: inverting the gender (from male to female or reversely). 3) A: inverting the age (from young to old or reversely).

The translation results and the residual images are illustrated in Fig.3, where the residual image is defined as the difference between the input image and the translated image , so a residual image with fewer non-zero values means better performances of attention.


As Fig.3 shows, all the four variants of SAT can produce favorable translation results and the non-zero regions of the residual images intuitively demonstrate the effectiveness of the visual attention. However, we find that the action vector can better preserve the original identity (see the hair textures in Black Hair and the eyes in Gender) and generate images more related to the target domains (see the hair colors in Blond Hair and the wrinkles in Age). We will further compare the four variants in subsequent sections.

Figure 5: Comparisons between StarGAN and SAT. The even rows illustrate the residual images of the two models for each translation task. SAT surpasses StarGAN significantly in both the translation quality of target domains and the preservation of domain-unrelated contents. H: Hair color, G: Gender, A: Age.

5.4 Baseline Models

We mainly compare SAT against StarGAN [1], which is also devoted to unpaired multi-domain image-to-image translation. The performances of some existing literature on image-to-image translation for two domains like DIAT [10] and CycleGAN [26] or on facial attribute transfer like IcGAN [15] have been detailedly discussed in [1] and surpassed by StarGAN with significant margins, so we ignore them to save space in this paper.

StarGAN utilizes the label vector to convey target domain information and the cycle consistent loss to preserve domain-unrelated contents. is introduced in the raw phase and combined with by depth-wise concatenation.

Operation H G A HG HA GA HGA
StarGAN 18.6% 16.7% 7.3% 23.2% 28.4% 15.2% 32.3%
SAT 73.6% 81.6% 88.0% 68.2% 62.3% 81.1% 52.8%
Pass 7.8% 1.7% 4.7% 8.6% 9.3% 3.7% 14.9%
Table 1: Quantitative comparisons of StarGAN and SAT. Each column sums up to 100%.

5.5 Real-Valued Vectors

We investigate the scalability of StarGAN and the four variants of SAT on real-valued vectors. For example, the domain Male can be set as or in the label vector, but a robust model should be able to handle abnormal values as well. We test several values of and the results are illustrated in Fig.4, where and can be normal values () or abnormal values ().

For , we observe severe artifacts in the fourth row, indicating the incapability of SAT-ar to handle real-valued vectors. StarGAN fails to translate robustly for extreme values, where the translated images are either over-saturated for or under-saturated for , so we conjecture that StarGAN wrongly correlates the domain Male with the saturation of images. The domain-unrelated regions like the background are also influenced and the image is even badly blurred when .

In contrast, the domain-unrelated regions are well preserved in all the four variants of SAT, which are mainly owing to the effective visual attention. Constrained by the identity reconstruction loss, the four variants can also perfectly preserve the original identity when  (see the red boxes and the blue boxes in Fig.4). However, SAT-ll fails to perform robustly and wrongly translates the girl into an old man for , indicating that SAT-ll confuses the semantics of the domain Male with the domain Old. Both SAT-lr and SAT-al can always achieve reasonable results for all values of , but we are delighted to discover that SAT-al is even capable of naturally adding some auxiliary features when dealing with extreme values, such as the makeups for and the beards for . The above differences are also observed on other test images.

As a result, we arrive at the following three conclusions. 1) It is better to introduce the label vector in the raw phase. 2) It is better to introduce the action vector in the latent phase. 3) SAT-al is superior to the other three variants. We will use SAT-al as the default choice of SAT for subsequent experiments.

Figure 6: Translation results and attention masks of the resolution for more domains.
Figure 7: Translation results with real-valued vectors on the domain Smiling.

5.6 Qualitative Evaluation

Based on H, G, A, we further construct another four operations for each test image, HG, HA, GA, HGA, to cover translations for multiple domains. Fig.5 illustrates the qualitative results of StarGAN and SAT for different operations, where the first column shows the input image, followed by three columns for single-domain translation and four columns for multi-domain translation.

As Fig.5 shows, StarGAN can generate visually satisfactory results, but unavoidably influences the domain-unrelated contents. For example, the background is changed for almost all operations according to the residual images, and the hair color wrongly turns black when the translation tasks are G, A and GA. StarGAN fails to disentangle the patterns of different domains and produces undesired changes, which may degrade the translation performances for certain operations.

The above problems of StarGAN are not observed in SAT. The residual images demonstrate the capability of SAT to focus only on the domain-related regions, such as the hair when translating hair color and the face when translating gender or age. Due to the effective action vector, SAT can capture the semantics of each domain and generate realistic images with reasonable details.

5.7 Quantitative Evaluation

For each of the 2,000 images in the test set, we perform the above seven operations and obtain a pair of candidates from the two models. The quantitative evaluation is conducted in a crowd-sourcing manner, where the volunteers are instructed to select the better one based on three criterias, the perceptual realism, the quality of translation for target domains, and the preservation of original identity.

The statistics are reported in Table 1, where Pass means the volunteers cannot make a choice when the two images are as good or as bad. The translation results depend heavily on the quality of the input images, and both models may perform poorly when the input images are blurred and under-saturated, which accounts for a large proportion of Pass. For single-domain translation, SAT surpasses StarGAN by winning more votes on average. For the more complicated tasks of multi-domain translation, the percentages of Pass increase a lot, but SAT is still superior to StarGAN with a significant margin of on average.

5.8 Higher Resolution

We process the images of CelebA to the size and train SAT again to translate images of higher resolution. We construct more domains based on the following ten attributes, black hair, blond hair, brown hair, male, young, eyeglasses, mouth slightly open, pale skin, rosy cheeks, smiling, to investigate the robustness of our model.

The translation results and the attention masks are shown in Fig.6, which prove the effectiveness of SAT to correctly translate images for the target domains by only changing the related contents. For the domain Smiling, we inspect SAT with various values of and illustrate the results in Fig.7, which demonstrates that SAT can also be utilized to achieve other tasks like facial expression synthesis [16].

6 Conclusion

In this paper, we propose SAT, an unified and explainable generative adversarial network for unpaired multi-domain image-to-image translation with a single generator. Based on the label vector, we propose the action vector to convey target domain information and introduce it in the latent phase. The visual attention is utilized so that only the domain-related regions are translated with the other preserved. Extensive experiments demonstrate the superiority of our model and the attention masks better explain what SAT attends when translating images for different domains.