What and Where to Translate: Local Mask-based Image-to-Image Translation

06/09/2019 ∙ by Wonwoong Cho, et al. ∙ Microsoft Korea University 2

Recently, image-to-image translation has obtained significant attention. Among many, those approaches based on an exemplar image that contains the target style information has been actively studied, due to its capability to handle multimodality as well as its applicability in practical use. However, two intrinsic problems exist in the existing methods: what and where to transfer. First, those methods extract style from an entire exemplar which includes noisy information, which impedes a translation model from properly extracting the intended style of the exemplar. That is, we need to carefully determine what to transfer from the exemplar. Second, the extracted style is applied to the entire input image, which causes unnecessary distortion in irrelevant image regions. In response, we need to decide where to transfer the extracted style. In this paper, we propose a novel approach that extracts out a local mask from the exemplar that determines what style to transfer, and another local mask from the input image that determines where to transfer the extracted style. The main novelty of this paper lies in (1) the highway adaptive instance normalization technique and (2) an end-to-end translation framework which achieves an outstanding performance in reflecting a style of an exemplar. We demonstrate the quantitative and qualitative evaluation results to confirm the advantages of our proposed approach.



There are no comments yet.


page 3

page 7

page 9

page 10

page 11

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unpaired image-to-image translation, in short, image translation, based on generative adversarial networks (GANs) (Goodfellow et al., 2014) aims to transform an input image from one domain to another, without using paired data between different domains (Zhu et al., 2017a; Liu et al., 2017; Kim et al., 2017; Choi et al., 2018; Liu et al., 2017; Bahng et al., 2018). Such an unpaired setting is inherently multimodal, since a single input image can be mapped to multiple different outputs within a target domain. For example, when translating the hair color of a given image into a blonde, the detailed hair region (e.g., upper vs. lower, and partial vs. entire) and detailed color (e.g., dark vs. light blonde) may vary.

Previous studies have achieved such multimodal outputs by adding a random noise sampled from a pre-defined prior distribution (Zhu et al., 2017b) or taking a user-selected exemplar image as additional input, which contains the detailed information of an intended target style (Chang et al., 2018). Recent studies (Lin et al., 2018; Ma et al., 2019; Cho et al., 2018) including MUNIT (Huang et al., 2018) and DRIT (Lee et al., 2018) utilize these two approaches, showing the state-of-the-art performance by separating (i.e., disentangling) content and style information of a given image through two different encoder networks.

However, existing exemplar-based methods have several limitations as follows. First, those methods do not pay attention to what target style to transfer from the exemplar. Instead, they simply extract style information from the entire region of a given exemplar, while it is likely only the style of a sub-region of the exemplar should be transferred. Thus, the style of the entire exemplar will tend to be noisy due to the irrelevant regions with respect to the target attribute to transfer. It gives rise to a degradation of the model, particularly in reflecting only the relevant style contained in an exemplar.

Suppose we translate the hair color of an image using an exemplar image. Since the hair color information is available only in the hair region of an image, the style information extracted from the entire region of the exemplar may contain the irrelevant information (e.g., the color of the wall and the texture pattern of the floor), which should not be reflected in the intended image translation. In the end, the noisy style results in erroneous translations in mirroring the hair color of the exemplar, as illustrated in Fig. 3.

Second, previous methods do not distinguish different regions of the input image. Even though particular regions should be kept as it is during translation, those methods simply transfer the extracted style to the entire region of the input image to obtain the target image. Due to this issue, previous approaches (Huang et al., 2018; Lee et al., 2018) often distort the irrelevant regions of an input image such as the background. That is, we should be aware of where to transfer for the input image.

To tackle these issues, we propose a novel, LOcal Mask-based Image Translation approach, called LOMIT, which generates a local, pixel-wise soft binary mask of an exemplar (i.e., the source region from which to extract out the style information) to identify what style to transfer and that of an input image to identify where to translate (i.e., the target region to which to apply the extracted style). Our algorithm shares the same high-level idea as recent approaches (Pumarola et al., 2018; Chen et al., 2018; Yang et al., 2018; Ma et al., 2019; Mejjati et al., 2018) that have leveraged an attention mask in image translation. In those approaches, the attention mask, extracted from an input image, plays a role of determining the target region to apply a translation, i.e., where to transfer. We expand these approaches by additionally exploiting a mask for an exemplar, so that LOMIT can decide where to translate as well as what to be transferred.

Once obtaining local masks, LOMIT extends a recently proposed technique for image translation, called adaptive instance normalization, using highway networks (Srivastava et al., 2015), which computes the weighted average of the input and the translated pixel values using the above-mentioned pixel-wise local mask values as different linear combination weights per pixel location. LOMIT has an additional advantage of being able to manipulate the computed masks to selectively transfer an intended style, e.g., choosing either a hair region (to transfer the hair color) or a facial region (to transfer the facial expression).

The effectiveness of LOMIT is evaluated on two facial datasets, via quantitative methods such as the fréchet inception distance (FID) and a user study and qualitative comparisons with other state-of-the-art methods.

Figure 1: Overview of a translation process. (a) Each domain is defined as a subset of data that share a particular attribute. An image from each domain is decomposed into the content space and the style space where the style is separately encoded as two other representations (the black arrows). While merging them (the cyan arrows), LOMIT learns how to reconstruct the original image. (b) For the cross-domain translation

, LOMIT combines a foreground style feature extracted from

with a background style and a content features extracted from .
Figure 2: Image translation workflow. (a) LOMIT first generates masks for the input and the exemplar via attention networks. (b) Next, we separate each image of and into a foreground and a background regions, depending on how much each pixel is involved in image translation. (c,d) By combining the content and the background style representations from with the foreground style representation , we obtain a translated image . Note that LOMIT also learns the opposite-directional image translation by interchanging and . Finally, LOMIT learns image translation using the cycle consistency loss from and .

2 Basic Setting

We define “content” as common features (an underlying structure) across all domains (e.g., the pose of a face, the location and the shape of eyes, a nose, a mouth, and hair), and “style” as a representation of the structure (e.g., background color, facial expression, skin tone, and hair color). As shown in Fig. 1, we assume that an image can be represented as , where is a content code in a content space, and is a style code in a style space. The operator combines and converts the content code and the style code into a complete image .

By considering the local mask indicating the relevant region (or simply, the foreground) to extract the style from or to apply it to, we further assume that is decomposed into , where is the style code extracted from the foreground region of the exemplar and is that from the background region. Separating a style representation into and implies disentangling style feature. The pixel-wise soft binary mask of an image is represented as a matrix with the same spatial resolution of . Each entry of lies between 0 and 1, which indicates the degree of the corresponding pixel belonging to the foreground. Then, the local foreground and the background regions, and of , are obtained as


where indicates an element-wise multiplication. Finally, our assumption is extended to , where , , and are obtained by the content encoder and the style encoder , respectively, which are all shared across multiple domains in LOMIT, i.e.,


It is essential for LOMIT to properly learn to generate the local mask involved in image translation. To this end, we propose to combine the mask generation networks with our novel highway adaptive instance normalization, as will be described in Section. 3.2.

3 Local Image Translation Model

We first denote and as images from domains and , respectively. As shown in Fig. 2, LOMIT converts a given image to and vice versa, i.e., , and , where is decoder networks and is our proposed, local mask-based highway adaptive instance normalization layer (or in short, HAdaIN), as will be described in detail in Section. 3.2.

For a brevity purpose, we omit the domain index notation in, say, and , unless needed for clarification.

3.1 Local Mask Extraction

We extract the local masks of the input and the exemplar images, as those are effectively involved in image translation. In concrete, LOMIT utilizes the local mask when (1) acquiring disentangled style features , and (2) specifying where to apply the style. For example, if LOMIT is conducting a hair color translation given the input image and the exemplar, our local masks should be obtained as the hair regions from two images. This is because the style to replace and transfer exist in the hair regions of the images.

As shown in Fig. 2(a), given an image , attention networks encode the content feature via the content encoder . The obtained is then forwarded into the rest of the attention networks , i.e., , where is the mask specifying the relevant region with respect to a target style to translate. In practice, the process applies to images in each domain independently in a similar manner, resulting in .

3.2 Highway Adaptive Instance Normalization

Adaptive instance normalization is an effective style transfer technique (Huang and Belongie, 2017)

. Generally, it matches the channel-wise statistics, e.g., the mean and the variance, of the activation map of an input image with those of a style image. In the context of image translation, MUNIT 

(Huang et al., 2018) extends AdaIN in a way that the target mean and the variance are obtained as the outputs of the trainable functions and of a given style code, i.e.,


where each of and

is defined as a multi-layer perceptron (MLP), i.e.,

and . Different MLPs for foreground and background are used because style information significantly differs, e.g., facial attributes vs. background texture.

As we pointed out earlier, previous approaches applied such a transformation globally across the entire region of an image, which may unnecessarily distort irrelevant regions. Hence, we formulate our local mask-based highway AdaIN (HAdaIN) as


where the first term corresponds to the local region of an input image translated by the foreground style, while the second corresponds to the complementary region where the original style of the input should be kept as it is.

4 Training Objectives

This section describes each of our loss terms in the objective function used for training our model.

4.1 Style and Content Reconstruction Loss

The foreground style of the translated output should be close to that of the exemplar, while the background style of the translated output should be close to that of the original input image. We formulate this criteria as the following style reconstruction loss terms:


From the perspective of content information, the content feature of an input image should be consistent with its translated output, which is represented as the content reconstruction loss as


By encouraging the content features to be the same, the content reconstruction loss maintains the content information of the input image, even after performing a translation.

4.2 Image Reconstruction Loss

As an effective supervision approach in an unpaired image translation setting, we adopt the image-level cyclic consistency loss (Zhu et al., 2017a) between an input image and its output through two consecutive image translations of (or ), i.e.,


Meanwhile, similar to previous studies (Huang et al., 2018; Lee et al., 2018), we translate not only but also . This intra-domain translation should work similarly to auto-encoder (Larsen et al., 2016), so the corresponding loss term can be written as


4.3 Domain Adversarial Loss

To reconstruct the real-data distribution via our model, we adopt the domain adversarial loss by introducing the discriminator networks . Among the loss terms proposed in the original GAN(Goodfellow et al., 2014), LSGAN(Mao et al., 2017), and WGAN-GP(Arjovsky et al., 2017; Gulrajani et al., 2017), we chose WGAN-GP, which is shown to empirically work best, as an adversarial training loss. That is, our adversarial loss is written as


where ,

is a sampled value from the uniform distribution, and

. We also apply the loss proposed in patchGAN (Isola et al., 2017; Zhu et al., 2017a).

4.4 Multi-Attribute Translation Loss

We use an auxiliary classifier 

(Odena et al., 2016) to cover multi-attribute translation with a single shared model, similar to StarGAN (Choi et al., 2018). The auxiliary classifier , which shares the parameters with the discriminator except for the last layer, classifies the domain of a given image. In detail, its loss term is defined as


where is the domain label of an input image

. Similar to the concept of weakly supervised learning 

(Zhou et al., 2016; Selvaraju et al., 2017), This loss term plays a role of supervising the local mask to determine the proper region of the corresponding domain through the HAdaIN module, allowing our model to extract out the style from its proper region of the exemplar.

4.5 Mask Regularization Losses

We impose several additional regularization losses on local mask generation to improve the overall image generation performance as well as the interpretability of the generated mask.

The first regularization is to minimize the difference of the mask values of those pixels having similar content information. This helps the local mask consistently capture a semantically meaningful region as a whole, e.g., capturing the entire hair region even when the lighting conditions and the hair color vary significantly within the exemplar. To this end, we design this regularization to minimize, as


where , , and

is a vector whose elements are all ones. Note that each of

is in , and is in , where . The first term is the distance matrix of all the pairs of pixel-wise mask values in

, and the second term is the cosine similarity matrix of all the pairs of

-dimensional pixel-wise content vectors. Note that we backpropagate the gradients generated by this regularization term only through

to train the attention networks, but not through , which prevents the regularization from affecting the encoder .

The second regularization is to minimize the local mask region (Chen et al., 2018; Pumarola et al., 2018), i.e.,


which plays a role of encouraging the model to focus only on a necessary region involved in image translation, by minimizing the L1 norm of each mask .

4.6 Full Loss

Finally, our full loss can be written as


where without a superscript denotes , , , , , and . Note that our training process contains both the intra-domain translation, and , and the inter-domain translation, and . Regarding training procedure for LOMIT, entire components of the generator, i.e., the style encoder, the content encoder, the mask generation network, and the decoder are updated as a whole with the generator loss , and thus our model does not require a separate step of training for each component in the generator.

5 Implementation Details

In this section, we describe the model architecture of LOMIT in Section. 5.1 and training details in Section. 5.2.

5.1 Model architectures

Content encoder.

Similar to MUNIT (Huang et al., 2018), the content encoder

is composed of two strided-convolutional layers and four residual blocks 

(He et al., 2016). Following the previous approaches (Huang and Belongie, 2017; Nam and Kim, 2018), instance normalization (IN)  (Ulyanov et al., 2016) is used across all the layers in the content encoder.

Style encoders.

The style encoder consists of four strided-convolutional layers, a global average pooling layer, and a fully-connected layer. The style codes are eight-dimensional vectors. Also, the style encoder shares the first few layers as they detect low-level feature. To maintain the style information, we do not use IN in the style encoder.

Attention networks.

Attention networks consist of six convolutional layers with a batch normalization 

(Ioffe and Szegedy, 2015)

. The layers are followed by MLP composed of two linear layers with tanh and sigmoid activation functions, respectively.


Decoder has four residual blocks and two convolutional layers with an upsampling layer each. We use the layer normalization (LN) (Ba et al., 2016) in the residual blocks not to lose inter-channel differences because LN normalizes the entire feature map, maintaining the differences between the channels.


Following StarGAN (Choi et al., 2018), the discriminator is composed of six strided-convolutional layers, followed by the standard discriminator and the auxiliary classifier. We exploit spectral normalization (Miyato et al., 2018) for the stable training.

5.2 Training Details

We utilize the Adam optimizer (Kingma and Ba, 2015) with and . Following the state-of-the-art approach (Choi et al., 2018) in multi-attribute translation, we load the data with a horizontal flip with 0.5 percent. For stable training, we update in every five updates of  (Gulrajani et al., 2017). We initialize the weights of

from a normal distribution and apply the initialization 

(He et al., 2015) on others. Also, we use the batch size of eight and the learning rate of 0.0001. We linearly decay the learning rate by half in every 10,000 iterations from 100,000 iterations. All the models used in the experiments are trained for 200,000 iterations using a single NVIDIA TITAN Xp GPU for 30 hours each.

Figure 3: Comparison with the baseline models. Each row from the top represents (A) an input image with (F) an output of StarGAN (Choi et al., 2018), (B) given exemplars to the baseline models, (C) LOMIT, (D) DRIT, and (E) MUNIT. The results demonstrates that LOMIT achieves the better performance than other baselines in the view of reflecting the given exemplar. Meanwhile, StarGAN (Choi et al., 2018), surrounded by the black box in the first row, is only able to generate an unimodal output.

6 Experiments

In this section, we elaborate the settings and the results of our experiments. First of all, we describe the dataset we use in Section. 6.1 and the baselines in Section. 6.2

. Secondly, we explain each of evaluation metrics and comparison with baselines in Section. 

6.36.4, respectively. Lastly, we provide examples about practical uses of LOMIT in Section. 6.5, and we further perform an intensive analysis of LOMIT with addtional experiments in Section. 6.6.

6.1 Dataset


The CelebA (Liu et al., 2015) dataset consists of 202,599 face images of celebrities and 40 attribute annotations per image. We choose 10 attributes (i.e., black_hair, blond_hair, brown_hair, smiling, goatee, mustache, no_beard, male, heavy_makeup, wearing_lipstick) that would convey meaningful local masks. We randomly select 10,000 images for testing and use the others for training. Images are center-cropped and scaled down to 128128.


The EmotioNet (Fabian Benitez-Quiroz et al., 2016) dataset contains 975,000 images of facial expressions in the wild, each annotated with 12 action units (AUs). Each AU denotes an activation of a specific facial muscle (e.g., jaw drop and nose wrinkler). We crop each image using a face detector 111https://github.com/ageitgey/face_recognition and resize them to 128128. We use 10,000 images for testing and 200,000 images for training.

Wild image dataset.

We also perform experiments using the wild images. Concretely, we exploit zebra2horse and apple2orange dataset (Zhu et al., 2017a), which contains the larger intra-domain variation compared to CelebA and EmotionNet. This is because the number of objects and the location the objects appear is diverse.

6.2 Baseline Methods


MUNIT (Huang et al., 2018) decomposes a given image into the domain-invariant content and the domain-specific style features and exploits AdaIN (Huang and Belongie, 2017) as a translation method. Incorporating a random sampling scheme for the training of latent style features, MUNIT attempts to reflect the multimodal nature of various style domains. We implement MUNIT to be trained on CelebA (Liu et al., 2015) dataset and report results for comparison.


DRIT (Lee et al., 2018)

employs two encoders, which encode the domain-invariant content information and domain-specific style information, respectively. DRIT uses the concatenated vector of the separated content and style features to combine them. The model is trained via the content discriminator which ensures the content space to be shared. Loss functions and training strategies are similar to MUNIT.


AGGAN (Mejjati et al., 2018) is one of the state-of-the-art method in mask-based image translation. AGGAN applies its translation result into a foreground region of an input image, such that the background region can be remained intact. Because AGGAN does not take an exemplar, we change its setting in order to compare with LOMIT. The major differences with ours are the absence of HadaIN module and exemplar mask.


ELEGANT encodes an face image into depth-wisely disentangled features in a shared space, whose each subpart indicates a corresponding face attribute, such as ‘Smiling’. In order to perform a translation, ELEGANT substitutes one of the subpart of the features from an exemplar to an input image. Note that ELEGANT does not exploit both masks for the exemplar and the input image.

6.3 Evaluation Metrics

Fréchet inception distance (FID).

The FID (Heusel et al., 2017)

is one of the widely used metrics for evaluating the performance of generative models. FID exploits the features extracted from intermediate layers of a pre-trained network. The distance between the feature distribution of the real images and that of the generated images is computed by measuring the distance between two different multivariate Gaussian distributions:


where , whose subsets are composed of the entire real images and , whose subsets consist of the overall generated images .

Human evaluation.

The FID is limited in evaluating how well a generated image maintains the consistency of an input image and the characteristics of an exemplar. In response, as a complementary metric to the FID, we perform human evaluation as a formal user study, of which details are described in Section. 6.4. Since we ask users to rank images generated from different models, we leverage the mean reciprocal rank (MRR) Craswell (2009) as a metric for quantifying human evaluation. Given a question, the MRR represents the average of the multiplicative inverse of a corresponding answer (rank) to the question, i.e.,


where is the number of questions and indicates the -th ranking of a given model . We use this metric to give priority to the highest rank, because a lower rank has less influence to the overall score in the MRR. The MRR ranges between 0 and 1, where the higher the score, the better the results.

6.4 Comparison with Baselines

In this section, we qualitatively and quantitatively compare LOMIT with other baseline methods on CelebA dataset in Section. 6.4.1 and wild image dataset in Section. 6.4.2.

6.4.1 Evaluation on CelebA dataset

Qualitative comparisons.

As shown in Fig. 3, we compare our model with the baseline models using CelebA dataset (Liu et al., 2015), where we train the baselines based on publicly available model implementations. Each macro column from the left indicates the translation from (a) brown to blonde, (b) non-facial hair to facial hair, and (c) non-smile to smile. LOMIT shows an outstanding performance compared to the baseline models in both reflecting the distinct style of an exemplar while keeping the irrelevant region, such as the background and the face in the hair color translation, intact.

Concretely, we observe that the noise in the style information extracted from the background is undesirably affecting the generated images in the case of MUNIT and DRIT (notably in the third column of both (a) and (b), and the first and the fifth columns of (c)), while LOMIT does not suffer from such influence. Besides, we also find that MUNIT and DRIT apply the style information to the irrelevant region of the input images, distorting the color and the tone of both the face and the background. It evidences that the mask for the exemplar should be properly incorporated and that LOMIT is superior to the compared models with regard to the accurate style application. These findings justify the initial motivation and the needs of the local masks along with the proposed HAdaIN module of LOMIT.

We also compare StarGAN (Choi et al., 2018), a widely-used, state-of-the-art method, to verify the added benefits of LOMIT. In Fig. 3, given a corresponding input, the images with the black outline in the first row (A) represents the outputs of StarGAN. It demonstrates that StarGAN is only able to generate a unimodal output depending on an multi-hot input vector indicating a target attribute. On the other hand, LOMIT generates diverse outputs reflecting each corresponding exemplar.

H & S 31.52 26.94 21.82
F.H & G 44.81 29.57 19.31
M & Y 37.78 33.68 26.01
Avg. 38.04 30.06 22.38
Table 1: Comparisons of the FID in the target domain. Each of H &S, F.H &G, and M &Y indicates hair colors (‘Brown_Hair’, ‘Blonde_Hair’, ‘Black_Hair’) & ‘Smiling’, facial hair (‘Mustache’, ‘No_Beard’, ‘Goatee’) & ‘Male’, and ‘Heavy_Makeup’ & ‘Young’, following the configurations in CelebA (Liu et al., 2015).
Comparisons of the FID.

We compare LOMIT with the baseline models using FID (Heusel et al., 2017), one of the renowned metrics for measuring the performance of generative models. A low FID demonstrates that the generated images are of better quality and in diverse spectrum. To obtain the score for each model, we first generate images using the test dataset. Next, we build up subsets per each attribute group with the generated images, e.g., given the attribute group of F.H & G, the dataset is separated into four subsets, such as (0,0), …, (1,1) in which each number indicates the binary label. Finally, we acquire the image features of each subset based on the pretrained Inception-v3 (Szegedy et al., 2016) and compute the statistics of the generated feature distribution. Similarly the feature statistics of the real images are obtained using the entire training dataset.

Table 1 lists the comparison results. In all the class subsets, LOMIT generates images that are more diverse and of better quality than the other methods, as indicated by the lower scores. We attribute the competitive result to the capability of LOMIT in applying the extracted style to the adequate region of the input image while keeping irrelevant regions intact. On the other hand, DRIT and MUNIT apply the style to unnecessarily large area of an input image, ending up with the feature distributions of generated images being far from the real distribution, as quantified by the high FID scores.

Q1. H & S 0.42 0.52 0.89
F.H & G 0.37 0.52 0.94
M & Y 0.41 0.46 0.97
Avg. 0.4 0.5 0.93

H & S 0.41 0.56 0.86
F.H & G 0.37 0.71 0.75
M & Y 0.44 0.48 0.92
Avg. 0.41 0.58 0.84

Table 2: Comparisons of the MRR, given two of the questions. Q1. is “Which image best maintains the extraneous regions on the translation?”, and Q2. is “Which image best reflects the style of the exemplar?”. A user grades the baseline methods according to a given question.
User study.

To evaluate the effectiveness of the proposed method, we conduct the user study by comparing LOMIT with other baseline models. First, out of the test dataset, we construct subsets in each of the attribute groups, e.g., M & Y

(0,0),…,(1,1). Second, we randomly sample 100 images from each subset and generate 10,000 images per each subset from every pair of these images. Lastly, we randomly sample 10 images from those generated images (40,000, in case of M & Y) per test run. A test run is composed of ten instances per attribute group, and for each question, an input, an exemplar, and corresponding outputs of each baseline models are presented. Regarding participants, We recruited 31 people diverse in age (from 22 to 40) and major expertise (20 non-experts in machine learning). Thus, the results reflect opinions from people with diverse viewpoints. Each time, participants rank all methods based on the given a question.

As for the evaluation metric, we report the mean reciprocal rank (MRR) Craswell (2009) as a metric for quantifying human evaluation. Given a question, the MRR average a reciprocal of given rankings. The MRR ranges between 0 and 1, where the higher the score, the better the results.

As the first question, we ask users to evaluate which model keeps well the irrelevant regions with respect to the style. The summaries are reported in Table 2. In every attribute group, LOMIT records an outstanding performance in keeping the extraneous regions untouched with respect to the style. On the other hand, other baselines apply the style to the entire regions of the image, yielding an excessive translation and resulting in a worse MRR than LOMIT.

The second question requires the users to answer which method reflects the exemplar style best. As shown in Table 2, we found that users favor LOMIT over baselines. We believe this is because the mask for the exemplar specifies the regions to extract the style from, so as to enable the model to have better representations of the targeted style.

6.4.2 Evaluation on wild image dataset

AppleOrange 170.83 124.31
OrangeApple 110.61 91.21
HorseZebra 105.27 49.60
ZebraHorse 97.25 89.74
Avg. 120.99 88.72

Table 3: FID score comparison with AGGAN (Mejjati et al., 2018). LOMIT shows the better results compared to AGGAN in every translation case.
Figure 4: Qualitative comparison with AGGAN. Each row represents the results of AGGAN and LOMIT, and each column shows the different translation case. The used exemplar for each case is the input image of the opposite case.

In order to verify LOMIT can handle dataset with large variations within a class, we additionally perform comparison experiments on ‘horse2zebra’ and ‘apple2orange’ datasets (Zhu et al., 2017a) using the FID scores. For this experiment, we adopt AGGAN (Mejjati et al., 2018) as the state-of-the-art mask-based image translation method, for which we use the official source code but with minor modifications on the architecture to take an exemplar as another input. We concatenate the input with an exemplar and forward them into a generator. We also adjust LOMIT to fit well to the dataset. Because the dataset is composed of a binary class, we do not need to cover the multi-attribute translation in those datasets. Therefore, we utilize domain-specific networks (i.e., the shared generator G is now replaced by and , where A and B indicate respective domains.) to increase the model capacity.

For the four tasks of horse2zebra, zebra2horse, apple2orange, and orange2apple, the overall FID scores of LOMIT outperform AGGAN as shown in Table. 3. We also verify the superior performance of LOMIT via qualitative comparisons with AGGAN as illustrated in Fig. 4. For example, the results of AGGAN in the first and the second columns show the incorrect translation results because the orange and the red are insufficiently colored. We believe that this results basically come from the use of the exemplar-mask. AGGAN cannot specify a region in an exemplar to be used as the style, while LOMIT is capable of refining the style by selecting essential regions of the exemplar, thus LOMIT can produce better results.

Figure 5: Interaction examples. A user can specify the regions from either or both an input image and an exemplar by manipulating masks.

Moreover, LOMIT can perform well even when a given exemplar contains multiple instances. For example, for the translation from zebra to horse, if a given exemplar is composed of multiple horses consisting both white horses and brown horses, the baseline method cannot specify which horse to be applied as a style in a translation. On the other hand, LOMIT tackles the problem and enables to choose which instance to be referenced as a style via masking on the exemplar.

6.5 User Intervention using LOMIT.

Various translation examples via user interaction.

Fig. 5 demonstrates a multifarious applicability of LOMIT via human interaction. It is shown that a user can specify where to translate as well as what to be transferred by manipulating the masks of an input and an exemplar. Masks with red outlines are the ones manually removed. The models are trained to translate from (a) black, non-smile to blonde, smile, (b) non-facial hair, female to facial hair, male, (c) young, makeup to old, non-makeup. The leftmost column of each macro column shows the result without any modification, and the other columns of each macro column indicate translation results after modifying the mask of the input or the exemplar. Users can modify the input mask to adjust where to translate, while one could change the exemplar mask to decide what to translate. For example, the second and the third columns in Fig. 5(a) are the results from a modification of the input-mask. From those results, it is verified that a user can choose where to translate through opting the region to apply the extracted style. Meanwhile, the fourth column shows the result from a modification of the exemplar-mask. It maintains the non-smile attribute of the input image by removing the regions of the eyes and the mouth of the exemplar-mask, which contains the smile attribute information. This demonstrates that a user can choose which style to transfer during a translation procedure, such that the learned attributes to be translated during training can be selectively transferred via user-interaction on the exemplar-mask.

Necessity of modifying exemplar-mask.

We believe this approach bears great potentials in diverse computer vision applications by allowing the user to fine-control the target region for the translation as well as the style of the exemplar. In particular, the technique can be effectively applied when the different styles in the same attribute co-exist in an exemplar. Practically, as illustrated in Fig. 

6(a) and (b), there can be numerous cases that an exemplar contains distinct styles within a single attribute. For example, the exemplar of Fig. 6(a) contains two women having different hair colors. It indicates that the target style extracted from the exemplar can be vague. The basic result represented in the first row shows the mixed color of brown and blonde, while the results from the second and the third rows respectively represent blonde and brown hair color because their exemplar-masks are modified in order to specify the hair color. By explicitly specifying the exact style to transfer, a user can designate a concrete target style, and the model can conduct an appropriate translation reflecting what the user wants.

Figure 6: Necessity of user-interaction on mask of exemplar. Practically, there can be a myriad of cases that an exemplar contains multiple styles within a single attribute. This issue can be clearly settled by our proposed technique.
Figure 7: Effects of the style mask. Each row from the top represents (a) and (b) LOMIT.

6.6 Additional Experiments and Analysis

In this section, we report additional experiments that perform an intensive analysis of LOMIT. Specifically, we conduct the ablation study in Section. 6.6.1 and the analysis of the exemplar-mask in Section. 6.6.2. We further perform an experiment based on EmotioNet dataset in Section. 6.6.3, such that we attempt to verify LOMIT can be utilized in diverse dataset. Next, we report the comparison results of LOMIT with ELEGANT in Section. 6.6.4, and lastly, we intensively discuss about the learning principle of LOMIT in Section. 6.6.5.

H & S 21.82 25.16 27.48 22.07 22.50 22.33

Table 4: Ablation study results. The left most column with a bold face represents our proposed method, and the rest of the columns show the ablation study results. This verifies that each of our various loss terms play a role in ameliorating the performance of LOMIT.

6.6.1 Ablation study on various losses

To justify our model architecture, we further perform ablation studies for loss configurations on CelebA ‘Hair colors’ & ‘Smiling’ () translation, and report FID scores as in the table above. As shown in Table. 4, LOMIT with all losses achieves the best result. Each of the second and the third columns represents the results of the LOMIT trained without the image reconstruction loss and the content reconstruction loss . This indicates that those losses significantly contribute to the model performance. Meanwhile, the model performance also degrades without the style losses and as shown in the fourth and the fifth columns. Even though their significance in improving model performance is less than the image reconstruction loss, the lack of the style reconstruction losses engender relatively incorrect style embeddings compared to LOMIT. Finally, the right most column shows the score of the LOMIT trained without the regularization losses. This denotes that the accurate mask extraction is important for boosting the model performance because the regularization losses play a role in properly refining a foreground region of the exemplar.

6.6.2 Analysis of the exemplar mask

We demonstrate the effects of the mask for the exemplar in Fig. 7. From the left, each column of the figure is composed of the input image, the corresponding input mask, the exemplar, its corresponding mask, and the output. The first row shows the result trained without incorporating the mask for the exemplar (), meaning that the style encoder encodes the entire regions of the exemplar as the foreground style, as well as that of the input as the background style. As illustrated in the second and fourth columns in the figure, the style masks of LOMIT (row (b)) specify regions more clearly than (row (a)). It reveals that the style mask is effectively regulating the model to distinguish the region of interest while minimizing the distortion of the irrelevant regions. Concretely, as can be seen in the image in the fifth column of (a), the area surrounding a mouth and the sculpture in the left is affected by the style, due to the mask excessively specifying the regions. Besides, the reddish face of the image demonstrates the extracted style from the exemplar includes extraneous regions on the translation, because a skin tone of a face is irrelevant information in the translation of (brown, no-smile black, no-smile). On the other hand, the image in the fifth column of (b) does not only maintain the irrelevant region, but also reflects the style to the relevant regions. That is, by exploiting the mask for the exemplar, we can achieve a better translation result.

Figure 8: The result of action unit translation using EmotioNet dataset (Fabian Benitez-Quiroz et al., 2016). The figure shows LOMIT can transfer various AUs from a given exemplar, such that it results in a change of an emotion following the given exemplar.

6.6.3 Experiment on EmotioNet

Fig. 8 shows the results for the action unit (AU) translation. For the training, we use all available AUs (1, 2, 4, 5, 6, 9, 12, 17, 20, 25, 26, and 43) as a training label (for the multi-attribute translation loss), so that the model can be trained for translating multi-AUs from the exemplar. Each triplet is composed of an input image, an exemplar, and a translated output. For example, the input of (a), containing AUs 12, 25 (Happy) takes the exemplar whose AUs are 1, 2, 25, 26 (Surprised). The translated output demonstrates that it preserves the identity of the input image while properly transferring the AUs of the exemplar. From the results, we verify that although a number of AUs sparsely distributed all over the face, LOMIT can perform the elaborate translation based on the local masks.

H & S 34.30 21.82

Table 5: FID comparison with ELEGANT (Xiao et al., 2018) on CelebA. The results demonstrate the superior performance of LOMIT.

6.6.4 Comparison with ELEGANT

Because ELEGANT can be regarded as one of the state-of-the-art methods, we additionally implement ELEGANT (Xiao et al., 2018) with its official source codes and compared FID score with LOMIT in the translation of hair color and smile attributes in CelebA (Liu et al., 2015). LOMIT showed substantially better performance than ELEGANT as represented in Table. 5. We attribute the superior performance of LOMIT to the mask-based technique of LOMIT. LOMIT extracts two masks, one for specifying which region to apply the style and the other for extracting a relevant style from a specific region, which allows a high-quality and targeted image translation in LOMIT.

6.6.5 Discussion on learning principle of LOMIT

We believe the multimodal translation is achieved by the training objectives of LOMIT. First, the adversarial loss and the multi-attribute translation loss encourage the model to generate a blonde person because the former reduces the distance between the distribution of the generated image and that of a real image containing a blonde hair. On the other hand, the latter makes the model generate an image with a blonde hair to be classified as being blonde. Second, the image reconstruction loss and the style reconstruction loss encourage the model to keep an intrinsic style of the exemplar. Specifically, the image reconstruction loss forces a reconstructed image to contain the same pixel value with an input image, while the style reconstruction loss makes a style code of the exemplar be kept after being applied to the input image. That is, each style code of different hair color has to be maintained to minimize the loss.

These aspects allow LOMIT to suitably learn how to translate an image while achieving the multimodality, such that LOMIT is able to cover the intra-domain variation even when an unseen style is given from an exemplar. This is because each exemplar in Fig.3 is sampled from the test dataset.

7 Conclusion

In this work, we addressed the problem of where and what to transfer for unpaired image-to-image translation. We proposed a local mask-based translation model called LOMIT, where the attention networks generate the mask of an input image and that of an exemplar. The mask of the exemplar determines what style to transfer by excluding irrelevant regions and extracting the style from only relevant regions. The other mask of the input determines where to transfer the extracted style. That is, it captures the regions to apply the style while maintaining an original style in the rest (through our highway adaptive instance normalization). LOMIT achieves outstanding results compared to the state-of-the-art methods (Huang et al., 2018; Lee et al., 2018)

. As future work, we plan to extend our model to other diverse domains of data, such as ImageNet 

(Deng et al., 2009) and MSCOCO (Lin et al., 2014). We will also extend our approach to video translation to improve the consistency of the translated results of consecutive frames.

8 Appendix

We provide the additional results in Fig. 9. It shows various examples demonstrating the superior performance of LOMIT.

Figure 9: Additional results. CelebA (Liu et al., 2015) data is used for the figure. Each block is composed of the input image, the input-mask, the exemplar, the exemplar-mask, and the output image. The target attribute is written on the top of each block.


  • Arjovsky et al. (2017) Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: ICML
  • Ba et al. (2016) Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:160706450
  • Bahng et al. (2018)

    Bahng H, Yoo S, Cho W, Park DK, Wu Z, Ma X, Choo J (2018) Coloring with words: Guiding image colorization through text-based palette generation. In: ECCV

  • Chang et al. (2018) Chang H, Lu J, Yu F, Finkelstein A (2018) Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In: CVPR
  • Chen et al. (2018) Chen X, Xu C, Yang X, Tao D (2018) Attention-gan for object transfiguration in wild images. In: ECCV
  • Cho et al. (2018) Cho W, Choi S, Park D, Shin I, Choo J (2018) Image-to-image translation via group-wise deep whitening and coloring transformation. arXiv preprint arXiv:181209912
  • Choi et al. (2018) Choi Y, Choi M, Kim M, Ha JW, Kim S, Choo J (2018) Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: CVPR
  • Craswell (2009) Craswell N (2009) Mean reciprocal rank. In: Encyclopedia of Database Systems
  • Deng et al. (2009) Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR
  • Fabian Benitez-Quiroz et al. (2016) Fabian Benitez-Quiroz C, Srinivasan R, Martinez AM (2016) Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: CVPR
  • Goodfellow et al. (2014) Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: NIPS
  • Gulrajani et al. (2017) Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. In: NIPS
  • He et al. (2015) He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV
  • He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR
  • Heusel et al. (2017) Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS
  • Huang and Belongie (2017) Huang X, Belongie SJ (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV
  • Huang et al. (2018) Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal unsupervised image-to-image translation. In: ECCV
  • Ioffe and Szegedy (2015) Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML
  • Isola et al. (2017)

    Isola P, Zhu J, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: CVPR

  • Kim et al. (2017) Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to discover cross-domain relations with generative adversarial networks. In: ICML
  • Kingma and Ba (2015) Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: ICLR
  • Larsen et al. (2016)

    Larsen ABL, Sønderby SK, Larochelle H, Winther O (2016) Autoencoding beyond pixels using a learned similarity metric. In: ICML

  • Lee et al. (2018) Lee HY, Tseng HY, Huang JB, Singh M, Yang MH (2018) Diverse image-to-image translation via disentangled representations. In: ECCV
  • Lin et al. (2018) Lin J, Xia Y, Qin T, Chen Z, Liu TY (2018) Conditional image-to-image translation. In: CVPR
  • Lin et al. (2014) Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft coco: Common objects in context. 1405.0312
  • Liu et al. (2017) Liu MY, Breuel T, Kautz J (2017) Unsupervised image-to-image translation networks. In: NIPS
  • Liu et al. (2015) Liu Z, Luo P, Wang X, Tang X (2015) Deep learning face attributes in the wild. In: ICCV
  • Ma et al. (2019) Ma L, Jia X, Georgoulis S, Tuytelaars T, Gool LV (2019) Exemplar guided unsupervised image-to-image translation with semantic consistency. In: ICLR
  • Mao et al. (2017) Mao X, Li Q, Xie H, Lau RY, Wang Z, Smolley SP (2017) Least squares generative adversarial networks. In: ICCV
  • Mejjati et al. (2018) Mejjati YA, Richardt C, Tompkin J, Cosker D, Kim KI (2018) Unsupervised attention-guided image to image translation. arXiv preprint arXiv:180602311
  • Miyato et al. (2018) Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. In: ICLR
  • Nam and Kim (2018)

    Nam H, Kim HE (2018) Batch-instance normalization for adaptively style-invariant neural networks. arXiv preprint arXiv:180507925

  • Odena et al. (2016) Odena A, Olah C, Shlens J (2016) Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:161009585
  • Pumarola et al. (2018) Pumarola A, Agudo A, Martinez A, Sanfeliu A, Moreno-Noguer F (2018) Ganimation: Anatomically-aware facial animation from a single image. In: ECCV
  • Selvaraju et al. (2017) Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D, et al. (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV
  • Srivastava et al. (2015) Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: NIPS
  • Szegedy et al. (2016) Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: CVPR
  • Ulyanov et al. (2016) Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normalization: The missing ingredient for fast stylization
  • Xiao et al. (2018) Xiao T, Hong J, Ma J (2018) Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 168–184
  • Yang et al. (2018) Yang C, Kim T, Wang R, Peng H, Kuo CCJ (2018) Show, attend and translate: Unsupervised image translation with self-regularization and attention. arXiv preprint arXiv:180606195
  • Zhou et al. (2016)

    Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: CVPR

  • Zhu et al. (2017a) Zhu JY, Park T, Isola P, Efros AA (2017a) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV
  • Zhu et al. (2017b) Zhu JY, Zhang R, Pathak D, Darrell T, Efros AA, Wang O, Shechtman E (2017b) Toward multimodal image-to-image translation. In: NIPS