Unpaired image-to-image translation, in short, image translation, based on generative adversarial networks (GANs) (Goodfellow et al., 2014) aims to transform an input image from one domain to another, without using paired data between different domains (Zhu et al., 2017a; Liu et al., 2017; Kim et al., 2017; Choi et al., 2018; Liu et al., 2017; Bahng et al., 2018). Such an unpaired setting is inherently multimodal, since a single input image can be mapped to multiple different outputs within a target domain. For example, when translating the hair color of a given image into a blonde, the detailed hair region (e.g., upper vs. lower, and partial vs. entire) and detailed color (e.g., dark vs. light blonde) may vary.
Previous studies have achieved such multimodal outputs by adding a random noise sampled from a pre-defined prior distribution (Zhu et al., 2017b) or taking a user-selected exemplar image as additional input, which contains the detailed information of an intended target style (Chang et al., 2018). Recent studies (Lin et al., 2018; Ma et al., 2019; Cho et al., 2018) including MUNIT (Huang et al., 2018) and DRIT (Lee et al., 2018) utilize these two approaches, showing the state-of-the-art performance by separating (i.e., disentangling) content and style information of a given image through two different encoder networks.
However, existing exemplar-based methods have several limitations as follows. First, those methods do not pay attention to what target style to transfer from the exemplar. Instead, they simply extract style information from the entire region of a given exemplar, while it is likely only the style of a sub-region of the exemplar should be transferred. Thus, the style of the entire exemplar will tend to be noisy due to the irrelevant regions with respect to the target attribute to transfer. It gives rise to a degradation of the model, particularly in reflecting only the relevant style contained in an exemplar.
Suppose we translate the hair color of an image using an exemplar image. Since the hair color information is available only in the hair region of an image, the style information extracted from the entire region of the exemplar may contain the irrelevant information (e.g., the color of the wall and the texture pattern of the floor), which should not be reflected in the intended image translation. In the end, the noisy style results in erroneous translations in mirroring the hair color of the exemplar, as illustrated in Fig. 3.
Second, previous methods do not distinguish different regions of the input image. Even though particular regions should be kept as it is during translation, those methods simply transfer the extracted style to the entire region of the input image to obtain the target image. Due to this issue, previous approaches (Huang et al., 2018; Lee et al., 2018) often distort the irrelevant regions of an input image such as the background. That is, we should be aware of where to transfer for the input image.
To tackle these issues, we propose a novel, LOcal Mask-based Image Translation approach, called LOMIT, which generates a local, pixel-wise soft binary mask of an exemplar (i.e., the source region from which to extract out the style information) to identify what style to transfer and that of an input image to identify where to translate (i.e., the target region to which to apply the extracted style). Our algorithm shares the same high-level idea as recent approaches (Pumarola et al., 2018; Chen et al., 2018; Yang et al., 2018; Ma et al., 2019; Mejjati et al., 2018) that have leveraged an attention mask in image translation. In those approaches, the attention mask, extracted from an input image, plays a role of determining the target region to apply a translation, i.e., where to transfer. We expand these approaches by additionally exploiting a mask for an exemplar, so that LOMIT can decide where to translate as well as what to be transferred.
Once obtaining local masks, LOMIT extends a recently proposed technique for image translation, called adaptive instance normalization, using highway networks (Srivastava et al., 2015), which computes the weighted average of the input and the translated pixel values using the above-mentioned pixel-wise local mask values as different linear combination weights per pixel location. LOMIT has an additional advantage of being able to manipulate the computed masks to selectively transfer an intended style, e.g., choosing either a hair region (to transfer the hair color) or a facial region (to transfer the facial expression).
The effectiveness of LOMIT is evaluated on two facial datasets, via quantitative methods such as the fréchet inception distance (FID) and a user study and qualitative comparisons with other state-of-the-art methods.
2 Basic Setting
We define “content” as common features (an underlying structure) across all domains (e.g., the pose of a face, the location and the shape of eyes, a nose, a mouth, and hair), and “style” as a representation of the structure (e.g., background color, facial expression, skin tone, and hair color). As shown in Fig. 1, we assume that an image can be represented as , where is a content code in a content space, and is a style code in a style space. The operator combines and converts the content code and the style code into a complete image .
By considering the local mask indicating the relevant region (or simply, the foreground) to extract the style from or to apply it to, we further assume that is decomposed into , where is the style code extracted from the foreground region of the exemplar and is that from the background region. Separating a style representation into and implies disentangling style feature. The pixel-wise soft binary mask of an image is represented as a matrix with the same spatial resolution of . Each entry of lies between 0 and 1, which indicates the degree of the corresponding pixel belonging to the foreground. Then, the local foreground and the background regions, and of , are obtained as
where indicates an element-wise multiplication. Finally, our assumption is extended to , where , , and are obtained by the content encoder and the style encoder , respectively, which are all shared across multiple domains in LOMIT, i.e.,
It is essential for LOMIT to properly learn to generate the local mask involved in image translation. To this end, we propose to combine the mask generation networks with our novel highway adaptive instance normalization, as will be described in Section. 3.2.
3 Local Image Translation Model
We first denote and as images from domains and , respectively. As shown in Fig. 2, LOMIT converts a given image to and vice versa, i.e., , and , where is decoder networks and is our proposed, local mask-based highway adaptive instance normalization layer (or in short, HAdaIN), as will be described in detail in Section. 3.2.
For a brevity purpose, we omit the domain index notation in, say, and , unless needed for clarification.
3.1 Local Mask Extraction
We extract the local masks of the input and the exemplar images, as those are effectively involved in image translation. In concrete, LOMIT utilizes the local mask when (1) acquiring disentangled style features , and (2) specifying where to apply the style. For example, if LOMIT is conducting a hair color translation given the input image and the exemplar, our local masks should be obtained as the hair regions from two images. This is because the style to replace and transfer exist in the hair regions of the images.
As shown in Fig. 2(a), given an image , attention networks encode the content feature via the content encoder . The obtained is then forwarded into the rest of the attention networks , i.e., , where is the mask specifying the relevant region with respect to a target style to translate. In practice, the process applies to images in each domain independently in a similar manner, resulting in .
3.2 Highway Adaptive Instance Normalization
Adaptive instance normalization is an effective style transfer technique (Huang and Belongie, 2017)
. Generally, it matches the channel-wise statistics, e.g., the mean and the variance, of the activation map of an input image with those of a style image. In the context of image translation, MUNIT(Huang et al., 2018) extends AdaIN in a way that the target mean and the variance are obtained as the outputs of the trainable functions and of a given style code, i.e.,
where each of and
is defined as a multi-layer perceptron (MLP), i.e.,and . Different MLPs for foreground and background are used because style information significantly differs, e.g., facial attributes vs. background texture.
As we pointed out earlier, previous approaches applied such a transformation globally across the entire region of an image, which may unnecessarily distort irrelevant regions. Hence, we formulate our local mask-based highway AdaIN (HAdaIN) as
where the first term corresponds to the local region of an input image translated by the foreground style, while the second corresponds to the complementary region where the original style of the input should be kept as it is.
4 Training Objectives
This section describes each of our loss terms in the objective function used for training our model.
4.1 Style and Content Reconstruction Loss
The foreground style of the translated output should be close to that of the exemplar, while the background style of the translated output should be close to that of the original input image. We formulate this criteria as the following style reconstruction loss terms:
From the perspective of content information, the content feature of an input image should be consistent with its translated output, which is represented as the content reconstruction loss as
By encouraging the content features to be the same, the content reconstruction loss maintains the content information of the input image, even after performing a translation.
4.2 Image Reconstruction Loss
As an effective supervision approach in an unpaired image translation setting, we adopt the image-level cyclic consistency loss (Zhu et al., 2017a) between an input image and its output through two consecutive image translations of (or ), i.e.,
Meanwhile, similar to previous studies (Huang et al., 2018; Lee et al., 2018), we translate not only but also . This intra-domain translation should work similarly to auto-encoder (Larsen et al., 2016), so the corresponding loss term can be written as
4.3 Domain Adversarial Loss
To reconstruct the real-data distribution via our model, we adopt the domain adversarial loss by introducing the discriminator networks . Among the loss terms proposed in the original GAN(Goodfellow et al., 2014), LSGAN(Mao et al., 2017), and WGAN-GP(Arjovsky et al., 2017; Gulrajani et al., 2017), we chose WGAN-GP, which is shown to empirically work best, as an adversarial training loss. That is, our adversarial loss is written as
is a sampled value from the uniform distribution, and. We also apply the loss proposed in patchGAN (Isola et al., 2017; Zhu et al., 2017a).
4.4 Multi-Attribute Translation Loss
We use an auxiliary classifier(Odena et al., 2016) to cover multi-attribute translation with a single shared model, similar to StarGAN (Choi et al., 2018). The auxiliary classifier , which shares the parameters with the discriminator except for the last layer, classifies the domain of a given image. In detail, its loss term is defined as
where is the domain label of an input image
. Similar to the concept of weakly supervised learning(Zhou et al., 2016; Selvaraju et al., 2017), This loss term plays a role of supervising the local mask to determine the proper region of the corresponding domain through the HAdaIN module, allowing our model to extract out the style from its proper region of the exemplar.
4.5 Mask Regularization Losses
We impose several additional regularization losses on local mask generation to improve the overall image generation performance as well as the interpretability of the generated mask.
The first regularization is to minimize the difference of the mask values of those pixels having similar content information. This helps the local mask consistently capture a semantically meaningful region as a whole, e.g., capturing the entire hair region even when the lighting conditions and the hair color vary significantly within the exemplar. To this end, we design this regularization to minimize, as
where , , and
is a vector whose elements are all ones. Note that each ofis in , and is in , where . The first term is the distance matrix of all the pairs of pixel-wise mask values in
, and the second term is the cosine similarity matrix of all the pairs of
-dimensional pixel-wise content vectors. Note that we backpropagate the gradients generated by this regularization term only throughto train the attention networks, but not through , which prevents the regularization from affecting the encoder .
4.6 Full Loss
Finally, our full loss can be written as
where without a superscript denotes , , , , , and . Note that our training process contains both the intra-domain translation, and , and the inter-domain translation, and . Regarding training procedure for LOMIT, entire components of the generator, i.e., the style encoder, the content encoder, the mask generation network, and the decoder are updated as a whole with the generator loss , and thus our model does not require a separate step of training for each component in the generator.
5 Implementation Details
5.1 Model architectures
Similar to MUNIT (Huang et al., 2018), the content encoder
is composed of two strided-convolutional layers and four residual blocks(He et al., 2016). Following the previous approaches (Huang and Belongie, 2017; Nam and Kim, 2018), instance normalization (IN) (Ulyanov et al., 2016) is used across all the layers in the content encoder.
The style encoder consists of four strided-convolutional layers, a global average pooling layer, and a fully-connected layer. The style codes are eight-dimensional vectors. Also, the style encoder shares the first few layers as they detect low-level feature. To maintain the style information, we do not use IN in the style encoder.
Decoder has four residual blocks and two convolutional layers with an upsampling layer each. We use the layer normalization (LN) (Ba et al., 2016) in the residual blocks not to lose inter-channel differences because LN normalizes the entire feature map, maintaining the differences between the channels.
5.2 Training Details
We utilize the Adam optimizer (Kingma and Ba, 2015) with and . Following the state-of-the-art approach (Choi et al., 2018) in multi-attribute translation, we load the data with a horizontal flip with 0.5 percent. For stable training, we update in every five updates of (Gulrajani et al., 2017). We initialize the weights of
from a normal distribution and apply the initialization(He et al., 2015) on others. Also, we use the batch size of eight and the learning rate of 0.0001. We linearly decay the learning rate by half in every 10,000 iterations from 100,000 iterations. All the models used in the experiments are trained for 200,000 iterations using a single NVIDIA TITAN Xp GPU for 30 hours each.
. Secondly, we explain each of evaluation metrics and comparison with baselines in Section.6.3, 6.4, respectively. Lastly, we provide examples about practical uses of LOMIT in Section. 6.5, and we further perform an intensive analysis of LOMIT with addtional experiments in Section. 6.6.
The CelebA (Liu et al., 2015) dataset consists of 202,599 face images of celebrities and 40 attribute annotations per image. We choose 10 attributes (i.e., black_hair, blond_hair, brown_hair, smiling, goatee, mustache, no_beard, male, heavy_makeup, wearing_lipstick) that would convey meaningful local masks. We randomly select 10,000 images for testing and use the others for training. Images are center-cropped and scaled down to 128128.
The EmotioNet (Fabian Benitez-Quiroz et al., 2016) dataset contains 975,000 images of facial expressions in the wild, each annotated with 12 action units (AUs). Each AU denotes an activation of a specific facial muscle (e.g., jaw drop and nose wrinkler). We crop each image using a face detector 111https://github.com/ageitgey/face_recognition and resize them to 128128. We use 10,000 images for testing and 200,000 images for training.
Wild image dataset.
We also perform experiments using the wild images. Concretely, we exploit zebra2horse and apple2orange dataset (Zhu et al., 2017a), which contains the larger intra-domain variation compared to CelebA and EmotionNet. This is because the number of objects and the location the objects appear is diverse.
6.2 Baseline Methods
MUNIT (Huang et al., 2018) decomposes a given image into the domain-invariant content and the domain-specific style features and exploits AdaIN (Huang and Belongie, 2017) as a translation method. Incorporating a random sampling scheme for the training of latent style features, MUNIT attempts to reflect the multimodal nature of various style domains. We implement MUNIT to be trained on CelebA (Liu et al., 2015) dataset and report results for comparison.
DRIT (Lee et al., 2018)
employs two encoders, which encode the domain-invariant content information and domain-specific style information, respectively. DRIT uses the concatenated vector of the separated content and style features to combine them. The model is trained via the content discriminator which ensures the content space to be shared. Loss functions and training strategies are similar to MUNIT.
AGGAN (Mejjati et al., 2018) is one of the state-of-the-art method in mask-based image translation. AGGAN applies its translation result into a foreground region of an input image, such that the background region can be remained intact. Because AGGAN does not take an exemplar, we change its setting in order to compare with LOMIT. The major differences with ours are the absence of HadaIN module and exemplar mask.
ELEGANT encodes an face image into depth-wisely disentangled features in a shared space, whose each subpart indicates a corresponding face attribute, such as ‘Smiling’. In order to perform a translation, ELEGANT substitutes one of the subpart of the features from an exemplar to an input image. Note that ELEGANT does not exploit both masks for the exemplar and the input image.
6.3 Evaluation Metrics
Fréchet inception distance (FID).
The FID (Heusel et al., 2017)
is one of the widely used metrics for evaluating the performance of generative models. FID exploits the features extracted from intermediate layers of a pre-trained network. The distance between the feature distribution of the real images and that of the generated images is computed by measuring the distance between two different multivariate Gaussian distributions:
where , whose subsets are composed of the entire real images and , whose subsets consist of the overall generated images .
The FID is limited in evaluating how well a generated image maintains the consistency of an input image and the characteristics of an exemplar. In response, as a complementary metric to the FID, we perform human evaluation as a formal user study, of which details are described in Section. 6.4. Since we ask users to rank images generated from different models, we leverage the mean reciprocal rank (MRR) Craswell (2009) as a metric for quantifying human evaluation. Given a question, the MRR represents the average of the multiplicative inverse of a corresponding answer (rank) to the question, i.e.,
where is the number of questions and indicates the -th ranking of a given model . We use this metric to give priority to the highest rank, because a lower rank has less influence to the overall score in the MRR. The MRR ranges between 0 and 1, where the higher the score, the better the results.
6.4 Comparison with Baselines
6.4.1 Evaluation on CelebA dataset
As shown in Fig. 3, we compare our model with the baseline models using CelebA dataset (Liu et al., 2015), where we train the baselines based on publicly available model implementations. Each macro column from the left indicates the translation from (a) brown to blonde, (b) non-facial hair to facial hair, and (c) non-smile to smile. LOMIT shows an outstanding performance compared to the baseline models in both reflecting the distinct style of an exemplar while keeping the irrelevant region, such as the background and the face in the hair color translation, intact.
Concretely, we observe that the noise in the style information extracted from the background is undesirably affecting the generated images in the case of MUNIT and DRIT (notably in the third column of both (a) and (b), and the first and the fifth columns of (c)), while LOMIT does not suffer from such influence. Besides, we also find that MUNIT and DRIT apply the style information to the irrelevant region of the input images, distorting the color and the tone of both the face and the background. It evidences that the mask for the exemplar should be properly incorporated and that LOMIT is superior to the compared models with regard to the accurate style application. These findings justify the initial motivation and the needs of the local masks along with the proposed HAdaIN module of LOMIT.
We also compare StarGAN (Choi et al., 2018), a widely-used, state-of-the-art method, to verify the added benefits of LOMIT. In Fig. 3, given a corresponding input, the images with the black outline in the first row (A) represents the outputs of StarGAN. It demonstrates that StarGAN is only able to generate a unimodal output depending on an multi-hot input vector indicating a target attribute. On the other hand, LOMIT generates diverse outputs reflecting each corresponding exemplar.
|H & S||31.52||26.94||21.82|
|F.H & G||44.81||29.57||19.31|
|M & Y||37.78||33.68||26.01|
Comparisons of the FID.
We compare LOMIT with the baseline models using FID (Heusel et al., 2017), one of the renowned metrics for measuring the performance of generative models. A low FID demonstrates that the generated images are of better quality and in diverse spectrum. To obtain the score for each model, we first generate images using the test dataset. Next, we build up subsets per each attribute group with the generated images, e.g., given the attribute group of F.H & G, the dataset is separated into four subsets, such as (0,0), …, (1,1) in which each number indicates the binary label. Finally, we acquire the image features of each subset based on the pretrained Inception-v3 (Szegedy et al., 2016) and compute the statistics of the generated feature distribution. Similarly the feature statistics of the real images are obtained using the entire training dataset.
Table 1 lists the comparison results. In all the class subsets, LOMIT generates images that are more diverse and of better quality than the other methods, as indicated by the lower scores. We attribute the competitive result to the capability of LOMIT in applying the extracted style to the adequate region of the input image while keeping irrelevant regions intact. On the other hand, DRIT and MUNIT apply the style to unnecessarily large area of an input image, ending up with the feature distributions of generated images being far from the real distribution, as quantified by the high FID scores.
|Q1.||H & S||0.42||0.52||0.89|
|F.H & G||0.37||0.52||0.94|
|M & Y||0.41||0.46||0.97|
|H & S||0.41||0.56||0.86|
|F.H & G||0.37||0.71||0.75|
|M & Y||0.44||0.48||0.92|
To evaluate the effectiveness of the proposed method, we conduct the user study by comparing LOMIT with other baseline models. First, out of the test dataset, we construct subsets in each of the attribute groups, e.g., M & Y
(0,0),…,(1,1). Second, we randomly sample 100 images from each subset and generate 10,000 images per each subset from every pair of these images. Lastly, we randomly sample 10 images from those generated images (40,000, in case of M & Y) per test run. A test run is composed of ten instances per attribute group, and for each question, an input, an exemplar, and corresponding outputs of each baseline models are presented. Regarding participants, We recruited 31 people diverse in age (from 22 to 40) and major expertise (20 non-experts in machine learning). Thus, the results reflect opinions from people with diverse viewpoints. Each time, participants rank all methods based on the given a question.
As for the evaluation metric, we report the mean reciprocal rank (MRR) Craswell (2009) as a metric for quantifying human evaluation. Given a question, the MRR average a reciprocal of given rankings. The MRR ranges between 0 and 1, where the higher the score, the better the results.
As the first question, we ask users to evaluate which model keeps well the irrelevant regions with respect to the style. The summaries are reported in Table 2. In every attribute group, LOMIT records an outstanding performance in keeping the extraneous regions untouched with respect to the style. On the other hand, other baselines apply the style to the entire regions of the image, yielding an excessive translation and resulting in a worse MRR than LOMIT.
The second question requires the users to answer which method reflects the exemplar style best. As shown in Table 2, we found that users favor LOMIT over baselines. We believe this is because the mask for the exemplar specifies the regions to extract the style from, so as to enable the model to have better representations of the targeted style.
6.4.2 Evaluation on wild image dataset
In order to verify LOMIT can handle dataset with large variations within a class, we additionally perform comparison experiments on ‘horse2zebra’ and ‘apple2orange’ datasets (Zhu et al., 2017a) using the FID scores. For this experiment, we adopt AGGAN (Mejjati et al., 2018) as the state-of-the-art mask-based image translation method, for which we use the official source code but with minor modifications on the architecture to take an exemplar as another input. We concatenate the input with an exemplar and forward them into a generator. We also adjust LOMIT to fit well to the dataset. Because the dataset is composed of a binary class, we do not need to cover the multi-attribute translation in those datasets. Therefore, we utilize domain-specific networks (i.e., the shared generator G is now replaced by and , where A and B indicate respective domains.) to increase the model capacity.
For the four tasks of horse2zebra, zebra2horse, apple2orange, and orange2apple, the overall FID scores of LOMIT outperform AGGAN as shown in Table. 3. We also verify the superior performance of LOMIT via qualitative comparisons with AGGAN as illustrated in Fig. 4. For example, the results of AGGAN in the first and the second columns show the incorrect translation results because the orange and the red are insufficiently colored. We believe that this results basically come from the use of the exemplar-mask. AGGAN cannot specify a region in an exemplar to be used as the style, while LOMIT is capable of refining the style by selecting essential regions of the exemplar, thus LOMIT can produce better results.
Moreover, LOMIT can perform well even when a given exemplar contains multiple instances. For example, for the translation from zebra to horse, if a given exemplar is composed of multiple horses consisting both white horses and brown horses, the baseline method cannot specify which horse to be applied as a style in a translation. On the other hand, LOMIT tackles the problem and enables to choose which instance to be referenced as a style via masking on the exemplar.
6.5 User Intervention using LOMIT.
Various translation examples via user interaction.
Fig. 5 demonstrates a multifarious applicability of LOMIT via human interaction. It is shown that a user can specify where to translate as well as what to be transferred by manipulating the masks of an input and an exemplar. Masks with red outlines are the ones manually removed. The models are trained to translate from (a) black, non-smile to blonde, smile, (b) non-facial hair, female to facial hair, male, (c) young, makeup to old, non-makeup. The leftmost column of each macro column shows the result without any modification, and the other columns of each macro column indicate translation results after modifying the mask of the input or the exemplar. Users can modify the input mask to adjust where to translate, while one could change the exemplar mask to decide what to translate. For example, the second and the third columns in Fig. 5(a) are the results from a modification of the input-mask. From those results, it is verified that a user can choose where to translate through opting the region to apply the extracted style. Meanwhile, the fourth column shows the result from a modification of the exemplar-mask. It maintains the non-smile attribute of the input image by removing the regions of the eyes and the mouth of the exemplar-mask, which contains the smile attribute information. This demonstrates that a user can choose which style to transfer during a translation procedure, such that the learned attributes to be translated during training can be selectively transferred via user-interaction on the exemplar-mask.
Necessity of modifying exemplar-mask.
We believe this approach bears great potentials in diverse computer vision applications by allowing the user to fine-control the target region for the translation as well as the style of the exemplar. In particular, the technique can be effectively applied when the different styles in the same attribute co-exist in an exemplar. Practically, as illustrated in Fig.6(a) and (b), there can be numerous cases that an exemplar contains distinct styles within a single attribute. For example, the exemplar of Fig. 6(a) contains two women having different hair colors. It indicates that the target style extracted from the exemplar can be vague. The basic result represented in the first row shows the mixed color of brown and blonde, while the results from the second and the third rows respectively represent blonde and brown hair color because their exemplar-masks are modified in order to specify the hair color. By explicitly specifying the exact style to transfer, a user can designate a concrete target style, and the model can conduct an appropriate translation reflecting what the user wants.
6.6 Additional Experiments and Analysis
In this section, we report additional experiments that perform an intensive analysis of LOMIT. Specifically, we conduct the ablation study in Section. 6.6.1 and the analysis of the exemplar-mask in Section. 6.6.2. We further perform an experiment based on EmotioNet dataset in Section. 6.6.3, such that we attempt to verify LOMIT can be utilized in diverse dataset. Next, we report the comparison results of LOMIT with ELEGANT in Section. 6.6.4, and lastly, we intensively discuss about the learning principle of LOMIT in Section. 6.6.5.
|H & S||21.82||25.16||27.48||22.07||22.50||22.33|
6.6.1 Ablation study on various losses
To justify our model architecture, we further perform ablation studies for loss configurations on CelebA ‘Hair colors’ & ‘Smiling’ () translation, and report FID scores as in the table above. As shown in Table. 4, LOMIT with all losses achieves the best result. Each of the second and the third columns represents the results of the LOMIT trained without the image reconstruction loss and the content reconstruction loss . This indicates that those losses significantly contribute to the model performance. Meanwhile, the model performance also degrades without the style losses and as shown in the fourth and the fifth columns. Even though their significance in improving model performance is less than the image reconstruction loss, the lack of the style reconstruction losses engender relatively incorrect style embeddings compared to LOMIT. Finally, the right most column shows the score of the LOMIT trained without the regularization losses. This denotes that the accurate mask extraction is important for boosting the model performance because the regularization losses play a role in properly refining a foreground region of the exemplar.
6.6.2 Analysis of the exemplar mask
We demonstrate the effects of the mask for the exemplar in Fig. 7. From the left, each column of the figure is composed of the input image, the corresponding input mask, the exemplar, its corresponding mask, and the output. The first row shows the result trained without incorporating the mask for the exemplar (), meaning that the style encoder encodes the entire regions of the exemplar as the foreground style, as well as that of the input as the background style. As illustrated in the second and fourth columns in the figure, the style masks of LOMIT (row (b)) specify regions more clearly than (row (a)). It reveals that the style mask is effectively regulating the model to distinguish the region of interest while minimizing the distortion of the irrelevant regions. Concretely, as can be seen in the image in the fifth column of (a), the area surrounding a mouth and the sculpture in the left is affected by the style, due to the mask excessively specifying the regions. Besides, the reddish face of the image demonstrates the extracted style from the exemplar includes extraneous regions on the translation, because a skin tone of a face is irrelevant information in the translation of (brown, no-smile black, no-smile). On the other hand, the image in the fifth column of (b) does not only maintain the irrelevant region, but also reflects the style to the relevant regions. That is, by exploiting the mask for the exemplar, we can achieve a better translation result.
6.6.3 Experiment on EmotioNet
Fig. 8 shows the results for the action unit (AU) translation. For the training, we use all available AUs (1, 2, 4, 5, 6, 9, 12, 17, 20, 25, 26, and 43) as a training label (for the multi-attribute translation loss), so that the model can be trained for translating multi-AUs from the exemplar. Each triplet is composed of an input image, an exemplar, and a translated output. For example, the input of (a), containing AUs 12, 25 (Happy) takes the exemplar whose AUs are 1, 2, 25, 26 (Surprised). The translated output demonstrates that it preserves the identity of the input image while properly transferring the AUs of the exemplar. From the results, we verify that although a number of AUs sparsely distributed all over the face, LOMIT can perform the elaborate translation based on the local masks.
|H & S||34.30||21.82|
6.6.4 Comparison with ELEGANT
Because ELEGANT can be regarded as one of the state-of-the-art methods, we additionally implement ELEGANT (Xiao et al., 2018) with its official source codes and compared FID score with LOMIT in the translation of hair color and smile attributes in CelebA (Liu et al., 2015). LOMIT showed substantially better performance than ELEGANT as represented in Table. 5. We attribute the superior performance of LOMIT to the mask-based technique of LOMIT. LOMIT extracts two masks, one for specifying which region to apply the style and the other for extracting a relevant style from a specific region, which allows a high-quality and targeted image translation in LOMIT.
6.6.5 Discussion on learning principle of LOMIT
We believe the multimodal translation is achieved by the training objectives of LOMIT. First, the adversarial loss and the multi-attribute translation loss encourage the model to generate a blonde person because the former reduces the distance between the distribution of the generated image and that of a real image containing a blonde hair. On the other hand, the latter makes the model generate an image with a blonde hair to be classified as being blonde. Second, the image reconstruction loss and the style reconstruction loss encourage the model to keep an intrinsic style of the exemplar. Specifically, the image reconstruction loss forces a reconstructed image to contain the same pixel value with an input image, while the style reconstruction loss makes a style code of the exemplar be kept after being applied to the input image. That is, each style code of different hair color has to be maintained to minimize the loss.
These aspects allow LOMIT to suitably learn how to translate an image while achieving the multimodality, such that LOMIT is able to cover the intra-domain variation even when an unseen style is given from an exemplar. This is because each exemplar in Fig.3 is sampled from the test dataset.
In this work, we addressed the problem of where and what to transfer for unpaired image-to-image translation. We proposed a local mask-based translation model called LOMIT, where the attention networks generate the mask of an input image and that of an exemplar. The mask of the exemplar determines what style to transfer by excluding irrelevant regions and extracting the style from only relevant regions. The other mask of the input determines where to transfer the extracted style. That is, it captures the regions to apply the style while maintaining an original style in the rest (through our highway adaptive instance normalization). LOMIT achieves outstanding results compared to the state-of-the-art methods (Huang et al., 2018; Lee et al., 2018)
. As future work, we plan to extend our model to other diverse domains of data, such as ImageNet(Deng et al., 2009) and MSCOCO (Lin et al., 2014). We will also extend our approach to video translation to improve the consistency of the translated results of consecutive frames.
We provide the additional results in Fig. 9. It shows various examples demonstrating the superior performance of LOMIT.
- Arjovsky et al. (2017) Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: ICML
- Ba et al. (2016) Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:160706450
Bahng et al. (2018)
Bahng H, Yoo S, Cho W, Park DK, Wu Z, Ma X, Choo J (2018) Coloring with words: Guiding image colorization through text-based palette generation. In: ECCV
- Chang et al. (2018) Chang H, Lu J, Yu F, Finkelstein A (2018) Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In: CVPR
- Chen et al. (2018) Chen X, Xu C, Yang X, Tao D (2018) Attention-gan for object transfiguration in wild images. In: ECCV
- Cho et al. (2018) Cho W, Choi S, Park D, Shin I, Choo J (2018) Image-to-image translation via group-wise deep whitening and coloring transformation. arXiv preprint arXiv:181209912
- Choi et al. (2018) Choi Y, Choi M, Kim M, Ha JW, Kim S, Choo J (2018) Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: CVPR
- Craswell (2009) Craswell N (2009) Mean reciprocal rank. In: Encyclopedia of Database Systems
- Deng et al. (2009) Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR
- Fabian Benitez-Quiroz et al. (2016) Fabian Benitez-Quiroz C, Srinivasan R, Martinez AM (2016) Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: CVPR
- Goodfellow et al. (2014) Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: NIPS
- Gulrajani et al. (2017) Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. In: NIPS
- He et al. (2015) He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV
- He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR
- Heusel et al. (2017) Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS
- Huang and Belongie (2017) Huang X, Belongie SJ (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV
- Huang et al. (2018) Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal unsupervised image-to-image translation. In: ECCV
- Ioffe and Szegedy (2015) Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML
Isola et al. (2017)
Isola P, Zhu J, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: CVPR
- Kim et al. (2017) Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to discover cross-domain relations with generative adversarial networks. In: ICML
- Kingma and Ba (2015) Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: ICLR
Larsen et al. (2016)
Larsen ABL, Sønderby SK, Larochelle H, Winther O (2016) Autoencoding beyond pixels using a learned similarity metric. In: ICML
- Lee et al. (2018) Lee HY, Tseng HY, Huang JB, Singh M, Yang MH (2018) Diverse image-to-image translation via disentangled representations. In: ECCV
- Lin et al. (2018) Lin J, Xia Y, Qin T, Chen Z, Liu TY (2018) Conditional image-to-image translation. In: CVPR
- Lin et al. (2014) Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft coco: Common objects in context. 1405.0312
- Liu et al. (2017) Liu MY, Breuel T, Kautz J (2017) Unsupervised image-to-image translation networks. In: NIPS
- Liu et al. (2015) Liu Z, Luo P, Wang X, Tang X (2015) Deep learning face attributes in the wild. In: ICCV
- Ma et al. (2019) Ma L, Jia X, Georgoulis S, Tuytelaars T, Gool LV (2019) Exemplar guided unsupervised image-to-image translation with semantic consistency. In: ICLR
- Mao et al. (2017) Mao X, Li Q, Xie H, Lau RY, Wang Z, Smolley SP (2017) Least squares generative adversarial networks. In: ICCV
- Mejjati et al. (2018) Mejjati YA, Richardt C, Tompkin J, Cosker D, Kim KI (2018) Unsupervised attention-guided image to image translation. arXiv preprint arXiv:180602311
- Miyato et al. (2018) Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. In: ICLR
Nam and Kim (2018)
Nam H, Kim HE (2018) Batch-instance normalization for adaptively style-invariant neural networks. arXiv preprint arXiv:180507925
- Odena et al. (2016) Odena A, Olah C, Shlens J (2016) Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:161009585
- Pumarola et al. (2018) Pumarola A, Agudo A, Martinez A, Sanfeliu A, Moreno-Noguer F (2018) Ganimation: Anatomically-aware facial animation from a single image. In: ECCV
- Selvaraju et al. (2017) Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D, et al. (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV
- Srivastava et al. (2015) Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: NIPS
- Szegedy et al. (2016) Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: CVPR
- Ulyanov et al. (2016) Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normalization: The missing ingredient for fast stylization
- Xiao et al. (2018) Xiao T, Hong J, Ma J (2018) Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 168–184
- Yang et al. (2018) Yang C, Kim T, Wang R, Peng H, Kuo CCJ (2018) Show, attend and translate: Unsupervised image translation with self-regularization and attention. arXiv preprint arXiv:180606195
Zhou et al. (2016)
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: CVPR
- Zhu et al. (2017a) Zhu JY, Park T, Isola P, Efros AA (2017a) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV
- Zhu et al. (2017b) Zhu JY, Zhang R, Pathak D, Darrell T, Efros AA, Wang O, Shechtman E (2017b) Toward multimodal image-to-image translation. In: NIPS