Show, Attend and Translate: Unsupervised Image Translation with Self-Regularization and Attention

06/16/2018 ∙ by Chao Yang, et al. ∙ 0

Image translation between two domains is a class of problems aiming to learn mapping from an input image in the source domain to an output image in the target domain. It has been applied to numerous domains, such as data augmentation, domain adaptation, and unsupervised training. When paired training data is not accessible, image translation becomes an ill-posed problem. We constrain the problem with the assumption that the translated image needs to be perceptually similar to the original image and also appears to be drawn from the new domain, and propose a simple yet effective image translation model consisting of a single generator trained with a self-regularization term and an adversarial term. We further notice that existing image translation techniques are agnostic to the subjects of interest and often introduce unwanted changes or artifacts to the input. Thus we propose to add an attention module to predict an attention map to guide the image translation process. The module learns to attend to key parts of the image while keeping everything else unaltered, essentially avoiding undesired artifacts or changes. The predicted attention map also opens door to applications such as unsupervised segmentation and saliency detection. Extensive experiments and evaluations show that our model while being simpler, achieves significantly better performance than existing image translation methods.



There are no comments yet.


page 1

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

(a) Input image (b) Predicted Attention Map
(c) Final result (d) CycleGAN [80]
Fig. 1: Horsezebra image translation. Our model learns to predict an attention map (b) and translates the horse to zebra while keeping the background untouched (c). By comparison, CycleGAN [80] significantly alters the appearance of the background together with the horse (d).


Many computer vision problems can be cast as an image-to-image translation problem: the task is to map an image of one domain to a corresponding image of another domain. For example, image colorization can be considered as mapping gray-scale images to corresponding images in RGB space 

[77]; style transfer can be viewed as translating images in one style to corresponding images with another style [19, 29, 18]. Other tasks falling into this category include semantic segmentation [46]

, super-resolution 

[39], image manipulation [28]

, etc. Another important application of image translation is related to domain adaptation and unsupervised learning: with the rise of deep learning, it is now considered crucial to have large labeled training datasets. However, labeling and annotating such large datasets are expensive and thus not scalable. An alternative is to use synthetic or simulated data for training, whose labels are trivial to acquire 

[82, 67, 60, 57, 53, 48, 30, 11]. Unfortunately, learning from synthetic data can be problematic and most of the time does not generalize to real-world data, due to the data distribution gap between the two domains. Furthermore, due to the deep neural networks’ capability of learning small details, it is anticipated that the trained model would easily over-fits to the synthetic domain. In order to close this gap, we can either find mappings or domain-invariant representations at feature level [8, 17, 47, 65, 68, 21, 9, 1, 33] or learn to translate images from one domain to another domain to create “fake” labeled data for training [7, 80, 43, 39, 44, 75]. In the latter case, we usually hope to learn a mapping that preserves the labels as well as the attributes we care about.

Typically there exist two settings for image translation given two domains and . The first setting is supervised, where example image pairs are available. This means for the training data, for each image there is a corresponding , and we wish to find a translator such that . Representative translation systems in the supervised setting include domain-specific works [15, 24, 37, 62, 46, 70, 71, 77] and the more general Pix2Pix [28, 69]. However, paired training data comes at a premium. For example, for image stylization, obtaining paired data requires lengthy artist authoring and is extremely expensive. For other tasks like object transfiguration, the desired output is not even well defined.

Therefore, we focus on the second setting, which is unsupervised image translation. In the unsupervised setting, and are two independent sets of images, and we do not have access to paired examples showing how an image could be translated to an image . Our task is then to seek an algorithm that can learn to translate between and without desired input-output examples. The unsupervised image translation setting has greater potentials because of its simplicity and flexibility but is also much more difficult. In fact, it is a highly under-constrained and ill-posed problem, since there could be unlimited many number of mappings between and

: from the probabilistic view, the challenge is to learn a joint distribution of images in different domains. As stated by the coupling theory 

[41], there exists an infinite set of joint distributions that can arrive the two marginal distributions in two different domains. Therefore, additional assumptions and constraints are needed for us to exploit the structure and supervision necessary to learn the mapping.

Existing works that address this problem assume that there are certain relationships between the two domains. For example, CycleGAN [80] assumes cycle-consistency and the existence of an inverse mapping that translates from to . It then trains two generators which are bijections and inverse to each other and uses adversarial constraint [20] to ensure the translated image appears to be drawn from the target domain and the cycle-consistency constraint to ensure the translated image can be mapped back to the original image using the inverse mapping ( and ). UNIT [43], on the other hand, assumes shared-latent space, meaning a pair of images in different domains can be mapped to some shared latent representations. The model trains two generators with shared layers. Both and maps an input to itself, while the domain translation is realized by letting go through part of and part of to get . The model is trained with an adversarial constraint on the image, a variational constraint on the latent code [35, 56], and another cycle-consistency constraint.

Assuming cycle consistency ensures 1-1 mapping and avoids mode collapses [61], both models generate reasonable image translation and domain adaptation results. However, there are several issues with existing approaches. First, such approaches are usually agnostic to the subjects of interest and there is little guarantee it reaches the desired output. In fact, approaches based on cycle-consistency [80, 42] could theoretically find any arbitrary 1-1 mapping that satisfies the constraints, and this renders the training unstable and the results random. This is problematic in many image translation scenarios. For example, when translating from a horse image to a zebra image, most likely we only wish to draw the particular black-white stripes on top of the horses while keeping everything else unchanged. However, what we observe is that existing approaches [80, 43] do not differentiate between the horse/zebra from the scene background, and the colors and appearances of the background often significantly change during translation (Fig. 1). Second, most of the time we only care about one-way translation, while existing methods like CycleGAN [80] and UNIT [42] always require training two generators of bijections. This is not only cumbersome but it is also hard to balance the effects of the two generators. Third, there is a sensitive trade-off between the faithfulness of the translated image to the input image and how similar it resembles the new domain, and it requires excessive manual tuning of the weight between the adversarial loss and the reconstruction loss to get satisfying results.

To address the aforementioned issues, we propose a simpler yet more effective image translation model that consists of a single generator with an attention module. We first re-consider what the desired outcome of an image translation task should be: most of the time the desired output should not only resemble the target domain but also preserve certain attributes and share similar visual appearance with input. For example, in the case of horse-zebra translation [80], the output zebra should be similar to the input horse in terms of the scene background, the location and the shape of the zebra and horse, etc. In the domain adaptation task that translates MNIST [38] to USPS [13], we expect the output is visually similar to the input in terms of the shape and structure of the digit such that it preserves the label. Based on such observation, our model proposes to use a single generator that maps to and is trained with a self-regularization term that enforces perceptual similarity between the output and the input, together with an adversarial term that enforces the output to appear like drawn from

. Furthermore, in order to focus the translation on key components of the image and avoid introducing unnecessary changes to irrelevant parts, we propose to add an attention module that predicts a probability map as to which part of the image it needs to attend to when translating. Such probability maps, which are learned in a completely unsupervised fashion, could further facilitate segmentation or saliency detection (Fig. 

1). Third, we propose an automatic and principled way to find the optimal weight between the self-regularization term and the adversarial term such that we do not have to manually search for the best hyper-parameter.

Our model does not rely on cycle-consistency or shared representation assumption, and it only learns one-way mapping. Although the constraint is susceptible to oversimplify certain scenarios, we found that the model works surprisingly well. With the attention module, our model learns to detect the key objects from the background context and is able to correct artifacts and remove unwanted changes from the translated results. We apply our model on a variety of image translation and domain adaptation tasks and show that our model is not only simpler but also works better than existing methods, achieving superior qualitative and quantitative performance. To demonstrate its application in real-world tasks, we show our model can be used to improve the accuracy of face 3D morphable model [6] prediction by augmenting the training data of real images with adapted synthetic images.

Ii Related Work

Generative adversarial networks (GANs) Using GAN framework [20] for generative image modeling and synthesis has gained remarkable progress recently. The basic idea of GAN training is to train a generator and a discriminator jointly such that the generator produces realistic images that confuse the discriminator. It is known that the vanilla GAN suffers from instability in training. Several techniques have been proposed to stabilize the training process and enable it to scale to higher resolution images, such as DCGAN [54], energy-based GAN [79], Wasserstein GAN (WGAN) [61, 2], WGAN-GP [22], BEGAN [4], LSGAN [49] and the Progressive GANs [31]. In our work, adversarial training is the fundamental element which ensures that the output sample from the generator appears like drawn from the target domain.

Image translation Image translation can be seen as generating an image in target domain conditioning on an image in the source domain. Similar problems of conditional image generation include text to image translation [76, 55], super resolution [32, 14, 39], style transfer [19, 29, 40, 26] etc. Based on the availability of paired training data, image translation can be either supervised (paired) or unsupervised (unpaired). Isola et al. [28] first propose a unified framework called Pix2Pix for paired image-to-image translation based on conditional GANs. Wang [69] further extends the framework to generate high-resolution images by using deeper, multi-scale networks and improved training losses.  [16] uses variational U-Net instead of GAN for conditional image generation. UNIT [42] and BiCycleGAN [81] incorporate latent code embedding into existing frameworks and enable generating randomly sampled translation results. On the other hand, when paired training data is not available, additional constraints such as cycle-consistency loss is employed [80, 27]. Such constraint enforces an image to map to another domain and back to itself to ensure 1-1 mapping between the two domains. However, such techniques heavily rely on “laziness” of the generator network and often introduce artifacts or unwanted changes to the results. Our model leverages recent advances in neural network training and employs the perceptual-based loss [29, 78] as self-regularization, such that cycle-consistency becomes unnecessary and we can also obtain more accurate translation results.

Attention Recently, attention mechanism has been successfully introduced in many applications in computer vision and language processing, e.g., image captioning  [73], text to image generation  [74], visual question answering  [72], saliency detection  [36], machine translation  [3] and speech recognition  [10]. Attention mechanism helps models to focus on the relevant portion of the input to resolve the corresponding output without any supervision. In machine translation  [3], it attends on relevant words in the source language to predict the current output in the target language. To generate an image from text  [74], it attends on different words for the corresponding sub-region of the image. Inversely, for image captioning  [73], image sub-regions were attended for the next generated word. In the same spirit, we propose to use an attention module to attend to the region of interest for the image translation task in an unsupervised way.

Iii Our Method

Fig. 2: Model overview. Our generator consists of a vanilla generator and an attention branch . We train the model using self-regularization perceptual loss and adversarial loss.

We begin by explaining our model for unsupervised image translation. Let and be two image domains, our goal is to train a generator , where are the function parameters. For simplicity, we omit and use instead. We are given unpaired samples and , and the unsupervised setting assumes that and are independently drawn from the marginal distributions and . Let denote the translated image, the key requirement is that should appear like drawn from domain , while preserving the low-level visual characteristics of . The translated images can be further used for other downstream tasks such as unsupervised learning. However, in our case, we decouple image translation from its applications.

Based on the requirements described, we propose to learn by minimizing the following loss:


Here , where is the vanilla generator and is the attention branch. outputs a translated image while predicts a probability map that is used to composite with to get the final output. The first part of the loss, , is the adversarial loss on the image domain that makes sure that appears like domain . The second part of the losses makes sure that is visually similar to . In our case, is given by a discriminator trained jointly with , and is measured with perceptual loss. We illustrate the model in Fig. III.

The model architectures: Our model consists of a generator and a discriminator . The generator has two branches: the vanilla generator and the attention branch . translates the input as a whole to generate a similar image in the new domain, and predicts a probability map as the attention mask. has the same size as and each pixel is a probability value between 0-1. In the end, we composite the final image by adding up and based on the attention mask.

is based on Fully Convolutional Network (FCN) and leverages properties of convolutional neural networks, such as translation invariance and parameter sharing. Similar to  [28, 80], the generator is built with three components: a down-sampling front-end to reduce the size, followed by multiple residual blocks [23]

, and an up-sampling back-end to restore the original dimensions. The down-sampling front-end consists of two convolutional blocks, each with a stride of 2. The intermediate part contains nine residual blocks that keep the height/width constant, and the up-sampling back-end consists of two deconvolutional blocks, also with a stride of 2. Each convolutional layer is followed by batch normalization and ReLU activation, except for the last layer whose output is in the image space. Using down-sampling at the beginning increases the receptive field of the residual blocks and makes it easier to learn the transformation at a smaller scale. Another modification is that we adopt the dilated convolution in all residual blocks, and set the dilation factor to 2. Dilated convolutions use spaced kernels, enabling it to compute each output value with a wider view of input without increasing the number of parameters and computational burden.

consists of the initial layers of the VGG-19 network [64] (up to conv3_3

), followed by two deconvolutional blocks. In the end it is a convolutional layer with sigmoid that outputs a single channel probability map. During training, the VGG-19 layers are warm-started with weights pretrained on ImageNet 


For the discriminator, we use a five-layer convolutional network. The first three layers have a stride of 2 followed by two convolution layers with stride 1, which effectively down-samples the networks three times. The output is a vector of real/fake predictions and each value corresponds to a patch of the image. Classifying each patch as real/fake introduces PatchGAN, and is shown to work better than the global GAN 

[80, 28].

Adversarial loss: Generative Adversarial Network [20] plays a two-player min-max game to update the network and . learns to translate the image to which appears as if it is from , while learns to distinguish from which is the real image drawn from . The parameters of and are updated alternatively. The discriminator updates its parameters by maximizing the following objective:


The adversarial loss used to update the generator is defined as:


By minimizing the loss function, the generator

learns to create a translated image that fools the network into classifying the image as drawn from .

Self-regularization loss: Theoretically, adversarial training can learn a mapping that produces outputs identically distributed as the target domain . However, if the capacity is large enough, a network can map the input images to any random permutations of images in the target domain. Thus, adversarial loses alone cannot guarantee that the learned function maps the input to the desired output. To further constrain the learned mapping such that it is meaningful, we argue that should preserve visual characteristics of the input image. In other words, the output and the input need to share perceptual similarities, especially regarding the low-level features. Such features may include color, edges, shape, objects, etc. We impose this constraint with the self-regularization term, which is modeled by minimizing the distance between the translated image and the input : . Here is some distance function , which can be , , SSIM, etc. However, recent research suggests that using perceptual distance based on a pre-trained network corresponds much better to human perception of similarity comparing with traditional distance measures [78]. In particular, we defined the perceptual loss as:

Here is VGG pretrained on ImageNet used to extract the neural features; we use to represent each layer, and are the height and width of feature . We extract neural features with across multiple layers, compute the difference at each location of and average over the feature height and width. We then scale it with layer-wise weight . We did extensive experiments to try different combinations of feature layers and obtained the best results by only using the first three layers of VGG and setting to be respectively. This conforms to the intuition that we would like to preserve the low-level traits of the input during translation. Note that this may not always be true (such as in texture transfer), but it is a hyper-parameter that could be easily adjusted based on different problem settings. We also experimented with using different pre-trained networks such as AlexNet to extract neural features as suggested by [78] but do not observe much difference in results.

Training scheme: In our experiment, we found that training the attention branch and the vanilla generator branch is difficult as it is hard to balance the learned translation and mask. In our practice, we train the two branches separately. First, we train the vanilla generator without the attention branch. After it converges, we train the attention branch while keeping the trained generator fixed. In the end, we jointly fine-tune them with a smaller learning rate.

Adaptive weight induction: Like other image translation methods, the resemblance to the new domain and faithfulness to the original image is a trade-off. In our model, it is determined by the weight of the self-regularization term relative to the image adversarial term. If is too large, the translated image will be close to the input but does not look like the new domain. If

is too small, the translated image would fail to pertain the visual traits of the input. Previous approaches usually decide the weight heuristically. Here we propose an adaptive scheme to search for the best

: we start by setting , which means we only use the adversarial constraint to train the generator. Then we gradually increase . This would lead to the decrease of the adversarial loss as the output would shift away from to , which makes it easier for to classify. We stop increasing when the adversarial loss sinks below some threshold . We then keep constant and continue to train the network until converging. Using the adaptive weight induction scheme avoids manual tuning of for each specific task and gives results that are both similar to the input and the new domain . Note that we repeat such process both when training and .

Analysis: Our model is related to CycleGAN in that if we assume 1-1 mapping, we can define an inverse mapping such that . This satisfies the constraints of CycleGAN in that the cycle-consistency loss is zero. This shows that our learned mapping belongs to the set of possible mappings given by CycleGAN. On the other hand, although CycleGAN tends to learn the mapping such that the visual distance between and is small possibly due to cycle-consistency constraint, it does not guarantee to minimize the perceptual distance between and . Comparing with UNIT, if we add another constraint that , then it is a special case of the UNIT model where all layers of the two generators are shared which leads to a single generator . In this case, the cycle-consistency constraint is implicit as and . However, we observe that adding the additional self-mapping constraint for domain does not improve the results.

Even though our approach assumes the perceptual distance between and its corresponding is small, our approach generalizes well to tasks where the input and output domains are significantly different, such as translation of photo to map, day to night, etc., as long as our assumption generally holds. For example, in the case of photo to map, the park (photo) is labeled as green (map) and the water (photo) is labeled as blue (map), which provides certain low-level similarities. Experiments show that even without the attention branch, our model produces results consistently similar or better than other methods. This indicates that the cycle-consistency assumption may not be necessary for image translation. Note that our approach is a meta-algorithm, and we could potentially improve the results by using new/more advanced components. For example, the generator and discriminator could be easily replaced with the latest GAN architectures such as LSGAN [50], WGAN-GP [22], or adding spectral normalization [51]. We may also improve the results by employing a more specific self-regularizaton term that is fine-tuned on the datasets we work on.

Iv Results

(a) Input (b) Initial trans (c) Attention map (d) Final result (e) UNIT [42] (f) CycleGAN [80]
Fig. 3: Image translation results of horse to zebra [28] and comparison with UNIT and CycleGAN.

We tested our model on a variety of datasets and tasks. In the following, we show the qualitative results of image translation, as well as quantitative results in several domain adaptation settings. In our experiments, all images are resized to 256x256. We use Adam solver [34] to update the model weights during training. In order to reduce model oscillation, we update the discriminators using a history of generated images rather than the ones produced by the latest generative models [63]: we keep an image buffer that stores the 50 previously generated images. All networks were trained from scratch with a learning rate of 0.0002. Starting from 5k iteration, we linearly decay the learning rate over the remaining 5k iterations. Most of our training takes about 1 day to converge on a single Titan X GPU.

Iv-a Qualitative Results

(a) Input (b) Initial (c) Attention (d) Final (e) Input (f) Initial (g) Attention (h) Final
Fig. 4: Image translation results on more datasets. From top to bottom: apple to orange [28], dog to cat [52], photo to DSLR [28], Yosemite summer to winter [28].
Fig. 5: More image translation results. From left to right: edges to shoes [28]; edges to handbags [28]; SYNTHIA to cityscape [58, 12]. Given the source and target domains are globally different, the initial translation and final result are similar with the attention maps focusing on the entire images.

Fig. 3 shows visual results of image translation of horse to zebra. For each image, we show the initial translation , the attention map and the final result composited using and based on . We also compare the results with CycleGAN [80] and UNIT [42], and all models are trained using the same number of iterations. For the baseline implementation, we use the original authors’ implementations. We can see from the examples that without the attention branch, our simple translation model already gives results similar or better than [80, 42]. However, all these results suffer from perturbations of background color/texture and artifacts near the region of interest. With the predicted attention map which learns to segment the horses, our final results have much higher visual quality, with the background keeping untouched and artifacts near the ROI removed (row 2, 4). Complete results of horse-zebra translations and comparisons are available online 111

Fig. 4 shows more results on a variety of datasets. We can see that for all these tasks, our model can learn the region of interest and generate compositions that are not only more faithful to the input, but also have fewer artifacts. For example, in dog to cat translation, we notice most attention maps have large values around the eyes, indicating the eyes are key ROI to differentiate cats from dogs. In the examples of photo to DSLR, the ROI should be the background that we wish to defocus, while the initial translation changes the color of the foreground flower in the photo. The final result, on the other hand, learns to keep the color of the foreground flower. In the second example of summer to winter translation, we notice the initial result incorrectly changes color of the person. With the guidance of attention map, the final result removes such artifacts.

In a few scenarios, the attention map is less useful as the image does not explicitly contain region of interest and should be translated everywhere. In this case, the composited results largely rely on the initial prediction given by . This is true for tasks like edges to shoes/handbags, SYNTHIA to cityscape (Fig. 5) and photo to map (Fig. 9). Although many of these tasks have very different source and target domains, our method is general and can be applied to get satisfying results.

To better demonstrate the effectiveness of our simple model, Fig. 6 shows several results before training with the attention branch and compares with baseline. We can see that even without the attention branch, our model generates better qualitative results comparing with CycleGAN and UNIT (more samples of photo to Van Gogh is available online 222

(a) Input (b) CycleGAN (c) UNIT (d) Ours w/o attn
Fig. 6: Comparing our results w/o attention with baselines. From top to bottom: dawn to night (SYNTHIA [58]), non-smile to smile (CelebA [45]) and photos to Van Gogh [28].
Fig. 7: Failure case of the attention map: it did not detect the ROI correctly and removed the zebra stripes.
(a) (b) (c) (d)
Fig. 8: Effects of using different layers as feature extractors. From left to right: input (a), using the first two layers of VGG (b), using the last two layers of VGG (c) and using the first three layers of VGG (d).

User study: To more rigorously evaluate the performance, we perform a user study to compare the results. The procedure is as following: we asked for feedbacks from 22 users (all are graduate students and researchers). Each user is given 30 sets of images to compare. Each set has 5 images, which are the input, initial result (w/o attention), final result (with attention), CycleGAN results and UNIT results. In total there are 300 different image sets randomly selected from horse to zebra and photo to Van Gogh translation tasks. The images in each set are in random order. The user is then asked to rank the four results from highest visual quality to lowest. The user is fully informed about the task and is aware of the goal as to translate the input image into a new domain while avoiding unnecessary changes.

Table I shows the user-study results. We listed results of: CycleGAN vs ours initial/final; UNIT vs ours initial/final; and ours initial vs ours final. We can see that our results, even without applying the attention branch (ours initial), achieve higher ratings than CycleGAN or UNIT. The attention branch also significantly improves the results (Ours final). In terms of directly evaluating the effects of attention branch, ours final is overwhelmingly better than ours initial based on user rankings (Table I row 5). We further examined the few cases where the attention results receive lower scores, and we found that the reason is due to incorrect attention maps (Fig. 7).

Method 1 Method 2 1 better About same 2 better
Ours initial CycleGAN 43.6% 30.0% 26.4%
UNIT 77.4% 17.5% 5.1%
Ours final CycleGAN 63.0% 21.9% 15.1%
UNIT 83.8% 14.4% 1.8%
Ours initial 74.2% 18.5% 7.3%
TABLE I: User study results.
input Pix2Pix CycleGAN Ours GT
Fig. 9: Unsupervised map prediction visualization.
Method Accuracy
Pix2Pix [28] 43.18%
CycleGAN [80] 45.91%
Ours 46.72%
Fig. 10: Unsupervised map prediction accuracy.
(a) (b) (c) (d) (e) (f)
Fig. 11: Visualization of image translation from MNIST (a),(d) to USPS (b),(e) and MNIST-M (c),(f).
CoGAN [44] 95.65% -
PixelDA [7] 95.90% 98.20%
UNIT [43] 95.97% -
CycleGAN [80] 94.28% 93.16%
Target-only 96.50% 96.40%
Ours 96.80% 98.33%
Fig. 12: Unsupervised classification results.
(a) (b) (c) (d) (e) (f)
Fig. 13: Visualization of rendered face to real face translation. (a)(d): input rendered faces; (b)(e): CycleGAN results; (c)(f): Our results.
Method MSE
Baseline 2.26
CycleGAN [80] 2.04
Ours 1.97
Fig. 14: Unsupervised 3DMM prediction results (MSE).
Ours before attn Ours after attn UNIT CycleGAN
98.90 128.32 241.13 109.36
TABLE II: FID between generated samples and target domain for horse to zebra.
Ours UNIT CycleGAN
92.86 120.58 102.49
TABLE III: FID between generated samples and target domain for photo to Van Gogh.

Effects of using different layers as feature extractors: We experimented using different layers of VGG-19 as feature extractors to measure the perceptual loss. Fig. 8 shows visual example of the horse to zebra image translation results trained with different perceptual terms. We can see that only using high-level features as regularization leads to results that are almost identical to the input (Fig. 8 (c)) while only using low-level features as regularization leads to results that are blurry and noisy (Fig. 8 (b)). We find the balance by adopting the first three layers of VGG-19 as feature extractor which does a good job of image translation and also avoids introducing too many noise or artifacts (Fig. 8 (d)).

Iv-B Quantitative Results

Map prediction: We translate images from satellite photos to maps with unpaired training data and compute the pixel accuracy of predicted maps. The original photo-map dataset consists of 1096 training pairs and 1098 testing pairs, where each pair contains a satellite photo and the corresponding map. To enable unsupervised learning, we take the 1096 photos from the training set and the 1098 maps from the test set, using them as the training data. Note that no attention is used here since the change is global and we observe training with attention yields similar results. At test time, we translate the test set photos to maps and again compute the accuracy. If the total RGB difference between the color of a pixel on the predicted map and that on the ground truth is larger than 12, we mark the pixel as wrong. Figure 9 and Table 10 show the visual results and the accuracy results, and we can see our approach achieves the highest map prediction accuracy. Note that Pix2Pix is trained with paired data.

Unsupervised classification: We show unsupervised classification results on USPS [13] and MNIST-M [17] in Figure 11 and Table 12. On both tasks, we assume we have access to labeled MNIST dataset. We first train a generator that maps MNIST to USPS or MNIST-M and then use the translated image and original label to train the classifier (we do not apply the attention branch here as we did not observe much difference after training with attention). We can see from the results that we achieve the highest accuracy on both tasks, advancing state-of-the-art. The qualitative results clearly show that our MNIST-translated images both preserve the original label and are also visually similar to USPS/MNIST-M. We also notice that our model achieves even better results than the model trained on target labels and conjecture that the classifiers get the benefit of the larger training set size of MNIST dataset.

3DMM face shape prediction:

As a real-world application of our approach, we study the problem of estimating 3D face shape, which is modeled with the 3D morphable model (3DMM) 

[5]. 3DMM is widely used for recognition and reconstruction. For a given face, the model encodes its shape with a 100 dimension vector. The goal of 3DMM regression is to predict the 100 dimension vector and we compare them with the ground truth using mean squared error (MSE). [66] proposes to train a very deep neural network [23] for 3DMM regression. However, in reality, the labeled training data for real faces are expensive to collect. We propose to use rendered faces instead, as their 3DMM parameters are readily available. We first rendered 200k faces as the source domain and use human selfie photo data of 645 face images we collected as the target domain. For test, we use our collected 112 3D-scanned faces as test data. For the purpose of domain adaptation, we first use our model to translate the rendered faces to real faces and use the results as the training data, assuming the 3DMM parameters stay unchanged. The 3DMM regression model structure is 102-layer Resnet [23] as in [66], and was trained with the translated faces. Figure 13 and Table 14 show the qualitative results and the final accuracy of 3DMM regression. From the visual results, we see that our translated face preserves the shape of the original rendered face and has higher quality than using CycleGAN. We also reduced the 3DMM regression error compared with baseline (where we trained on rendered faces and tested on real faces) and the CycleGAN results.

Fréchet Inception Distance: We also use the Fréchet Inception Distance (FID) [25] between generated samples from our model and target domains for quantitative evaluation. We compute FID for horse to zebra and photo to Van Gogh and results are shown in table II and III. For photo to Van Gogh, we observe that there is no difference between results before and after attention, so we report a single number for our model. The FID results show that our model achieves better FID than baselines for those tasks. For horse to zebra, our model with attention has worse FID than ours without attention and CycleGAN, and we speculate that there might be some correlations between foreground and background in the target domain when computing FID, so using attention might have a negative effect on FID. Also we suspect that FID might not be ideal for image translation task.

V Conclusion

We propose to use a simple model with attention for image translation and domain adaption and achieve superior performance in a variety of tasks demonstrated by both qualitative and quantitative measures. The attention module is particularly helpful to focus the translation on region of interest, remove unwanted changes or artifacts, and may also be used for unsupervised segmentation or saliency detection. Extensive experiments show that our model is both powerful and general, and can be easily applied to solve real-world problems.