|(a) Input image||(b) Predicted Attention Map|
|(c) Final result||(d) CycleGAN |
Many computer vision problems can be cast as an image-to-image translation problem: the task is to map an image of one domain to a corresponding image of another domain. For example, image colorization can be considered as mapping gray-scale images to corresponding images in RGB space; style transfer can be viewed as translating images in one style to corresponding images with another style [19, 29, 18]. Other tasks falling into this category include semantic segmentation 39], image manipulation 
, etc. Another important application of image translation is related to domain adaptation and unsupervised learning: with the rise of deep learning, it is now considered crucial to have large labeled training datasets. However, labeling and annotating such large datasets are expensive and thus not scalable. An alternative is to use synthetic or simulated data for training, whose labels are trivial to acquire[82, 67, 60, 57, 53, 48, 30, 11]. Unfortunately, learning from synthetic data can be problematic and most of the time does not generalize to real-world data, due to the data distribution gap between the two domains. Furthermore, due to the deep neural networks’ capability of learning small details, it is anticipated that the trained model would easily over-fits to the synthetic domain. In order to close this gap, we can either find mappings or domain-invariant representations at feature level [8, 17, 47, 65, 68, 21, 9, 1, 33] or learn to translate images from one domain to another domain to create “fake” labeled data for training [7, 80, 43, 39, 44, 75]. In the latter case, we usually hope to learn a mapping that preserves the labels as well as the attributes we care about.
Typically there exist two settings for image translation given two domains and . The first setting is supervised, where example image pairs are available. This means for the training data, for each image there is a corresponding , and we wish to find a translator such that . Representative translation systems in the supervised setting include domain-specific works [15, 24, 37, 62, 46, 70, 71, 77] and the more general Pix2Pix [28, 69]. However, paired training data comes at a premium. For example, for image stylization, obtaining paired data requires lengthy artist authoring and is extremely expensive. For other tasks like object transfiguration, the desired output is not even well defined.
Therefore, we focus on the second setting, which is unsupervised image translation. In the unsupervised setting, and are two independent sets of images, and we do not have access to paired examples showing how an image could be translated to an image . Our task is then to seek an algorithm that can learn to translate between and without desired input-output examples. The unsupervised image translation setting has greater potentials because of its simplicity and flexibility but is also much more difficult. In fact, it is a highly under-constrained and ill-posed problem, since there could be unlimited many number of mappings between and
: from the probabilistic view, the challenge is to learn a joint distribution of images in different domains. As stated by the coupling theory, there exists an infinite set of joint distributions that can arrive the two marginal distributions in two different domains. Therefore, additional assumptions and constraints are needed for us to exploit the structure and supervision necessary to learn the mapping.
Existing works that address this problem assume that there are certain relationships between the two domains. For example, CycleGAN  assumes cycle-consistency and the existence of an inverse mapping that translates from to . It then trains two generators which are bijections and inverse to each other and uses adversarial constraint  to ensure the translated image appears to be drawn from the target domain and the cycle-consistency constraint to ensure the translated image can be mapped back to the original image using the inverse mapping ( and ). UNIT , on the other hand, assumes shared-latent space, meaning a pair of images in different domains can be mapped to some shared latent representations. The model trains two generators with shared layers. Both and maps an input to itself, while the domain translation is realized by letting go through part of and part of to get . The model is trained with an adversarial constraint on the image, a variational constraint on the latent code [35, 56], and another cycle-consistency constraint.
Assuming cycle consistency ensures 1-1 mapping and avoids mode collapses , both models generate reasonable image translation and domain adaptation results. However, there are several issues with existing approaches. First, such approaches are usually agnostic to the subjects of interest and there is little guarantee it reaches the desired output. In fact, approaches based on cycle-consistency [80, 42] could theoretically find any arbitrary 1-1 mapping that satisfies the constraints, and this renders the training unstable and the results random. This is problematic in many image translation scenarios. For example, when translating from a horse image to a zebra image, most likely we only wish to draw the particular black-white stripes on top of the horses while keeping everything else unchanged. However, what we observe is that existing approaches [80, 43] do not differentiate between the horse/zebra from the scene background, and the colors and appearances of the background often significantly change during translation (Fig. 1). Second, most of the time we only care about one-way translation, while existing methods like CycleGAN  and UNIT  always require training two generators of bijections. This is not only cumbersome but it is also hard to balance the effects of the two generators. Third, there is a sensitive trade-off between the faithfulness of the translated image to the input image and how similar it resembles the new domain, and it requires excessive manual tuning of the weight between the adversarial loss and the reconstruction loss to get satisfying results.
To address the aforementioned issues, we propose a simpler yet more effective image translation model that consists of a single generator with an attention module. We first re-consider what the desired outcome of an image translation task should be: most of the time the desired output should not only resemble the target domain but also preserve certain attributes and share similar visual appearance with input. For example, in the case of horse-zebra translation , the output zebra should be similar to the input horse in terms of the scene background, the location and the shape of the zebra and horse, etc. In the domain adaptation task that translates MNIST  to USPS , we expect the output is visually similar to the input in terms of the shape and structure of the digit such that it preserves the label. Based on such observation, our model proposes to use a single generator that maps to and is trained with a self-regularization term that enforces perceptual similarity between the output and the input, together with an adversarial term that enforces the output to appear like drawn from
. Furthermore, in order to focus the translation on key components of the image and avoid introducing unnecessary changes to irrelevant parts, we propose to add an attention module that predicts a probability map as to which part of the image it needs to attend to when translating. Such probability maps, which are learned in a completely unsupervised fashion, could further facilitate segmentation or saliency detection (Fig.1). Third, we propose an automatic and principled way to find the optimal weight between the self-regularization term and the adversarial term such that we do not have to manually search for the best hyper-parameter.
Our model does not rely on cycle-consistency or shared representation assumption, and it only learns one-way mapping. Although the constraint is susceptible to oversimplify certain scenarios, we found that the model works surprisingly well. With the attention module, our model learns to detect the key objects from the background context and is able to correct artifacts and remove unwanted changes from the translated results. We apply our model on a variety of image translation and domain adaptation tasks and show that our model is not only simpler but also works better than existing methods, achieving superior qualitative and quantitative performance. To demonstrate its application in real-world tasks, we show our model can be used to improve the accuracy of face 3D morphable model  prediction by augmenting the training data of real images with adapted synthetic images.
Ii Related Work
Generative adversarial networks (GANs) Using GAN framework  for generative image modeling and synthesis has gained remarkable progress recently. The basic idea of GAN training is to train a generator and a discriminator jointly such that the generator produces realistic images that confuse the discriminator. It is known that the vanilla GAN suffers from instability in training. Several techniques have been proposed to stabilize the training process and enable it to scale to higher resolution images, such as DCGAN , energy-based GAN , Wasserstein GAN (WGAN) [61, 2], WGAN-GP , BEGAN , LSGAN  and the Progressive GANs . In our work, adversarial training is the fundamental element which ensures that the output sample from the generator appears like drawn from the target domain.
Image translation Image translation can be seen as generating an image in target domain conditioning on an image in the source domain. Similar problems of conditional image generation include text to image translation [76, 55], super resolution [32, 14, 39], style transfer [19, 29, 40, 26] etc. Based on the availability of paired training data, image translation can be either supervised (paired) or unsupervised (unpaired). Isola et al.  first propose a unified framework called Pix2Pix for paired image-to-image translation based on conditional GANs. Wang  further extends the framework to generate high-resolution images by using deeper, multi-scale networks and improved training losses.  uses variational U-Net instead of GAN for conditional image generation. UNIT  and BiCycleGAN  incorporate latent code embedding into existing frameworks and enable generating randomly sampled translation results. On the other hand, when paired training data is not available, additional constraints such as cycle-consistency loss is employed [80, 27]. Such constraint enforces an image to map to another domain and back to itself to ensure 1-1 mapping between the two domains. However, such techniques heavily rely on “laziness” of the generator network and often introduce artifacts or unwanted changes to the results. Our model leverages recent advances in neural network training and employs the perceptual-based loss [29, 78] as self-regularization, such that cycle-consistency becomes unnecessary and we can also obtain more accurate translation results.
Attention Recently, attention mechanism has been successfully introduced in many applications in computer vision and language processing, e.g., image captioning , text to image generation , visual question answering , saliency detection , machine translation  and speech recognition . Attention mechanism helps models to focus on the relevant portion of the input to resolve the corresponding output without any supervision. In machine translation , it attends on relevant words in the source language to predict the current output in the target language. To generate an image from text , it attends on different words for the corresponding sub-region of the image. Inversely, for image captioning , image sub-regions were attended for the next generated word. In the same spirit, we propose to use an attention module to attend to the region of interest for the image translation task in an unsupervised way.
Iii Our Method
We begin by explaining our model for unsupervised image translation. Let and be two image domains, our goal is to train a generator , where are the function parameters. For simplicity, we omit and use instead. We are given unpaired samples and , and the unsupervised setting assumes that and are independently drawn from the marginal distributions and . Let denote the translated image, the key requirement is that should appear like drawn from domain , while preserving the low-level visual characteristics of . The translated images can be further used for other downstream tasks such as unsupervised learning. However, in our case, we decouple image translation from its applications.
Based on the requirements described, we propose to learn by minimizing the following loss:
Here , where is the vanilla generator and is the attention branch. outputs a translated image while predicts a probability map that is used to composite with to get the final output. The first part of the loss, , is the adversarial loss on the image domain that makes sure that appears like domain . The second part of the losses makes sure that is visually similar to . In our case, is given by a discriminator trained jointly with , and is measured with perceptual loss. We illustrate the model in Fig. III.
The model architectures: Our model consists of a generator and a discriminator . The generator has two branches: the vanilla generator and the attention branch . translates the input as a whole to generate a similar image in the new domain, and predicts a probability map as the attention mask. has the same size as and each pixel is a probability value between 0-1. In the end, we composite the final image by adding up and based on the attention mask.
is based on Fully Convolutional Network (FCN) and leverages properties of convolutional neural networks, such as translation invariance and parameter sharing. Similar to [28, 80], the generator is built with three components: a down-sampling front-end to reduce the size, followed by multiple residual blocks 
, and an up-sampling back-end to restore the original dimensions. The down-sampling front-end consists of two convolutional blocks, each with a stride of 2. The intermediate part contains nine residual blocks that keep the height/width constant, and the up-sampling back-end consists of two deconvolutional blocks, also with a stride of 2. Each convolutional layer is followed by batch normalization and ReLU activation, except for the last layer whose output is in the image space. Using down-sampling at the beginning increases the receptive field of the residual blocks and makes it easier to learn the transformation at a smaller scale. Another modification is that we adopt the dilated convolution in all residual blocks, and set the dilation factor to 2. Dilated convolutions use spaced kernels, enabling it to compute each output value with a wider view of input without increasing the number of parameters and computational burden.consists of the initial layers of the VGG-19 network  (up to conv3_3
), followed by two deconvolutional blocks. In the end it is a convolutional layer with sigmoid that outputs a single channel probability map. During training, the VGG-19 layers are warm-started with weights pretrained on ImageNet.
For the discriminator, we use a five-layer convolutional network. The first three layers have a stride of 2 followed by two convolution layers with stride 1, which effectively down-samples the networks three times. The output is a vector of real/fake predictions and each value corresponds to a patch of the image. Classifying each patch as real/fake introduces PatchGAN, and is shown to work better than the global GAN[80, 28].
Adversarial loss: Generative Adversarial Network  plays a two-player min-max game to update the network and . learns to translate the image to which appears as if it is from , while learns to distinguish from which is the real image drawn from . The parameters of and are updated alternatively. The discriminator updates its parameters by maximizing the following objective:
The adversarial loss used to update the generator is defined as:
By minimizing the loss function, the generatorlearns to create a translated image that fools the network into classifying the image as drawn from .
Self-regularization loss: Theoretically, adversarial training can learn a mapping that produces outputs identically distributed as the target domain . However, if the capacity is large enough, a network can map the input images to any random permutations of images in the target domain. Thus, adversarial loses alone cannot guarantee that the learned function maps the input to the desired output. To further constrain the learned mapping such that it is meaningful, we argue that should preserve visual characteristics of the input image. In other words, the output and the input need to share perceptual similarities, especially regarding the low-level features. Such features may include color, edges, shape, objects, etc. We impose this constraint with the self-regularization term, which is modeled by minimizing the distance between the translated image and the input : . Here is some distance function , which can be , , SSIM, etc. However, recent research suggests that using perceptual distance based on a pre-trained network corresponds much better to human perception of similarity comparing with traditional distance measures . In particular, we defined the perceptual loss as:
Here is VGG pretrained on ImageNet used to extract the neural features; we use to represent each layer, and are the height and width of feature . We extract neural features with across multiple layers, compute the difference at each location of and average over the feature height and width. We then scale it with layer-wise weight . We did extensive experiments to try different combinations of feature layers and obtained the best results by only using the first three layers of VGG and setting to be respectively. This conforms to the intuition that we would like to preserve the low-level traits of the input during translation. Note that this may not always be true (such as in texture transfer), but it is a hyper-parameter that could be easily adjusted based on different problem settings. We also experimented with using different pre-trained networks such as AlexNet to extract neural features as suggested by  but do not observe much difference in results.
Training scheme: In our experiment, we found that training the attention branch and the vanilla generator branch is difficult as it is hard to balance the learned translation and mask. In our practice, we train the two branches separately. First, we train the vanilla generator without the attention branch. After it converges, we train the attention branch while keeping the trained generator fixed. In the end, we jointly fine-tune them with a smaller learning rate.
Adaptive weight induction: Like other image translation methods, the resemblance to the new domain and faithfulness to the original image is a trade-off. In our model, it is determined by the weight of the self-regularization term relative to the image adversarial term. If is too large, the translated image will be close to the input but does not look like the new domain. If
is too small, the translated image would fail to pertain the visual traits of the input. Previous approaches usually decide the weight heuristically. Here we propose an adaptive scheme to search for the best: we start by setting , which means we only use the adversarial constraint to train the generator. Then we gradually increase . This would lead to the decrease of the adversarial loss as the output would shift away from to , which makes it easier for to classify. We stop increasing when the adversarial loss sinks below some threshold . We then keep constant and continue to train the network until converging. Using the adaptive weight induction scheme avoids manual tuning of for each specific task and gives results that are both similar to the input and the new domain . Note that we repeat such process both when training and .
Analysis: Our model is related to CycleGAN in that if we assume 1-1 mapping, we can define an inverse mapping such that . This satisfies the constraints of CycleGAN in that the cycle-consistency loss is zero. This shows that our learned mapping belongs to the set of possible mappings given by CycleGAN. On the other hand, although CycleGAN tends to learn the mapping such that the visual distance between and is small possibly due to cycle-consistency constraint, it does not guarantee to minimize the perceptual distance between and . Comparing with UNIT, if we add another constraint that , then it is a special case of the UNIT model where all layers of the two generators are shared which leads to a single generator . In this case, the cycle-consistency constraint is implicit as and . However, we observe that adding the additional self-mapping constraint for domain does not improve the results.
Even though our approach assumes the perceptual distance between and its corresponding is small, our approach generalizes well to tasks where the input and output domains are significantly different, such as translation of photo to map, day to night, etc., as long as our assumption generally holds. For example, in the case of photo to map, the park (photo) is labeled as green (map) and the water (photo) is labeled as blue (map), which provides certain low-level similarities. Experiments show that even without the attention branch, our model produces results consistently similar or better than other methods. This indicates that the cycle-consistency assumption may not be necessary for image translation. Note that our approach is a meta-algorithm, and we could potentially improve the results by using new/more advanced components. For example, the generator and discriminator could be easily replaced with the latest GAN architectures such as LSGAN , WGAN-GP , or adding spectral normalization . We may also improve the results by employing a more specific self-regularizaton term that is fine-tuned on the datasets we work on.
|(a) Input||(b) Initial trans||(c) Attention map||(d) Final result||(e) UNIT ||(f) CycleGAN |
We tested our model on a variety of datasets and tasks. In the following, we show the qualitative results of image translation, as well as quantitative results in several domain adaptation settings. In our experiments, all images are resized to 256x256. We use Adam solver  to update the model weights during training. In order to reduce model oscillation, we update the discriminators using a history of generated images rather than the ones produced by the latest generative models : we keep an image buffer that stores the 50 previously generated images. All networks were trained from scratch with a learning rate of 0.0002. Starting from 5k iteration, we linearly decay the learning rate over the remaining 5k iterations. Most of our training takes about 1 day to converge on a single Titan X GPU.
Iv-a Qualitative Results
|(a) Input||(b) Initial||(c) Attention||(d) Final||(e) Input||(f) Initial||(g) Attention||(h) Final|
Fig. 3 shows visual results of image translation of horse to zebra. For each image, we show the initial translation , the attention map and the final result composited using and based on . We also compare the results with CycleGAN  and UNIT , and all models are trained using the same number of iterations. For the baseline implementation, we use the original authors’ implementations. We can see from the examples that without the attention branch, our simple translation model already gives results similar or better than [80, 42]. However, all these results suffer from perturbations of background color/texture and artifacts near the region of interest. With the predicted attention map which learns to segment the horses, our final results have much higher visual quality, with the background keeping untouched and artifacts near the ROI removed (row 2, 4). Complete results of horse-zebra translations and comparisons are available online 111http://www.harryyang.org/img_trans.
Fig. 4 shows more results on a variety of datasets. We can see that for all these tasks, our model can learn the region of interest and generate compositions that are not only more faithful to the input, but also have fewer artifacts. For example, in dog to cat translation, we notice most attention maps have large values around the eyes, indicating the eyes are key ROI to differentiate cats from dogs. In the examples of photo to DSLR, the ROI should be the background that we wish to defocus, while the initial translation changes the color of the foreground flower in the photo. The final result, on the other hand, learns to keep the color of the foreground flower. In the second example of summer to winter translation, we notice the initial result incorrectly changes color of the person. With the guidance of attention map, the final result removes such artifacts.
In a few scenarios, the attention map is less useful as the image does not explicitly contain region of interest and should be translated everywhere. In this case, the composited results largely rely on the initial prediction given by . This is true for tasks like edges to shoes/handbags, SYNTHIA to cityscape (Fig. 5) and photo to map (Fig. 9). Although many of these tasks have very different source and target domains, our method is general and can be applied to get satisfying results.
To better demonstrate the effectiveness of our simple model, Fig. 6 shows several results before training with the attention branch and compares with baseline. We can see that even without the attention branch, our model generates better qualitative results comparing with CycleGAN and UNIT (more samples of photo to Van Gogh is available online 222http://www.harryyang.org/img_trans/vangogh).
|(a) Input||(b) CycleGAN||(c) UNIT||(d) Ours w/o attn|
User study: To more rigorously evaluate the performance, we perform a user study to compare the results. The procedure is as following: we asked for feedbacks from 22 users (all are graduate students and researchers). Each user is given 30 sets of images to compare. Each set has 5 images, which are the input, initial result (w/o attention), final result (with attention), CycleGAN results and UNIT results. In total there are 300 different image sets randomly selected from horse to zebra and photo to Van Gogh translation tasks. The images in each set are in random order. The user is then asked to rank the four results from highest visual quality to lowest. The user is fully informed about the task and is aware of the goal as to translate the input image into a new domain while avoiding unnecessary changes.
Table I shows the user-study results. We listed results of: CycleGAN vs ours initial/final; UNIT vs ours initial/final; and ours initial vs ours final. We can see that our results, even without applying the attention branch (ours initial), achieve higher ratings than CycleGAN or UNIT. The attention branch also significantly improves the results (Ours final). In terms of directly evaluating the effects of attention branch, ours final is overwhelmingly better than ours initial based on user rankings (Table I row 5). We further examined the few cases where the attention results receive lower scores, and we found that the reason is due to incorrect attention maps (Fig. 7).
|Method 1||Method 2||1 better||About same||2 better|
|Ours before attn||Ours after attn||UNIT||CycleGAN|
Effects of using different layers as feature extractors: We experimented using different layers of VGG-19 as feature extractors to measure the perceptual loss. Fig. 8 shows visual example of the horse to zebra image translation results trained with different perceptual terms. We can see that only using high-level features as regularization leads to results that are almost identical to the input (Fig. 8 (c)) while only using low-level features as regularization leads to results that are blurry and noisy (Fig. 8 (b)). We find the balance by adopting the first three layers of VGG-19 as feature extractor which does a good job of image translation and also avoids introducing too many noise or artifacts (Fig. 8 (d)).
Iv-B Quantitative Results
Map prediction: We translate images from satellite photos to maps with unpaired training data and compute the pixel accuracy of predicted maps. The original photo-map dataset consists of 1096 training pairs and 1098 testing pairs, where each pair contains a satellite photo and the corresponding map. To enable unsupervised learning, we take the 1096 photos from the training set and the 1098 maps from the test set, using them as the training data. Note that no attention is used here since the change is global and we observe training with attention yields similar results. At test time, we translate the test set photos to maps and again compute the accuracy. If the total RGB difference between the color of a pixel on the predicted map and that on the ground truth is larger than 12, we mark the pixel as wrong. Figure 9 and Table 10 show the visual results and the accuracy results, and we can see our approach achieves the highest map prediction accuracy. Note that Pix2Pix is trained with paired data.
Unsupervised classification: We show unsupervised classification results on USPS  and MNIST-M  in Figure 11 and Table 12. On both tasks, we assume we have access to labeled MNIST dataset. We first train a generator that maps MNIST to USPS or MNIST-M and then use the translated image and original label to train the classifier (we do not apply the attention branch here as we did not observe much difference after training with attention). We can see from the results that we achieve the highest accuracy on both tasks, advancing state-of-the-art. The qualitative results clearly show that our MNIST-translated images both preserve the original label and are also visually similar to USPS/MNIST-M. We also notice that our model achieves even better results than the model trained on target labels and conjecture that the classifiers get the benefit of the larger training set size of MNIST dataset.
3DMM face shape prediction:
As a real-world application of our approach, we study the problem of estimating 3D face shape, which is modeled with the 3D morphable model (3DMM). 3DMM is widely used for recognition and reconstruction. For a given face, the model encodes its shape with a 100 dimension vector. The goal of 3DMM regression is to predict the 100 dimension vector and we compare them with the ground truth using mean squared error (MSE).  proposes to train a very deep neural network  for 3DMM regression. However, in reality, the labeled training data for real faces are expensive to collect. We propose to use rendered faces instead, as their 3DMM parameters are readily available. We first rendered 200k faces as the source domain and use human selfie photo data of 645 face images we collected as the target domain. For test, we use our collected 112 3D-scanned faces as test data. For the purpose of domain adaptation, we first use our model to translate the rendered faces to real faces and use the results as the training data, assuming the 3DMM parameters stay unchanged. The 3DMM regression model structure is 102-layer Resnet  as in , and was trained with the translated faces. Figure 13 and Table 14 show the qualitative results and the final accuracy of 3DMM regression. From the visual results, we see that our translated face preserves the shape of the original rendered face and has higher quality than using CycleGAN. We also reduced the 3DMM regression error compared with baseline (where we trained on rendered faces and tested on real faces) and the CycleGAN results.
Fréchet Inception Distance: We also use the Fréchet Inception Distance (FID)  between generated samples from our model and target domains for quantitative evaluation. We compute FID for horse to zebra and photo to Van Gogh and results are shown in table II and III. For photo to Van Gogh, we observe that there is no difference between results before and after attention, so we report a single number for our model. The FID results show that our model achieves better FID than baselines for those tasks. For horse to zebra, our model with attention has worse FID than ours without attention and CycleGAN, and we speculate that there might be some correlations between foreground and background in the target domain when computing FID, so using attention might have a negative effect on FID. Also we suspect that FID might not be ideal for image translation task.
We propose to use a simple model with attention for image translation and domain adaption and achieve superior performance in a variety of tasks demonstrated by both qualitative and quantitative measures. The attention module is particularly helpful to focus the translation on region of interest, remove unwanted changes or artifacts, and may also be used for unsupervised segmentation or saliency detection. Extensive experiments show that our model is both powerful and general, and can be easily applied to solve real-world problems.
-  H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand. Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446, 2014.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
-  D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
-  V. Blanz, S. Romdhani, and T. Vetter. Face identification across different poses and illuminations with a 3d morphable model. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, pages 202–207. IEEE, 2002.
-  V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194. ACM Press/Addison-Wesley Publishing Co., 1999.
K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan.
Unsupervised pixel-level domain adaptation with generative
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 7, 2017.
-  K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
-  R. Caseiro, J. F. Henriques, P. Martins, and J. Batista. Beyond the shortest path: Unsupervised domain adaptation by sampling subspaces along the spline flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3846–3854, 2015.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for speech recognition. In Advances in neural information processing systems, pages 577–585, 2015.
-  P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba. Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518, 2016.
-  M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset. In CVPR Workshop on the Future of Datasets in Vision, volume 1, page 3, 2015.
-  J. S. Denker, W. Gardner, H. P. Graf, D. Henderson, R. Howard, W. Hubbard, L. D. Jackel, H. S. Baird, and I. Guyon. Neural network recognizer for hand-written zip code digits. In Advances in neural information processing systems, pages 323–331, 1989.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pages 184–199. Springer, 2014.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
-  P. Esser, E. Sutter, and B. Ommer. A variational u-net for conditional appearance and shape generation.
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V. Lempitsky.
Domain-adversarial training of neural networks.
The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 327–340. ACM, 2001.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 1510–1519. IEEE, 2017.
-  X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. arXiv preprint arXiv:1804.04732, 2018.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 746–753. IEEE, 2017.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
-  T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  J. Kuen, Z. Wang, and G. Wang. Recurrent attentional networks for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3668–3677, 2016.
-  P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on Graphics (TOG), 33(4):149, 2014.
-  Y. LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint, 2016.
-  C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2479–2486, 2016.
-  T. Lindvall. Lectures on the coupling method. Courier Corporation, 2002.
-  M.-Y. Liu. Unsupervised Image-to-Image Translation. https://github.com/mingyuliutw/UNIT, 2017. [Online; accessed 7-May-2018].
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
-  M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
-  M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
-  A. Mahendran, H. Bilen, J. F. Henriques, and A. Vedaldi. Researchdoom and cocodoom: learning computer vision with games. arXiv preprint arXiv:1610.02431, 2016.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2813–2821. IEEE, 2017.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  W. Qiu and A. Yuille. Unrealcv: Connecting computer vision to unreal engine. In European Conference on Computer Vision, pages 909–916. Springer, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and variational inference in deep latent gaussian models.In International Conference on Machine Learning, volume 2, 2014.
-  S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pages 102–118. Springer, 2016.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3234–3243, 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell. Sim-to-real robot learning from pixels with progressive nets. arXiv preprint arXiv:1610.04286, 2016.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
-  Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, page 6, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, volume 6, page 8, 2016.
-  A. T. Tran, T. Hassner, I. Masi, and G. Medioni. Regressing robust and discriminative 3d morphable models with a very deep neural network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1493–1502. IEEE, 2017.
-  E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T. Darrell. Towards adapting deep visuomotor representations from simulated to real environments. CoRR, abs/1511.07111, 2015.
-  E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4068–4076. IEEE, 2015.
-  T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.
-  X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pages 318–335. Springer, 2016.
-  S. Xie and Z. Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
-  H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer, 2016.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
-  T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. arXiv preprint, 2017.
-  D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-level domain transfer. In European Conference on Computer Vision, pages 517–532. Springer, 2016.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.
-  R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European Conference on Computer Vision, pages 649–666. Springer, 2016.
-  R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint arXiv:1801.03924, 2018.
-  J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.
Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi.
Target-driven visual navigation in indoor scenes using deep reinforcement learning.In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3357–3364. IEEE, 2017.