Exemplar Guided Unsupervised Image-to-Image Translation

05/28/2018 ∙ by Liqian Ma, et al. ∙ ETH Zurich 0

Image-to-image translation task has become a popular topic recently. Most works focus on either one-to-one mapping in an unsupervised way or many-to-many mapping in a supervised way. However, a more practical setting is many-to-many mapping in an unsupervised way, which is harder due to the lack of supervision and the complex inner and cross-domain variations. To alleviate these issues, we propose the Exemplar Guided UNsupervised Image-to-image Translation (EG-UNIT) network which conditions the image translation process on an image in the target domain. An image representation is assumed to comprise both content information which is shared across domains and style information specific to one domain. By applying exemplar-based adaptive instance normalization to the shared content representation, EG-UNIT manages to transfer the style information in the target domain to the source domain. Experimental results on various datasets show that EG-UNIT can indeed translate the source image to diverse instances in the target domain with semantic consistency.



There are no comments yet.


page 2

page 7

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image-to-image (I2I) translation refers to the task of mapping an image from a source domain to a target domain, e.g

. semantic maps to real images, gray-scale to color images, low-resolution to high-resolution images, and so on. The recent advances in deep learning have greatly improved the quality of I2I translation methods for a number of applications, including super-resolution 


, colorization 

[31], inpainting [24], attribute transfer [17], style transfer [4], and domain adaptation [8, 21]. Most of these works [11, 28, 33] have been very successful in these cross-domain I2I translation tasks because they rely on large datasets of paired training data as supervision. However, for many tasks it is not easy, or even possible, to obtain such paired data that show how an image in the source domain should be translated to an image in the target domain, e.g. in cross-city street view translation or male-female face translation. For this unsupervised setting, Zhu et al. [32] proposed to use a cycle-consistency loss, which assumes that a mapping from domain A to B, followed by its reverse operation approximately yields an identity function, that is, . Liu et al. [21] further proposed a shared-latent space constraint, which assumes that a pair of corresponding images from domains A and B respectively can be mapped to the same representation in a shared latent space Z. Note that, all the aforementioned methods assume that there is a deterministic one-to-one mapping between the two domains, i.e. each image in A is translated to only a single image in B. By doing so, they fail to capture the multimodal nature of the image distribution within the target domain, e.g. different color and style of shoes in sketch-to-image translation and different seasons in synthetic-to-real street view translation.

In this work, we propose Exemplar Guided & Semantically Consistent I2I Translation (EGSC-IT) to explicitly address this issue. As shown in concurrent works [10, 17, 6], we assume that an image is composed of two disentangled representations. In our case, first a domain-shared representation that models the content in the image, and second a domain-specific representation that contains the style information. However, for a multimodal domain with complex inner-variations, as the ones we target in this paper, e.g. street views of day-and-night or different seasons, it is difficult to have a single static representation which covers all variations in that domain. Moreover, it is unclear which style (time-of-day/season) to pick during the image translation process. To handle such multimodal I2I translations, some approaches [1, 17, 6]

incorporate noise vectors as additional inputs to the generator, but as shown in 

[11, 33] this could lead to mode collapsing issues. Instead, we propose to condition the image translation process on an arbitrary image in the target domain, i.e. an exemplar. By doing so, EGSC-IT does not only enable multimodal (i.e. many-to-many) image translations, but also allows for explicit control over the translation process, since by using different exemplars as guidance we are able to translate an input image into images of different styles within the target domain – see Fig. 1.

Figure 2: The to translation procedure of our EGSC-IT framework. 1) Source domain image is fed into an encoder to compute a shared latent code and is further decoded to a common high-level content representation . 2) Meanwhile, is also fed into a sub-network to compute feature masks . 3) The target domain exemplar image is fed to a sub-network to compute affine parameters and for AdaIN . 4) The content representation is transferred to the target domain using , , , and is further decoded to an image by target domain generator .

To instantiate this idea, we adopt the weight sharing architecture proposed in UNIT [21], but instead of having a single latent space shared by both domains, we propose to decompose the latent space into two components according to the two disentangled representations presented above. That is, a domain-shared component that focuses on the image content, and a domain-specific component that captures the style information associated with the exemplar. In our particular case, the domain-shared content component contains semantic information, such as the objects’ category, shape and spatial layout, while the domain-specific style component contains the style information, such as the color and texture, to be translated from a target domain exemplar to an image in the source domain. To realize this translation, we apply adaptive instance normalization (AdaIN) [9] to the shared content component of the source domain image using the AdaIN parameters computed from the target domain exemplar. However, directly applying AdaIN to the feature maps of the shared content component would mix up all objects and scenes in the image, making the image translation prone to failure when an image contains diverse objects and scenes. To tackle this problem, existing works [22, 5, 19, 8] use semantic labels as an additional form of supervision. However, ground-truth semantic labels are not easy to obtain for most tasks as they require labor-intensive annotations. Instead, to maintain the semantic consistency during image translation without using any semantic labels we propose to compute feature masks. One can think of feature masks as attention modules that approximately decouple different semantic categories in an unsupervised way under the guidance of perceptual losses and adversarial loss. In particular, one feature mask corresponding to a certain semantic category is applied to one feature map of the shared content component, and consequently the AdaIN for that channel is only required to capture and model the style difference for that category, e.g. sky’s style in two domains. To the best of our knowledge, this is the first line of work that addresses the semantic consistency issue under this setting. See Fig. 2 for an overview of EGSC-IT.

Our contribution is three-fold. i) We propose a novel approach for the I2I translation task, which enables multimodal (i.e. many-to-many) mappings and allows for explicit style control over the translation process. ii) We introduce the concept of feature masks for the unsupervised, multimodal I2I translation task, which provides coarse semantic guidance without using any semantic labels. iii) Evaluation on different datasets show that our method is robust to mode collapse and can generate results with semantic consistency, conditioned on a given exemplar image.

2 Related work

I2I translation. I2I translation is used to learn a mapping from one image (i.e. source domain) to another (i.e. target domain). Recently, with the advent of generative models [7, 14], there have been a lot of works on this topic. Isola et al. [11]

proposed pix2pix to learn the mapping from input images to output images using a U-Net neural network in an adversarial way.

Wang et al. [28] extended the method to pix2pixHD, to turn semantic label maps into high-resolution photo-realistic images. Zhu et al. [33] extended pix2pix to BicycleGAN, which can model multimodal distributions and produce both diverse and realistic results. All these methods, however, require paired training data as supervision which may be difficult or even impossible to collect in many scenarios, such as synthetic-to-real street view translation or face-to-cartoon translation [26].

Recently, several unsupervised methods have been proposed to learn the mappings between two image collections without paired training data. Note that, this is an ill-posed problem since there are infinitely many mappings existing between two unpaired image domains. To address this ill-posed problem, different constraints have been added to the network to regularize the learning process [32, 13, 30, 21, 26]. One popular constraint is cycle-consistency, which enforces the network to learn deterministic mappings for various applications. Going one step further, Liu et al. [21] proposed a shared-latent space constraint which encourages a pair of images from different domains to be mapped to the same representation in the latent space. In a similar vein, Royer et al. [26] proposed to enforce a feature-level constraint with a latent embedding reconstruction loss. However, we argue that these constraints are not well suited for complex domains with large inner-domain variations, as also mentioned in [1, 17, 6, 20]. Unlike these methods, to address this problem we propose to add a target domain exemplar as guidance during image translation through AdaIN [9]. As explained in the previous section, the AdaIN technique is utilized to transfer the style component from the target domain exemplar to the shared content component of the source domain image. This allows multimodal (i.e. many-to-many) translations and can produce images of desired styles with explicit control over the translation process. Concurrent to our work, MUNIT [10], also proposed to use AdaIN to transfer style information from the target domain to the source domain. Unlike MUNIT, before applying AdaIN to the shared content component we compute feature masks to decouple different semantic categories and preserve the semantic consistency during the translation process. In particular, by applying feature masks to the feature maps of the shared content component, each channel can specialize and model the style difference only for a single semantic category, which is crucial when handling domains with complex scenes.

Style transfer. Style transfer aims at transferring the style information from an exemplar image to a content image, while preserving the content information. The seminal work by Gatys et al. [4] proposed to transfer style information by matching the feature correlations, i.e

. Gram matrices, in the convolutional layers of a deep neural network (DNN) following an iterative optimization process. In order to improve the speed and flexibility, several feed-forward neural networks have been proposed.

Huang & Belongie [9]

proposed a simple but effective method, called AdaIN, which aligns the mean and variance of the content image features with those of the style image features.

Li et al. [18] proposed the whitening and coloring transform (WCT) algorithm, which directly matches the features’ covariance in the content image to those in the given style image. However, due to the lack of semantic consistency during translation, these stylization methods usually generate non-photorealistic images, suffering from the “spills over” problem [22]. To address this, semantic label maps are used as additional supervision to help style transfer between corresponding semantic regions [22, 5, 19]. Unlike these works, we propose to compute feature masks to approximately model such semantic information without using any semantic labels that are very hard to collect.

& InfoFusion
CycleGAN - - - Low -
UNIT - - - Low -
Augmented CycleGAN - - Low -
CDD-IT Swap feature - Low -
DRIT Swap feature - Low -
MUNIT AdaIN - Middle Depends
EGSC-IT (Ours) AdaIN High
Table 1: Comparison of unpaired I2I translation networks: CycleGAN [32], UNIT [21], Augmented CycleGAN [1], CDD-IT [6], DRIT [17], MUNIT [10], EGSC-IT (Ours).

Table 1 summarizes the features of the most related works. As can be seen, our method using the combination of AdaIN and feature masks under the guidance of perceptual loss is, to the best of our knowledge, the first to achieve multimodal I2I translations in the unsupervised setting with high semantic consistency, without requiring any ground-truth semantic labels.

3 Method

Our goal is to learn a many-to-many mapping between two domains in an unsupervised way, which is guided by the style of an exemplar while retaining the semantic consistency at the same time. For example, a synthetic street view image can be translated to either a day-time or night-time realistic scene, depending on the exemplar. To realize this, similarly to concurrent works [10, 17, 6] we assume that an image can be decomposed into two disentangled components. In our case, that is, one modeling the shared content between domains, i.e. domain-shared content component, and another modeling the style information specific to exemplars in the target domain, i.e. domain-specific style component. In what follows, we present our EGSC-IT framework, the architecture of its networks, and the learning procedure.

3.1 Framework

For simplicity, we present EGSC-IT in the AB direction – see Fig. 2. Each image domain (i.e. source and target) is modeled by a VAE-GAN [15], which includes an encoder , a generator , and a discriminator . For the BA direction, the translation process as well as the notation are analogous.

Weight sharing for domain-shared content. To learn the content component of an image pair that is shared across source and target domains we employ the weight sharing strategy proposed in UNIT [21]. The latter assumes that the two domains, A and B, share a common latent space, and any image pair from the two domains, and , can be mapped to the same latent representation in this shared-latent space . They achieve this by simply sharing the weights of the last layer in and as well as the first layer in and . For more details about the weight-sharing strategy we refer the reader to the original UNIT paper.

Exemplar-based AdaIN for domain-specific style. The shared content component contains semantic information, such as the objects’ category, shape and spatial layout, but no style information, e.g. their color and texture. Inspired by Huang & Belongie [9], who showed that AdaIN’s affine parameters have a big influence on the output image’s style, we propose to apply AdaIN to the shared content component before the decoding stage. In particular, the exemplar from the target domain is fed to another network (see Fig. 2, blue line) to compute a set of feature maps , which are expected to contain the style information of the target domain. As in [9], means and variances are calculated for each channel of and used as AdaIN’s affine parameters,


where and respectively denote a function to compute the mean and variance across spatial dimensions. The shared content component is first normalized by these affine parameters, as shown in Eq. 2, and then decoded to a target-domain image using the target domain generator. Since different affine parameters normalize the feature statistics in different ways, by using different exemplar images in the target domain as input we can translate an image in the source domain to different sub-styles in the target domain. Therefore, EGSC-IT does not only allow for multimodal I2I translations, but at the same time enables the user to have explicit style control over the translation process.

Feature masks for semantic consistency. Directly applying AdaIN to the shared content component does not give satisfying results. The reason is that one channel in the shared content component is likely to contain information from multiple objects and scenes. The difference of these objects and scenes between the two domains is not always uniform, due to the large inner- and cross-domain variations. As such, applying AdaIN over a feature map with complex semantics is prone to mix styles of different objects and scenes together, hence failing to give semantically-consistent translations. To tackle this problem, existing works use semantic labels as an additional form of supervision. However, ground-truth semantic labels are not easy to obtain for most tasks as they require labor-intensive annotations. Instead, we propose to compute feature masks (see Fig. 2

, red line) to make an approximate estimation of semantic categories without using any ground-truth semantic labels. The feature masks

, which can be regarded as attention modules, are computed by applying a nonlinear activation function and a threshold to feature maps

, i.e. , where is a threshold and

is the sigmoid function. Feature masks contain substantial semantic information, which can be used to retain the semantic consistency during translation,

e.g. translating the source sky into the target sky style without affecting the other scene elements. The new normalized representation is , where denotes the Hadamard product.

During training, there are four types of information flow – see Fig. 3. For the reconstruction flow , the shared content component , feature masks , and AdaIN parameters are all computed from (and vice versa for ). For the translation flow , the shared content component and feature masks are computed from , while AdaIN’s affine parameters and are computed from the target domain exemplar (and vice versa for ).

& [rd] & & & [rd] & & & [rd] & & & [rd] &
[r] [ru] [rd] & [r] & & [ru] [rd] [r] & [r] & & [r] [ru] & [r] & & [r] [ru] & [r] &
& [ru] & & & [ru] & & [r] & [ru] & & [r] & [ru] &

Figure 3: Information flow diagrams of auto-encoding procedures and , and translation procedures and .

3.2 Network architecture

The overall framework can be divided into several sub-networks111For more details we refer the reader to the supplementary material.. 1) Two Encoders, and

. Each one consists of several strided convolutional layers and several residual blocks to compute the shared content component. 2) A feature mask network and an AdaIN network,

and for translation (vise versa for ) have the same architecture as the Encoder above except for the weight-sharing layers. 3) Two Generators, and , are almost symmetric to the Encoders except that the up-sampling is done by transposed convolutional layers. 4) Two Discriminators, and , are fully-convolutional networks containing a stack of convolutional layers. 5) A VGG sub-network [27], , that contains the first few layers (up to relu5_1) of a pre-trained VGG-19 [27], which is used to calculate perceptual losses. Note that, although we use UNIT as our baseline framework to build upon, this is not a hard restriction. In theory, UNIT can be replaced with any baseline framework with similar functionality.

3.3 Learning

The learning procedure of EGSC-IT contains VAEs, GANs, cycle-consistency and perceptual losses. To make the training more stable, we first pre-train the feature mask network and AdaIN network for each domain separately within a VAE-GAN architecture, and use the encoder part as fixed feature extractors, i.e.  and , for the remaining training. The overall loss is shown in Eq. 3,


where the VAEs, GANs and cycle-consistency losses are identical to the ones used in Liu et al. [21]. The perceptual loss consists of the content loss captured by feature maps containing localized spatial information, and the style loss captured by the Gram matrix containing non-localized style information similar to [4, 12], is as follows,

where and are the weights for content and style losses, which depend on the dataset domain variations and tasks. The content loss and style loss are defined as,


We use the first convolutional layer of the five blocks in to extract the feature maps. and are defined likewise. For the content losses and , a linear weighting scheme is adopted to help the network focus more on the high-level semantic information. In both content and style losses we use the L1 distance, which in our experiments outperforms L2.

Now that we have introduced all losses, we can explain how these losses help to achieve I2I translation, multimodal translation, and semantic consistency. I2I translation: , and help to maintain the shared latent space by relating the two different domains and finding the optimal translation between the two in an unsupervised way. Multimodal translation: and help to encourage to look not only like the main mode of variation in domain B, but also like an exemplar from domain B, , since the domain space is actually supported by each data sample. Semantic consistency: encourages the network to utilize the feature mask information for semantic consistency, without relying on hard correspondences between semantic labels as existing works do.

4 Experiments

We evaluate EGSC-IT’s translation ability, i.e. how well it generates domain-realistic-looking and semantically consistent images, both qualitatively and quantitatively on three tasks with progressively increasing visual complexity: 1) single-digit translation; 2) multi-digit translation; 3) street view translation. We first perform an ablation study on various components of EGSC-IT on the single-digit translation task. Then, we present results on more challenging translation tasks, and evaluate EGSC-IT quantitatively on the semantic segmentation task. In supplementary material, we apply EGSC-IT to the face gender translation task and perform the ablation study on the street-view translation task.

Figure 4: Single-digit translation testing results. The left-most four columns are samples from domain and , and reference translated ground truth and . * Models are trained using MNIST-Single data as EGSC-IT. Best viewed in color.

Single-digit translation. We set up a controlled experiment on the MNIST-Single dataset, which is created based on the handwritten digits dataset MNIST [16]. The MNIST-Single dataset consists of two different domains as shown in Fig. 4. For domain A of both training/test sets, the foreground and background are randomly set to black or white but different from each other. For domain B of training set, the foreground and background for digits from 0 to 4 are randomly assigned a color from {red, green, blue}, and the foreground and background for digits from 5 to 9 are fixed to red and green, respectively. For domain B of testing set, the foreground and background of all digits are randomly assigned a color from {red, green, blue}. Such data imbalance is designed on purpose to test the translation diversity and generalization ability. In particular, for diversity, we want to check whether a method would suffer from the mode collapse issue and translate the images to the dominant mode, i.e. (red, green), while for generalization, we want to check whether the model can be applied to new styles in the target domain that never appear in the training set, e.g. translate number 6 from black foreground and white background to blue foreground and red background.

CycleGAN UNIT MUNIT EGSC-IT w/o feature mask EGSC-IT w/o AdaIN EGSC-IT w/o EGSC-IT
A B 0.2140.168 0.1780.160 0.4630.094 0.3950.137 0.2080.166 0.2860.183 0.478 0.090
B A 0.0890.166 0.0740.158 0.2270.128 0.1330.171 0.0800.167 0.0930.169 0.232 0.131
Table 2: SSIM evaluation for single-digit translation. Higher is better.

We first analyze the importance of three main components of EGSC-IT, i.e. feature masks, AdaIN, and perceptual loss, on the MNIST-Single dataset. As shown in Fig. 4, EGSC-IT can successfully transfer the source image into the style of the exemplar image. Ablating the feature mask from EGSC-IT, leads to incorrect foreground and background shape, indicating that feature masks can indeed provide semantic information to transfer the corresponding local regions. Without AdaIN, the network suffers from the mode collapse issue in AB translation, i.e. all samples are transferred to the dominant mode with foreground and background, indicating that the exemplar’s style information can help the network to learn many-to-many mappings and avoid the mode collapse issue. Without perceptual losses , colors of foreground and background are incorrect, which shows that perceptual losses can encourage the network to learn semantic knowledge, in this case foreground and background, without ground-truth semantic labels. As for other I2I translation methods, CycleGAN [32] and UNIT [21] can only do deterministic image translations and suffer from mode collapse issue, such as white green and black red for CycleGAN in Fig. 4. MUNIT [10] can successfully transfer the style of the exemplar image, but the foreground and background are mixed and the digit’s shape is not kept well. These qualitative observations are in accordance with the quantitative results in Tab. 2, where our full EGSC-IT obtains higher SSIM scores than all other alternatives. In addition, we compare with other style transfer methods, Neural ST [4], AdaIN [9], and WCT [18]. In each case, we resize the input image to 512512 resolution and choose the best performing hyper-parameters. Note how style transfer methods can transfer the style successfully but fail to keep semantic consistency. Quantitative results for style transfer methods are in supplementary material.

Figure 5: Single-digit translation t-SNE embedding visualization. Red dots: real samples. Blue crosses: generated samples. Best viewed in color.

To verify EGSC-IT’s ability to match the target domain distributions of real data and translated results, we visualize them using t-SNE embeddings [23] in Fig. 5. The t-SNE embeddings are calculated from the translated images with PCA dimension reducing. Our method can match the distributions well, while others either collapse to few modes or mismatch the distributions.

Figure 6: Multi-digit translation. Left: testing results. Right: t-SNE embedding visualization. Red dots: real samples. Blue crosses: generated samples. Best viewed in color.

Figure 7: Street view translation testing results. Best viewed in color.

Multi-digit translation. The MNIST-Multiple dataset is another controlled experiment designed to mimic the complexity in real-world scenarios. It is used to test whether the network understands the semantics, i.e. digits, in an image and translates each digit accordingly. Each image in MNIST-Multiple contains all ten digits, which are randomly placed in 44 grids. Two domains are designed: in domain A, the foreground and background are randomly set to black or white, but different from each other; in domain B, the background is randomly assigned to either black or white and each foreground digit is assigned to a certain color, but with a little saturation and lightness perturbation. Our goal is to encourage the network to understand the semantic information, i.e. the different digits and backgrounds, when translate an image from domain A to domain B. That is, a successfully translated image should have the content of domain A, the digit class, and the style of domain B, the digit and background colors respectively. This experiment is quite challenging, but we observe that our model can still obtain good results without the need for ground-truth semantic labels or paired data. For example, in Figure 6 top row the digits 1,2,3,4,6 can be successfully translated given the criteria described above. As seen in Fig. 6, MUNIT can not translate the foreground color with semantic consistency, and the colors look more “fake”.

Method mIoU mIoU Gap
Source 0.329 -0.119
UNIT 0.297 -0.151
MUNIT 0.331 -0.117
EGSC-IT 0.343 -0.105
Oracle 0.448 0
Table 3: Semantic segmentation evaluation on 256512 resolution.

Street view translation. We carry out a synthetic real experiment for street view translation between GTA5 [25] and Berkeley Deep Drive (BDD) [29] datasets. The street view datasets are more complex than the digit ones (different illumination/weather conditions, complex environments). As shown in Fig. 7, our method can successfully translate the images from the source to the target domain according to the exemplar’s style. For small variations, e.g. dayday (first row), MUNIT can keep up, however for large variations, e.g. daynight and vice versa (second row), which is exactly the problem we examine in this paper, only EGSC-IT can successfully translate details like the proper sky color and illumination condition w.r.t. the exemplar. Similar to FCN-score used by Isola et al. [11], we also use the semantic segmentation performance to quantitatively evaluate the image translation quality. We first translate images in GTA5 dataset to an arbitrary image in BDD dataset. We only generate images of size due to the limitation on GPU memory. Then, we train a single-scale Deeplab model [2] on the translated images and test it on BDD test set. The mean Intersection over Union (mIoU) scores in Tab. 3 show that training with our translated synthetic images can improve the segmentation results, which indicates that our method can indeed reasonably translate the source GTA5 image to the target domain style with semantic consistency and reduce the domain difference successfully.

5 Discussions

Since our method does not use any semantic segmentation labels nor paired data, there are some artifacts in the results for some hard cases. For example, as to the street view translation, daynight and nightday (e.g. Fig. 7 bottom row) are more challenging than dayday (e.g. Fig. 7 top row). As a result, it is sometimes hard for our model to understand the semantics in such cases. In the future, it would be interesting to extend our method to the semi-supervised setting in order to benefit from the presence of some fully-labeled data.

6 Conclusion

We introduced the EGSC-IT framework to learn a multimodal mapping across domains in an unsupervised way. Under the guidance of an exemplar from the target domain, we showed how to combine AdaIN with feature masks in order to transfer the style of the exemplar to the source image, while retaining semantic consistency at the same time. Numerous quantitative and qualitative results demonstrate the effectiveness of our method in this particular setting.


We gratefully acknowledge the support of Toyota Motors Europe, FWO Structure from Semantics project, and KU Leuven GOA project CAMETRON.


  • Almahairi et al. [2018] Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In ICML, 2018.
  • Chen et al. [2018] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018.
  • Dong et al. [2014] Chao Dong, Chen C. Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In ECCV, 2014.
  • Gatys et al. [2016] Leon A Gatys, Alexander S Ecker, and Matthias Bethge.

    Image style transfer using convolutional neural networks.

    In CVPR, 2016.
  • Gatys et al. [2017] Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual factors in neural style transfer. In CVPR, 2017.
  • Gonzalez-Garcia et al. [2018] Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. Image-to-image translation for cross-domain disentanglement. In NIPS, 2018.
  • Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
  • Hoffman et al. [2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018.
  • Huang & Belongie [2017] Xun Huang and Serge J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
  • Huang et al. [2018] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.
  • Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In CVPR, 2017.
  • Johnson et al. [2016] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  • Kim et al. [2017] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017.
  • Kingma & Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Larsen et al. [2016] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  • Lee et al. [2018] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In ECCV, 2018.
  • Li et al. [2017] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. In NIPS, 2017.
  • Li et al. [2018] Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, and Jan Kautz. A closed-form solution to photorealistic image stylization. arXiv preprint arXiv:1802.06474, 2018.
  • Lin et al. [2018] Jianxin Lin, Yingce Xia, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Conditional image-to-image translation. In CVPR, 2018.
  • Liu et al. [2017] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
  • Luan et al. [2017] Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep photo style transfer. In CVPR, 2017.
  • Maaten & Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

    Journal of machine learning research

    , 9(Nov):2579–2605, 2008.
  • Pathak et al. [2016] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  • Richter et al. [2016] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In ECCV, 2016.
  • Royer et al. [2017] Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Moressi, Forrester Cole, and Kevin Murphy. Xgan: Unsupervised image-to-image translation for many-to-many mappings. arXiv preprint arXiv:1711.05139, 2017.
  • Simonyan & Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • Wang et al. [2018] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
  • Xu et al. [2017] Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. End-to-end learning of driving models from large-scale video datasets. In CVPR, 2017.
  • Yi et al. [2017] Zili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, 2017.
  • Zhang et al. [2016] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization. In ECCV, 2016.
  • Zhu et al. [2017a] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017a.
  • Zhu et al. [2017b] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In NIPS, 2017b.