PyTorch Implementation of ECCV 2020 Spotlight "TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images"
An unsupervised image-to-image translation (UI2I) task deals with learning a mapping between two domains without paired images. While existing UI2I methods usually require numerous unpaired images from different domains for training, there are many scenarios where training data is quite limited. In this paper, we argue that even if each domain contains a single image, UI2I can still be achieved. To this end, we propose TuiGAN, a generative model that is trained on only two unpaired images and amounts to one-shot unsupervised learning. With TuiGAN, an image is translated in a coarse-to-fine manner where the generated image is gradually refined from global structures to local details. We conduct extensive experiments to verify that our versatile method can outperform strong baselines on a wide variety of UI2I tasks. Moreover, TuiGAN is capable of achieving comparable performance with the state-of-the-art UI2I models trained with sufficient data.READ FULL TEXT VIEW PDF
Unsupervised image-to-image translation aims at learning the mapping fro...
Novel imaging technologies raise many questions concerning the adaptatio...
Image-to-image (I2I) translation is a pixel-level mapping that requires ...
We introduce a simple and versatile framework for image-to-image transla...
We address the problem of un-supervised geometric image-to-image transla...
Recent advances in generative models and adversarial training have led t...
Recent GAN-based architectures have been able to deliver impressive
PyTorch Implementation of ECCV 2020 Spotlight "TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images"
Unsupervised image-to-image translation (UI2I) tasks aim to map images from a source domain to a target domain with the main source content preserved and the target style transferred, while no paired data is available to train the models. Recent UI2I methods have achieved remarkable successes [26, 22, 38, 25, 3]. Among them, conditional UI2I gets much attention, where two images are given: an image from the source domain used to provide the main content, and the other one from the target domain used to specify which style the main content should be converted to. To achieve UI2I, typically one needs to collect numerous unpaired images from both the source and target domains.
However, we often come across cases for which there might not be enough unpaired data to train the image translator. An extreme case resembles one-shot unsupervised learning, where only one image in the source domain and one image in the target domain are given but unpaired. Such a scenario has a wide range of real-world applications, e.g., taking a photo and then converting it to a specific style of a given picture, or replacing objects in an image with target objects for image manipulation. In this paper, we take the first step towards this direction and study UI2I given only two unpaired images.
Note that the above problem subsumes the conventional image style transfer task. Both problems require one source image and one target image, which serve as the content and style images, respectively. In image style transfer, the features used to describe the styles (such as the Gram matrix of pre-trained deep features) of the translated image and the style image should match (e.g., Fig. 1(a)). In our generalized problem, not only the style but the higher-level semantic information should also match. As shown in Fig. 1(c), on the zebra-to-horse translation, not only the background style (e.g., prairie) is transferred, but the high-level semantics (i.e., the profile of the zebra) is also changed.
Achieving UI2I requires the models to effectively capture the variations of domain distributions between two domains, which is the biggest challenge for our problem since there are only two images available. To realize such one-shot translation, we propose a new conditional generative adversarial network, TuiGAN, which is able to transfer the domain distribution of input image to the target domain by progressively translating image from coarse to fine. The progressive translation enables the model to extract the underlying relationship between two images by continuously varying the receptive fields at different scales. Specifically, we use two pyramids of generators and discriminators to refines the generated result progressively from global structures to local details. For each pair of generators at the same scale, they are responsible for producing images that look like the target domain ones. For each pair of discriminators at the same scale, they are responsible for capturing the domain distributions of the two domains at the current scale. The “one-shot” term in our paper is different from the ones in [1, 4], which use a single image from the source domain and a set of images from the target domain for UI2I. In contrast, we only use two unpaired images from two domains in our work.
We conduct extensive experimental validation with comparisons to various baseline approaches using various UI2I tasks, including horse zebra, facade labels, aerial maps maps, apple orange, and so on. The experimental results show that the versatile approach effectively addresses the problem of one-shot image translation. We show that our model can not only outperform existing UI2I models in the one-shot scenario, but more remarkably, also achieve comparable performance with UI2I models trained with sufficient data.
Our contributions can be summarized as follows:
We propose a TuiGAN to realize image-to-image translation with only two unpaired images.
We leverage two pyramids of conditional GANs to progressively translate image from coarse to fine.
We demonstrate that the a wide range of UI2I tasks can be tackled using our versatile model.
propose to infer correspondences between a source image and another target image using Bayesian framework. With the development of deep neural networks, the advent of Generative Adversarial Networks (GAN) really inspires many works in I2I. Isola et al.  propose a conditional GAN called “pix2pix” model for a wide range of supervised I2I tasks. However, paired data may be difficult or even impossible to obtain in many cases. DiscoGAN , CycleGAN  and DualGAN  are proposed to tackle the unsupervised image-to-image translation (UI2I) problem by constraining two cross-domain translation models to maintain cycle-consistency. Liu et al.  propose a FUNIT model for few-shot UI2I. However, FUNIT requires not only a large amount of training data and computation resources to infer unseen domains, but also the training data and unseen domains to share similar attributes. Our work does not require any pre-training and specific form of data. Related to our work, Benaim et al.  and Cohen et al.  propose to solve the one-shot cross-domain translation problem, which aims to learn an unidirectional mapping function given a single image from the source domain and a set of images from the target domain. Moreover, their methods cannot translate images in the opposite direction as they claim that one seen sample in the target domain is difficult for capturing domain distribution. However, in this work, we focus on solving UI2I given only two unpaired image from two domains and realizing I2I in both directions.
Image style transfer can be traced back to Hertzmann et al.’s work . More recent approaches use neural networks to learn the style statistics. Gatys et al.  first model image style transfer by minimizing the Gram matrix of pre-trained deep features. Luan et al.  further propose to realize photorealistic style transfer which can preserve the photorealism of the content image. To avoid inconsistent stylizations in semantically uniform regions, Li et al.  introduce a two-step framework in which both steps have a closed-form solution. However, it is difficult for these models to transfer higher-level semantic structures, such as object transformation. We demonstrate that our model can outperform Li et al.  in various UI2I tasks.
Single image generative models aim to capture the internal distribution of an image. Conditional GAN based models have been proposed for texture expansion  and image retargeting . InGAN  is trained with a single natural input and learns its internal patch-distribution by an image-specific GAN. Unconditional GAN based models also have been proposed for texture synthesis [2, 23, 16] and image manipulation . In particular, SinGAN  employs an unconditional pyramidal generative model to learn the patch distribution based on images of different scales. However, these single image generative models usually take one image into consideration and do not capture the relationship between two images. In contrast, our model aims to capture the distribution variations between two unpaired images. In this way, our model can transfer an image from a source distribution to a target distribution while maintaining its internal content consistency.
Given two images and , where and are two image domains, our goal is to convert to and to without any other data accessible. Since we have only two unpaired images, the translated result (e.g., ) should inherit the domain-invariant features of the source image (e.g., ) and replace the domain-specific features with the ones of the target image (e.g., ) [38, 22, 13]. To realize such image translation, we need to obtain a pair of mapping functions and , such that
Our formulation aims to learn the internal domain distribution variation between and . Considering that the training data is quite limited, and are implemented as two multi-scale conditional GANs that progressively translate images from coarse to fine. In this way, the training data can be fully leveraged at different resolution scales. We downsample and to different scales, and then obtain and , where and are downsampled from and , respectively, by a scale factor ().
In previous literature, multi-scale architectures have been explored for unconditional image generation with multiple training images [18, 19, 5, 12], conditional image generation with multiple paired training images  and image generation with a single training image . In this paper, we leverage the benefit of multi-scale architecture for one-shot unsupervised learning, in which only two unpaired images are used to learn UI2I.
The network architecture of the proposed TuiGAN is shown in Fig. 2. The entire framework consists of two symmetric translation models: for (the top part in Fig.2) and for (the bottom part in Fig. 2). and are made up of a series of generators, and , which can achieve image translation at the corresponding scales. At each image scale, we also need discriminators and (), which is used to verify whether the input image is a natural one in the corresponding domain.
Progressive Translation The translation starts from images with the lowest resolution and gradually moves to the higher resolutions. and first map and to the corresponding target domains:
For images with scales , the generator has two inputs, and the previously generated . Similarly, takes and as inputs. Mathematically,
where means to use bicubic upsampling to resize image by a scale factor . Leveraging , could refine the previous output with more details, and also provides the global structure of the target image for current resolution. Eqn.(3) is iteratively applied until the eventual output and are obtained.
Scale-aware Generator The network architecture of is shown in Fig. 3. Note that and shares the same architecture but have different weights. consists of two fully convolutional networks. Mathematically, works as follows:
where represents pixel-wise multiplication. As shown in Eqn.(4), we first use to preprocess into
as the initial translation. Then, we use an attention modelto generate a mask , which models long term and multi-scale dependencies across image regions [36, 30]. takes , and as inputs and outputs considering to balance two scales’ results. Finally, and are linearly combined through the generated to get the output .
Similarly, the translation at -th scale is implemented as follows:
In this way, the generator focuses on regions of the image that are responsible of synthesizing details in current scale and keeps the previously learned global structure untouched in the previous scale. As shown in Fig. 3, the previous generator has generated global structure of a zebra in , but still fails to generate stripe details. In the -th scale, the current generator generates an attention map to add stripe details on the zebra and produces better result .
Our model is progressively trained from low resolution to high resolution. Each scale keeps fixed after training. For any
, the overall loss function of the-th scale is defined as follows:
where , , , refer to adversarial loss, cycle-consistency loss, identity loss and total variation loss respectively, and , , are hyper-parameters to balance the tradeoff among each loss term. At each scale, the generators aim to minimize while the discriminators is trained to maximize . We will introduce details of these loss functions.
Adversarial Loss The adversarial loss builds upon that fact that the discriminator tries to distinguish real images from synthetic images and generator tries to fool the discriminator by generating realistic images. At each scale , there are two discriminators and
, which take an image as input and output the probability that the input is a natural image in the corresponding domain. We choose WGAN-GP as adversarial loss which can effectively improve the stability of adversarial training by weight clipping and gradient penalty:
where , with , is the penalty coefficient.
Cycle-Consistency Loss One of the training problems of conditional GAN is mode collapse, i.e., a generator produces an especially plausible output whatever the input is. We utilize cycle-consistency loss  to constrain the model to retain the inherent properties of input image after translation: ,
Identity Loss We noticed that relying on the two losses mentioned above for one-shot image translation could easily lead to color  and texture misaligned results. To tackle the problem, we introduce the identity loss at each scale, which is denoted as . Mathematically,
We found that identity loss can effectively preserve the consistency of color and texture tone between the input and the output images as shown in Section 4.4.
Total Variation Loss To avoid noisy and overly pixelated, following , we introduce total variation (TV) loss to help in removing rough texture of the generated image and get more spatial continuous and smoother result. It encourages images to consist of several patches by calculating the differences of neighboring pixel values in the image. Let denote the pixel located in the -th row and -th column of image . The TV loss at the -th scale is defined as follows:
Network Architecture As mentioned before, all generators share the same architecture and they are all fully convolutional networks. In detail, is constructed by 5 blocks of the form 3x3 Conv-BatchNorm-LeakyReLU 
with stride 1.is constructed by 4 blocks of the form 3x3 Conv-BatchNorm-LeakyReLU. For each discriminator, we use the Markovian discriminator (PatchGANs)  which has the same 11x11 patch-size as to keep the same receptive field as generator.
Training Settings We train our networks using Adam  with initial learning rate , and we decay the learning rate after every 1600 iterations. We set our scale factor and train 4000 iterations for each scale. The number of scale is set to 4. For all experiments, we set weight parameters , , and .
We conduct experiments on several tasks of unsupervised image-to-image translation, including the general UI2I tasks111In this paper, we refer to general UI2I as tasks where there are multiple images in the source and target domains, i.e., the translation tasks studied in ., image style transfer, animal face translation and paint-to-image translation, to verify our versatile TuiGAN. To construct datasets of one-shot image translation, given a specific task (like horsezebra translation ), we randomly sample an image from the source domain and the other one from the target domain, respectively, and train models on the selected data.
We compare TuiGAN with two types of baselines. The first type leverages the full training data without subsampling. We choose CycleGAN  and DRIT  algorithms for image synthesis. The second type leverages partial data, even one or two images only. We choose the following baselines:
(1) OST , where one image from the source domain and a set of images in the target domain are given;
(2) SinGAN , which is a pyramidal unconditional generative model trained on only one image from the target domain, and injects an image from the source domain to the trained model for image translation.
(3) PhotoWCT , which can be considered as a special kind of image-to-image translation model, where a content photo is transferred to the reference photo’s style while remaining photorealistic.
(4) FUNIT , which targets few-shot UI2I and requires lots of data for pre-training. We test the one-shot translation of FUNIT.
(5) ArtStyle , which is a classical art style transfer model.
For all the above baselines, we use their official released code to produce the results.
(1) Single Image Fréchet Inception Distance (SIFID) : SIFID captures the difference of internal distributions between two images, which is implemented by computing the Fréchet Inception Distance (FID) between deep features of two images. A lower SIFID score indicates that the style of two images is more similar. We compute SIFID between translated image and corresponding target image.
(2) Perceptual Distance (PD) : PD computes the perceptual distance between images. A lower PD score indicates that the content of two images is more similar. We compute PD between translated image and corresponding source image.
(3) User Preference (UP): We conduct user preference studies for performance evaluation since the qualitative assessment is highly subjective.
Following , we first conduct general experiments on FacadeLabels, AppleOrange, HorseZebra and MapAerial Photo translation tasks to verify the effectiveness of our algorithm. The visual results of our proposed TuiGAN and the baselines are shown in Fig. 4.
Overall, the images generated by TuiGAN exhibit better translation quality than OST, SinGAN, PhotoWCT and FUNIT. While both SinGAN and PhotoWCT change global colors of the source image, they fail to transfer the high-level semantic structures as our model (e.g., in FacadeLabels and HorseZebra). Although OST is trained with the full training set of the target domain and transfers high-level semantic structures in some cases, the generated results contain many noticeable artifacts, e.g., the irregular noises on apples and oranges. Compared with CycleGAN and DRIT trained on full datasets, TuiGAN achieves comparable results to them. There are some cases that TuiGAN produces better results than these two models in LabelsFacade, ZebraHorse tasks, which further verifies that our model can actually capture domain distributions with only two unpaired images.
The results of average SIFID, PD and UP are reported in Table 1. For user preference study, we randomly select 8 unpaired images, and generate 8 translated images for each general UI2I task. In total, we collect 32 translated images for each subject to evaluate. We display the source image, target image and two translated images from our model and another baseline method respectively on a webpage in random order. We ask each subject to select the better translated image at each page. We finally collect the feedback from 18 subjects of total 576 votes and 96 votes for each comparison. We compute the percentage from a method is selected as the User Preference (UP) score.
We can see that TuiGAN obtains the best SIFID score among all the baselines, which shows that our model successfully captures the distributions of images in the target domain. In addition, our model achieves the third place in PD score after CycleGAN and PhotoWCT. From the visual results, we can see that PhotoWCT can only change global colors of the source image, which is the reason why it achieves the best PD score. As for user study, we can see that most of the users prefer the translation results generated by TuiGAN than OST, SinGAN, PhotoWCT and FUNIT. Compared with CycleGAN and DRIT trained on full data, our model also achieves similar votes from subjects.
We demonstrate the effectiveness of our TuiGAN on image style transfer: art style transfer, which is to convert image to the target artistic style with specific strokes or textures, and photorealistic style transfer, which is to obtain stylized photo that remains photorealistic. Results are shown in Fig. 5. As can be seen in the first row of Fig.5, TuiGAN retains the architectural contour and generates stylized result with vivid strokes, which just looks like Van Gogh’s painting. Instead, SinGAN fails to generate clear stylized image, and PhotoWCT  only changes the colors of real photo without capturing the salient painting patterns. In the second row, we transfer the night image to photorealistic day image with the key semantic information retained. Although SinGAN and ArtStyle produce realistic style, they fail to the maintain detailed edges and structures. The result of PhotoWCT is also not as clean as ours. Overall, our model achieves competitive performance on both types of image style transfer, while other methods usually can only target on a specific task but fail in another one.
Animal Face Translation To compare with the few-shot model FUNIT, which is pretained on animal face dataset, we conduct the animal face translation experiments as shown in Fig.6. We also include SinGAN and PhotoWCT for comparison. As we can see, our model can better transfer the fur colors from image in the target domain to the that of the source domain than other baselines: SinGAN  generates results with faint artifacts and blurred dog shape; PhotoWCT  can not transfer high-level style feature (e.g. spots) from the target image although it preserves the content well; and FUNIT generates results that are not consistent with the target dog’s appearance.
Painting-to-Image Translation This task focuses to generate photo-realistic image with more details based on a roughly related clipart as described in SinGAN . We use the two samples provided by SinGAN for comparison. The results are shown in Fig.7. Although two testing images share similar elements (e.g., trees and road), their styles are extremely different. Therefore, PhotoWCT and ArtStyle fail to transfer the target style in two translation cases. SinGAN also fails to generate specific details, such as leaves on the road in the first row of Fig.7, and maintain accurate content, such as mountains and clouds in the second row of Fig.7. Instead, our method preserves the crucial components of input and generates rich local details in two cases.
To investigate the influences of different training losses, generator architecture and multi-scale structure, we conduct several ablation studies based on HorseZebra task. Specifically,
(1) Fixing , we remove the cycle-consistent loss (TuiGAN w/o ), identity loss (TuiGAN w/o ), total variation loss (TuiGAN w/o ) and compare the differences;
(2) We range from to to see the effect of different scales. When , our model can be roughly viewed as the CycleGAN  that is trained with two unpaired images.
(3) We remove the attention model in the generators, and combine and by simply addition (briefly denoted as TuiGAN w/o ).
The qualitative results are shown in Fig.8. Without , the generated results suffers from inaccurate color and texture (e.g., green color on the transferred zebra). Without attention mechanism or , our model can not guarantee the completeness of the object shape (e.g., missed legs in the transferred horse). Without , our model produces images with artifacts (e.g., colour spots around the horse). The results from to either have poor global content information contained (e.g. the horse layout) or have obvious artifacts (e.g. the zebra stripes). Our full model (TuiGAN ) could capture the salient content of the source image and transfer remarkable style patterns of the target image.
We compute the quantitative ablations by assessing SIFID and PD scores of different variants of TuiGAN. As shown in Table 2, our full model still obtains the lowest SIFID score and PD score, which indicates that our TuiGAN could generate more realistic and stylized outputs while preserving the content unchanged.
In this paper, we propose TuiGAN, a versatile conditional generative model that is trained on only two unpaired image, for image-to-image translation. Our model is designed in a coarse-to-fine manner, in which two pyramids of conditional GANs refine the result progressively from global structures to local details. In addition, a scale-aware generator is introduced to better combine two scales’ results. We validate the capability of TuiGAN on a wide variety of unsupervised image-to-image translation tasks by comparing with several strong baselines. Ablation studies also demonstrate that the losses and network scales are reasonably designed. Our work represents a further step toward the possibility of unsupervised learning with extremely limited data.
Bergmann, U., Jetchev, N., Vollgraf, R.: Learning texture manifolds with the periodic spatial gan. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 469–477. JMLR. org (2017)
Cohen, T., Wolf, L.: Bidirectional one-shot unsupervised domain mapping. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1784–1792 (2019)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. pp. 448–456 (2015)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1125–1134 (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. pp. 694–711. Springer (2016)
In this section, we provide additional results of four general unpaired image-to-image translation tasks: AppleOrange in Fig. 9, HorseZebra in Fig. 10, FacadeLabels in Fig. 11, and MapAerial Photo in Fig. 12. From the additional results provided, we can further verify that most of the translation results generated by TuiGAN are better than OST, SinGAN, PhotoWCT and FUNIT. Compared with CycleGAN and DRIT trained on full data, our model can also achieve comparable performance in many cases.