SingleGAN: Image-to-Image Translation by a Single-Generator Network using Multiple Generative Adversarial Learning. ACCV 2018
Image translation is a burgeoning field in computer vision where the goal is to learn the mapping between an input image and an output image. However, most recent methods require multiple generators for modeling different domain mappings, which are inefficient and ineffective on some multi-domain image translation tasks. In this paper, we propose a novel method, SingleGAN, to perform multi-domain image-to-image translations with a single generator. We introduce the domain code to explicitly control the different generative tasks and integrate multiple optimization goals to ensure the translation. Experimental results on several unpaired datasets show superior performance of our model in translation between two domains. Besides, we explore variants of SingleGAN for different tasks, including one-to-many domain translation, many-to-many domain translation and one-to-one domain translation with multimodality. The extended experiments show the universality and extensibility of our model.READ FULL TEXT VIEW PDF
State-of-the-art methods for image-to-image translation with Generative
Existing approaches have been proposed to tackle unsupervised image-to-i...
Many image-to-image translation problems are ambiguous, as a single inpu...
Recently, a unified model for image-to-image translation tasks within
This year alone has seen unprecedented leaps in the area of learning-bas...
Pathological glomerulus classification plays a key role in the diagnosis...
Multi-domain image-to-image translation is a problem where the goal is t...
SingleGAN: Image-to-Image Translation by a Single-Generator Network using Multiple Generative Adversarial Learning. ACCV 2018
Recently, more and more attention has been paid to image-to-image translation due to its exciting potential in a variety of image processing applications . Although existing methods show impressive results on one-to-one mapping problems, they need to build multiple generators for modeling multiple mappings, which are inefficient and ineffective in some multi-domain and multi-model image translation tasks. Intuitively, many multi-mapping translation tasks are not independent and share some common features such as scene contents in transformations between different seasons. By sharing a network between related tasks, we can enable our model to generalize better on each separated task. In this paper, we propose a single-generator generative adversarial network (GAN), called SingleGAN, to solve multi-mapping translation tasks effectively and efficiently. To indicate a specific mapping, we introduce the domain code as an auxiliary input to the network. Then we integrate multiple optimization goals to learn each specific translation.
As illustrated in Fig. 1, the base SingleGAN model is utilized to learn the bijection between two domains. Since each domain dataset is not required to have the label of other domains, SingleGAN can make full use of the existing different datasets to learn the multi-domain translation.
To explore the potential and generality of SingleGAN, we also extend it to three cross domains translation tasks, which are more complex and practical. The first variant model tries to address the one-to-many domain translation task that processes a source domain input to a different target domains, such as the image style transfer. The Second model explore the many-to-many domain translation task. Unlike the recent method 
requires detailed annotation of category information to training the auxiliary classifier, we use multiple adversarial objects to help network captures different domain distribution separately. It means that SingleGAN is capable of learning multi-domain mappings by weakly supervised learning since we do not need to label all the training data with detailed annotation. The third variant model attempts to increase the generative diversity by introducing attribute latent code. A similar idea is used in BicycleGAN to address the multimodal translation problem. Our third model can be considered a generalization of BicycleGAN towards unpaired image-to-image translation.
To summarize, our contributions are as follows:
We propose SingleGAN, a novel GAN that utilizes a single generator and a group of discriminators to accomplish the unpaired image-to-image translation.
We show the generality and flexibility of SingleGAN by extending it to achieve three different kinds of translation tasks.
Experimental results demonstrate that our approach is more effective and general-purpose than several state-of-art methods.
Influenced by a zero-sum game, a typical GAN model consists of two modules: a generator and a discriminator. While the discriminator learns to distinguish between real and fake samples, the generator learns to generate fake samples that are indistinguishable from real samples. The GANs have shown impressive results in various computer vision tasks such as image generation, image editing  and representation learning 
. Recently, GAN-based conditional image generation has also been actively studied. To be specific, the various of extension GANs have achieved good results in many generation tasks such as image inpainting7], text2image, as well as to other domains such as videos  and 3D data. In this paper, we propose a scalable GAN framework to achieve image translation based on conditional image generation.
The idea of image-to-image translation goes back to Image Analogies , in which Hertzmann et al. proposed a network to transfer the texture information from a source modality space onto a target modality space. Image-to-image translation has received more attention since the flourishing growth of GANs. The pioneering work, Pix2pix  uses cGAN to perform supervised image translation from paired data. As those methods adopt supervised learning, sufficient paired data are required to train the network. However, preparing paired images can be time-consuming and laborious (e.g. artistic stylization) and even impossible for some applications (e.g. male to female face transfiguration). To address this issue, For example, CycleGAN , DiscoGAN  and DualGAN  introduce a cycle-consistency constraint, which widely used in visual tracking  and language domain , to learn convincing mappings across image domains from unpaired images. Based on a shared-latent space assumption, UNIT  extends the Coupled GAN 
to learn a joint distribution of different domains without paired images. FaderNet is also successful in the controlling of attributes by adding the discriminator to the latent space. Even though, these methods have promoted the development of one-to-one mapping image translation, they have limitations in scalability for multi-mapping translation. By introducing an auxiliary classifier in the discriminator, StarGAN 
achieved translation among different facial attributes with a single generator. However, this method may learn an inefficient domain mapping when the attribute labels are not sufficient for training the auxiliary classifier even if it introduces a mask vector.
The main architecture is shown in Fig. 1. In order to take advantage of the correlation between two related tasks, SingleGAN adopts a single generator to achieve bi-direction translation.
The goal of the model is to learn a mapping . By adding the domain code, is redefined as
where is the fake sample generated by the generator, sample belongs to the set of domain and are domain code for domain A and domain B respectively.
For capturing the distribution of different domains with a single generator, it is necessary to indicate the mapping with auxiliary information. Therefore, we introduce the domain code to label the different mapping in the generator. The domain code is constructed as a one-hot vector and similar to the latent code that is widely used to indicated the attributes of generated image [21, 3].
Recent work  shows that different injection methods of latent code will effect the performance of generation model. So we adopt the central biasing instance normalization (CBIN) proposed in  to inject the domain code in our SingleGAN model. CBIN is defined as
where is the index of feature maps, is affine transformation applied on the domain code and its parameters are learned for each feature map in one layer. The CBIN aims to adjust the different distributions of input feature maps adaptively with learnable parameters, which makes the domain code able to manage the different tasks. Meanwhile, the distance between the different distributions of input data is also trainable, which means that the coupling degree of different tasks is determined by the model itself. This advantage enables different tasks to share parameters better, so as to promote each other better.
Since our single generator has multi-domain outputs, we set up adversarial objectives for each target domain and employ a group of discriminators. The corresponding discriminator is used to identify the generated images in one domain. The adversarial loss is defined as
By optimizing multiple generative adversarial objectives, the generator recovers different domain distributions that indicated by domain code .
Although the above GAN loss can complete domain translation, highly under-constrained mapping often leads to a mode collapse. There are many possible mappings that can be inferred without the use of pairing information.
The final objective function is defined as
where controls the relative importance of the two objectives.
To explore the potential and generality of SingleGAN, based on the above model, we extend three variants of our model to different tasks: one-to-many domain translation, many-to-many domain translation and one-to-one domain translation with multi-modal mapping.
The first trial in Fig.2(a) applies to unidirectional tasks, for example multi-task detection and image multi-style transfer. As far as image style transfer is concerned, different style transfer from a single input image is a representative task of sharing semantics. Our model shares the same texture information of the input image and apply different styles on it. Compared with traditional image style transfer methods, which learn mapping between one content image and one style image, our model learn different mappings between image collections. Such one-to-three translation task are shown in Fig.2(a), the is redefined as
where A is the source domain and are target domains. In the meantime, the cycle consistency loss is modified to
As illustrated in Fig.2(b), the second variation shows images in multi-domain translating to each other. In this model, our goal is to train a single generator that can learns mappings among multiple domains and realize the mutual transformation of multiple domains. For a four-domain transfer instance, the is redefined as
and the also needs to be modified like the extended model (a).
To address the multi-modal image-to-image translation problem with unpaired data, we introduce the third variant as show in Fig.2(c). Inspired by BicycleGAN , we introduce the VAE-like encoder to extract feature latent code
for indicating the translation mapping. Although there is no paired data for supervised learning of the encoder, we utilize the cycle consistency to relax the constraint. During training time, we random sample latent code from a standard Gaussian distribution to indicate the multimodality. Then we concatenate the latent codeinto the domain code to indicate the final mapping. To constraint the image content and encourage the mapping from the latent code, we use the latent code encoded from the source image and the generated image to reconstruct the source image. Due to the introduction of a VAE-like encoder, the latent distribution encoded by the encoder is encouraged to be close to random Gaussian
where . To enforce the generator utilizing the latent code, the reconstruction latent code loss is also used
Combining these two losses with the loss of base model, our model can solve the problem of the lack of diversity in unpaired image translation. Notice that we only discuss the translation of A-to-B, as the mapping of B-to-A is similar and concurrent during training time.
structure with an encoder-decoder framework, which contains two stride-2 convolution layers for downsampling, six residual blocks and two stride-2 transposed convolution layers for upsampling. We replace all normalization layers except upsampling layers with CBIN layers. For the discriminators, we use two discriminators  to discriminate the real and fake images in different scales. For the experiment of multi-modal SingleGAN, the encoder model adopts the ResNet structure . We equip the encoder with CBIN, so it can also extract the latent information from different domain images. Code and model are available at https://github.com/Xiaoming-Yu/SingleGAN.
For all experiments, we train all models with Adam optimizer , setting , , learning rate of 0.001. In the extended multi-modal networks as shown in Fig.2(c), the weights for and are and respectively. To generate higher quality results with stable training, we replace the negative log likelihood objective by a least-squares loss .
As mentioned in Sect. 3.1, we use the one-hot vector to present the domain code . To the base model, we use the dimensional domain code for indicating the mapping between domain . For the one-to-many and many-to-many translation instances illustrated in Fig. 2, the domain code is dimension and represents different domains. In the third variant, the dimensional latent code is also used for multimodal image translation in the specific domain that indicated by the dimensional domain code.
To evaluate the base model, we use three unpaired datasets: AppleOrange, HorseZebra, and SummerWinter . As for the three extended models, we use PhotoArt  for one-to-many translation, Transient-Attributes  for many-to-many translation, and EdgesPhotos  for one-to-one multi-model translation. All of the images are scale to resolution.
To compare the performance of our SingleGAN model, we adopt the CycleGAN  and StarGAN  as our baseline models. CycleGAN uses cycle loss to learn the mapping between two different domains. To achieve cycle consistency, CycleGAN requires two generators and discriminators for two different domains. To unify multi-domain translation with single generator, StarGAN introduces an auxiliary classifier trained on image-label pairs in its discriminator to assist the generator to learn the mapping cross multiple domains. We compare our method with CycleGAN and StarGAN on two domains translation tasks.
In this section, we evaluate the performance of different models. It should be noted that both SingleGAN and StarGAN use a single generator for two domain image translation and CycleGAN uses two generators to achieve the similar mappings.
The qualitative comparison is shown in Fig. 3. We can observe that all these models present pleasant results in the simple case such as the apple to orange transformation. In the translation with complex scene, the performance of these models are degraded especially StarGAN. The possible reason is that the generator of StarGAN introduces the adversarial noise to fool the auxiliary classifier and fails to learn the effective mapping. Meanwhile, we can observer that SingleGAN presents the best results in most cases.
To judge the quality of the generated image quantitatively, we evaluate the classification accuracy of the images generated by these three models at first. We train three Xception  based binary classifiers for each image datasets. The baseline is the classification accuracy in real images. Higher classification accuracy means that the generated images may more easy to distinguish. Second, we compare the domain consistency between real images and generated images by computing average distance in feature space. A similar idea is used for calculating the diversity of multi-modal generation task [3, 22]
. we use the cosine similarity to evaluate the perceptual distance in the feature space of the VGG-16 network
that pre-trained in ImageNet. We sum across the five convolution layers preceding the pool layers. The larger the value, the more similar between two images. In the test stage, we randomly sample the real image and the generated image from same domain to make up the data pair. Then we compute the average distance between 2,000 pairs. The baseline is computed by sampling from 2,000 pairs of real images.
The quantitative results are shown in Table 1 and Table 2. Both SingleGAN and CycleGAN produce the quantitative results that comply with qualitative performance. In contrast, StarGAN gets a higher classification accuracy but the poor performance in domain consistency. It validates our conjecture that the generated image of StarGAN may have the adversarial noise of fooling the classifier in some complex scenes. In StarGAN, the discriminator learns to tell the image is real or fake without considering the classification result of the image while the generator learns to fool the discriminator with an image that can be corrected classified by the auxiliary classifier. So the generator may not get enough encouragement if it generates the adversarial noise to the image. For example, on the task of SummerWinter, although the input summer image is expected to translate into winter, the generator of StarGAN tends to just add a tiny adversarial noise to the input image so that the discriminator still tell it is real while the classifier classifies it as the winter. As a result, the generated images will look unchanged to human but win high classification scores.
This issue does not exist in SingleGAN and CycleGAN since these models optimizes different mappings with different discriminators. The main difference between SingleGAN and CycleGAN is the number of generators. As shown in Fig. 3 and Table 1, 2, we can observer that SingleGAN has the capacity to learn multiple mapping without performance degradation. By sharing the generator for different domain translation, SingleGAN can see more training data form different domains to learn the shared semantics and improve the performance of the generator.
To explore the potential of SingleGAN, we test the extended models on three different translation tasks.
For one-to-many image translation, we perform the multi-style transfer to evaluate the model performance. PhotoArt  dataset contains three artistic styles (500 images of Monet, 584 images of Cezanne and 401 images of Van Gogh) and 1000 real photos. The results are shown in Fig. 4. We can observe that the generate images have similar artistic styles when we perform same mapping while different styles are distinguishable.
For multi-domain translation, we choose four outdoor scenes in Transient-Attributes dataset  to evaluate the model: ‘day’, ‘night’, ‘summer’ and ‘winter’. It should be note that the multiple domains do not have to be independent, e.g. the subset ‘day’ contains summer and winter. The training data for each domain do not need to consider other domain information. As shown in Fig. 5, SingleGAN is competent at the transformation from all domains, though the dataset has incomplete labels.
The final experiment is to verify the multi-modal performance of SingleGAN after introducing the attribute latent code. The dataset adopted is edge2shoes . Please Note that this experiment is performed under the settings of unpaired data. The experimental result in Fig. 6
shows that SingleGAN has the ability to learn multimodal mapping under the unsupervised learning.
Although the above experiments have the unpaired data assumption, SingleGAN can also perform multi-domain image translation with paired data by replacing the cycle consistency loss with reconstruction loss .
Here we use the salient object dataset DUTS-TR  and BSDS500 edge dataset  to perform one-to-many image translation. Specify the real image as domain A, salient images as domain B and edge images as domain C. Then the can define as
The results in Fig. 7 demonstrate the effectiveness of SingleGAN.
Although SingleGAN can achieve multi-domain image translations, multiple adversarial learning needs to be done simultaneously. This constraint makes SingleGAN only be able to learn limited domain translation at a time since our storage is limited. So it is valuable to explore the transfer learning for the existing models. Besides, the capacity of the network to learning different mappings is also an important problem. We also observe that integrate suitable tasks for one single model may improve the performance of the generator. But what kind of tasks can promote each other remains to be explored in the future work. Nonetheless, we think the method proposed in this paper is valuable for exploring the multi-domain generation works.
In this paper we introduce a single generator based model, SingleGAN, for learning multi-mapping image-to-image translation. By introducing multiple adversarial learning for the generator, SingleGAN is able to learn a variety of mappings effectively and efficiently. Contrastive experimental results show quantitatively and qualitatively that our approach is effective in many image translation tasks. Furthermore, to improve the versatility and generality of the model, we present three variants of SingleGAN for different tasks: one-to-many domain transfer, many-to-many domain transfer and one-to-one domain transfer with varying attributes. The experiment results demonstrate these variants improve the corresponding translation effectively.
This work was supported in part by the Project of National Engineering Laboratory for Video Technology - Shenzhen Division, National Natural Science Foundation of China and Guangdong Province Scientific Research on Big Data (No.U1611461), Shenzhen Municipal Science and Technology Program under Grant JCYJ20170818141146428, and Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (No.ZDSYS201703031405467).
Image-to-image translation with conditional adversarial networks.
In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. (2017)
In: International Conference on Machine Learning. (2016) 1060–1069
Xception: Deep learning with depthwise separable convolutions.(2016) 1800–1807