With the development of deep generative models, image generation has become popular in various applications. Among them, image-to-image translation, which learns a mapping from one domain to another, is a recent hot topic. Many computer vision tasks, including cross-domain image generation[1, 2, 3, 4, 5]6]7, 8]9], image style transfer [10, 11, 12], can be considered as image translation. Based on the two domains of images (either paired or unpaired), a generative model is trained to learn their relationship.
Image translation can be defined as follows. Given an image , we map it to a target domain image that shares some similarity or has close relationship with . The task is to learn the mapping that transforms the source domain distribution to the target domain distribution . For example, can be a scaled-up version of in super-resolution, or is a colored version of a grayscale image , or is a photo of its sketch (Figure 1). In these tasks, we assume that there also exists an inverse mapping . With the help of the inverse mapping, we can understand better about the two domains by learning their joint relationship.
Given a paired data set, early approaches learn the mapping by using the input-output pairs in a supervised manner [1, 6]. The main challenge is to learn and generalize the relationship between the given pairs. However, paired data sets are usually hard to collect, and the corresponding target domain image may not even exist in practice. For example, it is hard to connect the set of paired images between photos and artistic works. Another example is that if the two domains are male faces and female faces, then there does not exist paired data of a same person. In these cases, supervised models fail because of the lack of a ground truth mapping for training. Learning unpaired image translation is more practical and has received more attention recently [2, 3, 4, 5]. Moreover, unsupervised (or unpaired) image generation is more practical since data collection is much easier.
In this paper, we consider the task of unsupervised one-to-many image generation. Given two image domains and , we learn to transfer images in domain to domain , and vice versa. Different from previous works on unpaired image translation [2, 3, 4], we assume that there can be many possible target images in domain given the same image in domain . For example, in the edge-to-shoe translation task in Figure 1, there can be different colors and textures when generating shoes. To model this variation, we propose to use an additional variable to complement images in domain . Moreover, this variable
can be easily sampled from a prior distribution, such as the normal distribution.
To learn the relationship among and , we propose a novel generative model under the constraint of domain adversarial loss and cycle consistency loss, which is first defined in . The proposed model, which will be called XOGAN, is assembled in an “XO”-structure, and is trained under the generative adversarial network (GAN) framework [13, 14, 15, 16, 17].
In our experiments, we show results on generating shoes and handbags with diverse colors and textures given the edge images. Besides, when the additional variable is kept the same for different edge inputs, we can generate objects with the same colors. Moreover, we can alter the colors of different objects by substituting the variable . Hence, not only can our model generate plausible images as in other generative models, it can also replace the color of a certain image with another.
Section II first reviews the related works. The proposed “XO”-structure and the training procedure are introduced in Section III. Section IV presents experimental results on the edges2shoes, edges2handbags and CelebA data sets. Finally, Section V gives some concluding remarks and future directions.
Notations. Samples from the two image domains are denoted by and , where the subscripts and are the domain indicators, and are the height, width, and channel, respectively. The empirical distributions of the image domains are denoted , and , respectively. The additional variable is sampled from a standard normal distribution , where is the dimensionality of .
The generator for image domain is denoted , in which the subscripts denote the image domains. The generator for image domain is denoted . The input is a concatenation of the image from domain and a sample from the distribution . The generator (also called an encoder) for is denoted .
The first path generated fake samples in domain and are denoted and , and the encoded variable is . When the fake samples are forwarded once again through the generators, the output variables in each domain are denoted (Figure 3).
The discriminator networks are denoted , and , respectively.
Ii Related Work
Ii-a Generative Adversarial Networks
The generative adversarial network (GAN)  is a powerful generative model that can generate plausible images. The GAN contains two modules: a generator that generates samples and a discriminator that tries to distinguish whether the sample is from the real or generated distribution. The generator aims to confuse the discriminator by generating samples that are difficult to differentiate from the real ones. Training of GAN often suffers from issues such as vanishing gradient and mode collapse , in which the generator tends to collapse to points in a single mode. Very recently, a number of techniques have been introduced to stabilize training procedure [14, 16]. In cross-domain image generation [1, 2, 4, 18], GAN is a powerful tool to match the generated image to the real image distributions, especially when paired images are not available.
Ii-B Supervised Image-to-Image Translation
Isola et al.  showed that cross-domain image translation can be learned and generalized using a paired data set. By using the conditional GAN , their model can generate plausible photographs from sketches or semantic layouts. Zhu et al.  used bicycle consistency between the latent code and output images to generate multi-modal target domain images. However, paired data sets are not always available, and unpaired data sets are more common in practice.
Ii-C Unsupervised Image-to-Image Translation
Taigman et al.  introduced the domain transfer network (DTN) to generate emoji-style images from facial images in an unsupervised manner. In the DTN, image translation is a one-way mapping. If we train another model to map the emoji images back to real faces, the face identity may be changed. More recently, bidirectional mapping becomes more appealing, and has been studied in the DiscoGAN , CycleGAN  and DualGAN . These models use one generator and one discriminator for each mapping, and the symmetric structure helps to learn the bidirectional mapping.
Figure 2 shows the CycleGAN , which uses a generator for the mapping and another generator for . Two associated adversarial discriminators, and , are used to measure the quality of generated samples in the corresponding domains. Figure 2(a) contains the forward cycle-consistency path: , and Figure 2(b) is the backward cycle-consistency path: . The cycle consistency loss captures the intuition that if we translate from one domain to the other and back again, we should be able to reconstruct the original input. However, the generated image by CycleGAN is deterministic. As will be shown in Section IV, these models cannot model additional variations even by adding random noise to the inputs.
Another recent model is the UNIT , which performs image translation by using a shared latent space to encode the two domains. Although they can generate multiple images via the use of a stochastic variable, the variations generated are still limited.
Iii The Proposed Model
Let and be two image domains. In supervised image-to-image translation, a sample pair
is drawn from the joint distribution. In this paper, we focus on unsupervised image-to-image translation, in which samples are drawn from the marginal distributions and .
Iii-a Generators and Cycle Consistency Loss
The proposed XOGAN model contains three generators and (with parameters , and , respectively). In this paper, we propose to use an additional variable to model the variation when translating from domain to domain . Given a sample drawn from and from the prior distribution , a fake sample in domain is generated by as
Given , generator generates a reconstruction of in domain , and generator encodes a reconstruction of :
Together, this forms the X-path in Figure 3(a). To ensure cycle consistency, the generated sample should contain sufficient information to reconstruct (for the path ), and similarly should be similar to (for the path ).
On the other hand, given a sample in domain , generator can use it to generate a fake sample in domain ; and generator can use it to encode a fake . Using both and , generator can recover a reconstruction of as . This forms the O-path in Figure 3(b). Again, for cycle consistency, should be close to .
Combining the above, the cycle consistency loss can thus be written as:
Here, we use the norm, though other norms can also be used.
Iii-B Domain Adversarial Loss
Minimizing (1) alone cannot guarantee that the generated fake samples and encoded variable follow the distributions of and . The GAN , which is known to be able to learn good generative models, can also be regarded as performing distribution matching. In the following, with the use of the adversarial loss, we will try to match the generated distributions , and with the corresponding and .
, the discriminators are binary classifiers, and the discriminator losses are:
In GAN, the generators, besides trying to minimize the cycle consistency loss, also need to confuse their corresponding discriminators. The adversarial losses for the generators are
To ensure both cycle consistency and distribution matching, the total loss for the generators is a combination of the cycle consistency loss in (1) and the adversarial losses:
where controls the balance between the two types of losses. In the experiment, is set to 10.
Iii-C Training Procedure
In each iteration, we sample a mini-batch of images ’s and ’s from , , and variable from prior distribution (which is the standard normal distribution ). They are fed through the X- and O-paths in Figure 3 to obtain the generated samples , , , and the reconstructed samples , , . The real and generated samples are then input to the three discriminators.
In this section, experiments are performed on a number of commonly used data sets for image translation.
edges2shoes and edges2handbags:111https://people.eecs.berkeley.edu/~tinghuiz/projects/pix2pix/datasets/ These two data sets have been used in . The edges2shoes data set contains about 50k paired images, and the edges2handbags data set contains about 140k paired images. In both data sets, domain contains edge images and domain contains real objects (shoes and bags). Note that one real object can be mapped to only one edge map, but an edge map can correspond to multiple objects.
Although the two data sets contain paired images, we separate the paired sets by sampling domain images in the first half pairs, and domain images in the other half pairs. Hence, there is no paired data in the training set, and the task is unsupervised image-to-image translation.
CelebA:222http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html This is a large-scale face attributes data set with more than 200K celebrity images, each with 40 attribute annotations . In the experiment, we use the hair color attribute. Domain contains faces with black hair, and domain contains faces with other hair colors. As the hair in many male CelebA faces is not apparent, we only use the female faces.
All the input images are rescaled to .
In the proposed XOGAN, we use the U-Net , which adds skip connections between mirrored layers of the 7 downsampling layers and the 7 upsampling layers, for the image generators and . For , the generator (or encoder)
is a 7-layer strided convolutional network with residual blocks. We set its dimensionality to 8 for the edges2shoes and edges2handbags data sets, respectively, and 4 for the CelebA data set. For the image discriminators and , we use the patch-discriminator in , which only classifies input images at the scale of patches, with overlapping patches. The discriminator is a simple two-layer multi-layer-perception. As in 
, the hyperparameterin (5) is set to 10. In each training iteration, we update the generators twice and then update the discriminators once.
Noisy DiscoGAN: This is a variant of DiscoGAN. It uses the same generators as in XOGAN, but the generator from domain to domain is augmented with random Gaussian noise. This allows the generation of different images in domain given the same image from domain . We do not compare with CycleGAN  and DualGAN , as they are very similar to DiscoGAN.
Iv-a Translating to with Random
To show the consistency of the learned additional variables, we sample different random variables’s to generate different fake images for each input image . We keep the random variable set the same for different ’s, so that when or is fixed, the generated samples should have some fixed attributes when is changing, such as the color as shown in Figures 5 and 8.
In this experiment, we randomly sample 4 input images in domain from the test set. For each , we generate 4 images using different ’s sampled from the standard normal distribution.
Iv-A1 edges2shoes and edges2handbags Data Sets
Figure 5 shows that the proposed XOGAN can generate realistic photos of shoes and handbags. With the help of the cycle consistency loss on variable , the generated objects have the same color for the same . This makes it possible to generate plausible and colorful objects by just drawing edges like those in domain . Besides, we can also control the color of generated images by
. Note that our task is different from the drawing softwares in two ways. (i) We show that the end-to-end deep neural network is able to generate plausible and diverse objects given edges; (ii) The generated objects do not only fill in the colors, but also have smooth textures that make them look real.
Figure 6 shows the results obtained by the noisy DiscoGAN. Although multiple images can be generated, they are inferior as compared to those obtained with the proposed model in the following ways. (i) The images generated by the noisy DiscoGAN are not as good as those of XOGAN; (ii) The images generated by the noisy DiscoGAN are not as diverse as the proposed XOGAN, i.e., has similar color for different ’s in each row. This may be due to the mode collapse problem of GAN , which means that it tends to generate images from a single mode (in this case, color).
Figure 7 shows the results obtained by UNIT. Similar to the noisy DiscoGAN, almost all the generated images tend to have the same color. From our observations, the generated samples of UNIT also suffer from mode collapse during training. The sampled latent variables do not produce diverse outputs in each row.
The above results show that the proposed XO-path is essential in both generating plausible and consistency target images. From the edges2shoes and edges2handbags data sets, we show that XOGAN is able to generate plausible images in the target domain . Different from the baseline models of noisy DiscoGAN and UNIT, we can sample mulitple and consistent output images with different ’s. Besides, the results of XOGAN are more diverse with the help of .
Iv-A2 CelebA Data Set
Figure 8 shows the generated faces with different hair colors. Variable is used to model the different hair colors in domain . Note that the other parts in the images are almost unchanged. Hence, the color change is focused on specific parts (hairs here).
Iv-B Translating to to with Substituted
In this section, we study whether can encode relevant information in domain by substituting among different images in domain .
In this experiment, we randomly sample 4 input images in domain from the test set. For each , we generate 4 corresponding images in domain and encode its additional variation in . As in the previous edges2shoes experiment, represents the colored shoes, is its corresponding edge image and should encode content inside the edge. We concatenate with different ’s to generate various images in domain .
Iv-B1 edges2shoes and edges2handbags Data Sets
To see what the additional variable encodes, we substitute different ’s given different colored shoes and handbags. As shown in Figure 9, we can modify images in domain . In these two pictures, we show the input shoes and handbags in the first column. We first generate the corresponding edge image , shown in the second column of Figure 9, of each using generator . The generated edges describe the contour of the given inputs well. Considering that edge images ignore the content of the given shoes or handbags, we encode the content or color into the additional variable . As we assumed, should contain the corresponding color information of input image . In order to show the relevance between and color of , we concatenate each edge image with different ’s to generate new images in domain as . In the first row of Figure 9, we show and in nthe first two columns, and the last four columns show . Since is generated from the concatenation of the edge image and different codes , should have the same shape as and the same color as .
In the figure, we can see that the shape is consistent in each rows and colors are consistent in each of the four rightmost columns. More importantly, when is fixed, all images ’s in that column have similar colors as the input image . As for the edges2shoes results in Figure 9, the shoes in the last column have similar colors as the last input shoe . It verifies our assumption that encodes relevant information of its corresponding inputs . This is interesting that, we can see what our object will look like by replacing its color with that of the other objects. In real-world applications like fitting in a clothes shop, the user does not need to try on over and over again, if they want to try the same clothes with different colors.
Iv-B2 CelebA Hair Color Conversion
We perform the -to--to- path on the CelebA data set again. Input faces are sampled from domain where the hair colors are not black. If the user wants to change the hair color to black, we can easily transfer the photo from to . An interesting result is to replace the hair color with that of another person. Figure 10 shows that we can replace the blond hair to gray or brown by substituting the code with that from a gray or brown person.
In this section, we show that not only the proposed model can generate plausible and diverse images from domain to domain and vice versa, but also can modify the color of specific features in domain by substituting the additional variable . The proposed cycle consistency constraints guarantee a good joint relationship between and . They also encode consistent features in variable for both random generation in Section IV-A and color substitution in Section IV-B.
In this paper, we presented a generative model called XOGAN for unsupervised image-to-image translation with additional variations in the one-to-many translation setting. We showed that we can generate plausible images in both domains, and the generated samples are more diverse than the baseline models. Not only does the additional variable learned lead to more diverse results, it also controls the colors in certain parts of the generated images. Experiments on the edges2shoes, edges2handbags and CelebA data sets showed that the learned variable is meaningful when generating images in domain .
The proposed method can be extended in several ways. First, the prior distribution of is a standard normal distribution, which may be too simple to model more complex variations. This can be improved with more complicated prior distributions introduced in VAE [26, 27]. Second, the variations in our model are mostly related to color. We hope that our model can be improved to change other attributes such as hair styles and ornaments. Besides, we can also consider a many-to-many mapping based on the one-to-many framework. Similar to the summer-to-winter task in , there exist many winter images corresponding to a single summer image and vice versa. Further, we can extend the proposed model to other domains such as text or speech. The additional variation can be different voices when translating text to speech.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” Preprint arXiv: 1611.07004, 2016.
T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover
cross-domain relations with generative adversarial networks,” in
Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 1857–1865.
-  Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain image generation,” Preprint arXiv:1611.02200, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.
-  Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: unsupervised dual learning for image-to-image translation,” in IEEE International Conference on Computer Vision, 2017, pp. 2849–2857.
C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic
single image super-resolution using a generative adversarial network,” in
IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690.
Q. Yao and J. T. Kwok, “Colorization by patch-based local low-rank matrix
Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015, pp. 1959–1965.
-  R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European Conference on Computer Vision, 2016, pp. 649–666.
-  R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do, “Semantic image inpainting with perceptual and contextual losses,” Preprint arXiv:1607.07539, 2016.
L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” inIEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2414–2423.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” in Advances in Neural Information Processing Systems, 2017, pp. 385–395.
-  J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision, 2016, pp. 694–711.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5769–5779.
-  M. Mirza and S. Osindero, “Conditional generative adversarial nets,” Preprint arXiv:1411.1784, 2014.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 214–223.
-  M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Advances in Neural Information Processing Systems, 2017, pp. 700–708.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward multimodal image-to-image translation,” in Advances in Neural Information Processing Systems, 2017, pp. 465–476.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Preprint arXiv:1412.6980, 2014.
Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” inIEEE International Conference on Computer Vision, 2015, pp. 3730–3738.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in European Conference on Computer Vision, 2016, pp. 702–716.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” Preprint arXiv:1312.6114, 2013.
-  D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, “Improved variational inference with inverse autoregressive flow,” in Advances in Neural Information Processing Systems, 2016, pp. 4743–4751.
-  L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther, “Auxiliary deep generative models,” in Proceedings of the 33rd International Conference on Machine Learning, 2016, pp. 1445–1453.