The ability of a human to transfer scenes of what they see to a painting has long existed since the cradle of civilization. The earliest cave paintings that we know of is not a photo-realistic representation, but rather an abstract depiction of hunting scenes. Yet when it comes to developing an algorithm that mimics such ability, we are still limited in what we can achieve.
Recent development in neural networks has greatly improved the generalization ability of image domain transfer algorithms, in which two parallel lines of development prevails: using Neural Style Transfergatys2015neural and using Generative Adversarial Networks(GAN)goodfellow2014generative .
Neural Style Transfer defines a content loss that captures the semantic information loss in transferred image, and a style loss often defined as the statistical correlation (e.g. a Gramian matrix) among the extracted features. It utilizes a trained object detection network such as VGG19simonyan2014very
for feature extraction. As a result, style transfer gives less satisfactory results when the target domain contains object representations not seen by the pretrained object detection networkzhang2017style . Furthermore, due to the content loss being defined as the L2 difference in convolutional layers of object detection network, Style Transfer based methods fails to fundamentally change the object layout when applied to domain transfer and often relies on the objects in the content and style images having similar spacial proportions in the first place. When semantic objects fails to align, e.g. body proportions in paintings are different from those in real life, style transfer leads to less desirable results.
On the other hand, GAN has shown more promising results when provided paired data belonging to the two domains of interest. Previous works such as isola2017image and zhang2017style all show promising results when paired datasets are available or can be generated. However such datasets are often expensive to obtain.
Current works on unpaired cross-domain image translation tasks, including but not limited to taigman2016unsupervised ; yoo2016pixel ; zhu2017unpaired ; choi2017stargan ; royer2017xgan , usually rely on the "shared latent space" assumption. That is, given two related domains S and T, for item , there exists a unique item where s and t shares the same latent encoding z. However, this assumption may not always hold for all datasets and forcing such assumption on the system may result in model collapse or unrealistic-looking outputs, especially for domains that are vastly different in appearance.
To address the issues mentioned above, we introduce the Twin-GAN network and demonstrates its usage on the task of translating unpaired unlabeled face images.
2 Related Work
2.1 Domain Transfer through Style Transfer
Conditional Style Transfer
By using conditional instance normalization, which learns a separate set of parameters and in the instance normalization layers for each style , dumoulin2017learned expanded on the Neural Style Transfer methods with a single network capable of mix-and-matching 32 styles. Future works such as huang2017arbitrary further improved on the idea by adaptively computing and based on the input style image and proposed real-time arbitrary style transfer using a feed-forward network.
Image Analogy through Patch Matching
propose a structure-preserving image analogy framework based on patch matching method. Given two structurally similar images, it extracts the semantic features using an object detection network. For each feature vector at one position, it finds a matching feature from the style image lying within the same fixed sized patch. The image is then reconstructed using the matched features from the style image. The effect is impressive, yet its application is greatly limited by its requirement on the input images – the images must be spatially and structurally similar since features can only come from a vicinity determined by the perspective field of the network and the patch size. Nonetheless, it is powerful for some use cases and can achieve state-of-the-art results, e.g. in harmonic photoshopping of paintingsluan2018deep .
2.2 Domain Transfer through Generative Adversarial Networks
The Generative Adversarial Networkgoodfellow2014generative frameworks synthesizes data points of a given distribution using two competing neural networks playing a minimax game. The Generator takes a noise vector sampled from a random distribution and synthesizes data points and the Discriminator tells apart real data points from the generated ones.
Supervised Image Translation
propose a conditional GAN framework called "pix2pix" for image translation task using a paired dataset. In order to make preserving details in the output image much easier, it uses U-Net, which adds skip connection between layerin the encoder and layer in the decoder where is the total number of layers. isola2017image
is applied on numerous tasks such as sketch to photo, image colorization, and aerial photo to map. Similarly,zhang2017style use a conditional GAN with Residual UNetronneberger2015u framework supplemented with embeddings from VGG19simonyan2014very for sketch colorization task. Both show that conditional GAN and UNet are effective when paired data are readily available or can be generated.
Unsupervised Domain Transfer
For an unsupervised domain transfer task where paired datasets are unavailable but labeled data is available for the source domain, DTNtaigman2016unsupervised transfers images in the source domain to the target domain while retaining their labeled features. It contains a pretrained feature extractor and a generator on top of . The DTN is trained using a GAN loss for realistic output, a feature consistency loss for preserving key features after translation, and an identity loss for making sure the network acts as an identity mapping function when supplied with inputs sampled from the target domain. DTN’s need for a pretrained feature extractor implies that the result will be limited by the quality and quantity of the labeled features.
Unsupervised Image Translation
In liu2017unsupervised the GAN-VAE based UNIT framework was introduced to handle the domain transfer task on an unpaired unlabeled dataset. It made the cycle-consistency hypothesis where for a pair of corresponding images from two domains, there exists a shared latent embedding where one-to-one mapping exists for and
., the GAN-VAE based UNIT framework uses this hypothesis to learn a pair of encoders mapping images from two different domains into the same latent space and a pair of generators conditioned on the latent encoding to translate the embedded image back into the two domain space. Specifically, the latent space is constrained to a Gaussian distribution by the VAE framework and weights are shared in the higher-level(deeper) layers of the encoders and decoders. It showed promising results in domain transfer tasks such as day/night scenes, canine breeds, feline breeds, and face attribute translation.
Similarly, zhu2017unpaired proposed the CycleGAN framework for unpaired image translation that relies on a GAN loss and a cycle consistency loss term defined as . This is similar to the shared latent embedding assumption without specifying the latent embedding distribution. The CycleGAN framework showed some success when applied to a range of classic image translation tasks. Some of its failure cases include over-recognizing objects and not being able to change the shape of the object during translation (e.g. outputting an apple-shaped orange).
Finally, choi2017stargan introduced a multi-domain transfer framework and applied it on face synthesis using labeled facial features. royer2017xgan trained an adversarial auto-encoder using cycle-consistency, reconstruction loss, GAN loss, and a novel teacher loss where knowledge is distilled from a pretrained teacher to the network. It showed some success on two all-frontal face dataset: the VGG-Face and the Cartoon Set.
Given unpaired samples from two domains and depicting the same underlying semantic object, we would like to learn two functions, that transforms source domain items to the target domain , and that does the opposite. Following past literaturestaigman2016unsupervised ; zhu2017unpaired ; choi2017stargan ; royer2017xgan , our TwinGAN architecture uses two encoders, and , two generators and , and two Discriminators and for domain and respectively, such that and .
We follow the current state of the art in image generation and adapts the PGGAN structurekarras2017progressive . In order to improve the stability of GAN training, to speed up training process, and to output images with higher resolution, PGGAN progressively grows the generator and discriminator together and alternates between growing stages and reinforcement stages. During the growing stage, the input from a lower resolution is linearly combined with a higher resolution. The linear factor grows from 0 to 1 as training progresses, allowing the network to gradually adjust to the higher resolution as well as any new variables added. During the reinforcement stage, any unused layer for lower dimension are discarded as the grown network does more training. Our discriminator is trained progressively as well.
Encoder and UNet
In an encoder-decoder structure using convolutional neural networks, the input is progressively down-sampled in the encoder and up-sampled in the decoder. For image translation task, some details and spatial information may be lost in the down-sampling process. UNetronneberger2015u is commonly used in image translation tasksisola2017image ; zhang2017style where details of the input image can be preserved in the output through the skip connection. We adapt such structure and our encoder mirrors the PGGAN generator structure – growing as the generator grows to a higher resolution. The skip connection connects the encoding layers right before down-sampling with generator layers right after up-sampling. For details on the network structure please see 4.
Domain-adaptive Batch Renormalization
Previous work such as dumoulin2017learned , de2017modulating , and miyato2018spectral have shown that by using a different set of normalization parameters , one can train a generative network to output visually different images of the same object. Inspired by their discovery, we capture the style difference in the two domains and by using two sets of batch renormalizationioffe2017batch parameters – one for each domain. The motivation behind such design is as follows: Since both encoders are trying to encode the same semantic object represented in a different style, by sharing weights in all but the normalization layers, we encourage them to use the same latent encoding to represent the two visually different domains. Thus, different from prior worksliu2017unsupervised ; zhu2017unpaired ; choi2017stargan ; royer2017xgan which share parameters only in the higher layers, we choose to share the weights for all
layers except the batch renormalization layers and we use the same weight-sharing strategy for our two generators as well. This is the key to our TwinGAN model that enable us to capture shared semantic information and to train the network with fewer parameters. Due to the small batch size used at higher resolutions, we found empirically that batch renormalization performs better than batch normalization during inference.
We use three sets of losses to train our framework. The adversarial loss ensures that the output images is indistinguishable from sampled images from the given domains. The reconstruction loss enforces that the encoder-generator actually captures information in the input image and is able to reproduce the input. The cycle consistency ensures that the input and output image contains the same features.
Given Input domain and output domain , the objective for image translation using vanilla GAN is formulated as
To boost the stability of GAN, we add the DRAGANkodali2017convergence objective function to our adversarial loss:
The adversarial objective is thus:
Cycle Consistency Loss
Optionally a discriminator can also be applied on the output and . Empirically we found that it leads to slightly better performance and less blurry output, but is by no means required.
Semantic Consistency Loss
We would like the translated image to have the same semantic features as the input image. That is, the encoder should extract the same high level features for both input and output image regardless of their domains. Since cycle consistency already covers the semantic consistency for input and output from the same domain, here we focus on cross-domain semantic consistency. Similar to previous works royer2017xgan , we formulate the semantic consistency loss as follows:
We note that it is generally not true that there exists a strictly one-to-one mapping between two domains. For example cat faces lack facial muscle to have as many expressions as we humans do, and forcing a one-to-one mapping between the two domains at the pixel level (e.g. using a loss such as in taigman2016unsupervised ) leads to mismatches. Therefore we choose to apply the loss only on the embeddings, which encode semantic information that is shared across domains. Thus, features unique to only one of the two domains are encouraged to be captured in the adaptive normalization parameters.
Our overall loss function is defined as:
s are hyperparameters that control the weight on each of the objectives.
We first compare our framework with other recent image translation architectures on the real-to-anime face translation task. We then study the benefits of cycle consistency loss and the semantic consistency loss. We show some use cases of a trained encoder and study its training process. Lastly, we show the generalization power of our framework on the human-to-cat face task.
4.1 Real-to-Anime Face Translation
We train out network on two datasets: CelebA liu2015deep dataset with 202599 celebrity face images and the "Getchu" datasetjin2017towards containing 26752 anime character face images with clean background. During training, we randomly crop the images to of their original size. In addition, we follow szegedy2017inception and randomly flip the images and adjust the contrast, hue, brightness, and saturation levels.
Regarding the hyperparameters in our framework, we set (the hyperparameters are not fine-tuned due to limit on computing resources). Following karras2017progressive , we use Adamkingma2014adam with , and . We did not spend much effort on finding the optimal set of hyperparameters.
We start with a resolution of and gradually grows to . We use a batch size of 8 for resolutions up to 64, and reduce that to 4 for 128 and 2 for 256 resolution. We show the discriminator 600k images for each stage of training. Different from karras2017progressive , we find that DRAGAN has a shorter training time compared to variations of WGANarjovsky2017wasserstein ; gulrajani2017improved and provides more stable training for image translation task. We used pixel-wise feature vector normalization but not equalized learning ratekarras2017progressive .
4.2 Extra loss terms
|With Style Embedding||26.85||10.87||9.89||13.06||15.16|
|No Style Embedding||29.23||9.21||10.51||10.41||14.84|
We study the merits of the cycle loss, the semantic consistency loss, the style embedding, and the UNet. Because all requires extra computation during training, the extra training time spent must be justified by the better result they bring. We measure those benefits both quantitatively using Sliced Wasserstein Score proposed in karras2017progressive .
In our experiments, we observed that having UNet encourages the network to find more local correspondences. Without UNet, the network failed to preserve correspondence between semantic parts and there were common error patterns such as the face direction becoming mirrored after translation – which is technically allowed by all our loss terms but is judged as being unnatural by human. Note that similar error patterns are observed in our experiment with MUNIT 1, which does not use UNet.
Adding cycle loss and semantic consistency loss both resulted in higher Sliced Wasserstein Score and better output. Adding Style embedding increased the Sliced Wasserstein Score by a little, but it gave the user the ability to control some features such as hair color and eye color. However we argue that those features should perhaps belong to the content and the style embeddings failed to catch the more subtle yet important style information, such as eye-to-face ratio, texture of the hair, etc., that varies from painter to painter. We thus made the style embedding optional and did not use that for the final results.
4.3 Human to cat face translation
In order to show the general applicability of our model, here we show our results on the task of translating human faces to cat faces. For human face we collected 200k images from the CelebA dataset. We extracted around 10k cat faces from the CAT datasetzhang2008cat by cropping the cat faces using the eye and ear positionsalexia2018deep . The network, the CelebA dataset, and training setup is the same as in 4.1.
4.4 Learned Cross-domain Image Embeddings
We want verify that meaningful semantic information shared across domains can indeed be extracted using our TwinGAN. For each domain, we use the corresponding encoder to extract the latent embeddings. For each image in domain , we find the nearest neighbors in domain by calculating the cosine distances between the flattened embeddings. As shown in 1, we found that meaningful correlations, including hair style, facing direction, sex, and clothing, can be established between the latent embeddings from the two domains.
5 Conclusion and Future Work
We proposed the Twin-GAN framework that performs cross-domain image translation on unlabeled unpaired data. We demonstrated its use case on human-to-anime and human-to-cat face translation tasks. We showed that TwinGAN is capable of extracting common semantic information across the two domains while encoding the unique information in the adaptive normalization parameters.
Despite the success, there are still a lot of room for improvements. The process of selecting features to translate is not controllable in our current framework. Furthermore, when applied to even more diverse datasets involving changing point-of-view and reasoning about the 3d environment, our framework does poorly. We experimented with applying TwinGAN on image translation from the game of Minecraft to real street views – our network collapsed at even low resolution. We hope to address these issues in future works.
This work is not an official Google supported project and is developed solely by the author. We’d like to thank Alan Tian, Yanghua Jin, and Minjun Li for their inspirations. We’d like to thank all the creators of art works and fan-arts, without whom this research will not be possible.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint, 1711, 2017.
-  Harm De Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C Courville. Modulating early visual processing by language. In Advances in Neural Information Processing Systems, pages 6594–6604, 2017.
-  Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. Proc. of ICLR, 2017.
-  Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  Xun Huang and Serge J Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pages 1510–1519, 2017.
-  Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. arXiv preprint arXiv:1804.04732, 2018.
-  Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems, pages 1945–1953, 2017.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.
Image-to-image translation with conditional adversarial networks.arXiv preprint, 2017.
-  Yanghua Jin, Jiakai Zhang, Minjun Li, Yingtao Tian, Huachun Zhu, and Zhihao Fang. Towards the automatic anime characters creation with generative adversarial networks. arXiv preprint arXiv:1708.05509, 2017.
-  Alexia Jolicoeur-Martineau. Deep learning with cats. https://github.com/AlexiaJM/Deep-learning-with-cats, 2018. Accessed: 2018-03-20.
-  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.
-  Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088, 2017.
-  Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
-  Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep painterly harmonization. arXiv preprint arXiv:1804.03189, 2018.
-  Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Moressi, Forrester Cole, and Kevin Murphy. Xgan: Unsupervised image-to-image translation for many-to-many mappings. arXiv preprint arXiv:1711.05139, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016.
-  Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S Paek, and In So Kweon. Pixel-level domain transfer. In European Conference on Computer Vision, pages 517–532. Springer, 2016.
-  Lvmin Zhang, Yi Ji, and Xin Lin. Style transfer for anime sketches with enhanced residual u-net and auxiliary classifier gan. arXiv preprint arXiv:1706.03319, 2017.
-  Weiwei Zhang, Jian Sun, and Xiaoou Tang. Cat head detection-how to effectively exploit shape and texture features. In European Conference on Computer Vision, pages 802–816. Springer, 2008.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint, 2017.