Photo-to-Caricature Translation on Faces in the Wild

11/29/2017 ∙ by Ziqiang Zheng, et al. ∙ Ocean University of China 0

Recently, image-to-image translation has been made much progress owing to the success of conditional Generative Adversarial Networks (cGANs). However, it's still very challenging for translation tasks with the requirement of high-level visual information conversion, such as photo-to-caricature translation that requires satire, exaggeration, lifelikeness and artistry. We present an approach for learning to translate faces in the wild from the source photo domain to the target caricature domain with different styles, which can also be used for other high-level image-to-image translation tasks. In order to capture global structure with local statistics while translation, we design a dual pathway model of cGAN with one global discriminator and one patch discriminator. Beyond standard convolution (Conv), we propose a new parallel convolution (ParConv) to construct Parallel Convolutional Neural Networks (ParCNNs) for both global and patch discriminators, which can combine the information from previous layer with the current layer. For generator, we provide three more extra losses in association with adversarial loss to constrain consistency for generated output itself and with the target. Also the style can be controlled by the input style info vector. Experiments on photo-to-caricature translation of faces in the wild show considerable performance gain of our proposed method over state-of-the-art translation methods as well as its potential real applications.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image-to-image translation has been made much progress Isola_2017_CVPR ; Zhu_2017_ICCV ; Yi_2017_ICCV ; Chen_2017_ICCV ; kim2017learning ; mao2018semantic because many tasks zhou2016similarity ; wang2016super ; ma2017unsupervised ; shi2017end

in image processing, computer graphics, and computer vision can be posed as translating an input image into a corresponding output image 

zhu2016generative ; li2016precomputed ; wang2016generative ; Ledig_2017_CVPR ; taigman2017unsupervised ; Sela_2017_ICCV ; Tung_2017_ICCV . And its achievements mainly owes to the success of Generative Adversarial Networks (GANs) goodfellow2014generative , especially conditional GANs (cGANs) mirza2014conditional ; radford2016unsupervised ; Isola_2017_CVPR . However, the current studies mainly concern image-to-image translation tasks with low-level visual information conversion, e.g., photo-to-sketch liu2018auto .

A caricature is a rendered image showing the features of its subject in an exaggerated way and usually used to describe a politician or movie star for political or entertainment purpose. Creating caricatures can be considered as artistic creation tracing back to the 17th century with the profession of caricaturist. Then some efforts have been made to produce caricatures semi-automatically using computer graphics techniques akleman2000making , which intend to provide warping tools specifically designed toward rapidly producing caricatures. But there are very few software programs designed specifically for automatically creating caricatures, and to the best of our knowledge, none can work to be comparable with caricaturist. Nowadays, besides the political and public-figure satire, caricatures are also used as gifts or souvenirs, and more and more museums dedicated to caricature throughout the world were opened. So it would be very useful and meaningful if computers can create caricatures from photos automatically and intelligently.

Photo-to-caricature is a typical high-level image-to-image translation problem but with bigger challenge than other low-level translation problems such as photo-to-label, photo-to-map, or photo-to-sketch Isola_2017_CVPR , because caricatures

  • require satire and exaggeration of photos;

  • need artistry with different styles;

  • must be lifelike, especially the expression of a face photo.

Specifically, for a face photo, we want to create the face caricatures with different styles, which exaggerate the face shape or facial sense organs (i.e., ears, mouth, nose, eyes and eyebrow) but keep the vivid expression while producing artistry.

In this paper, we propose a GAN-based method for learning to translate faces in the wild from the source photo domain to the target caricature domain (see Figure 1 for translating examples and Figure 2 for the architecture of our proposed method). Although deep convolutional neural networks with adversarial training radford2015unsupervised can generate images with enough precise facial features berthelot2017began , these images sometimes still have wrong relationships between facial features or mismatch among facial organs such as nose and eyes, e.g., a face with more than two eyes or crooked nose. Traditional GANs can produce correct facial organs but wrong relationships between them, and also it’s very challenging to abstract and exaggerate face and facial organs. We attribute this problem to the deficient capacity of the discriminator of GAN to distinguish real-fake images. Therefore, our motivation is to design an adversarial training with multiple discriminators to improve the ability of GAN’s discriminator for feature representation.

Figure 1: Translating faces in the wild from photo to caricature with different styles by our proposed method. (a) Example results on IIIT-CFW dataset mishra2016iiit ; (b) Example results on PHOTO-SKETCH dataset zhang2011coupled ; wang2009face .

Based on the model of CycleGAN Zhu_2017_ICCV , we design a dual pathway model of GAN for high-level image-to-image translation tasks, where one pathway of coarse discriminator is in charge of abstracting the global structure information, while another pathway of fine discriminator is responsible for concerning the local statistics information. For generator, besides the adversarial loss, we provide one more extra perceptual similarity loss to constrain consistency for generated output itself and with the unpaired target domain image. By using our proposed method, the photos of faces in the wild can be translated to caricatures with learned general-purpose exaggerated artistic styles while still keeping the original lifelike expression (see Figure 1 and 11 for references). Considering that traditional GANs are not robust and easily attracted by noise, we design a noise-added training procedure to improve the robustness of our model. Inspired by InfoGAN chen2016infogan , we find that auxiliary noise can help model learning the caricature style information while translating images in our task.

We have extensively evaluated our method on IIIT-CFW dataset mishra2016iiit , PHOTO-SKETCH dataset zhang2011coupled ; wang2009face , Caricature abaci2015matching , FEI dataset thomaz2010new , Yale dataset georghiades1997yale , KDEF dataset lundqvist1998karolinska and CelebA dataset liu2015faceattributes . The experimental results show that our method can create acceptable caricatures from face photos while current state-of-the-art image-to-image translation methods can’t. Also the designed experiments indicate the effectiveness of our proposed dual pathway of discriminators, additional noise input and extra perceptual loss, respectively. Besides, we tested our photo-to-caricature translation method for producing caricatures with adding different proportions of noise to show the translating robustness and style diversity. Furthermore, the proposed method can create caricatures for arbitrary face photos without pre-training on extra face datasets. Another prominent performance of our methods is that our model can capture the expression information and make some abstraction and exaggeration. This might be helpful to fill aforementioned gap of automatic and intelligent caricature creation.

2 Related Work

Image-to-image translation. Owing to the success of GANs goodfellow2014generative , especially various conditional GANs mirza2014conditional ; mathieu2015deep ; reed2016generative ; yan2016attribute2image ; wang2016generative ; zhang2017image , image-to-image translation problems have been made much progress recently, which aims to translate an input image from one domain to another domain given input-output images pairs Isola_2017_CVPR . Earlier image-conditional models for specific applications have achieved impressive results on inpainting Pathak_2016_CVPR , de-raining zhang2017image , texture synthesis li2016precomputed , style transfer wang2016generative , video prediction mathieu2015deep

and super-resolution 

Ledig_2017_CVPR . The general-purpose solution for image-to-image translation developed by Isola et al. Isola_2017_CVPR with the released Pix2pix software has achieved reasonable results on many translation tasks by using paired images for training such as photo-to-label, photo-to-map and photo-to-sketch. Then CycleGAN Zhu_2017_ICCV , DualGAN Yi_2017_ICCV , and DiscoGAN kim2017learning were proposed for unpaired or unsupervised image-to-image translation with almost comparable results to paired or supervised methods. However, the translation tasks that these work can tackle usually concern the conversion of low-level visual information such as line (photosketch), color (summerwinter), and texture (horsezebra), but it’s still very challenging for some translation tasks with the requirement of high-level visual information conversion, e.g., abstraction and exaggeration (photocaricature). Recently, Iizuka et al. iizuka2017globally combined two discriminators called global discriminator and local discriminator to improve the adversarial training for image completion task. Experimental results have proven that the global discriminator can distinguish the images based on the global parts while the local discriminator pays attention to the details of parts. We exploit the design of two discriminators for high-level image-to-image translation tasks in this paper.

Photo-to-cartoon translation. Translating photo to cartoon by computer algorithms has been studied for a long time because of the corresponding interesting and meaningful applications. The earlier work mainly relied on the analysis of facial features luo2002exaggeration ; chen2004automatic ; liao2004automatic , which is hard to be applicable for large-scale faces in the wild. Thanks to the invention of GANs goodfellow2014generative , automatic photo-to-cartoon translation becomes feasible. But most of the current related work mainly focused on the generation of anime zhang2017style or emoji taigman2017unsupervised with specific style111We consider that anime, emoji, and caricature are three types of cartoon or three subsets of cartoon.. And these works have nothing to do with caricature creation that needs to be exaggerated, lifelike and artistic.

Unlike the prior works for image-to-image translation dealing with low-level visual information conversion, our study mainly focuses on the translation problems of high-level visual information conversion, e.g., photo-to-caricature. For this purpose, our method differs from the past work in network architecture as well as the layers and losses. We design a dual pathway model of GAN with two discriminators named coarse discriminator and fine discriminator to capture global structure and local statistics respectively, and we apply the perceptual similarity loss for generator. For learning the style information and improving the robustness of our model, we provide the input with auxiliary noise. Here we show that our approach is effective on the face photo-to-caricature translation task, which typically requires high-level visual information conversion.

3 Method

Conditional GANs are generative models that learn a mapping from observed input image , to target image , . The objective of our conditional GAN can be expressed as


The goal is to learn a generator distribution over data that matches the real data distribution by transforming an observed input image into a sample . This generator is trained by playing against an adversarial discriminator that aims to distinguish between samples from the true data distribution and the generator’s distribution. To deal with the high-level visual information conversion for some more challenging image-to-image translation tasks, we exploit one discriminator to dual discriminators and add two more losses (perceptual loss and cycle consistency loss) to adversarial loss, so our method can tackle conversions of both the low-level (line, color, texture, etc.) and the high-level (expression, exaggeration, artistry, etc.) visual information. And in order to improve the robustness of our model, we provide our model a noise-added training and use auxiliary noise to learn the style information while translating. The overall network architecture and data flow are illustrated in Figure 2.

Figure 2: Network architecture and data flow chart of our proposed method for face photo-to-caricature translation.

3.1 Cycle consistency loss

Researchers apply adversarial training to learn the mapping function between two different image domains. Here we use two same generators (named G1 and G2 in Figure 2) for observing the sample distribution of two domains. And denotes the photo domain while denotes the caricature domain in our task. We use our generators to emulate the translation between and . However, only using adversarial loss can’t guarantee a plausible generation. So researchers use Cycle Consistency Loss () to help establishing mapping function between domains Zhu_2017_ICCV . And is expressed as


where and represent the two generators, and and are samples from and domain respectively. We use loss for the cycle consistency loss following Zhu et al. Zhu_2017_ICCV .

3.2 Perceptual loss

To further reduce the space gap of possible mapping functions between domains, we apply the perceptual loss

for our model. For a constrained translation problem, finding an appropriate loss function is critical. We adopt the content loss of Gatys

et al. gatys2016image , which is also referred as a perceptual similarity loss or feature matching bruna2015super ; dosovitskiy2016generating ; johnson2016perceptual ; ledig2016photo . We apply the perceptual loss to our model followed by the cycle consistence loss, and compute the perceptual loss between unpaired images from different domains to push generator to capture the feature representations. Let denotes a pre-trained visual perception network (we use pre-trained VGG19 in our experiments) and denotes the number of feature maps. Different layers in the networks represent low-to-high level information: from edges and color to object and semantic representation. Matching both low and high layers in the perception network can help achieving fantastic translation. And the perceptual loss can be expressed as


where denotes the image from caricature domain in our task and denotes the synthesized caricature image using . Note that these two images are unpaired.

3.3 Auxiliary noise input

Previous researches have fully proved that we can get plausible image results from noise input radford2016unsupervised ; chen2016infogan ; zhao2016energy ; berthelot2017began . In order to improve the robustness and enrich the diversity of image translation between domains, we design a noise-added training procedure before the translation shown in Figure 2

. First we obtain a random noise input from random uniform distribution (range from

to ), then we merge the noise input and the raw image input to acquire the final input using approximate weights. Here we define to denote the proportion of the raw image accounting for the final image. This can be expressed as


where denotes the raw input from photo domain and denotes the noise that has a uniform distribution . Figure 3 shows one sample with adding noise. With auxiliary noise input, the GAN object is expressed as:

Figure 3: Sample for adding noise, here we use .

We add the auxiliary noise to improve the robustness of our model. And we can get different styles of output through adjusting the noise input (see Section 4.5). Besides, we add the noise from a uniform distribution for making the style information more representable and matching the style space while translating. Furthermore, we also hope to control the style by adding different noise and make the translating conditional, which will be our future work.

3.4 Dual discriminators

Traditional GANs usually have one generator and one discriminator to leverage adversarial training. Different from them, we design two different discriminators to capture different level information.

In our method, we propose two different discriminators called coarse discriminator and fine discriminator respectively. The coarse discriminator aims to encourage generator to synthesize images based on global style and structure information for domain translation. In our high level image-to-image task, the coarse discriminator is capable for capturing the structure information and abstracting the representative information of face photo such as emotion and style. While the fine discriminator aims to achieve the feature matching and help generating more plausible and precise images, and the fine discriminator builds the adversarial training for the face details with generator such as lip and eyes. Different from Satoshi et al. iizuka2017globally , we are not using the image patches as input of local discriminator, we provide both the two discriminators with the whole image as input while the outputs of the two discriminators are different (see Figure 2). The output of coarse discriminator is a

patch matrix after the sigmoid activity function while the output of fine discriminator is

. Note that both the two discriminators are using sigmoid function at the last layer. We have tried using different output size combinations to get different results, and the experiments show that the combination of

for fine discriminator and for coarse discriminator can obtain the best result for translation. The coarse discriminator has a smaller feature map with more abstractive representation compared to the fine discriminator. The D1 and D2 in Figure 2 represent the dual discriminators for translating two different domains and .

3.5 Generator

Previous studies have found it beneficial to mix the GAN objective with a more traditional norm, such as  Pathak_2016_CVPR and  Isola_2017_CVPR distance. We explore this opinion to cycle consistency loss by applying distance to compute . Our final objective is


where means cycle consistency loss, indicates perceptual loss, and and are hyper parameters to balance the contribution of each loss to the objective. We use greedy search to optimize the hyper parameters with and for all the experiments in this paper.

As shown in Figure 2, we use Conv-Residual blocks-Deconv he2016deep as the generator to share the low-level and high-level information between the input and output directly across the net.

4 Experiments

As a typical image-to-image translation task, photo-to-caricature requires high-level visual information conversion, which is very challenging for the state-of-the-art general-purpose solutions. To explore the effect of our proposed model, we tested the method on a variety of datasets for translating faces in the wild from photo to caricature with different styles, and the qualitative results are shown in Figs. 4 and 11.

4.1 Dataset and training

Our proposed model is trained in a supervised unpaired fashion on a paired face photo-caricature dataset, named IIIT-CFW-P2C dataset, which was rebuilt based on IIIT-CFW mishra2016iiit . The IIIT-CFW is a dataset for the cartoon faces in the wild and contains annotated cartoon faces of famous personalities in the world with varying profession. Also it provides real faces of the public figure for cross modal retrieval tasks. However, it’s not suitable for the training of photo-to-caricature translation task using some paired methods (such as Pix2pix) because the face photos and face cartoons222We consider that caricature is a type of a cartoon or a subset of a cartoon. are not paired, e.g., the facial orientation and expression of the photo and caricature for the same person are varying a lot. So we rebuild a photo-caricature dataset with paired images by searching the IIIT-CFW dataset and Internet as the training set for compared experiments. Here we use for training and the left for testing. At inference time, we run the generator in exactly the same manner as during the training phase. Besides, we also extensively evaluated our method on a variety of datasets with faces in the wild, including Caricature abaci2015matching , FEI thomaz2010new , IIIT-CFW mishra2016iiit , Yale georghiades1997yale , KDEF lundqvist1998karolinska and CelebA liu2015faceattributes .

Besides, we also consider photo-to-sketch as a photo-to-caricature task for experiments using PHOTO-SKETCH dataset zhang2011coupled ; wang2009face , which has paired images and hence can be directly used for supervised training of some compared paired methods. And following DualGAN, we use unpaired images for training and for testing. Note that we train CycleGAN, DualGAN and our model using unpaired images and train Pix2pix using paired images of the two datasets.

4.2 Comparison with state-of-the-arts

Using IIIT-CFW-P2C dataset, we first compare our proposed method with Pix2pix Isola_2017_CVPR , DualGAN Yi_2017_ICCV , DiscoGAN kim2017learning and CycleGAN Zhu_2017_ICCV on photo-to-caricature translation task. All the four methods were trained on the same training dataset and tested on novel data from IIIT-CFW-P2C dataset that does not overlap those for training.

Qualitative evaluation

Figure 4 shows the experimental results of the comparison, it can be seen that, DualGAN only learned the color and edge translation rather than structure information, Pix2pix makes structure error in almost all cases, while CycleGAN can keep the structure information of the input image but without enough conversion for caricature creation task. For DiscoGAN, it’s really hard to generate plausible and meaningful results due to the lack data of training ( pairs in our experiments vs. tens of thousands of pairs in DiscoGAN’s experiments) and the big challenge of task (photo-to-caricature vs. edge-to-photo). Although it’s still not good enough for the results of our method compared to human caricaturists, the experiments on photo-to-caricature translation of faces show considerable performance gain of our proposed method over state-of-the-art image-to-image translation methods, especially the encouraging ability of exaggeration and abstraction. However, due to the very challenging task with less training data but on various styles, our method also messes some details while translation (see the mouth of the first row and the eyes of the fourth row on our results in Figure 4).

Figure 4: Comparison of state-of-the-art image-to-image translation methods with our proposed method for face photo-to-caricature translation on IIIT-CFW-P2C dataset.

This experiment also expresses that high-level image-to-image translation tasks like photo-to-caricature are generally more difficult than those low-level translation tasks such as photo-to-sketch, because it not only needs to abstract the facial features, but also requires to exaggerate the emotional expression. So that the pixel-level methods (like Pix2pix) might fail as they force the generator to concentrate on local information rather than whole structure.

Besides, we also evaluate our methods on PHOTO-SKETCH dataset. Figure 5 shows the compared results. Comparing with Pix2pix, our method can reduce the effect of being blurry and artifact. Although DualGAN and CycleGAN can also reserve the structure information of input faces, they are not good at achieving abstraction and artistry. Similarly, DiscoGAN collapses when facing insufficient training data and big challenging task.

Figure 5: Comparison of state-of-the-art image-to-image translation methods with our proposed method for face photo-to-caricature translation on PHOTO-SKETCH dataset.

Quantitative evaluation

Beyond the visually qualitative evaluation, we also evaluated the translated cartoon results of different methods quantitatively on IIIT-CFW-P2C dataset and PHOTO-SKETCH dataset in terms of both human judge and machine grade, and the average results are shown in Table 1 and Table 2 respectively.

For human score, we invited 40 volunteers to evaluate the generated image quality of different methods in terms of satire, exaggeration, lifelikeness and artistry compared with the given original photo, by grading from 1 to 5 (1 represents the worst and 5 represents the best). For inception score, we used a pre-trained classifier network and sampled images for evaluation followed 

salimans2016improved . The results shows that our proposed method outperforms state-of-the-art image-to-image translation methods with the highest human and inception scores.

Method Human score Inception score
Pix2pix 2.0106 1.50690.1090
DualGAN 1.9946 1.48430.1049
DiscoGAN 1.5014 1.33660.0714
CycleGAN 3.5001 1.56840.1331
Ours 4.0120 1.60430.0918
Table 1: The generated image equality evaluation results of different methods on IIIT-CFW-P2C dataset.
Method AMT Inception score
Pix2pix 2.0971 1.36250.0706
DualGAN 2.3810 1.40630.0822
DiscoGAN 1.0858 1.31420.0206
CycleGAN 3.2272 1.39800.1130
Ours 4.0750 1.42980.0818
Table 2: The generated image equality evaluation results of different methods on PHOTO-SKETCH dataset.

4.3 Dual discriminators

In this experiment, we verify the effectiveness of our proposed dual pathway of discriminators. We first use only one coarse discriminator (Coarse D) and one fine discriminator (Fine D) separately, and then use dual discriminators with one coarse discriminator plus one fine discriminator (Fine D + Coarse D), while keeping all other architectures and settings fixed for training and testing. Some example results are shown in Figure 6, and one Fine D model almost misses the key structure information on faces, but our dual Coarse D + Fine D model can render the structure of facial features well. It further proves that the Fine D model only concerns the local statistics for tackling low-level image-to-image translation tasks (e.g., photo-to-sketch).

Figure 6: Comparison of only one discriminator (Coarse D and Fine D respectively) with our dual discriminators (Coarse D + Fine D, C & F).

And we also took some experiments to greedy search the best combination size of output patches for coarse discriminator and fine discriminator. Figure 7 shows results of different combinations, which indicates that large Fine D patches (e.g., F) fails to abstract and exaggerate faces while small Fine D patches (e.g., F) abstracts and exaggerates faces too much.

Figure 7: Examples of different combination sizes of output patches for coarse discriminator and fine discriminator. C denotes that the output of coarse discriminator is a patch, and F denotes that the output of fine discriminator is a patch. It can be seen that large F such as F for fine discriminator fails to abstract faces and achieve exaggerations, while small F such as F abstracts and exaggerates faces too much.

4.4 Loss selection

We first consider to check if the cycle consistency loss should be provided to the GAN objective (Equation 6), and the second column in Figure 8 shows the results without cycle consistency loss. It’s easy to see that without cycle consistency loss, although the adversarial system with adversarial loss can capture the facial features, it’s hard to generate caricature images with plausible objects and meaningful relationship between facial organs. The third column in Figure 8 shows the results by adding cycle consistency loss . Therefore, the normal adversarial training can lead to some kind of caricature style, but it fails to be lifelike without meaningful components.

Figure 8: Comparison of extra loss for final objective of generator: without () and with () .

Based on the cycle consistency loss, we provide the perceptual loss for generator in our system. Figure 9 shows the compared results of without and with . It can be seen that perceptual loss can produce images with the exaggerated facial features such as eyes, nose, and mouth. The perceptual loss, which expresses some perceptual errors on facial features such as head, eyes, and mouth, could improve the artistic expression of image generation and show better abstraction ability. And it can also reduce the effect of being blurry. The second column of Figure 9 without using perceptual loss illustrates the indistinguishable facial expressions with distorted facial organs while translation, and the third column with adding perceptual loss improves the performance on facial expression and organ translation with caricature effect, e.g., the smile woman of last row.

Figure 9: Comparison of extra loss for final objective of generator: without () and with () .

4.5 Auxiliary noise input

By adding auxiliary noise to our photo-to-caricature system, we can improve the robustness and diversity of synthesized facial caricatures. Figure 10 shows the example results of adding auxiliary noise for photo-to-caricature translation, and the output results for adding different proportions () of noise indicate that our system can still synthesize meaningful facial caricatures with even more than a half noise (, see Equation 4 for reference) as inputs, besides, the added different proportions of noise also lead to different styles of output results, which indicates that it might be used as a factor for tuning different synthesized styles.

Figure 10: Examples of adding auxiliary noise for robustness and diversity of our photo-to-caricature translation. Different proportions ( in Equation 3) of noise inputs can also lead to meaningful different styles of output caricatures.

4.6 Freestyle face caricature creation

To evaluate the ability for real applications in the daily life, we tested our method on a variety of face datasets, including Caricature abaci2015matching , FEI thomaz2010new , IIIT-CFW mishra2016iiit , Yale georghiades1997yale , KDEF lundqvist1998karolinska and CelebA liu2015faceattributes , to illustrate the photo-to-caricature translation on faces in the wild, and the results are shown in Figure 11. These freestyle face caricature creation results validate that our model works not bad on arbitrary faces and show the potential value for the related applications. And we can see that the translated results in KDEF, FEI and Yale datasets also have different facial expression corresponding the input faces. Our methods successfully reserve the emotion information and emulate the facial organs with caricature style. So we can conclude that the more abstracted information such as facial emotion and expression with global structure information are reserved. Besides, our model can also enlarge or narrow the facial organs such as chin, lips, eyes and so on, which is required for high-level image-to-image translation tasks.

Figure 11: Example translated caricatures of facial photos from several datasets (Caricature abaci2015matching , FEI thomaz2010new , IIIT-CFW mishra2016iiit , Yale georghiades1997yale , KDEF lundqvist1998karolinska and CelebA liu2015faceattributes ) using our trained model on IIIT-CFW-P2C.

5 Conclusion and Future Work

We present a novel GAN-based method to deal with high-level image-to-image translation task, i.e., photo-to-caricature translation. The proposed method uses dual discriminators for capturing global structure and local statistics information with abstraction ability, also provides extra perceptual loss on GAN objective to constrain the consistency under exaggeration. Besides, the style information can be learned and representative by adding auxiliary noise input. And the robustness can be improved by the noise-added training. Experimental results show that our method not only outperforms other state-of-the-art image-to-image translation methods, but also works well on a variety of datasets for photo-to-caricature translation of faces in the wild.

Limitations. Translating photo to caricature is a very challenging high-level image-to-image translation task. Thus, our model also fails in some cases, e.g., some generated images of Yale and CelebA dataset in Figure 11. Although our method can keep the structure information of faces, it is still hard to render the details for providing high-quality caricatures and some tiny organs (such as eyes) are lack of details in Figure 4. Besides, it’s also sensitive to side face with complex background, e.g., some cases on CelebA and KDEF datasets in Figure 11.

Future work. With regard to future work, first, it would be interesting to investigate our method on other tasks of high-level image-to-image translation (e.g., human-to-cartoon translation for cartoon movies); second, for the proposed method, the model still needs to be improved to provide high-quality rendered translated results; third, we intend to apply our method on the real applications of automatic and intelligent photo-to-caricature translation; fourth, we hope that we can control caricature style while translating images between domains by tuning input noise.


We thanks the volunteers for grading human scores of translation results from different methods. This work was supported by the National Natural Science Foundation of China [61771440, 41776113], and Qingdao Municipal Science and Technology Program [17-1-1-5-jch].