A caricature is defined as “a picture, description, or imitation of a person or a thing in which certain striking characteristics are exaggerated in order to create a comic or grotesque effect” . Paradoxically, caricatures are images with facial features that represent the face more than the face itself. Compared to cartoons, which are 2D visual art that try to re-render an object or even a scene in a usually simplified artistic style, caricatures are portraits that have exaggerated features of a certain persons or things. Some example caricatures of two individuals are shown in Figure 1. The fascinating quality of caricatures is that even with large amounts of distortion, the identity of person in the caricature can still be easily recognized by humans. In fact, studies have found that we can recognize caricatures even more accurately than the original face images .
Caricature artists capture the most important facial features, including the face and eye shapes, hair styles, etc. Once an artist sketches a rough draft of the face, they will start to exaggerate person-specific facial features towards a larger deviation from an average face. Nowadays, artists can create realistic caricatures through computer softwares through: (1) warping the face photo to exaggerate the shape and (2) re-rendering the texture style . By mimicking this process, researchers have been working on automatic caricature generation , 
. A majority of the studies focus on designing a good structural representation to warp the image and change the face shape. However, neither the identity information nor the texture differences between a caricature and a face photo are taken into consideration. In contrast, numerous works have made progress with deep neural networks to transfer image styles, . Still these approaches merely focus on translating the texture style forgoing any changes in the facial features.
|Shape Deformation||  |
|Brennan et al. ||Drawing Line||User-interactive|
|Liang et al. ||2D Landmarks||User-interactive|
|CaricatureShop ||3D Mesh||Automatic|
|Texture Rendering|| |
|Zheng et al. ||Image to Image||None|
|CariGAN ||Image + Landmark Mask||None|
|Texture + Shape|| WarpGAN|
|CariGANs ||PCA Landmarks||Automatic|
|WarpGAN||Image to Image||Automatic|
In this work, we aim to build a completely automated system that can create new caricatures from photos by utilizing Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs). Different from previous works on caricature generation and style transfer, we emphasize the following challenges in our paper:
The caricature generation involves both texture changes and shape deformation.
The faces need to be exaggerated in a manner such that they can still be recognized.
Caricature samples exist in various visual and artistic styles (see Figure 1).
In order to tackle these challenges, we propose a new type of style transfer network, named WarpGAN, which decouples the shape deformation and texture rendering into two tasks. Akin to a human operating an image processing software, the generator in our system automatically predicts a set of control points that warp the input face photo into the closest resemblance to a caricature and also transfers the texture style through non-linear filtering. The discriminator is trained via an identity-preserving adversarial loss to distinguish between different identities and styles, and encourages the generator to synthesize diverse caricatures while automatically exaggerating facial features specific to the identity. Experimental results show that compared to state-of-the-art generation methods, WarpGAN allows for texture update along with face deformation in the image space, while preserving the identity. Compared to other style transfer GANs , , our method not only permits a transfer in texture style, but also deformation in shape. The contributions of the paper can be summarized as follows:
A domain transfer network that decouples the texture style and geometric shape by automatically estimating a set of sparse control points to warp the images.
A joint learning of texture style transfer and image warping for domain transfer with adversarial loss.
A quantitative evaluation through face recognition performance shows that the proposed method retains identity information after transferring texture style and warping. In addition, we conducted two perceptual studies where five caricature experts suggest that WarpGAN generates caricatures that are (1) visually appealing, (2) realistic; where only the appropriate facial features are exaggerated, and (3) our method outperforms the state-of-the-art.
An open-source111Code will be released after publication. automatic caricature generator where users can customize both the texture style and exaggeration degree.
2 Related Work
2.1 Automatic Image Warping
Many works have been proposed to enhance the spatial variability of neural networks via automatic warping. Most of them warp images by predicting a set of global transformation parameters   or a dense deformation field . Parametric methods estimate a small number of global transformation parameters and therefore cannot handle fine-grained local warping while dense deformation needs to predict all the vertices in a deformation grid, most of which are useless and hard to estimate. Cole et al. 
first proposed to use spline interpolation in neural networks to allow control point-based warping, but their method requires pre-detected landmarks as input. Several recent works have attempted to combine image warping with GANs to improve the spatial variability of the generator, however these methods either train the warping module separately , or need paired data as supervision  . In comparison, our warping module can be inserted as an enhancement of a normal generator and can be trained as part of an end-to-end system without further modification. To the best of our knowledge, this study is the first work on automatic image warping with self-predicted control points using deep neural networks. An overview of different warping methods are shown in Figure 2.
2.2 Style Transfer Networks
Stylizing images by transferring art characteristics has been extensively studied in literature. Given the effective ability of CNNs to extract semantic features , powerful style transfer networks have been developed. Gatys et al.  first proposed a neural style transfer method that uses a CNN to transfer the style content from the style image to the content image. A limitation of this method is that both the style and content images are required to be similar in nature which is not the case for caricatures. Using Generative Adversarial Networks (GANs) ,  for image synthesis has been a promising field of study, where state-of-the-art results have been demonstrated in applications ranging from text to image translation 37]
, to image super-resolution. Domain Transfer Network , CycleGAN , StarGAN , UNIT , and MUNIT  attempt image translation with unpaired image sets. All of these methods only use a de-convolutional network to construct images from the latent space and perform poorly on caricature generation due to the large spatial variation , .
2.3 Caricature Generation
Studies on caricature generation can be mainly classified into three categories: deformation-based, texture-based and methods with both. Traditional works mainly focused on exaggerating face shapes by enlarging the deviation of the given shape representation from average, such as 2D landmarks or 3D meshes, whose deformation capability is usually limited as shape modeling can only happen in the representation space. Recently, with the success of GANs, a few works have attempted to apply style transfer networks to image-to-image caricature generation . However, their results suffer from poor visual quality because these networks are not suitable for problems with large spatial variation. Cao et al.  recently proposed to decouple texture rendering and geometric deformation with two CycleGANs trained on image and landmark space, respectively. But with their face shape modeled in the PCA subspace of landmarks, they suffer from the same problem of the traditional deformation-based methods. In this work, we propose an end-to-end system with a joint learning of texture rendering and geometric warping. Compared with previous works, WarpGAN can model both shapes and textures in the image space with flexible spatial variability, leading to better visual quality and more artistic shape exaggeration. The differences between caricature generation methods are summarized in Table 1.
Let be images from the domain of face photos, be images from the caricature domain and be the latent codes of texture styles. We aim to build a network that transforms an photo image into a caricature by both transferring its texture style and exaggerating its geometric shape. Our system includes one deformable generator (see Figure 3) , one style encoder and one discriminator (see Figure 4). The important notations used in this paper are summarized in Table 2.
|real photo image||label of photo image|
|real caricature image||label of caricature image|
|estimated control points||estimated residuals of|
|number of identities||number of control points|
The proposed deformable generator in WarpGAN is composed of three sub-networks: an content encoder , a decoder and a warp controller. Given any image , the encoder outputs a feature map . Here , and are height, width and number of channels respectively. The content decoder takes and a random latent style code to render the given image into an image of a certain style. The warp controller estimates the control points and their residual flow to warp the rendered images. An overview of the deformable generator is shown in Figure 3.
Texture Style Transfer
Since there is a large variation in the texture styles of caricatures images (See Figure 1), we adopt an unsupervised method  to disentangle the style representation from the feature map so that we can transfer the input photo into different texture styles present in the caricature domain. During the training, the latent style code
is sampled randomly from a normal distribution and passed as an input into the decoder
. A multi-layer perceptron indecodes to generate the parameters of the Adaptive Instance Normalization (AdaIN) layers in , which have been shown to be effective in controlling visual styles [huang2017arbitrary]. The generated images with random styles are then warped and passed to the discriminator. Various styles obtained from WarpGAN can be seen in Figure 5.
To prevent and from losing semantic information during texture rendering, we combine the identity mapping loss  and reconstruction loss  to regularize and . In particular, a style encoder is used to learn the inverse mapping from the image space to the style space . Given its own style code, both photos and caricatures should be reconstructed from the the latent feature map:
Automatic Image Warping
The warp controller is a sub-network of two fully connected layers. With latent feature map as input, the controller learns to estimate control points and their residual flow , where each and
is a 2D vector in the u-v space. The points are then fed into a differentiable warping module. Let be the destination points, where . A grid sampler of size can then be computed via thin-plate spline interpolation:
where the vector denotes the u-v location of a pixel in the target image, and gives the inverse mapping of the pixel in the original image, and is the kernel function. The parameters are fitted to minimize and a curvature constraint, which can be solved with a closed formula on the fly . With the grid sampler constructed via inverse mapping function , the warped image
can then be generated through bi-linear sampling . The entire warping module is differentiable and can be trained as part of an end-to-end system.
Patch Adversarial Loss
We first used a fully convolutional network as a patch discriminator , . The patch discriminator is trained as a -class classifier to enlarge the difference between the styles of generated images and real photos . Let , and
denote the logits for the three classes of caricatures, photos and generated images, respectively. The patch adversarial loss is as follows:
Identity-Preservation Adversarial Loss
Although patch discriminator is suitable for learning visual style transfer, it fails to capture the distinguishing features of different identities. The exaggeration styles for different people could actually be different based on their facial features (See Section 4.2). To combine the identity-preservation and identity-specific style learning, we propose to train the discriminator as a -class classifier, where is the number of identities. The first, second, and third classes correspond to different identities of real photos, real caricatures and fake caricatures, respectively. Let be the identity labels of the photos and caricatures, respectively. The identity-preservation adversarial losses for and are as follows:
Here, denotes the logits of class given an image . The discriminator is trained to tell the differences between real photos, real caricatures, generated caricatures as well as the identities in the image. The generator is trained to fool the discriminator in recognizing the generated image as a real caricature of the corresponding identity. Finally, the system is optimized in an end-to-end way with the following objective functions:
We use the images from a public domain dataset, WebCaricature 222https://cs.nju.edu.cn/rl/WebCaricature.htm, to conduct the experiments. The dataset consists of caricatures and photos from identities. We align all the images with five landmarks (left eye, right eye, nose, mouth left, mouth right) using the ones provided in the WebCaricature dataset  protocol. Since the protocol does not provide the locations of eye centers, we estimate them by taking the average of the corresponding eye corners. Then, the images are aligned through similarity transformation using the five landmarks and are resized to . We randomly split the dataset into a training set of identities ( photos and caricatures) and a testing set of identities ( photos and caricatures). All the testing images in this paper are from identities in the testing set.
We use ADAM optimizers in Tensorflow withand for the whole network. Each mini-batch consists of a random pair of photo and caricature. We train the network for steps. The learning rate starts with and is decreased linearly to after steps. We empirically set , , and number of control points . We conduct all experiments using Tensorflow r1.9 and one Geforce GTX 1080 Ti GPU. The average speed for generating one caricature image on this GPU is s. The details of the architecture are provided in Section A.
4.1 Comparison to State-of-the-Art
, and Multimodal UNsupervised Image-to-image Translation (MUNIT)  for style transfer approaches333We train the baselines using their official implementations.. We find that among all the three baseline style transfer networks, CycleGAN and MUNIT demonstrate the most visually appealing texture styles (see Figure 5). StarGAN and UNIT produce very photo-like images with minimal or erroneous changes in texture. Since all these networks focus only on transferring the texture styles, the baseline architectures fail to deform the faces into caricatures, unlike WarpGAN. The other issue with the baselines methods is that they do not have a module for warping the images and therefore, they try to compensate for deformations in the face using only texture. Due to the complexity of this task, it becomes increasingly difficult to train them and they usually result in generating collapsed images.
4.2 Ablation Study
To analyze the function of different modules in our system, we train three variants of WarpGAN for comparison by removing , and , respectively. Figure 6 shows a comparison of WarpGAN variants that include all the loss functions. Without the proposed identity-preservation adversarial loss, the discriminator only focuses on local texture styles and therefore the geometric warping fails to capture personal features and is close to randomness. Without the patch adversarial loss, the discriminator mainly focuses on facial shape and the model fails to learn diverse texture styles. The model without identity mapping loss still performs well in terms of texture rendering and shape exaggeration. We keep the identity loss to improve the visual quality of the generated images.
4.3 Shape Exaggeration Styles
Caricaturists usually define a set of prototypes of face parts and have certain modes on how to exaggerate them . In WarpGAN we do not adopt any method to exaggerate the facial regions explicitly, but instead we introduce the identity preservation constraint as part of the adversarial loss. This forces the network to exaggerate the faces to be more distinctive from other identities and implicitly encourages the network to learn different exaggeration styles for people with different salient features. Some example exaggeration styles learned by the network are shown in Figure 7.
4.4 Customizing the exaggeration
Although the WarpGAN is trained as a deterministic model in terms of warping, we can introduce a parameter during deployment to allow customization of the exaggeration extent. Before warping, the residual flow of control points are scaled by to control how much the face shape will be exaggerated. The results are shown in Figure 8. indicates less warping and leads to the default output of the WarpGAN. When changing to , the images are warped with twice the default degree, but the resulting images still appear to be reasonable with only the distinguishing facial features are exaggerated. Since the texture styles are learned in a disentangled way, WarpGAN can also generate various texture styles. Figure 5 shows results from WarpGAN with three randomly sampled styles.
4.5 Quantitative Analysis
Even with face deformation, identity in the caricature needs to be preserved. In order to quantify identity preservation accuracy for caricatures generated by WarpGAN, we evaluate automatic face recognition performance using two state-of-the-art face matchers: (1) COTS444Uses a convolutional neural network for face recognition. and (2) an open source SphereFace  matcher.
|Photo-to-Photo||94.81 1.22%||90.78 0.64%|
|Hand-drawn-to-Photo||41.26 1.16%||45.80 1.56%|
|WarpGAN-to-Photo||79.00 1.46%||72.65 0.84%|
An identification experiment is conducted where one photo of the identity is kept in the gallery while all remaining photos, or all hand drawn caricatures, or all synthesized caricatures for the same identity are used as probes. We evaluate the Rank-1 identification accuracy using 10-fold cross validation and report the mean and standard deviation across the folds in Table3. We find that the generated caricatures can be matched to real face images with a higher accuracy than hand drawn caricatures. We also observe the same trend for both the matchers which suggests that recognition on synthesized caricatures is consistent and matcher-independent.
We conducted two perceptual studies by recruiting 5 caricature artists who are experts in their field to compare hand-drawn caricatures with images synthesized by our baselines along with our WarpGAN. A caricature is generated from a random image for each 126 subjects in the WebCaricature testing set. The first perceptual study uses of them and are used for the second. Experts do not have any knowledge of the source of the caricatures and they rely solely on their perceptual judgment.
The first study assesses the overall similarity of the generated caricatures to the hand-drawn ones. Each caricature expert was shown a face photograph of a subject along with three corresponding caricatures generated by CycleGAN, MUNIT, and WarpGAN, respectively. The experts then rank each of the three generated caricatures from “most visually closer to a hand-drawn caricature” to “least similar to a hand-drawn caricature”. We find that caricatures generated by WarpGAN is ranked as the most similar to a real caricature of the time, compared to and for CycleGAN and MUNIT, respectively.
In the second study, experts scored the generated caricatures according to two criteria: (i) visual quality, and (ii) whether the caricatures are exaggerated in proper manner where only prominent facial features are deformed. Experts are shown three photographs of a subject along with a caricature image that can either be (i) a real hand-drawn caricature, or (ii) generated using one of the three automatic generation methods. From Table 4 we find that WarpGAN receives the best perceptual scores out of the three state-of-the-art generation methods. Even though hand-drawn caricatures rate higher, our approach, WarpGAN, has made a tremendous leap in automatically generating caricatures, especially when compared to state-of-the-art.
Joint Rendering and Warping Learning
Unlike other visual style transfer tasks   , transforming photos into caricatures involves both texture difference and geometric transition. Texture is import in exaggerating local fine-grained features such as depth of the wrinkles while geometric deformation allows exaggeration of global features such as face shape. Conventional style transfer networks    aims to reconstruct an image from feature space using a decoder network. Because the decoder is a stack of nonlinear local filters, they are intrinsically inflexible in terms of spatial variation and the decoded images usually suffer from poor quality and severe information loss when there is a large geometric discrepancy between the input and output domain. On the other hand, warping-based methods are limited by nature to not being able to change the content and fine-grained details. Therefore, both style transfer and warping module are necessary parts for our adversarial learning framework. As shown in Figure 6, without either module, the generator will not be able to close the gap between photos and caricatures and the balance of competition between generator and discriminator will be broken, leading to collapsed results.
Identity-preservation Adversarial Loss
The discriminator in conventional GANs are usually trained as a binary  or ternary classifier , with each class representing a visual style. However, in our work, we found that because of the large variation of shape exaggeration styles in the caricatures, treating all the caricatures as one class in the discriminator would lead to the confusion of the generator, as shown in Figure 6. We observe that caricaturists tend to give similar exaggeration styles to the same person. Therefore, we treat each identity-domain pair as a separate class to reduce the difficulty of learning and also encourage the identity-preservation after warping.
In this paper, we proposed a new method of caricature generation, namely WarpGAN, that addresses both style transfer and face deformation in a joint learning framework. Without explicitly requiring any facial landmarks, the identity-preserving adversarial loss introduced in this work appropriately learns to capture caricature artists’ style while preserving the identity in the generated caricatures. We evaluated the generated caricatures by matching synthesized caricatures to real photos and observed that the recognition accuracy is higher than caricatures drawn by artists. Moreover, five caricature experts suggest that caricatures synthesized by WarpGAN are not only pleasing to the eye, but are also realistic where only the appropriate facial features are exaggerated and that our WarpGAN indeed outperforms the state-of-the-art networks.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
-  S. E. Brennan. Caricature generator: The dynamic exaggeration of faces by computer. Leonardo, 1985.
-  J. Bruna, P. Sprechmann, and Y. LeCun. Super-resolution with deep convolutional sufficient statistics. arXiv:1511.05666, 2015.
-  K. Cao, J. Liao, and L. Yuan. CariGANs: Unpaired Photo-to-Caricature Translation. arXiv:1811.00222, 2018.
-  Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. 2018.
-  F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In CVPR, 2017.
-  O. E. Dictionary. Caricature Definition. https://en.oxforddictionaries.com/definition/caricature, 2018. [Online; accessed 31-October-2018].
-  H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin. Soft-gated warping-gan for pose-guided person image synthesis. arXiv:1810.11610, 2018.
-  Y. Ganin, D. Kononenko, D. Sungatullina, and V. Lempitsky. Deepwarp: Photorealistic image resynthesis for gaze manipulation. In ECCV, 2016.
-  L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, pages 262–270, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
-  L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual factors in neural style transfer. In CVPR, 2017.
-  C. A. Glasbey and K. V. Mardia. A review of image-warping methods. Journal of applied statistics, 1998.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  X. Han, K. Hou, D. Du, Y. Qiu, Y. Yu, K. Zhou, and S. Cui. Caricatureshop: Personalized and photorealistic caricature sketching. arXiv:1807.09064, 2018.
-  X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
-  X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. arXiv:1804.04732, 2018.
-  J. Huo, W. Li, Y. Shi, Y. Gao, and H. Yin. Webcaricature: a benchmark for caricature face recognition. arXiv:1703.03230, 2017.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
-  M. M. Kalayeh, M. Seifu, W. LaLanne, and M. Shah. How to take a good selfie? In ACM MM, 2015.
-  B. F. Klare, S. S. Bucak, A. K. Jain, and T. Akgul. Towards automated caricature recognition. In ICB, 2012.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
-  T. Lewiner, T. Vieira, D. Martínez, A. Peixoto, V. Mello, and L. Velho. Interactive 3d caricature from harmonic exaggeration. Computers & Graphics, 2011.
-  W. Li, W. Xiong, H. Liao, J. Huo, Y. Gao, and J. Luo. CariGAN: Caricature Generation through Weakly Paired Adversarial Learning. arXiv:1811.00445, 2018.
-  L. Liang, H. Chen, Y.-Q. Xu, and H.-Y. Shum. Example-based caricature generation with exaggeration. In Pacific Conf. on Computer Graphics and Applications, 2002.
-  C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In CVPR, 2018.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
-  W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017.
-  Z. Mo, J. P. Lewis, and U. Neumann. Improved automatic caricature by feature normalization and exaggeration. In SIGGRAPH, 2004.
-  B. Qibal. Photoshop caricature tutorial. https://www.youtube.com/watch?v=EeL2F4cgyPs, 2015. [Online; accessed 04-November-2018].
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv:1605.05396, 2016.
-  G. Rhodes, S. Brennan, and S. Carey. Identification and ratings of caricatures: Implications for mental representations of faces. Cognitive Psychology, 1987.
-  A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe. Deformable gans for pose-based human image generation. In CVPR, 2018.
-  Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In ICLR, 2017.
-  D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In CVPR, 2017.
-  J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In NIPS, 2016.
-  R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with perceptual and contextual losses. arxiv preprint. arXiv:1607.07539, 2016.
-  Z. Zheng, H. Zheng, Z. Yu, Z. Gu, and B. Zheng. Photo-to-caricature translation on faces in the wild. arXiv:1711.10735, 2017.
-  Z. Zheng, H. Zheng, Z. Yu, Z. Gu, and B. Zheng. Photo-to-caricature translation on faces in the wild. arXiv:1711.10735, 2017.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
Appendix A Architecture
Our network architecture is modified based on MUNIT . Let c7s1-k be a convolutional layer with
filters and stride. dk denotes a convolutional layer with filters and stride . Rk denotes a residual block that contains two convolutional layers. uk denotes a upsampling layer followed by a convolutional layer with filters and stride . fck denotes a fully connected layer with filters. We apply Instance Normalization (IN)  to the content encoder and Adaptive Instance Normalization (AdaIN) 
to the decoder. No normalization is used in the style encoder. We use Leaky ReLU with slope 0.2 in the discriminator and ReLU activation everywhere else. The architectures of different modules are as follows:
A separate branch of convolutional layer with filters and stride is attached to the last convolutional layer of the discriminator to output for patch adversarial losses. The style decoder (the multi-layer perceptron) has two hidden fully connected layers of filters without normalization and the warp controller has only one hidden fully connected layer of filters with Layer Normalization . The length of the latent style code is set to .
Appendix B Additional Baselines
In the main paper, we compared WarpGAN with state-of-the-art style transfer networks as baselines. Here, we compare WarpGAN with other caricature generation works [25, 29, 15, 39, 24, 4]. Since these methods do not release their code and use different testing images, we crop the images from their papers and compare with them one by one. All the baseline results are also taken from their original papers. The results are shown in Figure 10.
Appendix C Transformation Methods
To see the advantage of the proposed control-points estimation for automatic warping, we train three variants of our model by replacing the warping method with (1) projective transformation, (2) dense deformation and (3) landmark-based warping. In projective transformation, the warp controller outputs parameters for the transformation matrix. In dense deformation, the warp controller outputs a deformation grid, which is further interpolated into for grid sampling. In landmark-based warping, we use the landmarks provided by Dlib555http://dlib.net/face_landmark_detection.py.html and the warp controller only outputs the residual flow. As shown in Figure 12, the warping is too limited in projective transformation for generating artistic caricatures and too unconstrained in dense deformation that it is difficult to train. Landmark-based warping yields reasonable results, but it is limited by the landmark detector. In comparison, our methods does not require any domain knowledge, has little limitation and leads to visually satisfying warping results.
Appendix D More Results
We show more results of the ablation study in Figure 11. The results are consistent with those in the main paper: (1) the joint learning of texture rendering and warping are crucial for generating realistic caricature images and (2) without patch adversarial loss or identity-preservation adversarial loss, the model cannot learn to generate caricatures with various texture styles and shape exaggeration styles.
Different Texture Styles
To test the performance of our model in more application scenarios, we download the public Selfie dataset666http://crcv.ucf.edu/data/Selfie/  for cross-dataset evaluation. The dataset includes public selfies crawled from Internet. Unlike our training dataset (WebCaricature), the identities in this dataset are not restricted to celebrities and there is a difference between the visual styles of these images and the ones in our training dataset. The results are shown in Figure 14.