WarpGAN: Automatic Caricature Generation

11/25/2018 ∙ by Yichun Shi, et al. ∙ Michigan State University 0

We propose, WarpGAN, a fully automatic network that can generate caricatures given an input face photo. Besides transferring rich texture styles, WarpGAN learns to automatically predict a set of control points that can warp the photo into a caricature, while preserving identity. We introduce an identity-preserving adversarial loss that aids the discriminator to distinguish between different subjects. Moreover, WarpGAN allows customization of the generated caricatures by controlling the exaggeration extent and the visual styles. Experimental results on a public domain dataset, WebCaricature, show that WarpGAN is capable of generating a diverse set of caricatures while preserving the identities. Five caricature experts suggest that caricatures generated by WarpGAN are visually similar to hand-drawn ones and only prominent facial features are exaggerated.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 11

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A caricature is defined as “a picture, description, or imitation of a person or a thing in which certain striking characteristics are exaggerated in order to create a comic or grotesque effect” [7]. Paradoxically, caricatures are images with facial features that represent the face more than the face itself. Compared to cartoons, which are 2D visual art that try to re-render an object or even a scene in a usually simplified artistic style, caricatures are portraits that have exaggerated features of a certain persons or things. Some example caricatures of two individuals are shown in Figure 1. The fascinating quality of caricatures is that even with large amounts of distortion, the identity of person in the caricature can still be easily recognized by humans. In fact, studies have found that we can recognize caricatures even more accurately than the original face images [32].

Caricature artists capture the most important facial features, including the face and eye shapes, hair styles, etc. Once an artist sketches a rough draft of the face, they will start to exaggerate person-specific facial features towards a larger deviation from an average face. Nowadays, artists can create realistic caricatures through computer softwares through: (1) warping the face photo to exaggerate the shape and (2) re-rendering the texture style [30]. By mimicking this process, researchers have been working on automatic caricature generation [25][23]

. A majority of the studies focus on designing a good structural representation to warp the image and change the face shape. However, neither the identity information nor the texture differences between a caricature and a face photo are taken into consideration. In contrast, numerous works have made progress with deep neural networks to transfer image styles 

[27][17]. Still these approaches merely focus on translating the texture style forgoing any changes in the facial features.

 

Approach Methodology Examples
Study Exaggeration Space Warping

 

Shape Deformation [2] [25] [15]
Brennan et al. [2] Drawing Line User-interactive
Liang et al. [25] 2D Landmarks User-interactive
CaricatureShop [15] 3D Mesh Automatic
Texture Rendering [38] [24]
Zheng et al. [39] Image to Image None
CariGAN [24] Image + Landmark Mask None
Texture + Shape [4] WarpGAN
CariGANs [4] PCA Landmarks Automatic
WarpGAN Image to Image Automatic

 

Table 1: Comparison of various studies on caricature generation. Majority of the published studies focus on either deforming the faces or transferring caricature styles, unlike the proposed WarpGAN which focuses on both. On the other hand, WarpGAN deforms the face in the image space thereby, truly capturing the transformations from a real face photo to a caricature. Moreover, WarpGAN does not require facial landmarks for generating caricatures.

In this work, we aim to build a completely automated system that can create new caricatures from photos by utilizing Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs). Different from previous works on caricature generation and style transfer, we emphasize the following challenges in our paper:

  • The caricature generation involves both texture changes and shape deformation.

  • The faces need to be exaggerated in a manner such that they can still be recognized.

  • Caricature samples exist in various visual and artistic styles (see Figure 1).

In order to tackle these challenges, we propose a new type of style transfer network, named WarpGAN, which decouples the shape deformation and texture rendering into two tasks. Akin to a human operating an image processing software, the generator in our system automatically predicts a set of control points that warp the input face photo into the closest resemblance to a caricature and also transfers the texture style through non-linear filtering. The discriminator is trained via an identity-preserving adversarial loss to distinguish between different identities and styles, and encourages the generator to synthesize diverse caricatures while automatically exaggerating facial features specific to the identity. Experimental results show that compared to state-of-the-art generation methods, WarpGAN allows for texture update along with face deformation in the image space, while preserving the identity. Compared to other style transfer GANs [40][17], our method not only permits a transfer in texture style, but also deformation in shape. The contributions of the paper can be summarized as follows:

  • A domain transfer network that decouples the texture style and geometric shape by automatically estimating a set of sparse control points to warp the images.

  • A joint learning of texture style transfer and image warping for domain transfer with adversarial loss.

  • A quantitative evaluation through face recognition performance shows that the proposed method retains identity information after transferring texture style and warping. In addition, we conducted two perceptual studies where five caricature experts suggest that WarpGAN generates caricatures that are (1) visually appealing, (2) realistic; where only the appropriate facial features are exaggerated, and (3) our method outperforms the state-of-the-art.

  • An open-source111Code will be released after publication. automatic caricature generator where users can customize both the texture style and exaggeration degree.

(a) Global Parameters [19] [8] [26]
(b) Dense Deformation Field [9]
(c) Landmark-based [9]
(d) Control Points Estimating
Figure 2: Inputs and outputs of different types of warping modules in neural networks. Given an image, WarpGAN can automatically predict both control points and its residual flow based on local features.

2 Related Work

2.1 Automatic Image Warping

Many works have been proposed to enhance the spatial variability of neural networks via automatic warping. Most of them warp images by predicting a set of global transformation parameters [19] [26] or a dense deformation field [9]. Parametric methods estimate a small number of global transformation parameters and therefore cannot handle fine-grained local warping while dense deformation needs to predict all the vertices in a deformation grid, most of which are useless and hard to estimate. Cole et al. [6]

first proposed to use spline interpolation in neural networks to allow control point-based warping, but their method requires pre-detected landmarks as input. Several recent works have attempted to combine image warping with GANs to improve the spatial variability of the generator, however these methods either train the warping module separately 

[8] [4], or need paired data as supervision [8] [33]. In comparison, our warping module can be inserted as an enhancement of a normal generator and can be trained as part of an end-to-end system without further modification. To the best of our knowledge, this study is the first work on automatic image warping with self-predicted control points using deep neural networks. An overview of different warping methods are shown in Figure 2.

2.2 Style Transfer Networks

Stylizing images by transferring art characteristics has been extensively studied in literature. Given the effective ability of CNNs to extract semantic features [3][10][12][22], powerful style transfer networks have been developed. Gatys et al. [11] first proposed a neural style transfer method that uses a CNN to transfer the style content from the style image to the content image. A limitation of this method is that both the style and content images are required to be similar in nature which is not the case for caricatures. Using Generative Adversarial Networks (GANs) [14][36] for image synthesis has been a promising field of study, where state-of-the-art results have been demonstrated in applications ranging from text to image translation [31]

, image inpainting 

[37]

, to image super-resolution 

[22]. Domain Transfer Network [34], CycleGAN [40], StarGAN [5], UNIT [27], and MUNIT [17] attempt image translation with unpaired image sets. All of these methods only use a de-convolutional network to construct images from the latent space and perform poorly on caricature generation due to the large spatial variation [24][4].

Figure 3: The generator module of WarpGAN. Given a face image, the generator outputs the pixel residual map for texture style transfer and a set of control points along with their residuals. A differentiable module takes the control points and warps the rendered image to generate a caricature.

2.3 Caricature Generation

Studies on caricature generation can be mainly classified into three categories: deformation-based, texture-based and methods with both. Traditional works mainly focused on exaggerating face shapes by enlarging the deviation of the given shape representation from average, such as 2D landmarks or 3D meshes 

[2][25][23][15], whose deformation capability is usually limited as shape modeling can only happen in the representation space. Recently, with the success of GANs, a few works have attempted to apply style transfer networks to image-to-image caricature generation [39][24]. However, their results suffer from poor visual quality because these networks are not suitable for problems with large spatial variation. Cao et al. [4] recently proposed to decouple texture rendering and geometric deformation with two CycleGANs trained on image and landmark space, respectively. But with their face shape modeled in the PCA subspace of landmarks, they suffer from the same problem of the traditional deformation-based methods. In this work, we propose an end-to-end system with a joint learning of texture rendering and geometric warping. Compared with previous works, WarpGAN can model both shapes and textures in the image space with flexible spatial variability, leading to better visual quality and more artistic shape exaggeration. The differences between caricature generation methods are summarized in Table 1.

3 Methodology

Let be images from the domain of face photos, be images from the caricature domain and be the latent codes of texture styles. We aim to build a network that transforms an photo image into a caricature by both transferring its texture style and exaggerating its geometric shape. Our system includes one deformable generator (see Figure 3) , one style encoder and one discriminator (see Figure 4). The important notations used in this paper are summarized in Table 2.

Name Meaning Name Meaning
real photo image label of photo image
real caricature image label of caricature image
content encoder decoder
style encoder discriminator
estimated control points estimated residuals of
number of identities number of control points
Table 2: Important notations used in this paper.

3.1 Generator

The proposed deformable generator in WarpGAN is composed of three sub-networks: an content encoder , a decoder and a warp controller. Given any image , the encoder outputs a feature map . Here , and are height, width and number of channels respectively. The content decoder takes and a random latent style code to render the given image into an image of a certain style. The warp controller estimates the control points and their residual flow to warp the rendered images. An overview of the deformable generator is shown in Figure 3.

Texture Style Transfer

Since there is a large variation in the texture styles of caricatures images (See Figure 1), we adopt an unsupervised method [17] to disentangle the style representation from the feature map so that we can transfer the input photo into different texture styles present in the caricature domain. During the training, the latent style code

is sampled randomly from a normal distribution and passed as an input into the decoder

. A multi-layer perceptron in

decodes to generate the parameters of the Adaptive Instance Normalization (AdaIN) layers in , which have been shown to be effective in controlling visual styles [huang2017arbitrary]. The generated images with random styles are then warped and passed to the discriminator. Various styles obtained from WarpGAN can be seen in Figure 5.

To prevent and from losing semantic information during texture rendering, we combine the identity mapping loss [34] and reconstruction loss [17] to regularize and . In particular, a style encoder is used to learn the inverse mapping from the image space to the style space . Given its own style code, both photos and caricatures should be reconstructed from the the latent feature map:

(1)
(2)
Figure 4: Overview of the proposed WarpGAN.

Automatic Image Warping

The warp controller is a sub-network of two fully connected layers. With latent feature map as input, the controller learns to estimate control points and their residual flow , where each and

is a 2D vector in the u-v space. The points are then fed into a differentiable warping module 

[6]. Let be the destination points, where . A grid sampler of size can then be computed via thin-plate spline interpolation:

(3)

where the vector denotes the u-v location of a pixel in the target image, and gives the inverse mapping of the pixel in the original image, and is the kernel function. The parameters are fitted to minimize and a curvature constraint, which can be solved with a closed formula on the fly [13]. With the grid sampler constructed via inverse mapping function , the warped image

(4)

can then be generated through bi-linear sampling [19]. The entire warping module is differentiable and can be trained as part of an end-to-end system.

Figure 5: Comparison of WarpGAN and four other state-of-the-art style transfer networks. WarpGAN is able to deform the faces unlike the baselines which only change the texture. The three texture styles of WarpGAN are generated using latent style codes sampled randomly from the normal distribution.

3.2 Discriminator

Patch Adversarial Loss

We first used a fully convolutional network as a patch discriminator [17][40]. The patch discriminator is trained as a -class classifier to enlarge the difference between the styles of generated images and real photos [34]. Let , and

denote the logits for the three classes of caricatures, photos and generated images, respectively. The patch adversarial loss is as follows:

(5)
(6)

Identity-Preservation Adversarial Loss

Although patch discriminator is suitable for learning visual style transfer, it fails to capture the distinguishing features of different identities. The exaggeration styles for different people could actually be different based on their facial features (See Section 4.2). To combine the identity-preservation and identity-specific style learning, we propose to train the discriminator as a -class classifier, where is the number of identities. The first, second, and third classes correspond to different identities of real photos, real caricatures and fake caricatures, respectively. Let be the identity labels of the photos and caricatures, respectively. The identity-preservation adversarial losses for and are as follows:

(7)
(8)

Here, denotes the logits of class given an image . The discriminator is trained to tell the differences between real photos, real caricatures, generated caricatures as well as the identities in the image. The generator is trained to fool the discriminator in recognizing the generated image as a real caricature of the corresponding identity. Finally, the system is optimized in an end-to-end way with the following objective functions:

(9)
(10)

px Input w/o w/o w/o with all

Figure 6:

Different variants of the WarpGAN without certain loss functions.

4 Experiments

px Bigger Eyes Smaller Eyes Longer Face Shorter Face Bigger Mouth Bigger Chin Bigger Forehead   Hand-drawn WarpGAN Input WarpGAN Output  

Figure 7: A few typical exaggeration styles learned by WarpGAN. First row shows hand-drawn caricatures that have certain exaggeration styles. The second and third row show the input images and the generated images of WarpGAN with the corresponding exaggeration styles. All the identities are from the testing set.

Dataset

We use the images from a public domain dataset, WebCaricature [18]222https://cs.nju.edu.cn/rl/WebCaricature.htm, to conduct the experiments. The dataset consists of caricatures and photos from identities. We align all the images with five landmarks (left eye, right eye, nose, mouth left, mouth right) using the ones provided in the WebCaricature dataset [18] protocol. Since the protocol does not provide the locations of eye centers, we estimate them by taking the average of the corresponding eye corners. Then, the images are aligned through similarity transformation using the five landmarks and are resized to . We randomly split the dataset into a training set of identities ( photos and caricatures) and a testing set of identities ( photos and caricatures). All the testing images in this paper are from identities in the testing set.

Training Details

We use ADAM optimizers in Tensorflow with

and for the whole network. Each mini-batch consists of a random pair of photo and caricature. We train the network for steps. The learning rate starts with and is decreased linearly to after steps. We empirically set , , and number of control points . We conduct all experiments using Tensorflow r1.9 and one Geforce GTX 1080 Ti GPU. The average speed for generating one caricature image on this GPU is s. The details of the architecture are provided in Section A.

4.1 Comparison to State-of-the-Art

We qualitatively compare our caricature generation method with CycleGAN [40], StarGAN [5],  Unsupervised Image-to-Image Translation (UNIT) [27]

, and Multimodal UNsupervised Image-to-image Translation (

MUNIT[17] for style transfer approaches333We train the baselines using their official implementations.. We find that among all the three baseline style transfer networks, CycleGAN and MUNIT demonstrate the most visually appealing texture styles (see Figure 5). StarGAN and UNIT produce very photo-like images with minimal or erroneous changes in texture. Since all these networks focus only on transferring the texture styles, the baseline architectures fail to deform the faces into caricatures, unlike WarpGAN. The other issue with the baselines methods is that they do not have a module for warping the images and therefore, they try to compensate for deformations in the face using only texture. Due to the complexity of this task, it becomes increasingly difficult to train them and they usually result in generating collapsed images.

4.2 Ablation Study

To analyze the function of different modules in our system, we train three variants of WarpGAN for comparison by removing , and , respectively. Figure 6 shows a comparison of WarpGAN variants that include all the loss functions. Without the proposed identity-preservation adversarial loss, the discriminator only focuses on local texture styles and therefore the geometric warping fails to capture personal features and is close to randomness. Without the patch adversarial loss, the discriminator mainly focuses on facial shape and the model fails to learn diverse texture styles. The model without identity mapping loss still performs well in terms of texture rendering and shape exaggeration. We keep the identity loss to improve the visual quality of the generated images.

4.3 Shape Exaggeration Styles

Caricaturists usually define a set of prototypes of face parts and have certain modes on how to exaggerate them [21]. In WarpGAN we do not adopt any method to exaggerate the facial regions explicitly, but instead we introduce the identity preservation constraint as part of the adversarial loss. This forces the network to exaggerate the faces to be more distinctive from other identities and implicitly encourages the network to learn different exaggeration styles for people with different salient features. Some example exaggeration styles learned by the network are shown in Figure 7.

4.4 Customizing the exaggeration

Although the WarpGAN is trained as a deterministic model in terms of warping, we can introduce a parameter during deployment to allow customization of the exaggeration extent. Before warping, the residual flow of control points are scaled by to control how much the face shape will be exaggerated. The results are shown in Figure 8. indicates less warping and leads to the default output of the WarpGAN. When changing to , the images are warped with twice the default degree, but the resulting images still appear to be reasonable with only the distinguishing facial features are exaggerated. Since the texture styles are learned in a disentangled way, WarpGAN can also generate various texture styles. Figure 5 shows results from WarpGAN with three randomly sampled styles.

px Input

Figure 8: The result of changing the amount of exaggeration by scaling the with an input parameter .

4.5 Quantitative Analysis

Face Recognition

Even with face deformation, identity in the caricature needs to be preserved. In order to quantify identity preservation accuracy for caricatures generated by WarpGAN, we evaluate automatic face recognition performance using two state-of-the-art face matchers: (1) COTS444Uses a convolutional neural network for face recognition. and (2) an open source SphereFace [28] matcher.

 

Method COTS SphereFace [28]

 

Photo-to-Photo 94.81 1.22% 90.78 0.64%
Hand-drawn-to-Photo 41.26 1.16% 45.80 1.56%
WarpGAN-to-Photo 79.00 1.46% 72.65 0.84%

 

Table 3: Rank-1 identification accuracy for three different matching protocols using two state-of-the-art face matchers, COTS and SphereFace [28].

An identification experiment is conducted where one photo of the identity is kept in the gallery while all remaining photos, or all hand drawn caricatures, or all synthesized caricatures for the same identity are used as probes. We evaluate the Rank-1 identification accuracy using 10-fold cross validation and report the mean and standard deviation across the folds in Table 

3. We find that the generated caricatures can be matched to real face images with a higher accuracy than hand drawn caricatures. We also observe the same trend for both the matchers which suggests that recognition on synthesized caricatures is consistent and matcher-independent.

Perceptual Study

We conducted two perceptual studies by recruiting 5 caricature artists who are experts in their field to compare hand-drawn caricatures with images synthesized by our baselines along with our WarpGAN. A caricature is generated from a random image for each 126 subjects in the WebCaricature testing set. The first perceptual study uses of them and are used for the second. Experts do not have any knowledge of the source of the caricatures and they rely solely on their perceptual judgment.

The first study assesses the overall similarity of the generated caricatures to the hand-drawn ones. Each caricature expert was shown a face photograph of a subject along with three corresponding caricatures generated by CycleGAN, MUNIT, and WarpGAN, respectively. The experts then rank each of the three generated caricatures from “most visually closer to a hand-drawn caricature” to “least similar to a hand-drawn caricature”. We find that caricatures generated by WarpGAN is ranked as the most similar to a real caricature of the time, compared to and for CycleGAN and MUNIT, respectively.

 

Method Visual Quality Exaggeration

 

Hand-Drawn 7.70 7.16
CycleGAN [40] 2.43 2.27
MUNIT [17] 1.82 1.83
WarpGAN 5.61 4.87

 

Table 4: Average perceptual scores from 5 caricature experts for visual quality and exaggeration extent. Scores range from 1 to 10.

In the second study, experts scored the generated caricatures according to two criteria: (i) visual quality, and (ii) whether the caricatures are exaggerated in proper manner where only prominent facial features are deformed. Experts are shown three photographs of a subject along with a caricature image that can either be (i) a real hand-drawn caricature, or (ii) generated using one of the three automatic generation methods. From Table 4 we find that WarpGAN receives the best perceptual scores out of the three state-of-the-art generation methods. Even though hand-drawn caricatures rate higher, our approach, WarpGAN, has made a tremendous leap in automatically generating caricatures, especially when compared to state-of-the-art.

px Input Warping Only Texture Only Both

Figure 9: Example result images generated by the same WarpGAN model without texture/warping and with both.

5 Discussion

Joint Rendering and Warping Learning

Unlike other visual style transfer tasks [34] [40] [17], transforming photos into caricatures involves both texture difference and geometric transition. Texture is import in exaggerating local fine-grained features such as depth of the wrinkles while geometric deformation allows exaggeration of global features such as face shape. Conventional style transfer networks [34] [40] [17] aims to reconstruct an image from feature space using a decoder network. Because the decoder is a stack of nonlinear local filters, they are intrinsically inflexible in terms of spatial variation and the decoded images usually suffer from poor quality and severe information loss when there is a large geometric discrepancy between the input and output domain. On the other hand, warping-based methods are limited by nature to not being able to change the content and fine-grained details. Therefore, both style transfer and warping module are necessary parts for our adversarial learning framework. As shown in Figure 6, without either module, the generator will not be able to close the gap between photos and caricatures and the balance of competition between generator and discriminator will be broken, leading to collapsed results.

Identity-preservation Adversarial Loss

The discriminator in conventional GANs are usually trained as a binary [40] or ternary classifier [34], with each class representing a visual style. However, in our work, we found that because of the large variation of shape exaggeration styles in the caricatures, treating all the caricatures as one class in the discriminator would lead to the confusion of the generator, as shown in Figure 6. We observe that caricaturists tend to give similar exaggeration styles to the same person. Therefore, we treat each identity-domain pair as a separate class to reduce the difficulty of learning and also encourage the identity-preservation after warping.

6 Conclusion

In this paper, we proposed a new method of caricature generation, namely WarpGAN, that addresses both style transfer and face deformation in a joint learning framework. Without explicitly requiring any facial landmarks, the identity-preserving adversarial loss introduced in this work appropriately learns to capture caricature artists’ style while preserving the identity in the generated caricatures. We evaluated the generated caricatures by matching synthesized caricatures to real photos and observed that the recognition accuracy is higher than caricatures drawn by artists. Moreover, five caricature experts suggest that caricatures synthesized by WarpGAN are not only pleasing to the eye, but are also realistic where only the appropriate facial features are exaggerated and that our WarpGAN indeed outperforms the state-of-the-art networks.

References

Appendix A Architecture

Our network architecture is modified based on MUNIT [17]. Let c7s1-k be a convolutional layer with

filters and stride

. dk denotes a convolutional layer with filters and stride . Rk denotes a residual block that contains two convolutional layers. uk denotes a upsampling layer followed by a convolutional layer with filters and stride . fck denotes a fully connected layer with filters. We apply Instance Normalization (IN) [35] to the content encoder and Adaptive Instance Normalization (AdaIN) [16]

to the decoder. No normalization is used in the style encoder. We use Leaky ReLU with slope 0.2 in the discriminator and ReLU activation everywhere else. The architectures of different modules are as follows:

  • Style Encoder:
    c7s1-64,d128,d256, R256,R256,R256

  • Content Encoder:
    c7s1-64,d128,d256, R256,R256,R256

  • Decoder:
    R256,R256,R256,u128,u64,c7s1-3

  • Discriminator:
    d32,d64,d128,d256,d512,fc512,fc3M

A separate branch of convolutional layer with filters and stride is attached to the last convolutional layer of the discriminator to output for patch adversarial losses. The style decoder (the multi-layer perceptron) has two hidden fully connected layers of filters without normalization and the warp controller has only one hidden fully connected layer of filters with Layer Normalization [1]. The length of the latent style code is set to .

Appendix B Additional Baselines

In the main paper, we compared WarpGAN with state-of-the-art style transfer networks as baselines. Here, we compare WarpGAN with other caricature generation works [25, 29, 15, 39, 24, 4]. Since these methods do not release their code and use different testing images, we crop the images from their papers and compare with them one by one. All the baseline results are also taken from their original papers. The results are shown in Figure 10.

Appendix C Transformation Methods

To see the advantage of the proposed control-points estimation for automatic warping, we train three variants of our model by replacing the warping method with (1) projective transformation, (2) dense deformation and (3) landmark-based warping. In projective transformation, the warp controller outputs parameters for the transformation matrix. In dense deformation, the warp controller outputs a deformation grid, which is further interpolated into for grid sampling. In landmark-based warping, we use the landmarks provided by Dlib555http://dlib.net/face_landmark_detection.py.html and the warp controller only outputs the residual flow. As shown in Figure 12, the warping is too limited in projective transformation for generating artistic caricatures and too unconstrained in dense deformation that it is difficult to train. Landmark-based warping yields reasonable results, but it is limited by the landmark detector. In comparison, our methods does not require any domain knowledge, has little limitation and leads to visually satisfying warping results.

Appendix D More Results

Ablation Study

We show more results of the ablation study in Figure 11. The results are consistent with those in the main paper: (1) the joint learning of texture rendering and warping are crucial for generating realistic caricature images and (2) without patch adversarial loss or identity-preservation adversarial loss, the model cannot learn to generate caricatures with various texture styles and shape exaggeration styles.

Different Texture Styles

More results of texture style controlling are shown in Figure 13. Five latent style codes are randomly sampled from the normal distribution . Images in the same column in Figure 13 are generated with the same style code.

Selfie Dataset

To test the performance of our model in more application scenarios, we download the public Selfie dataset666http://crcv.ucf.edu/data/Selfie/ [20] for cross-dataset evaluation. The dataset includes public selfies crawled from Internet. Unlike our training dataset (WebCaricature), the identities in this dataset are not restricted to celebrities and there is a difference between the visual styles of these images and the ones in our training dataset. The results are shown in Figure 14.

px Input Liang et al. [25] Ours Input Mo et al. [29] Ours Input Han et al. [15] Ours Input Zheng et al. [39] Ours Input CariGAN[24] Ours Input CariGANs[4] Ours

Figure 10: Comparison with previous works on caricature generation. In each cell, the left and middle images are the input and result images taken from the baseline paper, respectively. The right images are the results of WarpGAN.

px Input w/o texture w/o warping w/o w/o w/o with all

Figure 11: More results on ablation study. Input images are shown in the first column. The subsequent columns show the results of different models trained without a certain module or loss. The texture style codes are randomly sampled from the normal distribution.

px Input Projective transformation Dense deformation Landmark-based Ours Image Transformation Image Transformation Image Transformation

Figure 12: Different transformation methods. Input images are shown in the first column. The next four columns show the results and the transformation visualizations of four different models trained with different transformation methods. The landmark-based model uses landmarks detected by Dlib. Texture rendering is hidden here for clarity.

px Input code code code code code

Figure 13: Results of five different texture styles. Input images are shown in the first column. Subsequent five columns show the results of WarpGAN using five style codes sampled randomly from the normal distribution. All the images in the same column are generated with the same latent style code.

px

Figure 14: Example results on the Selfie dataset. This is a cross-dataset evaluation and no training is involved. In each pair, the left image is the input and the right image is the output of WarpGAN with a random texture style.