3D-CariGAN: An End-to-End Solution to 3D Caricature Generation from Face Photos

03/15/2020 ∙ by Zipeng Ye, et al. ∙ Tianjin University Cardiff University USTC 0

Caricature is a kind of artistic style of human faces that attracts considerable research in computer vision. So far all existing 3D caricature generation methods require some information related to caricature as input, e.g., a caricature sketch or 2D caricature. However, this kind of input is difficult to provide by non-professional users. In this paper, we propose an end-to-end deep neural network model to generate high-quality 3D caricature with a simple face photo as input. The most challenging issue in our system is that the source domain of face photos (characterized by 2D normal faces) is significantly different from the target domain of 3D caricatures (characterized by 3D exaggerated face shapes and texture). To address this challenge, we (1) build a large dataset of 6,100 3D caricature meshes and use it to establish a PCA model in the 3D caricature shape space and (2) detect landmarks in the input face photo and use them to set up correspondence between 2D caricature and 3D caricature shape. Our system can automatically generate high-quality 3D caricatures. In many situations, users want to control the output by a simple and intuitive way, so we further introduce a simple-to-use interactive control with three horizontal and one vertical lines. Experiments and user studies show that our system is easy to use and can generate high-quality 3D caricatures.



There are no comments yet.


page 2

page 5

page 6

page 7

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Caricature is a kind of rendered images using exaggeration, simplification and abstraction to express the most distinctive characteristics of people [25]. They are also used to express sarcasms and humors for political and social problems. Caricatures drawn by artists are 2D images. Although widely used, they are insufficient for many applications, such as animation, virtual reality and 3D printing, where 3D information is essential (Figure 1). 3D caricatures are suited for these, but they can only be created by artists with 3D modeling skills, and are also tedious and time-consuming to produce. Motivated by this, we aim to automatically generate 3D caricatures.

Figure 1: Physical prototypes of 3D caricatures generated by our proposed 3D-CariGAN model.

To the best of our knowledge, our work is the first to address the problem of automatic generation of 3D caricatures from photos. It is an extremely cross-domain task where the input is normal 2D images and the output is exaggerated 3D meshes whose forms and styles are both distinct. In the literature, efforts have been made to partial problems of this task. A recently popular problem is translating photos into 2D caricatures [3, 32, 19, 5]. However, image translation mainly focuses on warping and stylization of textures, which only works for 2D images. Wu et al. [31] propose a method to reconstruct 3D caricatures from 2D caricatures. However, its input and output are both caricature styles and the focus is mapping from 2D to 3D. Han et al. [10] propose a method for generating 3D caricatures from sketches, but the exaggerated deformation is dominated by sketches, which cannot be applied to normal photos.

In summary, none of the above three types of methods can be adapted for automatic generation of 3D caricatures directly from photos. A straightforward baseline approach can be created by combining the existing methods [31, 5, 7] to automatically generate 2D caricatures from photos and generate 3D caricatures from 2D caricatures. However, it is time-consuming and tends to lose information in intermediate steps, which results in the dissimilarity between the input and the output. In this paper, we present an end-to-end method for the task, which is much more efficient by preserving information as much as possible and allows intuitive control.

Another related work to ours is 3D face reconstruction from photos, which has been widely studied in recent years. It is popular to use parametric models to represent faces and heads due to the high complexity of faces and their high distribution constraints. Some parametric models such as the well-known 3D Morphable Model (3DMM) 


are linear models based on principal component analysis (PCA) of normal 3D faces, which are useful and effective. However, such models do not work for caricature faces due to their limited capability of extrapolation 


. In this paper, we build a PCA model for 3D caricature meshes, and generating 3D caricature models can be regarded as interpolation in our PCA space, making the problem much more tractable.

Training datasets are indispensable for learning to transform photos to 3D caricatures. For the domain of photos, we use CelebAMask-HQ dataset [18] which contains 30, 000 portrait photos. For the domain of 3D caricatures, we are not aware of existing large-scale 3D caricature dataset. So we create our own 3DCari dataset which contains 3D caricature meshes with the same topology. The two datasets are unpaired because it is difficult to obtain a corresponding 3D caricature for a photo.

In this paper, we present an end-to-end method named 3D-CariGAN for automatic generation of 3D caricatures from photos, which achieves real-time performance and allows users to interactively adjust the caricature facial shapes, with a simple and effective user interface. Although 3D-CariGAN can automatically generate 3D caricatures from 2D photos, users may have their own preference regarding how the 3D caricature should look like. We further introduce a simple yet effective user control, which is considered as part of the neural network input and a dedicated user control loss is introduced to adjust the overall 3D caricature shape according to the user input. It is intended to adjust the overall 3D shape, with the detailed geometry still determined by the input photo for ensuring recognizability. Extensive experiments and user studies show that our method achieves high-quality 3D caricatures from 2D photos in real-time, which is significantly faster and with better quality than the baseline method. Our simple interaction is easy to use than sketching, and allows the user to adjust the generation results in real-time.

The contributions of this paper include:

  • We create a large dataset of 3D caricatures, and based on them, build a novel PCA-based 3D linear morphable model for 3D caricature shapes.

  • We propose the first method to automatically generate a 3D caricature from a normal facial image, including 3D meshes with texture. Our end-to-end solution addresses cross-domain and cross-style challenges (normal photo to caricature, and 2D to 3D) by utlizing the caricature morphable model and introducing a novel cross-domain character loss.

2 Related Work

2D Caricature. Many works have studied generating 2D caricatures from photos. The main differences between photos and caricatures are shape and style. Some methods [3, 32] focus on geometric exaggeration and other works [19, 9, 20] focus on stylization. CariGANs [5] proposes a method that combines these two aspects to generate 2D caricatures, consisting of two networks: CariGeoGAN for geometric exaggeration and CariStyGAN for stylization. CariStyGAN disentangles a photo into the style component and the content component, and then replaces the style component by that of a reference or a sample. CariGeoGAN translates the facial landmarks of a photo from a normal shape to those of an exaggerated shape, which are used to warp the image. WarpGAN [28] generates caricatures by warping and stylization. It extracts the content component from the photo, takes a sample in the style latent space, and then transfers the style by combining the content component and sampled style component, which is similar to CariStyGAN. It warps a photo into a caricature while preserving identity by predicting a set of control points. The stylization in these methods can be adapted to stylize textures for 3D caricatures, but geometric exaggeration for 3D caricatures is more complicated, which is a major focus of our paper.

Face Reconstruction. Face reconstruction, i.e., generating 3D faces from photos, is well studied in computer vision, which includes methods based on monocular RGB and RGB-D images as input. The reader is referred to [34] for a comprehensive survey and references therein. It is popular to use parametric models to represent faces and heads due to the complexity of faces and high distribution constraints. 3D Morphable Models (3DMM) [1, 23, 2, 8] and multi-linear models [30, 4]

are two major types of parametric models. 3DMM is a PCA representation of faces including shapes and textures, and multi-linear models utilize a multi-linear tensor decomposition on attributes such as identity and expression. Parametric models provide a strong constraint to ensure the plausibility of reconstructed 3D face shapes, while substantially reducing the dimensionality of the generation space. So these representations are widely used for face reconstruction, by regressing the parameters. Recent works 

[15, 29, 16]

used convolutional neural networks (CNNs) to regress the parameters for face reconstruction. However, these methods only work for normal photos and generate normal 3D faces. Likewise, existing parametric models do not have enough extrapolation capability to represent caricature 3D faces 

[31]. This motivates us to build a new 3D caricature parametric model and a new neural network for unpaired cross-domain translation.

3D Caricatures. Although generating 3D caricatures from 2D caricatures or normal photos is similar to face reconstruction, only a few works tackle the problem of automatic generation of 3D caricatures. Sela et al. [26]

present a method for directly exaggerating 3D face models, which locally amplifies the area of a given 3D face model based on its Gaussian curvature. A deep learning based sketching system 

[10] is proposed for interactive modeling of 3D caricature faces by drawing facial contours. A method by Clarke et al. [6] can generate a 3D caricature from a facial photograph and a corresponding 2D hand-drawn caricature by capturing the artistic deformation style. However, the method requires paired data as input which is difficult to obtain. An optimization-based method [31] is proposed to reconstruct 3D caricatures from 2D caricatures. The method formulates 3D caricatures as deformed 3D faces. To support exaggeration, their method uses an intrinsic deformation representation which has the capability of extrapolation. Therefore, 3D caricature reconstruction is turned into an optimization problem with facial landmark constraints. However, all of these methods rely on 2D sketches or 2D caricatures which have the information of how to exaggerate and deform, but normal 2D face photos do not have such information. Our work addresses a new challenge of automatically transforming normal 2D face photos to 3D caricatures.

Figure 2: The illustration of our pipeline, including three components: 3D-CariGAN for transforming photos to 3D caricature meshes, CariStyGAN [5] for producing 2D stylized caricature images, and landmark based texture mapping for generating textured 3D caricatures. Our system also supports intuitive interaction of user adjustment for 3D caricature shapes.

3 Our Method

The pipeline of our method is shown in Figure 2. A 3D caricature consists of three components: a 3D mesh, a 2D texture image and a texture mapping (represented by texture coordinates). Automatically generating 3D caricatures from photos is decoupled into three steps: (1) we develop 3D-CariGAN to infer 3D caricature meshes from photos; (2) we utilize the CariStyGAN [5] to transfer photos to 2D caricatures that are used for texture images; and (3) to map the stylized texture to 3D caricature mesh, we reconstruct a normal 3D face with the same mesh connectivity as the 3D caricature mesh and consistent landmark positions when projected onto the photo. By doing so, we can transfer texture coordinates directly from the normal 3D face to the generated 3D caricature mesh.

3.1 Representation of 3D Caricatures

A head mesh has tens of thousands of vertices. An intuitive way is to directly use a neural network for predicting vertex coordinates [24]. However, this requires a very large training set and may produce noisy meshes due to insufficient constraints, especially in our cross-domain scenario. Three examples are shown in Figure 7 where we change our pipeline to directly generate mesh vertices of 3D caricatures. The resulting meshes are rather noisy, indicating insufficient constraints with the amount of training data we have. In face reconstruction, to address similar issues, PCA models such as 3DMM are popular and useful, which provide a strong prior and substantially reduce the dimensionality, making learning easier. However, all the existing PCA models are only good for interpolation in the shape space of normal 3D faces and they do not work well for extrapolation in 3D caricature shape space. To show this, we use a -dimensional PCA representation of normal 3D faces from FaceWarehouse [4] to represent caricature faces, but the recovered caricatures have substantial distortions, as illustrated in Figure 3. Therefore it is necessary to build a parametric model specifically for 3D caricatures.

Figure 3: Using the PCA representation of normal faces to represent caricatures. For each group, the left shows the input 3D caricature model and the right shows its corresponding model represented by the normal PCA representation. These examples show that the representation does not have sufficient extrapolation capability to faithfully reconstruct 3D caricatures. This motivates us to create a PCA model for 3D caricatures.
Figure 4: Illustration of PCA-based 3D caricature representation. To build this, we have collected 3D caricatures by detecting facial landmarks on 2D caricature photos (with manual correction if needed) and reconstructing 3D caricature meshes from them. We represent them as a

-dimensional vector by applying PCA.

In this paper, we build a PCA model for 3D caricature meshes in the following steps, as illustrated in Figure 4. Since we are not aware of any existing large 3D caricature dataset, we first collect hand-drawn portrait caricature images from Pinterest.com and WebCaricature dataset [13, 14] with facial landmarks extracted by a landmark detector [17], followed by human interaction for correction if needed. An optimization-based method [31] is then used to generate 3D caricature meshes using facial landmarks of 2D caricatures. We use this method to generate 3D caricature meshes with the same topology. We normalize the coordinates of their vertices by translating the center of meshes to the origin and scaling them to the same size. A 3D caricature mesh, denoted as (), can be represented as a long vector containing the coordinates of all the vertices. We apply PCA to all the caricature meshes , and obtain the mean head , and dominant components (). Then, a new caricature mesh can be represented as


where is a collection of components, and is a

-dimensional vector that compactly represents the 3D caricature. The PCA model has only one hyperparameter, i.e, the number of components. Balancing the amount of information against the number of components, we set

, where the sum of explained variance ratios is

, which means it contains almost all the information from the 6,100 meshes. We refer to the above representation as 3D Caricature PCA (3DCariPCA), and use it in our translation network.

PCA only involves matrix multiplication, which can be efficiently implemented on GPU. The gradients of these operations are also linear so the losses related to the mesh can be computed and their gradients can be back-propagated. Therefore, the PCA representation is ideal for our neural-network-based method.

3.2 A Simple Interactive Control

A fully automatic method is convenient for users. However, sometimes they are not satisfied with the results by fully automatic methods. Therefore, allowing users to adjust the generation results is necessary. If the user input restricts too many facial details, the generation result will be dissimilar to the input photo, which is undesirable. For example, sketching is a traditional way of user interaction which is reasonably efficient and user-friendly. However, it restricts too much on details and requires some artistic skills. We instead propose a new way to interactively adjust facial shapes while keeping facial features, namely using several horizontal and vertical lines to depict the facial shape. The number of lines can balance the simplicity and flexibility. Too strange caricature faces are undesirable so we only use three horizontal lines and one vertical line. The three horizontal lines are located at the heights of eyes , nose and mouth and the vertical line is the central axis of the face, as illustrated in Figure 5(a).

Figure 5: Illustration of our user control for generated 3D caricatures. (a) Using several lines can depict the contour of a face. (b) We use three horizontal lines and one vertical line, i.e. , , and . (c) The four lines are defined in a 3D caricature mesh using points. (d) We can adjust the mesh by user interaction that changes the lengths of the lines.
(a) Training
(b) Training
Figure 6: Architecture of 3D-CariGAN. Two generators (from photos to 3D caricatures) and (from 3D caricatures to photos), and two discriminators (for photos) and (for 3D caricatures) are included. and are the same as CycleGAN. (a) Two novel losses and are introduced. measures the differences between 2D landmarks of the photo and those of the caricature projected to 2D, ensuring that the character is preserved.

is used to provide a simple and intuitive user control, which measures the difference between facial shape vectors (FSV) of the input (randomly sampled from the normal distribution of FSV during training) and 3D mesh. (b) We use FSV of the input mesh as the input to

instead of random vector (for recovering faces without deformation).

We use ratios between the four lines for 3D-CariGAN because the absolute scale of a face is unknown. We simply normalize the lengths of horizontal lines with the length of the vertical line, leading to three normalized ratios. For a given 3D caricature, we mark points on the mesh and calculate the Euclidean distances between them as , , and , as shown in Figure 5(c). Therefore, we can define the following 3-dimensional facial shape vector (FSV) that represents the 3D caricature shape by computing the ratios:


where is a mesh or a PCA representation of a mesh and , , and are the lines on .

Our generator network takes the FSV as part of input, which allows several ways of generation. For interactive control, FSV can be obtained from a simple user interface that allows users to drag handles to change the lengths of these lines. For fully automatic generation of 3D caricature meshes from photos, we detect facial landmarks in the input photo using a landmark detector [17] and calculate the lengths of the four lines based on the landmarks, which provides the starting point for user adjustment. Alternatively, we also calculate the mean and covariance of the facial shape vector from the training set. This allows random generation of 3D caricatures with different facial shapes by sampling FSV from the normal distribution .

3.3 Generating Caricature Meshes from Photos

We now describe our network architecture for 2D photo to 3D caricature translation. It is an extremely cross-domain task where the input is normal face images and the output is exaggerated 3D meshes whose forms and styles are both totally distinct. We use CelebAMask-HQ dataset [18] which contains portrait photos and our 3DCari dataset containing full-head 3D caricatures as the training dataset which are naturally unpaired.

The structure of 3D-CariGAN is shown in Figure 6. Inspired by CycleGAN [33], our network architecture involves bidirectional mapping with cycle consistency loss. However, we make significant changes to cope with distinct forms in the two domains and introduce further losses to constrain the mapping. Denote by the domain of photos, the domain of facial shape vectors and the domain of PCA representation of meshes. The input to our network involves a normal face photo and a facial shape vector (ref. Eq.(2)), and the output is a 3D caricature mesh , represented by our 3DCariPCA representation, to make learning more efficient and incorporate 3D caricature constraint for improving generation results.

Our network consists of two generators, from to and from to , and two discriminators and . Denote by the domain of fake meshes and the domain of fake face images .

Network architecture. Since our network deals with both 2D images (for which CNNs and residual blocks [11] are effective) and 3D caricature meshes in the 3DCariPCA space as a -dimensional vector (for which fully connected layers are suitable), both structures are used in our network. maps a 2D face photo to a 3D caricature mesh. The network involves a sequence of down-sampling convolutional layers, residual blocks, reshaping and fully connected layers. The facial shape vector for user control is fed in the first fully connected layer. is the reverse generator that maps a caricature mesh (represented by 3DCariPCA vector) to a 2D face image. It involves a sequence of fully connected layers, reshaping, residual blocks and up-sampling convolutional layers. is the discriminator for face photos and

is the discriminator for 3D caricatures. We use batch normalization for fully connected layers and instance normalization for convolutional layers and residual blocks. Full architecture details are presented in supplementary material.

Loss terms. Adversarial loss and bidirectional cycle-consistency loss in the traditional CycleGAN are useful. However, they are insufficient to ensure that the input and the output are the same character, leading to generation of random persons. We introduce a novel cross-domain character loss to constrain the identity of the generated caricature. We further propose the user control loss to incorporate shape control. All in all, four types of loss , , and are used for training 3D-CariGAN, which are detailed below.

Adversarial loss. is the adversarial loss which ensures the distribution of is the same as that of . We adapt the adversarial loss of LSGAN [21] as


is similarly defined.

Cycle-consistency loss. is the bidirectional cycle-consistency loss which ensures that and , where is function composition and and are identity function. is defined as difference between input and recovered input:


is similarly defined.

Character loss. We propose which aims to measure character similarity between the input photo and the generated 3D caricature, penalizing the identity change. As the domains are rather different, we utilize facial landmarks for comparison. Let be the 2D facial landmarks of an image or a mesh, represented as a long vector. Landmarks for 2D faces can be directly located, and for 3D meshes, we project them onto the 2D plane. Since the 3D meshes have the same connectivity, a set of vertices corresponding to landmarks is pre-determined. We use the cosine of vector angles to measure the shape similarity between two sets of landmarks:


where is the average facial landmarks.

User loss. To allow the user to control the generated 3D caricature shape, the input of 3D-CariGAN consists of a photo and a user specified facial shape. We define


where is the facial shape vector introduced in Section 3.2 and is the facial shape specified by the user when inferring or randomly sampled when training.

Overall objective function. The overall objective function for 3D-CariGAN is:

where , and are weight parameters for balancing the multiple objectives. For all experiments, we set , and . We use the Adam solver to optimize the objective function for training the neural network.

3.4 Caricature Texture Generation

Texture is essential for the appearance of 3D caricatures, which is represented by a texture image and texture mapping. Mesh parameterization [27, 12] is widely used to generate the mapping between a 3D mesh and 2D parameter domain. However, the texture image needs to be aligned with the parameterization. In our case, since all the caricature meshes have the same connectivity, we take an easier approach where the texture image is the stylized photo obtained using CariStyGAN [5] — which stylizes the image without warping — and texture mapping is obtained from simple projection as follows.

We reconstruct a normal 3D head mesh from the input photo using 2D facial landmarks, e.g., using an optimization-based method [31]; alternative methods for face reconstruction may also be used. This method also computes a projection matrix, such that when applied to the 3D head mesh, the projection is aligned with the 2D photo. We apply the projection matrix to work out the 2D parameter coordinates for each vertex. We ensure that the normal head mesh has the same connectivity as 3D caricatures, and they have one-to-one vertex correspondence, so the texture coordinates for the 3D caricatures are the same as those for the normal head vertices.

4 Experiments and Results

We have implemented 3D-CariGAN, as well as CariStyGAN and CariGeoGAN for baseline, in PyTorch 

[22] and the optimization-based method [31] in C++. We have tested them on a PC with an Intel E5-2640v4 CPU (2.40 GHz) and an NVIDIA GeForce RTX 2080Ti GPU. The resolution of images in CelebAMask-HQ dataset is . They are resized to as input to 3D-CariGAN and CariStyGAN.

4.1 Ablation Study

Figure 7: 3D caricatures obtained by changing our pipeline to directly generate mesh vertex coordinates rather than 3DCariPCA vectors. The method produces noisy output meshes, due to the higher dimensional space and lack of the facial constraint.

We first show the benefits and necessity of using our PCA representation for 3D caricatures. As an alternative, we show the results of a variant of our method that instead uses the mesh vertex coordinates to represent 3D caricatures. The results are shown in Figure 7, which are very noisy and visually unacceptable, due to the higher dimensional space and lack of the facial constraint.

Figure 8: Comparison of 3D-CariGAN with different losses. We successively add , , and to the objective function. The generation results are completely random with only . The generation results become similar to input photos by adding and . After adding and user interaction, the facial features are kept and the facial shapes are changed.
Figure 9: Comparison of 3D-CariGAN with different losses. We successively add , and to the objective function.
Figure 10: With the input photo fixed, we adjust the facial shape vector (FSV) and obtain various results. The space of results is -dimensional, identical to the FSV space. We show a section in the plane, with fixed to the average value .

We perform an ablation study to demonstrate the effectiveness of each loss term in Figure 8. We successively add adversarial loss, cycle-consistency loss, character loss and user loss to the objective function. The adversarial loss only ensures that the generation results are 3D caricatures, but the generation results are highly random. Although the cycle-consistency loss regularizes the mapping, making the results less extreme, it is insufficient to retain identity and expression. The character loss constrains that the generation results have the same identity and expression as those of input photos, which makes the generation results look like the input photos. The user loss keeps the details and controls only the ratio of facial shapes. To evaluate the face identity preservation between the input 2D photos and the output 3D caricatures, eleven participants were invited to evaluate the twenty examples (in which ten are shown in Figure 9

). For each example, three 3D caricatures were presented (using the loss functions

, and respectively) and the participants were requested to select the one which not only achieves an exaggerated style but also maintains the face identity. Among the collected 220 votes, the 3D caricatures from the three methods received 4 (1.8%), 45 (20.5%) and 171 (77.7%) votes, respectively. Note that the 3D caricatures from can only guarantee to be in the caricature space but without any control for face identity, and so can be regarded as randomly picking up from the caricature space. These results demonstrate that our generated 3D caricatures preserve the identity well.

To further verify the effect of user loss and user interaction, we fix the input photo and adjust the -dimensional facial shape vector so the adjustment space is also -dimensional. We obtain a series of results and show the plane in Figure 10. It shows that user loss and user interaction are useful for adjusting facial shapes while keeping facial details. To quantitatively measure the shape similarity between the generated 3D caricature shape and the input photo shown in Figure 10, we also compute the defined in Eq. (5). The values are , , , , , , , , from top to bottom and from left to right. We observe that these values are quite small, demonstrating that the generated exaggerated shapes still retain likeness with the given face.

4.2 Comparison with Baseline Method

To the best of our knowledge, there is no existing method for generating 3D caricatures from photos. We make a baseline method by piping several methods. The optimization-based method [31] is the only method for generating 3D caricature meshes from 2D caricature images. However, it needs facial landmarks of caricature images as input, which are difficult to obtain automatically. CariGeoGAN [5] is a method for exaggerating the facial landmarks of photos. We can warp the photo with the guidance of exaggerated landmarks using differentiable spline interpolation [7]. Therefore, the pipeline for generating 3D caricature meshes is a sequence of steps as shown in Figure 11. Obviously, the baseline method cannot allow users to adjust facial shapes interactively. There are two ways to generate texture for the baseline method: one is the same as ours and the other is warping the stylized photo and projecting the caricature mesh to it. For comparison, we use the same texture for the baseline method and our method.

Figure 11: The pipeline of baseline method is shown. It includes a sequence of steps: landmark detection, landmark exaggeration using CariGeoGAN, image warping and generating 3D caricature meshes using facial landmarks.

The comparison of these two methods is shown in Figure 12

. For the baseline method, the exaggerated 2D facial landmarks are not designed for 3D caricatures, so the results of baseline are possible to be too common or too odd. We conduct a user study in which we randomly select

pairs of results, which are shown to participants. of them consider that the results of our method are more exaggerated than those of baseline and of them consider that the results of our method have better expressiveness than those of baseline. The generation time comparison is given in Table 1, showing that our method is real-time and much faster than the baseline.

Figure 12: Visual comparison between our method and baseline. The generation time for each model is shown beneath that model.
Methods Step Time (s)
Cascade Landmark Detecting 0.086
Warping 7.431
CariGeoGAN 0.001
Reconstruction 14.510
Total 22.028
Ours Total 0.011
Table 1: Generation time comparison of baseline and our method, averaged over results. The input is a image and the output mesh has vertices and faces.

4.3 User Interaction

We propose a simple and effective user interaction, which uses 4 lines to control the ratio of facial shape. Intuitively, it is easy to use and efficient. Alternatively, sketching is a traditional and efficient way of user interaction. However, it is difficult to draw a caricature sketch for non-experts. Moreover, compared with drawing sketches, our interaction strategy is more user-friendly. We conduct a user study to compare these two ways of user interaction. For sketching, users are asked to draw 5 sketches by given reference caricatures and record the time. For our interaction method, we ask users to adjust models by typing keys or dragging to control the facial shape vector and record the time. After finishing the two operations, we ask users which one is easier to use.

Ten participants were involved in this study. The result shows that our way is faster and easier than sketching. The average time of drawing a sketch is seconds and ours is seconds. All participants thought our interaction is easier than sketching. 4 of them feel stress to draw sketches and none of them feel stress using our method.

5 Conclusion

We propose an end-to-end deep neural network model that transforms a normal face photo into a 3D caricature, which is an extremely cross-domain task. Our system is real time and thus makes a further interaction control possible. We also propose a simple and intuitive interaction control method using a few lines. Experiments and user studies demonstrate the effectiveness of our system.


  • [1] V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces.. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pp. 187–194. Cited by: §1, §2.
  • [2] J. Booth, A. Roussos, A. Ponniah, D. Dunaway, and S. Zafeiriou (2018) Large scale 3d morphable models. International Journal of Computer Vision 126 (2-4), pp. 233–254. Cited by: §2.
  • [3] S. E. Brennan (2007) Caricature generator: the dynamic exaggeration of faces by computer. Leonardo 40 (4), pp. 392–400. Cited by: §1, §2.
  • [4] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou (2013) Facewarehouse: a 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20 (3), pp. 413–425. Cited by: §2, §3.1.
  • [5] K. Cao, J. Liao, and L. Yuan (2018-12) CariGANs: unpaired photo-to-caricature translation. ACM Trans. Graph. 37 (6), pp. 244:1–244:14. Cited by: §1, §1, Figure 2, §2, §3.4, §3, §4.2.
  • [6] L. Clarke, M. Chen, and B. Mora (2010) Automatic generation of 3d caricatures based on artistic deformation styles. IEEE Transactions on Visualization and Computer Graphics 17 (6), pp. 808–821. Cited by: §2.
  • [7] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman (2017-07) Synthesizing normalized faces from facial identity features. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §4.2.
  • [8] H. Dai, N. Pears, W. A. Smith, and C. Duncan (2017) A 3d morphable model of craniofacial shape and texture variation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3085–3093. Cited by: §2.
  • [9] L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576. Cited by: §2.
  • [10] X. Han, C. Gao, and Y. Yu (2017) DeepSketch2Face: a deep learning based sketching system for 3d face and caricature modeling. ACM Transactions on Graphics (TOG) 36 (4), pp. 126. Cited by: §1, §2.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §3.3.
  • [12] K. Hormann, B. Lévy, and A. Sheffer (2007) Mesh parameterization: theory and practice. ACM SIGGRAPH Course Notes. Cited by: §3.4.
  • [13] J. Huo, Y. Gao, Y. Shi, and H. Yin (2017) Variation robust cross-modal metric learning for caricature recognition. In Proceedings of the on Thematic Workshops of ACM Multimedia (ACM MM), pp. 340–348. Cited by: §3.1.
  • [14] J. Huo, W. Li, Y. Shi, Y. Gao, and H. Yin (2018) WebCaricature: a benchmark for caricature recognition. In British Machine Vision Conference (BMVC), Cited by: §3.1.
  • [15] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos (2017) Large pose 3d face reconstruction from a single image via direct volumetric cnn regression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1031–1039. Cited by: §2.
  • [16] L. Jiang, J. Zhang, B. Deng, H. Li, and L. Liu (2018) 3D face reconstruction with geometry details from a single image. IEEE Transactions on Image Processing 27 (10), pp. 4756–4770. Cited by: §2.
  • [17] D. E. King (2009)

    Dlib-ml: a machine learning toolkit

    Journal of Machine Learning Research 10 (Jul), pp. 1755–1758. Cited by: §3.1, §3.2.
  • [18] C. Lee, Z. Liu, L. Wu, and P. Luo (2019) MaskGAN: towards diverse and interactive facial image manipulation. arXiv preprint arXiv:1907.11922. Cited by: §1, §3.3.
  • [19] W. Li, W. Xiong, H. Liao, J. Huo, Y. Gao, and J. Luo (2018) CariGAN: caricature generation through weakly paired adversarial learning. arXiv preprint arXiv:1811.00445. Cited by: §1, §2.
  • [20] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang (2017) Visual attribute transfer through deep image analogy. ACM Transactions on Graphics (TOG) 36 (4), pp. 120. Cited by: §2.
  • [21] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2794–2802. Cited by: §3.3.
  • [22] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Cited by: §4.
  • [23] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter (2009)

    A 3d face model for pose and illumination invariant face recognition

    In 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 296–301. Cited by: §2.
  • [24] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black (2018)

    Generating 3d faces using convolutional mesh autoencoders

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 704–720. Cited by: §3.1.
  • [25] S. B. Sadimon, M. S. Sunar, D. Mohamad, and H. Haron (2010) Computer generated caricature: a survey. In International Conference on Cyberworlds (CW), pp. 383–390. Cited by: §1.
  • [26] M. Sela, Y. Aflalo, and R. Kimmel (2015) Computational caricaturization of surfaces. Computer Vision and Image Understanding 141, pp. 1–17. Cited by: §2.
  • [27] A. Sheffer, E. Praun, K. Rose, et al. (2007) Mesh parameterization methods and their applications. Foundations and Trends® in Computer Graphics and Vision 2 (2), pp. 105–171. Cited by: §3.4.
  • [28] Y. Shi, D. Deb, and A. K. Jain (2019) WarpGAN: automatic caricature generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10762–10771. Cited by: §2.
  • [29] A. Tewari, M. Zollhofer, H. Kim, P. Garrido, F. Bernard, P. Perez, and C. Theobalt (2017) Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1274–1283. Cited by: §2.
  • [30] D. Vlasic, M. Brand, H. Pfister, and J. Popović (2005) Face transfer with multilinear models. ACM Transactions on Graphics (TOG) 24 (3), pp. 426–433. Cited by: §2.
  • [31] Q. Wu, J. Zhang, Y. Lai, J. Zheng, and J. Cai (2018) Alive caricature from 2d to 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7336–7345. Cited by: §1, §1, §1, §2, §2, §3.1, §3.4, §4.2, §4.
  • [32] H. Xiaoguang, H. Kangcheng, D. Dong, Q. Yuda, C. Shuguang, Z. Kun, and Y. Yizhou (2018) CaricatureShop: personalized and photorealistic caricature sketching. IEEE Transactions on Visualization and Computer Graphics (), pp. 1–1. External Links: Document, ISSN Cited by: §1, §2.
  • [33] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2223–2232. Cited by: §3.3.
  • [34] M. Zollhöfer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P. Pérez, M. Stamminger, M. Nießner, and C. Theobalt (2018) State of the art on monocular 3d face reconstruction, tracking, and applications. Computer Graphics Forum (CGF) 37 (2), pp. 523–550. Cited by: §2.