A caricature is a vivid art form of depicting persons by abstracting or exaggerating the peculiarities of the facial features. As a way to convey humor or sarcasm, caricatures are widely used in entertainment, social events, electronic games and a variety of artistic creations. While 2D caricatures have gained popularity in comic graphics, there exist many scenarios, including cartoon character creation, game avatar customization, custom-made 3D printing, etc., that the 3D face caricatures remain the mainstream representations. However, creating a high-quality 3D caricature is a labor-intensive and time-consuming task even for a skilled artist. Thereby, generating expressive 3D face caricatures from a minimal input, such as a single image, is a highly-demanding but also challenging task.
Most of the prior works mainly focus on 2D caricature generation [cao2018carigans, shi2019warpgan, Kim20DST], while research on reconstructing 3D caricatures from 2D caricature images remains vary rare. Wu et al. [wu2018alive] propose the first work that creates 3D caricature from 2D caricature images using an optimization based approach. They formulate caricature generation as a problem of deforming the standard 3D face. In particular, they build a intrinsic deformation space based on the exaggerated morphable models of standard faces [paysan20093d]. The deformation coefficients are then optimized to reduce the landmark fitting errors. Recently, in their follow-up work [zhang2020landmark], they employ CNN to automate the task of 2D facial landmark prediction and deformation regression. However, previous works [sela2017unrestricted, liu20193d] have shown that the traditional 3D morphable models (3DMM) of normal faces have very limited expressiveness in modeling the intricate facial deformations in reality. Thereby, the deformation space based on a synthetically exaggerated 3DMM, as proposed in [wu2018alive, zhang2020landmark], is far from sufficient to capture realistic 3D caricatures, which are even more diversified and complex than normal faces.
The key to tackling the above problem is a high-quality 3D caricature dataset created by artists that can provide realistic shape priors for both learning-based and optimization-based approaches. However, there exist two challenges in constructing such a dataset. First, the 3D models crafted by artists are not topologically consistent, making it infeasible to many downstream applications, including blendshape creation, face animation, 3D landmark localization, etc. Secondly, the manually created meshes are typically not aligned with the corresponding images. While many face reconstruction techniques require an accurate registration, such misalignment makes the dataset inapplicable to projection-based applications such as landmark fitting, texture restoration and manipulation, etc.
In this work, we introduce 3DCaricShop, a large-scale 3D caricature dataset that simultaneously addresses the above issues. First of all, 3DCaricShop contains 2,000 highly diversified and high-quality 3D caricature models manually crafted by professional artists. It is constructed by requesting artists to create 3D caricatures according to 2,000 manually selected caricature images from WebCaricature [huo2017webcaricature], that span a wide range of shape exaggerations and texturing styles. Compared to the synthetic datasets [han2017deepsketch2face, zhang2020landmark], 3DCaricShop can provide shape priors for 3D caricatures with much higher fidelity. Secondly, all the 3D models in 3DCaricShop have been re-topologized to a consistent mesh topology that paves the way to a number of future applications, including learning a parametric shape space, batch geometry processing, etc. Thirdly, we provide accurate 3D face landmarks in 3DCaricShop, which facilitates the use of landmark fitting technique that is widely adopted in the state-of-the-art face reconstruction approaches. Last but not least, 3DCaricShop offers a paired 2D caricature image and the camera parameters that are used for mesh alignment. This enables a wide range of techniques, such as differentiable rendering, landmark fitting, etc., that rely on 2D-to-3D consistency.
To further exploit the power of 3DCaricShop, we propose a novel baseline approach to infer 3D caricatures from a single caricature image. While the methods based on deep implicit functions [huang2018deep, park2019deepsdf]
have shown promising capability of modeling objects with arbitrary topologies,it is prone to artifacts and self-intersections when applied to reconstruct 3D caricatures, which typically contain many extreme distortions. Though approaches using parametric mesh model can ensure a generation of plausible 3D face, they struggle to produce realistic faces with accurate geometry. We advocate to connect the good ends of both worlds by transferring the high-fidelity geometry learnt from the implicit reconstruction to a template mesh with a reasonable topology. To enable a faithful transfer, we propose a novel view-collaborative graph convolution network (VC-GCN) to extract key points from the implicit mesh for accurate mesh alignment. To strike a balance between accuracy and robustness, we iteratively project the registered template mesh onto a pre-trained PCA space using 3DCaricShop to avoid overfitting to outliers. Our approach is able to generate high-quality 3D caricatures in a pre-defined mesh topology that is animation-ready.
We have conducted extensive benchmarking and ablation analysis on the proposed dataset. Experimental results show that the proposed approach trained on 3DCaricShop sets new state of the art on the task of single-view 3D caricature reconstruction from caricature images.
2 Related Work
is a kind of physical based prior on the relation between illumination and shape, which recovers the detailed shape in photos. However, it fails to analyze artists works because of the stylized shading effect. Most recently, with the success of deep learning architectures and the release of large-scale 3D shape datasets such as ShapeNet[chang2015shapenet], learning based approaches have achieved great progress, by learning the shape priors directly from the huge datasets. According to the used 3D representations, these methods can be divided into voxel-based [maturana2015voxnet, choy20163d, Riegler_2017_CVPR], point-based [qi2017pointnet, qi2017pointnet++], mesh-based [wang2018pixel2mesh, pan2019deep], and implicit-function-based frameworks [mescheder2019occupancy, saito2019pifu]. Among these methods, PIFu [saito2019pifu], an algorithm based on implicit functions, has been applied on the reconstruction of human body and achieves impressive results. In this paper, we employ PIFu to create the 3D mesh for each single caricature image.
Single-view Face Modeling
A closely-related task is photo-realistic face reconstruction. Two mainstream methodologies are developed to handle this problem, i.e. parametric based [cao2013facewarehouse, tran2018nonlinear, deng2019accurate] and shape-from-shading based [richardson2017learning, t2018extreme] methods, and remarkable results have been achieved. However, both methods could not apply on our task directly. Parametric methods suffer from the large diversity of geometry shapes in caricature cases. For SfS algorithms, the underlying physical model could not capture various painting styles of artists.
3D Caricature Generation
Following the parametric based methods of normal face reconstruction, researchers further introduce deformation to enlarge the capability of representation [liu2009semi, wu2018alive, zhang2020landmark]. In [liu2009semi], a semi-supervised manifold regularization method is proposed to learn a regressive model for mapping from 2D real faces to the enlarged training set with 3D caricatures. Wu et al. [wu2018alive] formulate the 3D caricature generation as a problem of deformation from the standard 3D face. By introducing local deformation gradients, they build an intrinsic deformation representation with the capability of extrapolation. With the deformation representation, they construct an optimization framework to create caricature model guided by the landmark constraint. Following [wu2018alive], Zhang et al. [zhang2020landmark] employ CNN to learn the deformation parameters of the intrinsic deformation representations. However, due to the lack of 3D caricature data, their works are still far from satisfaction.
3D Face datasets
3D face datasets are of great value in face reconstruction tasks. In general, they could be categorized into synthetic and real captured datasets. For normal face, existing 3D datasets, including FaceWareHouse [cao2013facewarehouse] and Facescape [yang2020facescape], are built from scanned 3D data, hence widely used in normal face tasks. They focus on the high accuracy and photo reality of the meshes. However, they could not be applied directly on caricature reconstruction. Researchers [wu2018alive, zhang2020landmark] tried to perform deformation on real 3D face models to construct synthetic exaggerated data. Although some reasonable results are achieved, they still suffer from the lack of diversity. To tackle this problem, we propose 3DCaricShop, which is the first 3D caricature dataset built by artists, composed pairs of caricature images and meshes. Based on the dataset, 3D caricature shape could be learned in a model-free manner. We further propose a baseline method to reconstruct 3D mesh with uniform topology from single caricature image.
We construct a dataset which contains 2,000 image-model pairs in total. All of the 3D models are annotated with 3D facial landmarks and poses w.r.t images. More details are introduced in the following aspects.
3D Model Collection
WebCaricature [huo2017webcaricature] is the largest-to-date dataset of 2D caricatures. It contains around 6,000 caricature images with diverse identities, geometry, and textural styles. We first selected 2,000 images from them, further making them as diverse as possible. Then we recruited 4 paid expert Zbrush artists to create models according to images. The modeling is required to be matched with the image as much as possible, in projection manner. The contour lines for matching include edges of silhouette, lips, eyes, nose bottom and ears. It takes around 40 minutes for each model on average, and around 40 days are cost in total. Several image-model pairs sampled from our dataset are shown in Fig. 2. Each model consists of 300,000 700,000 vertices. More examples can be found in the supplemental material.
To support building parametric space for our 3D caricature dataset, we unify the mesh topology for all models in two steps: 1) We first manually annotate 44 3D landmarks (see details in Fig. 3) for each model; 2) The method of Non-rigid ICP [amberg2007optimal, dai2020statistical] is applied to register a pre-defined template mesh to each model, guided by the 3D landmarks. Due to the inherent difficulty to specify vertices on a 3D mesh, the landmark annotation is performed on 3 rendered views of the 3D shape. As described in [booth20163d], these 2D landmarks can be easily transformed into their corresponding 3D positions. The template mesh we use is from FaceWareHouse [cao2013facewarehouse] that consists of 11,551 vertices. This procedure is illustrated by an example in Fig. 3.
Analysis of the dataset
We quantitatively analyze our dataset by comparing the shape variations with two normal face datasets (FaceWarehouse (FWH) and FaceScape), as well as one synthetic caricature dataset, FaceWarehouse with deformation (Aug. FWH). We measure the shape variation using global and part variance. In particular, the variance is computed between the models and their corresponding mean shape of each dataset in terms of per-vertex displacement. The results are presented below. The shape diversity of our dataset is richer than the normal ones. For most of the face regions, 3DCaricShop has larger shape variance than Aug. FWH. We will include this analysis in the revision.
In this section, we introduce the proposed baseline method. Given an input caricature image , the task is to generate the corresponding 3D mesh . With the topologically uniform 3D meshes in 3DCaricShop, a straight forward way to tackle the task is to construct a PCA basis using the 3D Morphable Model algorithm [blanz1999morphable] to build the caricature face space. However, such a space could not handle the large variation in our data. To capture the diversity of geometry in caricature, we employ Pixel-aligned Implicit Function (PIFu) [saito2019pifu] to generate the 3D shape from the input image. Although the implicit function models the variation in targets, it could not ensure a uniform topology for the predictions. To achieve that, we register a template mesh to using non-rigid registeration (NICP) [amberg2007optimal]. Then the output of NICP is projected onto the pre-constructed PCA space, to alleviate deformation artifacts, such as self-intersections. We denote the output of NICP as and that of PCA as . Considering the large difference between the template and target meshes, a sparse 3D landmark is needed in the stage of NICP. We propose a novel view-collaborative graph convolution network (VC-GCN) to predict key points from the implicit mesh, where .
4.1 The Baseline Approach
Our parametric model space is built with standard 3D Morphable Model (3DMM)[blanz1999morphable] algorithm. Given caricature models with uniform topology and
vertices on each mesh, principal component analysis (PCA) is performed on the shape matrix, which is formed by stacking the 3D coordinates of the vertices. The generated
eigen-vectors are employed as the shape basis, where is a hyper-parameter. The mean vector represents the mean shape in the mesh set. With this 3DMM, a novel caricature model could be represented as follows: where is the vector of shape coefficients.
To capture the diversity of geometric variation in the 3D data, we adopt Pixel-aligned Implicit Function (PIFu) [saito2019pifu] to reconstruct the underlying 3D shape from images. PIFu performs 3D reconstruction by estimating the occupancy of a dense 3D shape, which determines whether a point in 3D space is inside the model or not. Given a RGB image , its normal maps from the front view and back view are generated to strengthen the local details, by using a pixel2pixel-hd network [wang2018high]. Then the implicit binary function could be written as:
where is the ground-truth occupancy.
3D Landmark Detection for Registration
The output meshes of the implicit function are not topologically uniform. In order to unify the topology, we adopt non-rigid registration[amberg2007optimal] to deform a template into the shape of . As shown in [amberg2007optimal], without landmarks the cost function of registration could run into a local minimum, where the template is collapsed onto a point on the target surface. Thus it is important to introduce the 3D landmarks of both meshes to guide the deformation. We design a novel framework to detect 3D landmarks for . In short, we propose to perform detection on the rendered views of to leverage the effectiveness of image-based CNN techniques. The process would be detailed in 4.2.
Since the huge difference between and , the deformation is likely to generate artifacts on meshes, such as self-intersection. To resolve this problem, we iteratively perform NICP and PCA projection of the results to obtain and . After projection, we obtain a deformed template which is closer to in shape. Fig. 4 illustrates the process of the progressive deformation.
4.2 View-collaborative 3D Landmark Detection
In this section, we discuss more details about how to detect 3D landmarks from , which is the key to supporting the procedure of landmark-guided registration. A straightforward way for this detection is directly applying point-based CNN (e.g., SparseConv [liu2015sparse]) to estimate landmark-ware heatmap on mesh vertices. However, due to the inherent difficulty to conduct CNN on a mesh, this approach tends to produce inaccurate results. We thus propose to perform detection on the rendered views of to leverage the effectiveness of image-based CNN techniques. Coarse locations of the 3D landmarks can be obtained from detected 2D landmarks on those views. More importantly, a stack of View-Collaborative GCN block (VC-GCN) is novelly designed to aggregate and enhance information from multiple views for accurate 3D landmarks locations. As illustrated in Fig. 5, single view graph features (local features) are first extracted for initialization. Then, these local features are enhanced in a progressive manner by continually fusing global information into each view. The final local features are aggregated into the global graph feature for 3D landmark prediction.
In this part, more details about local feature initialization are introduced. Given 2D images rendered from 3 views ( for simplicity), we first utilize a 2D landmark detector [guo2019pfld] to estimate the 2D landmarks , where denotes the key point number under the view . Next, 3d landmarks on the mesh are located using the projection matrix of each view. We use the above landmarks which exist in all local views to build local graphs. After that, to enrich the information of each graph node, we extract features from the feature maps generated by the landmark detector for each node, according to their 2D coordinates. Eventually, the initial local view features for VC-GCN can be produced by concatenating the 3D landmark locations with related node features under each view , where is the image feature dimension.
In order to provide global information for each view, we aggregate 3 local features into a global graph feature. Then the global feature is fused into each view to enhance the local view feature. This procedure is performed by a View-Collaborative GCN block. In each VC-GCN block, local features are first sent into several GCN layers for better representations. The layer-wise operation in GCN is defined as:
where is the adjacency matrix with self-loops, is its diagonal node degree matrix to normalize , represents the local feature in layer under the view , is a trainable parameter matrix for linear projection, and represents the non-linear activation operation. Then the obtained local features are combined into a global graph feature. For each node in the 3D landmark, its feature can be drawn from the local feature of the corresponding node under one view. Note that for the node that shared in different views, its feature is set as the average of multiple local features. The combined global graph feature is then strengthened through several GCN layers, with same operations as in Equation 3. Hence, the process of feature aggregation now can be formulated as following:
where denotes the combination operation, is the strengthened global features. Note that the input is set as the initial local view feature in the first VC-GCN block, and is set as the prior output features in subsequent blocks.
In the second step, strengthened global features are fused into local features of each view in a non-local manner [wang2018non], so that global information can guide the model to learn more representative local features. The enhanced local features of the view can be obtained as following, where the non-local fusion operation is denoted as :
More details about the non-local fusion operation are described in the supplementary materials.
It usually takes numerous glimpses to adjust key points to construct a 3D face, even for an expert artist. Thus, several VC-GCN blocks are stacked to progressively enhance local features. In the connection of two blocks, the enhanced local features from the former block are taken as the input of the later block.
Given the enhanced local features from the last VC-GCN block, we combine and strengthen them using GCN layers to obtain the final global graph features. Next, it is multiplied by a GCN head layer to get the 3D landmark estimation . The predicted 3D landmarks are supervised by 3D and 2D landmark ground truth simultaneously, which leads to more accurate prediction. We now formulate the loss function for the view-collaborative 3D landmark detection training as following:
where is the training set, is the subscript indicating each training sample, denotes the 2D landmark detection error in the initialization stage, and represents the 3D landmark prediction error in 2D and 3D space, respectively, and is the projection matrix to obtain 2D landmarks from 3D landmarks under the view . Note that all loss terms are in smooth- form.
The proposed framework is trained on our 3DCaricShop. The dataset is separated into 1,600 for training and 400 for testing. The weights of , and are set to , and
respectively. To train the network for learning the implicit reconstruction, a RMSProp optimizer is adopted with learning rate 0.001, and the network is pre-trained with the mini-batch size 2 for 80 epochs. During the training of 2D landmark detection network, an Adam optimizer is used. The learning rate is set to 0.0001 with a cosine decay, and the mini-batch size is set to 24 for 30 epochs. After that, the whole framework is trained in an end-to-end manner with the same strategy as above.
We present some typical results of the proposed framework in Fig. 6. As illustrated, our method is robust to caricature images with diverse textures. It can also recover diversified geometric features, such as the exaggerated nose in the second sample of the first row, and the sharp long chin in the third sample of the second row Fig. 6.
5.1 Comparisons with the State-of-the-arts
We qualitatively and quantitatively compare the results of our method with a variety of state-of-the-art 3D caricature reconstruction approaches on 3DCaricShop testing set, including linear parametric model (3DMM) [t2018extreme, blanz1999morphable], depth map (DF2Net) [zeng2019df2net], deformation representation (AliveCaric-DL) [zhang2020landmark], and implicit function (PIFu) [saito2019pifu] based methods.
In Fig. 7, we visualize some results of caricature reconstruction on images from 3DCaricShop. Among the parametric methods, the nonlinear deformation representation [zhang2020landmark] based model outperforms the linear ones on fitting the exaggerated input images, but it is still not precise enough due to its limited expressiveness. Besides, other deep learning based approaches such as DF2Net and PIFu unavoidably yield artifacts like hollows and spikes. However, our method introduces the constraint of PCA parametric space into the deep model, thus can produce highly exaggerated local details upon the foundation of a plausible global shape.
Considering other methods for comparison only reconstruct the face area, we adopt average point-to-surface Euclidean distance (P2S) as the evaluation metric, which measures the unidirectional distance from the source set to the target set. The average point-to-surface Euclidean distance can be computed as:
where is the vertex set of the reconstructed mesh and is the corresponding ground truth surface. Besides, due to the mismatch in orientation and scale between the generated meshes and ground truth, before calculation, Procrustes alignment is performed and scaling is estimated based on least square error.
As shown in Table 2, our method achieves the smallest P2S on the 3DCaricShop.
5.2 Ablation Studies
In this section, we perform ablation studies on the proposed 3D landmark detection framework and landmark guided registration process. The results show the effectiveness and robustness of our pipeline.
3D landmark detection
We analyze five variants of our framework: 1) directly using the 3D landmarks selected from predicted 2D landmarks without subsequent refinement, denoted as ‘w/o GCN refinement’; 2) utilizing voxel-based method [moon2018v2v] to estimate 3D heatmaps, denoted as ‘V2V’; 3) employing a global graph to refine the 3D landmarks from the first setting, without using VC-GCN block, which denoted as ‘Global only’; 4) Only using local index to gather local features from global view, rather than using non-local operations for local feature enhancement, denoted as ‘w/o G2L’; 5) The basic setting, denoted as Basic.
The metric we evaluate the results is mean per joint position error (MPJPE) which is defined as a Euclidean distance between predicted and ground truth 3D landmarks after root joint alignment. The root joint we define is the top of nose. This metric measures how accurately the root-relative 3D landmark estimation is performed. The quantitative results are listed in Table. 3. It confirm the effectiveness of each components in the design of our 3D landmark detector.
On landmark guided registration
We evaluate three kinds of registration process: 1) directly perform NICP without landmarks information; 2) perform landmark-guided NICP without PCA projection; 3) the process used in our method. The visualized results are shown in Fig. 9. As [amberg2007optimal] suggested, without landmark information, NICP could not capture the large discrepancy between and . Besides, the deformation without PCA projection is likely to generate meshes with self-intersection. In contrast, our method could obtain meshes with higher quality, and capture enough shape information in . For example, the artifacts on ears in Fig. 9(d) are eliminated, while the shape of nose is more consistent with both the groud truth mesh and the input caricature image.
|w/o GCN Refinement||0.451||Global only||0.373|
|V2V [moon2018v2v]||0.407||w/o G2L||0.358|
|Ours w/o Landmark||0.074|
|Ours w/o PCA projection||0.076|
We construct a new dataset and benchmark, called 3DCaricShop, for single-view 3D reconstruction from caricature images. 3DCaricShop is the largest collection by far of 3D caricature models crafted by professional artists. It consists of 2,000 high-quality and diversified 3D caricatures that are richly labeled with paired 2D caricature image, camera parameters, and 3D facial landmarks. A novel baseline approach is also presented to validate the usefulness of the proposed dataset. It combines the merits of flexible implicit functions and the robust parametric mesh representation. Specifically, we transfer the details from implicit reconstruction to a template mesh with the help of VC-GCN that accurately predicts 3D landmarks for the implicit mesh. Extensive benchmarking on our dataset has been performed including a variety of popular approaches. We found that reconstructing 3D caricature from a single 2D caricature image is a highly challenging task with ample opportunity for improvement. We hope 3DCaricShop and our baseline approach could shred light on future research in this field.
This work was supported in part by the National Key R&D Program of China with grant No. 2018YFB1800800, the Key Area R&D Program of Guangdong Province with grant No. 2018B030338001, Guangdong Research Project No. 2017ZT07X152, the National Natural Science Foundation of China 61902334, Shenzhen Fundamental Research (General Project) JCYJ20190814112007258.
Appendix A Structure of G2L Network
In this section, we illustrate the structure of the Global to Local (G2L) moduleof VC-GCN. As shown in Fig. 10, the outputs of local-view GCN and those of global-view GCN are fed into G2L network. First, we change the channels of global and local features ( and respectively) by local-GCN and global-GCN, of which the structures are the same as that employed in the whole pipeline. Then, trainable G2L weights are obtained by matrix multiplication between and , followed by a softmax operation. Finally, we get the updated local-view features processed by the following formula:
where means matrix multiplication. In Fig. 10, is the batch size of inputs. and represent the number of nodes in the local and global graph, with and defined as the number of feature channels. Empirically, is set to 32, considering the trade-off between efficiency and accuracy.
Appendix B More Qualitative Results
Fig. 11 shows that the reconstruction results using our proposed 3D landmark localization approach could capture the large exaggerations more accurately than other settings. For example, the long chin of the second sample is not distorted. Fig. 12 shows the necessity of landmark-guided registration. Without 3D landmarks, the outputs of NICP [amberg2007optimal] fail to fit the accurate shape of the face, and PCA [blanz1999morphable] projection helps to further reduce artifacts in the final results.
We show a failure case in Fig. 13 where the estimated normal map is blurry, especially at the mouth region, that leads to inaccurate result.
We present more visual results in Fig. 14 to show the effectiveness of our framework. In addition, We show more qualitative results for ablation studies on the framework.
Appendix C Applications
The proposed framework generates caricature meshes with uniform topology. With the well-defined topology, various applications could be developed. In Fig. 15
we present the mesh generation via interpolating among the predict caricature meshes.
In Fig. 16, we compare the rigging results with AliveCaric-DL (ADL). Both results are animated using the same skeleton and skinning weights for fair comparison. We show that our method supports faithful rigging of our results and preserves better geometric details than ADL.