Log In Sign Up

SCULPTOR: Skeleton-Consistent Face Creation Using a Learned Parametric Generator

by   Zesong Qiu, et al.
Tencent QQ

Recent years have seen growing interest in 3D human faces modelling due to its wide applications in digital human, character generation and animation. Existing approaches overwhelmingly emphasized on modeling the exterior shapes, textures and skin properties of faces, ignoring the inherent correlation between inner skeletal structures and appearance. In this paper, we present SCULPTOR, 3D face creations with Skeleton Consistency Using a Learned Parametric facial generaTOR, aiming to facilitate easy creation of both anatomically correct and visually convincing face models via a hybrid parametric-physical representation. At the core of SCULPTOR is LUCY, the first large-scale shape-skeleton face dataset in collaboration with plastic surgeons. Named after the fossils of one of the oldest known human ancestors, our LUCY dataset contains high-quality Computed Tomography (CT) scans of the complete human head before and after orthognathic surgeries, critical for evaluating surgery results. LUCY consists of 144 scans of 72 subjects (31 male and 41 female) where each subject has two CT scans taken pre- and post-orthognathic operations. Based on our LUCY dataset, we learn a novel skeleton consistent parametric facial generator, SCULPTOR, which can create the unique and nuanced facial features that help define a character and at the same time maintain physiological soundness. Our SCULPTOR jointly models the skull, face geometry and face appearance under a unified data-driven framework, by separating the depiction of a 3D face into shape blend shape, pose blend shape and facial expression blend shape. SCULPTOR preserves both anatomic correctness and visual realism in facial generation tasks compared with existing methods. Finally, we showcase the robustness and effectiveness of SCULPTOR in various fancy applications unseen before.


page 1

page 4

page 5

page 7

page 9

page 12

page 13

page 14


Learning a model of facial shape and expression from 4D scans

The field of 3D face modeling has a large gap between high-end and low-e...

Unconstrained Facial Expression Transfer using Style-based Generator

Facial expression transfer and reenactment has been an important researc...

A method for automatic forensic facial reconstruction based on dense statistics of soft tissue thickness

In this paper, we present a method for automated estimation of a human f...

JNR: Joint-based Neural Rig Representation for Compact 3D Face Modeling

In this paper, we introduce a novel approach to learn a 3D face model us...

NIMBLE: A Non-rigid Hand Model with Bones and Muscles

Emerging Metaverse applications demand reliable, accurate, and photoreal...

Multilinear Wavelets: A Statistical Shape Space for Human Faces

We present a statistical model for 3D human faces in varying expression,...

Towards 3D Human Shape Recovery Under Clothing

We present a learning-based scheme for robustly and accurately estimatin...

1. Introduction

The amazing variety of human faces – far greater than that of most other animals – make each of us unique and easily recognizable (Little et al., 2011; Quian, 2017; Wu, 2020). From plump cheeks of Mona Lisa to strong and finely chiseled chins shared by different versions of Batman, facial traits are the defining characteristics of human characters, real or virtual, physical or digital. To faithfully model human faces, existing approaches have overwhelmingly emphasized modeling the exterior shapes, textures and skin properties of faces. Physically-based techniques based on photometric or multi-view 3D scanning can now recover ultra-fine geometry at pore-level. Yet, they generally require using bulky and expensive apparatus (Ghosh et al., 2011) and have been limited to celebrities for feature film productions. By far, 3D scanned models are still of much fewer varieties than the real ones.

To enrich the diversity of facial models, tremendous efforts have been focused on developing easy-to-use face generators, ranging from earlier parametric models such as 3DMM 

(Blanz and Vetter, 1999) to the latest data-driven model such as FLAME (Li et al., 2017) and DECA (Feng et al., 2021). Despite their effectiveness, few techniques employ anatomic facial bone structures in the model generation process. Physiologically, facial appearance is biomechanically related to skeletal structures: bones grow under pulling force and absorb under pressure; consequently, strong muscles are often matched with thick bones, inducing characteristic contours and features. Putting aside biomechanical rationality can produce absurd results that are far less convincing than an anatomically consistent one. Therefore the variety of faces has boundaries and should conform to the anatomical rules. For face generators, it is hence crucial to provide easy-to-use controls over the facial skeleton structures while abiding physiological soundness.

Facial skeletons further serve as a key invariant to appearance: the body weight and muscle composition as well as skin textures and elasticity of the same character may change over different periods of time but its skeleton geometry remains largely unchanged. Given the renowned actor Christian Bale as an example whose presence in the feature films from ”The Mechanic” to ”The Batman” and to ”Vice” went through drastic body weight changes leading to dramatic facial appearance variations, his facial skeletons can serve as a constraint to help disentangle shapes, bones, and appearance as well as enable interpolation or extrapolation of respective attributes for future auditions.

In this paper, we present SCULPTOR, 3D face creations with Skeleton Consistency Using a Learned Parametric facial generaTOR that we derive from comprehensive anatomical studies. SCULPTOR aims to facilitate easy creation of both being anatomically consistent and visually convincing face models via a hybrid parametric-physical representation.

At the core of SCULPTOR is LUCY, the first large-scale shape-skeleton face dataset in collaboration with plastic surgeons. Named after the fossils of one of the oldest known human ancestors, our LUCY dataset contains high-quality Computed Tomography (CT) scans of the complete human head before and after orthognathic surgeries, critical for evaluating surgery results. CT, as a 3d medical imaging technique, is widely used in orthodontics for diagnosis, treatment planning, mock surgery and post treatment assessment (Agrawal et al., 2013). LUCY consists of 144 scans of 72 subjects (31 male and 41 female) where each subject has two CT scans taken pre- and post-orthognathic operations. Specifically, we obtain accurate 3D surgical landmarks and segmentation annotations of the internal mandible and maxilla skeleton and on the external facial geometry, both labelled by experienced plastic surgeons. To correlate inner skeletal structures with appearance, we also acquire the exterior 3D facial geometry along with texture maps. Specifically, we employ 3dMD (3dMD, 2022), a structured light based multi-view RGBD scanning system, to recover initial 3D facial geometry with skin textures before and after operations.

Next, we utilize a general neutral head mesh that consists of mandible, maxilla and outer surface mesh as the template geometry and set out to learn the parametric SCULPTOR model. SCULPTOR aims to separate the depiction of a 3D face into shape blend shape, pose blend shape and facial expression blend shape. Therefore we conduct mesh registration and model learning analogous to techniques used on exterior faces, hands, and even full body shapes  (Loper et al., 2015; Romero et al., 2017; Li et al., 2017) to train SCULPTOR parameters including skinning weight and various blend shapes. Validations on the ground truth pre- and post-orthognathic surgery data further demonstrate that SCULPTOR is reliable and accurate. In particular, compared with prior art  (Li et al., 2017; Blanz and Vetter, 1999), SCULPTOR can create unique facial features that help define a character and at the same time maintain physiological soundness.

The skeleton consistent nature of SCULPTOR benefits a variety of applications including: (1) Facial geometry estimation from incomplete facial bone structures that can be obtained as near as from a partial CT scan and as far as through archaeological explorations. (2) Augmenting existing facial assets by adjusting exterior face geometry or even fusing facial appearance from different characters while enforcing to

be anatomically consistent. (3) Face/skull inference from scanned face models or even images, by employing the differentiable network layer analogous to  (Li et al., 2017; Feng et al., 2021). (4) Supporting physically correct facial animations of drastic head/face movements as well as under external forces by first inferring and then imposing inner skeletal structures as constraints.

To summarize, our main technical contributions include:

  • We present LUCY, a comprehensive shape-skeleton correlated face dataset from pre- and post-surgery CT imaging and 3D scans. LUCY contains rich annotations on surgical landmarks and semantic segmentation labels and will be disseminated to the community after de-identification and anonymization.

  • We derive a skeleton consistent face generator model SCULPTOR from LUCY that jointly models the skull, face geometry, and face appearance under a unified data-driven framework. Compared with the SOTA, SCULPTOR preserves both anatomic correctness and visual realism in facial generation tasks.

  • We apply SCULPTOR to aforementioned applications and demonstrate its robustness and effectiveness. In particular, SCUPLTOR helps to enrich currently scarce 3D face data with physical correctness.

2. Related Work

In this section, we review contemporary related studies on Data acquisition for Parametric models, 3D Parametric Face Models and Anatomically-Constrained Parametric Face Models.

Data acquisition for Parametric models. To synthesize the surface of faces with different shape, pose and appearance, parametric face models (Li et al., 2017; Wu et al., 2016; Bao et al., 2021; R et al., 2021) estimate low-dimensional parametric space to approach 3D geometry. It assumes that human body shape geometries lie on a manifold, which can provide prior statistical knowledge that helps to solve ill-posed vision problems. The primary ingredient of the parametric model is a representative set of 3D shapes, usually coupled with corresponding appearance data acquired from the real world  (Egger et al., 2020). Laser scanners, time-of-flight sensors, multi-view photogrammetry and structured light systems are commonly used for 3D face data acquisition. Subsequently, geometric, photometric and hybrid methods are applied for capturing facial shape information from the data scanning. The facial appearance can also be constructed via back-projecting methods using reconstructed facial surface mesh as prior guidance.

Recently, tremendous technical improvement has been made for the acquisition of facial performance and morphology, especially on detailed skin micro-structure  (Nagano et al., 2015; Gotardo et al., 2018) , hair  (Hu et al., 2015), eyes (Bérard et al., 2016), eyelids  (Bermano et al., 2015; Wen et al., 2017), beards  (Olszewski et al., 2020), lips  (Garrido et al., 2016), teeth  (Wu et al., 2016) and tongue (Ploumpis et al., 2020) . In addition, medium-scale details (wrinkles) are captured from monocular input in real-time  (Cao et al., 2015; Habermann et al., 2019). Anatomical constraints have proven useful in estimating the rigid transformation of the skull (rigid stabilization)  (Beeler and Bradley, 2014; Ichim et al., 2017; Wu et al., 2016; Madsen et al., 2018) and extracting detailed flesh deformations  (Wu et al., 2016). However, limited by the light field data capture systems, real relationships between facial skeleton, shape and appearance are not well explored.

Figure 2. Overview of building SCULPTOR, which includes mesh registration to all the interior and exterior face features as well the photometric appearances in the LUCY dataset; and the parametric model training process over the dataset. SCULPTOR learns a skeleton consistent face model that can generate much more diverse faces while following anatomic principles.

The most common imaging techniques for acquiring facial anatomical structures are computed tomography (CT) and magnetic resonance imaging (MRI). CT presents superior clear skeleton and face contour information but is constrained by its high radiation exposure with only clinical allowance. MRI scan is a safe solution that can provide good contrast in human body soft tissue (such as muscle, fat, tender and neuron)  

(Misaki et al., 2014). However, the complex structure of human soft tissue and unclear skeletal structure in MRI make it difficult to extract anatomical structures related to facial appearance. Therefore, most existing anatomical constraints parametric face models are based on artificial skulls  (Beeler and Bradley, 2014) or public CT head dataset (Kustár et al., 2013). In (Kustár et al., 2013), it provides CT head scans from 400 patients. However, the dataset is not able to model how the face changes when part of the skull is modified.

Parametric Face Models. (Duan et al., 2015, 2014) have built a statistical skull-face shape model with 114 CT scans, enabling a mapping function from skull to face. However, it has not equipped the face model with pose variation space. Most of the current pose-enabled parametric face models focus on the face’s outer surface, shape and texture information (Li et al., 2017; Wu et al., 2016; Bao et al., 2021; R et al., 2021). For example, FLAME  (Li et al., 2017) learned a comprehensive facial shape and expression model that captures a wide range of outer facial shapes. For generating more realistic geometries, some exaggerated expressions such as shouting and laughing, FLAME considered an abstract jaw articulation as an axis of mandibular changes. However, without training with precise jaw articulation location, expressions with large jaw motions generated by FLAME still lack realism.  (Bao et al., 2021) created a differentiable parametric human head model to recover high-resolution realistic facial geometries and texture of the whole head’s outer surface. Without specific variations in facial expression and jaw motion, the model tended to generate facial geometry with only gentle poses and expressions. These models are typically global, meaning that the entire face is parameterized holistically. Local or region-based parametric face models have also been proposed, which offer more flexibility at the cost of being less constrained to realistic human face shapes.  (Brunton et al., 2014) used many localized multi-linear models to reconstruct faces from noisy or occluded point cloud data.  (Neumann et al., 2013) extracted sparse localized deformation components from an animated mesh sequence, also with the goal of intuitive editing as well as statistical processing of the face.  (Wu et al., 2016) proposed a local 3D face model that parameterizes the face into many overlapping patches and explicitly encodes the local deformation of each patch rather than local positions, which allows monocular face reconstruction and single-view direct editing with unprecedented fidelity.  (Feng et al., 2021) introduced an animatable detailed 3D face model called DECA that disentangles person-specific details from expression-dependent wrinkles allowing the proposed model to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged.

The internal skeletal structure of the face highly affects facial appearance. Therefore, physical or anatomical constraints are essential for modeling face pose and shape with large variation space. In  (Shui et al., 2017), the authors created two independent statistical shape models for skull and face, respectively. The skull models were built from a large dataset of CT head scans, and a linear mapping was found between the two models, following a common pipeline from automatic forensic facial reconstruction  (Claes et al., 2010). In  (Madsen et al., 2018), the authors proposed to solve the problem of aligning an independent face model to a skull model with stochastic optimization. A probabilistic joint face-skull model is then constructed, which is able to infer a distribution of plausible face shapes given a skull shape. Not only the facial skeleton but other kinds of attributes also play an important role in generating a more realistic face shape and appearance. In several approaches, researchers suggest changing the facial attributes in a final step to adjust soft tissue variation due to age and Body Mass Index (BMI) differences as a complementary approach  (Gruber et al., 2020). For example, by integrating a template skull model together with underlying facial muscle, Phace  (Ichim et al., 2017) produced physically-correct facial animation performance.  (Li et al., 2020) added teeth, gums, eyeballs, eye blending, lacrimal fluid, eye occlusion, and eyelashes into the face model to ensure anatomical correctness of the generated digitized face assets. In face capture tasks,  (Wu et al., 2016) used the underlying bone structure to anatomically constrain the local skin thickness, simultaneously solving for the skin surface and the skull position for every video frame, yielding a rigidly stabilized performance. Although with limitations, the anatomical structure is indicated to be an essential component in leading a physically correct facial animation performance.

Inspired by a recently proposed parametric hand bone model from an MRI dataset  (Li et al., 2021), which achieved inner hand kinetic structure in a data-driven manner. We consider the use of medical imaging techniques to build a novel face generator that jointly models the skull, face geometry, and facial appearance to better facilitate our modification of facial appearance by anatomical structure and generate more realistic and diverse faces.

Model Parametric Skull Face Anatomically Consistent Shape Pose Expression Appearance Trait
(Madsen et al., 2018)
(Gruber et al., 2020)
(Ichim et al., 2017)
(Li et al., 2020)
(Li et al., 2017)
Table 1. SCULPTOR vs. existing face models.

3. Overview

Figure 3. (a) 29 skeletal landmarks labeled on CT scan. Skeletal structures modified during Orthognathic surgery are marked with different colors. (b) Orthogonal slices of CT scan, indicating separated maxilla (white) and mandible (blue). (c) 15 facial landmark positions labeled on face appearance scans. (d) Facial appearance scan using the 3dMDface system in preparation for orthognathic surgery.

In this work, we propose SCULPTOR, a novel face generator that jointly models the skull, face geometry, and face appearance under a unified data-driven framework.

The overview of our method is shown in Fig. 2. SCULPTOR is developed from the first large-scale shape-skeleton face dataset LUCY that contains high-quality Computed Tomography (CT) scans of the complete human head pre and post orthognathic surgeries. Accurate 3D surgical landmarks and segmentation annotations on the internal mandible and maxilla skeleton and the external facial geometry are labeled by experienced plastic surgeons. Additionally, exterior 3D facial geometry along with texture maps are scanned using 3dMD.

We then utilize a general neutral head mesh that consists of mandible, maxilla and outer surface mesh as the template geometry to learn the statistical skeleton-driven facial model from LUCY. We conduct mesh registration to all the interior and exterior face features and photometric appearances in the dataset. We train SCULPTOR parameters including skinning weight and various blend shapes via model learning techniques used on exterior faces, hands, and even full body shapes  (Li et al., 2017; Romero et al., 2017; Loper et al., 2015; Li et al., 2022). SCULPTOR learns a skeleton consistent face model that can generate a much more diverse facial appearance while following anatomic principles.

The rest of the paper is organized as follows: we first introduce our data collection and annotation process in Section 4. Then, we present the model formulation in Section 5.1, followed by model registration and parameter learning on shape, trait, pose and appearance in Section 5.3. The procedure of the skeleton driven face generation and editing is shown in Section 5.4. In Section 6, we demonstrate the effectiveness of SCULPTOR on a variety of applications.

4. Building LUCY

4.1. Data Acquisition and Original Usage

Data original usage background. We actively collaborate with orthognathic surgeons to collect the real-world data that shows the skeleton consistent variation of the facial outer surface. The medical images collected during the surgery planning and recovery period clearly depict the influence of facial skeleton changes on the facial appearance, especially on the face’s outer surface. To achieve the best orthognathic surgery performance, each patient underwent two CT scannings, pre and post-surgery in Fig. 3(b), as well as multi-view facial scans captured by the imaging system in the hospital as shown in Fig. 3(d). This routine examination captures the patient skeleton structure and facial appearance features.

Data acquisition parameter. To avoid unnecessary radiation exposure, we retrospectively adopt CT scans from archived medical records at the Department of Oral & Craniomaxillofacial Surgery, Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine. In all, a total of 72 individual subject head CT image pairs (pre and post-surgery), as well as the multi-view face appearance scans are collected. The 3D maxillofacial CT imaging was performed using a spiral CT scanner (Light speed 16; GE, Gloucestershire, UK), with image spatial resolution . These real-world data help us to build a more realistic parametric model.

4.2. Data Labeling

The CT volume is a regular volumetric grid of scalar values representing tissue mass density. To analyze anatomical structure, bone and joint location must be annotated. To acquire anatomical structure from raw CT data, specialists segment the skull and face from CT volume with thresholding method and morphological operations. Besides, tissues around condyle structures are carefully annotated to break the connection between mandible and maxilla, finally acquiring separated mandible, maxilla volume and the facial outer surface. The CT volume and multi-view face scan for orthognathic surgery are both in neutral pose. We thus merge the multi-view scan face reconstruction with the CT facial geometry to add more facial details that are smoothed out during CT scanning. Specially, we apply ICP (Besl and McKay, 1992) to align multi-view scans to the facial soft tissues captured in CT. Thus, we obtained the anatomically consistent, detailed facial scan and skull. Then, biological meaningful landmarks are manually annotated on the mandible, maxilla and face surface for presurgical planning; in our work, 29 skeleton and 15 face surface landmarks are selected as semantic landmarks for model registration. Fig. 3(a)(c) show the skeletal and facial landmark positions, respectively.


Figure 4. Our realistic face generation pipeline with trait effect. Starting with SCULPTOR full template, we randomly generate and procedurally add shape, trait, appearance and expression/pose effects on the neutral template, rendering the 3D face with environment maps.

The interior skeletal structures determine the exterior face shape, geometry, and skin properties. Inspired by this, we intend to build a skeleton-driven parametric face model that provides expressive control in facial skeletons to achieve faithful exterior geometry and a realistic appearance. For achieving this goal, we jointly model the internal face skeletal structure with face exterior shape and geometry. Besides, we add a skeleton modification module trained on the pre- and post-orthognathic surgery medical image data, which is called the characteristic generator. The generator significantly enhances the model’s ability to represent and controlling skeletal structures with its influence on face shape.

5.1. Model Formulation

We define our model using the general formulation as follows:


where denotes the geometry for both skeleton and face, and models the face appearance. , , , , are parameters controlling pose, shape, trait, expression and appearance, respectively.

In our model, we propose a parametric face model with more accurate appearance variation with the inner skeleton structure by extending the human face model FLAME (Li et al., 2017) to a physically precise parametric human skeleton model with the canonical pose. Additionally, to further enhance the ability to control human face appearance using the skeleton, we introduce a novel parameter , which is called the ”trait component”, to represent a physiologically reasonable variation space for the internal skeleton of the face, as well as face shape variations that caused by the change of skeleton. The template is formulated as :


where demotes the Linear Blend Skinning (LBS) function; is the learned skinning weight; , , and

represent the PCA coefficient vector of the shape, trait, pose and expression space, respectively;

represents the person-specific head mesh with corresponding variation over the general template .

represents the anatomical joint location for jaws, defined as =. is a sparse matrix that computes joint location from personalized skull vertices with shape and trait components. Different from FLAME (Li et al., 2017), which regresses jaw joint from facial vertices, our joint regressor is defined by experienced surgeons as the midpoint of mandible condyles (Zoss et al., 2018)

, as condyle is the key anatomical structure involved in the jaw rotation and translation. As mandible movement lies in a 6 Degree Of Freedom (DoF) manifold 

(Zoss et al., 2018), containing both rotation and translation. We therefore model jaw movement using pose parameter . Thus the pose space is defined via a mandible joint , plus an additional global head orientation.

The personalized template is a linear combination of general head template , shape blend shape , trait blend shape , pose blend shape and expression blend shape . It is defined as . We define the general head template with an outer surface and inner skull geometry, including a mandible and a maxilla: , and are defined similarly to FLAME. We refer readers to to the supplemental page for details. We use to parameterize in the following paper.

In SCULPTOR, our goal is to improve the model’s ability to generate more characteristic faces with a physiologically correct constraint. However, simply using shared coefficients for representing facial and skeletal shape can hardly achieve this specific demand. Firstly, the parametric model focus on global face shape variation among the population, which is limited in capturing rare personality face features. Secondly, the skeletal modification follows a certain anatomical distribution. Therefore, to build a physiologically consistent facial variation space, we need to learn from real-world medical data to avoid artificial modifications that reduce the realism of face generation.

Our trait component improves the model’s ability to generate more characteristic faces with a physiologically correct constraint. Specifically, is weighted linear combination of trait blend shape and parameter , , We define the trait blend shape as the variation of the corresponding vertices between skull and face. Due to the high deformation complexity of the non-rigid soft tissue between skull and face, it is unlikely to model the inner muscles and fat efficiently while maintaining anatomically consistent. Thus we adopt the plastic surgery CT data, which contains pre-surgery and post-surgery scans of human face. The CT data records the variation between skull-face states before and after the operation, enabling the skull-face correspondence variation and defining the skull-face variation space as a trait space. For more details, please refer to the supplemental page.

5.2. Registration

The first step in building a parametric model is to associate the template skull mesh and face mesh to all the individual skulls and faces in our LUCY dataset.

Registration on skull. The major difficulty is that the CT-generated skull mesh is incomplete around orbit and cheekbone, where the bones are too thin compared with CT image resolution, thus leaving numerous small holes in the mesh. To address this issue, we employ a semantic embedded deformation scheme based on (Xu et al., 2019), where we enforce larger mesh regularization for incomplete and noisy areas.

First, the skull template and CT skull are roughly aligned using Procrustes rigid alignment on landmark correspondences. Then we use embedded deformation to recover skull details. We uniformly sample control nodes on the template surface with interval . Neighboring nodes are connected to form a graph, and then for each vertex in template , the deformation is defined as:


where is the transformation of node , denotes the influence weight of node on

. We compute the weight using Radial Basis Function 

(Rhee et al., 2007), where a larger weight indicates a closer distance and stronger influence. We optimize node deformation by minimizing the following energy term:


where dense term enforces vertex alignment of the deformed template and target CT skull by


where denotes the Chamfer Distance (Borgefors, 1983) between two meshes. computes the angle between the corresponding vertex normal. adds a normal penalty to prevent the template from fitting to the inner side of the maxilla, where the vertex normals are opposite. is a sparse term to enforce landmark alignment; it computes the L2 distance between template landmark set and target :


Following (Li et al., 2021) and (Xu et al., 2019), we adopt the as-rigid-as-possible motion regularization term that enforces the neighbouring node to deform similarly. Please refer to (Xu et al., 2019; Newcombe et al., 2015) for the complete formulation of this term. Instead of using a uniform weight for all vertices, we set the weights for the orbit and nasal region 50 times larger than other parts. As a result, the vertices of these regions are more likely to deform in response to neighboring nodes, discarding erroneous correspondences caused by an incomplete target.

Registration on face. Similarly, the face template and target face are roughly aligned using Procrustes rigid alignment on landmark correspondences. Then we optimize template deformation by minimizing the mesh distance, landmark difference and a regularization term. It is defined as follows:


To maintain the well-defined topology of the template and suppress noise from original face data . We follow (Li et al., 2017) and adopt the discrete Laplacian regularization term, . Mesh distance term and landmark term are identical in Equ.5 and Equ.6.

5.3. Parameter Learning

After registration, the skulls and corresponding faces from LUCY have been aligned to the same topology as our general template. We then set out to train the following model parameters similarly to previous works (Li et al., 2017; Romero et al., 2017; Loper et al., 2015; Li et al., 2022). Though data in LUCY are captured under a neutral pose and expression, the definition of ”neutral” varies from subject to subject, so we assume that each data has a minor pose and expression variation. Thus, in order to better disentangle shape from pose and expression, we adopt the expression basis from (Li et al., 2020) for disentanglement and utilize FaceScape (Yang et al., 2020) to learn pose related parameters, i.e. skinning weight and pose blend shape , so that we can better neutralize LUCY data.

We first learn the initial shape and trait parameters on LUCY, then train pose parameters on FaceScape. We iterate the learning on two data sources until convergence. We start with the general template as the initial and initialize skinning weight with which is transferred from FLAME (Li et al., 2017) via RBF kernel (Rhee et al., 2007).

Learning on LUCY. We first use the LUCY dataset to train skeleton-consistent shape and trait components . To this end, we need to solve for the neutral template for each subject, , as well as the corresponding minor pose parameter and expression parameter . To disentangle the neutral facial shape from jaw pose and expressions, we train model parameters in two iterative steps: pose/expression and shape. In the first step, we fix shape parameters and optimize for pose and expression, and in the second step, we update parameters in the opposite way. This training process was carried out on the whole LUCY dataset without separating pre-surgery and post-surgery data.

Specifically, for each subject , we compute the deformed model with optimized parameters and Eqn.2, then minimize the following objective function:


where the data term measures the euclidean distance between the deformed template mesh and the target registration, the edge term computes the corresponding edge length difference. We refer readers to (Loper et al., 2015; Li et al., 2017) for further details. is the discrete Laplacian (Kobbelt et al., 1998) term. We apply it on to force the vertices to preserve the original topology distribution, so that it is robust to registration noise. is a shape regularizer that constrains the outer surface of to be in initial shape space by restraining the projected shape coefficients to be zero (Li et al., 2020). It is defined as follows:


where is the facial geometry of the neutral template of subject , which is defined as the initial shape basis from (Li et al., 2020). We only add a shape regulator to the facial geometry here because (Li et al., 2020) only has a shape basis for the outer surface,

After iteratively learning for pose, expression and shape on LUCY, we obtain an estimated personal template

for both pre- and post- surgery data. Then we separate trait from shape component by first applying principal component analysis (PCA) on the post-surgery subset to obtain

and mean shape . Then, we compute by performing PCA on the vertex offset of pre- and post-surgery data by to model the trait component.

Learning on FaceScape. After acquiring , and from LUCY dataset, we set out to learn pose-related parameters . Firstly, we retopology FaceScape dataset to match with our outer surface model. Then given registered facial geometry, we solve for the person-specific template and compute joint position with . Then we optimize subject-specific parameters and and global parameters .

The objective energy function is defined as follows:


where is the same in Eqn. 8, except that we only compute the L2 distance between the surface of deformed model and the registered target, as FaceScape only contains face geometry. To avoid collision between skull and face, we use the collision term, in (Hasson et al., 2019) to penalize collision. To avoid overfitting to face, we add regularization on parameters:


where , and are L-2 regularization terms on corresponding parameters. prevents optimized becoming far away from initial . Meanwhile, similar to (Loper et al., 2015), we use Frobenius norm for to keep pose blend shape sparse.

Optimization Summary. We alternatively optimize the parameters on two different datasets. , optimized from LUCY are then passed to FaceScape, the learning process on FaceScape then update skinning weight and pose blend shape . The iteration goes on until convergence.

Appearance Modeling. Following (Qian et al., 2020), we model appearance with texture maps. Besides the texture maps from LUCY, we add extra online face texture assets from (3DSCANSTORE, 2022). We use a pre-defined UV-map to unify all the texture data and then perform principal component analysis directly on images. Thus is able to produce a skin texture image given a random appearance parameter .

Figure 5. Performance of skeleton driven characteristic face editing on one female (top row) and one male (bottom row) actors’ face using the trait space in SCULPTOR. Each row of the partial enlargement displays the facial characteristic variations according to a representative trait component.
Figure 6. Examples of randomly generated facial appearance variations of SCULPTOR on a male and a female face respectively.

5.4. Skeleton Consistent Generator

Skeleton driven face generator.

For generating high-quality and realistic face appearance, most existing parametric face models achieve remarkable performance on face shape, expression detail and skin texture. But it is hard to locally carve the characteristic face skeletal details, such as the shape of jaw, the height of cheekbones, or curve of eyebrows. The SCULPTOR jointly models the correlation of skull and face as well as characteristic variance in a data-driven manner, thus deriving a skeleton consistent face generator model. We show our procedural face generation process in Fig. 

4. We start with the SCULPTOR average template model, and randomly generate shape vector to make a variation on face shape. Then taking a random synthesized vector as input, the model is further modified on skull local shape details, which add more features such as higher cheekbone, more curved jaw, or even mandibular protrusion to the generated face. For example, in Fig. 4, we synthesize a man with a long square chin. Afterward, we continue adding facial appearance, pose, and expression parameters to generate a face with a widely opened mouth. We render the final image through Blender (Blender, 2021) cycle render engine with high-resolution environment texture in the final generation process.

Skeleton driven characteristic face editor. As shown in Fig. 4, limited by the ability of CT training data to present face local details, the face geometry directly generated from the proposed template is a bit over-smooth. We thus propose an alternative face generation pipeline for implementing skull edition from a real 3D face. We firstly use SCULPTOR model to fit with the 3D face in geometry, producing a smoothed model face. Then similarly to the previous section, by taking a random synthesized vector as input, we add more trait structure details to the skull model. Meanwhile, the face geometry is also changed accordingly. Finally, we add the personal skin offset back to the curved face geometry and generate a new face with characterized face contour and appearance details. The performance of this generator is shown in Fig.  5.

6. Application

6.1. Archaeological Skeletal Facial Completion

During archaeological excavation, archaeologists rarely recover complete evidence. Especially human skeletons, during the long history and complex influence of natural factors, organic materials such as human bones are more challenging to preserve intact than inorganic relics. As one of the most mysterious ancient females, part of Ava’s remains was found during the excavation in Caithness, Scotland, 1987. This 4200-year-old young lady only left her mandible, and her DNA indicates that she likely had bronze skin, brown eyes and black hair (Hoole et al., 2018). Using Ava’s mandible and teeth as a basis, archaeologists have commissioned forensic artists to conduct the original facial reconstruction of Ava. Using SCULPTOR, we have more statistical evidence and flexibility to complete her facial skeleton and infer different facial appearances.

We first intend to use the maxilla part of proposed SCULPTOR model to fit with Ava’s maxilla. Specially, we adopt our model with 300 shape components to optimize shape parameter with energy function: , where is the mean vertex error in Eqn. 8 while regularizes the shape parameter.

We optimize by regressing the maxilla part of our parametric model. Then, the optimized shape coefficient is applied to model components related to the maxilla, mandible and face geometry to generate the corresponding face and mandible.

To generate various of Ava with the fixed maxilla and different mandibles, we further apply our model with 300 shape components and 50 trait components to jointly optimize and on with energy function: . where is a randomly initialized vector that brings a creative and anatomically consistent modification on Ava’s skeletal structure. The effect of the trait components mainly focuses on mandible shape, but there still exists a bit of influence on the maxilla. We thus need to slightly refine the model by minimizing the energy function . We additionally define to keep as close to the initialization of as possible when we optimize and .

6.2. Character Fusion

For generating more diverse and realistic faces, we intend to exchange the upper and lower facial skeleton structures of two different actors, instead of simply adjusting the coefficients of the parametric trait components in SCULPTOR.

Specially, we first generate a new skull using the maxilla of actor and mandible of actor as . Then we optimize the pose parameter , shape parameter and trait parameter to minimize:


where regularizes the trait parameter and is designed to keep the generated face as close as possible to the original faces. Since face shape, trait components, skeleton shape and trait components share the weight parameters. After fixing the skull model, for generating more pose and expression variation to the facial appearance, we can use pose and expression parameters to control the facial pose and expression based on the newly generated face.

6.3. Skull Inference from Image

Similar to other parametric models (Li et al., 2017; Romero et al., 2017; Loper et al., 2015; Li et al., 2022), our model is easy to apply to skull inference tasks from images. We treat our model as a differentiable layer that takes pose , shape , trait , expression and appearance parameters as input and directly outputs a 3D head with inner skull and outer skin geometry as well as high-quality facial textures.

For skull inference from images, we build upon the DECA network structure (Feng et al., 2021), which takes 2D in-the-wild images as input and outputs a 3D animation face. We replace FLAME model in the DECA training loop with our model and add another regression branch for the trait components, then we train the network to regress model parameters and camera extrinsic on the FFHQ dataset (Karras et al., 2019). Similarly, we use landmark re-projection loss, photometric loss, identity loss, shape consistency loss, parameter regularization loss and detail reconstruction loss to train the network in an analysis-by-synthesis manner. We refer readers to (Feng et al., 2021) for details. Note that our anatomical landmark definition deviates from the general facial landmark, so we manually redefine corresponding landmarks in our template facial geometry to match the dataset annotation.

Figure 7. Model evaluation on compactness and generalization. From left to the right: (a) Shape space (b) Trait space (c) Appearance space.

6.4. Face Generation with Lipo Level Change Effect

To create the lipo level change effect, we start with multiple images of the same subject with varying levels of lipo. Then we infer the skull and head shape with neutral pose and expression from each image. Since we assume that the 3D offsets between the face and skull do not vary much between subjects throughout our model learning, i.e., a constant lipo level distribution, the predicted skull shape is different for each image. However, in reality, the skull shape of an adult is invariant to body weight growth or loss. As a result, we use the prediction with the minimum fitting error as the subject’s anticipated skull since it most closely matches the lipo level distribution of our dataset. We denote it as and the corresponding neutral head as . To create a realistic lipo level transition effect, (Ichim et al., 2017) proposed to use a hand-painted ”lipo map” to specify which areas of the head are more prone to body weight accumulation. Similarly, we automatically create a person-specific ”lipo map” with and all predicted head shapes . It is defined by vertex weight on the face mesh:


where denotes the vertex on the face surface mesh, is the face mesh count and denotes the face mesh. Larger weight specifies that the vertex is more likely to shift when lipo level changes, namely more variant to weight growth and loss. We also regulate by maximum normalization for the following computation. See Fig. 12 for the visualization of personal lipo maps. Next, we apply PCA on the offset of all head shapes with the neutral one and use the principal components and coefficients as an initial person-specific puffiness component. Although directly modifying achieves a smooth lipo level change effect, the results may suffer from unnatural deformation and visual artifacts around eyes and ears as these regions are expected to be invariant to weight change. Thus, we further optimize and such that the deformation matches the personal lipo map. Denoting the recovered head mesh using lipo components as , the optimization energy is defined as:


constraints deformation to align with inferred face when lipo map has larger values, and reduces the deformation effect for regions with a smaller weight. Finally, we perform another PCA on , so we can interpolate and extrapolate anatomically consistent faces with various puffiness levels by modifying . The results are shown in Fig. 12.

Model pre-surgery post-surgery
SCULPTOR 1.77 1.77
Table 2. Quantitative results for skull fitting performance. We evaluate Mean Squared Errors in millimeter on pre- and post-surgery test scans. SCULPTOR-SIMPLE stands for the simplified version which only models shape, while SCULPTOR models both shape and trait components.

6.5. Facial Animations

To further demonstrate the advantage of modeling the inner structure, we apply physical simulation to our model to include secondary deformation under dynamic pose and external force. We show results of the skin deformation during a rapid head-shaking and under a fist punch. Similar to (Kozlov et al., 2017), given a generated 3D head with inner skull and outer skin, we create a simple volume of tissue between them. We use tetrahedral mesh to represent the soft tissue and have artists manually tune the material parameters for soft-body dynamics. Using collision detection, we additionally ensure that no skin penetrates the skull. To compare with models without inner structure, we consider the whole head as a whole volume by omitting the inner skull and rerunning the simulations again. Visual results are shown in Fig. 13. Please also see the accompanying video for the dynamic sequence.

7. Experiments

Figure 8. Mean squared per-vertex error between FLAME and SCULPTOR-2, measured on pre-surgery, post-surgery and FaceScape test set.
Figure 9. Archaeological Skeletal Facial Completion. (a) The original maxilla of Ava and face generated using SCULPTOR without trait components. (b)-(d) Characteristic face generations with respect to Ava’s maxilla by varying trait parameters in SCULPTOR.

7.1. Implementation Details

In registration on the maxilla, we set to 0.5 as small holes and fragments exist in the raw maxilla mesh, removing the effect of foldovers. to 0.005 and to 0.0005 in practice, enforcing our template is able to be aligned to the biological landmarks. The maxilla registration takes approximately 5 minutes per data. In registration on the mandible, we set to 1.0, to 0.01 and to 0.001 since there are no fragments in raw mandible mesh. For face registration, as we focus on the topology of the face that can still be maintained during registration, we set to 0.03, to 10 and to 0.5. Each embedded deformation is performed 4 times with node interval ranging from 50 mm to 2 mm. The mandible registration takes approximately 5, 2, 3 minutes per data for maxilla, mandible and face, respectively.

During parameter training, in the stage of learning from CT data, keeping and in the whole process. We first set the to update the initial pose without any shape in the first iteration, and keep the in the following optimization. We set to maintain the topology of the face and skull. In the FaceScape learning stage, we set to strictly avoid the skull penetrating the face. For the rest parameters, we set , , and

. We iteratively optimize the parameters until the model has converged through the optimization process with Limited-memory BFGS optimizer in PyTorch. We optimize all the parameters on an NVIDIA GTX TITAN X GPU.

Model pre- post- FaceScape 3DRFE
FLAME 1.58 1.60 1.63 2.79
SCULPTOR-1 1.51 1.54 0.66 2.86
SCULPTOR-2 1.36 1.41 0.67 2.78
Table 3. Comparison with other face parametric models. We report mean squared error in millimeters in the first three columns, and chamfer distance (in millimeters) in 3DRFE dataset (the last column) on facial mesh fitting tasks. We use 300 shape components for FLAME, 300 shape components for SCULPTOR-1 and 228 shape components, 72 trait components for SCULPTOR-2.

7.2. Model Evaluation

We use two metrics, compactness and generalization, to evaluate the quality of our statistical model. The shape space is learned from 72 post-surgery patient scans and 400 face scans with fitted skulls from (Yang et al., 2020), and the trait space is computed from 72 pairs pre-surgery and post-surgery CT scans. Finally, the appearance space is trained with 126 high-resolution texture maps, 48 of them are online assets from (3DSCANSTORE, 2022).


Fig. 7 (a), (b) (c) (dark red curve) plot the compactness of the SCULPTOR shape, pose and appearance space, respectively. These curves depict the variance in the training data captured by a varying number of principal components. In Fig. 7(a), the curve implies that with the first 25 principle components, the shape space is able to cover 95% of the entire space. Meanwhile, 200 principle components are able to express nearly the entire space. In trait space, the first 50 principle components are plotted with a dark red curve in Fig. 7(b). It shows that the first 10 principle components achieve over 80% of the trait space, while 50 components are able to cover nearly 98% of the trait space. The appearance space is also built with high compactness, where the first 10 principle components achieve 96% of the full space while 25 components are able to cover over 98% of the appearance space.

Missing maxilla Missing mandible
pre- post- pre- post-
SCULPTOR-1 2.304 2.235 2.668 2.421
SCULPTOR-2 2.149 2.156 2.259 2.193
Table 4. Quantitative reconstruction results on missing mandible or maxilla task. We report hausdorff distance in millimeters.


Fig. 7 (a), (b) (c) (blue curve) plot the evaluation result on the generalization ability of the SCULPTOR shape, pose and appearance space, respectively. We use an additional 45 CT head image data from the archived medical records outside the LUCY training dataset to evaluate shape-space generalization. In Fig. 7

(a), the blue curve depicts shape space generalization error. The shape regression error is computed via the mean squared vertex error(MSVE), and the standard deviation is denoted in millimeters (mm). The vertex error decreases monotonically on the test shape with respect to the increasing number of principle components. The vertex error curve achieves lower than 3.25 mm and 2.33 mm using 50 and 175 principle components, respectively. In the trait space generalization study, aiming to generalize unseen traits under the limited principle components, the leave-one-out strategy is adopted for the 72 pairs of pre-surgery and post-surgery CT in the training data. In Fig. 

7(b) the blue curve evaluates the reconstruction error varying with an increasing number of involved principle components. With the increasing principle components, the mean squared error decreases to approximately 1.1 mm in using 50 principle components. Similarly, the appearance space also gives out the decreasing error curve with the number of components increasing.

Figure 10. Example of character fusion result. The fused face is generated by blending mandible of Ben Affleck into the maxilla of Christian Bale using SCULPTOR.

Qualitative Results.

Fig.  5 displays the skeleton driven characteristic face edition performance on one female and one male actor’s face using the trait space in SCULPTOR. Following the skull curvature process described in Section  5.4. Each row of the partial enlargement displays the characteristic facial variations according to a representative trait component: the first row of jaw width, the second row of jaw length and the third row with relative position between jaw and maxilla. The novel proposed trait space in SCULPTOR extends the parametric model function to control the local skeletal changes and produce accordingly facial characteristic variations.

Fig.  6 shows a group of randomly generated facial appearance variations of SCULPTOR. The right three columns show appearance changes on a male face with relatively higher cheekbone and longer mandible. While the left three columns present a female face with a narrow face contour and round, smooth jaw. As can be seen, our texture space provides realistic color space for face appearance generation.


The novel proposed face trait space is built as an additional local variation based on the parametric shape space. We conduct an ablation study to clarify the effect of using the trait space. We trained a simplified SCULPTOR model that cancels the trait components and uses both pre- and post-surgery CT data from 72 identities for training the shape parameter space. The simplified model is denoted as SCULPTOR-SIMPLE and compared with the standard SCULPTOR in Tab.  2. SCULPTOR-SIMPLE uses 144 shape components while SCULPTOR uses 72 shape plus 72 trait components. We use another 12 pairs pre- and post-surgery CT data to test the two models’ skull fitting performance. Mean Vertex Squared Error (MVSE) is used for evaluating the skull fitting results. The averaging MVSE for pre- and post-surgery data from SCULPTOR are both lower than that from the SCULPTOR-SIMPLE model, which demonstrates the effectiveness of adding the trait space in SCULPTOR in elevating the expressiveness of the skeletal structure and capturing nuanced shape differences.

Figure 11. Qualitative result of skull inference from RGB images for actors with a variety of face shapes.

Comparison with other models.

In the face reconstruction task, we experiment on 12 pairs of CT test data, 200 neutral face mesh test data from FaceScape (Yang et al., 2020) and 22 neutral faces from 3DRFE (Ma et al., 2007). We evaluate SCULPTOR-1 model with 300 shape components and 53 expression components, SCULPTOR-2 model with 228 shape components, 53 expression components and 72 trait components. SCULPTOR shape components are built from the PCA components on the neutral faces from the post-operation part in LUCY and FaceScape (Yang et al., 2020) datasets. To compare with the FLAME model (Li et al., 2017) with 300 shape components and 100 expression components. Table 3 indicates the quantitative result of the average reconstruction error on each group of test data, reported using facial area RMSE in millimeters (mm) on pre-, post-surgery and FaceScape dataset. And the result on 3DRFE (Ma et al., 2007) dataset (last column) is demonstrated by chamfer distance in millimeters (mm). Using the same number and type of principle components, SCULPTOR-1 produces lower errors in both skull and facial mesh fitting tasks than those from FLAME model. While in SCULPTOR-2 model, we replace the last 72 shape components with 72 trait components from SCULPTOR-1 model, and SCULPTOR-2 achieves better performance on surgery scans, but the fitting result on FaceScape testset is not as good as that using SCULPTOR-1. This could be caused by the limited variation in face shapes in the FaceScape data. We could visualize the facial mean squared per-vertex error in Fig 8. On the unseen 3DRFE data, SCULPTOR-2 performs nearly the same as FLAME on chamfer distance.

Figure 12. Person specific lipo level change effect. We show facial deformation of two subjects under decreasing lipo levels. The blue models are the personal lipo maps associated with body weight accumulation.
Figure 13. Facial animation using SCULPTOR. A fist punch to the cheekbone results in realistic skin deformation which follows the shape of the maxilla.

7.3. Archaeological Skeletal Facial Completion

Fig. 9 presents the face generation results recovered from the maxilla of Ava. The original Ava maxilla and face constructed by forensic artists are shown in Fig. 9(a). Our parametric model is able to fit with the maxilla and recover the mandible and face appearance for Ava by optimizing .

Fig. 9(a) depicts our reconstructed Ava face with respect to the optimization on maxilla using 100 components in SCULPTOR shape space without using trait components. Then by varying the trait parameter , we generate a variety of face appearance and mandible shapes that make each Ava looks different. The visualization results are shown in Fig. 9(b), (c) and (d). Our reconstructed Ava’s face in (b) has a round and short chin, the chin in (c) is wide and the face is flat, while the face in (d) is narrow and long. The two figures in the same column show the same face generation rendered from different views. Our model provides high characteristic variation space to generate characteristic faces. Moreover, we conducted the skull completion task on 12 pre- and post-surgery test data. The quantitative evaluation results are shown in Tab. 4. Given the face and part of the skull, SCULPTOR is performed to infer the missing mandible or maxilla. Hausdorff distance between the inference and the missing structure is reported in millimeters in Tab. 4.

7.4. Character Fusion

As shown in Fig. 10, we carry out the character fusion task with Christian Bale (top left) and Ben Affleck (bottom left) in Batman. Maxilla and mandible shapes of the two actors are obviously different. Our model fits Christian’s skull with a short maxilla and a pointed and long mandible. While Ben has a short, rounded mandible and a longer maxilla. The fused face is generated by blending Ben’s jaw into Christian’s maxilla and optimizing the pose, shape, and trait parameters in SCULPTOR. The generated face has a slim upper face that is similar to Christian, and the originally pointed chin is replaced by the short round chin of Ben. Then we add pose and expression of opening mouth and raising eyebrows to the fused face, and render them under ambient lighting to get a realistic person.

7.5. Skull Inference from Image

The most important and variable feature for facial contouring is the shape and position of the mandible. People with a square face tend to have a wider mandible, while people with a long mandible tend to have a sharper chin. We present the qualitative skull inference results using SCULPTOR with 500 shape and 50 trait components and projecting the internal mandible of each face back to the images in Fig. 11 to analyze each human’s characteristic face. In order to accurately compare the facial differences in digital photographs, we adopt ”Iris Ruler” as the scale to measure the distance between face and camera when capturing the photo  (Driessen et al., 2011). Each person’s iris is 11.5mm wide, and there is almothest no difference in race, sex and age. It can well compare the width and length of the mandible. We measure two items: Mandible Width (MW), which means the distance between the left and right mandibular angle, and Ramus Height (RH), which means the distance between mandibular angle and condylar. We demonstrate the measurement results in the Fig. 11. Benefiting from the detailed 3D Face recovery using DECA, SCULPTOR is able to reconstruct mandible position with high accuracy.

7.6. Face Generation with Lipo Level Change Effect

Based on Section 6.4, we are able to achieve anatomically consistent face deformation results when lipo level changes. As shown in Fig. 12, the left and right parts represent two individual subjects. The leftmost column of each part shows the personal lipo map, and the values specifie which areas of the head are more prone to body weight accumulation. As can be observed, as the subject’s lipo level drops, the vertices with higher values on the lipo map shift closer to the skull. The impact is most noticeable on the cheeks, whereas the neck and top of the head are less noticeable. This is congruent with human anatomy, as the cheeks are more prone to weight accumulation than the top of the head and neck(Swift et al., 2021). Despite the fact that this deformation only changes the surface geometry, simply altering skin vertices would result in unnatural deformation and visual artifacts caused by skin and skull collision. Since our model includes an inside skull, we can generate anatomically consistent lipo maps and use collision detection when deforming face vertices to ensure that the face is always outside the skull, thus achieving a more natural weight loss effect and the slimmest possible head shape.

7.7. Facial Animations

Fig. 13 shows simulated facial deformation under a fist punch to the cheekbone. We apply physical simulation on SCULPTOR. Since SCULPTOR models the inner skull; it acts as a constraint that prevents skin from penetrating the skull, thus reproducing photorealistic skin-skull collision under motion. Meanwhile conventional surface models without inner skeletal structures will lead to unrealistic skin sunken artifacts. Please see the comparison sequence in the accompanying video.

8. Conclusion

The 3D human face modelling has attracted increasing attention over the last few years. Most existing approaches overwhelmingly focused on modeling the exterior shapes, textures and skin properties of faces to generate faithful human faces. However, these methods ignored the anatomic facial bone structures in the model generation process. In this paper, we proposed LUCY, the first comprehensive shape-skeleton correlated face dataset from pre-and post-surgery CT imaging and 3D scans. To better explore the inherent correlation between inner skeletal structures and appearance, we developed SCULPTOR, a novel skeleton consistent parametric facial generator that jointly models the skull, face geometry and face appearance under a unified data-driven framework. SCULPTOR preserves both anatomic correctness and visual realism in facial generation tasks compared with existing methods. The robustness and effectiveness of SCULPTOR have been clearly shown in various applications, e.g. Archaeological Skeletal Facial Completion, Character Fusion, Skull Inference from Image, Face Generation with Lipo Level Change Effect and Physical Simulation. In particular, the skeleton consistent nature of our model design can enrich currently scare 3D face data with physical correctness.

As we analyzed before, SCULPTOR has several strong points on jointly modeling the correlation of skull and face as well as characteristic variance. We also want to highlight the potential drawbacks of this approach. As the data in LUCY only comes from plastic surgery, which often changes the mandible, zygomatic and maxillary alveolar bone locally, the diversity of its distribution may be limited. The deformation of the trait component depends entirely on whether a certain part of the bone will be surgically designed during the operation. Therefore, the trait component of our model is limited to these corresponding parts. In future work, we plan to enrich our LUCY dataset with the registration of existing off-the-shelf datasets, and further enhance the characteristic variance of our model. Besides, our model uses Linear Blend Skinning and expression blend shape without muscle modeling. In the future, we plan to collect the facial expression data with dynamic MRI, as MRI can quickly scan the face to obtain muscles and soft tissue information without radiation. Jointly modeling the skull, muscles, face geometry and face appearance to construct 3D human face will be an interesting area and need to be well explored.

This work was supported by NSFC programs (61976138, 61977047), the National Key Research and Development Program (2018YFB2100500), STCSM (2015F0203-000-06), SHMEC (2019-01-07-00-01-E00003) and Cultivation of Interdisciplinary Projects (YG2022ZD011).


  • 3dMD (2022) 3dMDface™ system. External Links: Link Cited by: §1.
  • 3DSCANSTORE (2022) 3D scan store: captured assets for digital artists. External Links: Link Cited by: §5.3, §7.2.
  • J. M. Agrawal, M. S. Agrawal, L. G. Nanjannawar, and A. D. Parushetti (2013) CBCT in orthodontics: the wave of future. The journal of contemporary dental practice 14 (1), pp. 153. Cited by: §1.
  • L. Bao, X. Lin, Y. Chen, H. Zhang, S. Wang, X. Zhe, D. Kang, H. Huang, X. Jiang, J. Wang, D. Yu, and Z. Zhang (2021) High-fidelity 3d digital human head creation from rgb-d selfies. ACM Trans. Graph. 41 (1). External Links: ISSN 0730-0301, Link, Document Cited by: §2, §2.
  • T. Beeler and D. Bradley (2014) Rigid stabilization of facial expressions. ACM Transactions on Graphics (TOG) 33 (4), pp. 1–9. Cited by: §2, §2.
  • P. Bérard, D. Bradley, M. Gross, and T. Beeler (2016) Lightweight eye capture using a parametric model. ACM Transactions on Graphics 35, pp. 1–12. External Links: Document Cited by: §2.
  • A. Bermano, T. Beeler, Y. Kozlov, D. Bradley, B. Bickel, and M. Gross (2015) Detailed spatio-temporal reconstruction of eyelids. ACM Transactions on Graphics 34, pp. 44:1–44:11. External Links: Document Cited by: §2.
  • P.J. Besl and N. D. McKay (1992) A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (2), pp. 239–256. External Links: Document Cited by: §4.2.
  • V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 187–194. Cited by: §1, §1.
  • Blender (2021) Cycles renderer. Note: Cited by: §5.4.
  • G. Borgefors (1983) Chamfering: a fast method for obtaining approximations of the euclidean distance in n dimensions. In Proc. 3rd Scand. Conf. on Image Analysis (SCIA3), pp. 250–255. Cited by: §5.2.
  • A. Brunton, T. Bolkart, and S. Wuhrer (2014) Multilinear wavelets: a statistical shape space for human faces. Vol. 8689, pp. . External Links: ISBN 978-3-319-10589-5, Document Cited by: §2.
  • C. Cao, D. Bradley, K. Zhou, and T. Beeler (2015) Real-time high-fidelity facial performance capture. ACM Transactions on Graphics 34, pp. 46:1–46:9. External Links: Document Cited by: §2.
  • P. Claes, D. Vandermeulen, S. Greef, G. Willems, J. Clement, and P. Suetens (2010) Computerized craniofacial reconstruction: conceptual framework and review. Forensic science international 201, pp. 138–45. External Links: Document Cited by: §2.
  • J. P. Driessen, H. Vuyk, and J. Borgstein (2011) New insights into facial anthropometry in digital photographs using iris dependent calibration. International Journal of Pediatric Otorhinolaryngology 75 (4), pp. 579–584. External Links: ISSN 0165-5876, Document, Link Cited by: §7.5.
  • F. Duan, D. Huang, Y. Tian, K. Lu, Z. Wu, and M. Zhou (2015) 3D face reconstruction from skull by regression modeling in shape parameter spaces. Neurocomputing 151, pp. 674–682. External Links: ISSN 0925-2312, Document, Link Cited by: §2.
  • F. Duan, S. Yang, D. Huang, Y. Hu, Z. Wu, and M. Zhou (2014) Craniofacial reconstruction based on multi-linear subspace analysis. Multimedia Tools and Applications 73 (2), pp. 809–823. Cited by: §2.
  • B. Egger, W. Smith, A. Tewari, S. Wuhrer, M. Zollhöfer, T. Beeler, F. Bernard, T. Bolkart, A. Kortylewski, S. Romdhani, C. Theobalt, V. Blanz, and T. Vetter (2020) 3D morphable face models—past, present, and future. ACM Transactions on Graphics 39, pp. 1–38. External Links: Document Cited by: §2.
  • Y. Feng, H. Feng, M. J. Black, and T. Bolkart (2021) Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (TOG) 40 (4), pp. 1–13. Cited by: §1, §1, §2, §6.3.
  • P. Garrido, M. Zollhöfer, C. Wu, D. Bradley, P. Pérez, T. Beeler, and C. Theobalt (2016) Corrective 3d reconstruction of lips from monocular video. ACM Transactions on Graphics 35, pp. 1–11. External Links: Document Cited by: §2.
  • A. Ghosh, G. Fyffe, B. Tunwattanapong, J. Busch, X. Yu, and P. Debevec (2011) Multiview face capture using polarized spherical gradient illumination. ACM Trans. Graph. 30 (6), pp. 1–10. External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • P. Gotardo, J. Riviere, D. Bradley, A. Ghosh, and T. Beeler (2018) Practical dynamic facial appearance modeling and acquisition. Vol. 37, pp. 1–13. External Links: Document Cited by: §2.
  • A. Gruber, M. Fratarcangeli, G. Zoss, R. Cattaneo, T. Beeler, M. Gross, and D. Bradley (2020) Interactive sculpting of digital faces using an anatomical modeling paradigm. In Computer Graphics Forum, Vol. 39, pp. 93–102. Cited by: Table 1, §2.
  • M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt (2019) LiveCap: real-time human performance capture from monocular video. ACM Transactions on Graphics 38, pp. 1–17. External Links: Document Cited by: §2.
  • Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid (2019) Learning joint reconstruction of hands and manipulated objects. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §5.3.
  • M. Hoole, A. Sheridan, A. Boyle, T. Booth, S. Brace, Y. Diekmann, I. Olalde, M. Thomas, I. Barnes, J. Evans, C. Chenery, H. Sloane, H. Morrison, S. Fraser, S. Timpany, and D. Hamilton (2018) ‘Ava’: a beaker-associated woman from a cist at achavanich, highland, and the story of her (re-)discovery and subsequent study. Proceedings of the Society of Antiquaries of Scotland 147, pp. 73–118. External Links: Link, Document Cited by: §6.1.
  • L. Hu, C. Ma, L. Luo, and H. Li (2015) Single-view hair modeling using a hairstyle database. ACM Transactions on Graphics 34, pp. 125:1–125:9. External Links: Document Cited by: §2.
  • A. Ichim, P. Kadleček, L. Kavan, and M. Pauly (2017) Phace: physics-based face modeling and animation. ACM Trans. Graph. 36 (4). External Links: ISSN 0730-0301, Link, Document Cited by: Table 1, §2, §2, §6.4.
  • T. Karras, S. Laine, and T. Aila (2019)

    A style-based generator architecture for generative adversarial networks

    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: §6.3.
  • L. Kobbelt, S. Campagna, J. Vorsatz, and H. Seidel (1998) Interactive multi-resolution modeling on arbitrary meshes. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’98, New York, NY, USA, pp. 105–114. External Links: ISBN 0897919998, Link, Document Cited by: §5.3.
  • Y. Kozlov, D. Bradley, M. Bächer, B. Thomaszewski, T. Beeler, and M. Gross (2017) Enriching facial blendshape rigs with physical simulation. In Computer Graphics Forum, Vol. 36, pp. 75–84. Cited by: §6.5.
  • Á. Kustár, L. Forró, I. Kalina, F. Fazekas, S. Honti, S. Makra, and M. Friess (2013) FACE-r-a 3d database of 400 living individuals’ full head ct- and face scans and preliminary gmm analysis for craniofacial reconstruction. Journal of forensic sciences 58, pp. . External Links: Document Cited by: §2.
  • R. Li, K. Bladin, Y. Zhao, C. Chinara, O. Ingraham, P. Xiang, X. Ren, P. Prasad, B. Kishore, J. Xing, and H. Li (2020) Learning formation of physically-based face attributes. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3407–3416. External Links: Document Cited by: Table 1, §2, §5.3, §5.3.
  • T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero (2017) Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph. 36 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §1, §1, §1, Table 1, §2, §2, §3, §5.1, §5.1, §5.2, §5.3, §5.3, §5.3, §6.3, §7.2.
  • Y. Li, M. Wu, Y. Zhang, L. Xu, and J. Yu (2021) PIANO: a parametric hand bone model from magnetic resonance imaging. In

    Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21

    pp. 816–822. External Links: Document, Link Cited by: §2, §5.2.
  • Y. Li, L. Zhang, Z. Qiu, Y. Jiang, N. Li, Y. Ma, Y. Zhang, L. Xu, and J. Yu (2022) NIMBLE: a non-rigid hand model with bones and muscles. ACM Trans. Graph. 41 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §3, §5.3, §6.3.
  • A. Little, B. Jones, and L. DeBruine (2011) Facial attractiveness: evolutionary based research. Philosophical transactions of the Royal Society of London. Series B, Biological sciences 366, pp. 1638–59. External Links: Document Cited by: §1.
  • M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §1, §3, §5.3, §5.3, §5.3, §6.3.
  • W. Ma, T. Hawkins, P. Peers, C. Chabert, M. Weiss, and P. Debevec (2007) Rapid acquisition of specular and diffuse normal maps from polarized spherical gradient illumination.. pp. 183–194. External Links: Document Cited by: §7.2.
  • D. Madsen, M. Lüthi, A. Schneider, and T. Vetter (2018) Probabilistic joint face-skull modelling for facial reconstruction. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 5295–5303. External Links: Document Cited by: Table 1, §2, §2.
  • M. Misaki, J. Savitz, V. Zotev, R. Phillips, H. Yuan, K. Young, W. Drevets, and J. Bodurka (2014) Contrast enhancement by combining t1-and t2-weighted structural brain mr images. Magnetic Resonance in Medicine 74, pp. . External Links: Document Cited by: §2.
  • K. Nagano, G. Fyffe, O. Alexander, J. Barbiç, H. Li, A. Ghosh, and P. Debevec (2015) Skin microstructure deformation with displacement map convolution. ACM Transactions on Graphics 34, pp. 109:1–109:10. External Links: Document Cited by: §2.
  • T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor, and C. Theobalt (2013) Sparse localized deformation components. ACM Transactions on Graphics (TOG) 32, pp. . External Links: Document Cited by: §2.
  • R. A. Newcombe, D. Fox, and S. M. Seitz (2015) Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 343–352. Cited by: §5.2.
  • K. Olszewski, D. Ceylan, J. Xing, J. Echevarria, Z. Chen, W. Chen, and H. Li (2020) Intuitive, interactive beard and hair synthesis with generative models. pp. 7444–7454. External Links: Document Cited by: §2.
  • S. Ploumpis, E. Ververas, E. O’ Sullivan, S. Moschoglou, N. Pears, W. Smith, B. Gecer, and S. Zafeiriou (2020) Towards a complete 3d morphable model of the human head. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, pp. 1–1. External Links: Document Cited by: §2.
  • N. Qian, J. Wang, F. Mueller, F. Bernard, V. Golyanik, and C. Theobalt (2020) HTML: a parametric hand texture model for 3d hand reconstruction and personalization. In European Conference on Computer Vision, pp. 54–71. Cited by: §5.3.
  • R. Quian (2017) How do we recognize a face?. Cell 169, pp. 975–977. External Links: Document Cited by: §1.
  • M. R, A. Tewari, H. Seidel, M. Elgharib, and C. Theobalt (2021) Learning complete 3d morphable face models from images and videos. pp. 3360–3370. External Links: Document Cited by: §2, §2.
  • T. Rhee, J.P. Lewis, U. Neumann, and K. Nayak (2007) Soft-tissue deformation for in vivo volume animation. In 15th Pacific Conference on Computer Graphics and Applications (PG’07), Vol. , pp. 435–438. External Links: Document Cited by: §5.2, §5.3.
  • J. Romero, D. Tzionas, and M. J. Black (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 36 (6). Cited by: §1, §3, §5.3, §6.3.
  • W. Shui, M. Zhou, S. Maddock, T. He, X. Wang, and Q. Deng (2017) A pca-based method for determining craniofacial relationship and sexual dimorphism of facial shapes. Computers in Biology and Medicine 90, pp. . External Links: Document Cited by: §2.
  • A. Swift, S. Liew, S. Weinkle, J. K. Garcia, and M. B. Silberberg (2021) The facial aging process from the “inside out”. Aesthetic Surgery Journal 41 (10), pp. 1107–1119. Cited by: §7.6.
  • Q. Wen, F. Xu, M. Lu, and J. Yong (2017) Real-time 3d eyelids tracking from semantic edges. ACM Transactions on Graphics 36, pp. 1–11. External Links: Document Cited by: §2.
  • C. Wu, D. Bradley, P. Garrido, M. Zollhöfer, C. Theobalt, M. Gross, and T. Beeler (2016) Model-based teeth reconstruction. ACM Transactions on Graphics 35, pp. 1–13. External Links: Document Cited by: §2.
  • C. Wu, D. Bradley, M. Gross, and T. Beeler (2016) An anatomically-constrained local deformation model for monocular face capture. ACM Transactions on Graphics 35, pp. 1–12. External Links: Document Cited by: §2, §2, §2, §2.
  • W. Wu (2020) Invited update: consensus on changing trends, attitudes, and concepts of asian beauty and consensus on current injectable treatment strategies in the asian face. Aesthetic Plastic Surgery 44, pp. . External Links: Document Cited by: §1.
  • L. Xu, W. Cheng, K. Guo, L. Han, Y. Liu, and L. Fang (2019) Flyfusion: realtime dynamic scene reconstruction using a flying depth camera. IEEE transactions on visualization and computer graphics 27 (1), pp. 68–82. Cited by: §5.2, §5.2.
  • H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, and X. Cao (2020) Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 601–610. Cited by: §5.3, §7.2, §7.2.
  • G. Zoss, D. Bradley, P. Bérard, and T. Beeler (2018) An empirical rig for jaw animation. ACM Trans. Graph. 37 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §5.1.