Learning 3D Human Body Embedding

05/14/2019 ∙ by Boyi Jiang, et al. ∙ USTC 0

Although human body shapes vary for different identities with different poses, they can be embedded into a low-dimensional space due to their similarity in structure. Inspired by the recent work on latent representation learning with a deformation-based mesh representation, we propose an autoencoder like network architecture to learn disentangled shape and pose embedding specifically for 3D human body. We also integrate a coarse-to-fine reconstruction pipeline into the disentangling process to improve the reconstruction accuracy. Moreover, we construct a large dataset of human body models with consistent topology for the learning of neural network. Our learned embedding can achieve not only superior reconstruction accuracy but also provide great flexibilities in 3D human body creations via interpolation, bilateral interpolation and latent space sampling, which is confirmed by extensive experiments. The constructed dataset and trained model will be made publicly available.



There are no comments yet.


page 4

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper considers the problem of building a parametric 3D human body model, which can map a low-dimensional representation into a high-quality 3D human body mesh. Parametric human body model has a wide range of application scenarios in computer graphics and vision community, such as human body tracking [41, 43], reconstruction [2, 7, 26]

and pose estimation 

[29, 23]. However, building a robust and expressive parametric body model is a challenging task. This is because human body has abundant variations due to many factors such as gender, ethnicity and stature. Especially, different poses cause significant deformations of human body, which is hard to be represented by widely used linear models such as PCA for 3D face modeling.

The state-of-the-art work SMPL (skinned multi-person linear model) [27] separates human body variations into shape-related variations and pose-related variations, where the former is modeled by a low dimensional linear shape space with shape parameters and the latter is handled by a skeleton skinning method with pose parameters derived from 3D joint angles. SMPL has a clear pose definition and can express different large scale human poses. The parameter-to-mesh computation in SMPL is high-speed and robust. However, the reconstruction accuracy of the skeleton skinning method relies on the linear shape space spanned by the neutral blend shapes. The skinning weights of SMPL are shared for different neutral shapes of different identities and this further restricts reconstruction ability. To overcome this problem, SMPL introduces pose related blend shapes, but these blend shapes are not disentangled with blend shapes controlled by shape parameters, which influences the independence of shape and pose parameters. While the pose parameters of SMPL explicitly define movements of human skeleton and are very suitable for character animation edit, they are not embedded with human body pose priors and may produce unexpected body meshes with an arbitrary set of pose parameters. Thus, some pose prior constraints like joint angle assignment range constraints and self-intersection penalty energy are needed to generate plausible body shapes [7, 1]. [23] uses a network discriminator to check whether generated pose parameters are consistent with the distribution of human motion during training. The limitations of SMPL motivate us to propose a new 3D human body representation, which learns a disentanged latent 3D human body representation with higher reconstruction accuracy. Moreover, we seek the pose latent parameters that can encode human pose distributions.

With the advance of deep learning, autoencoder or variational autoencoder (VAE) has been used to extract latent representations of face geometry 

[6, 32, 21]. Particularly, [6] uses a ladder VAE architecture to effectively encode face shape in different scales and obtain high reconstruction accuracy. [32] defines up and down sampling operations on their face mesh and uses graph structure convolution to encode latent representation and obtain outstanding reconstruction accuracy for extreme facial expressions. The method proposed in [21] disentangles shape and pose attributes with two VAE branches and then fuses them back to the input mesh. By exploiting the strong non-linear expression capability of neural network and a deformation representation, the method has better performance than existing methods on decomposing the identity and expression parts. These works demonstrate the capability of deep learning based encoder-decoder architecture on learning human face embedding. For 3D human body shape, although it can be disentangled in terms of identity and pose, it has a distinctive hierarchical deformation characteristic due to the articulated structure. This characteristic is vital to improve the body reconstruction accuracy.

Recently, Litany et al[25] proposed a graph convolution based variational autoencoder for 3D body shape, which directly uses the Euclidean coordinates as the vertex feature and encodes the whole shape without disentangling identity and pose attributes. However, Euclidean-domain based encoder-decoder architecture may produce non-natural deformation bodies from latent embedding. Compared with Euclidean coordinates, the mesh deformation representation called ACAP (as consistent as possible) introduced in [14] can handle arbitrarily large rotations in a stable way and has great interpolation and extrapolation properties. The recent studies in [37, 38] show that learning on deformation features with autoencoder or VAE can learn a powerful latent representation. However, [37, 38] are designed for learning the embedding for general 3D shapes. When applied to 3D human body modeling, they only provide one latent embedding that entangles both shape and pose variations, which is not sufficient for practical uses.

Therefore, in this paper we propose to utilize the neural network to learn two independent latent representations from ACAP features: one for shape variations and the other for pose variations, both of which are specifically designed and learned for the human body modeling. Moreover, a coarse-to-fine reconstruction pipeline is integrated into the disentangling process to improve the reconstruction accuracy. Our major contributions are twofold:

  • We propose a basic framework based on variational autoencoder architecture for learning disentangled shape and pose latent embedding of 3D human body. Our framework introduces a hierarchical representation design. The basic transformation module has great design freedom.

  • Learning on ACAP features [14] requires mesh data with the same topology, while different existing human body datasets have different topologies. To address this issue, we convert multiple existing human body meshes into a standard topology via a novel non-rigid registration method and construct a large scale human body dataset. It consists of over 5000 human body meshes with the same topology, where each identity has a standard or neutral pose.

We have conducted various experiments. The experimental results demonstrate the powerfulness of our learned 3D human body embedding in many aspects including modeling accuracy, interpolations and so on.

2 Related Work

Human shape models. The intrinsic human shape variations are often modeled in human shape reconstruction [3, 35, 42, 30]. For example, Allen et al[3] and Seo et al[35]

applied principal component analysis (PCA) on mesh vertex displacements to characterize the non-rigid deformation of human body shapes. Anguelov

et al[5] computed triangle deformation between the template and other shapes. By performing PCA on the transformation matrices, more robust results were achieved. Allen et al[3]

also constructed a correspondence between a set of semantic parameters of body shapes and PCA parameters globally by linear regression. Zhou

et al[44] used a similar idea to reshape human bodies from a single image semantically. To extract more local and fine-grained semantic body shape parametric representation, Yang et al[42] introduced local mapping between semantic parameters and per triangle deformation matrix, which provides precise semantic control of human body shapes.

Human pose shape models. To represent human pose shapes, skeleton skinning is often used, which computes vertex positions directly. Allen et al[4] learned skinning weights for corrective enveloping and then solved a highly nonlinear equation which describes the relations among pose, skeleton, and skinning weights. SMPL [27] explicitly defines body joints, uses the skeleton to represent body pose, and computes vertex positions with the standard skinning method. SMPL is compatible with graphics rendering or game engines. It has a simple computation process and is easy to be animated with the skeleton.

Deformation-based models. Mesh deformations have also been used to analyze 3D human body shape and pose [5, 11, 12, 18, 19, 17]. The most representative work is SCAPE [5], which analyzes body shape and pose deformation in terms of triangle deformations with respect to a reference mesh instead of vertex displacements. The deformation representation can encode detailed shape variations, but another least-square optimization is needed to get resultant vertex positions. This conversion usually causes more time consumption, which constrains it from real-time applications [39]. Chen et al[10] extended the SCAPE [5] approach for real-time usage. Jain et al[20] used a common skinning approach for modeling pose-dependent surface variations instead of using per-triangle transformation, which makes the pose estimation much faster than SCAPE [5].

Deformation representation approaches. Considering the limitations of vertex coordinates based representation, Gao et al[13] proposed to use the rotation difference on each directed edge to represent the deformation. This representation is called RIMD (rotation-invariant mesh difference), which is translation and rotation invariant. RIMD is suitable for mesh interpolation and extrapolation, but reconstructing vertex coordinates from RIMD requires to solve a complicated optimization problem. The RIMD feature encodes a plausible deformation space. With the RIMD feature, Tan et al[38] designed a fully connected mesh variational autoencoder network to extract latent deformation embedding. However, as aforementioned, it does not provide disentangled shape and pose latent embeddings for 3D human modeling.

Gao et al[14] further proposed another representation called ACAP (as-consistent-as-possible) feature, which allows more efficient reconstruction and derivative computations. Using the ACAP feature, Tan et al[37] proposed a convolutional autoencoder to extract localized deformation components from mesh data sets. Gao et al[15] also used the ACAP feature to achieve an automatic unpaired shape deformation transfer between two sets of meshes. And Wu et al[40] used a simplified ACAP representation to model caricature face geometry. In this paper, we also use the ACAP feature but specifically focus on learning disentangled shape and pose embedding for 3D human body shapes.

Fig. 1: Left: our reference mesh. Right: anatomical human body parts.

3 Overview

This section give a high level description about a new representation we propose for 3D human body. Given a 3D human shape , we denote by its corresponding latent parameters. A common analysis for shape is to decompose the shape into two level representation, which can be abstracted as , where represents the low frequency or base part of the shape and is the difference part between and base part representing the high frequency variation of shape .

For human body shape , the hierarchical decomposition is very suitable. Human body shape can be regarded as a composition of large scale rigid transformation of anatomical components and relatively small scale non-rigid deformation within each component. As shown in Fig. 1, a human body has articulate structure, which can be decomposed into several nearly rigid parts. The movements of these parts and the joint connection correspond to , while soft tissue and muscle deformation correspond to high-frequency difference part .

Based on the articulate structure of the human body, can be further modeled by a widely used strategy called skinning, which can be abstracted as , where is the reconstruction of low dimensional coarse shape defined on the rigid parts of the body and is a sampling operation recovering the original shape of by combining elements of belonging to different parts. Body skeleton is a typical example of . can be implemented with simple weighted average operation. Existing body shape representations like SCAPE [5], BlendScape [19] and SMPL [27] have this design characteristics.

Note that some human body related applications require an identity and pose disentangled body representation. Therefore we disentangle latent parameters into and , where controls 3D human body shape determined by identity and controls the shape variation determined by pose. Then our proposed human body representation can be:




aims to reconstruct coarse level shape by summing two independent parts and and then following a mapping . The mapping operation is introduced on the summation to enhance the nonlinearity of the representation and thus improve its expression ability. For difference shape part , we follow the same design principles. For body representation in Eq. (1

), each mapping can be implemented with MLP (multilayer perceptron) with arbitrary complexity. On the other hand, an end-to-end neural network can be integrated with this representation.

Fig. 2: The architecture of our proposed embedding learning network. The Encoder encodes ACAP feature into shape and pose latent pair . The Decoder has two decoding paths to generate base features and difference features respectively. The base feature captures large scale deformation determined by anatomical human body parts, and the difference feature describes the relatively small scale difference caused by different pose or soft tissue movement. Finally, the reconstructed feature is generated by summing the two features of the two paths. By setting pose latent code to be , we reconstruct the corresponding neutral body shape .

Another factor we need to consider is the geometric expression of human body shape and coarse-level shape . The Euclidean coordinate is a typical representation. However, it lacks translation invariance. As shown in [14, 13], deformation representation is more effective than Euclidean coordinate for large scale deformations, and shows excellent linear internal and external interpolation ability. In this work, we adopt ACAP [14] feature as our human body shape representation , and we extend it for coarse-level shape representation .

In the next few sections we give the detailed description of our proposed human body representation. In particular, we first give the deformation representation for and in Section 4

. Then we present our neural network architecture and loss function design in Section 

5. The construction of body shape dataset is given in Section 6, and how to use the proposed learned embedding is given in Section 7. Finally, the detailed experimental evaluations are reported in Section 8.

4 Deformation Representation

ACAP Feature. Assume that the mesh dataset consists of meshes with consistent connectivity. We choose one mesh as the reference and consider that the other meshes are deformed from the reference. We denote the -th vertex coordinates of the reference mesh and the target mesh by and , respectively. The deformation at vertex is described locally by an affine transform matrix that maps the one-ring neighbor of in the reference mesh to its corresponding vertex on target mesh. The matrix is computed by minimizing


where is the cotangent weight and is the index set of one ring neighbor of the -th vertex. Using polar decomposition, , the deformation matrix is decomposed into a rigid component represented by a rotation matrix and a non-rigid component represented by a real symmetry matrix . Following [14], the rotation matrix

can be further represented by a vector

, and the symmetric matrix can be represented by a vector . To process the ambiguity of axis-angle representation of rotation matrix, Gao et al[14] proposed an integer programming approach to solve for optimal globally and make all as consistent as possible. Interested readers are referred to [14] for details. Once and are available, we concatenate all together to form the ACAP feature vector for the target mesh, where represents the entire set of mesh vertices. In this way, we convert the target mesh into its ACAP feature representation.

Coarse Level Deformation Feature. In SCAPE [5], the human body is segmented into some rigid parts. The rigid rotation of each part is considered as the basic transformation for each triangle on the part. Following this idea, we define 16 rigid parts on a human body as shown in Fig. 1. We denote by the set of mesh vertices belonging to the -th part. Similar to Eq. (3), we compute its deformation :


where is the mean position of the target mesh’s -th part. Similarly, we can represent using and . While axis-angle vector represents the same rotation for cycle on radian values, which causes ambiguity for , the ACAP feature has eliminated the ambiguities for all , . This means that all have consistent radian values. Therefore, we choose the specific that is closest to the mean of all of the -th part. Specifically, we modify into


where and are length and normalized vector of initial respectively, and is computed by solving the following optimization problem


Once and are found for all parts, we concatenate all together to form the coarse-level feature .

ACAP to Mesh. Converting a given ACAP feature vector to the target mesh is much easier. In particular, we directly reconstruct from  [14]. The vertex coordinates of target mesh can be obtained by solving


which is equivalent to the following system of linear equations:


where . Note that Eq. (8) is translation-independent and thus we need to specify the position of one vertex. Then the amended linear system can be rewritten as where is a fixed and sparse coefficient matrix, for which a pre-decomposing operation can be executed to save the computation time.

Note that although the conversion from a mesh to its ACAP feature representation is quite complex, we only need it for converting the training mesh data to learn the embedding. Once the learning is done, during online applications we can directly optimize the low-dimensional embedding to fit the input data, e.g., a mesh, which is simple and easy to compute.

Scaling Deformation Feature. Following the strategy of Tan et al[38], we rescale each dimension of and to independently. This strategy normalizes each dimension of feature and reduces learning difficulty of reconstructing deformation feature and .

5 Embedding Learning

5.1 Network Architecture

In this work, our goal is to learn a disentangled human body representation with a coarse-to-fine reconstruction pipeline. We need body ACAP feature and coarse-level feature to supervise our hierarchical reconstruction. Meanwhile, we define a neutral pose of a human body as the standard shape determined by identity only. For corresponding to a specific pose shape of an identity, we denote the ACAP feature corresponding to the neutral pose of the same identity by . Similarly, represents the neutral correspondence of . A group of forms one train data for our embedding learning.

We use a VAE like architecture to design our end-to-end learning architecture. Fig.2 shows the proposed architecture. For the encoder, we first feed into a shared MLP (multilayer perceptron) to generate a 400 dimension hidden feature. Then we use the standard VAE [24] encoder structure to generate the shape and pose latent representations separately. Specially, is composed of two fully connected layers with

as the activation function.

have similar structure and they use a fully connected layer without activation to output the mean values and another fully connected layer with

activation to output the standard deviation. We set the shape embedding

to dimensions and the pose embedding to dimensions, i.e. and , to roughly match the dimensions of the shape and pose parameters in SMPL [27].

Our decoder follows the design of Eq. (1). There are two paths called base path and difference path. Each path takes as input, and corresponds to and in Eq. (1), respectively. The decoder outputs by summing the results of the two paths and produces with , and aims to reconstruct . Meanwhile, the decoder outputs by another calculation with as inputs, and aims to reconstruct . For the basic transformations except for learnable skinning layer in the decoder, they all have similar structures. Tab. I gives detailed information.

The learnable skinning layer is introduced to construct base feature from coarse level feature . The skinning method has showed its ability for human body modeling based on Euclidean coordinates [27]. Our learnable skinning layer exploits this method for feature space. Particularly, we use a learnable sparse matrix to transform a reshaped coarse level feature to reshaped base feature , i.e.,


where each row of is a convex combination of rows of coarse-level feature . Moreover, we constrain to be non-zero only on the nearby parts of the -th vertex to avoid overfitting and a non-smoothing solution.

5.2 Loss Function

Unlike [38, 37] that use mean square error (MSE), we find that error gives better results for the feature reconstruction:


Similarly, for coarse-level feature reconstruction, we define


For the shape and pose embedding, since we use VAE as the encoder, KL divergence losses are needed:



is the posterior probability,

is the prior multivariate normal distribution, and

is the KL divergence formulation. See [24] for more details of the KL divergence formulation. The total loss is given in the following form:


The hyperparameters

, , , , and are set to be , , , , and , respectively. We train the network with the learning rate as initialization, and decrease it with ratio for about epochs. We train about 3600 epochs and fine-tune the tradeoff between KL loss and reconstruction loss for another 2000 epochs.

MLP & &
units number 2 2 1 1
TABLE I: Structure of MLPs in the decoder. A unit is a fully connected layer with as activation. We also list the input and output feature dimensions of units for each MLP.

6 Constructing Training Data

To facilitate data-driven 3D human body analysis, we need to have a large number of 3D human mesh models. Thus, we collect data from several publicly available datasets. In particular, SCAPE [5] and FAUST [8] provide meshes of several subjects with different poses. Hasler et al[18] provide 520 body meshes for about 100 subjects with relatively low resolution. MANO [34] collects the body and hand shapes of several people. Dyna [31] and DFaust [9] releases the alignments of several subjects’ movement scan sequences. For the rest-pose body data set, CAESAR database [33] is the largest commercially available dataset that contains 3D scans of over 4500 American and European subjects in a standard pose. Yang et al[42] convert a large part of CAESAR dataset to the same topology with SCAPE dataset. All these datasets have very different topology structures and different poses for each identity.

Our proposed embedding learning network has two main requirements for the training data. First, the topology of the whole dataset must be kept consistent due to the requirement of ACAP feature computation. Second, to disentangle human body variations into shape and pose latent embeddings, we need to define a neutral pose as the specific pose which represents the body variations only caused by shape, i.e., intrinsic factors among individuals. In other words, we need to construct a neutral pose mesh for each identity in our dataset.

For the first requirement, we need to convert our collected public datasets, like FAUST [8], SCAPE [5] and Hasler et al[18] into the same topology. Before that, we need to define a standard topology first. Considering vertex density and data amount, we modify the topology shared by SCAPE [5] and SPRING [42] to eliminate several non-manifold faces and treat this topology as the standard one. Specifically, we set the mesh graph structure with vertices and 24495 faces, which is much denser than SMPL [27] that has 6890 vertices. We choose one mesh of SCAPE [5] as the reference mesh, as shown in Fig. 1, for the ACAP feature computation.

For the second requirement, SPRING [42] is a dataset with a consistent simple pose, which can be regarded as our neutral pose.

Fig. 3: Reconstruction comparison of 2 data from test poses data. We show respective reconstruction results and MED error maps on neutral models.

Topology Conversion. We formulate our topology conversion problem as a non-rigid registration problem from the reference topology to a mesh in a target topology dataset. We adopt the data-driven non-rigid deformation method of Gao et al[14] to solve our problem. First, we define the prior human body deformation space by a base of ACAP features. We use 70 large pose meshes of SCAPE [5] to cover the pose variations, and choose 70 shape meshes of different individuals from SPRING [42] to cover the shape variations. With the computed 140 ACAP features (see Section 4), we get a matrix . Then, we extract compressed sparse basis deformations from . We use the sparse dictionary learning method of [28] to extract the sparse base . Unlike [14], we extract the sparse base based on human body parts instead of automatically selecting basis deformation center. See Fig. 1 for the segmentation of human body parts. In this way, we can now use a vector to get an ACAP feature:

Second, we manually mark a set of corresponding vertices between the reference and the target topology, denoted as , where is the index set of markers on our reference topology and represents the index of the corresponding marker on the target topology.

Finally, we formulate our topology conversion problem as:


where and represent the rotation and the translation of the global rigid transformation, is the point-to-plane ICP energy, is the normal of vertex on the target mesh, is a vertex to be optimized on the reference mesh topology, is a vertex of the reference mesh, and is for sparse landmark constraints. is the formulation from Gao et al[14], which uses the extracted sparse deformation base to generate transformation so as to constrain the movements of . For more details, please refer to Gao et al[14].

By using this topology conversion method, we convert 916 meshes from Dyna [31], all 100 meshes of FAUST [8], 517 meshes of Hasler et al[18] and 852 meshes of MANO [34] to the standard topology and align the converted meshes to the reference mesh.

Neutral Pose Construction. For each identity in the original dataset, we perform global matching with each pose (i.e., neutral pose) in SPRING [42] under rigid transformation and choose the posture mesh that has the smallest error. Then we use ARAP (as rigid as possible) [36] deformation method to get the corresponding neutral pose mesh. In this way, we generate another 135 neutral meshes.

Finally, with the method described above, we obtain 2385 converted pose meshes plus another 70 from SCAPE [5], and 135 deformed neutral meshes plus 3048 from SPRING [42]. We compute their ACAP features and corresponding coarse level features using the method described in Section 4. After removing a few bad results, we eventually get 5594 pair features. We choose corresponding neutral features for every pair , and construct the final dataset. Then, we random choose 160 neutral meshes and 160 pose meshes as testing data and the rest are used as training data. Table II shows the final numbers of the used data from each dataset and our constructed dataset.

Fig. 4: Cumulative Errors Distribution (CED) curves on our shape scan dataset.
Fig. 5: Reconstruction comparison of 2 data from shape scan data. We show respective reconstruction results and PMD error maps on scan point clouds.
DataSet Dyna [31] FAUST [8] SCAPE [5] Hasler et al[18]
number 907 99 70 517
DataSet MANO [34] SPRING [42] Neutral General Pose
number 818 3048 3183 2411
TABLE II: The number of models we chose from each existing dataset for constructing our consistent mesh dataset, and the number of models (neutral models and general pose models) in our final constructed dataset.
Fig. 6: Cumulative Errors Distribution(CED) curves on DFaust [9] scan dataset.

7 Making Use of the Embedding

Once the embedding learning is done, we only need to keep the trained decoder plus the ACAP feature to mesh conversion in Eq. (8), denoted as , which takes shape and pose parameters as input and output a mesh in the predefined topology. For various online applications such as reconstruction, we just need to optimize the low-dimensional embedding to fit the input data, which could be image, video, point cloud, mesh, etc.

Let us use mesh input as a toy example. Given a mesh with our topology, we want to find optimal whose best reconstructs the given mesh. Here, we do not want to use our trained encoder to obtain since the encoder requires to convert the given mesh into ACAP features, which is complex and time-consuming. Instead, we optimize directly by only using the decoder:


where rotation and translation are the global rigid transformation parameters, is the -th vertex position of the decoded mesh of , and is the -th vertex of the given mesh. For this optimization with per vertex constraints, we assign to , and to . This model generally takes about 200 iterations to achieve millimeter reconstruction accuracy with Adam optimization.

Test Dataset Ours Baseline meshVAE
Neutral (160) 4.99 5.26
Pose (160) 3.19 3.13
TABLE III: MED() on test data with 160 neutral meshes and 160 pose meshes.
Methods mean std #points
Ours 545263
Baseline 5.2 7.5 543848
meshVAE 5.4 7.2 544794
SMPL 6.4 8.5 546020
TABLE IV: Quantitative comparison between different methods on our shape scan dataset. The mean PMD(), standard deviation and valid number of points for testing (without hand part) are given.

8 Experiments

In this section, we quantitatively analyze our model’s ability for reconstruction tasks and present some qualitative results and potential applications. We set three baseline methods for comparison in different tasks. To show the power of our hierarchical reconstruction pipeline, we trained a baseline architecture that removed the base path in the decoder, which we call “Baseline”. To compare the effect of disentangling shape and pose variations, we trained the non-disentangled meshVAE [38] architecture on our dataset. To evaluate the reconstruction power, we compare our method with the widely used SMPL [27]

model. We integrate official neutral SMPL model code into the pytorch framework and implement the optimization in the same framework with Adam optimization method.

Computation Time. Our implementation is based on Pytorch. Our mesh decoder takes about 10ms to map an embedding to a mesh on TITAN Xp GPU.

8.1 Quantitative Evaluation

We quantitatively evaluate reconstruction ability on two types of data. One is our test dataset, which has consistent topology. We use the mean Euclidean distance of vertices (MED) as the measurement. Another is general scan data of human body shapes. We compare our method with SMPL [27], which has a different topology with ours, on this type of data. We compute the distance between each point of scan point cloud and corresponding reconstructed mesh as the measure. The distance is computed with the AABB tree, and we denote this error measurement by PMD (point-mesh distance). As our method and comparative methods mainly focus on the body shape, we ignore PMDs of vertices belonging to the hand part.

Fig. 7: Reconstruct results of our method and SMPL on scan data of DFaust [9]. The error maps show that our method has better reconstruction accuracy.

Point to Point Reconstruction. We compare the reconstruction ability of ours, Baseline and meshVAE on our test dataset. We obtain embedding by solving Eq. (15) for each method and report the MED errors in Tab. III. We also show two reconstruction results and their respective error map in Fig. 3.

We can see that our model outperforms Baseline and meshVAE for both shape and pose test dataset. As shown in Fig. 3, the MED of our model is obviously lower than that of other methods, especially for large scale body shape and pose variations. The results demonstrate the effectiveness of our disentangled and hierarchical architecture design.

Shape Scan Data Reconstruction. To show the reconstruction ability for different human body scan data of different identities, we use high accuracy scan data of six males and females respectively with different body types under neutral pose. These subjects are irrelevant with our train dataset and are all with tight clothes. The scan system includes 4 Xtion sensors which rotate around the subject standing in the middle of the scene, and we use the collected multiview RGB and depth data to recover high accuracy geometry of the subject.

We label eight corresponding landmarks on the scan mesh and use this sparse correspondence to generate coarse alignment with scan data. Then we use point-to-plane iterative closest point (ICP) optimization iteratively. For our method, Baseline and meshVAE, we use latent parameter regularization adopted in Eq. (15). As for SMPL’s optimization, we adopt pose prior and shape regularization from [7] to constrain SMPL’s parameters. All the optimizations are implemented based on Adam method with pytorch.

We compute PMD for each point of the scan data, and draw the Cumulative Errors Distribution (CED) curve in Fig. 4. Tab. IV shows numerical comparison and Fig. 5 shows two examples on the shape scan dataset. Again, our method has the best reconstruct accuracy.

Methods mean std #points
Ours 30953504
Ours_s 3.1 4.7 30952186
Baseline 3.3 4.8 30956373
Baseline_s 3.6 4.9 30956386
SMPL 4.6 5.5 31015202
SMPL_s 4.8 5.8 31012640
meshVAE 3.2 4.6 30956136
TABLE V: Quantitative comparison between different methods on DFaust [9] scan dataset. The mean PMD(), standard deviation and valid number of points for testing (without hand part) are given.
Methods mean std #points
Ours 6.3 6.4 30962584
Ours_s 30963799
SMPL 6.9 7.5 31017249
SMPL_s 6.7 7.2 31018236
TABLE VI: Quantitative comparison of sparse reconstruction on DFaust [9] scan dataset between different methods. The mean PMD(), standard deviation and valid number of points for testing (without hand part) are given.
Fig. 8: Reconstruction examples of our method with sparse marker constraints from two CMU MOCAP sequences.
Fig. 9: Results of linearly interpolating the body shape latent code generated by different methods. Red circles indicate unreasonable human body movements.

General Scan Data Reconstruction. To show the reconstruction ability for the human body with different poses, we evaluate our method and three baselines on DFaust [9] dataset. DFaust provides ten subjects with several movement scan sequences and high accurate registered meshes. However, the subjects of DFaust have overlaps with our train dataset Dyna, Faust, and MANO. We choose three subjects from DFaust labeled with 50007, 50009 and 50020 as our test set. We remove subjects overlapped with training data from our train set, and use 1973 pose data and 3021 neutral data to retrain our model, Baseline model, and meshVAE. We sample data from DFaust with 40 frames interval and finally get 108, 65 and 69 test data for three subjects respectively.

We use a similar point-to-plane ICP registration method described above with 79 sparse landmarks to get more accurate coarse alignment for general pose scan data. For methods that disentangle shape and pose, we also optimize another result by sharing shape parameters among all scan data of one subject, and we denote these results with a suffix s in the method name.

We compute PMD for each scan point and draw the Cumulative Errors Distribution (CED) curve in Fig. 6. Tab. V shows numerical comparison on the test dataset. In Fig. 7, we also present several scan data, corresponding reconstructed meshes of our method and SMPL and their respective error maps on scan point clouds. From the results, our method has the best reconstruction accuracy, and the result of Ours_s has the second high reconstruction accuracy. This indicates that our method effectively disentangles shape and pose variations of the human body.

Fig. 10: Examples of randomly sampling shape parameters from with pose parameters as zero. The results show rich variations of neutral-pose human body.
Fig. 11: Results of human body estimation on LSP [22]. Each group includes the input image with skeleton, image with estimated human body and another view of reconstructed body mesh.

Reconstruction with Sparse Constraints. In this experiment, we test our reconstruction ability with the constraints of sparse marker points. Motion capture systems usually use sparse markers to capture human movements, and thus the ability to reconstruct 3D human body from sparse markers is important. In particular, we still test on the selected data of DFaust. We manually mark 39 landmarks between the registered mesh of DFaust and our template and SMPL template respectively. We use these sparse corresponding landmarks to reconstruct mesh and compute PMD errors for scan data. Tab. VI shows the numerical results on the test dataset.

Even without carefully optimization for locations and offsets of sparse markers on the human body as Mosh [26] did, we still get similar accuracy with SMPL. Moreover, we also select two motion sparse marker sequences from CMU MOCAP111 mocap.cs.cmu.edu to test our model. Fig. 8 shows the reconstruction results. These experiments indicate that our latent embedding is a reasonable dimensionality reduction for the human body shape manifold and can get plausible human body shape with few markers constraints.

Fig. 12: Human body shapes by randomly sampling shape and pose parameters. The results demonstrate rich body posture variations of our representation.

8.2 Qualitative Evaluation

Global Interpolation. Here we test the capability of our representation to interpolate between two random persons under different poses. We qualitative compare with Baseline, meshVAE, and SMPL. For source and target meshes, we use the reconstruction methods described in Section. 8.1 to get respective parameters for each representation. Then, we linearly interpolate between source and target parameters to get a list of parameters and use the corresponding decoder to reconstruct the meshes. Fig. 9 shows one interpolation sequence between two meshes of our test set from two perspectives. All methods produce plausible results for the interpolation part (0 1.0), but for extrapolation, SMPL might generate weird body movements compared with our learning method. As we know, pose parameters of SMPL encode relative coordinates transformation between two joints, and this pose encoding does not consider human body movement prior, which is the possible reason for unreal extrapolation results of SMPL.

Bilateral Interpolation. Our representation disentangles shape and pose parameters and thus we can interpolate them separately. Given two meshes with different poses and shapes, we first extract their shape and pose parameters. Then we linearly interpolate the shape and pose parameters separately. Fig. 14 shows the results of such bilateral interpolation. We can see that each column has a consistent pose and each row corresponds to a specific person. Even for extrapolation situations, we can get reasonably good results.

New Model Generation. Since we encode our shape parameters and pose parameters with VAE architecture separately, we can generate new shape models by randomly sampling the embedding parameters.

In Fig. 10, we present generated neutral meshes by random sampling on the embeded shape space under two perspectives. The generated shape has abundant variations. In Fig. 12, we randomly generated pose meshes by sampling on the embeded shape and pose space, and the generated meshes have plausible and different postures.

Fig. 13: Reconstruction results by applying our body shape representation to fit a depth sequence. The first row is the original images, which are not used in our algorithm. The second row shows the registered meshes overlayed on images. The third row shows the reconstructed meshes and the target point clouds together.
Fig. 14: Results of interpolating shape code and pose code separately.

3D Body Reconstruction From 2D Joints. Although our representation does not define explicit skeleton like SMPL [27], we can also get a rough estimation of a joint by taking the average of relative points on body mesh. The relative vertices of each joint are manually selected. We use this simple method to generate estimated positions of joints like wrist, elbow, and others.

Given 2D human joints position, we can use our representation to reconstruct the 3D human body by solving


where rotation and translation are the global rigid transformation parameters, is -th joint position of the decoded mesh from , is the -th 2D joint position, is the given camera projection matrix with intrinsic parameters , is the robust differentiable Geman-McClure penalty function [16] and , and are controllable weights.

We use this optimization to estimation 3D human body from 2D joints for Leeds Sports Pose Dataset(LSP) [22]. 2D joints locations provided by LSP are used as the target. As optimizing Eq. (16) needs a relatively reasonable initialization for global rigid transformation to avoid falling into local minimums, we initialize the translation by assuming that the person is standing in front of the camera and estimate the distance via the ratio of 2D joints length and the torso length of reference mesh. With this initialization of translation , we only optimization Eq. (16) for and to get final initial estimation of global rigid transformation and . To avoid wrong orientation of , we rotate on to get another alternative initialization . Finally, we optimize Eq. (16) based on and , and we choose the result with smaller loss as the solution. Our optimization strategy is similar with [7]. Fig. 11 shows some qualitative results on LSP dataset. The results show that our representation can roughly recover the human body shape from 2D joint locations of images in the wild.

Registration on Depth Images. We also show an example of fitting our representation to a depth sequence. We use Kinect v2 to collect some depth data. For each frame, we convert depth image to point cloud mesh for convenient point-to-plane ICP registration optimization. Besides depth data, we also record the 3D joint locations predicted by the SDK of Kinect v2. Although the prediction is not accurate enough, it is sufficient to supply a coarse initialization. Similarly, regularization and robust estimator in Eq. (16) are used here. To smooth in the temporal domain, we apply smooth energy for pose parameters and share one shape parameters for the whole sequence. Fig. 13 shows an example of registration results for a depth sequence. The color images are just for visualization and not used in our algorithm.

9 Limitations

Although our representation defines a coarse-level shape, it still lacks an explicit and simple position computation method for the body skeleton from latent embedding. Currently, a simple average of related mesh vertex positions is treated as an estimation of the corresponding joint for the skeleton. However, this estimation is not very accurate and may introduce subtle differences in the target human pose.

For neutral pose definition, we directly use the common pose of SPRING [42]. However, postures of SPRING are not totally consistent. Small misalignments exist in this dataset, such as small swings of arms and little offset of head orientation. These misalignments influence learning accuracy.

In future, a coarse-level shape with explicit skeleton definition can be designed to increase the expression ability of the body shape and a more strictly defined neutral pose can be defined to eliminate ambiguity caused by misalignments from original data.

10 Conclusion

This paper aims to create a shape and pose disentangled body model with high accuracy. We have proposed a general learning framework integrating a coarse-to-fine reconstruction pipeline. Based on the framework, we utilize a VAE like architecture to train our model end-to-end. To make full use of the great fitting ability of neural network, we have constructed a large and topology consistent dataset with computed deformation shape representations from available datasets that contain models of different topologies. Moreover, neutral shapes are defined for each identity of our dataset. Experimental results have demonstrated the powerfulness of our learned embedding in terms of the reconstruction accuracy and the flexibility for model recreation. We believe that our learned embedding will be very useful to the community for various human body related applications.


This research is partially supported by National Natural Science Foundation of China (No. 61672481), Youth Innovation Promotion Association CAS (No. 2018495), NTU CoE grant, a Joint WASP/NTU project (M4082186), MoE Tier-2 Grant (2016-T2-2-065, 2017-T2-1-076) of Singapore and the Singapore NRF-funded NTU BeingTogether Centre. We thank VRC Inc. (Japan) for sharing the scanned human shape models with us in Fig. 5 and Tab. IV.


  • [1] T. Alldieck, M. Kassubeck, B. Wandt, B. Rosenhahn, and M. Magnor. Optical flow-based 3d human motion estimation from monocular video. In

    German Conference on Pattern Recognition

    , pages 347–360. Springer, 2017.
  • [2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2018.
  • [3] B. Allen, B. Curless, and Z. Popović. The space of human body shapes: reconstruction and parameterization from range scans. In ACM transactions on graphics (TOG), volume 22, pages 587–594. ACM, 2003.
  • [4] B. Allen, B. Curless, Z. Popović, and A. Hertzmann. Learning a correlated model of identity and pose-dependent body shape variation for real-time synthesis. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 147–156. Eurographics Association, 2006.
  • [5] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: shape completion and animation of people. In ACM transactions on graphics (TOG), volume 24, pages 408–416. ACM, 2005.
  • [6] T. Bagautdinov, C. Wu, J. Saragih, P. Fua, and Y. Sheikh. Modeling facial geometry using compositional vaes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [7] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578. Springer, 2016.
  • [8] F. Bogo, J. Romero, M. Loper, and M. J. Black. Faust: Dataset and evaluation for 3d mesh registration. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3794–3801, 2014.
  • [9] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black. Dynamic faust: Registering human bodies in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6233–6242, 2017.
  • [10] Y. Chen, Z.-Q. Cheng, C. Lai, R. R. Martin, and G. Dang. Realtime reconstruction of an animating human body from a single depth camera. IEEE transactions on visualization and computer graphics, 22(8):2000–2011, 2016.
  • [11] Y. Chen, Z. Liu, and Z. Zhang. Tensor-based human body modeling. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 105–112, 2013.
  • [12] O. Freifeld and M. J. Black. Lie bodies: A manifold representation of 3d human shape. In European Conference on Computer Vision, pages 1–14. Springer, 2012.
  • [13] L. Gao, Y.-K. Lai, D. Liang, S.-Y. Chen, and S. Xia. Efficient and flexible deformation representation for data-driven surface modeling. ACM Transactions on Graphics (TOG), 35(5):158, 2016.
  • [14] L. Gao, Y.-K. Lai, J. Yang, L.-X. Zhang, L. Kobbelt, and S. Xia. Sparse data driven mesh deformation. arXiv preprint arXiv:1709.01250, 2017.
  • [15] L. Gao, J. Yang, Y.-L. Qiao, Y.-K. Lai, P. L. Rosin, W. Xu, and S. Xia. Automatic unpaired shape deformation transfer. In SIGGRAPH Asia 2018 Technical Papers, page 237. ACM, 2018.
  • [16] S. Geman. Statistical methods for tomographic image reconstruction. Bull. Int. Stat. Inst, 4:5–21, 1987.
  • [17] N. Hasler, H. Ackermann, B. Rosenhahn, T. Thormählen, and H.-P. Seidel. Multilinear pose and body shape estimation of dressed subjects from image sets. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1823–1830. IEEE, 2010.
  • [18] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H.-P. Seidel. A statistical model of human pose and body shape. In Computer Graphics Forum, volume 28, pages 337–346. Wiley Online Library, 2009.
  • [19] D. A. Hirshberg, M. Loper, E. Rachlin, and M. J. Black. Coregistration: Simultaneous alignment and modeling of articulated 3d shape. In European conference on computer vision, pages 242–255. Springer, 2012.
  • [20] A. Jain, T. Thormählen, H.-P. Seidel, and C. Theobalt. Moviereshape: Tracking and reshaping of humans in videos. In ACM Transactions on Graphics (TOG), volume 29, page 148. ACM, 2010.
  • [21] Z.-H. Jiang, Q. Wu, K. Chen, and J. Zhang. Disentangled representation learning for 3d face shape. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [22] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC, volume 2, page 5, 2010.
  • [23] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [24] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Second International Conference on Learning Representations, ICLR, 2014.
  • [25] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [26] M. Loper, N. Mahmood, and M. J. Black. Mosh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG), 33(6):220, 2014.
  • [27] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015.
  • [28] T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor, and C. Theobalt. Sparse localized deformation components. ACM Transactions on Graphics (TOG), 32(6):179, 2013.
  • [29] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 International Conference on 3D Vision (3DV), pages 484–494. IEEE, 2018.
  • [30] L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt, and B. Schiele. Building statistical shape spaces for 3d human modeling. Pattern Recognition, 67:276–286, 2017.
  • [31] G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics (TOG), 34(4):120, 2015.
  • [32] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3d faces using convolutional mesh autoencoders. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part III, pages 725–741, 2018.
  • [33] K. M. Robinette, H. Daanen, and E. Paquet. The caesar project: a 3-d surface anthropometry survey. In 3-D Digital Imaging and Modeling, 1999. Proceedings. Second International Conference on, pages 380–386. IEEE, 1999.
  • [34] J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), Nov. 2017.
  • [35] H. Seo and N. Magnenat-Thalmann. An automatic modeling of human bodies from sizing parameters. In Proceedings of the 2003 symposium on Interactive 3D graphics, pages 19–26. ACM, 2003.
  • [36] O. Sorkine and M. Alexa. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, volume 4, pages 109–116, 2007.
  • [37] Q. Tan, L. Gao, Y. Lai, J. Yang, and S. Xia. Mesh-based autoencoders for localized deformation component analysis. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

    , pages 2452–2459, 2018.
  • [38] Q. Tan, L. Gao, Y.-K. Lai, and S. Xia. Variational autoencoders for deforming 3d mesh models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5841–5850, 2018.
  • [39] A. Weiss, D. Hirshberg, and M. J. Black. Home 3d body scans from noisy image and range data. In IEEE International Conference on Computer Vision (ICCV), pages 1951–1958, 2011.
  • [40] Q. Wu, J. Zhang, Y.-K. Lai, J. Zheng, and J. Cai. Alive caricature from 2d to 3d. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7336–7345, 2018.
  • [41] W. Xu, A. Chatterjee, M. Zollhöfer, H. Rhodin, D. Mehta, H.-P. Seidel, and C. Theobalt. Monoperfcap: Human performance capture from monocular video. ACM Transactions on Graphics (TOG), 37(2):27, 2018.
  • [42] Y. Yang, Y. Yu, Y. Zhou, S. Du, J. Davis, and R. Yang. Semantic parametric reshaping of human body models. In 3D Vision (3DV), 2014 2nd International Conference on, volume 2, pages 41–48. IEEE, 2014.
  • [43] T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll, and Y. Liu. DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [44] S. Zhou, H. Fu, L. Liu, D. Cohen-Or, and X. Han. Parametric reshaping of human bodies in images. In ACM Transactions on Graphics (TOG), volume 29, page 126. ACM, 2010.