Identity-Disentangled Neural Deformation Model for Dynamic Meshes

09/30/2021 ∙ by Binbin Xu, et al. ∙ 4

Neural shape models can represent complex 3D shapes with a compact latent space. When applied to dynamically deforming shapes such as the human hands, however, they would need to preserve temporal coherence of the deformation as well as the intrinsic identity of the subject. These properties are difficult to regularize with manually designed loss functions. In this paper, we learn a neural deformation model that disentangles the identity-induced shape variations from pose-dependent deformations using implicit neural functions. We perform template-free unsupervised learning on 3D scans without explicit mesh correspondence or semantic correspondences of shapes across subjects. We can then apply the learned model to reconstruct partial dynamic 4D scans of novel subjects performing unseen actions. We propose two methods to integrate global pose alignment with our neural deformation model. Experiments demonstrate the efficacy of our method in the disentanglement of identities and pose. Our method also outperforms traditional skeleton-driven models in reconstructing surface details such as palm prints or tendons without limitations from a fixed template.



There are no comments yet.


page 5

page 6

page 8

page 12

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hands form one of our most important interfaces with the world and thus modelling and tracking them are an important problem that has recently received significant attention from the computer vision community. However, it is challenging to capture the fine details of hand geometry, due to the complicated interaction between muscles, bones and tendons. In addition, there are significant variations between individuals’ hands.

The most common technique for building a multi-person model of human hands or bodies is to fit an explicit mesh representation to a large collection of scans [29, 26]. These models typically combine a statistical shape basis with the standard Linear Blend Skinning (LBS) method for deforming the mesh using the skeleton. It is non-trivial to develop these models, which require high quality mesh registration and often rely on manual annotation. Solving for the resulting statistical model is a complex, nonconvex optimization that requires careful regularization.

With the rapid development of deep learning, many recent works have shown successful and impressive results on learning shape embeddings using a purely data-driven approach. One promising direction is to train a multi-layer perceptron (MLP) to implicitly represent the shape representation using e.g. a signed distance field. Learning the shape embedding requires that the training data be aligned in order to remove the ambiguity introduced by the 6DoF rigid body or 7DoF similarity transformation. This alignment space is referred to as the canonical coordinates and is defined by the training set. While it may be possible to align the

training shapes to the canonical coordinates by preprocessing the data, this assumption is invariably violated on real-world testing

data. In practice, using the shape embedding to reconstruct novel inputs requires joint estimation of pose and geometry.

This paper studies the problem of reconstructing identity-invariant deforming shapes from a dynamic sequence of 3D point clouds and provides solutions to apply it in real-world data. Our model produces complete and detailed geometry that captures identity-specific features. We have two major contributions. First, we learn a neural deformation model that factorizes its latent space into subject identity and gesture from unregistered 3D point clouds, without requiring any human registration or annotation. Second, we propose two effective solutions to jointly optimize for a global rigid pose with identity and pose embeddings in the neural deformation model. We conduct extensive ablations to validate our design choice, and show the disentangled latent space can perform identity and pose transfer as well as dynamic shape completion on new subjects.

2 Related Work

Our goal is not only to learn a disentangled deformable model with detailed geometry, but also to use the presentation to reconstruct unseen data that often misalign with the canonical shape space.

Surface Deformation Models

One widely adopted representation for articulated shapes is Linear Blend Skinning (LBS) that deforms a template mesh based on skeleton. Low resolution meshes are more efficient for pose tracking [35], but cannot capture complex deformations or fine details such as wrinkles. Subdividing the mesh can improve tracking convergence [34, 12], but does not increase details. Fixed template mesh also struggles to describe variations among individuals such as palm prints or finger nails. To improve the deformation quality, Pose-Space Deformation (PSD) [15, 1, 13] models add pose-dependent shapes as “correctives” to LBS results. Similar ideas are used in SMPL [17] and STAR [23] for the body and in MANO [29] for hands. However, these models are still limited by the template mesh resolution.

Recently, neural models offer an alternative approach to represent or deform 3D geometries. The skinning algorithm can be implemented as network modules that map skeletal pose parameters to the final deformed mesh [10, 5]

. The flexibility of a neural network enables the learning of more complex skinning functions

[21], or a direct mapping from a latent embedding to a deformed mesh without skeleton [14]. Such flexibility also opens the door to mesh-less representations. Autodecoder networks can simultaneously learn an embedding space together with a mapping function to signed distance fields [25, 9] or occupancy grids [18, 7], where a surface can be extracted up to a desired resolution. These networks can be trained on 3D point clouds, meshes, or even images [20, 33, 30]. While initial efforts focus on learning static shapes aligned in a canonical space, they can be extended to represent deforming shapes [22, 8, 11, 36]. Our work is closely related to these efforts, where we extend [25, 9] to incorporate the 6D global transformation of deforming shapes, and further disentangle the shape space from the identity space to preserve subject-specific invariance throughout a motion.

Identity-disentangled Deformable Models

Thanks to the speed and simplicity, bilinear deformable models are widely used in pose and shape estimation for hands [29], bodies [17], and faces [27, 6, 16]. These methods use a global linear shape basis to represent identity but are posed using LBS. They are typically built by registering a template mesh to a large 3D shape collection which capture multiple subjects with a variety of poses [2]. Inverting the skinning transform brings the meshes into a consistent space where a statistical model such as Principle Component Analysis (PCA) can disentangle the per-subject identity from pose-dependent corrective shapes. The quality of the resulting model depends critically on the accurate template registration, which often requires manual annotations.

Neural networks are able to learn disentangled nonlinear deformable models. Zhou et al. [38] models disentangled shape and pose from registered meshes via self-consistency loss. This approach can be applied to different subjects without domain specific designs. I3DMM [37] learns disentangled latent spaces of identity, face expressions, hairstyle, and color from watertight aligned face scans. Similarly we learn a neural deformable model from scans with disentangled identity and shapes. NMP [24] disentangle shape and pose by learning the deformation field, which requires dense correspondences annotation that is non-trivial to obtain. Our work assumes the training scans are rigidly aligned, but needs no hole-filling, template registration, nor correspondence annotations across subjects. Unlike I3DMM [37] which focuses on faces, we evaluate on hands with a high degree of articulation.

3D Scan Neural Registration

While a neural deformable model is conceptually similar to a bilinear model, it is not obvious how to incorporate global transformation into a neural shape model. While most existing work only demonstrates reconstruction in a canonical shape space [37, 22, 8, 36], there have been several concurrent exciting work fitting 3D models to 3D scans using implicit representations. Most approaches [3, 31, 19] combine conventional template models with neural implicit representations to get the best of both worlds. We instead explore a purely implicit representation without any templates. It allows us to learn unstructured and detailed features that would be difficult to capture by a predefined template. We bypass the template registration completely, and need not to define a consistent “rest pose” among different identities.

3 Method

The overview of DiForm is presented in Fig. 2. This section describes each module in details.

3.1 Learning Embedding for Deformable Shapes



Figure 2: DiForm overview. From a set of deformable shapes, we learn the identity code and the deformation code to decode 3D point clouds in the canonical space to SDF. To reconstruct unseen observations in an arbitrary space at inference, we jointly optimize the latent codes and the transformation to align the coordinates. Two solutions are proposed to seek good rotation initialization.

Consider a shape set , where indexes the -th sample from the -th subject. For deformable shapes, each subject has a different intrinsic geometry that can deform in a similar manner. An example set is hands each with different gestures, or faces each with different expressions. We assume the set contain 3D point clouds with normals, . The prior works [25, 9, 32] learn the shape space using MLPs that encode SDF. To recap, these MLPs map a 3D location and a latent shape code to its SDF value. Shape information can be extracted to meshes by marching cubes or depth maps by sphere tracing, both of which query the network at many spatial locations while holding the shape code fixed.

This work focus on deformable shapes. We thus aim to distentangle the two main factors of variation: the intrinsic shape of the subject and its current configuration. Thus, rather than a single latent code , we condition the MLP on two latent codes. The first, , is referred to as the identity code and is meant to store intrinsic shape information. The second, , is the deformation code and is used by the network along with to pose a concrete shape. Mathematically, we learn the function


where is the SDF of shape at , and is the network parameters. Similar to previous works [25, 9, 32], we model by a decoder MLP. At training, the network parameters and the latent codes are optimized by back-propagating gradients from the loss.

We assume knowledge of which shapes belong to the same subject . However the knowledge of corresponding deformation across subjects are unknown, due to the fact that configuration space is continuous and it requires expensive manual annotation. To this end, shapes from the same subject share a common identity code , while all shapes maintain its own deformation code . By forcing the network to share the same identity code for all shapes belonging to subject , we prevent the optimizer from storing pose-specific information in these latent codes. As a corollary, all variation between different configurations with the same identity must be encoded in the shape codes for various . We do not have any corresponding mechanism to prevent the optimizer from storing identity-specific information in the shape codes, because they are not shared among individuals. In theory, it is possible for all information about a shape be encoded in codes . However, our experimental results indicate this does not happen in practice (c.f. section 4).

Inspired by IGR [9] and MetaSDF [32], we solve the following optimization at training:


The variable and denotes all identity and deformation codes. The loss function contains three parts. The term


supervises the training with the point clouds. It enforces the SDF to be zero at the surface locations, i.e., . The term also encourages the predicted normal, as defined by the gradient of the SDF, to be as similar as possible to the input normal. It is not sufficient to constrain the SDF using only surface points as that leaves the majority of space without supervision. Therefore, the second term


is used to regularize the SDF over the continuous 3D space

. This is done by enforcing the predicted normal to be a unit vector, which is a general property of all signed distance functions (excepting discontinuities at which the gradient is not well defined). In addition, the regularizer

with the positive hyper-parameter , also prevents off-surface locations to create zero-isosurface. The third term


encourages the latent space to be zero-mean and helps to prevent over-fitting. The weights are the hyper-parameters used in the training and inference.

3.2 Dynamic Reconstruction

We now describe the inference of identity and deformation on novel observations using the trained model, thereby performing dynamic reconstruction of a deforming shape.

3.2.1 Single shape inference

We first discuss the inference for a single shape. Following related work on learning SDF, we assume the training shapes are properly aligned such that the network learns the shape space according to a ‘canonical’ coordinates. This greatly simplifies learning by removing the large-scale variations to SDFs caused by similarity transforms. However, it cannot typically be assumed that novel observations are provided in the canonical space. Instead, these observations appear in what we will call the world coordinates. We can still apply the canonical model by simply introducing a transformation to align the world to the canonical space and optimizing it in addition to the latent codes.

Without loss of generality, we assume the transformation is described by 6DoF . A shape in the world can be transformed to the canonical space by


with rotation , translation , where the subscript indicates the transformation from world to canonical. Rigid transformation is differentiable. Inserting Eq. 6 to the energy function of Eq. 2, optimizing the latent codes and the pose , yield a joint estimation of the shape and the pose.

It is straight-forward to optimize this problem with gradient descent for and . However, 3D rotations are embedded in a non-Euclidean manifold. To take proper gradient, we use Lie algebra , where the operator

generates a skew-symmetric matrix. The exponential map

converts to the rotation matrix , and the log map operates vice versa. The Jacobian of the exponential map is analytically defined for , otherwise is unstable and difficult to compute. Therefore, to optimize , we rewrite the energy with . This leads to update of the rotation with , where the Jacobian is always evaluated at , and is the learning rate for gradient descent. This optimization is more stable as opposed to , with Jacobian evaluated at the current estimate.

3.2.2 Initialization for Joint Optimization

The optimization defined in Sec. 3.2.1 is highly non-convex, which depends on good initialization to avoid local minima. Our initial experience suggests that it is sufficient to initialize with the -code and with the shape center. However, rotation needs to be reasonably accurate to ensure the initial estimate is within the basin of convergence. Our experience suggests the rotation be within about 20

of the global optimum. To find the rotation initialization, we develop two methods: a searching solution using policy gradient (PG) from reinforcement learning (RL); and a learning solution to predict the pose with MLPs.

Searching rotation with policy gradient. Our experiment shows that good initial rotations converges to significantly lower cost as defined by Eq. 2 and vice versa. Based on this observation, we can sample rotations and evaluate the hypothesis to seek a good guess. This strategy demands intense computation to cover the rotation space. We propose to use PG to solve the searching efficiently with a probabilistic formulation.

In the language of RL, a policy parameterized by

is the probability of taking action

. Each action leads to a reward . For simplicity, we denote the policy with . To find the action trajectory that maximizes the total rewards, PG shapes the action distribution by maximizing the expected rewards


with respect to . With gradient ascent, policy is updated by . Note the equation , yield


In RL, it is common for the policy to be rolled out across multiple actions with rewards often deferred to the end. However, our case is much simpler as the action consists of selecting a rotation and the reward (loss) is determined immediately. With this approach, we can sample actions to optimize the policy without differentiating . Consider an action space of randomly sampled rotations , where follows a multinomial distribution . We can optimize each action and in an alternative fashion. At iteration , we sample rotations from by the distribution . For each , we optimize as defined in Eq. 2 for steps to follow the action trajectory and set the cost . We then update with the gradient computed by Eq. 8

, and iterate the process till convergence. Starting from a uniform distribution, policy gradient will shift

to center on the preferred rotation. As a result, computation is effectively distributed to the more promising hypothesises.

Direct pose prediction with MLP. Despite being more efficient than exhausted search, PG still requires evaluating hundreds of hypothesis to update the probability. For fast inference, we train an MLP to directly output the 6DoF pose from the world to the canonical, which we refer to as PoseNet. To achieve this, one solution is to learn the mapping , where is a point cloud in the world. The problem with this design is that because the canonical coordinates are arbitrarily defined, training can get stuck at local minima. To help the network, we instead learn , where is a reference shape in the canonical space. With this modification, the network is changed to learn the relative instead of the absolute pose. Empirically, We found this training behaves better. We also bootstrap the training by with the mass center difference of input shapes to roughly shift the world shapes to the canonical space. With the predicted relative rotation and translation , the final transformation between the input shape and the canonical space are composed by .

3.2.3 Continuous inference

Given a sequential observation of shape deformation, we achieve dynamic reconstruction by estimating the geometry parameters and the 6DoF for each time stamp . For this purpose, we adopt the incremental optimization strategy to solve the single shape inference as described in Sec. 3.2.1, where optimization at is initialized by the estimates at . At , the rotation is first solved by methods in Sec. 3.2.2, followed by a full optimization for the shape and the pose. Because our shape embedding models identity and deformation separately, we can alternatively freeze or reduce its learning rate drastically after an identity shape is well observed.

3.3 3DH  Dataset

In order to evaluate the proposed algorithms, we collect a large number of 3D scans of hands. Similar to the MANO dataset [29], we use the commercial 3DMD scanner, which directly outputs 4D reconstructions by fusing depth measurements from five synchronized RGB-D cameras.

Two types of data are collected. First, we capture the left hand from a variety of people. Each participant performs some predefined gestures, such as counting, grabbing, pointing etc. During capturing, participants rest their left arm on a fixed handle to produce aligned 3D meshes without post-processing. We refer this dataset as 3DH  for 3D hands. In total, 183 subjects are captured, which is randomly split into 150 for training and 33 for testing. We reserve two random samples per training subject for validation. This provides 13820, 300 and 734 total samples for training, validation and testing. For the second collection, we remove the arm rest and ask participant to perform random left hand motion freely in the space. A total of 10 sequences from 5 people are captured, who are not included in the previous data capture. We refer this data 3DH-4D .

4 Experiments

point cloud

ground truth


DiForm (ours)
Figure 3: The level of details of the proposed DiForm versus MANO [29]. The ground truth mesh obtained by the scanning device is shown for reference. DiForm is able to express more accurate and detailed muscle deformation, such as creases and bulging, and fill in the holes that are missing from the inputs.

point cloud

ground truth




Figure 4: Qualitative comparison on the 3DH test set. All methods use point clouds for reconstruction. The ground truth meshes only serve as a visual reference.
Implementation details.

We adopt DeepSDF [25] auto-decoder back bone for shape embedding with the modification to train with disentangle latent space. The other modification is to add the positional encoding [20]

with 10 frequency bands to transform the input. Unless otherwise stated, the identity and the deformation codes are both set to be 64 dimensions. The training shapes are scaled by a global multiplier estimated from the average hand size. For each shape, we randomly sample 16K surface points and 16K off-surface points according to a Gaussian distribution. Adam is used to optimize the network with a constant 0.0005 learning rate for 1000 epochs, with batch size of 24.

To train the pose prediction, we use a weight-sharing Siamese PointNet [28] to extract 1024-dimensional feature vector for a given point cloud. The features of query and reference shapes are concatenated before passing to an MLP and output the pose. The learning is supervised with ground truth. To avoid discontinuity, the 6D vector proposed in [39] is used to parameterize rotation. At training, the reference shapes are augmented by adding small random perturbation to the learnt -code shape. At inference, the reference shape is set to the -code shape. We randomly sampled 2822 shapes from the 3DH training set. The network is optimized by Adam with a constant learning rate of 0.0005 for 1000 epochs with batch size of 36.

4.1 Shape Reconstruction in Canonical Space

To reconstruct SDF from point clouds in the canonical coordinates, we optimize the latent codes for 2000 epochs with Adam. In each iteration, 16K surface points and 16K off-surface points are sampled. For shapes belonging to the same subject, the identity code can be optimized separately from the single observation or jointly from all available observations. We refer to the separate identity optimization as DiForm-S and the joint version as DiForm-J. We compare DiForm to the state-of-the-art algorithm IGR [9], which is the baseline method that learns one latent space. We implement IGR with the same back-bone network and parameters. The latent space is set to be 128 dimensions with the same capacity as DiForm. For fair comparisons, we also train IGR with the positional encoding (denoted as IGR-PE). To compare to the state-of-the-art LBS-based hand modeling, we use MANO [29] baseline. For fair comparison, we used the published result from MANO and subdivide the meshes to have the similar resolution as DiForm, i.e., approximately 100K vertices.

Shape representation power.

We train the baseline and DiForm on 3DH training set with the same setting. Afterwards we evaluate the reconstruction on the 3DH test set and the MANO [29] dataset with 271 left hands from 24 subjects. The results of MANO [29] is computed from the published model. To quantify the performance, the Chamfer Distance (CD) is measured that compares the raw scans and the reconstructions. Because raw scans can contain missing surfaces, we report the sided CD, where denotes CD from reconstruction to the ground truth and vice versa. For all methods, we report the mean

and the standard deviation

, as shown in Tab. 1.

Overall, training with separate latent spaces yields lower CD in comparison to the LBS-based modeling and one-code training. Our method can even fill in the missing holes from the input point cloud by sharing the information between different samples of the same hand by back-propagating gradients to update the same identity code. In Fig. 4, we visualize the reconstructed meshes from different methods, where DiForm-J yields less noisy reconstructions. The LBS-based MANO [29] algorithm shows very low error on . However, Manos templated meshes are significantly coarser, whereas our model can express intricate muscle deformations (see Fig. 3). The template meshes also self-penetrate when fingers pressed into each other, which is not a problem for our implicit SDF representation.

3DH IGR [9] 1.1848 0.3981 1.7535 0.7281 0.6160 0.0963
IGR+PE 1.1628 0.3720 1.7073 0.6752 0.6182 0.0929
DiForm-S 1.0494 0.3060 1.4697 0.5257 0.6286 0.1178
DiForm-J 1.0594 0.3180 1.4631 0.5413 0.6552 0.1157
MANO Mano [29] 1.9549 0.3988 0.9174 0.2336 1.0375 0.3416
IGR [9] 1.9900 0.7551 3.3365 1.4862 0.6444 0.1034
IGR+PE 2.0340 0.7265 3.3905 1.4246 0.6786 0.1725
DiForm-S 1.8536 0.6816 3.0032 1.2860 0.7056 0.2156
DiForm-J 1.8085 0.6243 2.8921 1.2145 0.7253 0.1568
Table 1: Quantitative evaluation on canonical shape reconstruction. We report the Chamfer Distance (CD) in millimeter on 3DH  and MANO.
Validation set IGR [9] 1.1001 0.2516 1.6534 0.5020 0.5471 0.0404
IGR-PE 1.0707 0.2262 1.5951 0.4499 0.5467 0.0396
DiForm-S 0.9993 0.2222 1.4455 0.4187 0.5529 0.0626
DiForm-J 0.9977 0.1983 1.4400 0.3827 0.5557 0.0546
DiForm-C 0.9955 0.1878 1.4312 0.3695 0.5596 0.0446
Test set IGR [9] 1.3076 0.3261 1.9410 0.6292 0.6740 0.0746
IGR-PE 1.3681 0.3194 2.0266 0.5910 0.7100 0.0895
DiForm-S 1.1920 0.2821 1.6770 0.4953 0.7071 0.1039
DiForm-J 1.2217 0.2823 1.6924 0.4885 0.7509 0.1074
DiForm-C 1.2034 0.2533 1.6563 0.4435 0.7505 0.1024
Table 2: Quantitative evaluation of shape reconstruction conditioned on fixed identity code. Chamfer Distance (CD) is in millimeter.

Figure 5: DiForm reconstructs unseen shapes using pre-optimized identity code from the same subjects.
Disentangle identity and deformation.

To examine the separation between the identity and the deformation representations, we designed two experiments. First, we examine if the deformation can be transferred to different individuals. To this end, we randomly sampled a set of deformation codes and identity codes that were not paired in training. Then we densely pair each deformation code to each identity code. The decoded shapes can be seen in Fig. 1. The result shows that our deformation code can express the same gesture in different identities, even though the identity codes and the deformation codes were not trained together. Similarly, the same identity code combined with different deformation codes show visually similar hand geometry. In addition, we linearly interpolate the deformation codes and the identity codes to visualize the interpolated shape and observe smooth transitions in both identity and gesture interpolation trajectories. Additional results are shown in the supplementary material.

We also conducted a quantitative evaluation for the identity-deformation exchange. We randomly select 100 pairs of different poses from different identities and then synthesize a new shape by taking the pose from one identity and combining it with the other identity. We fitted our model to these shapes and computed the L2 distance between the optimized codes and the known input codes. We found the optimized codes are always closest to the ground truth than to other codes sampled from embedding space. The average distance of the optimized identity and deformation codes to all the other input codes is 7.19 and 22.60 respectively, comparing to 2.29 and 4.81 to the ground truth. This further supports the separation of identity and deformation representations.

In the third experiment, we take an optimized identity code to reconstruct unseen shapes from the same individual while freezing the identity code. To this end, we use the trained identity codes to reconstruct the 3DH validation set, which share the same identities with the training set. The experiment is conducted also on test sequences, where the identity code is first jointly optimized with 20 shapes, before we freeze it to reconstruct the remaining shapes. Tab. 2 present the quantitative comparison (c.f. Fig. 5 for visual inspect). It can be seen that freezing identity code (DiForm-C) outperforms the other methods, where the identity code is optimized together with the deformation code. It shows that the optimised identity code has captured the underlying shape and driven the deformation code to the corresponding gesture, suggesting a good separation of identity and code. The respective reconstructions are shown in the supplementary material.

Generalization to human body.

We further conducted an experiment to exame how DiForm performs on complex shapes. To this end, we trained DiFrom with the DFaust [4] dataset using 11442 body shapes from 9 people. Since there is only few subjects, Fig. 6 visualizes the generated shapes by combining 8 random deformation codes with 3 random identity codes. The result show that the deformation codes are capable of driving the identity to deform in the similar fashion.

Figure 6: Deformation transfer on the DFaust [4].

point cloud



Figure 7: Dynamic reconstruction of 4D point cloud in the world (top), with DiForm output in the world (mid) and in the canonical (bottom). We direct viewers to the supplementary video for better inspection.
Methods Initial estimation (mean/standard deviation) After joint shape and pose optimization (mean/standard deviation)
RPE: (degree) RPE: (mm) CD: can. (mm) CD: world (mm) RPE: (degree) RPE: (mm) CD: can. (mm) CD: world (mm)
MHE 17.343/27.118 10.344/6.4190 7.2924/1.5676 7.1615/1.2365 12.233/28.327 6.9173/6.5961 3.6215/1.4521 1.6347/0.7655
PG 6.2968/4.4954 5.2287/3.1341 7.2891/1.5588 6.7842/1.3300 5.4425/3.5907 5.1671/3.3695 3.0956/1.1738 1.4422/0.5960
PN 5.2534/3.3529 5.8876/3.9372 7.8876/1.8261 7.2946/1.5643 4.5315/3.1874 4.6401/2.8217 2.9295/1.2387 1.3861/0.5221
Table 3: Quantitative evaluation of pose initialization. We compare multi-hypothesis evaluation (MHE), policy gradient (PG) and the pose network prediction (PN). We report CD to the ground truth in the canonical (can.) and in the world.

4.2 Dynamic Reconstruction

First we evaluate different algorithms on the 6DoF initialization. Lack of baseline algorithms, we construct one that approaximates the exhausted search. To this end, we uniformly sample 2000 rotations as candidate hypothesis for initialization. We ran 400 iterations with 1000 surface points to evaluate each hypothesis and take one corresponds to the lowest loss as initialization. We refer this baseline as MHE. To quantify the performance, we generate a dataset with 136 random shapes randomly rotated. In Tab. 3, we report the pose and the geometry error evaluated on the predicted initialization and after the joint optimization. The results show similar performances for PG and PN. The advantage of PG is no training is required and it guarantees to find the solution. The disadvantage is PG takes significantly longer to compute than PN which outputs the prediction with a single forward inference. Fig. 7 shows a demo of DiForm reconstructing a dynamic 4D point cloud in the world. Our method robustly tracked the 6DoF pose and simutaneously optimizes the geometry with fine details.

4.3 Limitations and Discussions

Figure 8: Failure cases, due to missing significant observations or rare shapes too different from training.

Fig. 8 shows some failure cases. We observe DiForm is under constrained by highly incomplete point clouds. This drawback is compensated in 4D reconstruction, where DiForm can easily leverage all available data over time. We also observe DiForm struggles to extrapolate shapes, which can be improved by more diverse training data. When linearly interpolating latent codes, the resulting shapes can contain artifacts (c.f. supplementary). This suggests the latent space is not always smooth, causing linear interpolation to deviate from the manifold. Since DiForm is template free, it cannot support conventional animation. Despite we demonstrate motion transfer and dynamic reconstruction capability, it is difficult to interpreate DiForm parameters semantically.

5 Conclusion

This paper proposes DiForm , a neural deformable model with disentangled identity to reconstruct SDF from dynamic 3D point clouds in arbitrary coordinates. For this purpose, we explicitly represent deformable shapes with two embedding spaces, one to describe the identity related information and the other to describe the generic deformation among all shape instances. The experiments show DiForm achieves the separation between identity and deformation with a good evidence. The deformation codes can drive similar deformation for different identity embedding. The identity code can be solved with a few observations, and then be conditioned for various deformations. We further developed an end-to-end solution to reconstruct observations in arbitrary coordinates using the learnt embedding. Our algorithm is evaluated on a large 3D hand scans, outperforming the state-of-the-art hand modeling methods. Dispite not yet having all the answers, DiForm is a valuable exploration to the solution space and show new possibilities with a purely implicit representation. Future direction could be to explore hybrid approaches that combine with template models.


  • [1] Brett Allen, Brian Curless, and Zoran Popović. Articulated body deformation from range scan data. ACM Trans. Graph., 21(3):612–619, July 2002.
  • [2] Brett Allen, Brian Curless, Zoran Popović, and Aaron Hertzmann. Learning a correlated model of identity and pose-dependent body shape variation for real-time synthesis. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’06, page 147–156, 2006.
  • [3] Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll.

    Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration.

    In Neural Information Processing Systems (NeurIPS), December 2020.
  • [4] Federica Bogo, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Dynamic FAUST: Registering human bodies in motion. In

    IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    , July 2017.
  • [5] Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10843–10852, 2019.
  • [6] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization & Computer Graphics, 20(03), mar 2014.
  • [7] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5939–5948, 2019.
  • [8] Boyang Deng, JP Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, and Andrea Tagliasacchi. Neural articulated shape approximation. In The European Conference on Computer Vision (ECCV). Springer, August 2020.
  • [9] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes.

    Proceedings of the International Conference on Machine Learning (ICML)

    , 2020.
  • [10] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. In CVPR, 2019.
  • [11] Korrawe Karunratanakul, Jinlong Yang, Yan Zhang, Michael Black, Krikamol Muandet, and Siyu Tang. Grasping field: Learning implicit representations for human grasps. In Proceedings of the International Conference on 3D Vision (3DV), 2020.
  • [12] S. Khamis, J. Taylor, J. Shotton, C. Keskin, S. Izadi, and A. Fitzgibbon. Learning an efficient model of hand shape variation from depth images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2540–2548, 2015.
  • [13] Paul G. Kry, Doug L. James, and Dinesh K. Pai. EigenSkin: Real time large deformation character skinning in hardware. In Proceedings of the 2002 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Association for Computing Machinery, 2002.
  • [14] Dominik Kulon, Haoyang Wang, Riza Alp Güler, Michael Bronstein, and Stefanos Zafeiriou. Single image 3d hand reconstruction with mesh convolutions. In Proceedings of the British Machine Vision Conference (BMVC), 2019.
  • [15] J. P. Lewis, Matt Cordner, and Nickson Fong. Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In SIGGRAPH, page 165–172, 2000.
  • [16] Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
  • [17] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
  • [18] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4460–4470, 2019.
  • [19] Marko Mihajlovic, Yan Zhang, Michael J Black, and Siyu Tang. LEAP: Learning articulated occupancy of people. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2021.
  • [20] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • [21] Gyeongsik Moon, Takaaki Shiratori, and Kyoung Mu Lee. Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. In Proceedings of the European Conference on Computer Vision (ECCV), pages 440–455, 2020.
  • [22] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Occupancy flow: 4d reconstruction by learning particle dynamics. In Proceedings of the International Conference on Computer Vision (ICCV), pages 5379–5389, 2019.
  • [23] Ahmed A A Osman, Timo Bolkart, and Michael J. Black. STAR: A spare trained articulated human body regressor. In European Conference on Computer Vision (ECCV), 2020.
  • [24] Pablo Palafox, Aljaž Božič, Justus Thies, Matthias Nießner, and Angela Dai. Npms: Neural parametric models for 3d deformable shapes. arXiv preprint arXiv:2104.00702, 2021.
  • [25] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [26] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  • [27] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter.

    A 3d face model for pose and illumination invariant face recognition.

    In 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pages 296–301, 2009.
  • [28] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017.
  • [29] Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, 36(6):245, 2017.
  • [30] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the International Conference on Computer Vision (ICCV), pages 2304–2314, 2019.
  • [31] Shunsuke Saito, Jinlong Yang, Qianli Ma, and Michael J. Black. SCANimate: Weakly supervised learning of skinned clothed avatar networks. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2021.
  • [32] Vincent Sitzmann, Eric R. Chan, Richard Tucker, Noah Snavely, and Gordon Wetzstein. Metasdf: Meta-learning signed distance functions. In arXiv, 2020.
  • [33] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein.

    Implicit neural representations with periodic activation functions.

    Neural Information Processing Systems (NIPS), 33, 2020.
  • [34] Jonathan Taylor, Lucas Bordeaux, Thomas Cashman, Bob Corish, Cem Keskin, Toby Sharp, Eduardo Soto, David Sweeney, Julien Valentin, Benjamin Luff, Arran Topalian, Erroll Wood, Sameh Khamis, Pushmeet Kohli, Shahram Izadi, Richard Banks, Andrew Fitzgibbon, and Jamie Shotton. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Trans. Graph., 35(4), July 2016.
  • [35] Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin. Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph., 33(5), Sept. 2014.
  • [36] Ze Yang, Shenlong Wang, Sivabalan Manivasagam, Zeng Huang, Wei-Chiu Ma, Xinchen Yan, Ersin Yumer, and Raquel Urtasun. S3: Neural shape, skeleton, and skinning fields for 3d human modeling, 2021.
  • [37] Tarun Yenamandra, Ayush Tewari, Florian Bernard, Hans-Peter Seidel, Mohamed Elgharib, Daniel Cremers, and Christian Theobalt. i3dmm: Deep implicit 3d morphable model of human heads, 2021.
  • [38] Keyang Zhou, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Unsupervised shape and pose disentanglement for 3d meshes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 341–357, 2020.
  • [39] Yi Zhou, Connelly Barnes, Lu Jingwan, Yang Jimei, and Li Hao. On the continuity of rotation representations in neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.