DeepAI
Log In Sign Up

NIMBLE: A Non-rigid Hand Model with Bones and Muscles

02/09/2022
by   Yuwei Li, et al.
Clemson University
0

Emerging Metaverse applications demand reliable, accurate, and photorealistic reproductions of human hands to perform sophisticated operations as if in the physical world. While real human hand represents one of the most intricate coordination between bones, muscle, tendon, and skin, state-of-the-art techniques unanimously focus on modeling only the skeleton of the hand. In this paper, we present NIMBLE, a novel parametric hand model that includes the missing key components, bringing 3D hand model to a new level of realism. We first annotate muscles, bones and skins on the recent Magnetic Resonance Imaging hand (MRI-Hand) dataset and then register a volumetric template hand onto individual poses and subjects within the dataset. NIMBLE consists of 20 bones as triangular meshes, 7 muscle groups as tetrahedral meshes, and a skin mesh. Via iterative shape registration and parameter learning, it further produces shape blend shapes, pose blend shapes, and a joint regressor. We demonstrate applying NIMBLE to modeling, rendering, and visual inference tasks. By enforcing the inner bones and muscles to match anatomic and kinematic rules, NIMBLE can animate 3D hands to new poses at unprecedented realism. To model the appearance of skin, we further construct a photometric HandStage to acquire high-quality textures and normal maps to model wrinkles and palm print. Finally, NIMBLE also benefits learning-based hand pose and shape estimation by either synthesizing rich data or acting directly as a differentiable layer in the inference network.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 6

page 8

page 9

page 13

page 16

06/21/2021

PIANO: A Parametric Hand Bone Model from Magnetic Resonance Imaging

Hand modeling is critical for immersive VR/AR, action understanding, or ...
04/08/2019

Pushing the Envelope for RGB-based Dense 3D Hand Pose Estimation via Neural Rendering

Estimating 3D hand meshes from single RGB images is challenging, due to ...
09/14/2022

SCULPTOR: Skeleton-Consistent Face Creation Using a Learned Parametric Generator

Recent years have seen growing interest in 3D human faces modelling due ...
07/02/2018

Model-based Hand Pose Estimation for Generalized Hand Shape with Appearance Normalization

Since the emergence of large annotated datasets, state-of-the-art hand p...
05/06/2021

Learning Skeletal Articulations with Neural Blend Shapes

Animating a newly designed character using motion capture (mocap) data i...
04/04/2022

LISA: Learning Implicit Shape and Appearance of Hands

This paper proposes a do-it-all neural model of human hands, named LISA....
10/14/2022

DART: Articulated Hand Model with Diverse Accessories and Rich Textures

Hand, the bearer of human productivity and intelligence, is receiving mu...

1. Introduction

In the production of animated feature films, VFX juggernauts spend most of their time and resources on rendering face, hair and skin complexion, as a shortcut or trick for creating realistic human and feelings. In contrast, renderings of other parts of human body, particularly human hands, are often glossed over. The reason is simple: VFX producers can easily direct the focus of audience attention away from hands and onto hair and skin, where they can generate the most life-like and thereby attention-grabbing visual effects. Indeed, we rarely see scenes featuring complex and dexterous hand movements. At the dawn of the Metaverse, however, emerging virtual reality consumer products will be decidedly more intimate, immersive and interactive and therefore demand life-like renderings on all body parts, especially hands. When users put on the head-mounted displays, their virtual hands should replace the physical ones to perform as many operations in the Metaverse as in real life. The dexterousness of human hands - the complex geometrical structures, the marvelous things and the subtle messages that fingers can construct, create and pass on when they move - define us humans as intelligent beings in both physical and digital worlds.

Hands, however, are difficult to model. Through evolution, hand movements have become an intricate orchestration of bones, muscles, ligaments, nerves, and skins. Performing a specific gesture, for example, stems from the dragging and pulling of the hand muscles, which then drives the movements of the bones and eventually changes the shape and appearance caused by muscle deformation and skin wrinkling. To faithfully reproduce 3D hand movements, it is critical to not only model each individual component but more importantly to model their delicate coordination. In contrast to tremendous efforts on 3D human body modeling (Loper et al., 2015; Hirshberg et al., 2012; Anguelov et al., 2005; Pons-Moll et al., 2015), research on photo-realistic 3D human hands is rather limited. By far, the majority of prior art has focused on modeling the skeleton of the hand. In fact, the most adopted hand model, MANO (Romero et al., 2017), defines skeleton in terms of empirical joint locations without taking into account anatomical bone structure. The recent PIANO (Li et al., 2021) model extends MANO by employing anatomically correct bone structures and shapes, as well as joints to connect the bones. PIANO manages to produce more convincing movements but the resulting appearance still lacks realism as it ignores muscles and skins that deform along with the bones. In reality, even a routine posture such as grabbing or holding a fist requires 24 muscle groups surrounding the hand and the wrist (Panchal-Kildare and Malone, 2013) working together to execute extension and flexion. Shape deformations of these muscle groups subsequently affect the appearance of their covering skins and the overall realism in virtual hand rendering. By far, the graphics community still lacks a reliable and practical musculoskeletal parametric hand model.

In this paper, we present NIMBLE, a Non-rIgid hand Model with skins, Bones, and muscLEs, to bring 3D modeling of dynamic hands to a new level of realism. Our work is enabled by the recent Magnetic Resonance Imaging hand (MRI-Hand) dataset  (Li et al., 2021) that captures 35 subjects (19 male 16 female), with 50 different hand poses, along with annotated segmentation labels of bones. We first conduct comprehensive segmentation annotations to further identify muscles and skins on the MRI-Hand dataset. Similar to how SMPL (Loper et al., 2015) starts with an T-pose for modeling comprehensive movements, NIMBLE utilizes a general rest hand pose as the template that includes 20 bones, 7 muscle groups and a skin mesh. In particular, for the sake of computation and rendering efficiency, we cluster 24 anatomic muscles groups to 7 while preserving as much physical meaning as possible.

To derive a new parametric hand model, NIMBLE uses triangle meshes to model bones and tetrahedral meshes for deformable muscles and skins. The formulation manages to model shape deformations while maintaining structural rigidity. To register the internal structures under the rest pose to the ones in MRI-Hand, we present a multi-stage registration technique that leverages the benefits of pose initialization, the interdependence of hand structures, and physical-based simulations. Our technique accurately models the elasticity of deformations without sacrificing pose accuracy. The registered results lead to a new parametric model containing bones, muscles, skin template meshes, kinematic tree, shape blend shapes, pose blend shapes, and a joint regressor. In addition to parameter learning, we further impose penalty terms to avoid collision and enforce physically correct muscle deforms.

Applications of NIMBLE span from geometric modeling and rendering to visual inference. We first demonstrate applying NIMBLE to animate 3D hands of different shapes and under arbitrary poses. NIMBLE provides an unprecedented level of realism by enforcing inner bones and muscles to match anatomic and kinematic rules. To further enhance visual realism, we construct a photometric appearance capture system called HandStage, analogous to the USC LightStage (Debevec, 2012), to acquire high-quality textures with appearance details (e.g., normal maps), including wrinkles and palm print. These appearance details further improve visual fidelity and adapt to new lighting conditions. NIMBLE, as a parametric model, can be further integrated into state-of-the-art animation systems and rendering engines such as Blender and Unity for VR and AR applications. Furthermore, parameter-represented NIMBLE benefits learning-based hand pose and shape estimation on both MRI and RGB imagery, by either synthesizing rich data with varying shapes and poses or acting directly as a differentiable layer in the inference network.

To summarize, our main contributions include:

  • We exploit the recent MRI-Hand dataset to conduct the complete segmentation annotations for bones, muscles, and skins, as well as auto-registered meshes by optimization with physical constraints.

  • We derive a parametric NIMBLE model by iterating between shape registration and parameter learning, using new penalty terms to guarantee physically correct bone movements and muscle deformations.

  • We demonstrate using NIMBLE for anatomically correct digital hand synthesis, motion animation and photorealistic rendering. The results benefit many downstream tasks including pose and shape inference, visual tracking, etc.

  • We make available our NIMBLE model and annotation data at https://reyuwei.github.io/proj/nimble.

2. Related Work

In this section, we survey closely related works and discuss the relationship with the proposed work.

Parametric Models.

With parametric modeling, low-dimensional parametric space is estimated to approach human body geometry. Many existing methods incorporate linear blend skinning, which deforms a mesh based non-linear combination of rigid transformations of associated bones, on top of various skeletal representations. With skinning weights carefully designed, these presentations can produce reasonable deformations for articulated body tissues. A pioneer work on 3D morphable face is proposed by (Blanz and Vetter, 1999). Since then numerous methods have learned 3D face shape and expression from scanning data (Allen et al., 2006; Zollhöfer et al., 2018). The advantage of such geometric models is their ability to represent variety of face shapes and wide range of expressions. Unlike most face models that focus only on the facial region, recent popular models FLAME (Li et al., 2017) and its extension DECA (Feng et al., 2021) consider the whole head and neck regions instead. With the entire head, the authors were able to assume a simulated jaw joint to achieve large deformed facial pose and expressions. The availability of 3D body scanners enabled learning of body shape from scans. Since the CAESAR dataset opened up the learning of body shape (Allen et al., 2003), most early works focus on modelling only body shape (varying with identity) using subjects scanned in roughly the same pose. Combining body shapes from a group of population and different poses of a single subject, (Anguelov et al., 2005) learned a factored model of both body shape and pose based on triangle deformations. Following this work, many human body parametric models were built using either triangle deformations (Pons-Moll et al., 2015; Hirshberg et al., 2012) or vertex-based displacements (Hasler et al., 2010; Loper et al., 2015), however all these works focus on modeling body shape and pose without the hands or face. Comparing with face and body, human hands are more complex for parametric modelling due to the extreme flexibility of hand motion. Early 3D hand models are typically not learned but based on shape primitives (Melax et al., 2013; Oikonomidis et al., 2011; Schmidt et al., 2014), reconstructed with multiview stereo with fixed shape (Ballan et al., 2012; Tzionas et al., 2016), and use non-learned per-part scaling parameters (de la gorce et al., 2011), or use simple shape spaces (Tkach et al., 2016). Only recently (Romero et al., 2017; Khamis et al., 2015; Li et al., 2021) proposed learned hand models. (Khamis et al., 2015) collect partial depth maps of 50 people to learn a model of shape variation, however they do not capture a pose space. (Romero et al., 2017) on the other side learn a parametric hand model (MANO) with both a rich shape and pose space using 3D scans of 31 subjects in up to 51 poses, following the SMPL (Loper et al., 2015) formulation. (Li et al., 2021) built up a parametric hand bone model from MRI data, which drove the hand shape and pose using real bone and joint structures.

Hand Models.

Hand modeling is an essential topic in computer graphics. Many hand models have been proposed and are summarized in Table 1 categorizing according to their intuitive emphasis of the hand inner biomechanical structures (Albrecht et al., 2003; Wang et al., 2019, 2021; Li et al., 2021) or the hand outer shape, color and texture (Romero et al., 2017; Moon et al., 2020; Qian et al., 2020).

Hand Inner Biomechanical Model.

Hand structure and functions are biomechanically complex. Therefore, physical-based kinetic simulation is essential to model hand pose and shapes. For example, in early works, using simulated underlying hand skeleton to generate a solid hand surface mesh constrain (Capell et al., 2005; Kim and Pollard, 2011; Liu et al., 2013). Not just bones, hand skin and tendons are also widely considered to refine the visual appearance or control of hand articulation (Sueda et al., 2008; Li et al., 2013; Sachdeva et al., 2015). Musculotendon modeling and simulation (Kadleček et al., 2016; Lee et al., 2018; Abdrashitov et al., 2021) have also been studied on human body. More recently, for achieving real-human-like hand animation performance, researchers paid more attention on anatomical structures instead of simulated models. (Mirakhorlo et al., 2018)

comprehensively modeled biomechanical hand model based on detailed measurements from a hand specimen, yet it is not differentiable and can not be embedded in deep learning frameworks. 

(Anas et al., 2016) proposed a statistical wrist shape and bone model for automatic carpal bone segmentation.  (Wang et al., 2019) acquired a single-subject complete hand bone and muscle (Wang et al., 2021) anatomy in multiple poses using magnetic resonance imaging (MRI), to build an anatomy-correct hand bone rig of a target performer. However, it suffers from time-consuming and user-specified bone segmentation operations, which is impractical to apply to various individuals for building a parametric model. (Li et al., 2021)

construct a parametric hand bone model named PIANO from multiple-subject and multiple-pose MRI acquisitions, which is physically precise and differentiable. It can be applied in deep neural networks for computer vision tasks. However, it still lacks more comprehensive anatomical structures such as muscle to support more realistic hand outer surface generation.

Hand Outer Appearance Model.

Fully driven by the underlying biomechanical hand structures, skinning technique still acts as indispensable procedure in generating high-fidelity hand animation.  (Lewis et al., 2000) proposed Pose-Space Deformation (PSD) that combines skeleton subspace deformation (Magnenat-Thalmann et al., 1989) with artist-corrected pose shapes. It is widely used in industry due to its speed, simplicity and the ability to incorporate real-world scans and arbitrary artist corrections. Kry and his collaborators (2002)

further proposed to use Principal Component Analysis (PCA) to represent the large set of pose corrections. Recently, 

(Romero et al., 2017) augmented an LBS-based hand surface model with statistical individual- and pose-dependent parametric correctives, constructing a system referred as MANO (Romero et al., 2017). MANO has been widely used in variety of hand fitting and tracking scenarios including hand interactions (Hasson et al., 2019; Mueller et al., 2019)

and single RGB image hand pose estimation 

(Baek et al., 2019). These approaches are fully constrained by the underlying MANO model, which lacks real biomechanically correct constrains from the underlying hand tissue, and thus may fail to replicate subtle details of hand geometry like creases and bulging. Inspired by the interesting works, we construct the first complete parametric hand model with bone, muscle, and skin. By involving inherent kinematic structures and considering physically precise constraints, NIMBLE enables authentic hand shape and appearance generation and differentiable training for many down stream tasks.

Model Parametric Model Skin Bone Muscle Shape Pose Appearance
(Albrecht et al., 2003)
(Wang et al., 2019)
(Wang et al., 2021)
MANO (Romero et al., 2017)
HTML (Qian et al., 2020)
PIANO (Li et al., 2021)
NIMBLE (Ours)
Table 1. NIMBLE vs. existing hand models.

3. Overview

Figure 2. (a) MRI annotation of bone and muscle mask. (b) Original slices. (c) Reconstructed bone, muscle and skin mesh, joints are visualized in red.

We present a novel method for Non-rIgid hand paraMetric modeling on Bone and MuscLE (NIMBLE). To our best knowledge, this is the first parametric hand anatomy-based algorithm that can simultaneously model the interior hand kinematic structure and the exterior hand shape with the high-fidelity appearance of individuals.

NIMBLE is developed on a large amount of hand data with annotated inner and outer structural features. Specifically, we use the MRI hand dataset from (Li et al., 2021) and further annotate the MRI data to segment out the muscle and skins from the original annotations, as shown in Figure 2. The hand appearances are represented as textures with diffuse, normal, and specular maps, collected by a photometric appearance capture system called HandStage. We then build a parametric hand model by registering the hand template to all the interior and exterior hand features and photometric appearances in the dataset. After registration, we extend the general hand modeling pipeline followed by (Romero et al., 2017; Li et al., 2021), so that NIMBLE can learn a complete anatomy-based hand model with bone, muscles and textures by iteratively fitting the hand template to the multi-modal data and regenerating the parametric model from the registered multi-model features. The pipeline of our method is shown in Figure 3. The rest of the paper is organized as follows: we firstly introduce our data collection and annotation in Section 3.1. Next, we show our model formulation in Section 4.1, followed by a physically based registration in Section 4.2 and multi-stage parameter learning on shape and pose in Section 4.3. After having the hand template, we attach hand appearance to get photo-realistic rendering effect, as discussed in Section 5. In Section 6, we evaluate the effectiveness of NIMBLE by numerous 3D hands of different poses, shapes, appearances, and photorealistic rendering conditions. We also show that our method can be easily fitted into the hand inference pipeline from various input.

3.1. MRI Data Collection and Preparation

The dataset from (Li et al., 2021)

contains 200 hand MRI volumes spanning 50 different hand postures of 35 individual subjects. However, it only provides annotation on the bone mask and 25 joint positions per hand. In this paper, we regenerate a fine-grained bone segmentation mask on each MRI volume using radial basis functions and joint annotation. Additionally, we handcraft binary muscle masks on each volume slice using Amira

(Amira, 2022). For time efficiency, we only annotate large and notable muscle areas on each MRI scan, and our model registration algorithm, as discussed in Section 4.2, can automatically fill in the missing parts. As for skin annotation, we use an automatic thresholding method (Otsu, 1979) to extract skin mask. Next, we extract the iso-surface of bone, muscle and skin by applying the Marching Cubes algorithm (Lorensen and Cline, 1987) to get rough hand meshes, as shown in Fig 2(c).

4. Nimble

Modeling photo-realistic hand is challenging as the interior structure of the hand, including muscles and bones, are unknown, whereas the non-rigid and elastic property of muscles and the biomedical connection between muscles and bones can largely determine the exterior shape of hands. State-of-the-art solutions have neglected this connection and thus failed at capturing realistic movement like muscle and skin bulging. We address this issue by jointly modelling the internal hand structure with outer appearances while considering the physical constraints of the relative motion between muscles, bones and skins.

Figure 3. Overview of building NIMBLE, which includes inner and outer registration and parametric model learning. As well as NIMBLE application for synthetic hand generation and photorealistic rendering. , , are parameters that control model pose, shape and appearance.

4.1. Model Formulation

Figure 4. Three cutaway views of the tetrahedral mesh of our template hand.
Tissue # parts # vertices # faces # tetrahedron
Bone 20 3345 6610 -
Muscle 7 5635 10512 15986
Skin 1 5990 9984 19562
Table 2. NIMBLE template mesh details. Number of semantic parts, vertices, mesh faces and tetrahedrons.

The general formulation of NIMBLE is defined as follows:

(1)

where denotes the hand geometry, and models the hand appearance. , , are parameters controlling hand pose, shape and appearance, respectively. In this section, we will focus on the model of hand geometry.

To generate an accurate hand template, we extend the PIANO (Li et al., 2021) pipeline, which only considers the bone structures, shapes and joints, and add the muscle and skin features in the template formulation :

(2)

where demotes the Linear Blend Skinning (LBS) function; is the learned skinning weight of ; represent the joint locations; is an array of joints rotation axes and angles;

is the PCA coefficient vector of the shape space; and

is a person-specific hand template mesh. In another word, we can formulate of arbitrary individuals by Eqn. 2, as long as we know . Specifically, is defined by a joint regressor that maps the bone mesh vertices to joint locations by taking into account the shape parameters , we refer readers to (Li et al., 2021; Loper et al., 2015) for details. Note that we use bone mesh because joints are the essential rotation center of bone segments, which is invariant to skin and muscle shape.

The personalized template is a linear combination of general hand template , pose blend shape and shape blend shape (Eqn. 2,3,and 4 of (Romero et al., 2017)), where is the multiplication of pose blend shapes and pose rotation matrix, and is the multiplication of orthonormal PCA of shape blend shape and . and can correct artifacts introduced by by adding vertex offsets to the general template . We use the same number of rotation joint as in PIANO (Li et al., 2021). In the following paper, we use to parameterize .

Unlike the popular surface modeling methods i.e., SMPL (Loper et al., 2015) and MANO (Romero et al., 2017), we define the general hand template by jointly considering the bones, muscles and skin mesh as a whole set . As for bones, we adopt the triangular mesh settings in PIANO (Li et al., 2021) due to the rigid deformation property of bones. We use tetrahedral mesh to model muscle and skin so that NIMBLE can capture the non-rigid motion effects such as muscle bulging and wrinkled skin. Given that each hand has about 24 major muscles, which will significantly increase the computation and rendering cost if NIMBLE models all the muscles, we follow (Schwarz and Taylor, 1955; Erolin et al., 2016) and anatomically integrate the muscle groups into 7 ones based on their functional and physical properties. As for skin template , we use MANO topology as an initialization. We manually register bone, muscle and skin mesh to the same rest pose in the MRI dataset using (R3DS, 2022), so that the three components are into the same physical space. The process takes less than 10 minutes and is only performed once. We then remesh the registered triangular mesh of seven muscles and skin using isotropic explicit remeshing algorithm (Alliez et al., 2003) with a target edge length of 3mm, and create tetrahedral meshes using Tetgen (Si, 2015). The model details are listed in Table 2, and Figure 4 shows three cutaway views of our template mesh with tetrahedrons. Our complete hand template mesh consists of with 14970 vertices and 27106 faces.

4.2. Muscle Registration

Figure 5. Registered hand muscles from MRI segmentation. Observe that the muscle around the arrows become thicker and tighter after registration.

Before training the parametric hand model, we need to first register the general template mesh to scale the dataset within the same topology. However, mesh registration is an open question of long-standing, let alone our goal is to register meshes of low resolution from a large MRI dataset. In this paper, we bypass the manual landmark labelling method (Wang et al., 2021), and propose a physically based multi-stage registration algorithm that can model accurate poses with high-quality elastic non-rigid deformation. We adopt similar registration pipelines for muscle, bone, and skin. Here, we present muscle registration in detail, and briefly discuss its difference compared with bone and skin registrations.

Generally, our registration pipeline consists of two steps: pose initialization and iterative refinement. Pose initialization is to provide a good initial alignment to account for the highly nonlinear deformation of muscle. After initialization, we iteratively update the mesh vertex offset so that the deformed template best matches the target one from MRI scans.

Pose Initialization.

We use a simplified parametric model to initialize pose parameter :

(3)

where we removed all the shape relevant parameters in Equation (2), including the shape parameter , the shape blend shape , and the pose blend shape . We use to indicate the trimmed . is the skinning weight of this LBS function, which has been initialized by radial basis functions (RBF) according to template joint positions (Rhee et al., 2007). We minimize the L2 joint error between the posed template and target joint annotation and solve for the inverse kinematics to obtain the initial pose .

Iterative Refinement.

Then, we perform non-rigid registration to align the hand model at a finer scale. We formulate this as an energy minimization problem to match deformed template muscle mesh to target muscle mesh . The objective function for non-rigid registration is defined as:

(4)

where is geometry term, is regularization term, is non-rigid elasticity term, is the attachment constraints and , are the internal/external collision penalties. We will discuss each term and their benefit in details.

Geometry Term Inspired by the surface tracking algorithms (Xu et al., 2019; Newcombe et al., 2015; Smith et al., 2020), we use vertex distance and normal angle error to measure the distance between template mesh and target mesh :

(5)

where measures the Chamfer Distance (Borgefors, 1983) between two meshes and computes the angle between the corresponding vertex normal. The first term pulls the template vertex to match with the nearest target vertex, while the second term adds a normal penalty to prevent the template from fitted to the opposite vertex normal.

Regularization Term The regularization term consists of three components, i.e., rigidity regularizer , face normal consistency regularizer , and edge length regularizer :

(6)

regularizes the deformation of vertices in by comparing the deformation degree of adjacent vertices to avoid implausible shapes in unobserved regions. Instead of directly regulating the mesh node’s rotation, we add constraints on vertices:

(7)

where represents the deformation of vertex , and is the weight between vertex and : . Higher corresponds to closer distance, and thus higher impact.

To regularize the moving direction of vertices, we adopt the face normal consistency term and edge length term from (Wang et al., 2018) to further ensure mesh surface smoothness and avoid flying vertices. Specifically, the computes the angle between the normal of each pair of neighbouring faces to ensure of consistent face normal and smooth surface. penalizes flying vertices that cause long edges by minimizing the average edge length.

Non-rigid Elasticity Term To capture non-rigid deformation of hands, we define using the Neo-Hookean elastic function, which has been proven effective for muscle and flesh simulation in (Smith et al., 2020, 2018):

(8)

where denotes the tetrahedron’s volume and can be viewed as energy density. ensures the deformation gradient to be identical and thus can effectively prevent large changes and heavy self-collisions of muscles. Please refer to (Smith et al., 2020) for the complete formulation of .

Attachment Constraints To ensure that muscles are attached to their corresponding bones properly, we use to guarantee corresponding attachment points on the mesh surface:

(9)

where and are hand-crafted attachment points matching on muscle and bone mesh.

Internal Collision To avoid hand mesh self-penetration, similar to (Hirota et al., 2001), we penalize the internal collision by:

(10)

where refers to the interior penetration vertex, is the target surface position, and is the corresponding surface normal . Due to the large search space of and , can only handle small collisions. We therefore add an additional normal and distance filter to shrink the search space. Specifically, we discard collisions with normal angle larger than in to remove large self-collision and finger penetrating the palm in .

External Collision External collision happens between muscle to muscle and muscle to bone. To eliminate this, we use the contact loss proposed in (Hasson et al., 2019):

(11)

where is a repulsion term, that measures the point-to-plane distance, and is a attraction term computing the point-to-point distance of correspondence vertices. detects interpenetration points and pushes them towards the target mesh surface, the attraction term finds close vertices and force them to come into contact. By doing so, forces the muscle groups and bones to be adjacent without colliding each other.

Bone and Skin Registration.

The pipeline of bone and skin registration are similar to muscle’s, except that we use different term combination and balancing weights. For bone registration, we omit the non-rigid elasticity term in Equation (4), considering the rigid deformation property of bones. For skin registration, which also requires non-rigid deformation constrains, we use all the terms in Equation (4). The core difference is that we use larger weights on geometry term to align skins, as the skin annotations are more reliable than muscles in the MRI dataset.

4.3. Parameter Learning

After registration, we have an initialized model . The general template mesh and hand scans in the MRI dataset have been aligned to the same topology. Consequently, for each subject of hand pose , we can generate a aligned mesh . We then set out to train . Note that bones, muscles, and skins follow different anatomical and physical properties during shape and pose change. Therefore, given MRI scans, we train by a multi-stage strategy to disentangle deformations by pose and shape. Nevertheless, the public MRI datasets only contain limited numbers of hand poses due to the high cost and time-intensity of MRI data acquisition. Thus, we further optimize

using additional hand scans from large image-based dataset to extend our pose variance.


Learning on MRI dataset.

Given MRI scans, we train through three stages, i.e., the pose stage, the shape stage, and the parameter stage. For each stage, we only update certain parameters while keeping the rest fixed. The objective function is defined as follows:

(12)

where is energy term for pose stage updating the pose parameter ; constrains the shape related parameters , and is for parameter stage, which update and . To avoid collision between muscles, bones and skins, we introduce a coupling penalty term (Equation (11) with ) to , , and throughout the training procedure, while assigning different weights to balance the impact on different stage. We minimize by interactively going through the three stages until convergence.

Pose stage.

Given each , , , and , we solve for the specific pose parameter :

(13)

Joint term forces the posed template to match by measuring the L2 distance between posed template joint and the target joint annotation. measures the edge length difference between posed template and . Such term provides a good estimation of pose without knowing the subject specific shape. Since metacarpals have a limited range of motion according to (Wang et al., 2019; Panchal-Kildare and Malone, 2013), we add a regularization term to prevent the metacarpal joints from having unrealistic rotations:

(14)

where is a binary mask selecting only metacarpal joints.

Shape stage.

We then update the shape related parameters (), the joint regressor , and the general template in this stage. Firstly, we optimize the subject specific template and , which is directly relevant to and :

(15)

where is a geometry term (Equation (5)); the regularization terms (Equation (6)); is a joint term; is a joint regularization term. For the geometry term, We use a lower weight at the interior boundary of the muscle groups, namely the contacting vertices between each muscle, to ensure a consistent muscle boundary. Additionally, we use on muscle vertices to ensure smoothness. is a joint regularization to confine the joint locations of the subject consistent with joints prediction from the initial joint regressor:

(16)

After learned the and by optimizing , we can get by enforcing and to be equivalent. We then run principal component analysis (PCA) on to obtain shape space parameters , where is the mean shape of MRI dataset . is the principal component matrix, and is the PCA coefficient vector of the shape space.

Parameter stage.

We optimize skinning weight and pose blend shape by:

(17)

Similar to (Loper et al., 2015), regularize the Frobenius norm of to be zero, which prevents overfitting of the pose-dependent blend shapes. regulate the skinning weight by minimizing the distance between and .

Pose Augmentation.

After optimization, NIMBLE can potentially be directly used to better estimate hand pose because of more reliable bone and muscle modeling. As shown in Figure 15, we are able to provide anatomically correct and physically plausible deformation compared with state-of-the-arts. However, due to limited hand poses provided by the MRI datasets (Li et al., 2021; Wang et al., 2019), NIMBLE may suffer from degraded performance in applications requiring large hand pose variances. To address this issue, we additionally optimize NIMBLE on hand scans from MANO dataset (Romero et al., 2017). MANO provides 1554 raw scans and hand registrations align with the topology of MANO hand model. We perform a topology transfer with a simplified physically based non-rigid registration (Section 4.2) to inline our model to MANO dataset. To achieve this, we compute a dense correspondence from MANO topology to ours by manually fitting MANO to our template using Wrap3D (R3DS, 2022). Then we run the non-rigid optimization with geometry term, non-rigid elasticity term and dense correspondence to match with raw scan. By doing so, we obtain another 1554 hand registrations with large pose variance.

We follow the same parameter learning strategy as in the MRI dataset and further optimize learnt from MRI scans. Note that the canonicalized MANO data only contains skin geometry, resulting in weak supervision on bone and muscle. To prevent unexpected deformations, we leverage an additional shape regularizer to constrain the deformation of inner geometry. Essentially, we want to use the skin to guide the deformation of bone and muscle so that the inner and outer mesh will not downgrade to the average template in our previously registered MRI shape space. We define the shape regularizer as follows:

(18)

where denotes the projected shape coefficients on to the MRI shape space, which corresponds to .

More parameter setting, registration and learning details can be found in Section. 6.1.

5. Photorealistic Rendering

Modeling the high-quality and realistic appearance is important for a realistic rendering pipeline. Physically-based textures, including diffuse albedo, normal maps, specular maps, play an important role in rendering photo-realistic hand appearance. Here, we introduce how to model the appearance for NIMBLE, i.e., .

Appearance Capture.

We utilize a photometric appearance capture system that we call HandStage analogous to the USC LightStage (Debevec et al., 2000) to capture the detailed physically-based textures. We are able to attain diffuse albedo, normal maps and specular maps of hands by applying several patterns of polarized gradient illumination on them. We captured 20 hands of different identities with our HandStage capture system to reconstruct 8192x8192 pore-level detailed physically-based hand textures. To increase diversity, we include extra 18 online hand texture assets from (3DSCANSTORE, 2022). Our final appearance dataset consists of 38 photo-realistic hand texture assets from different ages, genders and races.

For rich and authentic hand appearance generation, we create a parametric appearance model from our appearance dataset. Every asset in our dataset have physically-based textures as well as uniform texture UV mapping, which allows us to apply linear interpolation between existing textures. For every appearance

in our dataset, we compute

as the average appearance, including average diffuse albedo, normal maps and specular maps. Then, we run principal component analysis (PCA) using singular value decomposition to obtain the principal components

from every existing appearance in our total appearance dataset. After that we obtain the parametric appearance model for appearance parameter vector as

(19)

where is the number of principal components, here we choose . With parametric appearance model created from PCA, we could also generate realistic physically-based textures out of our dataset. Since our textures have uniform texture UV mapping as our template skin mesh, we could directly apply generated physically-based textures NIMBLE with different shapes and produce a photo-realistic appearance.

Rendering Process.

Figure 6. Our rendering pipeline for creating photorealistic hand. Starting with our full template, we randomly generate shape, pose and appearance and render the 3D hand with environment maps. We also show difference rending effects with only diffuse map and with normal map.

We show our procedural rendering process in Figure 6, we start with NIMBLE template model, and generate shape, pose and appearance parameters randomly. NIMBLE takes the parameters as input and generates a realistic 3D hand with bone, muscle and skin geometry, as well as photo-realistic appearance with diffuse map, normal map and specular map. Then we render the 3D hand with Cycles, a photo-realistic ray-tracing renderer (Blender, 2021). We employ image-based lighting with high dynamic range images (HDRI) as background texture to illuminate the hand, as shown in Figure 6. We also show a variety of different hand poses, shapes and appearances under a uniform lighting in Figure 8. To generate a dynamic motion sequence, we map pose parameters to full pose quaternion representation, and linearly interpolate between different pose to keep a smooth pose morphing. Please see the supplementary video for examples. To further enhance our rendering quality, we generate skin wrinkles with cloth simulation in rendering engines. Here we only use the surface triangles of our volumetric mesh so that cloth simulation schemes can be applied. See Figure 7 for an example. During the adduction of the thumb to the index finger, the first dorsal interossei between thumb and index finger contract and pull the bones closer, the contracted muscle makes the skin bulge, and the purlicue skin is then squeezed to create a fold. As can be seen in the close-up of Figure 7, the wrinkle near purlicue gradually appears as the fingers become closer. NIMBLE recovers muscle and skin bulging under such pose, while cloth simulation produces wrinkles caused by the squeezing between the geometry of thumb and index finger.

Figure 7. Simulation of skin wrinkle due to thumb adduction. Note the wrinkle gradually appear and skin bulges as the fingers become closer.
Figure 8. Gallery of generated 3D Hand. Our model is able to synthesize realistic digital hand with large variance on pose, shape and appearance.

6. Experimental Results

We first show our implementation details, then evaluate the performance of our registration process and the NIMBLE models learned from these registrations. We test NIMBLE on public hand datasets and compare with state-of-the-arts, e.g., MANO (Romero et al., 2017) and HTML (Qian et al., 2020). Extensive experiments demonstrate NIMBLE’s outperforming performance and capability of photo-realistic rendering among diverse datasets. Additionally, we show that NIMBLE can be easily fitted into the hand inference pipeline from various input.

6.1. Implementation Details

In our experiments, we set the term weights in Equation (4) according to template type during registration. For muscle, since the target is not separated, we use larger regularization term weights to guide the registration. We set , , , , , , , and . For skin, we set larger geometry term weights and non-rigid elasticity term weights: , , , , . Specifically, for bone, we omit the non-rigid elasticity term with . The whole registration takes approximately 20 minutes for the muscle group, 8 minutes for skin, and 2 minutes for bone. For parameter learning, we also use weights to balance each term. During the first iteration, we set the weight and in Equation (12) to 100 to enforce a strong pose constraint, and decrease to them to 0.01 and 10 respectively after the first stage. We set the weight for coupling penalty gradually increase from 0.1 to 1 to ensure that our final model is collision-free. We also adopt an additional optimization process with Equation (10) and (11

) to handle skin collisions with bone and muscles during model usage. We iterate the whole process of registration and parameter learning several times to get a stable result. All of our experiments are performed with PyTorch auto differentiation on an NVIDIA GeForce RTX 3090.

6.2. Registration Evaluation

Quantitative Result.

Figure 9. The histogram of median per-vertex distance between registration and the target surface.

In Figure 9, we plot histograms for the median of per-vertex distance from registration to the MRI mesh and MANO scan mesh. The distance is measured across all registered MRI data and MANO scans. Note that we discard the inner vertices in this evaluation. It is indicated that our method produces registrations that generally match the target mesh within a 2 mm distance error. For bone registration, almost all vertices have a distance error below 1.4 mm, while 87% of the vertices achieve a median distance less than 1 mm. For muscle registration, 65% of the vertices make a median distance less than 1 mm and only 9% vertices are above 5 mm. These are caused mainly by the missing data at the attachment part where muscle gets thinner and attaches to bone, as described in Section 4. We thus perform higher regularization weights and attachment terms to increase robustness at these parts. For skin registration in the MRI dataset, 70% of the vertices make a median distance less than 1.5 mm, while the max distance error is 3.69 mm. While for the MANO scan skin dataset, with our topology transfer method, we achieve a mean error of 1.09 mm. There are 4% of vertices error above 3 mm. This is mostly caused by incomplete scan in the dataset, especially with object occlusion, as shown in Figure 10 (last two rows). Meanwhile, we report the mean squared distance from scan to registration for MANO skin is 1.76 mm.

Qualitative Result.

Figure 10 shows representative registrations sample of the MRI dataset and MANO scan dataset. For all inner and outer data, our method provides an accurate and smooth registration. We are able to capture the muscle stretching and bulging effects. It is notable that our registration is able to maintain a robust performance towards noisy MRI target while capturing detail skin wrinkles from detailed MANO scan (Romero et al., 2017).

Figure 10. Qualitative result of MRI registration. From top to bottom: MRI muscle mesh (purple) and our registration (green), MRI bone mesh and registration, MRI skin mesh and our registration, as well as the MANO scan mesh (yellow) and our registration.

Ablation Study.

Figure 11. Results of the LBS initialization and non-rigid refinement on MRI mesh. From left to right: target MRI mesh, LBS registration and mri to registration distance, Non-rigid registration and the color-coded distance.

Our registration process contains two steps: LBS pose initialization and iterative non-rigid refinement. Figure 11 visualized the registration results of each optimization step. The LBS step serves only as a pose initialization, it is unable to capture the details of muscle and skin bulging, especially around the thumb muscle and hand palm. After the iterative non-rigid refinement, the registration tightly fits the surface of the MRI mesh. Note on the top right of Figure 11, the geometry error remains large in the middle of the MRI muscle mesh. This is due to segmentation error, some tendons are mislabeled as muscle because it is hard to distinguish them on MRI slices. While using the strong regularization term in Section 4.2, we successfully match the template to the target without fitting it to mislabeled tendons.

Figure 12. Registration results with different optimization terms. We label regions of interest with arrows. From left to right, we run registration with geometry term and then add regularization term, non-rigid elasticity term, and the collision terms to obtain the total energy.

To assess the impact of each of our energy terms on iterative refinement, we register an MRI hand mesh with multiple energy term variants. As illustrated in Figure 12, we begin with the geometry term and then add the regularization, non-rigid elasticity, and collision terms individually to obtain the total energy defined in Section 4.2. While the geometry term forces the vertex to align with the target, the fingers collide with one another, resulting in severe artifacts. While adding a regularization term eliminates some artifacts, finger collision and unnatural knuckle deformation remain. With the addition of non-rigid elasticity term, the deformations of each finger become more realistic, but the thumb and middle finger continue to self-penetrate. By including a collision term, the self-penetration problem is resolved and the final collision-free registration result is obtained.

6.3. Model Evaluation

The general method to evaluate a statistical model is to measure its compactness and generalization (Romero et al., 2017; Loper et al., 2015). Compactness measures the amount of variability in the training data captured by the model, while generalization measures the model ability to represent unseen shape, pose and appearance.

Figure 13. Model Quality of compactness and generalization. (a) Pose space (b) Shape space (c) Appearance space.
Figure 14. Per-tissue shape compactness and generalization.

Compactness.

Figure 13

(a), (b) (red curve) plot the compactness of the NIMBLE shape and pose space, respectively. These curves depict the variance in the training data captured by a varying number of principal components. The pose space plot shows that 15, 25, 30 components can express 83%, 92%, 95% of the full space. The result is consistent with the anatomy of the human hand, which is generally considered to have 27 degrees of freedom

(Panchal-Kildare and Malone, 2013). As for the shape space, we only plot the first 33 principal components, as our shape space is mainly learned from 33 individuals in the MRI dataset. As shown in Figure 13 (b), we can note that the first principal component covers 50% of the shape space. Meanwhile, 10 and 20 components manage to cover 83% and 93% of the complete shape space. We also plot the per-tissue compactness curve in Figure 14(a). It indicates that the variance of different tissues is mostly consistent.

Generalization.

To study the generalization ability of NIMBLE shape space in the presence of limited shape variance, we perform a leave-one-out evaluation on our MRI and MANO training set, which contain 62 individuals in total. Figure 13

(b) blue curve shows the generalization curve of the shape space. We report the mean squared error and the standard deviation in millimeters. As the number of principal components increases, the mean error decreases to a minimum error of 0.6 mm achieved by the full space. We also plots the per-tissue generalization in Figure

14(b). Note that the muscle error is the lowest across all components, meaning the shape variance for muscle is relatively small compared to bone and skin.

To evaluate the generalization capabilities of the pose space in NIMBLE, namely the ability to generalize to unseen pose with known shape parameters. We construct a test set containing 9 registered MRI data in the unseen pose, combined with the test scan set from (Romero et al., 2017). The test scan set contains 50 hand surface scans with unseen poses and shapes. All meshes are in alignment of our topology and none were used to train our models. We fit our trained model to the registered mesh, optimizing over pose and to find the best fit in terms of the mean squared vertex distances. Since we are evaluating pose generalization ability, we use full shape space for this experiment. Figure 13(a) blue curve shows the generalization results. We report the mean squared distance error and the standard deviation. Similar to shape space, this plot for the pose space decreases monotonically for an increasing number of components.

Compare with MANO.

To compare with MANO (Romero et al., 2017), similar to the pose generalization experiment, we fit models to our MRI test set and their scan test set, respectively. We use full pose and shape space for all models in this experiment. In Table 3, we report the mean squared vertex error in millimeters. It can be seen that MANO model performs best in its test set but does not generalize well to MRI data. Meanwhile, our model is not able to generalize to unseen pose and shape in the MANO test set with low pose variance derived from MRI data. After pose augmentation, though the performance on MRI test set drops a little bit, the result on MANO test set is significantly improved. Overall, our model achieves satisfying on both test sets and achieves the smallest average error. Figure 15 further shows qualitative hand fitting comparison with MANO. It is notable that MANO suffers from impractical inner deformation and lacks skin details, as it is built on outer surface only. In contrast, our NIMBLE model achieves anatomically correct inner hand tissue deformation while retaining skin details.

Model MRI test MANO test Avg.
MANO (Romero et al., 2017) 3.32 1.46 2.39
Ours - MRI 2.51 3.89 3.20
Ours - Pose Aug 2.67 1.62 2.15
Table 3. Comparison with MANO. We evaluate mean squared distance in millimeters on MRI test set and MANO test set respectively, and report the average error.
Figure 15. Deformation comparison with MANO (Romero et al., 2017). (a) NIMBLE retains skin details during deformation, while MANO provides an overly smoothed skin. (b) (c) MANO presents implausible flexion of the inner bone and muscle, as well as an unrealistically sunken skin, whereas NIMBLE maintains anatomically correct and physically plausible deformation.

Bone/Muscle/Skin Correlation.

Regarding the correlation of bone, muscle and skin, we impose biomechanical constraints to correlate three tissues implicitly encoded via the non-rigid elasticity term in registration (Section 4.2) and the coupling term in parameter training (Section 4.3). These terms force skin deformations to follow bone and muscle movements, critical for physically correct simulations. Visually, such deformations are more nuanced, as shown in Figure 15 and the accompanying video. They both demonstrate the bulging thumb base (thenar eminence) when the thumb touches the index finger, illustrating the intricate coordination between bones, muscles, and skin. To assess the impact of inner muscle layer quantitatively, we conduct an ablative study on bone-skin vs. bone-muscle-skin models. We fit the MRI test set and MANO test set with models learned on bone-skin and bone-muscle-skin data separately using full pose and shape space. The evaluation results are shown in Table 4. It can be seen that adding muscle layer achieves lower error on all metrics. Compared to bone, skin error shows larger improvement on both test sets, indicating that muscle layer has a positive impact on skin deformation. We thus conclude that modeling muscle layer facilitates both visual realism and fitting results, and the correlation between each tissue is successfully encoded in the model through our registration and parameter learning pipeline.

Model MRI-bone MRI-skin MANO-skin
Bone-Skin 2.59 2.61 1.67
Bone-Muscle-Skin 2.58 2.56 1.62
Table 4. Ablative comparison of bone-skin model and bone-muscle-skin model. We evaluate mean squared distance in millimeters on bone and skin mesh from MRI test set, as well as skin mesh from MANO test set.

Photorealistic Appearance.

Figure 13(c) shows the evaluation of our appearance model. The plot depicts the rising variability in our appearance dataset as the number of employed principal components increases. The first several components could represent a significant amount of variation which mainly include skin tone and ruddiness, while the other components control the details of skin. For evaluating generalization, we perform a leave-one-out evaluation similarly. Since our appearance dataset includes diffuse albedo, normal maps and specular maps, we utilize root mean squared error (RMSE) as the metrics for reconstruction error measurement. We reconstruct the left-out textures using the PCA analyzed from the other textures and measured the reconstruction error as RMSE of the vectorized textures. As shown in 13(c), the reconstruction error decreases as the number of components increases. Figure 16 shows a qualitative appearance comparison with HTML (Qian et al., 2020). We use Wrap3d (R3DS, 2022) to transfer texture from HTML to our model and render the result under the same lighting condition. As can be seen that the appearance submodule of NIMBLE covers a wide diversity of skin complexions. In particular, the use of normal map in NIMBLE better illustrates tendons on the back of the hand and palm lines despite complexion variations.

Figure 16. Appearance comparison with HTML (Qian et al., 2020). We render our model with textures from HTML and our appearance data, respectively. (a)(b) shows the back and front side of captured hand textures from each model, (b)(c) shows random sampled textures. (a)(c) are rendered with an additional lighting source to highlight the normal differences.

6.4. Applications

Synthesizing Digital Hand.

Learning-based hand-related tasks rely on high quality labeled datasets of hand images, yet acquiring such datasets with correct labels (e.g. 3D geometry, pose and appearance) is extremely challenging owing to the high degree of freedom (DoF) of hand motions. As each finger can flex and extend, abduct and adduct, and also circumduct; and all fingers can move independently as well as coordinately to form specific gestures. Such high DoF causes complex occlusions and hence imposes significant difficulties in skeleton annotation. Even for humans, it would be very difficult to manually label hand joints of complex gestures at high precision, largely due to the ambiguity caused by occlusions. Our model is well-suited to help resolve these issues. With the NIMBLE model and render engines, we can create an unlimited number of photo-realistic hand images and video sequences with corresponding ground-truth inner and outer geometry, pose and texture maps. All of which can be used for downstream hand-related learning tasks. We demonstrate qualitative results of our photorealistic rendering and its ability to generate a complete digital hand in Figure 8. Several results are shown in Figure 17 for the same pose with different texture under different lighting environments. We can also provide the corresponding ground truth 3D joint annotation.

Figure 17. Representative results of the same posed hand with different camera view, illumination and texture. (a) Inner and outer geometry of NIMBLE generated digital hand and the corresponding 3D joint annotation. (b) Photorealistic images. From left to right, the first two columns show the same texture under different illumination, while the second and third column show different textures under the same illumination.

Hand Inference.

Like other parametric hand models (Li et al., 2021; Romero et al., 2017), NIMBLE is easily adaptable to a variety of optimization and regression-based hand inference tasks, such as hand anatomy analysis, pose and shape estimation, and hand tracking with a variety of inputs including meshes, point clouds, MRI, and RGB images. We integrate NIMBLE as a differentiable layer that takes shape , pose , and appearance as input and outputs a 3D hand mesh with photorealistic textures and 3D joints. Similar to (Li et al., 2021; Hasson et al., 2019), NIMBLE support training with multiple losses, such as parameter loss, regularization loss, 2D/3D joint loss, mesh loss, as well as photometric loss (Qian et al., 2020) and silhouette loss (Xu et al., 2018) with differentiable renderers provided in PyTorch3D (Johnson et al., 2020).

We show representative results for the usage of NIMBLE in Figure 18. We are able to estimate and recover anatomically correct inner and outer hand structure and provide a photorealistic rendering from various inputs. Figure 18 (a) shows an example of hand inference from point cloud. The input point cloud is taken from the MANO test set, and we perform this task using an optimization-based method (Newcombe et al., 2015). We optimize pose and shape parameters with joint loss and mesh loss with respect to the target point cloud. Figure 18 (b) illustrates the inference of hand anatomy from an MRI volume. We build a network with ResNet3D (Tran et al., 2018) encoder and a parameter regressor branch to directly regress NIMBLE parameters from MRI volume. We train the network on our MRI training set with supervision on the pose and shape parameters as we acquire ground truth parameter labels from our registration and parameter learning stages. Since there is no appearance guidance in point cloud and MRI volume, we omit appearance parameter during optimization and training and use the mean texture for rendering in Figure 18 (a)(b). Similarly, for the image-based hand inference task shown in Figure 18 (c), we build upon I2L-MeshNet (Moon and Lee, 2020) and train another parameter regression branch with 3D joint loss and photometric loss using the FreiHand (Zimmermann et al., 2019) dataset. Note that FreiHand offers ground truth annotation with 21 3D keypoints, while our model is defined with 25 anatomical joints. Following (Li et al., 2021), we add an additional linear layer that maps from our joint to dataset annotation to account for the mismatch. In addition, we add L2-regularizers on the magnitude of the shape, pose, and appearance parameters. We assume the fixed lighting condition same as HTML (Qian et al., 2020) for predicting appearance parameters. The quantitative results are shown in Table 5. Following (Zimmermann et al., 2019)

, we report PA MPJPE, which is the euclidean distance (mm) between predicted joint and ground truth 3D joint after rigid alignment, as well as F-scores at two different distance thresholds. Though our model does not outperform

(Moon and Lee, 2020) due to the fundamental difference of joint definition, we are able to achieve a comparable quantitative result and predict unprecedented photorealistic hand with inner structures (Fig.18(c)).

Figure 18. Representative results of the usage of NIMBLE for hand pose and shape estimation from (a) point cloud, (b) MRI volume and (c) RGB image. For textureless point cloud and MRI, NIMBLE is rendered with mean texture.
Methods PA MPJPE F@5mm F@15mm
(Moon and Lee, 2020) 7.4 0.681 0.973
NIMBLE 9.4 0.547 0.955
Table 5. Quantitative results of RGB inference. We report joint error with PA MPJPE and F-scores between (Moon and Lee, 2020) and our model on FreiHAND dataset.

7. Conclusion and Future Work

To generate faithful hands in Metaverse for immersive experience, we propose a non-rigid hand model, NIMBLE, with skins, bones, and muscle, which is anatomically correct and meets the delicate coordination of inner and outer kinematic structures of hands. Especially, the data we rely on is an enhanced MRI hand dataset with full segmentation annotations for bones, muscles, and skins, as well as auto-registered meshes by our proposed optimization method with physical constraints. For the parameter learning of NIMBLE, we also involve penalty terms to guarantee physically correct muscle deforms. By enforcing inner bones and muscles to match anatomic and kinematic rules, NIMBLE provides an unprecedented level of realism and achieves anatomically correct digital hand synthesis, motion animation and photorealistic rendering. Due to the parametric representations, NIMBLE also benefits many learning-based vision applications with different modalities of input data.

There are several avenues for future work. We demonstrate how NIMBLE can change shape and pose with inner and outer consistency, but we do not explicitly model the interconnections. We intend to utilize implicit skinning to include explicit constraints on bone, muscle, and skin interactions. Besides, with our tetrahedron modeling, we can extend the model to include parametric secondary deformation using specifically designed blend shapes or FEM soft body dynamics, as in (Pons-Moll et al., 2015; Tsoli et al., 2014). We also want to analyze muscular attributes like stiffness and elasticity to produce a more realistic physical model for efficient muscle and flesh modeling. We plan to extend our parametric model to include tendons and ligaments to improve skin deformation and overall hand movement realism, allowing for even more dexterous hand modeling and anatomical and kinematics analysis in the future. Additionally, we plan to use alternative approaches like geometric modeling via parametric or learning-based methods to model skin wrinkles. However, such approaches require capturing significantly more detailed normal maps. Our next step is to utilize the HandStage to capture dynamic sequences to model these fine details on both shape and appearance. For hand tracking applications, which highly rely on hand datasets, while existing multiview datasets like (Zimmermann et al., 2019; Moon et al., 2020) provide limited annotation, and synthetic datasets like (Hasson et al., 2019) lack realism and have domain gap compared to real images. We plan to utilize NIMBLE to create a high-quality hand dataset with comprehensive ground truth annotation including inner and outer geometry, pose, and appearance, and further train a deep network with it for hand motion capture. Finally, two-handed contact and object interaction are also vital. We only use right-handed data, but a left-handed model and hand-object parametric model would be tremendously useful for two-hand motion capture and immersive VR interactions.

Acknowledgements.
This work was supported by NSFC programs (61976138, 61977047), the National Key Research and Development Program (2018YFB2100500), STCSM (2015F0203-000-06), and SHMEC (2019-01-07-00-01-E00003).

References

  • 3DSCANSTORE (2022) 3D scan store: captured assets for digital artists. External Links: Link Cited by: §5.
  • R. Abdrashitov, S. Bang, D. Levin, K. Singh, and A. Jacobson (2021) Interactive modelling of volumetric musculoskeletal anatomy. ACM Trans. Graph. 40 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • I. Albrecht, J. Haber, and H. Seidel (2003) Construction and animation of anatomically based human hand models. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 98–109. Cited by: §2, Table 1.
  • B. Allen, B. Curless, Z. Popović, and A. Hertzmann (2006) Learning a correlated model of identity and pose-dependent body shape variation for real-time synthesis. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’06, Goslar, DEU, pp. 147–156. External Links: ISBN 3905673347 Cited by: §2.
  • B. Allen, B. Curless, and Z. Popović (2003) The space of human body shapes: reconstruction and parameterization from range scans. ACM Trans. Graph. 22 (3), pp. 587–594. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • P. Alliez, E. C. De Verdire, O. Devillers, and M. Isenburg (2003) Isotropic surface remeshing. In 2003 Shape Modeling International., pp. 49–58. Cited by: §4.1.
  • Amira (2022) Amira software for biomedical and life science research. External Links: Link Cited by: §3.1.
  • E. M. A. Anas, A. Rasoulian, A. Seitel, K. Darras, D. Wilson, P. S. John, D. Pichora, P. Mousavi, R. Rohling, and P. Abolmaesumi (2016) Automatic segmentation of wrist bones in ct using a statistical wrist shape pose model. IEEE Transactions on Medical Imaging 35 (8), pp. 1789–1801. External Links: Document Cited by: §2.
  • D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis (2005) SCAPE: shape completion and animation of people. In ACM SIGGRAPH 2005 Papers, pp. 408–416. Cited by: §1, §2.
  • S. Baek, K. I. Kim, and T. Kim (2019) Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • L. Ballan, A. Taneja, J. Gall, L. V. Gool, and M. Pollefeys (2012) Motion capture of hands in action using discriminative salient points. In European Conference on Computer Vision, pp. 640–653. Cited by: §2.
  • V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 187–194. Cited by: §2.
  • Blender (2021) Cycles renderer. Note: Cited by: §5.
  • G. Borgefors (1983) Chamfering: a fast method for obtaining approximations of the euclidean distance in n dimensions. In Proc. 3rd Scand. Conf. on Image Analysis (SCIA3), pp. 250–255. Cited by: §4.2.
  • S. Capell, M. Burkhart, B. Curless, T. Duchamp, and Z. Popović (2005) Physically based rigging for deformable characters. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 301–310. Cited by: §2.
  • M. de la gorce, D. Fleet, and N. Paragios (2011) Model-based 3d hand pose estimation from monocular video. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33, pp. 1793 – 1805. External Links: Document Cited by: §2.
  • P. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar (2000) Acquiring the reflectance field of a human face. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 145–156. Cited by: §5.
  • P. Debevec (2012) The light stages and their applications to photoreal digital actors. SIGGRAPH Asia 2 (4), pp. 1–6. Cited by: §1.
  • C. Erolin, C. Lamb, R. Soames, and C. Wilkinson (2016) Does virtual haptic dissection improve student learning? a multi-year comparative study.. In MMVR, pp. 110–117. Cited by: §4.1.
  • Y. Feng, H. Feng, M. J. Black, and T. Bolkart (2021) Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (TOG) 40 (4), pp. 1–13. Cited by: §2.
  • N. Hasler, T. Thormählen, B. Rosenhahn, and H. Seidel (2010) Learning skeletons for shape and pose. In Proceedings of the 2010 ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D ’10, New York, NY, USA, pp. 23–30. External Links: ISBN 9781605589398, Link, Document Cited by: §2.
  • Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid (2019) Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11807–11816. Cited by: §2, §4.2, §6.4, §7.
  • G. Hirota, S. Fisher, A. State, C. Lee, and H. Fuchs (2001) An implicit finite element method for elastic solids in contact. In Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No. 01TH8596), pp. 136–254. Cited by: §4.2.
  • D. A. Hirshberg, M. Loper, E. Rachlin, and M. J. Black (2012) Coregistration: simultaneous alignment and modeling of articulated 3d shape. In European conference on computer vision, pp. 242–255. Cited by: §1, §2.
  • J. Johnson, N. Ravi, J. Reizenstein, D. Novotny, S. Tulsiani, C. Lassner, and S. Branson (2020) Accelerating 3d deep learning with pytorch3d. In SIGGRAPH Asia 2020 Courses, SA ’20, New York, NY, USA. External Links: ISBN 9781450381123, Link, Document Cited by: §6.4.
  • P. Kadleček, A. Ichim, T. Liu, J. Křivánek, and L. Kavan (2016) Reconstructing personalized anatomical models for physics-based body animation. ACM Transactions on Graphics (TOG) 35 (6), pp. 1–13. Cited by: §2.
  • S. Khamis, J. Taylor, J. Shotton, C. Keskin, S. Izadi, and A. W. Fitzgibbon (2015) Learning an efficient model of hand shape variation from depth images. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2540–2548. Cited by: §2.
  • J. Kim and N. S. Pollard (2011) Fast simulation of skeleton-driven deformable body characters. ACM Trans. Graph. 30 (5). External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • P. G. Kry, D. L. James, and D. K. Pai (2002) Eigenskin: real time large deformation character skinning in hardware. In Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, pp. 153–159. Cited by: §2.
  • S. Lee, R. Yu, J. Park, M. Aanjaneya, E. Sifakis, and J. Lee (2018) Dexterous manipulation and control with volumetric muscles. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–13. Cited by: §2.
  • J. P. Lewis, M. Cordner, and N. Fong (2000) Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, USA, pp. 165–172. External Links: ISBN 1581132085, Link, Document Cited by: §2.
  • D. Li, S. Sueda, D. R. Neog, and D. K. Pai (2013) Thin skin elastodynamics. ACM Trans. Graph. 32 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero (2017) Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph. 36 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • Y. Li, M. Wu, Y. Zhang, L. Xu, and J. Yu (2021) PIANO: a parametric hand bone model from magnetic resonance imaging. In

    Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21

    ,
    pp. 816–822. External Links: Document, Link Cited by: NIMBLE: A Non-rigid Hand Model with Bones and Muscles, §1, §1, §2, §2, §2, Table 1, §3.1, §3, §4.1, §4.1, §4.1, §4.3, §6.4, §6.4.
  • L. Liu, K. Yin, B. Wang, and B. Guo (2013) Simulation and control of skeleton-driven soft body characters. ACM Trans. Graph. 32 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 1–16. Cited by: §1, §1, §2, §4.1, §4.1, §4.3, §6.3.
  • W. E. Lorensen and H. E. Cline (1987) Marching cubes: a high resolution 3d surface construction algorithm. ACM siggraph computer graphics 21 (4), pp. 163–169. Cited by: §3.1.
  • N. Magnenat-Thalmann, R. Laperrière, and D. Thalmann (1989) Joint-dependent local deformations for hand animation and object grasping. In Proceedings on Graphics Interface ’88, CAN, pp. 26–33. Cited by: §2.
  • S. Melax, L. Keselman, and S. Orsten (2013) Dynamics based 3d skeletal hand tracking. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, pp. 184–184. Cited by: §2.
  • M. Mirakhorlo, N. Van Beek, M. Wesseling, H. Maas, H. Veeger, and I. Jonkers (2018) A musculoskeletal model of the hand and wrist: model definition and evaluation. Computer methods in biomechanics and biomedical engineering 21 (9), pp. 548–557. Cited by: §2.
  • G. Moon and K. M. Lee (2020) I2l-meshnet: image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision, pp. 752–768. Cited by: §6.4, Table 5.
  • G. Moon, T. Shiratori, and K. M. Lee (2020) DeepHandMesh: a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. pp. 440–455. External Links: ISBN 978-3-030-58535-8, Document Cited by: §2.
  • G. Moon, S. Yu, H. Wen, T. Shiratori, and K. M. Lee (2020) InterHand2.6m: a dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In European Conference on Computer Vision (ECCV), Cited by: §7.
  • F. Mueller, M. Davis, F. Bernard, O. Sotnychenko, M. Verschoor, M. A. Otaduy, D. Casas, and C. Theobalt (2019) Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Transactions on Graphics (TOG) 38 (4). Cited by: §2.
  • R. A. Newcombe, D. Fox, and S. M. Seitz (2015) Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 343–352. Cited by: §4.2, §6.4.
  • I. Oikonomidis, N. Kyriazis, and A. A. Argyros (2011) Efficient model-based 3d tracking of hand articulations using kinect. In BMVC, Cited by: §2.
  • N. Otsu (1979) A threshold selection method from gray level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, pp. 62–66. Cited by: §3.1.
  • S. Panchal-Kildare and K. Malone (2013) Skeletal anatomy of the hand. Hand clinics 29 (4), pp. 459–471. Cited by: §1, §4.3, §6.3.
  • G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black (2015) Dyna: a model of dynamic human shape in motion. ACM Transactions on Graphics (TOG) 34 (4), pp. 1–14. Cited by: §1, §2, §7.
  • N. Qian, J. Wang, F. Mueller, F. Bernard, V. Golyanik, and C. Theobalt (2020) HTML: a parametric hand texture model for 3d hand reconstruction and personalization. In European Conference on Computer Vision, pp. 54–71. Cited by: §2, Table 1, Figure 16, §6.3, §6.4, §6.4, §6.
  • R3DS (2022) WRAP3D. External Links: Link Cited by: §4.1, §4.3, §6.3.
  • T. Rhee, J. P. Lewis, U. Neumann, and K. Nayak (2007) Soft-tissue deformation for in vivo volume animation. In 15th Pacific Conference on Computer Graphics and Applications (PG’07), pp. 435–438. Cited by: §4.2.
  • J. Romero, D. Tzionas, and M. J. Black (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics (ToG) 36 (6), pp. 245. Cited by: §1, §2, §2, §2, Table 1, §3, §4.1, §4.1, §4.3, Figure 15, §6.2, §6.3, §6.3, §6.3, §6.4, Table 3, §6.
  • P. Sachdeva, S. Sueda, S. Bradley, M. Fain, and D. K. Pai (2015) Biomechanical simulation and control of hands and tendinous systems. ACM Trans. Graph. 34 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • T. Schmidt, R. A. Newcombe, and D. Fox (2014) DART: dense articulated real-time tracking. In Robotics: Science and Systems, Cited by: §2.
  • R. J. Schwarz and C. Taylor (1955) The anatomy and mechanics of the human hand. Artificial limbs 2 (2), pp. 22–35. Cited by: §4.1.
  • H. Si (2015) TetGen, a delaunay-based quality tetrahedral mesh generator. ACM Transactions on Mathematical Software (TOMS) 41 (2), pp. 1–36. Cited by: §4.1.
  • B. Smith, F. D. Goes, and T. Kim (2018) Stable neo-hookean flesh simulation. ACM Transactions on Graphics (TOG) 37 (2), pp. 1–15. Cited by: §4.2.
  • B. Smith, C. Wu, H. Wen, P. Peluse, Y. Sheikh, J. K. Hodgins, and T. Shiratori (2020) Constraining dense hand surface tracking with elasticity. ACM Transactions on Graphics (TOG) 39 (6), pp. 1–14. Cited by: §4.2, §4.2.
  • S. Sueda, A. Kaufman, and D. K. Pai (2008) Musculotendon simulation for hand animation. ACM Trans. Graph. 27 (3), pp. 1–8. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • A. Tkach, M. Pauly, and A. Tagliasacchi (2016) Sphere-meshes for real-time hand modeling and tracking. ACM Trans. Graph. 35 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §6.4.
  • A. Tsoli, N. Mahmood, and M. J. Black (2014) Breathing life into shape: capturing, modeling and animating 3d human breathing. ACM Transactions on graphics (TOG) 33 (4), pp. 1–11. Cited by: §7.
  • D. Tzionas, L. Ballan, A. Srikantha, P. Aponte, M. Pollefeys, and J. Gall (2016) Capturing hands in action using discriminative salient points and physics simulation. International Journal of Computer Vision 118 (2), pp. 172–193. Cited by: §2.
  • B. Wang, G. Matcuk, and J. Barbič (2019) Hand modeling and simulation using stabilized magnetic resonance imaging. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: §2, §2, Table 1, §4.3, §4.3.
  • B. Wang, G. Matcuk, and J. Barbič (2021) Modeling of personalized anatomy using plastic strains. ACM Transactions on Graphics (TOG) 40 (2), pp. 1–21. Cited by: §2, §2, Table 1, §4.2.
  • N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67. Cited by: §4.2.
  • L. Xu, W. Cheng, K. Guo, L. Han, Y. Liu, and L. Fang (2019) Flyfusion: realtime dynamic scene reconstruction using a flying depth camera. IEEE Transactions on Visualization and Computer Graphics. Cited by: §4.2.
  • W. Xu, A. Chatterjee, M. Zollhöfer, H. Rhodin, D. Mehta, H. Seidel, and C. Theobalt (2018) MonoPerfCap: human performance capture from monocular video. ACM Trans. Graph. 37 (2), pp. 27:1–27:15. External Links: ISSN 0730-0301, Link, Document Cited by: §6.4.
  • C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and T. Brox (2019) Freihand: a dataset for markerless capture of hand pose and shape from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 813–822. Cited by: §6.4, §7.
  • M. Zollhöfer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P. Pérez, M. Stamminger, M. Nießner, and C. Theobalt (2018) State of the art on monocular 3d face reconstruction, tracking, and applications. Computer Graphics Forum 37. Cited by: §2.